ACPI: Remove msleep() bloat from acpi_os_sleep()

Message ID	2803b89021f991662b000f50e45dbaebdcca438a.1724729626.git.len.brown@intel.com
State	New
Headers	show Received: from mail-il1-f177.google.com (mail-il1-f177.google.com [209.85.166.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id ECCC417D2; Tue, 27 Aug 2024 03:36:44 +0000 (UTC) Sender: Len Brown <lenb417@gmail.com> From: Len Brown <lenb@kernel.org> To: rjw@rjwysocki.net Cc: linux-acpi@vger.kernel.org, linux-pm@vger.kernel.org, linux-kernel@vger.kernel.org, Len Brown <len.brown@intel.com>, Arjan van de Ven <arjan@linux.intel.com>, Todd Brandt <todd.e.brandt@intel.com> Subject: [PATCH] ACPI: Remove msleep() bloat from acpi_os_sleep() Date: Mon, 26 Aug 2024 23:35:18 -0400 Message-ID: <2803b89021f991662b000f50e45dbaebdcca438a.1724729626.git.len.brown@intel.com> Precedence: bulk MIME-Version: 1.0 Reply-To: Len Brown <lenb@kernel.org> Organization: Intel Open Source Technology Center Content-Transfer-Encoding: 8bit
Series	ACPI: Remove msleep() bloat from acpi_os_sleep() \| expand ACPI: Remove msleep() bloat from acpi_os_sleep()

Len Brown Aug. 27, 2024, 3:35 a.m. UTC

From: Len Brown <len.brown@intel.com>

Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
The floor of the range is the exact requested ms,
with an additional 1ms of slack for sleeps above 20ms.

This reduces  the kernel resume time of the Dell 9300
to 1,124 ms from 2,471 ms.

The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
which has invoked msleep(ms) since 2013.

But msleep(ms) is based on jiffies, and the rounding-up
logic to convert to jiffies on a HZ=250 system causes
msleep(5) to bloat to a minimum of a 12ms delay.
msleep(5) typically takes over 15ms!

As a result, AML delay loops with small Sleep() inside
magnify the entire loop.  A particularly painful example
is ACPI support for powering-on ICL and TGL
thunderbolt/pcie_ports during system resume.

Regarding jiffy-based msleep() being inexpensive
and hrtimer-based usleep_range() being expensive.
ACPI AML timer invocations are rare, and so it
is unlikely the hrtimer cost will be noticible,
or even measurable.  At the same time, the msleep()
timer duration bloat is significant enough to
be noticed by end users.

Regarding usleep_range() timer coalescing.
It virtually never works during ACPI flows, which
commonly run when there are few coalescing
opportunities. As a result, the timers almost
always expire at the maximum end of their specified range.

It was tempting to use usleep_range(us, us)
for all values of us.  But 1 ms is added to the
range for timers over 20ms on the reasoning that
the AML Sleep interface has a granularity of 1ms,
most costly loops use duration under 20ms inside,
and singular long sleeps are unlitly to notice an
additiona 1ms, so why not allow some coalescing...

Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
Signed-off-by: Len Brown <len.brown@intel.com>
Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
Tested-by: Todd Brandt <todd.e.brandt@intel.com>
---
 drivers/acpi/osl.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Rafael J. Wysocki Aug. 27, 2024, 11:29 a.m. UTC | #1

First, let me add a few people who know more about timers than I do.

On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote:
>
> From: Len Brown <len.brown@intel.com>
>
> Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
> The floor of the range is the exact requested ms,
> with an additional 1ms of slack for sleeps above 20ms.
>
> This reduces  the kernel resume time of the Dell 9300
> to 1,124 ms from 2,471 ms.
>
> The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
> which has invoked msleep(ms) since 2013.
>
> But msleep(ms) is based on jiffies, and the rounding-up
> logic to convert to jiffies on a HZ=250 system causes
> msleep(5) to bloat to a minimum of a 12ms delay.
> msleep(5) typically takes over 15ms!
>
> As a result, AML delay loops with small Sleep() inside
> magnify the entire loop.  A particularly painful example
> is ACPI support for powering-on ICL and TGL
> thunderbolt/pcie_ports during system resume.
>
> Regarding jiffy-based msleep() being inexpensive
> and hrtimer-based usleep_range() being expensive.
> ACPI AML timer invocations are rare, and so it
> is unlikely the hrtimer cost will be noticible,
> or even measurable.  At the same time, the msleep()
> timer duration bloat is significant enough to
> be noticed by end users.

I'm not sure why you are refusing to follow the implementation of
fsleep() and Documentation/timers/timers-howto.rst and still use
msleep() for sleep durations longer than 20 ms.

Why should usleep_range() be used for 100 ms sleeps, for instance?
This goes against the recommendation in the above document, so is
there a particular reason?

> Regarding usleep_range() timer coalescing.
> It virtually never works during ACPI flows, which
> commonly run when there are few coalescing
> opportunities. As a result, the timers almost
> always expire at the maximum end of their specified range.

I don't think that's the main point of using a nonzero range in
usleep_range().  AFAICS, this is about letting the timekeeping
subsystem know how much you care about timer precision so it can
arrange things to meet everyone's needs.

> It was tempting to use usleep_range(us, us)
> for all values of us.  But 1 ms is added to the
> range for timers over 20ms on the reasoning that
> the AML Sleep interface has a granularity of 1ms,
> most costly loops use duration under 20ms inside,
> and singular long sleeps are unlitly to notice an
> additiona 1ms, so why not allow some coalescing...

So again, why not use msleep() for sleeps longer than 20 ms?

> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> Signed-off-by: Len Brown <len.brown@intel.com>
> Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
> Tested-by: Todd Brandt <todd.e.brandt@intel.com>
> ---
>  drivers/acpi/osl.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 70af3fbbebe5..c4c76f86cd7a 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
>
>  void acpi_os_sleep(u64 ms)
>  {
> -       msleep(ms);
> +       u64 us = ms * 1000;
> +
> +       if (us <= 20000)
> +               usleep_range(us, us);
> +       else
> +               usleep_range(us, us + 1000);
> +
>  }
>
>  void acpi_os_stall(u32 us)
> --

While I agree with using usleep_range() for sleeps up to 20 ms in
acpi_os_sleep(), I disagree with the patch as is.

Thanks!

Len Brown Aug. 28, 2024, 4:06 a.m. UTC | #2

On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> First, let me add a few people who know more about timers than I do.
>
> On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote:
> >
> > From: Len Brown <len.brown@intel.com>
> >
> > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
> > The floor of the range is the exact requested ms,
> > with an additional 1ms of slack for sleeps above 20ms.
> >
> > This reduces  the kernel resume time of the Dell 9300
> > to 1,124 ms from 2,471 ms.
> >
> > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
> > which has invoked msleep(ms) since 2013.
> >
> > But msleep(ms) is based on jiffies, and the rounding-up
> > logic to convert to jiffies on a HZ=250 system causes
> > msleep(5) to bloat to a minimum of a 12ms delay.
> > msleep(5) typically takes over 15ms!
> >
> > As a result, AML delay loops with small Sleep() inside
> > magnify the entire loop.  A particularly painful example
> > is ACPI support for powering-on ICL and TGL
> > thunderbolt/pcie_ports during system resume.
> >
> > Regarding jiffy-based msleep() being inexpensive
> > and hrtimer-based usleep_range() being expensive.
> > ACPI AML timer invocations are rare, and so it
> > is unlikely the hrtimer cost will be noticible,
> > or even measurable.  At the same time, the msleep()
> > timer duration bloat is significant enough to
> > be noticed by end users.
>
> I'm not sure why you are refusing to follow the implementation of
> fsleep() and Documentation/timers/timers-howto.rst and still use
> msleep() for sleep durations longer than 20 ms.

timers_howto.rst could use an update to reflect reality.

It doesn't disclose how toxic msleep actually is for small values.
msleep(1) takes 11ms on a HZ=250 system.

Linux/ACPI has to support any random AML Sleep(ms) call,
and sometimes we see timeout loops implemented around
an inner Sleep(1ms).  If we use msleep those loops explode
by 11x and aggregate durations that are noticed by end users.

fsleep does three things -- and none of them are a good fit
for acpi_os_sleep:

static inline void fsleep(unsigned long usecs)
{
        if (usecs <= 10)
                udelay(usecs);
        else if (usecs <= 20000)
                usleep_range(usecs, 2 * usecs);
        else
                msleep(DIV_ROUND_UP(usecs, 1000));
}

> udelay(usecs);
will never execute in the ACPI case, as the minimum delay is 1000 usec.

> usleep_range(usecs, 2 * usecs);

timers-howto.rst says this:

                        "With the introduction of a range, the scheduler is
                        free to coalesce your wakeup with any other wakeup
                        that may have happened for other reasons, or at the
                        worst case, fire an interrupt for your upper bound."

But the way usleep_range works is it sets the timer for the
upper bound, and opportunistically wakes/cancels if another timer
fires after the lower bound and before the upper bound.

It calls it a "worst case" that the timer fires at the upper bound.

But when ACPI suspend/resume flows are running the only other
timer is the tick, and so the "worst case" happens effectively ALWAYS.

So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively
DOUBLING the duration of all timers 20 ms and smaller.
There may be scenarios where doing this makes sense,
but acpi_os_sleep() is not one of them.

> msleep(DIV_ROUND_UP(usecs, 1000));

msleep(50) takes 59.8ms -- a 20% penalty.
We have loops with AML Sleep(50) in the middle,
and this  code would bloat them by 20%, while
the user is waiting -- for no benefit. o
Again, there may be scenarios where doing this makes sense,
but acpi_os_sleep() is not one of them.

> Why should usleep_range() be used for 100 ms sleeps, for instance?
> This goes against the recommendation in the above document, so is
> there a particular reason?

The document doesn't say *why* msleep is recommended.
One would assume that when there are many timers, msleep
is efficient because it consolidates them to jiffy boundaries,
and if they are long duration timers, perhaps the assumption is
that they don't care so much about the additional delays?

Again, there are certainly scenarios where that makes sense,
but at the end of the day, msleep(100) takes 106 msec.

ACPI is not a heavy timer user, so the msleep efficiency justification
for making the user wait longer holds no weight.

Now that we realize that end-users notice timer bloat in acp_os_sleep,
it is clear that msleep is simply a poor choice for acpi_os_sleep.

> > Regarding usleep_range() timer coalescing.
> > It virtually never works during ACPI flows, which
> > commonly run when there are few coalescing
> > opportunities. As a result, the timers almost
> > always expire at the maximum end of their specified range.
>
> I don't think that's the main point of using a nonzero range in
> usleep_range().  AFAICS, this is about letting the timekeeping
> subsystem know how much you care about timer precision so it can
> arrange things to meet everyone's needs.

The range in usleep_range is to enable timer coalescing,
which is to reduce the number of timers firing on the system.
If it has another purpose, neither the code nor the API documentation
are forthcoming.

> > It was tempting to use usleep_range(us, us)
> > for all values of us.  But 1 ms is added to the
> > range for timers over 20ms on the reasoning that
> > the AML Sleep interface has a granularity of 1ms,
> > most costly loops use duration under 20ms inside,
> > and singular long sleeps are unlitly to notice an
> > additiona 1ms, so why not allow some coalescing...
>
> So again, why not use msleep() for sleeps longer than 20 ms?

per above.  Too slow.

> > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> > Signed-off-by: Len Brown <len.brown@intel.com>
> > Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
> > Tested-by: Todd Brandt <todd.e.brandt@intel.com>
> > ---
> >  drivers/acpi/osl.c | 8 +++++++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > index 70af3fbbebe5..c4c76f86cd7a 100644
> > --- a/drivers/acpi/osl.c
> > +++ b/drivers/acpi/osl.c
> > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
> >
> >  void acpi_os_sleep(u64 ms)
> >  {
> > -       msleep(ms);
> > +       u64 us = ms * 1000;
> > +
> > +       if (us <= 20000)
> > +               usleep_range(us, us);
> > +       else
> > +               usleep_range(us, us + 1000);
> > +
> >  }
> >
> >  void acpi_os_stall(u32 us)
> > --
>
> While I agree with using usleep_range() for sleeps up to 20 ms in
> acpi_os_sleep(), I disagree with the patch as is.

The measurement results do not support any form of acpi_os_sleep that
invokes any form of msleep().

Honestly, I think because of when and how acpi_os_sleep is called, we should
consider making it yet simpler:

void acpi_os_sleep(u64 ms)
{
        u64 us = ms * 1000;

        usleep_range(us, us);
}

thanks,
-Len

Rafael J. Wysocki Aug. 28, 2024, 10:58 a.m. UTC | #3

On Wed, Aug 28, 2024 at 6:12 AM Len Brown <lenb@kernel.org> wrote:
>
> On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > First, let me add a few people who know more about timers than I do.
> >
> > On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote:
> > >
> > > From: Len Brown <len.brown@intel.com>
> > >
> > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
> > > The floor of the range is the exact requested ms,
> > > with an additional 1ms of slack for sleeps above 20ms.
> > >
> > > This reduces  the kernel resume time of the Dell 9300
> > > to 1,124 ms from 2,471 ms.
> > >
> > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
> > > which has invoked msleep(ms) since 2013.
> > >
> > > But msleep(ms) is based on jiffies, and the rounding-up
> > > logic to convert to jiffies on a HZ=250 system causes
> > > msleep(5) to bloat to a minimum of a 12ms delay.
> > > msleep(5) typically takes over 15ms!
> > >
> > > As a result, AML delay loops with small Sleep() inside
> > > magnify the entire loop.  A particularly painful example
> > > is ACPI support for powering-on ICL and TGL
> > > thunderbolt/pcie_ports during system resume.
> > >
> > > Regarding jiffy-based msleep() being inexpensive
> > > and hrtimer-based usleep_range() being expensive.
> > > ACPI AML timer invocations are rare, and so it
> > > is unlikely the hrtimer cost will be noticible,
> > > or even measurable.  At the same time, the msleep()
> > > timer duration bloat is significant enough to
> > > be noticed by end users.
> >
> > I'm not sure why you are refusing to follow the implementation of
> > fsleep() and Documentation/timers/timers-howto.rst and still use
> > msleep() for sleep durations longer than 20 ms.
>
> timers_howto.rst could use an update to reflect reality.

Maybe it could.

> It doesn't disclose how toxic msleep actually is for small values.
> msleep(1) takes 11ms on a HZ=250 system.
>
> Linux/ACPI has to support any random AML Sleep(ms) call,
> and sometimes we see timeout loops implemented around
> an inner Sleep(1ms).  If we use msleep those loops explode
> by 11x and aggregate durations that are noticed by end users.

So this is all about short sleeps.

You don't need to argue with me about short sleeps, I'm convinced.

Thus I'm going to skip all arguments regarding short sleeps going forward.

> fsleep does three things -- and none of them are a good fit
> for acpi_os_sleep:
>
> static inline void fsleep(unsigned long usecs)
> {
>         if (usecs <= 10)
>                 udelay(usecs);
>         else if (usecs <= 20000)
>                 usleep_range(usecs, 2 * usecs);
>         else
>                 msleep(DIV_ROUND_UP(usecs, 1000));
> }
>
> > udelay(usecs);
> will never execute in the ACPI case, as the minimum delay is 1000 usec.
>
> > usleep_range(usecs, 2 * usecs);
>
> timers-howto.rst says this:
>
>                         "With the introduction of a range, the scheduler is
>                         free to coalesce your wakeup with any other wakeup
>                         that may have happened for other reasons, or at the
>                         worst case, fire an interrupt for your upper bound."
>
> But the way usleep_range works is it sets the timer for the
> upper bound, and opportunistically wakes/cancels if another timer
> fires after the lower bound and before the upper bound.

Yes, that's how it works.

> It calls it a "worst case" that the timer fires at the upper bound.
>
> But when ACPI suspend/resume flows are running the only other
> timer is the tick, and so the "worst case" happens effectively ALWAYS.

Arguably though, ACPI suspend/resume flows are not the only case in
which AML skeeps.

> So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively
> DOUBLING the duration of all timers 20 ms and smaller.
> There may be scenarios where doing this makes sense,
> but acpi_os_sleep() is not one of them.

It would be good to actually try this.

However, I'm not arguing that usleep_range(usecs, 2 * usecs) is
suitable for ACPI sleeps.  I'm saying that it is friendly with respect
to the timekeeping subsystem to give it a nonzero range if it is
affordable.

And I do think that it is affordable in the ACPI sleep case.

> > msleep(DIV_ROUND_UP(usecs, 1000));
>
> msleep(50) takes 59.8ms -- a 20% penalty.

I guess you mean on a HZ=250 system system and I guess this is a
measured number.

> We have loops with AML Sleep(50) in the middle,
> and this  code would bloat them by 20%, while
> the user is waiting -- for no benefit. o

If this happens during system suspend/resume and if this is the only
thing that is waited for (ie. nothing runs in parallel with it).

So your resume time is 20% longer than it can be in theory if every
Sleep() causes an extra timer interrupt to be programmed.  Is this
extra overhead justified?

> Again, there may be scenarios where doing this makes sense,
> but acpi_os_sleep() is not one of them.

I'm not convinced.

> > Why should usleep_range() be used for 100 ms sleeps, for instance?
> > This goes against the recommendation in the above document, so is
> > there a particular reason?
>
> The document doesn't say *why* msleep is recommended.
> One would assume that when there are many timers, msleep
> is efficient because it consolidates them to jiffy boundaries,
> and if they are long duration timers, perhaps the assumption is
> that they don't care so much about the additional delays?
>
> Again, there are certainly scenarios where that makes sense,
> but at the end of the day, msleep(100) takes 106 msec.

And is this a big deal really?  Is a user really going to notice the
6% difference and even if they do, would they care?  For one, I
wouldn't.

Arguably, it all depends on the intention of the developer who uses
the msleep(), os Sleep() in the AML case.

Do they mean "exactly 100 ms" or do they mean "at least 100 ms"?  If
the latter is the case, the whole argument is moot if not misguided.

As a side note, I generally think that if a developer puts Sleep(n) in
their code they need to assume that the sleep will be slightly longer
than n for reasons related to scheduling etc. so I don't believe that
they ever mean "exactly n".  However, they probably don't think that
it may take 3 times longer, so it is better to make that not happen.

> ACPI is not a heavy timer user, so the msleep efficiency justification
> for making the user wait longer holds no weight.
>
> Now that we realize that end-users notice timer bloat in acp_os_sleep,
> it is clear that msleep is simply a poor choice for acpi_os_sleep.

For short sleeps, yes.

For long sleeps, using an hrtimer generally ends up in programming an
extra timer interrupt unless the range is really wide, but then
msleep() gets better again.  So is the extra timer interrupt really
justified every time AML evaluates Sleep()?  And again, system
suspend/resume are not the only cases when that happens.

Honestly, I don't think so and that's why I prefer msleep() to be used
for longer sleeps even though it is "slacky".

> > > Regarding usleep_range() timer coalescing.
> > > It virtually never works during ACPI flows, which
> > > commonly run when there are few coalescing
> > > opportunities. As a result, the timers almost
> > > always expire at the maximum end of their specified range.
> >
> > I don't think that's the main point of using a nonzero range in
> > usleep_range().  AFAICS, this is about letting the timekeeping
> > subsystem know how much you care about timer precision so it can
> > arrange things to meet everyone's needs.
>
> The range in usleep_range is to enable timer coalescing,
> which is to reduce the number of timers firing on the system.

Exactly.  Which reduces the number of times a timer interrupts needs
to be programmed.  Which generally reduces overhead.

> If it has another purpose, neither the code nor the API documentation
> are forthcoming.
>
> > > It was tempting to use usleep_range(us, us)
> > > for all values of us.  But 1 ms is added to the
> > > range for timers over 20ms on the reasoning that
> > > the AML Sleep interface has a granularity of 1ms,
> > > most costly loops use duration under 20ms inside,
> > > and singular long sleeps are unlitly to notice an
> > > additiona 1ms, so why not allow some coalescing...
> >
> > So again, why not use msleep() for sleeps longer than 20 ms?
>
> per above.  Too slow.

per above, I'm not convinced.

> > > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=216263
> > > Signed-off-by: Len Brown <len.brown@intel.com>
> > > Suggested-by: Arjan van de Ven <arjan@linux.intel.com>
> > > Tested-by: Todd Brandt <todd.e.brandt@intel.com>
> > > ---
> > >  drivers/acpi/osl.c | 8 +++++++-
> > >  1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> > > index 70af3fbbebe5..c4c76f86cd7a 100644
> > > --- a/drivers/acpi/osl.c
> > > +++ b/drivers/acpi/osl.c
> > > @@ -607,7 +607,13 @@ acpi_status acpi_os_remove_interrupt_handler(u32 gsi, acpi_osd_handler handler)
> > >
> > >  void acpi_os_sleep(u64 ms)
> > >  {
> > > -       msleep(ms);
> > > +       u64 us = ms * 1000;
> > > +
> > > +       if (us <= 20000)
> > > +               usleep_range(us, us);
> > > +       else
> > > +               usleep_range(us, us + 1000);
> > > +
> > >  }
> > >
> > >  void acpi_os_stall(u32 us)
> > > --
> >
> > While I agree with using usleep_range() for sleeps up to 20 ms in
> > acpi_os_sleep(), I disagree with the patch as is.
>
> The measurement results do not support any form of acpi_os_sleep that
> invokes any form of msleep().

That's a bold claim and it seems a bit unfounded.

Have you tested anything other than system suspend/resume?

Have you tested anything other than x86 laptops?

> Honestly, I think because of when and how acpi_os_sleep is called, we should
> consider making it yet simpler:
>
> void acpi_os_sleep(u64 ms)
> {
>         u64 us = ms * 1000;
>
>         usleep_range(us, us);
> }

Which I obviously disagree with.

In the meantime I have checked how many users of usleep_range() use
the same value as both arguments of it and I have found 2 out of over
4000.

So now, is everyone clueless or is ACPI so special that it has to do
something different than everyone else?

And because I would actually like to see what the difference in terms
of system suspend/resume time is between the $subject patch and
something alternative that uses msleep() for longer sleeps and
generally uses nonzero ranges with usleep_range(), attached is a patch
to compare.

Thanks!

Rafael J. Wysocki Aug. 28, 2024, 1:35 p.m. UTC | #4

On Wed, Aug 28, 2024 at 12:58 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Aug 28, 2024 at 6:12 AM Len Brown <lenb@kernel.org> wrote:
> >
> > On Tue, Aug 27, 2024 at 7:29 AM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > >
> > > First, let me add a few people who know more about timers than I do.
> > >
> > > On Tue, Aug 27, 2024 at 5:42 AM Len Brown <lenb@kernel.org> wrote:
> > > >
> > > > From: Len Brown <len.brown@intel.com>
> > > >
> > > > Optimize acpi_os_sleep(ms) using usleep_range(floor, ceiling).
> > > > The floor of the range is the exact requested ms,
> > > > with an additional 1ms of slack for sleeps above 20ms.
> > > >
> > > > This reduces  the kernel resume time of the Dell 9300
> > > > to 1,124 ms from 2,471 ms.
> > > >
> > > > The ACPI AML Sleep(ms) method calls acpi_os_sleep(ms),
> > > > which has invoked msleep(ms) since 2013.
> > > >
> > > > But msleep(ms) is based on jiffies, and the rounding-up
> > > > logic to convert to jiffies on a HZ=250 system causes
> > > > msleep(5) to bloat to a minimum of a 12ms delay.
> > > > msleep(5) typically takes over 15ms!
> > > >
> > > > As a result, AML delay loops with small Sleep() inside
> > > > magnify the entire loop.  A particularly painful example
> > > > is ACPI support for powering-on ICL and TGL
> > > > thunderbolt/pcie_ports during system resume.
> > > >
> > > > Regarding jiffy-based msleep() being inexpensive
> > > > and hrtimer-based usleep_range() being expensive.
> > > > ACPI AML timer invocations are rare, and so it
> > > > is unlikely the hrtimer cost will be noticible,
> > > > or even measurable.  At the same time, the msleep()
> > > > timer duration bloat is significant enough to
> > > > be noticed by end users.
> > >
> > > I'm not sure why you are refusing to follow the implementation of
> > > fsleep() and Documentation/timers/timers-howto.rst and still use
> > > msleep() for sleep durations longer than 20 ms.
> >
> > timers_howto.rst could use an update to reflect reality.
>
> Maybe it could.
>
> > It doesn't disclose how toxic msleep actually is for small values.
> > msleep(1) takes 11ms on a HZ=250 system.
> >
> > Linux/ACPI has to support any random AML Sleep(ms) call,
> > and sometimes we see timeout loops implemented around
> > an inner Sleep(1ms).  If we use msleep those loops explode
> > by 11x and aggregate durations that are noticed by end users.
>
> So this is all about short sleeps.
>
> You don't need to argue with me about short sleeps, I'm convinced.
>
> Thus I'm going to skip all arguments regarding short sleeps going forward.
>
> > fsleep does three things -- and none of them are a good fit
> > for acpi_os_sleep:
> >
> > static inline void fsleep(unsigned long usecs)
> > {
> >         if (usecs <= 10)
> >                 udelay(usecs);
> >         else if (usecs <= 20000)
> >                 usleep_range(usecs, 2 * usecs);
> >         else
> >                 msleep(DIV_ROUND_UP(usecs, 1000));
> > }
> >
> > > udelay(usecs);
> > will never execute in the ACPI case, as the minimum delay is 1000 usec.
> >
> > > usleep_range(usecs, 2 * usecs);
> >
> > timers-howto.rst says this:
> >
> >                         "With the introduction of a range, the scheduler is
> >                         free to coalesce your wakeup with any other wakeup
> >                         that may have happened for other reasons, or at the
> >                         worst case, fire an interrupt for your upper bound."
> >
> > But the way usleep_range works is it sets the timer for the
> > upper bound, and opportunistically wakes/cancels if another timer
> > fires after the lower bound and before the upper bound.
>
> Yes, that's how it works.
>
> > It calls it a "worst case" that the timer fires at the upper bound.
> >
> > But when ACPI suspend/resume flows are running the only other
> > timer is the tick, and so the "worst case" happens effectively ALWAYS.
>
> Arguably though, ACPI suspend/resume flows are not the only case in
> which AML skeeps.
>
> > So when fsleep does a usleep_range(usecs, 2 * usecs), it is effectively
> > DOUBLING the duration of all timers 20 ms and smaller.
> > There may be scenarios where doing this makes sense,
> > but acpi_os_sleep() is not one of them.
>
> It would be good to actually try this.
>
> However, I'm not arguing that usleep_range(usecs, 2 * usecs) is
> suitable for ACPI sleeps.  I'm saying that it is friendly with respect
> to the timekeeping subsystem to give it a nonzero range if it is
> affordable.
>
> And I do think that it is affordable in the ACPI sleep case.
>
> > > msleep(DIV_ROUND_UP(usecs, 1000));
> >
> > msleep(50) takes 59.8ms -- a 20% penalty.
>
> I guess you mean on a HZ=250 system system and I guess this is a
> measured number.
>
> > We have loops with AML Sleep(50) in the middle,
> > and this  code would bloat them by 20%, while
> > the user is waiting -- for no benefit. o
>
> If this happens during system suspend/resume and if this is the only
> thing that is waited for (ie. nothing runs in parallel with it).
>
> So your resume time is 20% longer than it can be in theory if every
> Sleep() causes an extra timer interrupt to be programmed.  Is this
> extra overhead justified?
>
> > Again, there may be scenarios where doing this makes sense,
> > but acpi_os_sleep() is not one of them.
>
> I'm not convinced.
>
> > > Why should usleep_range() be used for 100 ms sleeps, for instance?
> > > This goes against the recommendation in the above document, so is
> > > there a particular reason?
> >
> > The document doesn't say *why* msleep is recommended.
> > One would assume that when there are many timers, msleep
> > is efficient because it consolidates them to jiffy boundaries,
> > and if they are long duration timers, perhaps the assumption is
> > that they don't care so much about the additional delays?
> >
> > Again, there are certainly scenarios where that makes sense,
> > but at the end of the day, msleep(100) takes 106 msec.
>
> And is this a big deal really?  Is a user really going to notice the
> 6% difference and even if they do, would they care?  For one, I
> wouldn't.
>
> Arguably, it all depends on the intention of the developer who uses
> the msleep(), os Sleep() in the AML case.
>
> Do they mean "exactly 100 ms" or do they mean "at least 100 ms"?  If
> the latter is the case, the whole argument is moot if not misguided.
>
> As a side note, I generally think that if a developer puts Sleep(n) in
> their code they need to assume that the sleep will be slightly longer
> than n for reasons related to scheduling etc. so I don't believe that
> they ever mean "exactly n".  However, they probably don't think that
> it may take 3 times longer, so it is better to make that not happen.
>
> > ACPI is not a heavy timer user, so the msleep efficiency justification
> > for making the user wait longer holds no weight.
> >
> > Now that we realize that end-users notice timer bloat in acp_os_sleep,
> > it is clear that msleep is simply a poor choice for acpi_os_sleep.
>
> For short sleeps, yes.
>
> For long sleeps, using an hrtimer generally ends up in programming an
> extra timer interrupt unless the range is really wide, but then
> msleep() gets better again.  So is the extra timer interrupt really
> justified every time AML evaluates Sleep()?  And again, system
> suspend/resume are not the only cases when that happens.
>
> Honestly, I don't think so and that's why I prefer msleep() to be used
> for longer sleeps even though it is "slacky".

That being said, I've read the thread mentioned in timers-howto.rst
and it turns out to be interesting in the context of this patch.

So Jon attempted to reimplement msleep() with the help of hrtimers and
it basically worked except that it broke USB mouse hotplug for Andrew.
IIUC, this issue was the reason why that patch was not applied.

But this appears to mean that there are no fundamental objections to
using hrtimers for long sleeps even though it is still recommended to
use msleep() in that case.  So it all boils down to the question: If a
sleep is relatively long, say 100 ms or longer, why exactly is
msleep(delay) more suitable for it than usleep_range(delay, delay +
delta), where delta is of the order of 1 jiffy?

I get that msleep() is based on tick and so it does not require an
extra hrtimer (which leads to programming an additional timer
interrupt in many cases), but is it worth the trouble of checking the
delay and using msleep() instead of usleep_range() if the sleep is
long enough?

I'm not sure now TBH.

Anna-Maria Behnsen Aug. 28, 2024, 7:44 p.m. UTC | #5

Hi,

I try to give some input from the timer perspective and maybe it helps
to clarify which should be the proper way to go for acpi_os_sleep(). It
also identifies some problems with the current documentation and
implementation of msleep/fsleep.

Len Brown <lenb@kernel.org> writes:

> Attempting to grab all the loose ends from this discussion
> and put them into a list of 4 things:
>
> 1. Is it time to re-visit Jon's proposal to fix msleep, especially for
> small sleep values?

Lets have a deeper look to msleep() internals: msleep() uses timer list
timers. Because of the design of the timer wheel (granularity of buckets
increases with the levels) and because of the granularity of jiffies the
sleep values will be longer as specified. Let's assume we are executing
a msleep(1) on a HZ=250 system:

First msecs are mapped on jiffies, so this results in a 4ms timeout
value, as there is nothing shorter than 1 jiffie. Then the jiffie value
is handed over to timer code and msleep() adds another jiffie to the
timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer
list timer is queued. To make sure that timers will not fire early or
race with a concurrent incrementation of jiffie, timers are queued
always into the next bucket. As the timer will end up in the first level
of the timer wheel the granularity of the buckets is 1 jiffies. This
means that the timeout would be 3 jiffies in worst case.

The additional jiffie in msleep() is the historical prevention that the
sleep time is at least the specified time. This is handled by timer
wheel core code itself, so this extra jiffie could be removed. I will
provide a patch for it.

But even with this change, the worst case sleep length will be 8ms
instead of 1ms.

For comparison, see here a table with the values for all the steps
explained above taking some different msleep values. I already dropped
the addition of the extra jiffie. The granularity of the timer wheel
levels can be found at the first long comment in kernel/time/timer.c.

This is still a HZ=250 system:

msleep()   | msecs_to_jiffies() | worst case delay after | timer wheel
           |                    | enqueue (next bucket)  | level
--------------------------------------------------------------------------
1 ms       | 1 jiffie           | 2 jiffies, 8 ms        | 0
256 ms     | 64 jiffies         | 72 jiffies, 288 ms     | 1
257 ms     | 65 jiffies         | 72 jiffies, 288 ms     | 1
2048 ms    | 513 jiffies        | 576 jiffies, 2304 ms   | 2
2300 ms    | 575 jiffies        | 576 jiffies, 2304 ms   | 2
4096 ms    | 1024 jiffies       | 1088 jiffies, 4352 ms  | 2

The same values on a HZ=1000 system:

msleep()   | msecs_to_jiffies() | worst case delay after | timer wheel
           |                    | enqueue (next bucket)  | level
--------------------------------------------------------------------------
1 ms       | 1 jiffie           | 2 jiffies, 2 ms        | 0
256 ms     | 256 jiffies        | 264 jiffies, 264 ms    | 1
257 ms     | 257 jiffies        | 264 jiffies, 264 ms    | 1
2048 ms    | 2048 jiffies       | 2112 jiffies, 2112 ms  | 2
2300 ms    | 2300 jiffies       | 2304 jiffies, 2304 ms  | 2
4096 ms    | 4096 jiffies       | 4608 jiffies, 4608 ms  | 3

The documentation states, that msleep is not valid for short sleep
values. But as demonstrated with the two tables, this is not enirely
true. So the descision whether msleep() is suitable for the usecase is
not as simple as documented. It depends on the HZ configuration in
combination with the timeout value, and on the request how precise the
timeout has to be. And another important point is: hrtimers are more
expensive then timer list timers...

The documentation was originally written back in 2010 where the non
cascading timer wheel wasn't in place yet. So is has to be updated.

> 2. We agree that usleep_range() is appropriate for short acpi_os_sleep()
> due to ASL loops with small Sleep() values.
> But if we do something different for large and small values,
> where is the line between small and large?

As pointed out above - this depends on HZ and what the requirements of
the callsite are.

> fsleep anointed 20ms, but didn't document why.
> (and it made both short sleeps *and* long sleeps too slow to be useful
> here, IMO)

fsleep() just implements what the outdated documentation states. And it
adds a magic max value for usleep_range(). It seems to me, that fsleep()
somehow accidentially found the way into the kernel in 2020. As it is
now there this needs to be fixed and should take the above things into
account in some generic way.

> 3. Is usleep_range(min, max) with min= max bad?
> If it is good, why is virtually nobody else using min=max?

It's not bad to use it. It depends on your use case. If you really need
the precise sleep length, then it should be valid to use min = max.

I hope the timer information will help to find the proper solution for
acpi_os_sleep().

Thanks,

	Anna-Maria

Rafael J. Wysocki Aug. 28, 2024, 8:29 p.m. UTC | #6

Hi,

On Wed, Aug 28, 2024 at 9:44 PM Anna-Maria Behnsen
<anna-maria@linutronix.de> wrote:
>
> Hi,
>
> I try to give some input from the timer perspective and maybe it helps
> to clarify which should be the proper way to go for acpi_os_sleep(). It
> also identifies some problems with the current documentation and
> implementation of msleep/fsleep.
>
> Len Brown <lenb@kernel.org> writes:
>
> > Attempting to grab all the loose ends from this discussion
> > and put them into a list of 4 things:
> >
> > 1. Is it time to re-visit Jon's proposal to fix msleep, especially for
> > small sleep values?
>
> Lets have a deeper look to msleep() internals: msleep() uses timer list
> timers. Because of the design of the timer wheel (granularity of buckets
> increases with the levels) and because of the granularity of jiffies the
> sleep values will be longer as specified. Let's assume we are executing
> a msleep(1) on a HZ=250 system:
>
> First msecs are mapped on jiffies, so this results in a 4ms timeout
> value, as there is nothing shorter than 1 jiffie. Then the jiffie value
> is handed over to timer code and msleep() adds another jiffie to the
> timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer
> list timer is queued. To make sure that timers will not fire early or
> race with a concurrent incrementation of jiffie, timers are queued
> always into the next bucket. As the timer will end up in the first level
> of the timer wheel the granularity of the buckets is 1 jiffies. This
> means that the timeout would be 3 jiffies in worst case.
>
> The additional jiffie in msleep() is the historical prevention that the
> sleep time is at least the specified time. This is handled by timer
> wheel core code itself, so this extra jiffie could be removed. I will
> provide a patch for it.
>
> But even with this change, the worst case sleep length will be 8ms
> instead of 1ms.
>
> For comparison, see here a table with the values for all the steps
> explained above taking some different msleep values. I already dropped
> the addition of the extra jiffie. The granularity of the timer wheel
> levels can be found at the first long comment in kernel/time/timer.c.
>
> This is still a HZ=250 system:
>
> msleep()   | msecs_to_jiffies() | worst case delay after | timer wheel
>            |                    | enqueue (next bucket)  | level
> --------------------------------------------------------------------------
> 1 ms       | 1 jiffie           | 2 jiffies, 8 ms        | 0
> 256 ms     | 64 jiffies         | 72 jiffies, 288 ms     | 1
> 257 ms     | 65 jiffies         | 72 jiffies, 288 ms     | 1
> 2048 ms    | 513 jiffies        | 576 jiffies, 2304 ms   | 2
> 2300 ms    | 575 jiffies        | 576 jiffies, 2304 ms   | 2
> 4096 ms    | 1024 jiffies       | 1088 jiffies, 4352 ms  | 2
>
> The same values on a HZ=1000 system:
>
> msleep()   | msecs_to_jiffies() | worst case delay after | timer wheel
>            |                    | enqueue (next bucket)  | level
> --------------------------------------------------------------------------
> 1 ms       | 1 jiffie           | 2 jiffies, 2 ms        | 0
> 256 ms     | 256 jiffies        | 264 jiffies, 264 ms    | 1
> 257 ms     | 257 jiffies        | 264 jiffies, 264 ms    | 1
> 2048 ms    | 2048 jiffies       | 2112 jiffies, 2112 ms  | 2
> 2300 ms    | 2300 jiffies       | 2304 jiffies, 2304 ms  | 2
> 4096 ms    | 4096 jiffies       | 4608 jiffies, 4608 ms  | 3
>
>
> The documentation states, that msleep is not valid for short sleep
> values. But as demonstrated with the two tables, this is not enirely
> true. So the descision whether msleep() is suitable for the usecase is
> not as simple as documented. It depends on the HZ configuration in
> combination with the timeout value, and on the request how precise the
> timeout has to be. And another important point is: hrtimers are more
> expensive then timer list timers...
>
> The documentation was originally written back in 2010 where the non
> cascading timer wheel wasn't in place yet. So is has to be updated.
>
> > 2. We agree that usleep_range() is appropriate for short acpi_os_sleep()
> > due to ASL loops with small Sleep() values.
> > But if we do something different for large and small values,
> > where is the line between small and large?
>
> As pointed out above - this depends on HZ and what the requirements of
> the callsite are.
>
> > fsleep anointed 20ms, but didn't document why.
> > (and it made both short sleeps *and* long sleeps too slow to be useful
> > here, IMO)
>
> fsleep() just implements what the outdated documentation states. And it
> adds a magic max value for usleep_range(). It seems to me, that fsleep()
> somehow accidentially found the way into the kernel in 2020. As it is
> now there this needs to be fixed and should take the above things into
> account in some generic way.
>
> > 3. Is usleep_range(min, max) with min= max bad?
> > If it is good, why is virtually nobody else using min=max?
>
> It's not bad to use it. It depends on your use case. If you really need
> the precise sleep length, then it should be valid to use min = max.
>
> I hope the timer information will help to find the proper solution for
> acpi_os_sleep().

Yes, it is useful, thank you!

The main difficulty is that acpi_os_sleep() really works on behalf of
AML code (it is called when the Sleep() operator is evaluated in AML)
and it is hard to say what behavior is intended there.

We know that the msleep() precision is not sufficient at least in some
cases (for example, a high-count loop in AML that sleeps for 5 ms in
every iteration), but we don't really know what precision is needed.

IMV it generally is reasonable to assume that firmware developers
would not expect the exact sleep time (and the ACPI specification is
not very precise regarding this either), but they wouldn't also expect
sleep times much longer (in relative terms) than requested.  So
roughly speaking they probably assume something between t and t + t/4,
where t is the requested sleep time in milliseconds, regardless of the
HZ value.

Also, you said above that hrtimers were more expensive than the timer
wheel timers, which we all agree with, but it is not clear to me what
exactly the difference is.  For example, if there is a loop of 1000
iterations in AML that each sleep for 5 ms, how much more overhead
from using hrtimers would be there relative to using timer-wheel
timers?

Generally speaking, Len's idea of using usleep_range() for all sleep
lengths in acpi_os_sleep() is compelling because it leads to simple
code, but it is not clear to me how much more it would cost relative
to msleep() (or what issues it may provoke for that matter) and what
delta value to use in there.  One idea is to derive the delta from the
sleep length (say, take 1/4 of it like I did above), but the
counter-argument is that this would cause the loops in AML in question
to take measurably more time and there may be no real reason for it.

Thanks!

Anna-Maria Behnsen Aug. 29, 2024, 8:04 a.m. UTC | #7

Anna-Maria Behnsen <anna-maria@linutronix.de> writes:

[...]

> Lets have a deeper look to msleep() internals: msleep() uses timer list
> timers. Because of the design of the timer wheel (granularity of buckets
> increases with the levels) and because of the granularity of jiffies the
> sleep values will be longer as specified. Let's assume we are executing
> a msleep(1) on a HZ=250 system:
>
> First msecs are mapped on jiffies, so this results in a 4ms timeout
> value, as there is nothing shorter than 1 jiffie. Then the jiffie value
> is handed over to timer code and msleep() adds another jiffie to the
> timeout. The timeout is then 2 jiffies or 8ms. With this timeout a timer
> list timer is queued. To make sure that timers will not fire early or
> race with a concurrent incrementation of jiffie, timers are queued
> always into the next bucket. As the timer will end up in the first level
> of the timer wheel the granularity of the buckets is 1 jiffies. This
> means that the timeout would be 3 jiffies in worst case.
>
> The additional jiffie in msleep() is the historical prevention that the
> sleep time is at least the specified time. This is handled by timer
> wheel core code itself, so this extra jiffie could be removed. I will
> provide a patch for it.

I missed to use the whole cc list above when sending the patch:

  https://lore.kernel.org/r/20240829074133.4547-1-anna-maria@linutronix.de/

Thanks,

	Anna-Maria

Anna-Maria Behnsen Aug. 29, 2024, 3:36 p.m. UTC | #8

Hi,

"Rafael J. Wysocki" <rafael@kernel.org> writes:

[...]

> The main difficulty is that acpi_os_sleep() really works on behalf of
> AML code (it is called when the Sleep() operator is evaluated in AML)
> and it is hard to say what behavior is intended there.
>
> We know that the msleep() precision is not sufficient at least in some
> cases (for example, a high-count loop in AML that sleeps for 5 ms in
> every iteration), but we don't really know what precision is needed.
>
> IMV it generally is reasonable to assume that firmware developers
> would not expect the exact sleep time (and the ACPI specification is
> not very precise regarding this either), but they wouldn't also expect
> sleep times much longer (in relative terms) than requested.  So
> roughly speaking they probably assume something between t and t + t/4,
> where t is the requested sleep time in milliseconds, regardless of the
> HZ value.
>
> Also, you said above that hrtimers were more expensive than the timer
> wheel timers, which we all agree with, but it is not clear to me what
> exactly the difference is.  For example, if there is a loop of 1000
> iterations in AML that each sleep for 5 ms, how much more overhead
> from using hrtimers would be there relative to using timer-wheel
> timers?

The general differences are:
- hrtimers are using a rbtree instead of hasing, which is more expensive
- in case the new hrtimer is the new first hrtimer, hardware has to be
  programmed, which is expensive too
- the timer wheel uses batching. hrtimers not. They might get an
  own interrupt for every hrtimer when they do not have the exact same
  expiry time.
- because of the above it is a lot more expensive to use hrtimers when
  hrtimers are cancelled and requeued with a new expiry time (like
  e.g. network timeouts do it)

It's hard to measure, because it depends... I briefly talked to Thomas
and his opinion is, that there is an overhead but it should be in a
range of ok.

> Generally speaking, Len's idea of using usleep_range() for all sleep
> lengths in acpi_os_sleep() is compelling because it leads to simple
> code, but it is not clear to me how much more it would cost relative
> to msleep() (or what issues it may provoke for that matter) and what
> delta value to use in there.  One idea is to derive the delta from the
> sleep length (say, take 1/4 of it like I did above), but the
> counter-argument is that this would cause the loops in AML in question
> to take measurably more time and there may be no real reason for it.

I created a patch for fsleep() - only complie tested - which should make
sure that the slack is maximum 25%. What do you think about it? Might it
be helpful?

---8<---
--- a/include/linux/delay.h
+++ b/include/linux/delay.h
@@ -78,15 +78,36 @@ static inline void ssleep(unsigned int s
 	msleep(seconds * 1000);
 }
 
-/* see Documentation/timers/timers-howto.rst for the thresholds */
+/**
+ * fsleep - flexible sleep which autoselects the best mechanism
+ * @usecs:	requested sleep duration in microseconds
+ *
+ * flseep() selects the best mechanism that will provide maximum 25% slack
+ * to the requested sleep duration. Therefore it uses:
+ *
+ * - udelay() loop for sleep durations <= 10 microseconds to avoid hrtimer
+ *   overhead for really short sleep durations.
+ *
+ * - usleep_range() for sleep durations which would lead with the usage of
+ *   msleep() to a slack larger than 25%. This depends on the granularity of
+ *   jiffies.
+ *
+ * - msleep() for all other sleep durations.
+ *
+ * Note: When CONFIG_HIGH_RES_TIMERS is not set, all sleeps are processed with
+ * the granularity of jiffies and the slack might exceed 25% especially for
+ * short sleep durations.
+ */
 static inline void fsleep(unsigned long usecs)
 {
+	const unsigned int max_slack_shift = 2;
+
 	if (usecs <= 10)
 		udelay(usecs);
-	else if (usecs <= 20000)
-		usleep_range(usecs, 2 * usecs);
+	else if (usecs < ((TICK_NSEC << max_slack_shift) / NSEC_PER_USEC))
+		usleep_range(usecs, usecs + (usecs >> max_slack_shift));
 	else
-		msleep(DIV_ROUND_UP(usecs, 1000));
+		msleep(DIV_ROUND_UP(usecs, USEC_PER_MSEC));
 }
 
 #endif /* defined(_LINUX_DELAY_H) */
---8<----

Thanks,

	Anna-Maria

Rafael J. Wysocki Aug. 30, 2024, 1:56 p.m. UTC | #9

On Fri, Aug 30, 2024 at 7:55 AM Len Brown <lenb@kernel.org> wrote:
>
> On Thu, Aug 29, 2024 at 11:37 AM Anna-Maria Behnsen
> <anna-maria@linutronix.de> wrote:
>
> > I created a patch for fsleep() - only complie tested - which should make
> > sure that the slack is maximum 25%. What do you think about it? Might it
> > be helpful?
>
> If the purpose of using msleep instead of usleep_range is to lower
> the cost of the timer sub-system....
>
> then I'm not sure that choosing to do an msleep instead of a usleep_range
> based on the timer duration makes any sense.
>
> The lighter overhead of the msleep is something that is more important
> when there are more timers.  More timers is not the same as longer timers.

My understanding is that when a new timer is added, it is not actually
known how many timers there are overall in use in the system, so it is
generally better to use a more lightweight variant in case there are
many of them.

However, the more lightweight variant is not suitable for short sleep
durations because it adds bloat of 2 jiffies (after the recent change
from Anna-Maria), so the idea is to start using msleep() at the point
when the bloat added by it starts to be comparable to the delta
between the usleep_range() arguments (used for short sleeps).

Now, because it is not known exactly what timer precision is required,
some assumptions need to be made and IMV it is reasonable to assume
that the actual sleep duration is expected to be somewhat greater than
requested, but it is also not expected to be much greater than
requested, so using a fraction of the requested sleep time as the
usleep_range() makes sense to me.

Of course, you can argue that in the ACPI case there are those
high-count loops and so more exact sleep durations are better, but
realistically this is mostly about system suspend/resume times and the
difference is not something that cannot be tolerated in that case IMV.
Also it would be better to avoid running those high-count loops in the
first place.

Overall, if my understanding is correct, I generally agree with
Anna-Maria's approach.

Rafael J. Wysocki Nov. 15, 2024, 9:45 p.m. UTC | #10

On Fri, Nov 15, 2024 at 10:20 PM Len Brown <lenb@kernel.org> wrote:
>
> Re: if short sleep then use usleep_range()
>
> I think we all agree on this.
>
> Though in the ACPI flows that brought up this topic, setting max to
> anything other than min is simply a request for bad performance.

I disagree with the "bad performance" statement.

Performance lower than the possible maximum doesn't necessarily count as "bad".

> Re: if long sleeps then use msleep()
>
> ... because a jiffy based timer effectively forces coalescing, and is
> the lowest overhead.
>
> The problem with this logic is, as you mention, coalescing is a
> function of the distribution of timer expirations over time,it is not
> a function of the duration of those timers.

I just think that high precision is not necessary for long timeouts.

I also don't think that ASL programmers expect high precision in those cases.

Rafael J. Wysocki Nov. 18, 2024, 10:07 a.m. UTC | #11

On Fri, Nov 15, 2024 at 11:05 PM Len Brown <lenb@kernel.org> wrote:
>
> > Performance lower than the possible maximum doesn't necessarily count as "bad".
>
> System resume slower than 1 second will fail multiple product acceptance tests.
>
> That isn't good, it is bad :-)
>
> > > Re: if long sleeps then use msleep()
> > >
> > > ... because a jiffy based timer effectively forces coalescing, and is
> > > the lowest overhead.
> > >
> > > The problem with this logic is, as you mention, coalescing is a
> > > function of the distribution of timer expirations over time,it is not
> > > a function of the duration of those timers.
> >
> > I just think that high precision is not necessary for long timeouts.
> >
> > I also don't think that ASL programmers expect high precision in those cases.
>
> Precision isn't the question.
> The benefit of additional delay is the question.
>
> ie. if an ASL programmer asks for Sleep(100), they expect it to take
> no less than 100ms.
>
> msleep(100) takes 106ms -- effectively always.

For HZ=250.

> usleep-range(100,100) takes 100ms, effectively always.
>
> Is the additional 6ms delay really worth the saved overhead of using
> jiffies rather than a timer for an ACPI flow?

Or the other way around, is the better timer precision worth the
additional overhead?

AFAICS, the reason why you are pushing so hard for this is
suspend/resume delays due to loops of many iterations that sleep for a
short time in every iteration.

I'm kind of having a hard time with accepting the argument that the
kernel needs to be made to use more resources always in the ACPI path
to address just this one case.

ACPI: Remove msleep() bloat from acpi_os_sleep()

Commit Message

Comments

Patch