mbox series

[v2,0/1] PM: Start asynchronous suspend threads upfront

Message ID 20240618121327.2177-1-kaiyen.chang@intel.com
Headers show
Series PM: Start asynchronous suspend threads upfront | expand

Message

Chang, Kaiyen June 18, 2024, 12:13 p.m. UTC
*** BLURB HERE ***

Kaiyen Chang (1):
  PM: Start asynchronous suspend threads upfront

 drivers/base/power/main.c | 90 +++++++++++++++++++++++++--------------
 1 file changed, 57 insertions(+), 33 deletions(-)

Comments

Rafael J. Wysocki June 18, 2024, 5:27 p.m. UTC | #1
On Tue, Jun 18, 2024 at 2:13 PM Kaiyen Chang <kaiyen.chang@intel.com> wrote:
>
> Currently, when performing a suspend operation, all devices on the
> dpm_list must wait for the "synchronous" devices that are listed
> after them to complete before the main suspend thread can start
> their suspend routines, even if they are "asynchronous". If the
> suspend routine of a synchronous device must enter a waiting state
> for some reason, it will cause the main suspend thread to go to
> sleep, thereby delaying the processing of all subsequent devices,
> including asynchronous ones, ultimately extending the overall
> suspend time.
>
> By starting the asynchronous suspend threads upfront, we can allow
> the system to handle the suspend routines of these asynchronous
> devices in parallel, without waiting for the suspend routines of
> the synchronous devices listed after them to complete, and without
> breaking their order with respect to their parents and children.
> This way, even if the main suspend thread is blocked, these
> asynchronous suspend threads can continue to run without being
> affected.
>
> Signed-off-by: Kaiyen Chang <kaiyen.chang@intel.com>

How exactly has this been tested?

In the past, changes going in this direction caused system suspend to
hang on at least some platforms (including the ones in my office).

> ---
> Change from v1: Fix some unclear parts in the commit messages.
> ---
>  drivers/base/power/main.c | 90 +++++++++++++++++++++++++--------------
>  1 file changed, 57 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c
> index 4a67e83300e1..6ddd6ef36625 100644
> --- a/drivers/base/power/main.c
> +++ b/drivers/base/power/main.c
> @@ -1283,6 +1283,7 @@ static void async_suspend_noirq(void *data, async_cookie_t cookie)
>
>  static int dpm_noirq_suspend_devices(pm_message_t state)
>  {
> +       struct device *dev;
>         ktime_t starttime = ktime_get();
>         int error = 0;
>
> @@ -1293,26 +1294,33 @@ static int dpm_noirq_suspend_devices(pm_message_t state)
>
>         mutex_lock(&dpm_list_mtx);
>
> +       /*
> +        * Trigger the suspend of "async" devices upfront so they don't have to
> +        * wait for the "non-async" ones that don't depend on them.
> +        */
> +
> +       list_for_each_entry_reverse(dev, &dpm_late_early_list, power.entry)
> +               dpm_async_fn(dev, async_suspend_noirq);
> +
>         while (!list_empty(&dpm_late_early_list)) {
> -               struct device *dev = to_device(dpm_late_early_list.prev);
> +               dev = to_device(dpm_late_early_list.prev);
>
>                 list_move(&dev->power.entry, &dpm_noirq_list);
>
> -               if (dpm_async_fn(dev, async_suspend_noirq))
> -                       continue;
> -
> -               get_device(dev);
> +               if (!dev->power.async_in_progress) {
> +                       get_device(dev);
>
> -               mutex_unlock(&dpm_list_mtx);
> +                       mutex_unlock(&dpm_list_mtx);
>
> -               error = device_suspend_noirq(dev, state, false);
> +                       error = device_suspend_noirq(dev, state, false);
>
> -               put_device(dev);
> +                       put_device(dev);
>
> -               mutex_lock(&dpm_list_mtx);
> +                       mutex_lock(&dpm_list_mtx);
>
> -               if (error || async_error)
> -                       break;
> +                       if (error || async_error)
> +                               break;
> +               }
>         }
>
>         mutex_unlock(&dpm_list_mtx);
> @@ -1454,6 +1462,7 @@ static void async_suspend_late(void *data, async_cookie_t cookie)
>   */
>  int dpm_suspend_late(pm_message_t state)
>  {
> +       struct device *dev;
>         ktime_t starttime = ktime_get();
>         int error = 0;
>
> @@ -1466,26 +1475,33 @@ int dpm_suspend_late(pm_message_t state)
>
>         mutex_lock(&dpm_list_mtx);
>
> +       /*
> +        * Trigger the suspend of "async" devices upfront so they don't have to
> +        * wait for the "non-async" ones that don't depend on them.
> +        */
> +
> +       list_for_each_entry_reverse(dev, &dpm_suspended_list, power.entry)
> +               dpm_async_fn(dev, async_suspend_late);
> +
>         while (!list_empty(&dpm_suspended_list)) {
> -               struct device *dev = to_device(dpm_suspended_list.prev);
> +               dev = to_device(dpm_suspended_list.prev);
>
>                 list_move(&dev->power.entry, &dpm_late_early_list);
>
> -               if (dpm_async_fn(dev, async_suspend_late))
> -                       continue;
> -
> -               get_device(dev);
> +               if (!dev->power.async_in_progress) {
> +                       get_device(dev);
>
> -               mutex_unlock(&dpm_list_mtx);
> +                       mutex_unlock(&dpm_list_mtx);
>
> -               error = device_suspend_late(dev, state, false);
> +                       error = device_suspend_late(dev, state, false);
>
> -               put_device(dev);
> +                       put_device(dev);
>
> -               mutex_lock(&dpm_list_mtx);
> +                       mutex_lock(&dpm_list_mtx);
>
> -               if (error || async_error)
> -                       break;
> +                       if (error || async_error)
> +                               break;
> +               }
>         }
>
>         mutex_unlock(&dpm_list_mtx);
> @@ -1719,6 +1735,7 @@ static void async_suspend(void *data, async_cookie_t cookie)
>   */
>  int dpm_suspend(pm_message_t state)
>  {
> +       struct device *dev;
>         ktime_t starttime = ktime_get();
>         int error = 0;
>
> @@ -1733,26 +1750,33 @@ int dpm_suspend(pm_message_t state)
>
>         mutex_lock(&dpm_list_mtx);
>
> +       /*
> +        * Trigger the suspend of "async" devices upfront so they don't have to
> +        * wait for the "non-async" ones that don't depend on them.
> +        */
> +
> +       list_for_each_entry_reverse(dev, &dpm_prepared_list, power.entry)
> +               dpm_async_fn(dev, async_suspend);
> +
>         while (!list_empty(&dpm_prepared_list)) {
> -               struct device *dev = to_device(dpm_prepared_list.prev);
> +               dev = to_device(dpm_prepared_list.prev);
>
>                 list_move(&dev->power.entry, &dpm_suspended_list);
>
> -               if (dpm_async_fn(dev, async_suspend))
> -                       continue;
> -
> -               get_device(dev);
> +               if (!dev->power.async_in_progress) {
> +                       get_device(dev);
>
> -               mutex_unlock(&dpm_list_mtx);
> +                       mutex_unlock(&dpm_list_mtx);
>
> -               error = device_suspend(dev, state, false);
> +                       error = device_suspend(dev, state, false);
>
> -               put_device(dev);
> +                       put_device(dev);
>
> -               mutex_lock(&dpm_list_mtx);
> +                       mutex_lock(&dpm_list_mtx);
>
> -               if (error || async_error)
> -                       break;
> +                       if (error || async_error)
> +                               break;
> +               }
>         }
>
>         mutex_unlock(&dpm_list_mtx);
> --
> 2.34.1
>
Rafael J. Wysocki July 2, 2024, 1:20 p.m. UTC | #2
On Thu, Jun 20, 2024 at 11:21 AM Chang, Kaiyen <kaiyen.chang@intel.com> wrote:
>
> On Tue, Jun 18, 2024 at 07:27:18PM +0200, Rafael J. Wysocki <rafael@kernel.org> wrote:
> >
> > On Tue, Jun 18, 2024 at 2:13 PM Kaiyen Chang <kaiyen.chang@intel.com>
> > wrote:
> > >
> > > Currently, when performing a suspend operation, all devices on the
> > > dpm_list must wait for the "synchronous" devices that are listed after
> > > them to complete before the main suspend thread can start their
> > > suspend routines, even if they are "asynchronous". If the suspend
> > > routine of a synchronous device must enter a waiting state for some
> > > reason, it will cause the main suspend thread to go to sleep, thereby
> > > delaying the processing of all subsequent devices, including
> > > asynchronous ones, ultimately extending the overall suspend time.
> > >
> > > By starting the asynchronous suspend threads upfront, we can allow the
> > > system to handle the suspend routines of these asynchronous devices in
> > > parallel, without waiting for the suspend routines of the synchronous
> > > devices listed after them to complete, and without breaking their
> > > order with respect to their parents and children.
> > > This way, even if the main suspend thread is blocked, these
> > > asynchronous suspend threads can continue to run without being
> > > affected.
> > >
> > > Signed-off-by: Kaiyen Chang <kaiyen.chang@intel.com>
> >
> > How exactly has this been tested?
> >
> > In the past, changes going in this direction caused system suspend to hang on
> > at least some platforms (including the ones in my office).
> >
>
> Hi Rafael, thanks for your reply.
>
> 1. We have currently verified this patch on the ADL-N Chromebook, running the suspend stress test 1000 times consecutively without encountering any issues.

Just one machine?  That's hardly sufficient.

> 2. Could you tell us what was the reason for suspend to hang back then?

It was what is still the case I think: Some dependencies between
devices are implicitly taken into account when the async suspend
threads are not started upfront.

> The reason why I believe that we can start the async device's suspend threads
> upfront is that, during resume, a sync devices listed after certain async devices
> on the list do not actually rely on the implicit dependency provided by the order
> of the devices on the list.

This problem has nothing to do with resume, because the machines hang
during suspend.

> This is because if there is no consumer-supplier or parent-children relationship between
> these devices, the suspend routine of this sync device will be executed directly without
> waiting for the suspend routines of the async devices listed before it to complete. In other
> words, this sync device does not depend on the async devices listed before them.

Well, not quite, IIUC.

Say there are two devices, A and B, the former is "sync" and the
latter is "async".  Say there is a dependency between them such that A
must suspend before B, but it is not expressed directly (no
parent-child relationship and no device link).

If A is before B in the list order and async suspend is not started
upfront, the dependency is met.  If async suspend is started upfront,
B may suspend before A.

> Therefore, during suspend, there is no need to ensure those async devices listed
> before the sync devices are awake before these sync devices complete their suspend
> routines. In summary, as long as we manage the dependencies between
> consumers/suppliers and parents/children relationships properly, we should be
> able to start the async devices' suspend routines upfront during suspend.

Well, if the dependencies are all represented either through
parent-child relationships or through device links, then yes, but are
they?  On all systems currently in the field?

So if you want to move this forward, please talk to Todd Brandt (CCed)
to arrange for testing your patch in the client power lab.  It must at
least not break suspend on any machines in there.
Chang, Kaiyen Aug. 16, 2024, 8:33 a.m. UTC | #3
On Tue, July 2, 2024 at 09:20:24PM +0200, Rafael J. Wysocki <rafael@kernel.org> wrote:
> 
> On Thu, Jun 20, 2024 at 11:21 AM Chang, Kaiyen <kaiyen.chang@intel.com>
> wrote:
> >
> > On Tue, Jun 18, 2024 at 07:27:18PM +0200, Rafael J. Wysocki
> <rafael@kernel.org> wrote:
> > >
> > > On Tue, Jun 18, 2024 at 2:13 PM Kaiyen Chang
> > > <kaiyen.chang@intel.com>
> > > wrote:
> > > >
> > > > Currently, when performing a suspend operation, all devices on the
> > > > dpm_list must wait for the "synchronous" devices that are listed
> > > > after them to complete before the main suspend thread can start
> > > > their suspend routines, even if they are "asynchronous". If the
> > > > suspend routine of a synchronous device must enter a waiting state
> > > > for some reason, it will cause the main suspend thread to go to
> > > > sleep, thereby delaying the processing of all subsequent devices,
> > > > including asynchronous ones, ultimately extending the overall suspend
> time.
> > > >
> > > > By starting the asynchronous suspend threads upfront, we can allow
> > > > the system to handle the suspend routines of these asynchronous
> > > > devices in parallel, without waiting for the suspend routines of
> > > > the synchronous devices listed after them to complete, and without
> > > > breaking their order with respect to their parents and children.
> > > > This way, even if the main suspend thread is blocked, these
> > > > asynchronous suspend threads can continue to run without being
> > > > affected.
> > > >
> > > > Signed-off-by: Kaiyen Chang <kaiyen.chang@intel.com>
> > >
> > > How exactly has this been tested?
> > >
> > > In the past, changes going in this direction caused system suspend
> > > to hang on at least some platforms (including the ones in my office).
> > >
> >
> > Hi Rafael, thanks for your reply.
> >
> > 1. We have currently verified this patch on the ADL-N Chromebook, running
> the suspend stress test 1000 times consecutively without encountering any
> issues.
> 
> Just one machine?  That's hardly sufficient.
> 

We have verified this patch on JSL, ADL-P, ADL-N, RPL, and MTL Chromebooks.
All devices can pass the suspend stress test 1000 times consecutively.

> > 2. Could you tell us what was the reason for suspend to hang back then?
> 
> It was what is still the case I think: Some dependencies between devices are
> implicitly taken into account when the async suspend threads are not started
> upfront.
> 
> > The reason why I believe that we can start the async device's suspend
> > threads upfront is that, during resume, a sync devices listed after
> > certain async devices on the list do not actually rely on the implicit
> > dependency provided by the order of the devices on the list.
> 
> This problem has nothing to do with resume, because the machines hang
> during suspend.
> 
> > This is because if there is no consumer-supplier or parent-children
> > relationship between these devices, the suspend routine of this sync
> > device will be executed directly without waiting for the suspend
> > routines of the async devices listed before it to complete. In other words,
> this sync device does not depend on the async devices listed before them.
> 
> Well, not quite, IIUC.
> 
> Say there are two devices, A and B, the former is "sync" and the latter is
> "async".  Say there is a dependency between them such that A must suspend
> before B, but it is not expressed directly (no parent-child relationship and no
> device link).
> 
> If A is before B in the list order and async suspend is not started upfront, the
> dependency is met.  If async suspend is started upfront, B may suspend
> before A.
> 

Here are two instances where I mistakenly wrote "suspend" instead of
"resume." Please allow me to correct them as follows:

Before correction:
----------------------------------------------------------------------
This is because if there is no consumer-supplier or parent-children
relationship between these devices, the *suspend* routine of this sync
device will be executed directly without waiting for the *suspend*
routines of the async devices listed before it to complete.

After correction:
----------------------------------------------------------------------
This is because if there is no consumer-supplier or parent-children
relationship between these devices, the *resume* routine of this sync
device will be executed directly without waiting for the *resume*
routines of the async devices listed before it to complete.

Here’s what I want to convey:

The commit 5af84b82701a "PM: Asynchronous suspend and resume of devices"
seems to be based on the assumption that, except for devices with
consumer-supplier or parent-children relationships, no sync device depends
on any async devices. This is because only devices with consumer-supplier
or parent-children relationships would wait for the async routines to
complete.

In the example you previously provided, if A (sync) must suspend before
B (async), it indicates that A depends on B (i.e., A needs B). Therefore,
during the resume phase, A should wait for B's resume routine to complete
before starting its resume routine. However, in commit 5af84b82701a, only
devices with consumer-supplier or parent-children relationships would wait
for the async routine to finish, while the other sync devices would not. This
makes me think that, in fact, the sync devices without clearly defined
dependencies do not actually need any async devices to be ready. Otherwise,
there would be a potential risk here.

> > Therefore, during suspend, there is no need to ensure those async
> > devices listed before the sync devices are awake before these sync
> > devices complete their suspend routines. In summary, as long as we
> > manage the dependencies between consumers/suppliers and
> > parents/children relationships properly, we should be able to start the async
> devices' suspend routines upfront during suspend.
> 
> Well, if the dependencies are all represented either through parent-child
> relationships or through device links, then yes, but are they?  On all systems
> currently in the field?
> 
> So if you want to move this forward, please talk to Todd Brandt (CCed) to
> arrange for testing your patch in the client power lab.  It must at least not
> break suspend on any machines in there.

Thank you for your suggestion. We've requested Todd's assistance with the
testing. The results show that after applying this patch, all the machines
in the lab encountered no issues. Could you please advise on the next steps?