diff mbox series

[1/2] soc: qcom: aoss: Don't wait for IRQ if we might be in suspend/resume noirq

Message ID 20200805091141.1.I86b3faaecb0d82997b599b1300f879606c71e116@changeid
State New
Headers show
Series [1/2] soc: qcom: aoss: Don't wait for IRQ if we might be in suspend/resume noirq | expand

Commit Message

Doug Anderson Aug. 5, 2020, 4:16 p.m. UTC
Running suspend/resume tests on a sc7180-based board with a modem I
found that both system suspend and system resume would hang for 1
second.  These messages indicate where:

  genpd genpd:0:4080000.remoteproc: calling genpd_suspend_noirq+0x0/0x2c @ 18659, parent: none
  genpd genpd:0:4080000.remoteproc: genpd_suspend_noirq+0x0/0x2c returned 0 after 987917 usecs

Adding a printout, I found that we were working with the power domain
where "res->pd.name" was "modem".

I found that we were hanging on the wait_event_interruptible_timeout()
call in qmp_send().  Specifically we'd wait for the whole 1 second
timeout to hit, then we'd notice that our condition was true and would
continue on our merry way.  Sure enough, I could confirm that
wait_event_interruptible_timeout() was returning "1" which indicates
that the condition evaluated to true and we also timed out.

Dumping the stack at the time of the failure made the problem clear.
Specifically the stack looked like:
   qmp_send+0x1cc/0x210
   qmp_pd_power_toggle+0x90/0xb8
   qmp_pd_power_off+0x20/0x2c
   genpd_sync_power_off+0x80/0x12c
   genpd_finish_suspend+0xd8/0x108
   genpd_suspend_noirq+0x20/0x2c
   dpm_run_callback+0xe0/0x1d4
   __device_suspend_noirq+0xfc/0x200
   dpm_suspend_noirq+0x174/0x3bc
   suspend_devices_and_enter+0x198/0x8a0
   pm_suspend+0x550/0x6f4
As you can see we're running from the "noirq" callback.  Looking at
what was supposed to wake us up, it was qmp_intr() (our IRQ handler).
Doh!

I believe that the correct fix here is to assume that our power_off /
power_on functions might be called at "noirq" time and just always
poll if we're called via that path.  Other paths can continue to wait
for the IRQ.

Fixes: 2209481409b7 ("soc: qcom: Add AOSS QMP driver")
Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
This problem was observed on the Chrome OS 5.4 tree which has some
extra patches in it compared to mainline.  The top of the current
Qualcomm tree didn't have the delay, but that's probably because
everything isn't fully enabled there yet.  I at least confirmed that
this patch doesn't actually _break_ anything on mainline, though.

 drivers/soc/qcom/qcom_aoss.c | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

Comments

Stephen Boyd Aug. 5, 2020, 5:28 p.m. UTC | #1
Quoting Douglas Anderson (2020-08-05 09:16:11)
> The function qmp_send() called wait_event_interruptible_timeout() to
> wait for its interrupt.  However, this function did not check for
> -ERESTARTSYS and assumed that any non-zero return value meant that the
> event happened.
> 
> While we could try to figure out how to handle interruptions by
> figuring out how to cancel and/or undo our transfer in a race-free way
> and then communicating this status back to all of our callers, that
> seems like a whole lot of complexity.  As I understand it the transfer
> should happen rather quickly and if we're really hitting the 1 second
> timeout we're in deep doggy doodoo anyway.  Let's just use the
> non-interruptible version of the function and call it good enough.
> 
> Found by code inspection.  No known test cases expose the problem
> described here.
> 
> Fixes: 2209481409b7 ("soc: qcom: Add AOSS QMP driver")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---

I would put this first in the series as it's an obvious bug fix.

Reviewed-by: Stephen Boyd <swboyd@chromium.org>
Stephen Boyd Aug. 5, 2020, 5:36 p.m. UTC | #2
Quoting Douglas Anderson (2020-08-05 09:16:10)
> Running suspend/resume tests on a sc7180-based board with a modem I
> found that both system suspend and system resume would hang for 1
> second.  These messages indicate where:
> 
>   genpd genpd:0:4080000.remoteproc: calling genpd_suspend_noirq+0x0/0x2c @ 18659, parent: none
>   genpd genpd:0:4080000.remoteproc: genpd_suspend_noirq+0x0/0x2c returned 0 after 987917 usecs
> 
> Adding a printout, I found that we were working with the power domain
> where "res->pd.name" was "modem".
> 
> I found that we were hanging on the wait_event_interruptible_timeout()
> call in qmp_send().  Specifically we'd wait for the whole 1 second
> timeout to hit, then we'd notice that our condition was true and would
> continue on our merry way.  Sure enough, I could confirm that
> wait_event_interruptible_timeout() was returning "1" which indicates
> that the condition evaluated to true and we also timed out.
> 
> Dumping the stack at the time of the failure made the problem clear.
> Specifically the stack looked like:
>    qmp_send+0x1cc/0x210
>    qmp_pd_power_toggle+0x90/0xb8
>    qmp_pd_power_off+0x20/0x2c
>    genpd_sync_power_off+0x80/0x12c
>    genpd_finish_suspend+0xd8/0x108
>    genpd_suspend_noirq+0x20/0x2c
>    dpm_run_callback+0xe0/0x1d4
>    __device_suspend_noirq+0xfc/0x200
>    dpm_suspend_noirq+0x174/0x3bc
>    suspend_devices_and_enter+0x198/0x8a0
>    pm_suspend+0x550/0x6f4
> As you can see we're running from the "noirq" callback.  Looking at
> what was supposed to wake us up, it was qmp_intr() (our IRQ handler).
> Doh!

Why is the genpd being powered off at all? It looks like the driver is
written in a way that it doesn't expect this to happen. See where
adsp_pds_disable() is called from. Looks like the remoteproc "stop"
callback should be called or the driver should be detached.

It sort of looks like the genpd is expected to be at the max level all
the time (it sets INT_MAX in adsp_pds_enable(), cool). Maybe we need to
add some sort of suspend hooks to the remote proc driver instead? Where
those suspend hooks are called earlier and drop the genpd performance
state request but otherwise leave it enabled across suspend?

I know this isn't clearing the land mine that is calling this code from
noirq phase of suspend, but I'm just looking at the driver and thinking
that it never expected to be called from this phase of suspend to begin
with.

> 
> I believe that the correct fix here is to assume that our power_off /
> power_on functions might be called at "noirq" time and just always
> poll if we're called via that path.  Other paths can continue to wait
> for the IRQ.
> 
> Fixes: 2209481409b7 ("soc: qcom: Add AOSS QMP driver")
> Signed-off-by: Douglas Anderson <dianders@chromium.org>
> ---
> This problem was observed on the Chrome OS 5.4 tree which has some
> extra patches in it compared to mainline.  The top of the current
> Qualcomm tree didn't have the delay, but that's probably because
> everything isn't fully enabled there yet.  I at least confirmed that
> this patch doesn't actually _break_ anything on mainline, though.
Doug Anderson Aug. 5, 2020, 8:24 p.m. UTC | #3
Hi,

On Wed, Aug 5, 2020 at 10:36 AM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Douglas Anderson (2020-08-05 09:16:10)
> > Running suspend/resume tests on a sc7180-based board with a modem I
> > found that both system suspend and system resume would hang for 1
> > second.  These messages indicate where:
> >
> >   genpd genpd:0:4080000.remoteproc: calling genpd_suspend_noirq+0x0/0x2c @ 18659, parent: none
> >   genpd genpd:0:4080000.remoteproc: genpd_suspend_noirq+0x0/0x2c returned 0 after 987917 usecs
> >
> > Adding a printout, I found that we were working with the power domain
> > where "res->pd.name" was "modem".
> >
> > I found that we were hanging on the wait_event_interruptible_timeout()
> > call in qmp_send().  Specifically we'd wait for the whole 1 second
> > timeout to hit, then we'd notice that our condition was true and would
> > continue on our merry way.  Sure enough, I could confirm that
> > wait_event_interruptible_timeout() was returning "1" which indicates
> > that the condition evaluated to true and we also timed out.
> >
> > Dumping the stack at the time of the failure made the problem clear.
> > Specifically the stack looked like:
> >    qmp_send+0x1cc/0x210
> >    qmp_pd_power_toggle+0x90/0xb8
> >    qmp_pd_power_off+0x20/0x2c
> >    genpd_sync_power_off+0x80/0x12c
> >    genpd_finish_suspend+0xd8/0x108
> >    genpd_suspend_noirq+0x20/0x2c
> >    dpm_run_callback+0xe0/0x1d4
> >    __device_suspend_noirq+0xfc/0x200
> >    dpm_suspend_noirq+0x174/0x3bc
> >    suspend_devices_and_enter+0x198/0x8a0
> >    pm_suspend+0x550/0x6f4
> > As you can see we're running from the "noirq" callback.  Looking at
> > what was supposed to wake us up, it was qmp_intr() (our IRQ handler).
> > Doh!

As per is typical for me, I'm poking around in code that I have very
little context in.  :-P  If something I'm saying seems wrong, feel
free to correct.


> Why is the genpd being powered off at all? It looks like the driver is
> written in a way that it doesn't expect this to happen. See where
> adsp_pds_disable() is called from. Looks like the remoteproc "stop"
> callback should be called or the driver should be detached.
>
> It sort of looks like the genpd is expected to be at the max level all
> the time (it sets INT_MAX in adsp_pds_enable(), cool).

In general in Linux there are some things that, at suspend time, get
done behind a driver's back.  The regulator API, for instance, allows
for regulators to be turned off in suspend even if a driver leaves
them on.  Sure, it's good practice for a driver to be explicit but the
regulator suspend states do allow for the more heavy-handed approach.

I guess I assume that genpd is a bit similar.  If a driver leaves a
genpd on all the time then it will still be turned off at suspend time
and then turned back on at resume time.  It seems like it must be part
of the genpd API.  Specifically genpd_sync_power_off() says: "Check if
the given PM domain can be powered off (during system suspend or
hibernation) and do that if so."  That makes it seem like it's how
genpd works.

Reading all the descriptions of things like GENPD_FLAG_ALWAYS_ON,
GENPD_FLAG_ACTIVE_WAKEUP, GENPD_FLAG_RPM_ALWAYS_ON makes me even more
convinced that it's normal (unless otherwise specified) for genpds to
get turned off in suspend even if a driver just blindly left them on.

Presumably if this "modem" genpd is supposed to stay on in suspend
time it should have been marked "always on"?  I'd guess we'd need to
add "GENPD_FLAG_ALWAYS_ON" in some (or all?) cases in qmp_pd_add() if
this was true?


> Maybe we need to
> add some sort of suspend hooks to the remote proc driver instead? Where
> those suspend hooks are called earlier and drop the genpd performance
> state request but otherwise leave it enabled across suspend?

I think you're saying:

a) You think it's a bug today that the "modem" genpd is being powered
off in suspend.  Any evidence to back this up?

b) Assuming it's a bug today, we should mark the "modem" as
GENPD_FLAG_ALWAYS_ON.

c) If there are genpds that sometimes should be left on in suspend but
sometimes not (and that doesn't match up with what
GENPD_FLAG_ACTIVE_WAKEUP does), then we'd have to pass
GENPD_FLAG_ALWAYS_ON as a flag and then add suspend hooks to make the
decision for us.

Did I understand that correctly?

...or are you suggesting that we work around the fact that
qmp_pd_power_off() can't be called at "noirq" time by forcing it to
suspend earlier?

...or am I just totally confused and you meant something else?


> I know this isn't clearing the land mine that is calling this code from
> noirq phase of suspend, but I'm just looking at the driver and thinking
> that it never expected to be called from this phase of suspend to begin
> with.

You're saying that qmp_pd_power_off() wasn't expecting to be called
from the noirq phase of suspend?  Sure, I guess not given the bug.
...but once we fix the bug, it works fine, doesn't it?  ...and it
appears that it's part of the genpd API to be able to be called from
the noirq phase.  To me that means that, even if we were supposed to
be keeping this particular PD on during suspend we should take my
patch.


So the summary is: I still think my patch is correct, but I could
certainly still be convinced otherwise.

-Doug
Doug Anderson Aug. 5, 2020, 8:25 p.m. UTC | #4
Hi,

On Wed, Aug 5, 2020 at 10:28 AM Stephen Boyd <swboyd@chromium.org> wrote:
>
> Quoting Douglas Anderson (2020-08-05 09:16:11)
> > The function qmp_send() called wait_event_interruptible_timeout() to
> > wait for its interrupt.  However, this function did not check for
> > -ERESTARTSYS and assumed that any non-zero return value meant that the
> > event happened.
> >
> > While we could try to figure out how to handle interruptions by
> > figuring out how to cancel and/or undo our transfer in a race-free way
> > and then communicating this status back to all of our callers, that
> > seems like a whole lot of complexity.  As I understand it the transfer
> > should happen rather quickly and if we're really hitting the 1 second
> > timeout we're in deep doggy doodoo anyway.  Let's just use the
> > non-interruptible version of the function and call it good enough.
> >
> > Found by code inspection.  No known test cases expose the problem
> > described here.
> >
> > Fixes: 2209481409b7 ("soc: qcom: Add AOSS QMP driver")
> > Signed-off-by: Douglas Anderson <dianders@chromium.org>
> > ---
>
> I would put this first in the series as it's an obvious bug fix.

I guess I thought of it the other way.  This is a less serious problem
(no known way to tickle it) and so deserved to be 2nd.  I'm happy to
flip it as needed, though.  It would also be trivially easy to flip
when applying.


> Reviewed-by: Stephen Boyd <swboyd@chromium.org>

Thanks!

-Doug
Stephen Boyd Aug. 5, 2020, 11:02 p.m. UTC | #5
+Sibi who wrote the code

Quoting Doug Anderson (2020-08-05 13:24:06)
> 
> On Wed, Aug 5, 2020 at 10:36 AM Stephen Boyd <swboyd@chromium.org> wrote:
> >
> > Why is the genpd being powered off at all? It looks like the driver is
> > written in a way that it doesn't expect this to happen. See where
> > adsp_pds_disable() is called from. Looks like the remoteproc "stop"
> > callback should be called or the driver should be detached.
> >
> > It sort of looks like the genpd is expected to be at the max level all
> > the time (it sets INT_MAX in adsp_pds_enable(), cool).
> 
> In general in Linux there are some things that, at suspend time, get
> done behind a driver's back.  The regulator API, for instance, allows
> for regulators to be turned off in suspend even if a driver leaves
> them on.  Sure, it's good practice for a driver to be explicit but the
> regulator suspend states do allow for the more heavy-handed approach.
> 
> I guess I assume that genpd is a bit similar.  If a driver leaves a
> genpd on all the time then it will still be turned off at suspend time
> and then turned back on at resume time.  It seems like it must be part
> of the genpd API.  Specifically genpd_sync_power_off() says: "Check if
> the given PM domain can be powered off (during system suspend or
> hibernation) and do that if so."  That makes it seem like it's how
> genpd works.
> 
> Reading all the descriptions of things like GENPD_FLAG_ALWAYS_ON,
> GENPD_FLAG_ACTIVE_WAKEUP, GENPD_FLAG_RPM_ALWAYS_ON makes me even more
> convinced that it's normal (unless otherwise specified) for genpds to
> get turned off in suspend even if a driver just blindly left them on.
> 
> Presumably if this "modem" genpd is supposed to stay on in suspend
> time it should have been marked "always on"?  I'd guess we'd need to
> add "GENPD_FLAG_ALWAYS_ON" in some (or all?) cases in qmp_pd_add() if
> this was true?

Agreed. I can't read the mind of Sibi so I can only guess that Sibi
wasn't expecting this behavior by reading the driver structure. That
could be a wrong assumption.

> 
> 
> > Maybe we need to
> > add some sort of suspend hooks to the remote proc driver instead? Where
> > those suspend hooks are called earlier and drop the genpd performance
> > state request but otherwise leave it enabled across suspend?
> 
> I think you're saying:
> 
> a) You think it's a bug today that the "modem" genpd is being powered
> off in suspend.  Any evidence to back this up?
> 
> b) Assuming it's a bug today, we should mark the "modem" as
> GENPD_FLAG_ALWAYS_ON.
> 
> c) If there are genpds that sometimes should be left on in suspend but
> sometimes not (and that doesn't match up with what
> GENPD_FLAG_ACTIVE_WAKEUP does), then we'd have to pass
> GENPD_FLAG_ALWAYS_ON as a flag and then add suspend hooks to make the
> decision for us.
> 
> Did I understand that correctly?
> 
> ...or are you suggesting that we work around the fact that
> qmp_pd_power_off() can't be called at "noirq" time by forcing it to
> suspend earlier?
> 
> ...or am I just totally confused and you meant something else?
> 
> 
> > I know this isn't clearing the land mine that is calling this code from
> > noirq phase of suspend, but I'm just looking at the driver and thinking
> > that it never expected to be called from this phase of suspend to begin
> > with.
> 
> You're saying that qmp_pd_power_off() wasn't expecting to be called
> from the noirq phase of suspend?  Sure, I guess not given the bug.
> ...but once we fix the bug, it works fine, doesn't it?  ...and it
> appears that it's part of the genpd API to be able to be called from
> the noirq phase.  To me that means that, even if we were supposed to
> be keeping this particular PD on during suspend we should take my
> patch.
> 
> 
> So the summary is: I still think my patch is correct, but I could
> certainly still be convinced otherwise.
> 

I'm trying to say that the driver looks like it expects to power off the
genpd in the adsp_stop() callback. That same callback sends some sort of
message to the modem saying that it is being stopped (see
qcom_q6v5_request_stop()). Turning the performance state down, or
turning the power domain off completely, without telling the modem that
it's happening like as is done in adsp_stop() looks wrong. But who
knows, maybe the modem is happy with that and doesn't care?

In general, the whole thing looks weird to me because I would expect the
modem to take care of its own power requirements, including this
"load_state" one. Anyway, I hope Sibi can clarify what's going on.
Sibi Sankar Aug. 6, 2020, 2:34 p.m. UTC | #6
On 2020-08-06 04:32, Stephen Boyd wrote:
> +Sibi who wrote the code
> 
> Quoting Doug Anderson (2020-08-05 13:24:06)
>> 
>> On Wed, Aug 5, 2020 at 10:36 AM Stephen Boyd <swboyd@chromium.org> 
>> wrote:
>> >
>> > Why is the genpd being powered off at all? It looks like the driver is
>> > written in a way that it doesn't expect this to happen. See where
>> > adsp_pds_disable() is called from. Looks like the remoteproc "stop"
>> > callback should be called or the driver should be detached.
>> >
>> > It sort of looks like the genpd is expected to be at the max level all
>> > the time (it sets INT_MAX in adsp_pds_enable(), cool).
>> 
>> In general in Linux there are some things that, at suspend time, get
>> done behind a driver's back.  The regulator API, for instance, allows
>> for regulators to be turned off in suspend even if a driver leaves
>> them on.  Sure, it's good practice for a driver to be explicit but the
>> regulator suspend states do allow for the more heavy-handed approach.
>> 
>> I guess I assume that genpd is a bit similar.  If a driver leaves a
>> genpd on all the time then it will still be turned off at suspend time
>> and then turned back on at resume time.  It seems like it must be part
>> of the genpd API.  Specifically genpd_sync_power_off() says: "Check if
>> the given PM domain can be powered off (during system suspend or
>> hibernation) and do that if so."  That makes it seem like it's how
>> genpd works.
>> 
>> Reading all the descriptions of things like GENPD_FLAG_ALWAYS_ON,
>> GENPD_FLAG_ACTIVE_WAKEUP, GENPD_FLAG_RPM_ALWAYS_ON makes me even more
>> convinced that it's normal (unless otherwise specified) for genpds to
>> get turned off in suspend even if a driver just blindly left them on.
>> 
>> Presumably if this "modem" genpd is supposed to stay on in suspend
>> time it should have been marked "always on"?  I'd guess we'd need to
>> add "GENPD_FLAG_ALWAYS_ON" in some (or all?) cases in qmp_pd_add() if
>> this was true?
> 
> Agreed. I can't read the mind of Sibi so I can only guess that Sibi
> wasn't expecting this behavior by reading the driver structure. That
> could be a wrong assumption.
> 
>> 
>> 
>> > Maybe we need to
>> > add some sort of suspend hooks to the remote proc driver instead? Where
>> > those suspend hooks are called earlier and drop the genpd performance
>> > state request but otherwise leave it enabled across suspend?
>> 
>> I think you're saying:
>> 
>> a) You think it's a bug today that the "modem" genpd is being powered
>> off in suspend.  Any evidence to back this up?
>> 
>> b) Assuming it's a bug today, we should mark the "modem" as
>> GENPD_FLAG_ALWAYS_ON.
>> 
>> c) If there are genpds that sometimes should be left on in suspend but
>> sometimes not (and that doesn't match up with what
>> GENPD_FLAG_ACTIVE_WAKEUP does), then we'd have to pass
>> GENPD_FLAG_ALWAYS_ON as a flag and then add suspend hooks to make the
>> decision for us.

Doug/Stephen,

Yes this is a bug, we wouldn't want
to disable aoss_qmp genpd for modem
during suspend (when the modem is
running). The qmp send for modem
is the primary means through which
aoss determines whether to wait for
modem before proceeding to sleep. So
looks like updating the flag with
GENPD_FLAG_ACTIVE_WAKEUP is the way
to go. But introducing another flag
that doesn't touch genpd's during
suspend/resume should also work.


>> 
>> Did I understand that correctly?
>> 
>> ...or are you suggesting that we work around the fact that
>> qmp_pd_power_off() can't be called at "noirq" time by forcing it to
>> suspend earlier?
>> 
>> ...or am I just totally confused and you meant something else?
>> 
>> 
>> > I know this isn't clearing the land mine that is calling this code from
>> > noirq phase of suspend, but I'm just looking at the driver and thinking
>> > that it never expected to be called from this phase of suspend to begin
>> > with.
>> 
>> You're saying that qmp_pd_power_off() wasn't expecting to be called
>> from the noirq phase of suspend?  Sure, I guess not given the bug.
>> ...but once we fix the bug, it works fine, doesn't it?  ...and it
>> appears that it's part of the genpd API to be able to be called from
>> the noirq phase.  To me that means that, even if we were supposed to
>> be keeping this particular PD on during suspend we should take my
>> patch.
>> 
>> 
>> So the summary is: I still think my patch is correct, but I could
>> certainly still be convinced otherwise.
>> 
> 
> I'm trying to say that the driver looks like it expects to power off 
> the
> genpd in the adsp_stop() callback. That same callback sends some sort 
> of
> message to the modem saying that it is being stopped (see
> qcom_q6v5_request_stop()). Turning the performance state down, or
> turning the power domain off completely, without telling the modem that
> it's happening like as is done in adsp_stop() looks wrong. But who
> knows, maybe the modem is happy with that and doesn't care?
> 
> In general, the whole thing looks weird to me because I would expect 
> the
> modem to take care of its own power requirements, including this
> "load_state" one. Anyway, I hope Sibi can clarify what's going on.
Sibi Sankar Aug. 6, 2020, 5:33 p.m. UTC | #7
On 2020-08-06 22:40, Doug Anderson wrote:
> Hi,
> 
> On Thu, Aug 6, 2020 at 7:36 AM Sibi Sankar <sibis@codeaurora.org> 
> wrote:
>> 
>> On 2020-08-06 04:32, Stephen Boyd wrote:
>> > +Sibi who wrote the code
>> >
>> > Quoting Doug Anderson (2020-08-05 13:24:06)
>> >>
>> >> On Wed, Aug 5, 2020 at 10:36 AM Stephen Boyd <swboyd@chromium.org>
>> >> wrote:
>> >> >
>> >> > Why is the genpd being powered off at all? It looks like the driver is
>> >> > written in a way that it doesn't expect this to happen. See where
>> >> > adsp_pds_disable() is called from. Looks like the remoteproc "stop"
>> >> > callback should be called or the driver should be detached.
>> >> >
>> >> > It sort of looks like the genpd is expected to be at the max level all
>> >> > the time (it sets INT_MAX in adsp_pds_enable(), cool).
>> >>
>> >> In general in Linux there are some things that, at suspend time, get
>> >> done behind a driver's back.  The regulator API, for instance, allows
>> >> for regulators to be turned off in suspend even if a driver leaves
>> >> them on.  Sure, it's good practice for a driver to be explicit but the
>> >> regulator suspend states do allow for the more heavy-handed approach.
>> >>
>> >> I guess I assume that genpd is a bit similar.  If a driver leaves a
>> >> genpd on all the time then it will still be turned off at suspend time
>> >> and then turned back on at resume time.  It seems like it must be part
>> >> of the genpd API.  Specifically genpd_sync_power_off() says: "Check if
>> >> the given PM domain can be powered off (during system suspend or
>> >> hibernation) and do that if so."  That makes it seem like it's how
>> >> genpd works.
>> >>
>> >> Reading all the descriptions of things like GENPD_FLAG_ALWAYS_ON,
>> >> GENPD_FLAG_ACTIVE_WAKEUP, GENPD_FLAG_RPM_ALWAYS_ON makes me even more
>> >> convinced that it's normal (unless otherwise specified) for genpds to
>> >> get turned off in suspend even if a driver just blindly left them on.
>> >>
>> >> Presumably if this "modem" genpd is supposed to stay on in suspend
>> >> time it should have been marked "always on"?  I'd guess we'd need to
>> >> add "GENPD_FLAG_ALWAYS_ON" in some (or all?) cases in qmp_pd_add() if
>> >> this was true?
>> >
>> > Agreed. I can't read the mind of Sibi so I can only guess that Sibi
>> > wasn't expecting this behavior by reading the driver structure. That
>> > could be a wrong assumption.
>> >
>> >>
>> >>
>> >> > Maybe we need to
>> >> > add some sort of suspend hooks to the remote proc driver instead? Where
>> >> > those suspend hooks are called earlier and drop the genpd performance
>> >> > state request but otherwise leave it enabled across suspend?
>> >>
>> >> I think you're saying:
>> >>
>> >> a) You think it's a bug today that the "modem" genpd is being powered
>> >> off in suspend.  Any evidence to back this up?
>> >>
>> >> b) Assuming it's a bug today, we should mark the "modem" as
>> >> GENPD_FLAG_ALWAYS_ON.
>> >>
>> >> c) If there are genpds that sometimes should be left on in suspend but
>> >> sometimes not (and that doesn't match up with what
>> >> GENPD_FLAG_ACTIVE_WAKEUP does), then we'd have to pass
>> >> GENPD_FLAG_ALWAYS_ON as a flag and then add suspend hooks to make the
>> >> decision for us.
>> 
>> Doug/Stephen,
>> 
>> Yes this is a bug, we wouldn't want
>> to disable aoss_qmp genpd for modem
>> during suspend (when the modem is
>> running). The qmp send for modem
>> is the primary means through which
>> aoss determines whether to wait for
>> modem before proceeding to sleep. So
>> looks like updating the flag with
>> GENPD_FLAG_ACTIVE_WAKEUP is the way
>> to go. But introducing another flag
>> that doesn't touch genpd's during
>> suspend/resume should also work.
> 
> OK, sounds good.  As per out-of-band conversation:
> 
> * You'll plan to post a patch updating the flag.
> 
> * There's still nothing here that says my patch is the wrong thing to
> do also.  It seems like genpd poweroff routine are expected to be able
> to run at "noirq" time so we should make sure we are able to do that.
> 
> I'm also curious: my patch doesn't affect the behavior.  The genpd
> would be powered off with or without my patch, my patch just removes a
> pointless 1 second delay.  Therefore I guess today there is some type
> of bug because the genpd is being turned off.  What would be the
> visible impact of that bug?  ...or is it somehow masked by something
> else keeping this power on so it wasn't an issue right now?

I've been told AOSS decides to wait
for modem suspend if its been notified
that modem is on through qmp_send. AFAIK
we never ran into this because AOSS sleep
sequence starts after xo-shutdown which
wont be reached in the presence of active
rpmh votes from modem.

Regardless we definitely want this genpd left
untouched during suspend/resume.

> 
> -Doug
> 
> 
> -Doug
diff mbox series

Patch

diff --git a/drivers/soc/qcom/qcom_aoss.c b/drivers/soc/qcom/qcom_aoss.c
index ed2c687c16b3..818cdf74a267 100644
--- a/drivers/soc/qcom/qcom_aoss.c
+++ b/drivers/soc/qcom/qcom_aoss.c
@@ -6,6 +6,7 @@ 
 #include <linux/clk-provider.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
+#include <linux/iopoll.h>
 #include <linux/mailbox_client.h>
 #include <linux/module.h>
 #include <linux/platform_device.h>
@@ -215,6 +216,8 @@  static bool qmp_message_empty(struct qmp *qmp)
  * @qmp: qmp context
  * @data: message to be sent
  * @len: length of the message
+ * @noirq: If true we might have been called from the "noirq" suspend/resume
+ *         callbacks, so fall back to polling mode for waiting for completion.
  *
  * Transmit @data to AOSS and wait for the AOSS to acknowledge the message.
  * @len must be a multiple of 4 and not longer than the mailbox size. Access is
@@ -222,11 +225,12 @@  static bool qmp_message_empty(struct qmp *qmp)
  *
  * Return: 0 on success, negative errno on failure
  */
-static int qmp_send(struct qmp *qmp, const void *data, size_t len)
+static int qmp_send(struct qmp *qmp, const void *data, size_t len, bool noirq)
 {
 	long time_left;
 	size_t tlen;
 	int ret;
+	bool is_empty;
 
 	if (WARN_ON(len + sizeof(u32) > qmp->size))
 		return -EINVAL;
@@ -245,8 +249,16 @@  static int qmp_send(struct qmp *qmp, const void *data, size_t len)
 	tlen = readl(qmp->msgram + qmp->offset);
 	qmp_kick(qmp);
 
-	time_left = wait_event_interruptible_timeout(qmp->event,
-						     qmp_message_empty(qmp), HZ);
+	/*
+	 * We may be called from a suspend/resume "noirq" context.  In such
+	 * a case we have no choice but to poll.
+	 */
+	if (noirq)
+		time_left = readx_poll_timeout_atomic(qmp_message_empty, qmp,
+						      is_empty, is_empty, 1U, 1000000U);
+	else
+		time_left = wait_event_interruptible_timeout(qmp->event,
+							     qmp_message_empty(qmp), HZ);
 	if (!time_left) {
 		dev_err(qmp->dev, "ucore did not ack channel\n");
 		ret = -ETIMEDOUT;
@@ -267,7 +279,7 @@  static int qmp_qdss_clk_prepare(struct clk_hw *hw)
 	static const char buf[QMP_MSG_LEN] = "{class: clock, res: qdss, val: 1}";
 	struct qmp *qmp = container_of(hw, struct qmp, qdss_clk);
 
-	return qmp_send(qmp, buf, sizeof(buf));
+	return qmp_send(qmp, buf, sizeof(buf), false);
 }
 
 static void qmp_qdss_clk_unprepare(struct clk_hw *hw)
@@ -275,7 +287,7 @@  static void qmp_qdss_clk_unprepare(struct clk_hw *hw)
 	static const char buf[QMP_MSG_LEN] = "{class: clock, res: qdss, val: 0}";
 	struct qmp *qmp = container_of(hw, struct qmp, qdss_clk);
 
-	qmp_send(qmp, buf, sizeof(buf));
+	qmp_send(qmp, buf, sizeof(buf), false);
 }
 
 static const struct clk_ops qmp_qdss_clk_ops = {
@@ -321,7 +333,7 @@  static int qmp_pd_power_toggle(struct qmp_pd *res, bool enable)
 	snprintf(buf, sizeof(buf),
 		 "{class: image, res: load_state, name: %s, val: %s}",
 		 res->pd.name, enable ? "on" : "off");
-	return qmp_send(res->qmp, buf, sizeof(buf));
+	return qmp_send(res->qmp, buf, sizeof(buf), true);
 }
 
 static int qmp_pd_power_on(struct generic_pm_domain *domain)
@@ -438,7 +450,7 @@  static int qmp_cdev_set_cur_state(struct thermal_cooling_device *cdev,
 			qmp_cdev->name,
 			cdev_state ? "on" : "off");
 
-	ret = qmp_send(qmp_cdev->qmp, buf, sizeof(buf));
+	ret = qmp_send(qmp_cdev->qmp, buf, sizeof(buf), false);
 
 	if (!ret)
 		qmp_cdev->state = cdev_state;