diff mbox series

wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s

Message ID 20241127105709.4014302-1-treapking@chromium.org
State New
Headers show
Series wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s | expand

Commit Message

Pin-yen Lin Nov. 27, 2024, 10:55 a.m. UTC
In commit 52250cbee7f6 ("mwifiex: use timeout variant for
wait_event_interruptible") it was noted that sometimes we seemed
to miss the signal that our host sleep settings took effect. A
10 second timeout was added to the code to make sure we didn't
hang forever waiting. It appears that this problem still exists
and we hit the timeout sometimes for Chromebooks in the field.

Recently on ChromeOS we've started setting the DPM watchdog
to trip if full system suspend takes over 10 seconds. Given
the timeout in the original patch, obviously we're hitting
the DPM watchdog before mwifiex gets a chance to timeout.

While we could increase the DPM watchdog in ChromeOS to avoid
this problem, it's probably better to simply decrease the
timeout. Any time we're waiting several seconds for the
firmware to respond it's likely that the firmware won't ever
respond. With that in mind, decrease the timeout in mwifiex
from 10 seconds to 5 seconds.

Suggested-by: Doug Anderson <dianders@chromium.org>
Signed-off-by: Pin-yen Lin <treapking@chromium.org>
---

 drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Doug Anderson Dec. 4, 2024, 1:38 a.m. UTC | #1
Hi,

On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@chromium.org> wrote:
>
> Hi,
>
> On Wed, Nov 27, 2024 at 7:44 AM Doug Anderson <dianders@chromium.org> wrote:
> > On Wed, Nov 27, 2024 at 2:58 AM Pin-yen Lin <treapking@chromium.org> wrote:
> > > In commit 52250cbee7f6 ("mwifiex: use timeout variant for
> > > wait_event_interruptible") it was noted that sometimes we seemed
> > > to miss the signal that our host sleep settings took effect. A
> > > 10 second timeout was added to the code to make sure we didn't
> > > hang forever waiting. It appears that this problem still exists
> > > and we hit the timeout sometimes for Chromebooks in the field.
> > >
> > > Recently on ChromeOS we've started setting the DPM watchdog
> > > to trip if full system suspend takes over 10 seconds. Given
> > > the timeout in the original patch, obviously we're hitting
> > > the DPM watchdog before mwifiex gets a chance to timeout.
>
> The Kconfig default is 120 seconds, and it's hidden under
> CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds
> pretty small for triggering panics.) The smallest I can see outside of
> ChromeOS is 12 seconds, although 60 seconds is much more common. I
> also happen to see other WiFi drivers have hit similar problems, but
> they settled on 20 seconds (assuming a 60s timeout on other distros):
> https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@kernel.org/
> https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
>
> Technically, this Kconfig lets you set a value as small as 1 second. I
> don't think we should work at reducing all timeouts to fit that
> target.
>
> I could maybe approve lowering this one, but I'd also recommend
> reconsidering your timeout value. And more questions below.

That's fair. I guess having a 10 second timeout for full system
suspend didn't seem totally crazy to me. If a system is taking more
than 10 seconds to do a full system suspend then that seems like
something is pretty broken. I guess it's somewhat like the same
argument that the WiFi driver had for picking 10 seconds but applied
to the whole system level, and I guess that's where we get into
trouble. That made me think that even 5 seconds seems a bit long for
any given driver to suspend. ...but yeah, it's squishy.

Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog
timer and that's a better solution and leave the WiFi driver how it
is?

Another thought: I wonder if it's possible to be dynamic and somehow
set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've
checked to see if the mwifiex can actually query the DPM watchdog
timeout... ;-)

...also, it sure seems like if we're going to choose a value so low
that we shouldn't panic. All of our other watchdogs that panic aren't
so short, so probably this one shouldn't be either. Maybe we could
submit a patch to make the DPM watchdog just abort the suspend if
that's not too hard (and if the power people accept it).


> > > While we could increase the DPM watchdog in ChromeOS to avoid
> > > this problem, it's probably better to simply decrease the
> > > timeout. Any time we're waiting several seconds for the
> > > firmware to respond it's likely that the firmware won't ever
> > > respond. With that in mind, decrease the timeout in mwifiex
> > > from 10 seconds to 5 seconds.
> > >
> > > Suggested-by: Doug Anderson <dianders@chromium.org>
> > > Signed-off-by: Pin-yen Lin <treapking@chromium.org>
> > > ---
> > >
> > >  drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > I believe Brian Norris is still anointed as the personally nominally
> > in charge of mwifiex upstream (get_maintainer labels him as "odd"
> > fixer, which seems awfully judgemental), so he should be CCed on
> > fixes. Added him.
>
> I tend to see mwifiex patches even if I don't get CC'd, but thanks. I
> wonder why get_maintainer(?) picked up Francesco properly but not me.
> *shrug*
>
> > > diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > > index e06a0622973e..f79589cafe57 100644
> > > --- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > > +++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
> > > @@ -545,7 +545,7 @@ int mwifiex_enable_hs(struct mwifiex_adapter *adapter)
> > >
> > >         if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q,
> > >                                              adapter->hs_activate_wait_q_woken,
> > > -                                            (10 * HZ)) <= 0) {
> > > +                                            (5 * HZ)) <= 0) {
> >
> > Given that I suggested this fix, it should be no surprise:
> >
> > Reviewed-by: Douglas Anderson <dianders@chromium.org>
> >
> > Upon sleeping on the idea, the only slight concern I have here is
> > whether we'll trigger this timeout if we try to suspend while WiFi
> > scanning is in progress. I don't have any actual evidence supporting
> > this concern, but I remember many years ago when I used to deal with
> > the WiFi drivers more often there were cases where suspend could be
> > delayed if it happened while a scan was in progress. Maybe/hopefully
> > that's fixed now, but I figured I'd at least bring it up in case it
> > tickled anything in someone's mind...
>
> Scans should essentially have been canceled by now, but IIUC, the
> driver doesn't really force the card to stop if it's in the middle of
> a scan (I'm not 100% sure if it's possible). So it's possible that
> pending scans could delay this step.
>
> I wonder what the distribution of (successful) timing is today. I'd
> assume this typically take on the order of milliseconds, but do we
> commonly see outliers? What if a system is fully loaded while
> suspending?

I would hope this doesn't affect things from the DPM watchdog, but I
haven't researched. Hopefully the DPM watchdog starts after userspace
is frozen so the system being fully loaded shouldn't matter?

Things I could believe mattering are things like:

* If memory is full and something in the suspend patch allocates a big
chunk of memory then maybe (??) that could slow things down?

* If lots of USB devices are connected that could slow things down.

* If there are a large number of WiFi networks or APs or Bluetooth
devices I could believe that could cause suspend problems.


>  Can you try testing (and gather timing numbers) when
> suspending soon after initiating scans? It's hard to judge what the
> lower limit of this timeout should really be without any numbers, just
> like it's hard to judge whether your 10 second watchdog is reasonable.

Pin-yen: is this something you could gather?


> Also, for the record, since we might have to field regression reports
> for other systems: what hardware (chip variant, FW version) are you
> seeing problems on?

Pin-yen: I'm assuming you'll provide this.

-Doug
Brian Norris Dec. 4, 2024, 2:04 a.m. UTC | #2
Hi Doug,

On Tue, Dec 03, 2024 at 05:38:55PM -0800, Doug Anderson wrote:
> On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@chromium.org> wrote:
> > The Kconfig default is 120 seconds, and it's hidden under
> > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds
> > pretty small for triggering panics.) The smallest I can see outside of
> > ChromeOS is 12 seconds, although 60 seconds is much more common. I
> > also happen to see other WiFi drivers have hit similar problems, but
> > they settled on 20 seconds (assuming a 60s timeout on other distros):
> > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@kernel.org/
> > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
> >
> > Technically, this Kconfig lets you set a value as small as 1 second. I
> > don't think we should work at reducing all timeouts to fit that
> > target.
> >
> > I could maybe approve lowering this one, but I'd also recommend
> > reconsidering your timeout value. And more questions below.
> 
> That's fair. I guess having a 10 second timeout for full system
> suspend didn't seem totally crazy to me. If a system is taking more
> than 10 seconds to do a full system suspend then that seems like
> something is pretty broken. I guess it's somewhat like the same
> argument that the WiFi driver had for picking 10 seconds but applied
> to the whole system level, and I guess that's where we get into
> trouble. That made me think that even 5 seconds seems a bit long for
> any given driver to suspend. ...but yeah, it's squishy.

10 seconds is likely that *something* is wrong (or at least suboptimal),
but IMO, it's not quite at unreasonable levels. But yes, my point was
mainly that it's squishy, and I personally wouldn't want to be the one
running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the
known behavior of multiple drivers and the timeout-means-panic behavior.

> Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog
> timer and that's a better solution and leave the WiFi driver how it
> is?

That seems reasonable.

To be clear, I'm OK with this patch, if we can get a little more
confidence in it (like the timing data and HW info). I *think* 5 vs 10
isn't a big deal here, but I don't have much other than my guess at the
moment.

> Another thought: I wonder if it's possible to be dynamic and somehow
> set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've

You probably meant min()?

> checked to see if the mwifiex can actually query the DPM watchdog
> timeout... ;-)

Yeah, I wondered similarly -- or in reverse, if we could somehow "pat"
the watchdog or prime it with an expected timeout. But AFAICT, neither
such feature exists today.

> ...also, it sure seems like if we're going to choose a value so low
> that we shouldn't panic. All of our other watchdogs that panic aren't
> so short, so probably this one shouldn't be either. Maybe we could
> submit a patch to make the DPM watchdog just abort the suspend if
> that's not too hard (and if the power people accept it).

Yeah, if you made the watchdog just interrupt suspend and dump some
warnings, then the effect would be pretty similar to this patch.

> > I wonder what the distribution of (successful) timing is today. I'd
> > assume this typically take on the order of milliseconds, but do we
> > commonly see outliers? What if a system is fully loaded while
> > suspending?
> 
> I would hope this doesn't affect things from the DPM watchdog, but I
> haven't researched. Hopefully the DPM watchdog starts after userspace
> is frozen so the system being fully loaded shouldn't matter?

I was just throwing out ideas, but I didn't specifically mean user
space. You provided a few more ideas. Anyway, I was just fishing for
*some* attempt at real-world (and, as-bad-as-you-can-simulate world)
numbers, since that's the point of a timeout like this.

> >  Can you try testing (and gather timing numbers) when
> > suspending soon after initiating scans? It's hard to judge what the
> > lower limit of this timeout should really be without any numbers, just
> > like it's hard to judge whether your 10 second watchdog is reasonable.
> 
> Pin-yen: is this something you could gather?
> 
> 
> > Also, for the record, since we might have to field regression reports
> > for other systems: what hardware (chip variant, FW version) are you
> > seeing problems on?
> 
> Pin-yen: I'm assuming you'll provide this.

I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
the above to provide a little more confidence, or if you want to
reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.

Brian
Pin-yen Lin Dec. 4, 2024, 1:45 p.m. UTC | #3
Hi Brian,

On Wed, Dec 4, 2024 at 10:04 AM Brian Norris <briannorris@chromium.org> wrote:
>
> Hi Doug,
>
> On Tue, Dec 03, 2024 at 05:38:55PM -0800, Doug Anderson wrote:
> > On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@chromium.org> wrote:
> > > The Kconfig default is 120 seconds, and it's hidden under
> > > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds
> > > pretty small for triggering panics.) The smallest I can see outside of
> > > ChromeOS is 12 seconds, although 60 seconds is much more common. I
> > > also happen to see other WiFi drivers have hit similar problems, but
> > > they settled on 20 seconds (assuming a 60s timeout on other distros):
> > > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@kernel.org/
> > > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17
> > >
> > > Technically, this Kconfig lets you set a value as small as 1 second. I
> > > don't think we should work at reducing all timeouts to fit that
> > > target.
> > >
> > > I could maybe approve lowering this one, but I'd also recommend
> > > reconsidering your timeout value. And more questions below.
> >
> > That's fair. I guess having a 10 second timeout for full system
> > suspend didn't seem totally crazy to me. If a system is taking more
> > than 10 seconds to do a full system suspend then that seems like
> > something is pretty broken. I guess it's somewhat like the same
> > argument that the WiFi driver had for picking 10 seconds but applied
> > to the whole system level, and I guess that's where we get into
> > trouble. That made me think that even 5 seconds seems a bit long for
> > any given driver to suspend. ...but yeah, it's squishy.
>
> 10 seconds is likely that *something* is wrong (or at least suboptimal),
> but IMO, it's not quite at unreasonable levels. But yes, my point was
> mainly that it's squishy, and I personally wouldn't want to be the one
> running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the
> known behavior of multiple drivers and the timeout-means-panic behavior.
>
> > Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog
> > timer and that's a better solution and leave the WiFi driver how it
> > is?
>
> That seems reasonable.
>
> To be clear, I'm OK with this patch, if we can get a little more
> confidence in it (like the timing data and HW info). I *think* 5 vs 10
> isn't a big deal here, but I don't have much other than my guess at the
> moment.
>
> > Another thought: I wonder if it's possible to be dynamic and somehow
> > set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've
>
> You probably meant min()?
>
> > checked to see if the mwifiex can actually query the DPM watchdog
> > timeout... ;-)
>
> Yeah, I wondered similarly -- or in reverse, if we could somehow "pat"
> the watchdog or prime it with an expected timeout. But AFAICT, neither
> such feature exists today.
>
> > ...also, it sure seems like if we're going to choose a value so low
> > that we shouldn't panic. All of our other watchdogs that panic aren't
> > so short, so probably this one shouldn't be either. Maybe we could
> > submit a patch to make the DPM watchdog just abort the suspend if
> > that's not too hard (and if the power people accept it).
>
> Yeah, if you made the watchdog just interrupt suspend and dump some
> warnings, then the effect would be pretty similar to this patch.
>
> > > I wonder what the distribution of (successful) timing is today. I'd
> > > assume this typically take on the order of milliseconds, but do we
> > > commonly see outliers? What if a system is fully loaded while
> > > suspending?
> >
> > I would hope this doesn't affect things from the DPM watchdog, but I
> > haven't researched. Hopefully the DPM watchdog starts after userspace
> > is frozen so the system being fully loaded shouldn't matter?
>
> I was just throwing out ideas, but I didn't specifically mean user
> space. You provided a few more ideas. Anyway, I was just fishing for
> *some* attempt at real-world (and, as-bad-as-you-can-simulate world)
> numbers, since that's the point of a timeout like this.
>
> > >  Can you try testing (and gather timing numbers) when
> > > suspending soon after initiating scans? It's hard to judge what the
> > > lower limit of this timeout should really be without any numbers, just
> > > like it's hard to judge whether your 10 second watchdog is reasonable.
> >
> > Pin-yen: is this something you could gather?

I tried entering suspend right after wifi scans, and the time spent in
mwifiex_enable_hs() is always around 100ms. It seems initiating
suspend does not increase the execution time for mwifiex_enable_hs(),
so I think the driver is capable of interrupting a scan.
> >
> >
> > > Also, for the record, since we might have to field regression reports
> > > for other systems: what hardware (chip variant, FW version) are you
> > > seeing problems on?
> >
> > Pin-yen: I'm assuming you'll provide this.
Doug Anderson Dec. 4, 2024, 5:56 p.m. UTC | #4
Hi,

On Tue, Dec 3, 2024 at 6:04 PM Brian Norris <briannorris@chromium.org> wrote:
>
> 10 seconds is likely that *something* is wrong (or at least suboptimal),
> but IMO, it's not quite at unreasonable levels. But yes, my point was
> mainly that it's squishy, and I personally wouldn't want to be the one
> running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the
> known behavior of multiple drivers and the timeout-means-panic behavior.
>
> > Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog
> > timer and that's a better solution and leave the WiFi driver how it
> > is?
>
> That seems reasonable.

FWIW, I created a public feature request for this:

https://issuetracker.google.com/382269699

...we'll see if we can get anyone to bite on it. ...and then see if
upstream folks like the idea too.


> To be clear, I'm OK with this patch, if we can get a little more
> confidence in it (like the timing data and HW info). I *think* 5 vs 10
> isn't a big deal here, but I don't have much other than my guess at the
> moment.
>
> > Another thought: I wonder if it's possible to be dynamic and somehow
> > set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've
>
> You probably meant min()?

Yeah, I always screw that up. Sigh.


> > >  Can you try testing (and gather timing numbers) when
> > > suspending soon after initiating scans? It's hard to judge what the
> > > lower limit of this timeout should really be without any numbers, just
> > > like it's hard to judge whether your 10 second watchdog is reasonable.
> >
> > Pin-yen: is this something you could gather?
> >
> >
> > > Also, for the record, since we might have to field regression reports
> > > for other systems: what hardware (chip variant, FW version) are you
> > > seeing problems on?
> >
> > Pin-yen: I'm assuming you'll provide this.
>
> I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> the above to provide a little more confidence, or if you want to
> reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.

Possibly the answer could be both?
Doug Anderson Dec. 4, 2024, 6:11 p.m. UTC | #5
Hi,

On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote:
>
> > > >  Can you try testing (and gather timing numbers) when
> > > > suspending soon after initiating scans? It's hard to judge what the
> > > > lower limit of this timeout should really be without any numbers, just
> > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > >
> > > Pin-yen: is this something you could gather?
>
> I tried entering suspend right after wifi scans, and the time spent in
> mwifiex_enable_hs() is always around 100ms. It seems initiating
> suspend does not increase the execution time for mwifiex_enable_hs(),
> so I think the driver is capable of interrupting a scan.
> > >
> > >
> > > > Also, for the record, since we might have to field regression reports
> > > > for other systems: what hardware (chip variant, FW version) are you
> > > > seeing problems on?
> > >
> > > Pin-yen: I'm assuming you'll provide this.
>
> From the debugfs entry:
>
> driver_name = "mwifiex"
> driver_version = mwifiex 1.0 (15.68.19.p54)
> verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
>
> The compatible string of the DT is "marvell,sd8897".
> >
> > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> > the above to provide a little more confidence, or if you want to
> > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.
>
> It looks okay to me to decrease the timeout here given that scanning
> doesn't seem to affect the suspend time. What's your thought, Doug?

I think Brian is right and that we should change how we're using the
DPM watchdog, but IMO that doesn't mean that we couldn't also change
this timeout.

If nothing else, you'd want to post a v2 of your patch containing a
summary of the info you've gathered so it's recorded with the patch.

Maybe what makes the most sense, though, is to put our money where our
mouth is and land some sort of patch in the ChromeOS tree and then
report back to upstream in a month when we have data about it. If
things look good in the field then presumably that would give some
confidence for upstream to be willing to land the patch?

Probably about the best data we could gather would be to break the
existing timeout into two halves. You could wait half the time, then
print a warning, and then wait the other half the time. We could even
land that change _without_ changing the timeout to 5 seconds. Then if
we ever see the warning print but then the suspend succeeds we know
that there are cases where waiting longer would have helped. If we
made that a WARN_ON() our existing infrastructure would help us gather
that info...

...so I guess summarizing my rambling, a good plan would be:

1. Change the ChromeOS DPM watchdog timeout to 15 seconds to avoid the
problem for now.

2. Land a patch to do a WARN_ON() in mwifiex if we take 5 seconds.
Actually, you could (probably) send this upstream and land it
FROMLIST?

3. Wait ~1-2 months and see if we ever see the WARN_ON() trigger but
then things succeed after 10 seconds. If this never happens then send
a v2 patch changing the timeout to 5 seconds with details in the
commit message.

4. In the background, see if we can convince folks to make the DPM
Watchdog less panicky.


-Doug
Brian Norris Dec. 4, 2024, 7:26 p.m. UTC | #6
On Wed, Dec 04, 2024 at 09:45:11PM +0800, Pin-yen Lin wrote:
> On Wed, Dec 4, 2024 at 10:04 AM Brian Norris <briannorris@chromium.org> wrote:
> >
> > > >  Can you try testing (and gather timing numbers) when
> > > > suspending soon after initiating scans? It's hard to judge what the
> > > > lower limit of this timeout should really be without any numbers, just
> > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > >
> > > Pin-yen: is this something you could gather?
> 
> I tried entering suspend right after wifi scans, and the time spent in
> mwifiex_enable_hs() is always around 100ms. It seems initiating
> suspend does not increase the execution time for mwifiex_enable_hs(),
> so I think the driver is capable of interrupting a scan.

Thanks! At some level, there are things we can only verify by
experimentation, since we don't have firmware source code. This seems
fine to me then.

> > > > Also, for the record, since we might have to field regression reports
> > > > for other systems: what hardware (chip variant, FW version) are you
> > > > seeing problems on?
> > >
> > > Pin-yen: I'm assuming you'll provide this.
> 
> From the debugfs entry:
> 
> driver_name = "mwifiex"
> driver_version = mwifiex 1.0 (15.68.19.p54)
> verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
> 
> The compatible string of the DT is "marvell,sd8897".

Thanks.

I think it'd be good to see this info in the commit message, but
otherwise you can carry my:

Acked-by: Brian Norris <briannorris@chromium.org>

It'd be extra nice to see that you successfully use this patch in
your own releases, but I don't think that's a requirement for upstream.
And anyway, the upstream RC cycle is pretty long.

Brian
Pin-yen Lin Dec. 5, 2024, 1:46 p.m. UTC | #7
Hi Doug,

On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote:
> >
> > > > >  Can you try testing (and gather timing numbers) when
> > > > > suspending soon after initiating scans? It's hard to judge what the
> > > > > lower limit of this timeout should really be without any numbers, just
> > > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > > >
> > > > Pin-yen: is this something you could gather?
> >
> > I tried entering suspend right after wifi scans, and the time spent in
> > mwifiex_enable_hs() is always around 100ms. It seems initiating
> > suspend does not increase the execution time for mwifiex_enable_hs(),
> > so I think the driver is capable of interrupting a scan.
> > > >
> > > >
> > > > > Also, for the record, since we might have to field regression reports
> > > > > for other systems: what hardware (chip variant, FW version) are you
> > > > > seeing problems on?
> > > >
> > > > Pin-yen: I'm assuming you'll provide this.
> >
> > From the debugfs entry:
> >
> > driver_name = "mwifiex"
> > driver_version = mwifiex 1.0 (15.68.19.p54)
> > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
> >
> > The compatible string of the DT is "marvell,sd8897".
> > >
> > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> > > the above to provide a little more confidence, or if you want to
> > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.
> >
> > It looks okay to me to decrease the timeout here given that scanning
> > doesn't seem to affect the suspend time. What's your thought, Doug?
>
> I think Brian is right and that we should change how we're using the
> DPM watchdog, but IMO that doesn't mean that we couldn't also change
> this timeout.
>
> If nothing else, you'd want to post a v2 of your patch containing a
> summary of the info you've gathered so it's recorded with the patch.
>
> Maybe what makes the most sense, though, is to put our money where our
> mouth is and land some sort of patch in the ChromeOS tree and then
> report back to upstream in a month when we have data about it. If
> things look good in the field then presumably that would give some
> confidence for upstream to be willing to land the patch?
>
> Probably about the best data we could gather would be to break the
> existing timeout into two halves. You could wait half the time, then
> print a warning, and then wait the other half the time. We could even
> land that change _without_ changing the timeout to 5 seconds. Then if
> we ever see the warning print but then the suspend succeeds we know
> that there are cases where waiting longer would have helped. If we
> made that a WARN_ON() our existing infrastructure would help us gather
> that info...

I just realized that mwifiex_wait_queue_complete() has another 12s
timeout[1], though they are not directly involved in suspend/resume.

Should we also add a warning to that and see if we can make it half?
This starts to make me think how many timeouts we want to consider.
Does it make sense to only focus on the one in mwifiex_enable_hs()? Or
should we check other timeouts in mwifiex or even other drivers?

[1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51

>
> ...so I guess summarizing my rambling, a good plan would be:
>
> 1. Change the ChromeOS DPM watchdog timeout to 15 seconds to avoid the
> problem for now.
>
> 2. Land a patch to do a WARN_ON() in mwifiex if we take 5 seconds.
> Actually, you could (probably) send this upstream and land it
> FROMLIST?
>
> 3. Wait ~1-2 months and see if we ever see the WARN_ON() trigger but
> then things succeed after 10 seconds. If this never happens then send
> a v2 patch changing the timeout to 5 seconds with details in the
> commit message.
>
> 4. In the background, see if we can convince folks to make the DPM
> Watchdog less panicky.
>
>
> -Doug

Regards,
Pin-yen
Doug Anderson Dec. 5, 2024, 5:13 p.m. UTC | #8
Hi,

On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@chromium.org> wrote:
>
> Hi Doug,
>
> On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote:
> > >
> > > > > >  Can you try testing (and gather timing numbers) when
> > > > > > suspending soon after initiating scans? It's hard to judge what the
> > > > > > lower limit of this timeout should really be without any numbers, just
> > > > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > > > >
> > > > > Pin-yen: is this something you could gather?
> > >
> > > I tried entering suspend right after wifi scans, and the time spent in
> > > mwifiex_enable_hs() is always around 100ms. It seems initiating
> > > suspend does not increase the execution time for mwifiex_enable_hs(),
> > > so I think the driver is capable of interrupting a scan.
> > > > >
> > > > >
> > > > > > Also, for the record, since we might have to field regression reports
> > > > > > for other systems: what hardware (chip variant, FW version) are you
> > > > > > seeing problems on?
> > > > >
> > > > > Pin-yen: I'm assuming you'll provide this.
> > >
> > > From the debugfs entry:
> > >
> > > driver_name = "mwifiex"
> > > driver_version = mwifiex 1.0 (15.68.19.p54)
> > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
> > >
> > > The compatible string of the DT is "marvell,sd8897".
> > > >
> > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> > > > the above to provide a little more confidence, or if you want to
> > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.
> > >
> > > It looks okay to me to decrease the timeout here given that scanning
> > > doesn't seem to affect the suspend time. What's your thought, Doug?
> >
> > I think Brian is right and that we should change how we're using the
> > DPM watchdog, but IMO that doesn't mean that we couldn't also change
> > this timeout.
> >
> > If nothing else, you'd want to post a v2 of your patch containing a
> > summary of the info you've gathered so it's recorded with the patch.
> >
> > Maybe what makes the most sense, though, is to put our money where our
> > mouth is and land some sort of patch in the ChromeOS tree and then
> > report back to upstream in a month when we have data about it. If
> > things look good in the field then presumably that would give some
> > confidence for upstream to be willing to land the patch?
> >
> > Probably about the best data we could gather would be to break the
> > existing timeout into two halves. You could wait half the time, then
> > print a warning, and then wait the other half the time. We could even
> > land that change _without_ changing the timeout to 5 seconds. Then if
> > we ever see the warning print but then the suspend succeeds we know
> > that there are cases where waiting longer would have helped. If we
> > made that a WARN_ON() our existing infrastructure would help us gather
> > that info...
>
> I just realized that mwifiex_wait_queue_complete() has another 12s
> timeout[1], though they are not directly involved in suspend/resume.
>
> Should we also add a warning to that and see if we can make it half?
> This starts to make me think how many timeouts we want to consider.
> Does it make sense to only focus on the one in mwifiex_enable_hs()? Or
> should we check other timeouts in mwifiex or even other drivers?
>
> [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51

IMO any of these "arbitrary and really long timeout to make sure we
don't hang forever" type things probably should have a warning so we
know if we're getting close to the timeout. It wouldn't be very hard
to make a wrapper around wait_event_interruptible_timeout() to handle
this. I suppose that wrapper could just be local to mwifiex, though if
other drivers found it useful it would make sense to put it somewhere
more common.

That being said, if we aren't actually hitting these other timeouts I
don't know that we need to do an extensive audit right now to find
every one of them.

Of course, Brian also said he'd be OK with just setting the timeout to
5 seconds based on the research you've already done. In that case I
don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5
seconds is still pretty long, but I don't know if it's WARN-worthy. It
could at least be dev_warn() worthy...



-Doug
Pin-yen Lin Dec. 6, 2024, 11:21 a.m. UTC | #9
Hi Doug,

On Fri, Dec 6, 2024 at 1:13 AM Doug Anderson <dianders@chromium.org> wrote:
>
> Hi,
>
> On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@chromium.org> wrote:
> >
> > Hi Doug,
> >
> > On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote:
> > >
> > > Hi,
> > >
> > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote:
> > > >
> > > > > > >  Can you try testing (and gather timing numbers) when
> > > > > > > suspending soon after initiating scans? It's hard to judge what the
> > > > > > > lower limit of this timeout should really be without any numbers, just
> > > > > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > > > > >
> > > > > > Pin-yen: is this something you could gather?
> > > >
> > > > I tried entering suspend right after wifi scans, and the time spent in
> > > > mwifiex_enable_hs() is always around 100ms. It seems initiating
> > > > suspend does not increase the execution time for mwifiex_enable_hs(),
> > > > so I think the driver is capable of interrupting a scan.
> > > > > >
> > > > > >
> > > > > > > Also, for the record, since we might have to field regression reports
> > > > > > > for other systems: what hardware (chip variant, FW version) are you
> > > > > > > seeing problems on?
> > > > > >
> > > > > > Pin-yen: I'm assuming you'll provide this.
> > > >
> > > > From the debugfs entry:
> > > >
> > > > driver_name = "mwifiex"
> > > > driver_version = mwifiex 1.0 (15.68.19.p54)
> > > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
> > > >
> > > > The compatible string of the DT is "marvell,sd8897".
> > > > >
> > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> > > > > the above to provide a little more confidence, or if you want to
> > > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.
> > > >
> > > > It looks okay to me to decrease the timeout here given that scanning
> > > > doesn't seem to affect the suspend time. What's your thought, Doug?
> > >
> > > I think Brian is right and that we should change how we're using the
> > > DPM watchdog, but IMO that doesn't mean that we couldn't also change
> > > this timeout.
> > >
> > > If nothing else, you'd want to post a v2 of your patch containing a
> > > summary of the info you've gathered so it's recorded with the patch.
> > >
> > > Maybe what makes the most sense, though, is to put our money where our
> > > mouth is and land some sort of patch in the ChromeOS tree and then
> > > report back to upstream in a month when we have data about it. If
> > > things look good in the field then presumably that would give some
> > > confidence for upstream to be willing to land the patch?
> > >
> > > Probably about the best data we could gather would be to break the
> > > existing timeout into two halves. You could wait half the time, then
> > > print a warning, and then wait the other half the time. We could even
> > > land that change _without_ changing the timeout to 5 seconds. Then if
> > > we ever see the warning print but then the suspend succeeds we know
> > > that there are cases where waiting longer would have helped. If we
> > > made that a WARN_ON() our existing infrastructure would help us gather
> > > that info...
> >
> > I just realized that mwifiex_wait_queue_complete() has another 12s
> > timeout[1], though they are not directly involved in suspend/resume.
> >
> > Should we also add a warning to that and see if we can make it half?
> > This starts to make me think how many timeouts we want to consider.
> > Does it make sense to only focus on the one in mwifiex_enable_hs()? Or
> > should we check other timeouts in mwifiex or even other drivers?
> >
> > [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51
>
> IMO any of these "arbitrary and really long timeout to make sure we
> don't hang forever" type things probably should have a warning so we
> know if we're getting close to the timeout. It wouldn't be very hard
> to make a wrapper around wait_event_interruptible_timeout() to handle
> this. I suppose that wrapper could just be local to mwifiex, though if
> other drivers found it useful it would make sense to put it somewhere
> more common.
>
> That being said, if we aren't actually hitting these other timeouts I
> don't know that we need to do an extensive audit right now to find
> every one of them.
>
> Of course, Brian also said he'd be OK with just setting the timeout to
> 5 seconds based on the research you've already done. In that case I
> don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5
> seconds is still pretty long, but I don't know if it's WARN-worthy. It
> could at least be dev_warn() worthy...
>

Sounds like we will start to binary search the timeout...

Then, I would like to just send out a v2 for this with an updated
commit message. It sounds like it will be endless binary searches on
all those timeouts in the kernel if we really want to create a wrapper
around wait_event_interruptible_timeout(), and all the efforts only
make us a few more seconds faster on the system.
>
>
> -Doug

Regards,
Pin-yen
Doug Anderson Dec. 6, 2024, 4:09 p.m. UTC | #10
Hi,

On Fri, Dec 6, 2024 at 3:21 AM Pin-yen Lin <treapking@chromium.org> wrote:
>
> Hi Doug,
>
> On Fri, Dec 6, 2024 at 1:13 AM Doug Anderson <dianders@chromium.org> wrote:
> >
> > Hi,
> >
> > On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@chromium.org> wrote:
> > >
> > > Hi Doug,
> > >
> > > On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote:
> > > > >
> > > > > > > >  Can you try testing (and gather timing numbers) when
> > > > > > > > suspending soon after initiating scans? It's hard to judge what the
> > > > > > > > lower limit of this timeout should really be without any numbers, just
> > > > > > > > like it's hard to judge whether your 10 second watchdog is reasonable.
> > > > > > >
> > > > > > > Pin-yen: is this something you could gather?
> > > > >
> > > > > I tried entering suspend right after wifi scans, and the time spent in
> > > > > mwifiex_enable_hs() is always around 100ms. It seems initiating
> > > > > suspend does not increase the execution time for mwifiex_enable_hs(),
> > > > > so I think the driver is capable of interrupting a scan.
> > > > > > >
> > > > > > >
> > > > > > > > Also, for the record, since we might have to field regression reports
> > > > > > > > for other systems: what hardware (chip variant, FW version) are you
> > > > > > > > seeing problems on?
> > > > > > >
> > > > > > > Pin-yen: I'm assuming you'll provide this.
> > > > >
> > > > > From the debugfs entry:
> > > > >
> > > > > driver_name = "mwifiex"
> > > > > driver_version = mwifiex 1.0 (15.68.19.p54)
> > > > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54
> > > > >
> > > > > The compatible string of the DT is "marvell,sd8897".
> > > > > >
> > > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide
> > > > > > the above to provide a little more confidence, or if you want to
> > > > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT.
> > > > >
> > > > > It looks okay to me to decrease the timeout here given that scanning
> > > > > doesn't seem to affect the suspend time. What's your thought, Doug?
> > > >
> > > > I think Brian is right and that we should change how we're using the
> > > > DPM watchdog, but IMO that doesn't mean that we couldn't also change
> > > > this timeout.
> > > >
> > > > If nothing else, you'd want to post a v2 of your patch containing a
> > > > summary of the info you've gathered so it's recorded with the patch.
> > > >
> > > > Maybe what makes the most sense, though, is to put our money where our
> > > > mouth is and land some sort of patch in the ChromeOS tree and then
> > > > report back to upstream in a month when we have data about it. If
> > > > things look good in the field then presumably that would give some
> > > > confidence for upstream to be willing to land the patch?
> > > >
> > > > Probably about the best data we could gather would be to break the
> > > > existing timeout into two halves. You could wait half the time, then
> > > > print a warning, and then wait the other half the time. We could even
> > > > land that change _without_ changing the timeout to 5 seconds. Then if
> > > > we ever see the warning print but then the suspend succeeds we know
> > > > that there are cases where waiting longer would have helped. If we
> > > > made that a WARN_ON() our existing infrastructure would help us gather
> > > > that info...
> > >
> > > I just realized that mwifiex_wait_queue_complete() has another 12s
> > > timeout[1], though they are not directly involved in suspend/resume.
> > >
> > > Should we also add a warning to that and see if we can make it half?
> > > This starts to make me think how many timeouts we want to consider.
> > > Does it make sense to only focus on the one in mwifiex_enable_hs()? Or
> > > should we check other timeouts in mwifiex or even other drivers?
> > >
> > > [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51
> >
> > IMO any of these "arbitrary and really long timeout to make sure we
> > don't hang forever" type things probably should have a warning so we
> > know if we're getting close to the timeout. It wouldn't be very hard
> > to make a wrapper around wait_event_interruptible_timeout() to handle
> > this. I suppose that wrapper could just be local to mwifiex, though if
> > other drivers found it useful it would make sense to put it somewhere
> > more common.
> >
> > That being said, if we aren't actually hitting these other timeouts I
> > don't know that we need to do an extensive audit right now to find
> > every one of them.
> >
> > Of course, Brian also said he'd be OK with just setting the timeout to
> > 5 seconds based on the research you've already done. In that case I
> > don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5
> > seconds is still pretty long, but I don't know if it's WARN-worthy. It
> > could at least be dev_warn() worthy...
> >
>
> Sounds like we will start to binary search the timeout...

I wouldn't bother binary searching the timeout and in general I
wouldn't try to tune these timeouts unless we think there's a real
benefit to it. Mostly, though, I like the idea of having a warning
_before_ a failure occurs. Like if I have a retry loop that will retry
10 times, I like to warn after I need 5 retries so that it's obvious
we're getting close. If I expect a timeout to never be needed, I'd
also like some warning that says we got close to the maximum timeout.
All of this makes it much easier to debug problems if/when we actually
hit the failure case. It also has the benefit that if we need to tune
a timeout / number of retries then we have more confidence that we
were nowhere near the max, but IMO that wouldn't be the primary
benefit.


> Then, I would like to just send out a v2 for this with an updated
> commit message.

Sure, though maybe it's not urgent? We've changed the DPM watchdog in
ChromeOS to 15 seconds so at least we won't be hitting it there. While
I agree that the 10 second timeout is still likely overkill and that 5
should be fine, gathering more data wouldn't hurt.

I'd be really curious how the system behaves after the timeout hits,
too. Does the next suspend actually succeed? Does WiFi recover or is
it busted until the next reboot?


> It sounds like it will be endless binary searches on
> all those timeouts in the kernel if we really want to create a wrapper
> around wait_event_interruptible_timeout(), and all the efforts only
> make us a few more seconds faster on the system.

My goal in general isn't to optimize the system but to know when we're
close to failure so it's easier to debug when the timeout occurs.
Kalle Valo Dec. 9, 2024, 3:59 p.m. UTC | #11
Pin-yen Lin <treapking@chromium.org> wrote:

> In commit 52250cbee7f6 ("mwifiex: use timeout variant for
> wait_event_interruptible") it was noted that sometimes we seemed
> to miss the signal that our host sleep settings took effect. A
> 10 second timeout was added to the code to make sure we didn't
> hang forever waiting. It appears that this problem still exists
> and we hit the timeout sometimes for Chromebooks in the field.
> 
> Recently on ChromeOS we've started setting the DPM watchdog
> to trip if full system suspend takes over 10 seconds. Given
> the timeout in the original patch, obviously we're hitting
> the DPM watchdog before mwifiex gets a chance to timeout.
> 
> While we could increase the DPM watchdog in ChromeOS to avoid
> this problem, it's probably better to simply decrease the
> timeout. Any time we're waiting several seconds for the
> firmware to respond it's likely that the firmware won't ever
> respond. With that in mind, decrease the timeout in mwifiex
> from 10 seconds to 5 seconds.
> 
> Suggested-by: Doug Anderson <dianders@chromium.org>
> Signed-off-by: Pin-yen Lin <treapking@chromium.org>
> Reviewed-by: Douglas Anderson <dianders@chromium.org>
> Acked-by: Brian Norris <briannorris@chromium.org>

Patch applied to wireless-next.git, thanks.

f143cece43dd wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s
diff mbox series

Patch

diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
index e06a0622973e..f79589cafe57 100644
--- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
+++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c
@@ -545,7 +545,7 @@  int mwifiex_enable_hs(struct mwifiex_adapter *adapter)
 
 	if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q,
 					     adapter->hs_activate_wait_q_woken,
-					     (10 * HZ)) <= 0) {
+					     (5 * HZ)) <= 0) {
 		mwifiex_dbg(adapter, ERROR,
 			    "hs_activate_wait_q terminated\n");
 		return false;