Message ID | 20241127105709.4014302-1-treapking@chromium.org |
---|---|
State | New |
Headers | show |
Series | wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s | expand |
Hi, On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@chromium.org> wrote: > > Hi, > > On Wed, Nov 27, 2024 at 7:44 AM Doug Anderson <dianders@chromium.org> wrote: > > On Wed, Nov 27, 2024 at 2:58 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > In commit 52250cbee7f6 ("mwifiex: use timeout variant for > > > wait_event_interruptible") it was noted that sometimes we seemed > > > to miss the signal that our host sleep settings took effect. A > > > 10 second timeout was added to the code to make sure we didn't > > > hang forever waiting. It appears that this problem still exists > > > and we hit the timeout sometimes for Chromebooks in the field. > > > > > > Recently on ChromeOS we've started setting the DPM watchdog > > > to trip if full system suspend takes over 10 seconds. Given > > > the timeout in the original patch, obviously we're hitting > > > the DPM watchdog before mwifiex gets a chance to timeout. > > The Kconfig default is 120 seconds, and it's hidden under > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds > pretty small for triggering panics.) The smallest I can see outside of > ChromeOS is 12 seconds, although 60 seconds is much more common. I > also happen to see other WiFi drivers have hit similar problems, but > they settled on 20 seconds (assuming a 60s timeout on other distros): > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@kernel.org/ > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17 > > Technically, this Kconfig lets you set a value as small as 1 second. I > don't think we should work at reducing all timeouts to fit that > target. > > I could maybe approve lowering this one, but I'd also recommend > reconsidering your timeout value. And more questions below. That's fair. I guess having a 10 second timeout for full system suspend didn't seem totally crazy to me. If a system is taking more than 10 seconds to do a full system suspend then that seems like something is pretty broken. I guess it's somewhat like the same argument that the WiFi driver had for picking 10 seconds but applied to the whole system level, and I guess that's where we get into trouble. That made me think that even 5 seconds seems a bit long for any given driver to suspend. ...but yeah, it's squishy. Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog timer and that's a better solution and leave the WiFi driver how it is? Another thought: I wonder if it's possible to be dynamic and somehow set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've checked to see if the mwifiex can actually query the DPM watchdog timeout... ;-) ...also, it sure seems like if we're going to choose a value so low that we shouldn't panic. All of our other watchdogs that panic aren't so short, so probably this one shouldn't be either. Maybe we could submit a patch to make the DPM watchdog just abort the suspend if that's not too hard (and if the power people accept it). > > > While we could increase the DPM watchdog in ChromeOS to avoid > > > this problem, it's probably better to simply decrease the > > > timeout. Any time we're waiting several seconds for the > > > firmware to respond it's likely that the firmware won't ever > > > respond. With that in mind, decrease the timeout in mwifiex > > > from 10 seconds to 5 seconds. > > > > > > Suggested-by: Doug Anderson <dianders@chromium.org> > > > Signed-off-by: Pin-yen Lin <treapking@chromium.org> > > > --- > > > > > > drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > I believe Brian Norris is still anointed as the personally nominally > > in charge of mwifiex upstream (get_maintainer labels him as "odd" > > fixer, which seems awfully judgemental), so he should be CCed on > > fixes. Added him. > > I tend to see mwifiex patches even if I don't get CC'd, but thanks. I > wonder why get_maintainer(?) picked up Francesco properly but not me. > *shrug* > > > > diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c > > > index e06a0622973e..f79589cafe57 100644 > > > --- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c > > > +++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c > > > @@ -545,7 +545,7 @@ int mwifiex_enable_hs(struct mwifiex_adapter *adapter) > > > > > > if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q, > > > adapter->hs_activate_wait_q_woken, > > > - (10 * HZ)) <= 0) { > > > + (5 * HZ)) <= 0) { > > > > Given that I suggested this fix, it should be no surprise: > > > > Reviewed-by: Douglas Anderson <dianders@chromium.org> > > > > Upon sleeping on the idea, the only slight concern I have here is > > whether we'll trigger this timeout if we try to suspend while WiFi > > scanning is in progress. I don't have any actual evidence supporting > > this concern, but I remember many years ago when I used to deal with > > the WiFi drivers more often there were cases where suspend could be > > delayed if it happened while a scan was in progress. Maybe/hopefully > > that's fixed now, but I figured I'd at least bring it up in case it > > tickled anything in someone's mind... > > Scans should essentially have been canceled by now, but IIUC, the > driver doesn't really force the card to stop if it's in the middle of > a scan (I'm not 100% sure if it's possible). So it's possible that > pending scans could delay this step. > > I wonder what the distribution of (successful) timing is today. I'd > assume this typically take on the order of milliseconds, but do we > commonly see outliers? What if a system is fully loaded while > suspending? I would hope this doesn't affect things from the DPM watchdog, but I haven't researched. Hopefully the DPM watchdog starts after userspace is frozen so the system being fully loaded shouldn't matter? Things I could believe mattering are things like: * If memory is full and something in the suspend patch allocates a big chunk of memory then maybe (??) that could slow things down? * If lots of USB devices are connected that could slow things down. * If there are a large number of WiFi networks or APs or Bluetooth devices I could believe that could cause suspend problems. > Can you try testing (and gather timing numbers) when > suspending soon after initiating scans? It's hard to judge what the > lower limit of this timeout should really be without any numbers, just > like it's hard to judge whether your 10 second watchdog is reasonable. Pin-yen: is this something you could gather? > Also, for the record, since we might have to field regression reports > for other systems: what hardware (chip variant, FW version) are you > seeing problems on? Pin-yen: I'm assuming you'll provide this. -Doug
Hi Doug, On Tue, Dec 03, 2024 at 05:38:55PM -0800, Doug Anderson wrote: > On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@chromium.org> wrote: > > The Kconfig default is 120 seconds, and it's hidden under > > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds > > pretty small for triggering panics.) The smallest I can see outside of > > ChromeOS is 12 seconds, although 60 seconds is much more common. I > > also happen to see other WiFi drivers have hit similar problems, but > > they settled on 20 seconds (assuming a 60s timeout on other distros): > > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@kernel.org/ > > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17 > > > > Technically, this Kconfig lets you set a value as small as 1 second. I > > don't think we should work at reducing all timeouts to fit that > > target. > > > > I could maybe approve lowering this one, but I'd also recommend > > reconsidering your timeout value. And more questions below. > > That's fair. I guess having a 10 second timeout for full system > suspend didn't seem totally crazy to me. If a system is taking more > than 10 seconds to do a full system suspend then that seems like > something is pretty broken. I guess it's somewhat like the same > argument that the WiFi driver had for picking 10 seconds but applied > to the whole system level, and I guess that's where we get into > trouble. That made me think that even 5 seconds seems a bit long for > any given driver to suspend. ...but yeah, it's squishy. 10 seconds is likely that *something* is wrong (or at least suboptimal), but IMO, it's not quite at unreasonable levels. But yes, my point was mainly that it's squishy, and I personally wouldn't want to be the one running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the known behavior of multiple drivers and the timeout-means-panic behavior. > Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog > timer and that's a better solution and leave the WiFi driver how it > is? That seems reasonable. To be clear, I'm OK with this patch, if we can get a little more confidence in it (like the timing data and HW info). I *think* 5 vs 10 isn't a big deal here, but I don't have much other than my guess at the moment. > Another thought: I wonder if it's possible to be dynamic and somehow > set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've You probably meant min()? > checked to see if the mwifiex can actually query the DPM watchdog > timeout... ;-) Yeah, I wondered similarly -- or in reverse, if we could somehow "pat" the watchdog or prime it with an expected timeout. But AFAICT, neither such feature exists today. > ...also, it sure seems like if we're going to choose a value so low > that we shouldn't panic. All of our other watchdogs that panic aren't > so short, so probably this one shouldn't be either. Maybe we could > submit a patch to make the DPM watchdog just abort the suspend if > that's not too hard (and if the power people accept it). Yeah, if you made the watchdog just interrupt suspend and dump some warnings, then the effect would be pretty similar to this patch. > > I wonder what the distribution of (successful) timing is today. I'd > > assume this typically take on the order of milliseconds, but do we > > commonly see outliers? What if a system is fully loaded while > > suspending? > > I would hope this doesn't affect things from the DPM watchdog, but I > haven't researched. Hopefully the DPM watchdog starts after userspace > is frozen so the system being fully loaded shouldn't matter? I was just throwing out ideas, but I didn't specifically mean user space. You provided a few more ideas. Anyway, I was just fishing for *some* attempt at real-world (and, as-bad-as-you-can-simulate world) numbers, since that's the point of a timeout like this. > > Can you try testing (and gather timing numbers) when > > suspending soon after initiating scans? It's hard to judge what the > > lower limit of this timeout should really be without any numbers, just > > like it's hard to judge whether your 10 second watchdog is reasonable. > > Pin-yen: is this something you could gather? > > > > Also, for the record, since we might have to field regression reports > > for other systems: what hardware (chip variant, FW version) are you > > seeing problems on? > > Pin-yen: I'm assuming you'll provide this. I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide the above to provide a little more confidence, or if you want to reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. Brian
Hi Brian, On Wed, Dec 4, 2024 at 10:04 AM Brian Norris <briannorris@chromium.org> wrote: > > Hi Doug, > > On Tue, Dec 03, 2024 at 05:38:55PM -0800, Doug Anderson wrote: > > On Tue, Dec 3, 2024 at 3:05 PM Brian Norris <briannorris@chromium.org> wrote: > > > The Kconfig default is 120 seconds, and it's hidden under > > > CONFIG_EXPERT. What makes you think 10 is a good value? (It sounds > > > pretty small for triggering panics.) The smallest I can see outside of > > > ChromeOS is 12 seconds, although 60 seconds is much more common. I > > > also happen to see other WiFi drivers have hit similar problems, but > > > they settled on 20 seconds (assuming a 60s timeout on other distros): > > > https://lore.kernel.org/linux-wireless/20230329162038.8637-1-kvalo@kernel.org/ > > > https://git.kernel.org/linus/cf5fa3ca0552f1b7ba8490de40700bbfb6979b17 > > > > > > Technically, this Kconfig lets you set a value as small as 1 second. I > > > don't think we should work at reducing all timeouts to fit that > > > target. > > > > > > I could maybe approve lowering this one, but I'd also recommend > > > reconsidering your timeout value. And more questions below. > > > > That's fair. I guess having a 10 second timeout for full system > > suspend didn't seem totally crazy to me. If a system is taking more > > than 10 seconds to do a full system suspend then that seems like > > something is pretty broken. I guess it's somewhat like the same > > argument that the WiFi driver had for picking 10 seconds but applied > > to the whole system level, and I guess that's where we get into > > trouble. That made me think that even 5 seconds seems a bit long for > > any given driver to suspend. ...but yeah, it's squishy. > > 10 seconds is likely that *something* is wrong (or at least suboptimal), > but IMO, it's not quite at unreasonable levels. But yes, my point was > mainly that it's squishy, and I personally wouldn't want to be the one > running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the > known behavior of multiple drivers and the timeout-means-panic behavior. > > > Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog > > timer and that's a better solution and leave the WiFi driver how it > > is? > > That seems reasonable. > > To be clear, I'm OK with this patch, if we can get a little more > confidence in it (like the timing data and HW info). I *think* 5 vs 10 > isn't a big deal here, but I don't have much other than my guess at the > moment. > > > Another thought: I wonder if it's possible to be dynamic and somehow > > set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've > > You probably meant min()? > > > checked to see if the mwifiex can actually query the DPM watchdog > > timeout... ;-) > > Yeah, I wondered similarly -- or in reverse, if we could somehow "pat" > the watchdog or prime it with an expected timeout. But AFAICT, neither > such feature exists today. > > > ...also, it sure seems like if we're going to choose a value so low > > that we shouldn't panic. All of our other watchdogs that panic aren't > > so short, so probably this one shouldn't be either. Maybe we could > > submit a patch to make the DPM watchdog just abort the suspend if > > that's not too hard (and if the power people accept it). > > Yeah, if you made the watchdog just interrupt suspend and dump some > warnings, then the effect would be pretty similar to this patch. > > > > I wonder what the distribution of (successful) timing is today. I'd > > > assume this typically take on the order of milliseconds, but do we > > > commonly see outliers? What if a system is fully loaded while > > > suspending? > > > > I would hope this doesn't affect things from the DPM watchdog, but I > > haven't researched. Hopefully the DPM watchdog starts after userspace > > is frozen so the system being fully loaded shouldn't matter? > > I was just throwing out ideas, but I didn't specifically mean user > space. You provided a few more ideas. Anyway, I was just fishing for > *some* attempt at real-world (and, as-bad-as-you-can-simulate world) > numbers, since that's the point of a timeout like this. > > > > Can you try testing (and gather timing numbers) when > > > suspending soon after initiating scans? It's hard to judge what the > > > lower limit of this timeout should really be without any numbers, just > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > Pin-yen: is this something you could gather? I tried entering suspend right after wifi scans, and the time spent in mwifiex_enable_hs() is always around 100ms. It seems initiating suspend does not increase the execution time for mwifiex_enable_hs(), so I think the driver is capable of interrupting a scan. > > > > > > > Also, for the record, since we might have to field regression reports > > > for other systems: what hardware (chip variant, FW version) are you > > > seeing problems on? > > > > Pin-yen: I'm assuming you'll provide this.
Hi, On Tue, Dec 3, 2024 at 6:04 PM Brian Norris <briannorris@chromium.org> wrote: > > 10 seconds is likely that *something* is wrong (or at least suboptimal), > but IMO, it's not quite at unreasonable levels. But yes, my point was > mainly that it's squishy, and I personally wouldn't want to be the one > running with the lowest CONFIG_DPM_WATCHDOG_TIMEOUT out there, given the > known behavior of multiple drivers and the timeout-means-panic behavior. > > > Maybe the ChromeOS should change to 15 seconds for the DPM Watchdog > > timer and that's a better solution and leave the WiFi driver how it > > is? > > That seems reasonable. FWIW, I created a public feature request for this: https://issuetracker.google.com/382269699 ...we'll see if we can get anyone to bite on it. ...and then see if upstream folks like the idea too. > To be clear, I'm OK with this patch, if we can get a little more > confidence in it (like the timing data and HW info). I *think* 5 vs 10 > isn't a big deal here, but I don't have much other than my guess at the > moment. > > > Another thought: I wonder if it's possible to be dynamic and somehow > > set the timeout as "max(10, dpm_watchdog_timeout / 2)". Not that I've > > You probably meant min()? Yeah, I always screw that up. Sigh. > > > Can you try testing (and gather timing numbers) when > > > suspending soon after initiating scans? It's hard to judge what the > > > lower limit of this timeout should really be without any numbers, just > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > Pin-yen: is this something you could gather? > > > > > > > Also, for the record, since we might have to field regression reports > > > for other systems: what hardware (chip variant, FW version) are you > > > seeing problems on? > > > > Pin-yen: I'm assuming you'll provide this. > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide > the above to provide a little more confidence, or if you want to > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. Possibly the answer could be both?
Hi, On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > > Can you try testing (and gather timing numbers) when > > > > suspending soon after initiating scans? It's hard to judge what the > > > > lower limit of this timeout should really be without any numbers, just > > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > > > Pin-yen: is this something you could gather? > > I tried entering suspend right after wifi scans, and the time spent in > mwifiex_enable_hs() is always around 100ms. It seems initiating > suspend does not increase the execution time for mwifiex_enable_hs(), > so I think the driver is capable of interrupting a scan. > > > > > > > > > > Also, for the record, since we might have to field regression reports > > > > for other systems: what hardware (chip variant, FW version) are you > > > > seeing problems on? > > > > > > Pin-yen: I'm assuming you'll provide this. > > From the debugfs entry: > > driver_name = "mwifiex" > driver_version = mwifiex 1.0 (15.68.19.p54) > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54 > > The compatible string of the DT is "marvell,sd8897". > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide > > the above to provide a little more confidence, or if you want to > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. > > It looks okay to me to decrease the timeout here given that scanning > doesn't seem to affect the suspend time. What's your thought, Doug? I think Brian is right and that we should change how we're using the DPM watchdog, but IMO that doesn't mean that we couldn't also change this timeout. If nothing else, you'd want to post a v2 of your patch containing a summary of the info you've gathered so it's recorded with the patch. Maybe what makes the most sense, though, is to put our money where our mouth is and land some sort of patch in the ChromeOS tree and then report back to upstream in a month when we have data about it. If things look good in the field then presumably that would give some confidence for upstream to be willing to land the patch? Probably about the best data we could gather would be to break the existing timeout into two halves. You could wait half the time, then print a warning, and then wait the other half the time. We could even land that change _without_ changing the timeout to 5 seconds. Then if we ever see the warning print but then the suspend succeeds we know that there are cases where waiting longer would have helped. If we made that a WARN_ON() our existing infrastructure would help us gather that info... ...so I guess summarizing my rambling, a good plan would be: 1. Change the ChromeOS DPM watchdog timeout to 15 seconds to avoid the problem for now. 2. Land a patch to do a WARN_ON() in mwifiex if we take 5 seconds. Actually, you could (probably) send this upstream and land it FROMLIST? 3. Wait ~1-2 months and see if we ever see the WARN_ON() trigger but then things succeed after 10 seconds. If this never happens then send a v2 patch changing the timeout to 5 seconds with details in the commit message. 4. In the background, see if we can convince folks to make the DPM Watchdog less panicky. -Doug
On Wed, Dec 04, 2024 at 09:45:11PM +0800, Pin-yen Lin wrote: > On Wed, Dec 4, 2024 at 10:04 AM Brian Norris <briannorris@chromium.org> wrote: > > > > > > Can you try testing (and gather timing numbers) when > > > > suspending soon after initiating scans? It's hard to judge what the > > > > lower limit of this timeout should really be without any numbers, just > > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > > > Pin-yen: is this something you could gather? > > I tried entering suspend right after wifi scans, and the time spent in > mwifiex_enable_hs() is always around 100ms. It seems initiating > suspend does not increase the execution time for mwifiex_enable_hs(), > so I think the driver is capable of interrupting a scan. Thanks! At some level, there are things we can only verify by experimentation, since we don't have firmware source code. This seems fine to me then. > > > > Also, for the record, since we might have to field regression reports > > > > for other systems: what hardware (chip variant, FW version) are you > > > > seeing problems on? > > > > > > Pin-yen: I'm assuming you'll provide this. > > From the debugfs entry: > > driver_name = "mwifiex" > driver_version = mwifiex 1.0 (15.68.19.p54) > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54 > > The compatible string of the DT is "marvell,sd8897". Thanks. I think it'd be good to see this info in the commit message, but otherwise you can carry my: Acked-by: Brian Norris <briannorris@chromium.org> It'd be extra nice to see that you successfully use this patch in your own releases, but I don't think that's a requirement for upstream. And anyway, the upstream RC cycle is pretty long. Brian
Hi Doug, On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > > > > Can you try testing (and gather timing numbers) when > > > > > suspending soon after initiating scans? It's hard to judge what the > > > > > lower limit of this timeout should really be without any numbers, just > > > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > > > > > Pin-yen: is this something you could gather? > > > > I tried entering suspend right after wifi scans, and the time spent in > > mwifiex_enable_hs() is always around 100ms. It seems initiating > > suspend does not increase the execution time for mwifiex_enable_hs(), > > so I think the driver is capable of interrupting a scan. > > > > > > > > > > > > > Also, for the record, since we might have to field regression reports > > > > > for other systems: what hardware (chip variant, FW version) are you > > > > > seeing problems on? > > > > > > > > Pin-yen: I'm assuming you'll provide this. > > > > From the debugfs entry: > > > > driver_name = "mwifiex" > > driver_version = mwifiex 1.0 (15.68.19.p54) > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54 > > > > The compatible string of the DT is "marvell,sd8897". > > > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide > > > the above to provide a little more confidence, or if you want to > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. > > > > It looks okay to me to decrease the timeout here given that scanning > > doesn't seem to affect the suspend time. What's your thought, Doug? > > I think Brian is right and that we should change how we're using the > DPM watchdog, but IMO that doesn't mean that we couldn't also change > this timeout. > > If nothing else, you'd want to post a v2 of your patch containing a > summary of the info you've gathered so it's recorded with the patch. > > Maybe what makes the most sense, though, is to put our money where our > mouth is and land some sort of patch in the ChromeOS tree and then > report back to upstream in a month when we have data about it. If > things look good in the field then presumably that would give some > confidence for upstream to be willing to land the patch? > > Probably about the best data we could gather would be to break the > existing timeout into two halves. You could wait half the time, then > print a warning, and then wait the other half the time. We could even > land that change _without_ changing the timeout to 5 seconds. Then if > we ever see the warning print but then the suspend succeeds we know > that there are cases where waiting longer would have helped. If we > made that a WARN_ON() our existing infrastructure would help us gather > that info... I just realized that mwifiex_wait_queue_complete() has another 12s timeout[1], though they are not directly involved in suspend/resume. Should we also add a warning to that and see if we can make it half? This starts to make me think how many timeouts we want to consider. Does it make sense to only focus on the one in mwifiex_enable_hs()? Or should we check other timeouts in mwifiex or even other drivers? [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51 > > ...so I guess summarizing my rambling, a good plan would be: > > 1. Change the ChromeOS DPM watchdog timeout to 15 seconds to avoid the > problem for now. > > 2. Land a patch to do a WARN_ON() in mwifiex if we take 5 seconds. > Actually, you could (probably) send this upstream and land it > FROMLIST? > > 3. Wait ~1-2 months and see if we ever see the WARN_ON() trigger but > then things succeed after 10 seconds. If this never happens then send > a v2 patch changing the timeout to 5 seconds with details in the > commit message. > > 4. In the background, see if we can convince folks to make the DPM > Watchdog less panicky. > > > -Doug Regards, Pin-yen
Hi, On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@chromium.org> wrote: > > Hi Doug, > > On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > > > > > > Can you try testing (and gather timing numbers) when > > > > > > suspending soon after initiating scans? It's hard to judge what the > > > > > > lower limit of this timeout should really be without any numbers, just > > > > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > > > > > > > Pin-yen: is this something you could gather? > > > > > > I tried entering suspend right after wifi scans, and the time spent in > > > mwifiex_enable_hs() is always around 100ms. It seems initiating > > > suspend does not increase the execution time for mwifiex_enable_hs(), > > > so I think the driver is capable of interrupting a scan. > > > > > > > > > > > > > > > > Also, for the record, since we might have to field regression reports > > > > > > for other systems: what hardware (chip variant, FW version) are you > > > > > > seeing problems on? > > > > > > > > > > Pin-yen: I'm assuming you'll provide this. > > > > > > From the debugfs entry: > > > > > > driver_name = "mwifiex" > > > driver_version = mwifiex 1.0 (15.68.19.p54) > > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54 > > > > > > The compatible string of the DT is "marvell,sd8897". > > > > > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide > > > > the above to provide a little more confidence, or if you want to > > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. > > > > > > It looks okay to me to decrease the timeout here given that scanning > > > doesn't seem to affect the suspend time. What's your thought, Doug? > > > > I think Brian is right and that we should change how we're using the > > DPM watchdog, but IMO that doesn't mean that we couldn't also change > > this timeout. > > > > If nothing else, you'd want to post a v2 of your patch containing a > > summary of the info you've gathered so it's recorded with the patch. > > > > Maybe what makes the most sense, though, is to put our money where our > > mouth is and land some sort of patch in the ChromeOS tree and then > > report back to upstream in a month when we have data about it. If > > things look good in the field then presumably that would give some > > confidence for upstream to be willing to land the patch? > > > > Probably about the best data we could gather would be to break the > > existing timeout into two halves. You could wait half the time, then > > print a warning, and then wait the other half the time. We could even > > land that change _without_ changing the timeout to 5 seconds. Then if > > we ever see the warning print but then the suspend succeeds we know > > that there are cases where waiting longer would have helped. If we > > made that a WARN_ON() our existing infrastructure would help us gather > > that info... > > I just realized that mwifiex_wait_queue_complete() has another 12s > timeout[1], though they are not directly involved in suspend/resume. > > Should we also add a warning to that and see if we can make it half? > This starts to make me think how many timeouts we want to consider. > Does it make sense to only focus on the one in mwifiex_enable_hs()? Or > should we check other timeouts in mwifiex or even other drivers? > > [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51 IMO any of these "arbitrary and really long timeout to make sure we don't hang forever" type things probably should have a warning so we know if we're getting close to the timeout. It wouldn't be very hard to make a wrapper around wait_event_interruptible_timeout() to handle this. I suppose that wrapper could just be local to mwifiex, though if other drivers found it useful it would make sense to put it somewhere more common. That being said, if we aren't actually hitting these other timeouts I don't know that we need to do an extensive audit right now to find every one of them. Of course, Brian also said he'd be OK with just setting the timeout to 5 seconds based on the research you've already done. In that case I don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5 seconds is still pretty long, but I don't know if it's WARN-worthy. It could at least be dev_warn() worthy... -Doug
Hi Doug, On Fri, Dec 6, 2024 at 1:13 AM Doug Anderson <dianders@chromium.org> wrote: > > Hi, > > On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > Hi Doug, > > > > On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote: > > > > > > Hi, > > > > > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > > > > > > > > Can you try testing (and gather timing numbers) when > > > > > > > suspending soon after initiating scans? It's hard to judge what the > > > > > > > lower limit of this timeout should really be without any numbers, just > > > > > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > > > > > > > > > Pin-yen: is this something you could gather? > > > > > > > > I tried entering suspend right after wifi scans, and the time spent in > > > > mwifiex_enable_hs() is always around 100ms. It seems initiating > > > > suspend does not increase the execution time for mwifiex_enable_hs(), > > > > so I think the driver is capable of interrupting a scan. > > > > > > > > > > > > > > > > > > > Also, for the record, since we might have to field regression reports > > > > > > > for other systems: what hardware (chip variant, FW version) are you > > > > > > > seeing problems on? > > > > > > > > > > > > Pin-yen: I'm assuming you'll provide this. > > > > > > > > From the debugfs entry: > > > > > > > > driver_name = "mwifiex" > > > > driver_version = mwifiex 1.0 (15.68.19.p54) > > > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54 > > > > > > > > The compatible string of the DT is "marvell,sd8897". > > > > > > > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide > > > > > the above to provide a little more confidence, or if you want to > > > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. > > > > > > > > It looks okay to me to decrease the timeout here given that scanning > > > > doesn't seem to affect the suspend time. What's your thought, Doug? > > > > > > I think Brian is right and that we should change how we're using the > > > DPM watchdog, but IMO that doesn't mean that we couldn't also change > > > this timeout. > > > > > > If nothing else, you'd want to post a v2 of your patch containing a > > > summary of the info you've gathered so it's recorded with the patch. > > > > > > Maybe what makes the most sense, though, is to put our money where our > > > mouth is and land some sort of patch in the ChromeOS tree and then > > > report back to upstream in a month when we have data about it. If > > > things look good in the field then presumably that would give some > > > confidence for upstream to be willing to land the patch? > > > > > > Probably about the best data we could gather would be to break the > > > existing timeout into two halves. You could wait half the time, then > > > print a warning, and then wait the other half the time. We could even > > > land that change _without_ changing the timeout to 5 seconds. Then if > > > we ever see the warning print but then the suspend succeeds we know > > > that there are cases where waiting longer would have helped. If we > > > made that a WARN_ON() our existing infrastructure would help us gather > > > that info... > > > > I just realized that mwifiex_wait_queue_complete() has another 12s > > timeout[1], though they are not directly involved in suspend/resume. > > > > Should we also add a warning to that and see if we can make it half? > > This starts to make me think how many timeouts we want to consider. > > Does it make sense to only focus on the one in mwifiex_enable_hs()? Or > > should we check other timeouts in mwifiex or even other drivers? > > > > [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51 > > IMO any of these "arbitrary and really long timeout to make sure we > don't hang forever" type things probably should have a warning so we > know if we're getting close to the timeout. It wouldn't be very hard > to make a wrapper around wait_event_interruptible_timeout() to handle > this. I suppose that wrapper could just be local to mwifiex, though if > other drivers found it useful it would make sense to put it somewhere > more common. > > That being said, if we aren't actually hitting these other timeouts I > don't know that we need to do an extensive audit right now to find > every one of them. > > Of course, Brian also said he'd be OK with just setting the timeout to > 5 seconds based on the research you've already done. In that case I > don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5 > seconds is still pretty long, but I don't know if it's WARN-worthy. It > could at least be dev_warn() worthy... > Sounds like we will start to binary search the timeout... Then, I would like to just send out a v2 for this with an updated commit message. It sounds like it will be endless binary searches on all those timeouts in the kernel if we really want to create a wrapper around wait_event_interruptible_timeout(), and all the efforts only make us a few more seconds faster on the system. > > > -Doug Regards, Pin-yen
Hi, On Fri, Dec 6, 2024 at 3:21 AM Pin-yen Lin <treapking@chromium.org> wrote: > > Hi Doug, > > On Fri, Dec 6, 2024 at 1:13 AM Doug Anderson <dianders@chromium.org> wrote: > > > > Hi, > > > > On Thu, Dec 5, 2024 at 5:46 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > > > Hi Doug, > > > > > > On Thu, Dec 5, 2024 at 2:11 AM Doug Anderson <dianders@chromium.org> wrote: > > > > > > > > Hi, > > > > > > > > On Wed, Dec 4, 2024 at 5:45 AM Pin-yen Lin <treapking@chromium.org> wrote: > > > > > > > > > > > > > Can you try testing (and gather timing numbers) when > > > > > > > > suspending soon after initiating scans? It's hard to judge what the > > > > > > > > lower limit of this timeout should really be without any numbers, just > > > > > > > > like it's hard to judge whether your 10 second watchdog is reasonable. > > > > > > > > > > > > > > Pin-yen: is this something you could gather? > > > > > > > > > > I tried entering suspend right after wifi scans, and the time spent in > > > > > mwifiex_enable_hs() is always around 100ms. It seems initiating > > > > > suspend does not increase the execution time for mwifiex_enable_hs(), > > > > > so I think the driver is capable of interrupting a scan. > > > > > > > > > > > > > > > > > > > > > > Also, for the record, since we might have to field regression reports > > > > > > > > for other systems: what hardware (chip variant, FW version) are you > > > > > > > > seeing problems on? > > > > > > > > > > > > > > Pin-yen: I'm assuming you'll provide this. > > > > > > > > > > From the debugfs entry: > > > > > > > > > > driver_name = "mwifiex" > > > > > driver_version = mwifiex 1.0 (15.68.19.p54) > > > > > verext = w8897o-B0, RF87XX, FP68, 15.68.19.p54 > > > > > > > > > > The compatible string of the DT is "marvell,sd8897". > > > > > > > > > > > > I'll leave it up to y'all (Doug and Pin-Yen) whether you want to provide > > > > > > the above to provide a little more confidence, or if you want to > > > > > > reconsider your use of CONFIG_DPM_WATCHDOG_TIMEOUT. > > > > > > > > > > It looks okay to me to decrease the timeout here given that scanning > > > > > doesn't seem to affect the suspend time. What's your thought, Doug? > > > > > > > > I think Brian is right and that we should change how we're using the > > > > DPM watchdog, but IMO that doesn't mean that we couldn't also change > > > > this timeout. > > > > > > > > If nothing else, you'd want to post a v2 of your patch containing a > > > > summary of the info you've gathered so it's recorded with the patch. > > > > > > > > Maybe what makes the most sense, though, is to put our money where our > > > > mouth is and land some sort of patch in the ChromeOS tree and then > > > > report back to upstream in a month when we have data about it. If > > > > things look good in the field then presumably that would give some > > > > confidence for upstream to be willing to land the patch? > > > > > > > > Probably about the best data we could gather would be to break the > > > > existing timeout into two halves. You could wait half the time, then > > > > print a warning, and then wait the other half the time. We could even > > > > land that change _without_ changing the timeout to 5 seconds. Then if > > > > we ever see the warning print but then the suspend succeeds we know > > > > that there are cases where waiting longer would have helped. If we > > > > made that a WARN_ON() our existing infrastructure would help us gather > > > > that info... > > > > > > I just realized that mwifiex_wait_queue_complete() has another 12s > > > timeout[1], though they are not directly involved in suspend/resume. > > > > > > Should we also add a warning to that and see if we can make it half? > > > This starts to make me think how many timeouts we want to consider. > > > Does it make sense to only focus on the one in mwifiex_enable_hs()? Or > > > should we check other timeouts in mwifiex or even other drivers? > > > > > > [1]: https://elixir.bootlin.com/linux/v6.12.1/source/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c#L51 > > > > IMO any of these "arbitrary and really long timeout to make sure we > > don't hang forever" type things probably should have a warning so we > > know if we're getting close to the timeout. It wouldn't be very hard > > to make a wrapper around wait_event_interruptible_timeout() to handle > > this. I suppose that wrapper could just be local to mwifiex, though if > > other drivers found it useful it would make sense to put it somewhere > > more common. > > > > That being said, if we aren't actually hitting these other timeouts I > > don't know that we need to do an extensive audit right now to find > > every one of them. > > > > Of course, Brian also said he'd be OK with just setting the timeout to > > 5 seconds based on the research you've already done. In that case I > > don't know if we'd want a WARNing after 2.5 seconds. Maybe? 2.5 > > seconds is still pretty long, but I don't know if it's WARN-worthy. It > > could at least be dev_warn() worthy... > > > > Sounds like we will start to binary search the timeout... I wouldn't bother binary searching the timeout and in general I wouldn't try to tune these timeouts unless we think there's a real benefit to it. Mostly, though, I like the idea of having a warning _before_ a failure occurs. Like if I have a retry loop that will retry 10 times, I like to warn after I need 5 retries so that it's obvious we're getting close. If I expect a timeout to never be needed, I'd also like some warning that says we got close to the maximum timeout. All of this makes it much easier to debug problems if/when we actually hit the failure case. It also has the benefit that if we need to tune a timeout / number of retries then we have more confidence that we were nowhere near the max, but IMO that wouldn't be the primary benefit. > Then, I would like to just send out a v2 for this with an updated > commit message. Sure, though maybe it's not urgent? We've changed the DPM watchdog in ChromeOS to 15 seconds so at least we won't be hitting it there. While I agree that the 10 second timeout is still likely overkill and that 5 should be fine, gathering more data wouldn't hurt. I'd be really curious how the system behaves after the timeout hits, too. Does the next suspend actually succeed? Does WiFi recover or is it busted until the next reboot? > It sounds like it will be endless binary searches on > all those timeouts in the kernel if we really want to create a wrapper > around wait_event_interruptible_timeout(), and all the efforts only > make us a few more seconds faster on the system. My goal in general isn't to optimize the system but to know when we're close to failure so it's easier to debug when the timeout occurs.
Pin-yen Lin <treapking@chromium.org> wrote: > In commit 52250cbee7f6 ("mwifiex: use timeout variant for > wait_event_interruptible") it was noted that sometimes we seemed > to miss the signal that our host sleep settings took effect. A > 10 second timeout was added to the code to make sure we didn't > hang forever waiting. It appears that this problem still exists > and we hit the timeout sometimes for Chromebooks in the field. > > Recently on ChromeOS we've started setting the DPM watchdog > to trip if full system suspend takes over 10 seconds. Given > the timeout in the original patch, obviously we're hitting > the DPM watchdog before mwifiex gets a chance to timeout. > > While we could increase the DPM watchdog in ChromeOS to avoid > this problem, it's probably better to simply decrease the > timeout. Any time we're waiting several seconds for the > firmware to respond it's likely that the firmware won't ever > respond. With that in mind, decrease the timeout in mwifiex > from 10 seconds to 5 seconds. > > Suggested-by: Doug Anderson <dianders@chromium.org> > Signed-off-by: Pin-yen Lin <treapking@chromium.org> > Reviewed-by: Douglas Anderson <dianders@chromium.org> > Acked-by: Brian Norris <briannorris@chromium.org> Patch applied to wireless-next.git, thanks. f143cece43dd wifi: mwifiex: decrease timeout waiting for host sleep from 10s to 5s
diff --git a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c index e06a0622973e..f79589cafe57 100644 --- a/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c +++ b/drivers/net/wireless/marvell/mwifiex/sta_ioctl.c @@ -545,7 +545,7 @@ int mwifiex_enable_hs(struct mwifiex_adapter *adapter) if (wait_event_interruptible_timeout(adapter->hs_activate_wait_q, adapter->hs_activate_wait_q_woken, - (10 * HZ)) <= 0) { + (5 * HZ)) <= 0) { mwifiex_dbg(adapter, ERROR, "hs_activate_wait_q terminated\n"); return false;
In commit 52250cbee7f6 ("mwifiex: use timeout variant for wait_event_interruptible") it was noted that sometimes we seemed to miss the signal that our host sleep settings took effect. A 10 second timeout was added to the code to make sure we didn't hang forever waiting. It appears that this problem still exists and we hit the timeout sometimes for Chromebooks in the field. Recently on ChromeOS we've started setting the DPM watchdog to trip if full system suspend takes over 10 seconds. Given the timeout in the original patch, obviously we're hitting the DPM watchdog before mwifiex gets a chance to timeout. While we could increase the DPM watchdog in ChromeOS to avoid this problem, it's probably better to simply decrease the timeout. Any time we're waiting several seconds for the firmware to respond it's likely that the firmware won't ever respond. With that in mind, decrease the timeout in mwifiex from 10 seconds to 5 seconds. Suggested-by: Doug Anderson <dianders@chromium.org> Signed-off-by: Pin-yen Lin <treapking@chromium.org> --- drivers/net/wireless/marvell/mwifiex/sta_ioctl.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)