Message ID | 20220407115918.1.I8226c7fdae88329ef70957b96a39b346c69a914e@changeid |
---|---|
State | New |
Headers | show |
Series | USB: hcd-pci: Fully suspend across freeze/thaw cycle | expand |
On 12.4.2022 18.40, Alan Stern wrote: > On Tue, Apr 12, 2022 at 05:56:42PM +0300, Mathias Nyman wrote: >> On 11.4.2022 17.50, Alan Stern wrote: >>> For example, what would happen if the user unplugs a device right in the >>> middle of the freeze transition, after the root hub has been frozen but >>> before the controller is frozen? We don't want such an unplug event to >>> prevent the system from going into hibernation -- especially if the root >>> hub was not enabled for wakeup. >> >> We should be able to let system go to hibernate even if we get a disconnect >> interrupt between roothub and host controller freeze. >> Host is not yet suspended so no PME# wake is generated, only an interrupt. >> >> From Linux PM point of view it should be ok as well as the actual xhci >> device that is generating the interrupt is hasnt completer freeze() >> >> The xhci interrupt handler just needs to make sure that the disconnect >> isn't propagated if roothub is suspended and wake on disconnect >> is not set. And definitely make sure xhci doesn't start roothub polling. >> >> When freeze() is called for the host we should prevent the host from >> generating interrupts. > > I guess that means adding a new callback. Or we could just suspend the > controller, like Evan proposed originally Suspending the host in freeze should work. It will do an extra xhci controller state save stage, but that should be harmless. But is there really a need for the suggested noirq part? + .freeze_noirq = hcd_pci_suspend_noirq, That will try to set the host to PCI D3 state. It seems a bit unnecessary for freeze. > >>> (If the root hub _is_ enabled for wakeup then it's questionable. >>> Unplugging a device would be a wakeup event, so you could easily argue >>> that it _should_ prevent the system from going into hibernation. After >>> all, if the unplug happened a few milliseconds later, after the system >>> had fully gone into hibernation, then it would cause the system to wake >>> up.) >>> >>>> Would it make sense prevent xHCI interrupt generation in the host >>>> freeze() stage, clearing the xHCI EINT bit in addition to calling >>>> check_roothub_suspend()? >>>> Then enable it back in thaw() >>> >>> That won't fully eliminate the problem mentioned in the preceding >>> paragraphs, although I guess it would help somewhat. >> >> Would the following steps solve this? >> >> 1. Disable device initiated resume for connected usb devices in freeze() >> >> 2. Don't propagate connect or OC changes if roothub is suspended and port wake >> flags are disabled. I.E don't kick roothub polling in xhci interrupt >> handler here. > > I guess you can't just halt the entire host controller when only one of > the root hubs is suspended with wakeup disabled. That does complicate > things. But you could halt it as soon as both of the root hubs are > frozen. Wouldn't that prevent interrupt generation? True, but probably easier to just suspend host in freeze() as you stated above. -Mathias
Hi Alan and Mathias, On Thu, Apr 14, 2022 at 7:21 AM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Thu, Apr 14, 2022 at 05:00:12PM +0300, Mathias Nyman wrote: > > On 12.4.2022 18.40, Alan Stern wrote: > > > On Tue, Apr 12, 2022 at 05:56:42PM +0300, Mathias Nyman wrote: > > >> On 11.4.2022 17.50, Alan Stern wrote: > > >>> For example, what would happen if the user unplugs a device right in the > > >>> middle of the freeze transition, after the root hub has been frozen but > > >>> before the controller is frozen? We don't want such an unplug event to > > >>> prevent the system from going into hibernation -- especially if the root > > >>> hub was not enabled for wakeup. > > >> > > >> We should be able to let system go to hibernate even if we get a disconnect > > >> interrupt between roothub and host controller freeze. > > >> Host is not yet suspended so no PME# wake is generated, only an interrupt. > > >> > > >> From Linux PM point of view it should be ok as well as the actual xhci > > >> device that is generating the interrupt is hasnt completer freeze() > > >> > > >> The xhci interrupt handler just needs to make sure that the disconnect > > >> isn't propagated if roothub is suspended and wake on disconnect > > >> is not set. And definitely make sure xhci doesn't start roothub polling. > > >> > > >> When freeze() is called for the host we should prevent the host from > > >> generating interrupts. > > > > > > I guess that means adding a new callback. Or we could just suspend the > > > controller, like Evan proposed originally > > > > Suspending the host in freeze should work. > > It will do an extra xhci controller state save stage, but that should be harmless. > > > > But is there really a need for the suggested noirq part? > > > > + .freeze_noirq = hcd_pci_suspend_noirq, > > > > That will try to set the host to PCI D3 state. > > It seems a bit unnecessary for freeze. > > Agreed. > > > >>> (If the root hub _is_ enabled for wakeup then it's questionable. > > >>> Unplugging a device would be a wakeup event, so you could easily argue > > >>> that it _should_ prevent the system from going into hibernation. After > > >>> all, if the unplug happened a few milliseconds later, after the system > > >>> had fully gone into hibernation, then it would cause the system to wake > > >>> up.) > > >>> > > >>>> Would it make sense prevent xHCI interrupt generation in the host > > >>>> freeze() stage, clearing the xHCI EINT bit in addition to calling > > >>>> check_roothub_suspend()? > > >>>> Then enable it back in thaw() > > >>> > > >>> That won't fully eliminate the problem mentioned in the preceding > > >>> paragraphs, although I guess it would help somewhat. > > >> > > >> Would the following steps solve this? > > >> > > >> 1. Disable device initiated resume for connected usb devices in freeze() > > >> > > >> 2. Don't propagate connect or OC changes if roothub is suspended and port wake > > >> flags are disabled. I.E don't kick roothub polling in xhci interrupt > > >> handler here. > > > > > > I guess you can't just halt the entire host controller when only one of > > > the root hubs is suspended with wakeup disabled. That does complicate > > > things. But you could halt it as soon as both of the root hubs are > > > frozen. Wouldn't that prevent interrupt generation? > > > > True, but probably easier to just suspend host in freeze() as you stated above. > > Okay. > > Evan, this discussion suggests that you rewrite your patch as a series > of three: > > 1. Change choose_wakeup() so that for PM_EVENT_FREEZE, wakeup is > always disabled. If I understand this correctly, this means potentially runtime resuming the device so its wakeup setting can be consistently set to wakeups disabled across a freeze transition. Got it I think in terms of the "how". > > 2. Change the xhci-hcd interrupt handler so that port-status > changes are ignored if the port's root hub is suspended with > wakeup disabled. This part confuses me. This would be way deep under xhci_handle_event(), probably in handle_port_status(), just throwing away certain events that come in the ring. How would we know to go back and process those events later? I think we don't need to do this if we suspend the controller as in #3 below. The suspended (halted) controller wouldn't generate event interrupts (since the spec mentions port status change generation is gated on HCHalted). So we're already covered against receiving interrupts in this zone by halting the controller, and the events stay nicely pending for when we restart it in thaw. Is the goal of #1 purely a setup change for #2, or does it stand on its own even if we nixed #2? Said differently, is #1 trying to ensure that wake signaling doesn't occur at all between freeze and thaw, even when the controller is suspended and guaranteed not to generate interrupts via its "normal" mechanism? I don't have a crisp mental picture of how the wake signaling works, but if the controller wake mechanism sidesteps the original problem of sending an MSI to a dead CPU (as in, it does not use MSIs), then it might be ok as-is. > > 3. As in the original patch, make the .freeze and .thaw callbacks > in hcd-pci.c call the appropriate suspend and resume routines, > but don't do anything for .freeze_noirq and .thaw_noirq. Sure. I had made the _noirq paths match suspend for consistency, I wasn't sure if those could mix n match without issues. I'll try it out leaving the _noirq callbacks alone. -Evan > > How does that sound? > > Alan Stern
diff --git a/drivers/usb/core/hcd-pci.c b/drivers/usb/core/hcd-pci.c index 8176bc81a635d6..e02506807ffc6c 100644 --- a/drivers/usb/core/hcd-pci.c +++ b/drivers/usb/core/hcd-pci.c @@ -616,10 +616,10 @@ const struct dev_pm_ops usb_hcd_pci_pm_ops = { .suspend_noirq = hcd_pci_suspend_noirq, .resume_noirq = hcd_pci_resume_noirq, .resume = hcd_pci_resume, - .freeze = check_root_hub_suspended, - .freeze_noirq = check_root_hub_suspended, - .thaw_noirq = NULL, - .thaw = NULL, + .freeze = hcd_pci_suspend, + .freeze_noirq = hcd_pci_suspend_noirq, + .thaw_noirq = hcd_pci_resume_noirq, + .thaw = hcd_pci_resume, .poweroff = hcd_pci_suspend, .poweroff_noirq = hcd_pci_suspend_noirq, .restore_noirq = hcd_pci_resume_noirq,
The documentation for the freeze() method says that it "should quiesce the device so that it doesn't generate IRQs or DMA". The unspoken consequence of not doing this is that MSIs aimed at non-boot CPUs may get fully lost if they're sent during the period where the target CPU is offline. The current callbacks for USB HCD do not fully quiesce interrupts, specifically on XHCI. Change to use the full suspend/resume flow for freeze/thaw to ensure interrupts are fully quiesced. This fixes issues where USB devices fail to thaw during hibernation because XHCI misses its interrupt and fails to recover. Signed-off-by: Evan Green <evgreen@chromium.org> --- You may be able to reproduce this issue on your own machine via the following: 1. Disable runtime PM on your XHCI controller 2. Aim your XHCI IRQ at a non-boot CPU (replace 174): echo 2 > /proc/irq/174/smp_affinity 3. Attempt to hibernate (no need to actually go all the way down). I run 2 and 3 in a loop, and can usually hit a hang or dead XHCI controller within 1-2 iterations. I happened to notice this on an Alderlake system where runtime PM is accidentally disabled for one of the XHCI controllers. Some more discussion and debugging can be found at [1]. [1] https://lore.kernel.org/linux-pci/CAE=gft4a-QL82iFJE_xRQ3JrMmz-KZKWREtz=MghhjFbJeK=8A@mail.gmail.com/T/#u --- drivers/usb/core/hcd-pci.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-)