diff mbox series

usb:xhci: Avoid hub_event() stuck when xHC restore state timeout

Message ID TYZPR01MB4784CB3058AC1B5787DB6601D586A@TYZPR01MB4784.apcprd01.prod.exchangelabs.com
State New
Headers show
Series usb:xhci: Avoid hub_event() stuck when xHC restore state timeout | expand

Commit Message

Yaxiong Tian Dec. 4, 2023, 8:02 a.m. UTC
From: Yaxiong Tian <tianyaxiong@kylinos.cn>

when xHc restore state timeout,the xhci_reusme() return -ETIMEDOUT
instantly. After usb_hc_died() called ,they kick hub_wq to running
hub_event() but the wq is freezd. When suspend ends,hub_evnet realy
running and sticking.
Such as:
[  968.794016][ 2] [   T37] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[  968.802969][ 2] [   T37] kworker/2:3     D    0   999      2 0x00000028
[  968.809579][ 2] [   T37] Workqueue: usb_hub_wq hub_event
[  968.814885][ 2] [   T37] Call trace:
[  968.818455][ 2] [   T37]  __switch_to+0xd4/0x138
[  968.823067][ 2] [   T37]  __schedule+0x2dc/0x6a0
[  968.827680][ 2] [   T37]  schedule+0x34/0xb0
[  968.831947][ 2] [   T37]  schedule_timeout+0x1e0/0x298
[  968.837079][ 2] [   T37]  __wait_for_common+0xf0/0x208
[  968.842212][ 2] [   T37]  wait_for_completion+0x1c/0x28
[  968.847432][ 2] [   T37]  xhci_configure_endpoint+0x104/0x640
[  968.853173][ 2] [   T37]  xhci_check_bandwidth+0x140/0x2e0
[  968.858652][ 2] [   T37]  usb_hcd_alloc_bandwidth+0x1c8/0x348
[  968.864393][ 2] [   T37]  usb_disable_device+0x198/0x260
[  968.869698][ 2] [   T37]  usb_disconnect+0xdc/0x3a0
[  968.874571][ 2] [   T37]  usb_disconnect+0xbc/0x3a0
[  968.879441][ 2] [   T37]  hub_quiesce+0xa0/0x108
[  968.884053][ 2] [   T37]  hub_event+0x4d4/0x1558
[  968.888664][ 2] [   T37]  kretprobe_trampoline+0x0/0xc4
[  968.893884][ 2] [   T37]  worker_thread+0x4c/0x488
[  968.898668][ 2] [   T37]  kthread+0xf8/0x128
[  968.902933][ 2] [   T37]  ret_from_fork+0x10/0x18

The result is that you cannot suspend again.because the wq can't
be freezed.Also hard to reboot,when some application visited this
piece.

The reason of stuck is that some access related to xhci hardware
is being called.But xhci has problem,at least not running.(
when xhci_restore_registers(),the xhci will load op_regs.The
CMD_RUN will clear in xhci_suspend().)

So using XHCI_STATE_DYING flag,to avoid any code to touching
hardware immediately.hub_event() will complete.The usb_hc_died
tasks will be completed and some sys interfaces will be removed.

Signed-off-by: Yaxiong Tian <tianyaxiong@kylinos.cn>
---
 drivers/usb/host/xhci.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Mathias Nyman Dec. 5, 2023, 2:58 p.m. UTC | #1
On 4.12.2023 10.02, Yaxiong Tian wrote:
> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
> 
> when xHc restore state timeout,the xhci_reusme() return -ETIMEDOUT

Out of curiosity, have you tried if it still is possible to revive your
xHC controller here?

Instead of returning -ETIMEDOUT, try setting " reinit_xhc = true", and
jump to  "if (reinit_xhc) {"  where we reinitialize xHC in xhci_resume() due
to other resume issues.

> instantly. After usb_hc_died() called ,they kick hub_wq to running
> hub_event() but the wq is freezd. When suspend ends,hub_evnet realy
> running and sticking.
> Such as:
> [  968.794016][ 2] [   T37] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [  968.802969][ 2] [   T37] kworker/2:3     D    0   999      2 0x00000028
> [  968.809579][ 2] [   T37] Workqueue: usb_hub_wq hub_event
> [  968.814885][ 2] [   T37] Call trace:
> [  968.818455][ 2] [   T37]  __switch_to+0xd4/0x138
> [  968.823067][ 2] [   T37]  __schedule+0x2dc/0x6a0
> [  968.827680][ 2] [   T37]  schedule+0x34/0xb0
> [  968.831947][ 2] [   T37]  schedule_timeout+0x1e0/0x298
> [  968.837079][ 2] [   T37]  __wait_for_common+0xf0/0x208
> [  968.842212][ 2] [   T37]  wait_for_completion+0x1c/0x28
> [  968.847432][ 2] [   T37]  xhci_configure_endpoint+0x104/0x640
> [  968.853173][ 2] [   T37]  xhci_check_bandwidth+0x140/0x2e0
> [  968.858652][ 2] [   T37]  usb_hcd_alloc_bandwidth+0x1c8/0x348
> [  968.864393][ 2] [   T37]  usb_disable_device+0x198/0x260
> [  968.869698][ 2] [   T37]  usb_disconnect+0xdc/0x3a0
> [  968.874571][ 2] [   T37]  usb_disconnect+0xbc/0x3a0
> [  968.879441][ 2] [   T37]  hub_quiesce+0xa0/0x108
> [  968.884053][ 2] [   T37]  hub_event+0x4d4/0x1558
> [  968.888664][ 2] [   T37]  kretprobe_trampoline+0x0/0xc4
> [  968.893884][ 2] [   T37]  worker_thread+0x4c/0x488
> [  968.898668][ 2] [   T37]  kthread+0xf8/0x128
> [  968.902933][ 2] [   T37]  ret_from_fork+0x10/0x18
> 
> The result is that you cannot suspend again.because the wq can't
> be freezed.Also hard to reboot,when some application visited this
> piece.
> 
> The reason of stuck is that some access related to xhci hardware
> is being called.But xhci has problem,at least not running.(
> when xhci_restore_registers(),the xhci will load op_regs.The
> CMD_RUN will clear in xhci_suspend().)
> 

Nice catch and debugging work btw.

> So using XHCI_STATE_DYING flag,to avoid any code to touching
> hardware immediately.hub_event() will complete.The usb_hc_died
> tasks will be completed and some sys interfaces will be removed.

The XHCI_STATE_DYING flag is currently only set in xhci_hc_died().
So when this flag is set we could assume that the command ring and
pending URBs are, or will be cleaned up. This would change with your patch.
  
We might need some other solution,
Maybe set the set_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags) after
CNR (Contorller Not Ready) is successfully cleared, and controller is
actually accessible.

We then would need to add checks to see if controller is accessible
before queuing any commands to xHC hardware.

Thanks
Mathias
Yaxiong Tian Dec. 8, 2023, 9:28 a.m. UTC | #2
在 2023/12/5 22:58, Mathias Nyman 写道:
> On 4.12.2023 10.02, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@kylinos.cn>
>>
>> when xHc restore state timeout,the xhci_reusme() return -ETIMEDOUT
>
> Out of curiosity, have you tried if it still is possible to revive your
> xHC controller here?
>
> Instead of returning -ETIMEDOUT, try setting " reinit_xhc = true", and
> jump to  "if (reinit_xhc) {"  where we reinitialize xHC in 
> xhci_resume() due
> to other resume issues.
Yes it is works.
>
>> instantly. After usb_hc_died() called ,they kick hub_wq to running
>> hub_event() but the wq is freezd. When suspend ends,hub_evnet realy
>> running and sticking.
>> Such as:
>> [  968.794016][ 2] [   T37] "echo 0 > 
>> /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  968.802969][ 2] [   T37] kworker/2:3     D    0   999      2 
>> 0x00000028
>> [  968.809579][ 2] [   T37] Workqueue: usb_hub_wq hub_event
>> [  968.814885][ 2] [   T37] Call trace:
>> [  968.818455][ 2] [   T37]  __switch_to+0xd4/0x138
>> [  968.823067][ 2] [   T37]  __schedule+0x2dc/0x6a0
>> [  968.827680][ 2] [   T37]  schedule+0x34/0xb0
>> [  968.831947][ 2] [   T37]  schedule_timeout+0x1e0/0x298
>> [  968.837079][ 2] [   T37]  __wait_for_common+0xf0/0x208
>> [  968.842212][ 2] [   T37]  wait_for_completion+0x1c/0x28
>> [  968.847432][ 2] [   T37]  xhci_configure_endpoint+0x104/0x640
>> [  968.853173][ 2] [   T37]  xhci_check_bandwidth+0x140/0x2e0
>> [  968.858652][ 2] [   T37]  usb_hcd_alloc_bandwidth+0x1c8/0x348
>> [  968.864393][ 2] [   T37]  usb_disable_device+0x198/0x260
>> [  968.869698][ 2] [   T37]  usb_disconnect+0xdc/0x3a0
>> [  968.874571][ 2] [   T37]  usb_disconnect+0xbc/0x3a0
>> [  968.879441][ 2] [   T37]  hub_quiesce+0xa0/0x108
>> [  968.884053][ 2] [   T37]  hub_event+0x4d4/0x1558
>> [  968.888664][ 2] [   T37]  kretprobe_trampoline+0x0/0xc4
>> [  968.893884][ 2] [   T37]  worker_thread+0x4c/0x488
>> [  968.898668][ 2] [   T37]  kthread+0xf8/0x128
>> [  968.902933][ 2] [   T37]  ret_from_fork+0x10/0x18
>>
>> The result is that you cannot suspend again.because the wq can't
>> be freezed.Also hard to reboot,when some application visited this
>> piece.
>>
>> The reason of stuck is that some access related to xhci hardware
>> is being called.But xhci has problem,at least not running.(
>> when xhci_restore_registers(),the xhci will load op_regs.The
>> CMD_RUN will clear in xhci_suspend().)
>>
>
> Nice catch and debugging work btw.
>
Thanks, I further debug found that the command is Infinite 
retransmission and xHc is
halted. So the __wait_for_common() never complete,because xHc is halted 
and others
don't kown.
Note:For fast reproduction  I force  return -ETIMEDOUT  by setting
      if (1 || xhci_handshake(&xhci->op_regs->status,STS_RESTORE, 0, 100 
* 1000)) )

[  131.024013][ 3] [  T135] xhci_hcd 0000:06:00.0: Command timeout, 
USBSTS: 0x00000001 HCHalted
[  131.032447][ 3] [  T135] xhci_hcd 0000:06:00.0: Command timeout on 
stopped ring
[  131.039745][ 3] [  T135] xhci_hcd 0000:06:00.0: Turn aborted command 
00000000620a076c to no-op
[  131.048346][ 3] [  T135] xhci_hcd 0000:06:00.0: // Ding dong

>> So using XHCI_STATE_DYING flag,to avoid any code to touching
>> hardware immediately.hub_event() will complete.The usb_hc_died
>> tasks will be completed and some sys interfaces will be removed.
>
> The XHCI_STATE_DYING flag is currently only set in xhci_hc_died().
> So when this flag is set we could assume that the command ring and
> pending URBs are, or will be cleaned up. This would change with your 
> patch.
>
I agree it.
> We might need some other solution,
> Maybe set the set_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags) after
> CNR (Contorller Not Ready) is successfully cleared, and controller is
> actually accessible.
>
> We then would need to add checks to see if controller is accessible
> before queuing any commands to xHC hardware.
>
I found another fix based on the debugging results。In xhci_suspend() set
xhci->xhc_state |= XHCI_STATE_HALTED and in xhci_resume() clear it.
like it.

------>8-----
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 08fbabff23a0..f364958c237a 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -928,6 +928,8 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
          spin_unlock_irq(&xhci->lock);
          return -ETIMEDOUT;
      }
+    xhci->xhc_state |= XHCI_STATE_HALTED;
+
      xhci_clear_command_ring(xhci);

      /* step 3: save registers */
@@ -1119,8 +1121,14 @@ int xhci_resume(struct xhci_hcd *xhci, 
pm_message_t msg)
      command = readl(&xhci->op_regs->command);
      command |= CMD_RUN;
      writel(command, &xhci->op_regs->command);
-    xhci_handshake(&xhci->op_regs->status, STS_HALT,
-          0, 250 * 1000);
+    if(xhci_handshake(&xhci->op_regs->status, STS_HALT,
+          0, 250 * 1000)){
+        xhci_warn(xhci, "WARN: xHC CMD_RUN timeout\n");
+        spin_unlock_irq(&xhci->lock);
+        return -ETIMEDOUT;
+    };
+
+    xhci->xhc_state &= ~XHCI_STATE_HALTED;

      /* step 5: walk topology and initialize portsc,
       * portpmsc and portli
----8<-------------

> Thanks
> Mathias
Thanks
diff mbox series

Patch

diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index cb885ddc2032..f0ddd7421e1f 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -1044,7 +1044,8 @@  int xhci_resume(struct xhci_hcd *xhci, pm_message_t msg)
 		 */
 		if (xhci_handshake(&xhci->op_regs->status,
 			      STS_RESTORE, 0, 100 * 1000)) {
-			xhci_warn(xhci, "WARN: xHC restore state timeout\n");
+			xhci_err(xhci, "xHC restore state timeout, assume dead\n");
+			xhci->xhc_state |= XHCI_STATE_DYING;
 			spin_unlock_irq(&xhci->lock);
 			return -ETIMEDOUT;
 		}