Message ID | 20250219181556.1020029-1-ralph.siemsen@linaro.org |
---|---|
State | New |
Headers | show |
Series | [RFC] usb: gadget: u_ether: prevent deadlock under RT | expand |
Hi Sebastian, Thanks for your reply! On Wed, Feb 26, 2025 at 3:29 AM Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote: > Based on the backtrace the problem is within the cdns3. The driver > acquires at the beginning of its threaded routine > spin_lock_irqsave(&priv_dev->lock, flags); > > and then before returning the URB it does > spin_unlock(&priv_dev->lock); > usb_gadget_giveback_request() > > so the lock is dropped but the interrupts are still disabled. This makes > me wonder why using threaded interrupts at all since interrupts are > disabled for the whole routine but then others do the same. I also wondered why threaded interrupts are being used, but I don't know the reason. > If you look at dwc3_thread_interrupt() they have local_bh_disable()/ > enable() before acquiring the lock and this is what I would suggest > doing here, too. The NCM is probably not the only one affected but > everything doing network that may since it may recourse into softirq. Thanks for the suggestion. I had also thought of doing this. I also noticed that the "cdnsp" driver has a similar fix in commit 58f2fcb3a845 ("usb: cdnsp: Fix deadlock issue during using NCM gadget"), which was also backported to stable branches. So I will prepare a v2 patch to do the same for "cdns3" driver. Regards Ralph
diff --git a/drivers/usb/gadget/function/u_ether.c b/drivers/usb/gadget/function/u_ether.c index 09e2838917e2..dc511c01b741 100644 --- a/drivers/usb/gadget/function/u_ether.c +++ b/drivers/usb/gadget/function/u_ether.c @@ -233,6 +233,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req) if (dev->unwrap) { unsigned long flags; + local_bh_disable(); spin_lock_irqsave(&dev->lock, flags); if (dev->port_usb) { status = dev->unwrap(dev->port_usb, @@ -243,6 +244,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req) status = -ENOTCONN; } spin_unlock_irqrestore(&dev->lock, flags); + local_bh_enable(); } else { skb_queue_tail(&dev->rx_frames, skb); }
A deadlock can be readily triggered when using NCM gadget with the Cadence cdns3 USB driver, under heavy traffic from "iperf3 --bidir". It has been observed under 6.1, 6.6 and 6.12 kernel versions, but only on PREEMPT_RT. Once deadlocked, even magicsysrq has no effect. However, there is periodic output from the rcu stall detector: [ 71.105941] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 71.105966] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-1): P125/4:b..l [ 71.105992] rcu: (detected by 1, t=5252 jiffies, g=-515, q=7 ncpus=2) [ 71.106003] task:irq/507-s-f4000 state:D stack:0 pid:125 tgid:125 ppid:2 flags:0x00000008 [ 71.106018] Call trace: [ 71.106022] __switch_to+0xf4/0x158 [ 71.106046] __schedule+0x2b4/0x920 [ 71.106055] schedule_rtlock+0x24/0x50 [ 71.106064] rtlock_slowlock_locked+0x348/0xcb8 [ 71.106077] rt_spin_lock+0x88/0xb8 [ 71.106086] eth_start_xmit+0x30/0x1490 [u_ether] /*****/ [ 71.106112] ncm_tx_timeout+0x2c/0x50 [usb_f_ncm] [ 71.106131] __hrtimer_run_queues+0x180/0x378 [ 71.106143] hrtimer_run_softirq+0x90/0x100 [ 71.106151] handle_softirqs.isra.0+0x14c/0x360 [ 71.106165] __local_bh_enable_ip+0x104/0x118 [ 71.106177] __netdev_alloc_skb+0x1e0/0x210 [ 71.106192] ncm_unwrap_ntb+0x1ec/0x528 [usb_f_ncm] [ 71.106206] rx_complete+0x120/0x288 [u_ether] /*****/ [ 71.106221] usb_gadget_giveback_request+0x34/0xf8 [ 71.106236] cdns3_gadget_giveback+0xe4/0x2d0 [cdns3] [ 71.106286] cdns3_transfer_completed+0x3b0/0x630 [cdns3] [ 71.106320] cdns3_device_thread_irq_handler+0x8b8/0xd18 [cdns3] [ 71.106353] irq_thread_fn+0x34/0xb8 [ 71.106364] irq_thread+0x180/0x2f0 [ 71.106374] kthread+0x104/0x118 [ 71.106384] ret_from_fork+0x10/0x20 The deadlock occurs because eth_start_xmit() and rx_complete() both acquire the same spinlock in the same instance of struct eth_dev. The nested call occurs because rx_complete() calls __netdev_alloc_skb() which performs a brief local_bh_disable/enable() sequence. Prevent the deadlock by disabling softirq in rx_complete(), with the same scope as the spinlock. With this fix, no deadlocks are observed over many hours of continuous testing at USB 2.0 speed (480 Mbit/s). Signed-off-by: Ralph Siemsen <ralph.siemsen@linaro.org> --- Discussion items: 0) This is somewhat similar to the recent discussion https://lore.kernel.org/linux-rt-devel/20250212123302.0f620f23@gandalf.local.home/ and my solution is to "sprinkle local_bh_disable() around", which is clearly not optimal. Hence the RFC on this patch. 1) There are several more places using the same spinlock, for example the gether_suspend() and gether_resume() functions. I'm not using power management, but I wonder if there might be more deadlocks lurking? 2) By keeping softirq disabled for a longer duration, this patch could potentially increase the RT latency. I tried to quantify this using cyclictest, but I cannot get a baseline for comparison, since it deadlocks almost immediately when the USB is active. Other suggestions? 3) The fix is in generic u_ether.c code, to match the scope of the spinlock. Alternatively, it could be done in cdns3 specific code, such as in cdns3_device_thread_irq_handler(). I'm not sure if the same problem exists in other USB drivers? --- drivers/usb/gadget/function/u_ether.c | 2 ++ 1 file changed, 2 insertions(+)