diff mbox series

net: usb: lan78xx: Enforce a minimum interrupt polling period

Message ID 20250310165932.1201702-1-fiona.klute@gmx.de
State New
Headers show
Series net: usb: lan78xx: Enforce a minimum interrupt polling period | expand

Commit Message

Fiona Klute March 10, 2025, 4:59 p.m. UTC
If a new reset event appears before the previous one has been
processed, the device can get stuck into a reset loop. This happens
rarely, but blocks the device when it does, and floods the log with
messages like the following:

  lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped

The only bit that the driver pays attention to in the interrupt data
is "link was reset". If there's a flapping status bit in that endpoint
data (such as if PHY negotiation needs a few tries to get a stable
link), polling at a slower rate allows the state to settle.

This is a simplified version of a patch that's been in the Raspberry
Pi downstream kernel since their 4.14 branch, see also:
https://github.com/raspberrypi/linux/issues/2447

Signed-off-by: Fiona Klute <fiona.klute@gmx.de>
Cc: kernel-list@raspberrypi.com
Cc: stable@vger.kernel.org
---
For the stable crew: I've *tested* the patch with 6.12.7 and 6.13.5 on
a Revolution Pi Connect 4 (Raspberry Pi CM4 based device with built-in
LAN7800 as second ethernet port), according to the linked issue for
the RPi downstream kernel the problem should be present in all
maintained longterm kernel versions, too (based on how long they've
carried a patch).

 drivers/net/usb/lan78xx.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)


base-commit: dd83757f6e686a2188997cb58b5975f744bb7786

Comments

Andrew Lunn March 10, 2025, 9:27 p.m. UTC | #1
On Mon, Mar 10, 2025 at 05:59:31PM +0100, Fiona Klute wrote:
> If a new reset event appears before the previous one has been
> processed, the device can get stuck into a reset loop. This happens
> rarely, but blocks the device when it does, and floods the log with
> messages like the following:
> 
>   lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
> 
> The only bit that the driver pays attention to in the interrupt data
> is "link was reset". If there's a flapping status bit in that endpoint
> data (such as if PHY negotiation needs a few tries to get a stable
> link), polling at a slower rate allows the state to settle.

Could you expand on this a little bit more. What is the issue you are
seeing?

I had a quick look at the PHY handling code, and it looks broken. The
only time a MAC driver should look at members of phydev is during the
adjust link callback, so lan78xx_link_status_change(). Everything is
guaranteed to be consistent at this time. However, the current
lan78xx_link_status_change() only adjusts EEE setting. The PHY code in
lan78xx_link_reset() looks wrong. MAC drivers should not be reading
PHY registers, or calling functions like phy_read_status(). Setting
flow control should be performed in lan78xx_link_status_change() using
phydev->pause and phydev->asym_pause.

	Andrew
Fiona Klute March 11, 2025, 2:36 p.m. UTC | #2
Am 11.03.25 um 14:22 schrieb Andrew Lunn:
> On Tue, Mar 11, 2025 at 01:30:54PM +0100, Fiona Klute wrote:
>> Am 10.03.25 um 22:27 schrieb Andrew Lunn:
>>> On Mon, Mar 10, 2025 at 05:59:31PM +0100, Fiona Klute wrote:
>>>> If a new reset event appears before the previous one has been
>>>> processed, the device can get stuck into a reset loop. This happens
>>>> rarely, but blocks the device when it does, and floods the log with
>>>> messages like the following:
>>>>
>>>>     lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>>>>
>>>> The only bit that the driver pays attention to in the interrupt data
>>>> is "link was reset". If there's a flapping status bit in that endpoint
>>>> data (such as if PHY negotiation needs a few tries to get a stable
>>>> link), polling at a slower rate allows the state to settle.
>>>
>>> Could you expand on this a little bit more. What is the issue you are
>>> seeing?
>>
>> What happens is that *sometimes* when the interface is activated (up, im
>> my case via NetworkManager) during boot, the "kevent 4 may have been
>> dropped" message starts to be emitted about every 6 or 7 ms.
>
> This sounding a bit like an interrupt storm. The PHY interrupt is not
> being cleared correctly. PHY interrupts are level interrupts, so if
> you don't clear the interrupt at the source, it will fire again as
> soon as you re-enable it.
>
> So which PHY driver is being used? If you look for the first kernel
> message about the lan78xx it probably tells you.
>
>> [   27.918335] Call trace:
>> [   27.918338]  console_flush_all+0x2b0/0x4f8 (P)
>> [   27.918346]  console_unlock+0x8c/0x170
>> [   27.918352]  vprintk_emit+0x238/0x3b8
>> [   27.918357]  dev_vprintk_emit+0xe4/0x1b8
>> [   27.918364]  dev_printk_emit+0x64/0x98
>> [   27.918368]  __netdev_printk+0xc8/0x228
>> [   27.918376]  netdev_info+0x70/0xa8
>> [   27.918382]  phy_print_status+0xcc/0x138
>> [   27.918386]  lan78xx_link_status_change+0x78/0xb0
>> [   27.918392]  phy_link_change+0x38/0x70
>> [   27.918398]  phy_check_link_status+0xa8/0x110
>> [   27.918405]  _phy_start_aneg+0x5c/0xb8
>> [   27.918409]  lan88xx_link_change_notify+0x5c/0x128
>> [   27.918416]  _phy_state_machine+0x12c/0x2b0
>> [   27.918420]  phy_state_machine+0x34/0x80
>> [   27.918425]  process_one_work+0x150/0x3b8
>> [   27.918432]  worker_thread+0x2a4/0x4b8
>> [   27.918438]  kthread+0xec/0xf8
>> [   27.918442]  ret_from_fork+0x10/0x20
>> [   27.918534] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>> [   27.924985] lan78xx 2-3:1.0 enp1s0u3: kevent 4 may have been dropped
>
> Ah, O.K. This tells me the PHY is a lan88xx. And there is a workaround
> involved for an issue in this PHY. Often PHYs are driven by polling
> for status changes once per second. Not all PHYs/boards support
> interrupts. It could be this workaround has only been tested with
> polling, not interrupts, and so is broken when interrupts are used.
>
> As a quick hack test, in lan78xx_phy_init()
>
> 	/* if phyirq is not set, use polling mode in phylib */
> 	if (dev->domain_data.phyirq > 0)
> 		phydev->irq = dev->domain_data.phyirq;
> 	else
> 		phydev->irq = PHY_POLL;
>
> Hard code phydev->irq to PHY_POLL, so interrupts are not used.
>
> See if you can reproduce the issue when interrupts are not used.
Thank you, I'll test that. Given the issue appears rarely it'll
unfortunately take a while to be (mostly) sure.

Best regards,
Fiona
diff mbox series

Patch

diff --git a/drivers/net/usb/lan78xx.c b/drivers/net/usb/lan78xx.c
index a91bf9c7e31d..7bf01a31a932 100644
--- a/drivers/net/usb/lan78xx.c
+++ b/drivers/net/usb/lan78xx.c
@@ -173,6 +173,12 @@ 
 #define INT_EP_GPIO_1			(1)
 #define INT_EP_GPIO_0			(0)
 
+/* highspeed device, so polling interval is in microframes (eight per
+ * millisecond)
+ */
+#define INT_URB_MICROFRAMES_PER_MS	8
+#define MIN_INT_URB_INTERVAL_MS		8
+
 static const char lan78xx_gstrings[][ETH_GSTRING_LEN] = {
 	"RX FCS Errors",
 	"RX Alignment Errors",
@@ -4527,7 +4533,11 @@  static int lan78xx_probe(struct usb_interface *intf,
 	if (ret < 0)
 		goto out4;
 
-	period = ep_intr->desc.bInterval;
+	period = max(ep_intr->desc.bInterval,
+		     MIN_INT_URB_INTERVAL_MS * INT_URB_MICROFRAMES_PER_MS);
+	dev_info(&intf->dev,
+		 "interrupt urb period set to %d, bInterval is %d\n",
+		 period, ep_intr->desc.bInterval);
 	maxp = usb_maxpacket(dev->udev, dev->pipe_intr);
 
 	dev->urb_intr = usb_alloc_urb(0, GFP_KERNEL);