Message ID | 20241014210840.5941d336@foxbook |
---|---|
Headers | show |
Series | Fix the NEC stop bug workaround | expand |
On Mon, Oct 14, 2024 at 09:10:05PM +0200, Michal Pecio wrote: > The NEC uPD720200 has a bug, which prevents reliably stopping > an endpoint shortly after it has been restarted. This usually > happens when a driver kills many URBs in quick succession and > it results in concurrent execution and cancellation of TDs. > > This is handled by stopping the endpoint again if in doubt. > > This "doubt" turns out to be a problem, because Stop Endpoint > may be queued when the EP is already Stopped (for Set TR Deq > execution, for example) or becomes Stopped concurrently (by > Reset Endpoint, for example). If the EP is truly Stopped, the > command fails and further retries just keep failing forever. > > This is easily triggered by modifying uvcvideo to unlink its > isochronous URBs in 100us intervals instead of poisoning them. > Any driver that unlinks URBs asynchronously may trigger this, > and any URB unlink during ongoing halt recovery also can. > > Fix the problem by tracking redundant Stop Endpoint commands > which are sure to fail, and by not retrying them. It's easy, > because xhci_urb_dequeue() is the only user ever queuing the > command with the default handler and without ensuring that > the endpoint is Running and will not Halt before it Stops. > For this case, we assume that an endpoint with pending URBs > is always Running, unless certain operations are pending on > it which indicate known exceptions. > > Note that we need to catch those exceptions when they occur, > because their flags may be cleared before our handler runs. > > It's possible that other HCs have similar bugs (see also the > related "Running" case below), but the workaround is limited > to NEC because no such chips are currently known and tested. > > Fixes: fd9d55d190c0 ("xhci: retry Stop Endpoint on buggy NEC controllers") > Signed-off-by: Michal Pecio <michal.pecio@gmail.com> > --- > drivers/usb/host/xhci-ring.c | 44 +++++++++++++++++++++++++++++++++--- > drivers/usb/host/xhci.h | 2 ++ > 2 files changed, 43 insertions(+), 3 deletions(-) > Hi, This is the friendly patch-bot of Greg Kroah-Hartman. You have sent him a patch that has triggered this response. He used to manually respond to these common problems, but in order to save his sanity (he kept writing the same thing over and over, yet to different people), I was created. Hopefully you will not take offence and will fix the problem in your patch and resubmit it so that it can be accepted into the Linux kernel tree. You are receiving this message because of the following common error(s) as indicated below: - You have marked a patch with a "Fixes:" tag for a commit that is in an older released kernel, yet you do not have a cc: stable line in the signed-off-by area at all, which means that the patch will not be applied to any older kernel releases. To properly fix this, please follow the documented rules in the Documentation/process/stable-kernel-rules.rst file for how to resolve this. If you wish to discuss this problem further, or you have questions about how to resolve this issue, please feel free to respond to this email and Greg will reply once he has dug out from the pending patches received from other developers. thanks, greg k-h's patch email bot
On 14.10.2024 22.08, Michal Pecio wrote: > Hi, > > I found an unfortunate problem with my workaround for this hardware bug. > > To recap, Stop Endpoint sometimes fails, the Endpoint Context says the > EP is Stopped, but cancelled TRBs are still executed. I found this bug > earlier this year and submitted a workaround, which retries the command > (sometimes a few times) and all is good. > > This works fine for common cases, but what if the endpoint is really > stopped? Then Stop Endpoint is supposed to fail and fail it does. The > workaround code doesn't know that it happened and retries infinitely. > > I have never seen it in normal use, but I devised a reliable repro. > The effect isn't pretty - no URBs can be cancelled, device gets stuck, > if unplugged it locks up connections/disconnections on the whole bus. > > With some experimentation I found that the bug is a variant of the old > "stop after restart" issue - the doorbell ring is internally reordered > after the subsequent command. By busy-waiting I confirmed that EP state > which is initially seen as Stopped becomes Running some time later. > Seems host controllers aren't designed to stop, move dequeue, and restart an endpoint in quick succession. In addition to fixing this NEC case we could think about avoiding these cases, some could be avoided by adding a new ".flush_endpoint()" callback to the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function that calls .urb_dequeue() in a loop for each queued URB, causing host to issue the stop, move deq and ring doorbell for every URB. If usbcore knows all URBs will be cancelled it could let host do it in one go. i.e. stop endpoint once. Thanks Mathias
> Can we skip the new flag and just check for the correct flags here > directly? > > if (ep->ep_state & (SET_DEQ_PENDING | EP_HALTED | EP_CLEARING_TT) > break; Unfortunately not, because those pending operations may (and usually will) complete before our handler runs. They will not restart the EP because we set EP_STOP_CMD_PENDING, but they will clear their flags. So we know that Stop Endpoint is guaranteed to fail, but its handler will not see those flags and will have no clue why it failed, hence we store this one bit of knowledge specially for its use. But you raise a valid point. If Stop EP fails on a Halted endpoint and somebody else resets it before Stop EP handler runs, the handler will see EP_HALTED, because Reset EP handler must run later if the commands were queued and executed in this order. So if Stop EP handler tests for EP_HALTED, nobody needs to worry about updating EP_STOP_CMD_REDUNDANT for us. The helper function can go out, the patch is shorter, and the solution more robust against any changes to halt recovery code that anyone might do. All that "redundant" logic becomes concentrated in queue/handle _stop_endpoint() functions. I think I will do a v2. By the way, is this list of conditions complete? There are other flags like GETTING_STREAMS or CLEAR_TOGGLE, but I'm under impression that they are valid only with no queued URBs, so nothing can be cancelled then. Regards, Michal
On Tue, Oct 15, 2024 at 03:23:23PM +0300, Mathias Nyman wrote: > In addition to fixing this NEC case we could think about avoiding these > cases, some could be avoided by adding a new ".flush_endpoint()" callback to > the USB host side API. Usb core itself has a usb_hcd_flush_endpoint() function > that calls .urb_dequeue() in a loop for each queued URB, causing host to > issue the stop, move deq and ring doorbell for every URB. > > If usbcore knows all URBs will be cancelled it could let host do it in one go. > i.e. stop endpoint once. Indeed, this makes a lot of sense, and I have long thought that the API should have been designed this way from the beginning. At least for non-Control transfers, unlinking a single URB somewhere inside a sequence of URBs seems pointless. I doubt that it ever happens in the kernel. (On the other hand, it _is_ reasonable to do this for Control transfers, because they can come from several different sources, not just from the device's driver. The source for a Control URB might want to unlink it while not affecting the URBs from other sources.) Furthermore, I suspect this is what Windows does and what the USBIF originally had in mind for URB management. (It's harder to tell what they thought about Control transfers, though.) Alan Stern
> > With some experimentation I found that the bug is a variant of the > > old "stop after restart" issue - the doorbell ring is internally > > reordered after the subsequent command. By busy-waiting I confirmed > > that EP state which is initially seen as Stopped becomes Running > > some time later. > > Seems host controllers aren't designed to stop, move dequeue, and > restart an endpoint in quick succession. As it was you who added the Running case handling, do you know hardware other than NEC which triggers this? Or could it be just a single vendor who screwed up once 15 years ago and caused all the chaos? NEC sometimes triggers the Running case too and it is obvious why. I'm not sure how I missed it back in January and assumed it's some sort of random failure for no reason. BTW, the NEC problem appears to be limited to periodic endpoints. I am unable to reproduce it on bulk. I thought that I reproduced it on bulk back then, but on second thought it may have been interrupt, which that device also has. Unfortunatel I wasn't printing endpoint numbers then. Regards, Michal
On 16.10.2024 8.47, Michał Pecio wrote: >>> With some experimentation I found that the bug is a variant of the >>> old "stop after restart" issue - the doorbell ring is internally >>> reordered after the subsequent command. By busy-waiting I confirmed >>> that EP state which is initially seen as Stopped becomes Running >>> some time later. >> >> Seems host controllers aren't designed to stop, move dequeue, and >> restart an endpoint in quick succession. > > As it was you who added the Running case handling, do you know hardware > other than NEC which triggers this? Or could it be just a single vendor > who screwed up once 15 years ago and caused all the chaos? > > NEC sometimes triggers the Running case too and it is obvious why. I'm > not sure how I missed it back in January and assumed it's some sort of > random failure for no reason. > > BTW, the NEC problem appears to be limited to periodic endpoints. I am > unable to reproduce it on bulk. I thought that I reproduced it on bulk > back then, but on second thought it may have been interrupt, which that > device also has. Unfortunatel I wasn't printing endpoint numbers then. > > Regards, > Michal Sorry about the reply delay. I don't think this is a NEC only issue. I was originally fixing halted endpoints at stop endpoint command completion, did some stress testing, and was able to hit that running case on Intel xHC controllers See: 9ebf30007858 xhci: Fix halted endpoint at stop endpoint command completion 1174d44906d5 xhci: handle stop endpoint command completion with endpoint in running state. I also just got a report off-list about an exactly similar case as yours, endpoint stopped with ctx error, endpoint state was still stopped even if doorbell was already rung. This caused Set TR Deq command to fail with context error as endpoint was running by the time this command was processed. This was on a Intel host, se we need a generic solution to this. Thanks -Mathias