Message ID | 4d7aa3f4-22d9-9f5a-3d70-1bd7148ff4ba@google.com |
---|---|
State | New |
Headers | show |
Series | Kernel panic when unbinding UVC gadget function | expand |
Hi Avichal - thanks for all the detail On 20/07/2023 02:28, Avichal Rakesh wrote: > Hey all, > > I recently ran into a kernel panic when testing the UVC Gadget Driver. > The device ramdumps with the following stack when removing the UVC config from > configfs: > > KP: Oops - BUG: Fatal exception: comm:Thread-6 PC:__list_del_entry_valid+0xb0/0xc4 LR:__list_del_entry_valid+0xb0/0xc4 > PC: __list_del_entry_valid+0xb0 <FFFFFFE685330294> > LR: __list_del_entry_valid+0xb0 <FFFFFFE685330294> > > [<FFFFFFE685330294>] __list_del_entry_valid+0xb0 > [<FFFFFFE6857E50AC>] v4l2_fh_del+0x78 > [<FFFFFFE685769774>] uvc_v4l2_release+0xd0 > [<FFFFFFE6857D9B10>] v4l2_release+0xcc > [<FFFFFFE684EE192C>] __fput+0xf8 > [<FFFFFFE684EE17CC>] ____fput+0xc > [<FFFFFFE684B5C9E0>] task_work_run+0x138 > > This looks like a side effect of > https://lore.kernel.org/lkml/20230608204517.105396-1-badhri@google.com/. > Effectively, UVC function tried to disconnect the gadget before > cleaning up resources. However, usb_gadget_unregister_driver which is > removing the function prevents the gadget from disconnecting until the > function is unbound. > > As of the patch mentioned above, gadget_unbind_driver holds > udc->connect_lock and calls both usb_gadget_disconnect_locked and > udc->driver->unbind one after the other. > > usb_gadget_disconnect_locked calls into UVC Gadget driver as follows: > > 1. usb_gadget_disconnect_locked > 2. configfs_composite_disconnect > 3. __composite_disconnect > 4. uvc_function_disable > > udc->driver->unbind calls into UVC Gadget driver as follows: > > 1. udc->driver->unbind > 2. configfs_composite_unbind > 3. purge_configs_funcs > 4. uvc_function_unbind > > uvc_function_disable notifies the userspace application with > UVC_EVENT_DISCONNECTED which causes the V4L2 node to be released > (or unsubscribed to). Either on unsubscribe or on release, the UVC Gadget > Driver calls uvc_function_disconnect before cleaning up resources. Following > is the problematic stack from uvc_v4l2_disable. > > 1. uvc_v4l2_disable > 2. uvc_function_disconnect > 3. usb_function_deactivate > 4. usb_gadget_deactivate > > usb_gadget_deactivate attempts to lock udc->connect_lock as of the patch > mentioned above. > > This means that attempting to unregister the UVC Gadget Driver results in the > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only > be released after uvc_function_unbind finishes. This results in either the > gadget deactivating after the unbind process has finished, or in a Kernel Panic > as it tries to cleanup a V4L2 node that has been purged. > > This leaves us with two options: > 1. We "fix" the locking in core.c to restore old behavior, and let the > usb_gadget_deactivate call go through without locking. However, > I am not sure about the specifics of the patch were and what exact issue it > was trying to fix. > > Badhri, would you know if relaxing the constraints on > usb_gadget_deactivate is feasible? It is possible that other gadget drivers > run into similar issues as UVC driver. > > 3. UVC Gadget Driver calls usb_function_deactivate to take down the gadget if > the userspace application stops listening to the V4L2 node. However, AFAICT > disable is called as a part of the gadget resetting. So, if the V4L2 node > is released because of UVC_EVENT_DISCONNECT, we can skip calling > usb_function_deactivate as the gadget will be reset anyway. > > usb_function documentation seems to agree that if 'disable' is called, > the gadget will be reset/reconfigured shortly: > > @disable: (REQUIRED) Indicates the function should be disabled. Reasons > * include host resetting or reconfiguring the gadget, and disconnection. > > A dirty Patch for option 2 is attached below which skips calling > usb_function_deactivate if uvc_function_disable was called before. It seems > to work okay in testing. Let me know if the analysis and solutions seems okay > and I can upload a formal patch. For what it's worth the analysis makes sense; the patch looks ok to me so if the conclusion is to fix the problem that way I think it's fine, but I'm more inclined to consider this a locking problem in core - it'd be better to fix it there I think. > Thank you! > > --- > drivers/usb/gadget/function/f_uvc.c | 12 ++++++++++-- > drivers/usb/gadget/function/uvc.h | 1 + > 2 files changed, 11 insertions(+), 2 deletions(-) > > diff --git a/drivers/usb/gadget/function/f_uvc.c b/drivers/usb/gadget/function/f_uvc.c > index 5e919fb65833..cef92243f1f7 100644 > --- a/drivers/usb/gadget/function/f_uvc.c > +++ b/drivers/usb/gadget/function/f_uvc.c > @@ -385,7 +385,7 @@ uvc_function_disable(struct usb_function *f) > v4l2_event.type = UVC_EVENT_DISCONNECT; > v4l2_event_queue(&uvc->vdev, &v4l2_event); > > - uvc->state = UVC_STATE_DISCONNECTED; > + uvc->state = UVC_STATE_HOST_DISCONNECTED; > > usb_ep_disable(uvc->video.ep); > if (uvc->enable_interrupt_ep) > @@ -410,8 +410,16 @@ uvc_function_disconnect(struct uvc_device *uvc) > { > int ret; > > - if ((ret = usb_function_deactivate(&uvc->func)) < 0) > + if (uvc->state == UVC_STATE_HOST_DISCONNECTED) { > + /* > + * Don't deactivate gadget as this is being called in > + * response to the host resetting. Gadget will be deactivated > + * anyway. Just update to state as acknowledgement > + */ > + uvc->state = UVC_STATE_DISCONNECTED; > + } else if ((ret = usb_function_deactivate(&uvc->func)) < 0) { > uvcg_info(&uvc->func, "UVC disconnect failed with %d\n", ret); > + } > } > > /* -------------------------------------------------------------------------- > diff --git a/drivers/usb/gadget/function/uvc.h b/drivers/usb/gadget/function/uvc.h > index 100475b1363e..f1e2bc98dc61 100644 > --- a/drivers/usb/gadget/function/uvc.h > +++ b/drivers/usb/gadget/function/uvc.h > @@ -120,6 +120,7 @@ struct uvc_video { > }; > > enum uvc_state { > + UVC_STATE_HOST_DISCONNECTED, > UVC_STATE_DISCONNECTED, > UVC_STATE_CONNECTED, > UVC_STATE_STREAMING,
On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote: > Hi Avichal - thanks for all the detail > > On 20/07/2023 02:28, Avichal Rakesh wrote: > > This looks like a side effect of > > https://lore.kernel.org/lkml/20230608204517.105396-1-badhri@google.com/. > > Effectively, UVC function tried to disconnect the gadget before > > cleaning up resources. However, usb_gadget_unregister_driver which is > > removing the function prevents the gadget from disconnecting until the > > function is unbound. > > A dirty Patch for option 2 is attached below which skips calling > > usb_function_deactivate if uvc_function_disable was called before. It seems > > to work okay in testing. Let me know if the analysis and solutions seems okay > > and I can upload a formal patch. > > > For what it's worth the analysis makes sense; the patch looks ok to me so if > the conclusion is to fix the problem that way I think it's fine, but I'm > more inclined to consider this a locking problem in core - it'd be better to > fix it there I think. I'm not so sure that handling this in the core is feasible. Removing the driver obviously needs to be synchronized with deactivation, since the two actions affect the same parts of the state (i.e., the pull-ups and the "connected" flag). Consequently I don't see how to avoid a deadlock if the driver's unbind callback does a deactivate. Besides, as the patch mentions, doing so is never necessary. However, even with that call removed we could still have a problem. I don't know much about how the UVC function driver works, but it would be reasonable for the driver to have a private mutex that gets held both during unbind and when the user application closes the V4L2 node. Then there's an ABBA locking issue: Unbind: The UDC core holds connect_lock while calling the UVC unbind handler, which needs to acquire the private mutex. Close node: The UVC driver holds the private mutex while doing a deactivate, which needs to acquire connect_lock. Any ideas on how to clear this up? Alan Stern
Thank you both, Dan and Alan, for your comments! On Fri, Jul 21, 2023 at 12:32 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote: > > Hi Avichal - thanks for all the detail > > > > > A dirty Patch for option 2 is attached below which skips calling > > > usb_function_deactivate if uvc_function_disable was called before. It seems > > > to work okay in testing. Let me know if the analysis and solutions seems okay > > > and I can upload a formal patch. > > > > > > For what it's worth the analysis makes sense; the patch looks ok to me so if > > the conclusion is to fix the problem that way I think it's fine, but I'm > > more inclined to consider this a locking problem in core - it'd be better to > > fix it there I think. > > I'm not so sure that handling this in the core is feasible. Removing > the driver obviously needs to be synchronized with deactivation, since > the two actions affect the same parts of the state (i.e., the pull-ups > and the "connected" flag). I don't have the full context on what caused the locking to be added, but now that it in place, it seems like there needs to be a clarification of expectation between core and the gadget drivers. Is it valid for the gadget drivers to call usb_gadget_deactivate (and similar functions) as a part of disable/unbind (in terms of API/expectations)? 1. If yes, maybe core can track when it is in the middle of resetting and drop calls to usb_gadget_deactivate if called in the middle of the disconnect--->unbind sequence. This is effectively what the patch above does in UVC driver, but core might (with some extra state) have stronger guarantees of when a call is redundant and can be safely dropped. 2. If no, then it becomes the gadget's responsibility to ensure that it doesn't call any of the usb_gadget_* functions when disabling/unbinding. However, it does require core to provide some concrete rules around when things are safe to call, and when they aren't. > > Consequently I don't see how to avoid a deadlock if the driver's unbind > callback does a deactivate. Besides, as the patch mentions, doing so is > never necessary. > > However, even with that call removed we could still have a problem. I > don't know much about how the UVC function driver works, but it would be > reasonable for the driver to have a private mutex that gets held both > during unbind and when the user application closes the V4L2 node. Then > there's an ABBA locking issue: > > Unbind: The UDC core holds connect_lock while calling the UVC > unbind handler, which needs to acquire the private mutex. > > Close node: The UVC driver holds the private mutex while doing > a deactivate, which needs to acquire connect_lock. > > Any ideas on how to clear this up? > I think my question above gives us two options out based on the answer: 1. Core handling redundant calls is the more bullet-proof solution IMO. It means that the gadget driver never holds connect_lock when it shouldn't. No more ABBA! One potential implementation is to track when core is resetting in a protected flag. All functions related to resetting/disconnecting would check the flag before locking connect_lock and would become no-ops if gadget is in the middle of resetting. 2. Some stronger guarantees will let the gadget driver's state machine decide if it can call usb_gadget_* functions. For example, if we can say for sure that disable call will always be followed by the unbind call, and that usb_gadget_* functions are disallowed between the two, then UVC driver can handle ABBA situation by tracking when it is between a disable and unbind call and skip calling usb_gadget_* function until unbind finishes. The downside of (2), is that a poorly written (or malicious) gadget driver can grind the gadget to a halt with a somewhat simple deadlock. Unfortunately, I am travelling over the next week, but I'll try to create and attach a dirty patch for core to handle redundant calls to usb_gadget_* over the next week. I am fairly new and don't know the full semantics around core, so if I am missing something simple here, please do let me know! Regards, Avi
On Fri, Jul 21, 2023 at 03:44:52PM -0700, Avichal Rakesh wrote: > Thank you both, Dan and Alan, for your comments! > > On Fri, Jul 21, 2023 at 12:32 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > > > On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote: > > > Hi Avichal - thanks for all the detail > > > > > > > A dirty Patch for option 2 is attached below which skips calling > > > > usb_function_deactivate if uvc_function_disable was called before. It seems > > > > to work okay in testing. Let me know if the analysis and solutions seems okay > > > > and I can upload a formal patch. > > > > > > > > > For what it's worth the analysis makes sense; the patch looks ok to me so if > > > the conclusion is to fix the problem that way I think it's fine, but I'm > > > more inclined to consider this a locking problem in core - it'd be better to > > > fix it there I think. > > > > I'm not so sure that handling this in the core is feasible. Removing > > the driver obviously needs to be synchronized with deactivation, since > > the two actions affect the same parts of the state (i.e., the pull-ups > > and the "connected" flag). > > I don't have the full context on what caused the locking to be added, > but now that it > in place, it seems like there needs to be a clarification of > expectation between core > and the gadget drivers. Is it valid for the gadget drivers to call > usb_gadget_deactivate (and similar functions) as a part of disable/unbind > (in terms of API/expectations)? > > 1. If yes, maybe core can track when it is in the middle of resetting and > drop calls to usb_gadget_deactivate if called in the middle of the > disconnect--->unbind sequence. This is effectively what the patch above > does in UVC driver, but core might (with some extra state) have stronger > guarantees of when a call is redundant and can be safely dropped. > > 2. If no, then it becomes the gadget's responsibility to ensure that it doesn't > call any of the usb_gadget_* functions when disabling/unbinding. However, it > does require core to provide some concrete rules around when things are safe > to call, and when they aren't. > > > > > Consequently I don't see how to avoid a deadlock if the driver's unbind > > callback does a deactivate. Besides, as the patch mentions, doing so is > > never necessary. > > > > However, even with that call removed we could still have a problem. I > > don't know much about how the UVC function driver works, but it would be > > reasonable for the driver to have a private mutex that gets held both > > during unbind and when the user application closes the V4L2 node. Then > > there's an ABBA locking issue: > > > > Unbind: The UDC core holds connect_lock while calling the UVC > > unbind handler, which needs to acquire the private mutex. > > > > Close node: The UVC driver holds the private mutex while doing > > a deactivate, which needs to acquire connect_lock. > > > > Any ideas on how to clear this up? > > > > I think my question above gives us two options out based on the answer: > > 1. Core handling redundant calls is the more bullet-proof solution IMO. It > means that the gadget driver never holds connect_lock when it shouldn't. > No more ABBA! > > One potential implementation is to track when core is resetting in a protected > flag. All functions related to resetting/disconnecting would check the > flag before > locking connect_lock and would become no-ops if gadget is in the middle of > resetting. > > 2. Some stronger guarantees will let the gadget driver's state machine decide > if it can call usb_gadget_* functions. For example, if we can say for sure that > disable call will always be followed by the unbind call, and that usb_gadget_* > functions are disallowed between the two, then UVC driver can handle ABBA > situation by tracking when it is between a disable and unbind call and skip > calling usb_gadget_* function until unbind finishes. > > The downside of (2), is that a poorly written (or malicious) gadget driver can > grind the gadget to a halt with a somewhat simple deadlock. > > Unfortunately, I am travelling over the next week, but I'll try to > create and attach > a dirty patch for core to handle redundant calls to usb_gadget_* over the next > week. > > I am fairly new and don't know the full semantics around core, so if I > am missing > something simple here, please do let me know! Here's a proposal, along the lines of your first suggestion above. The idea is to avoid holding the connect_lock mutex while invoking a gadget driver's callbacks. Unfortunately, this is unavoidable in the case of the ->disconnect() callback, but that's okay because the kerneldoc already says that ->disconnect() may be called in_interrupt, so it's not allowed to call any core routines that may sleep. Just to make this perfectly clear, the patch adds a couple of comments along these lines. As far as I can tell, there is no real reason to hold connect_lock during the ->unbind() callback. Not doing so should fix the problem in the UVC function driver. Let me know if this works. Alan Stern drivers/usb/gadget/udc/core.c | 9 +++++++++ 1 file changed, 9 insertions(+) Index: usb-devel/drivers/usb/gadget/udc/core.c =================================================================== --- usb-devel.orig/drivers/usb/gadget/udc/core.c +++ usb-devel/drivers/usb/gadget/udc/core.c @@ -822,6 +822,9 @@ EXPORT_SYMBOL_GPL(usb_gadget_disconnect) * usb_gadget_activate() is called. For example, user mode components may * need to be activated before the system can talk to hosts. * + * This routine may sleep; it must not be called in interrupt context + * (such as from within a gadget driver's disconnect() callback). + * * Returns zero on success, else negative errno. */ int usb_gadget_deactivate(struct usb_gadget *gadget) @@ -860,6 +863,8 @@ EXPORT_SYMBOL_GPL(usb_gadget_deactivate) * This routine activates gadget which was previously deactivated with * usb_gadget_deactivate() call. It calls usb_gadget_connect() if needed. * + * This routine may sleep; it must not be called in interrupt context. + * * Returns zero on success, else negative errno. */ int usb_gadget_activate(struct usb_gadget *gadget) @@ -1639,7 +1644,11 @@ static void gadget_unbind_driver(struct usb_gadget_disable_async_callbacks(udc); if (gadget->irq) synchronize_irq(gadget->irq); + mutex_unlock(&udc->connect_lock); + udc->driver->unbind(gadget); + + mutex_lock(&udc->connect_lock); usb_gadget_udc_stop_locked(udc); mutex_unlock(&udc->connect_lock);
On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Fri, Jul 21, 2023 at 03:44:52PM -0700, Avichal Rakesh wrote: > > Thank you both, Dan and Alan, for your comments! > > > > On Fri, Jul 21, 2023 at 12:32 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > > > > > On Fri, Jul 21, 2023 at 04:57:51PM +0100, Dan Scally wrote: > > > > Hi Avichal - thanks for all the detail > > > > > > > > > A dirty Patch for option 2 is attached below which skips calling > > > > > usb_function_deactivate if uvc_function_disable was called before. It seems > > > > > to work okay in testing. Let me know if the analysis and solutions seems okay > > > > > and I can upload a formal patch. > > > > > > > > > > > > For what it's worth the analysis makes sense; the patch looks ok to me so if > > > > the conclusion is to fix the problem that way I think it's fine, but I'm > > > > more inclined to consider this a locking problem in core - it'd be better to > > > > fix it there I think. > > > > > > I'm not so sure that handling this in the core is feasible. Removing > > > the driver obviously needs to be synchronized with deactivation, since > > > the two actions affect the same parts of the state (i.e., the pull-ups > > > and the "connected" flag). > > > > I don't have the full context on what caused the locking to be added, > > but now that it > > in place, it seems like there needs to be a clarification of > > expectation between core > > and the gadget drivers. Is it valid for the gadget drivers to call > > usb_gadget_deactivate (and similar functions) as a part of disable/unbind > > (in terms of API/expectations)? > > > > 1. If yes, maybe core can track when it is in the middle of resetting and > > drop calls to usb_gadget_deactivate if called in the middle of the > > disconnect--->unbind sequence. This is effectively what the patch above > > does in UVC driver, but core might (with some extra state) have stronger > > guarantees of when a call is redundant and can be safely dropped. > > > > 2. If no, then it becomes the gadget's responsibility to ensure that it doesn't > > call any of the usb_gadget_* functions when disabling/unbinding. However, it > > does require core to provide some concrete rules around when things are safe > > to call, and when they aren't. > > > > > > > > Consequently I don't see how to avoid a deadlock if the driver's unbind > > > callback does a deactivate. Besides, as the patch mentions, doing so is > > > never necessary. > > > > > > However, even with that call removed we could still have a problem. I > > > don't know much about how the UVC function driver works, but it would be > > > reasonable for the driver to have a private mutex that gets held both > > > during unbind and when the user application closes the V4L2 node. Then > > > there's an ABBA locking issue: > > > > > > Unbind: The UDC core holds connect_lock while calling the UVC > > > unbind handler, which needs to acquire the private mutex. > > > > > > Close node: The UVC driver holds the private mutex while doing > > > a deactivate, which needs to acquire connect_lock. > > > > > > Any ideas on how to clear this up? > > > > > > > I think my question above gives us two options out based on the answer: > > > > 1. Core handling redundant calls is the more bullet-proof solution IMO. It > > means that the gadget driver never holds connect_lock when it shouldn't. > > No more ABBA! > > > > One potential implementation is to track when core is resetting in a protected > > flag. All functions related to resetting/disconnecting would check the > > flag before > > locking connect_lock and would become no-ops if gadget is in the middle of > > resetting. > > > > 2. Some stronger guarantees will let the gadget driver's state machine decide > > if it can call usb_gadget_* functions. For example, if we can say for sure that > > disable call will always be followed by the unbind call, and that usb_gadget_* > > functions are disallowed between the two, then UVC driver can handle ABBA > > situation by tracking when it is between a disable and unbind call and skip > > calling usb_gadget_* function until unbind finishes. > > > > The downside of (2), is that a poorly written (or malicious) gadget driver can > > grind the gadget to a halt with a somewhat simple deadlock. > > > > Unfortunately, I am travelling over the next week, but I'll try to > > create and attach > > a dirty patch for core to handle redundant calls to usb_gadget_* over the next > > week. > > > > I am fairly new and don't know the full semantics around core, so if I > > am missing > > something simple here, please do let me know! > > Here's a proposal, along the lines of your first suggestion above. The > idea is to avoid holding the connect_lock mutex while invoking a gadget > driver's callbacks. > > Unfortunately, this is unavoidable in the case of the ->disconnect() > callback, but that's okay because the kerneldoc already says that > ->disconnect() may be called in_interrupt, so it's not allowed to call > any core routines that may sleep. Just to make this perfectly clear, > the patch adds a couple of comments along these lines. > > As far as I can tell, there is no real reason to hold connect_lock > during the ->unbind() callback. Not doing so should fix the problem in > the UVC function driver. Thank you for the patch (and apologies for the delay)! This is a neat fix I completely glossed over. Looked around at other places where unbind is called, and it looks like the lock isn't held anywhere else either, so dropping the lock before calling unbind shouldn't break any existing assumptions around the callback. > > Let me know if this works. > > > drivers/usb/gadget/udc/core.c | 9 +++++++++ > 1 file changed, 9 insertions(+) > > Index: usb-devel/drivers/usb/gadget/udc/core.c > =================================================================== > --- usb-devel.orig/drivers/usb/gadget/udc/core.c > +++ usb-devel/drivers/usb/gadget/udc/core.c > @@ -822,6 +822,9 @@ EXPORT_SYMBOL_GPL(usb_gadget_disconnect) > * usb_gadget_activate() is called. For example, user mode components may > * need to be activated before the system can talk to hosts. > * > + * This routine may sleep; it must not be called in interrupt context > + * (such as from within a gadget driver's disconnect() callback). > + * > * Returns zero on success, else negative errno. > */ > int usb_gadget_deactivate(struct usb_gadget *gadget) > @@ -860,6 +863,8 @@ EXPORT_SYMBOL_GPL(usb_gadget_deactivate) > * This routine activates gadget which was previously deactivated with > * usb_gadget_deactivate() call. It calls usb_gadget_connect() if needed. > * > + * This routine may sleep; it must not be called in interrupt context. > + * > * Returns zero on success, else negative errno. > */ > int usb_gadget_activate(struct usb_gadget *gadget) > @@ -1639,7 +1644,11 @@ static void gadget_unbind_driver(struct > usb_gadget_disable_async_callbacks(udc); > if (gadget->irq) > synchronize_irq(gadget->irq); > + mutex_unlock(&udc->connect_lock); > + > udc->driver->unbind(gadget); > + > + mutex_lock(&udc->connect_lock); > usb_gadget_udc_stop_locked(udc); > mutex_unlock(&udc->connect_lock); > > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC driver can now be unbound without locking up with the V4L2 release callback. Logs confirm that the calls are still being interleaved, but don't result in a deadlock now. Regards, Avi.
On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote: > On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > Here's a proposal, along the lines of your first suggestion above. The > > idea is to avoid holding the connect_lock mutex while invoking a gadget > > driver's callbacks. > > > > Unfortunately, this is unavoidable in the case of the ->disconnect() > > callback, but that's okay because the kerneldoc already says that > > ->disconnect() may be called in_interrupt, so it's not allowed to call > > any core routines that may sleep. Just to make this perfectly clear, > > the patch adds a couple of comments along these lines. > > > > As far as I can tell, there is no real reason to hold connect_lock > > during the ->unbind() callback. Not doing so should fix the problem in > > the UVC function driver. > > Thank you for the patch (and apologies for the delay)! This is a neat > fix I completely glossed over. Looked around at other places where > unbind is called, and it looks like the lock isn't held anywhere else > either, so dropping the lock before calling unbind shouldn't break any > existing assumptions around the callback. > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC > driver can now be unbound without locking up with the V4L2 release > callback. Logs confirm that the calls are still being interleaved, but > don't result in a deadlock now. Thanks for trying it out. Is it okay for me to add your "Tested-by:" tag when the patch is submitted? Alan Stern
On Fri, Jul 28, 2023 at 10:05:39AM -0400, Alan Stern wrote: > On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote: > > On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > > > Here's a proposal, along the lines of your first suggestion above. The > > > idea is to avoid holding the connect_lock mutex while invoking a gadget > > > driver's callbacks. > > > > > > Unfortunately, this is unavoidable in the case of the ->disconnect() > > > callback, but that's okay because the kerneldoc already says that > > > ->disconnect() may be called in_interrupt, so it's not allowed to call > > > any core routines that may sleep. Just to make this perfectly clear, > > > the patch adds a couple of comments along these lines. > > > > > > As far as I can tell, there is no real reason to hold connect_lock > > > during the ->unbind() callback. Not doing so should fix the problem in > > > the UVC function driver. > > > > Thank you for the patch (and apologies for the delay)! This is a neat > > fix I completely glossed over. Looked around at other places where > > unbind is called, and it looks like the lock isn't held anywhere else > > either, so dropping the lock before calling unbind shouldn't break any > > existing assumptions around the callback. > > > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC > > driver can now be unbound without locking up with the V4L2 release > > callback. Logs confirm that the calls are still being interleaved, but > > don't result in a deadlock now. > > Thanks for trying it out. Is it okay for me to add your "Tested-by:" > tag when the patch is submitted? Another thing... It's not clear that the patch will fix the problem entirely. Your original analysis of the bug stated: > This means that attempting to unregister the UVC Gadget Driver results in the > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only > be released after uvc_function_unbind finishes. This results in either the > gadget deactivating after the unbind process has finished, or in a Kernel Panic > as it tries to cleanup a V4L2 node that has been purged. My patch removes the locking issue. But if an execution path can occur with a lock present, it can also occur when the lock has been removed. That means it may still be possible for the UVC gadget driver to try deactivating the UDC after the unbind process has finished or for it to try cleaning up a V4L2 node that has been purged. If either of those really could have happened (as opposed to just getting stuck in a deadlock, waiting for a mutex that would never be released), then it can still happen with the patch. Fixing them would require changes to the UVC gadget driver. So the problem may not be gone entirely. Alan Stern
On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Fri, Jul 28, 2023 at 10:05:39AM -0400, Alan Stern wrote: > > On Fri, Jul 28, 2023 at 01:57:15PM +0530, Avichal Rakesh wrote: > > > On Sat, Jul 22, 2023 at 8:57 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > > > > > Here's a proposal, along the lines of your first suggestion above. The > > > > idea is to avoid holding the connect_lock mutex while invoking a gadget > > > > driver's callbacks. > > > > > > > > Unfortunately, this is unavoidable in the case of the ->disconnect() > > > > callback, but that's okay because the kerneldoc already says that > > > > ->disconnect() may be called in_interrupt, so it's not allowed to call > > > > any core routines that may sleep. Just to make this perfectly clear, > > > > the patch adds a couple of comments along these lines. > > > > > > > > As far as I can tell, there is no real reason to hold connect_lock > > > > during the ->unbind() callback. Not doing so should fix the problem in > > > > the UVC function driver. > > > > > > Thank you for the patch (and apologies for the delay)! This is a neat > > > fix I completely glossed over. Looked around at other places where > > > unbind is called, and it looks like the lock isn't held anywhere else > > > either, so dropping the lock before calling unbind shouldn't break any > > > existing assumptions around the callback. > > > > > Tried the patch, and it fixes the issue for UVC Gadget Driver! UVC > > > driver can now be unbound without locking up with the V4L2 release > > > callback. Logs confirm that the calls are still being interleaved, but > > > don't result in a deadlock now. > > > > Thanks for trying it out. Is it okay for me to add your "Tested-by:" > > tag when the patch is submitted? Oh, yes of course! > > Another thing... > > It's not clear that the patch will fix the problem entirely. Your > original analysis of the bug stated: > > > This means that attempting to unregister the UVC Gadget Driver results in the > > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only > > be released after uvc_function_unbind finishes. This results in either the > > gadget deactivating after the unbind process has finished, or in a Kernel Panic > > as it tries to cleanup a V4L2 node that has been purged. > > My patch removes the locking issue. But if an execution path can > occur with a lock present, it can also occur when the lock has been > removed. That means it may still be possible for the UVC gadget driver > to try deactivating the UDC after the unbind process has finished or for > it to try cleaning up a V4L2 node that has been purged. > > If either of those really could have happened (as opposed to just > getting stuck in a deadlock, waiting for a mutex that would never be > released), then it can still happen with the patch. Fixing them would > require changes to the UVC gadget driver. So the problem may not be > gone entirely. > The current situation can theoretically happen without the deadlock, yes, but shouldn't happen in practice. UVC's disable/unbind code flow looks as follows: 1. When disable callback is called, the gadget driver signals the userspace application to close the V4L2 node. 2. Closing the V4L2 node calls the release callback to clean up resources. It is this callback that calls into gadget_deactivate and gets blocked currently (without your patch). 3. Separately, the unbind callback waits on release callback to finish for 500ms, assuming the userspace application to behave well and close the node in a reasonable amount of time. 4. If the release callback still hasn't been called, the V4L2 node is forcefully removed and UVC driver waits for another 1000ms for the release callback to clean up any pending resources. 5. The unbind process continues regardless of the status of release callback after waiting at most 1.5s for release. So the only way to run into the current issue is if the release callback fails to finish in both step 3 and step 4 (for example, due to a deadlock) in the span of 1.5s. It is possible, but fairly unlikely (at least in my limited understanding) for the release callback to be delayed for quite that long. Hope that makes some sense! - Avi.
On Fri, Jul 28, 2023 at 11:41:19PM +0530, Avichal Rakesh wrote: > On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > It's not clear that the patch will fix the problem entirely. Your > > original analysis of the bug stated: > > > > > This means that attempting to unregister the UVC Gadget Driver results in the > > > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only > > > be released after uvc_function_unbind finishes. This results in either the > > > gadget deactivating after the unbind process has finished, or in a Kernel Panic > > > as it tries to cleanup a V4L2 node that has been purged. > > > > My patch removes the locking issue. But if an execution path can > > occur with a lock present, it can also occur when the lock has been > > removed. That means it may still be possible for the UVC gadget driver > > to try deactivating the UDC after the unbind process has finished or for > > it to try cleaning up a V4L2 node that has been purged. > > > > If either of those really could have happened (as opposed to just > > getting stuck in a deadlock, waiting for a mutex that would never be > > released), then it can still happen with the patch. Fixing them would > > require changes to the UVC gadget driver. So the problem may not be > > gone entirely. > > > The current situation can theoretically happen without the deadlock, > yes, but shouldn't happen in practice. UVC's disable/unbind code flow > looks as follows: > > 1. When disable callback is called, the gadget driver signals the > userspace application to close the V4L2 node. > 2. Closing the V4L2 node calls the release callback to clean up > resources. It is this callback that calls into gadget_deactivate and > gets blocked currently (without your patch). > 3. Separately, the unbind callback waits on release callback to finish > for 500ms, assuming the userspace application to behave well and close > the node in a reasonable amount of time. > 4. If the release callback still hasn't been called, the V4L2 node is > forcefully removed and UVC driver waits for another 1000ms for the > release callback to clean up any pending resources. > 5. The unbind process continues regardless of the status of release > callback after waiting at most 1.5s for release. > > So the only way to run into the current issue is if the release > callback fails to finish in both step 3 and step 4 (for example, due > to a deadlock) in the span of 1.5s. It is possible, but fairly > unlikely (at least in my limited understanding) for the release > callback to be delayed for quite that long. > > Hope that makes some sense! Yes, and it shows that there really is a bug in the UVC driver. In kernel programming, fairly unlikely == not impossible == bound to happen eventually! I don't know enough about the driver to make any detailed recommendations. But you might consider, for example, that if the unbind routine can get along with forcibly removing the V4L2 node and not waiting for the release callback to finish, then why not have it behave that way all the time? In other words, shorten the timeouts from 500 ms and 1000 ms to 0 ms. Whether you do that or not, someone definitely should fix up the release routine so that it won't get into trouble if it is called after (or concurrently with) all of the cleanup operations -- which is quite likely to happen if those timeouts are eliminated! In particular, it shouldn't call gadget_deactivate unless it knows that an unbind hasn't happened yet. And if that is the case, it should block the unbind routine until gadget_deactivate returns. Basically, it's a bug for a function driver to call any gadget core routine after its unbind callback has returned. Alan Stern
On Sat, Jul 29, 2023 at 2:33 AM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Fri, Jul 28, 2023 at 11:41:19PM +0530, Avichal Rakesh wrote: > > On Fri, Jul 28, 2023 at 9:07 PM Alan Stern <stern@rowland.harvard.edu> wrote: > > > It's not clear that the patch will fix the problem entirely. Your > > > original analysis of the bug stated: > > > > > > > This means that attempting to unregister the UVC Gadget Driver results in the > > > > V4L2 resource cleanup being stuck behind udc->connect_lock, which will only > > > > be released after uvc_function_unbind finishes. This results in either the > > > > gadget deactivating after the unbind process has finished, or in a Kernel Panic > > > > as it tries to cleanup a V4L2 node that has been purged. > > > > > > My patch removes the locking issue. But if an execution path can > > > occur with a lock present, it can also occur when the lock has been > > > removed. That means it may still be possible for the UVC gadget driver > > > to try deactivating the UDC after the unbind process has finished or for > > > it to try cleaning up a V4L2 node that has been purged. > > > > > > If either of those really could have happened (as opposed to just > > > getting stuck in a deadlock, waiting for a mutex that would never be > > > released), then it can still happen with the patch. Fixing them would > > > require changes to the UVC gadget driver. So the problem may not be > > > gone entirely. > > > > > The current situation can theoretically happen without the deadlock, > > yes, but shouldn't happen in practice. UVC's disable/unbind code flow > > looks as follows: > > > > 1. When disable callback is called, the gadget driver signals the > > userspace application to close the V4L2 node. > > 2. Closing the V4L2 node calls the release callback to clean up > > resources. It is this callback that calls into gadget_deactivate and > > gets blocked currently (without your patch). > > 3. Separately, the unbind callback waits on release callback to finish > > for 500ms, assuming the userspace application to behave well and close > > the node in a reasonable amount of time. > > 4. If the release callback still hasn't been called, the V4L2 node is > > forcefully removed and UVC driver waits for another 1000ms for the > > release callback to clean up any pending resources. > > 5. The unbind process continues regardless of the status of release > > callback after waiting at most 1.5s for release. > > > > So the only way to run into the current issue is if the release > > callback fails to finish in both step 3 and step 4 (for example, due > > to a deadlock) in the span of 1.5s. It is possible, but fairly > > unlikely (at least in my limited understanding) for the release > > callback to be delayed for quite that long. > > > > Hope that makes some sense! > > Yes, and it shows that there really is a bug in the UVC driver. In > kernel programming, fairly unlikely == not impossible == bound to happen > eventually! > > I don't know enough about the driver to make any detailed > recommendations. But you might consider, for example, that if the > unbind routine can get along with forcibly removing the V4L2 node and > not waiting for the release callback to finish, then why not have it > behave that way all the time? In other words, shorten the timeouts from > 500 ms and 1000 ms to 0 ms. Forcibly removing the V4L2 node doesn't clean up the associated resources (for example, the fd held by the userspace application), so we risk running into kernel panics if the V4L2 node is forcibly removed without a clean release from the userspace application. I don't see an easy way to reduce or remove the timeouts entirely, but I might be missing something simple again. Dan and Laurent, if you have ideas around this, I am happy to test stuff out! > > Whether you do that or not, someone definitely should fix up the release > routine so that it won't get into trouble if it is called after (or > concurrently with) all of the cleanup operations -- which is quite > likely to happen if those timeouts are eliminated! In particular, it > shouldn't call gadget_deactivate unless it knows that an unbind hasn't > happened yet. And if that is the case, it should block the unbind > routine until gadget_deactivate returns. Basically, it's a bug for a > function driver to call any gadget core routine after its unbind > callback has returned. > This seems like a reasonable check to have. I am out until next week, but I can test and put up a patch once I get back! Thank you! - Avi.
diff --git a/drivers/usb/gadget/function/f_uvc.c b/drivers/usb/gadget/function/f_uvc.c index 5e919fb65833..cef92243f1f7 100644 --- a/drivers/usb/gadget/function/f_uvc.c +++ b/drivers/usb/gadget/function/f_uvc.c @@ -385,7 +385,7 @@ uvc_function_disable(struct usb_function *f) v4l2_event.type = UVC_EVENT_DISCONNECT; v4l2_event_queue(&uvc->vdev, &v4l2_event); - uvc->state = UVC_STATE_DISCONNECTED; + uvc->state = UVC_STATE_HOST_DISCONNECTED; usb_ep_disable(uvc->video.ep); if (uvc->enable_interrupt_ep) @@ -410,8 +410,16 @@ uvc_function_disconnect(struct uvc_device *uvc) { int ret; - if ((ret = usb_function_deactivate(&uvc->func)) < 0) + if (uvc->state == UVC_STATE_HOST_DISCONNECTED) { + /* + * Don't deactivate gadget as this is being called in + * response to the host resetting. Gadget will be deactivated + * anyway. Just update to state as acknowledgement + */ + uvc->state = UVC_STATE_DISCONNECTED; + } else if ((ret = usb_function_deactivate(&uvc->func)) < 0) { uvcg_info(&uvc->func, "UVC disconnect failed with %d\n", ret); + } } /* -------------------------------------------------------------------------- diff --git a/drivers/usb/gadget/function/uvc.h b/drivers/usb/gadget/function/uvc.h index 100475b1363e..f1e2bc98dc61 100644 --- a/drivers/usb/gadget/function/uvc.h +++ b/drivers/usb/gadget/function/uvc.h @@ -120,6 +120,7 @@ struct uvc_video { }; enum uvc_state { + UVC_STATE_HOST_DISCONNECTED, UVC_STATE_DISCONNECTED, UVC_STATE_CONNECTED, UVC_STATE_STREAMING,