Message ID | 0c39bfed1cf6f7b747e702aa841f82c9d2140f27.1480922249.git.baolin.wang@linaro.org |
---|---|
State | New |
Headers | show |
Hi Mathias, On 12 December 2016 at 23:52, Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > On 05.12.2016 09:51, Baolin Wang wrote: >> >> If a command event is found on the event ring during an interrupt, >> we need to stop the command timer with del_timer(). Since del_timer() >> can fail if the timer is running and waiting on the xHCI lock, then >> it maybe get the wrong timeout command in xhci_handle_command_timeout() >> if host fetched a new command and updated the xhci->current_cmd in >> handle_cmd_completion(). For this situation, we need a way to signal >> to the command timer that everything is fine and it should exit. > > > Ah, right, this could actually happen. > >> >> >> We should introduce a counter (xhci->current_cmd_pending) for the number >> of pending commands. If we need to cancel the command timer and >> del_timer() >> succeeds, we decrement the number of pending commands. If del_timer() >> fails, >> we leave the number of pending commands alone. >> >> For handling timeout command, in xhci_handle_command_timeout() we will >> check >> the counter after decrementing it, if the counter >> (xhci->current_cmd_pending) >> is 0, which means xhci->current_cmd is the right timeout command. If the >> counter (xhci->current_cmd_pending) is greater than 0, which means current >> timeout command has been handled by host and host has fetched new command >> as >> xhci->current_cmd, then just return and wait for new current command. > > > A counter like this could work. > > Writing the abort bit can generate either ABORT+STOP, or just STOP > event, this seems to cover both. > > quick check, case 1: timeout and cmd completion at the same time. > > cpu1 cpu2 > > queue_command(first), p++ (=1) > queue_command(more), > --completion irq fires-- -- timer times out at same time-- > handle_cmd_completion() handle_cmd_timeout(),) > lock(xhci_lock ) spin_on(xhci_lock) > del_timer() fail, p (=1, nochange) > cur_cmd = list_next(), p++ (=2) > unlock(xhci_lock) > lock(xhci_lock) > p-- (=1) > if (p > 0), exit > OK works > > case 2: normal timeout case with ABORT+STOP, no race. > > cpu1 cpu2 > > queue_command(first), p++ (=1) > queue_command(more), > handle_cmd_timeout() > p-- (P=0), don't exit > mod_timer(), p++ (P=1) > write_abort_bit() > handle_cmd_comletion(ABORT) > del_timer(), ok, p-- (p = 0) > handle_cmd_completion(STOP) > del_timer(), fail, (P=0) > handle_stopped_cmd_ring() > cur_cmd = list_next(), p++ (=1) > mod_timer() > > OK, works, and same for just STOP case, with the only difference that > during handle_cmd_completion(STOP) p is decremented (p--) Yes, that's the cases what I want to handle, thanks for your explicit explanation. > > So unless there is a way to find out if cur_cmd is valid in command timeout > in command timeout with the help of existing flags and lists this would be a > working > solution. > > -Mathias > -- Baolin.wang Best Regards -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19.12.2016 13:34, Baolin Wang wrote: > Hi Mathias, > > On 19 December 2016 at 18:33, Mathias Nyman > <mathias.nyman@linux.intel.com> wrote: >> On 13.12.2016 05:21, Baolin Wang wrote: >>> >>> Hi Mathias, >>> >>> On 12 December 2016 at 23:52, Mathias Nyman >>> <mathias.nyman@linux.intel.com> wrote: >>>> >>>> On 05.12.2016 09:51, Baolin Wang wrote: >>>>> >>>>> >>>>> If a command event is found on the event ring during an interrupt, >>>>> we need to stop the command timer with del_timer(). Since del_timer() >>>>> can fail if the timer is running and waiting on the xHCI lock, then >>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout() >>>>> if host fetched a new command and updated the xhci->current_cmd in >>>>> handle_cmd_completion(). For this situation, we need a way to signal >>>>> to the command timer that everything is fine and it should exit. >>>> >>>> >>>> >>>> Ah, right, this could actually happen. >>>> >>>>> >>>>> >>>>> We should introduce a counter (xhci->current_cmd_pending) for the number >>>>> of pending commands. If we need to cancel the command timer and >>>>> del_timer() >>>>> succeeds, we decrement the number of pending commands. If del_timer() >>>>> fails, >>>>> we leave the number of pending commands alone. >>>>> >>>>> For handling timeout command, in xhci_handle_command_timeout() we will >>>>> check >>>>> the counter after decrementing it, if the counter >>>>> (xhci->current_cmd_pending) >>>>> is 0, which means xhci->current_cmd is the right timeout command. If the >>>>> counter (xhci->current_cmd_pending) is greater than 0, which means >>>>> current >>>>> timeout command has been handled by host and host has fetched new >>>>> command >>>>> as >>>>> xhci->current_cmd, then just return and wait for new current command. >>>> >>>> >>>> >>>> A counter like this could work. >>>> >>>> Writing the abort bit can generate either ABORT+STOP, or just STOP >>>> event, this seems to cover both. >>>> >>>> quick check, case 1: timeout and cmd completion at the same time. >>>> >>>> cpu1 cpu2 >>>> >>>> queue_command(first), p++ (=1) >>>> queue_command(more), >>>> --completion irq fires-- -- timer times out at same time-- >>>> handle_cmd_completion() handle_cmd_timeout(),) >>>> lock(xhci_lock ) spin_on(xhci_lock) >>>> del_timer() fail, p (=1, nochange) >>>> cur_cmd = list_next(), p++ (=2) >>>> unlock(xhci_lock) >>>> lock(xhci_lock) >>>> p-- (=1) >>>> if (p > 0), exit >>>> OK works >>>> >>>> case 2: normal timeout case with ABORT+STOP, no race. >>>> >>>> cpu1 cpu2 >>>> >>>> queue_command(first), p++ (=1) >>>> queue_command(more), >>>> handle_cmd_timeout() >>>> p-- (P=0), don't exit >>>> mod_timer(), p++ (P=1) >>>> write_abort_bit() >>>> handle_cmd_comletion(ABORT) >>>> del_timer(), ok, p-- (p = 0) >>>> handle_cmd_completion(STOP) >>>> del_timer(), fail, (P=0) >>>> handle_stopped_cmd_ring() >>>> cur_cmd = list_next(), p++ (=1) >>>> mod_timer() >>>> >>>> OK, works, and same for just STOP case, with the only difference that >>>> during handle_cmd_completion(STOP) p is decremented (p--) >>> >>> >>> Yes, that's the cases what I want to handle, thanks for your explicit >>> explanation. >>> >> >> Gave this some more thought over the weekend, and this implementation >> doesn't solve the case when the last command times out and races with the >> completion handler: >> >> cpu1 cpu2 >> >> queue_command(first), p++ (=1) >> --completion irq fires-- -- timer times out at same time-- >> handle_cmd_completion() handle_cmd_timeout(),) >> lock(xhci_lock ) spin_on(xhci_lock) >> del_timer() fail, p (=1, nochange) >> no more commands, P (=1, nochange) >> unlock(xhci_lock) >> lock(xhci_lock) >> p-- (=0) >> p == 0, continue, even if we should >> not. >> For this we still need to rely on >> checking cur_cmd == NULL in the timeout function. >> (Baolus patch sets it to NULL if there are no more commands pending) > > As I pointed out in patch 1 of this patchset, this patchset is based > on Lu Baolu's new fix patch: > usb: xhci: fix possible wild pointer > https://www.spinics.net/lists/linux-usb/msg150219.html > > After applying Baolu's patch, after decrement the counter, we will > check the xhci->cur_command if is NULL. So in this situation: > cpu1 cpu2 > > queue_command(first), p++ (=1) > --completion irq fires-- -- timer times out at same time-- > handle_cmd_completion() handle_cmd_timeout(),) > lock(xhci_lock ) spin_on(xhci_lock) > del_timer() fail, p (=1, nochange) > no more commands, P (=1, nochange) > unlock(xhci_lock) > lock(xhci_lock) > p-- (=0) > no current command, return > if (!xhci->current_cmd) { > unlock(xhci_lock); > return; > } > > It can work. Yes, What I wanted to say is that as it relies on Baolus patch for that one case it seems that patch 2/2 can be replaced by a single line change: if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer)) Right? -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 19 December 2016 at 20:13, Mathias Nyman <mathias.nyman@intel.com> wrote: > On 19.12.2016 13:34, Baolin Wang wrote: >> >> Hi Mathias, >> >> On 19 December 2016 at 18:33, Mathias Nyman >> <mathias.nyman@linux.intel.com> wrote: >>> >>> On 13.12.2016 05:21, Baolin Wang wrote: >>>> >>>> >>>> Hi Mathias, >>>> >>>> On 12 December 2016 at 23:52, Mathias Nyman >>>> <mathias.nyman@linux.intel.com> wrote: >>>>> >>>>> >>>>> On 05.12.2016 09:51, Baolin Wang wrote: >>>>>> >>>>>> >>>>>> >>>>>> If a command event is found on the event ring during an interrupt, >>>>>> we need to stop the command timer with del_timer(). Since del_timer() >>>>>> can fail if the timer is running and waiting on the xHCI lock, then >>>>>> it maybe get the wrong timeout command in >>>>>> xhci_handle_command_timeout() >>>>>> if host fetched a new command and updated the xhci->current_cmd in >>>>>> handle_cmd_completion(). For this situation, we need a way to signal >>>>>> to the command timer that everything is fine and it should exit. >>>>> >>>>> >>>>> >>>>> >>>>> Ah, right, this could actually happen. >>>>> >>>>>> >>>>>> >>>>>> We should introduce a counter (xhci->current_cmd_pending) for the >>>>>> number >>>>>> of pending commands. If we need to cancel the command timer and >>>>>> del_timer() >>>>>> succeeds, we decrement the number of pending commands. If del_timer() >>>>>> fails, >>>>>> we leave the number of pending commands alone. >>>>>> >>>>>> For handling timeout command, in xhci_handle_command_timeout() we will >>>>>> check >>>>>> the counter after decrementing it, if the counter >>>>>> (xhci->current_cmd_pending) >>>>>> is 0, which means xhci->current_cmd is the right timeout command. If >>>>>> the >>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means >>>>>> current >>>>>> timeout command has been handled by host and host has fetched new >>>>>> command >>>>>> as >>>>>> xhci->current_cmd, then just return and wait for new current command. >>>>> >>>>> >>>>> >>>>> >>>>> A counter like this could work. >>>>> >>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP >>>>> event, this seems to cover both. >>>>> >>>>> quick check, case 1: timeout and cmd completion at the same time. >>>>> >>>>> cpu1 cpu2 >>>>> >>>>> queue_command(first), p++ (=1) >>>>> queue_command(more), >>>>> --completion irq fires-- -- timer times out at same >>>>> time-- >>>>> handle_cmd_completion() handle_cmd_timeout(),) >>>>> lock(xhci_lock ) spin_on(xhci_lock) >>>>> del_timer() fail, p (=1, nochange) >>>>> cur_cmd = list_next(), p++ (=2) >>>>> unlock(xhci_lock) >>>>> lock(xhci_lock) >>>>> p-- (=1) >>>>> if (p > 0), exit >>>>> OK works >>>>> >>>>> case 2: normal timeout case with ABORT+STOP, no race. >>>>> >>>>> cpu1 cpu2 >>>>> >>>>> queue_command(first), p++ (=1) >>>>> queue_command(more), >>>>> handle_cmd_timeout() >>>>> p-- (P=0), don't exit >>>>> mod_timer(), p++ (P=1) >>>>> write_abort_bit() >>>>> handle_cmd_comletion(ABORT) >>>>> del_timer(), ok, p-- (p = 0) >>>>> handle_cmd_completion(STOP) >>>>> del_timer(), fail, (P=0) >>>>> handle_stopped_cmd_ring() >>>>> cur_cmd = list_next(), p++ (=1) >>>>> mod_timer() >>>>> >>>>> OK, works, and same for just STOP case, with the only difference that >>>>> during handle_cmd_completion(STOP) p is decremented (p--) >>>> >>>> >>>> >>>> Yes, that's the cases what I want to handle, thanks for your explicit >>>> explanation. >>>> >>> >>> Gave this some more thought over the weekend, and this implementation >>> doesn't solve the case when the last command times out and races with the >>> completion handler: >>> >>> cpu1 cpu2 >>> >>> queue_command(first), p++ (=1) >>> --completion irq fires-- -- timer times out at same time-- >>> handle_cmd_completion() handle_cmd_timeout(),) >>> lock(xhci_lock ) spin_on(xhci_lock) >>> del_timer() fail, p (=1, nochange) >>> no more commands, P (=1, nochange) >>> unlock(xhci_lock) >>> lock(xhci_lock) >>> p-- (=0) >>> p == 0, continue, even if we >>> should >>> not. >>> For this we still need to rely >>> on >>> checking cur_cmd == NULL in the timeout function. >>> (Baolus patch sets it to NULL if there are no more commands pending) >> >> >> As I pointed out in patch 1 of this patchset, this patchset is based >> on Lu Baolu's new fix patch: >> usb: xhci: fix possible wild pointer >> https://www.spinics.net/lists/linux-usb/msg150219.html >> >> After applying Baolu's patch, after decrement the counter, we will >> check the xhci->cur_command if is NULL. So in this situation: >> cpu1 cpu2 >> >> queue_command(first), p++ (=1) >> --completion irq fires-- -- timer times out at same >> time-- >> handle_cmd_completion() handle_cmd_timeout(),) >> lock(xhci_lock ) spin_on(xhci_lock) >> del_timer() fail, p (=1, nochange) >> no more commands, P (=1, nochange) >> unlock(xhci_lock) >> lock(xhci_lock) >> p-- (=0) >> no current command, return >> if (!xhci->current_cmd) { >> unlock(xhci_lock); >> return; >> } >> >> It can work. > > > Yes, > > What I wanted to say is that as it relies on Baolus patch for that one case > it seems that patch 2/2 can be replaced by a single line change: > > if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer)) > > Right? After checking the code again, I am agree with you. I will resend the patch with fixing this issue. Thanks. -- Baolin.wang Best Regards -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote: > Hi Mathias, > > On 12/19/2016 08:13 PM, Mathias Nyman wrote: >> On 19.12.2016 13:34, Baolin Wang wrote: >>> Hi Mathias, >>> >>> On 19 December 2016 at 18:33, Mathias Nyman >>> <mathias.nyman@linux.intel.com> wrote: >>>> On 13.12.2016 05:21, Baolin Wang wrote: >>>>> >>>>> Hi Mathias, >>>>> >>>>> On 12 December 2016 at 23:52, Mathias Nyman >>>>> <mathias.nyman@linux.intel.com> wrote: >>>>>> >>>>>> On 05.12.2016 09:51, Baolin Wang wrote: >>>>>>> >>>>>>> >>>>>>> If a command event is found on the event ring during an interrupt, >>>>>>> we need to stop the command timer with del_timer(). Since del_timer() >>>>>>> can fail if the timer is running and waiting on the xHCI lock, then >>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout() >>>>>>> if host fetched a new command and updated the xhci->current_cmd in >>>>>>> handle_cmd_completion(). For this situation, we need a way to signal >>>>>>> to the command timer that everything is fine and it should exit. >>>>>> >>>>>> >>>>>> >>>>>> Ah, right, this could actually happen. >>>>>> >>>>>>> >>>>>>> >>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number >>>>>>> of pending commands. If we need to cancel the command timer and >>>>>>> del_timer() >>>>>>> succeeds, we decrement the number of pending commands. If del_timer() >>>>>>> fails, >>>>>>> we leave the number of pending commands alone. >>>>>>> >>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will >>>>>>> check >>>>>>> the counter after decrementing it, if the counter >>>>>>> (xhci->current_cmd_pending) >>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the >>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means >>>>>>> current >>>>>>> timeout command has been handled by host and host has fetched new >>>>>>> command >>>>>>> as >>>>>>> xhci->current_cmd, then just return and wait for new current command. >>>>>> >>>>>> >>>>>> >>>>>> A counter like this could work. >>>>>> >>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP >>>>>> event, this seems to cover both. >>>>>> >>>>>> quick check, case 1: timeout and cmd completion at the same time. >>>>>> >>>>>> cpu1 cpu2 >>>>>> >>>>>> queue_command(first), p++ (=1) >>>>>> queue_command(more), >>>>>> --completion irq fires-- -- timer times out at same time-- >>>>>> handle_cmd_completion() handle_cmd_timeout(),) >>>>>> lock(xhci_lock ) spin_on(xhci_lock) >>>>>> del_timer() fail, p (=1, nochange) >>>>>> cur_cmd = list_next(), p++ (=2) >>>>>> unlock(xhci_lock) >>>>>> lock(xhci_lock) >>>>>> p-- (=1) >>>>>> if (p > 0), exit >>>>>> OK works >>>>>> >>>>>> case 2: normal timeout case with ABORT+STOP, no race. >>>>>> >>>>>> cpu1 cpu2 >>>>>> >>>>>> queue_command(first), p++ (=1) >>>>>> queue_command(more), >>>>>> handle_cmd_timeout() >>>>>> p-- (P=0), don't exit >>>>>> mod_timer(), p++ (P=1) >>>>>> write_abort_bit() >>>>>> handle_cmd_comletion(ABORT) >>>>>> del_timer(), ok, p-- (p = 0) >>>>>> handle_cmd_completion(STOP) >>>>>> del_timer(), fail, (P=0) >>>>>> handle_stopped_cmd_ring() >>>>>> cur_cmd = list_next(), p++ (=1) >>>>>> mod_timer() >>>>>> >>>>>> OK, works, and same for just STOP case, with the only difference that >>>>>> during handle_cmd_completion(STOP) p is decremented (p--) >>>>> >>>>> >>>>> Yes, that's the cases what I want to handle, thanks for your explicit >>>>> explanation. >>>>> >>>> >>>> Gave this some more thought over the weekend, and this implementation >>>> doesn't solve the case when the last command times out and races with the >>>> completion handler: >>>> >>>> cpu1 cpu2 >>>> >>>> queue_command(first), p++ (=1) >>>> --completion irq fires-- -- timer times out at same time-- >>>> handle_cmd_completion() handle_cmd_timeout(),) >>>> lock(xhci_lock ) spin_on(xhci_lock) >>>> del_timer() fail, p (=1, nochange) >>>> no more commands, P (=1, nochange) >>>> unlock(xhci_lock) >>>> lock(xhci_lock) >>>> p-- (=0) >>>> p == 0, continue, even if we should >>>> not. >>>> For this we still need to rely on >>>> checking cur_cmd == NULL in the timeout function. >>>> (Baolus patch sets it to NULL if there are no more commands pending) >>> >>> As I pointed out in patch 1 of this patchset, this patchset is based >>> on Lu Baolu's new fix patch: >>> usb: xhci: fix possible wild pointer >>> https://www.spinics.net/lists/linux-usb/msg150219.html >>> >>> After applying Baolu's patch, after decrement the counter, we will >>> check the xhci->cur_command if is NULL. So in this situation: >>> cpu1 cpu2 >>> >>> queue_command(first), p++ (=1) >>> --completion irq fires-- -- timer times out at same time-- >>> handle_cmd_completion() handle_cmd_timeout(),) >>> lock(xhci_lock ) spin_on(xhci_lock) >>> del_timer() fail, p (=1, nochange) >>> no more commands, P (=1, nochange) >>> unlock(xhci_lock) >>> lock(xhci_lock) >>> p-- (=0) >>> no current command, return >>> if (!xhci->current_cmd) { >>> unlock(xhci_lock); >>> return; >>> } >>> >>> It can work. >> >> Yes, >> >> What I wanted to say is that as it relies on Baolus patch for that one case >> it seems that patch 2/2 can be replaced by a single line change: >> >> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer)) >> >> Right? >> >> -Mathias >> > > It seems that the watch dog algorithm for command queue becomes > more and more complicated and hard for maintain. I am also seeing > another case where a command may lose the chance to be tracked by > the watch dog timer. > > Say, > > queue_command(the only command in queue) > - completion irq fires-- - timer times out at same time-- - another command enqueue-- > - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock) > - del_timer() fail > - free the command and > set current_cmd to NULL > - unlock(xhci_lock) > - lock(xhci_lock) > - queue_command()(timer will > not rescheduled since the timer > is pending) In this case, since the command timer was fired and you did not re-add the command timer, why here timer is pending? Maybe I missed something? Thanks. > - lock(xhci_lock) > - no current command > - return > > As the result, the later command isn't under track of the watch dog. > If hardware fails to response to this command, kernel will hang in > the thread which is waiting for the completion of the command. > > I can write a patch to fix this and cc stable kernel as well. For long > term, in order to make it simple and easy to maintain, how about > allocating a watch dog timer for each command? It could be part > of the command structure and be managed just like the life cycle > of a command structure. > > I can write a patch for review and discussion, if you think this > change is possible. > > Best regards, > Lu Baolu -- Baolin.wang Best Regards -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20 December 2016 at 14:39, Lu Baolu <baolu.lu@linux.intel.com> wrote: > Hi, > > On 12/20/2016 02:06 PM, Baolin Wang wrote: >> Hi, >> >> On 20 December 2016 at 12:29, Lu Baolu <baolu.lu@linux.intel.com> wrote: >>> Hi Mathias, >>> >>> On 12/19/2016 08:13 PM, Mathias Nyman wrote: >>>> On 19.12.2016 13:34, Baolin Wang wrote: >>>>> Hi Mathias, >>>>> >>>>> On 19 December 2016 at 18:33, Mathias Nyman >>>>> <mathias.nyman@linux.intel.com> wrote: >>>>>> On 13.12.2016 05:21, Baolin Wang wrote: >>>>>>> Hi Mathias, >>>>>>> >>>>>>> On 12 December 2016 at 23:52, Mathias Nyman >>>>>>> <mathias.nyman@linux.intel.com> wrote: >>>>>>>> On 05.12.2016 09:51, Baolin Wang wrote: >>>>>>>>> >>>>>>>>> If a command event is found on the event ring during an interrupt, >>>>>>>>> we need to stop the command timer with del_timer(). Since del_timer() >>>>>>>>> can fail if the timer is running and waiting on the xHCI lock, then >>>>>>>>> it maybe get the wrong timeout command in xhci_handle_command_timeout() >>>>>>>>> if host fetched a new command and updated the xhci->current_cmd in >>>>>>>>> handle_cmd_completion(). For this situation, we need a way to signal >>>>>>>>> to the command timer that everything is fine and it should exit. >>>>>>>> >>>>>>>> >>>>>>>> Ah, right, this could actually happen. >>>>>>>> >>>>>>>>> >>>>>>>>> We should introduce a counter (xhci->current_cmd_pending) for the number >>>>>>>>> of pending commands. If we need to cancel the command timer and >>>>>>>>> del_timer() >>>>>>>>> succeeds, we decrement the number of pending commands. If del_timer() >>>>>>>>> fails, >>>>>>>>> we leave the number of pending commands alone. >>>>>>>>> >>>>>>>>> For handling timeout command, in xhci_handle_command_timeout() we will >>>>>>>>> check >>>>>>>>> the counter after decrementing it, if the counter >>>>>>>>> (xhci->current_cmd_pending) >>>>>>>>> is 0, which means xhci->current_cmd is the right timeout command. If the >>>>>>>>> counter (xhci->current_cmd_pending) is greater than 0, which means >>>>>>>>> current >>>>>>>>> timeout command has been handled by host and host has fetched new >>>>>>>>> command >>>>>>>>> as >>>>>>>>> xhci->current_cmd, then just return and wait for new current command. >>>>>>>> >>>>>>>> >>>>>>>> A counter like this could work. >>>>>>>> >>>>>>>> Writing the abort bit can generate either ABORT+STOP, or just STOP >>>>>>>> event, this seems to cover both. >>>>>>>> >>>>>>>> quick check, case 1: timeout and cmd completion at the same time. >>>>>>>> >>>>>>>> cpu1 cpu2 >>>>>>>> >>>>>>>> queue_command(first), p++ (=1) >>>>>>>> queue_command(more), >>>>>>>> --completion irq fires-- -- timer times out at same time-- >>>>>>>> handle_cmd_completion() handle_cmd_timeout(),) >>>>>>>> lock(xhci_lock ) spin_on(xhci_lock) >>>>>>>> del_timer() fail, p (=1, nochange) >>>>>>>> cur_cmd = list_next(), p++ (=2) >>>>>>>> unlock(xhci_lock) >>>>>>>> lock(xhci_lock) >>>>>>>> p-- (=1) >>>>>>>> if (p > 0), exit >>>>>>>> OK works >>>>>>>> >>>>>>>> case 2: normal timeout case with ABORT+STOP, no race. >>>>>>>> >>>>>>>> cpu1 cpu2 >>>>>>>> >>>>>>>> queue_command(first), p++ (=1) >>>>>>>> queue_command(more), >>>>>>>> handle_cmd_timeout() >>>>>>>> p-- (P=0), don't exit >>>>>>>> mod_timer(), p++ (P=1) >>>>>>>> write_abort_bit() >>>>>>>> handle_cmd_comletion(ABORT) >>>>>>>> del_timer(), ok, p-- (p = 0) >>>>>>>> handle_cmd_completion(STOP) >>>>>>>> del_timer(), fail, (P=0) >>>>>>>> handle_stopped_cmd_ring() >>>>>>>> cur_cmd = list_next(), p++ (=1) >>>>>>>> mod_timer() >>>>>>>> >>>>>>>> OK, works, and same for just STOP case, with the only difference that >>>>>>>> during handle_cmd_completion(STOP) p is decremented (p--) >>>>>>> >>>>>>> Yes, that's the cases what I want to handle, thanks for your explicit >>>>>>> explanation. >>>>>>> >>>>>> Gave this some more thought over the weekend, and this implementation >>>>>> doesn't solve the case when the last command times out and races with the >>>>>> completion handler: >>>>>> >>>>>> cpu1 cpu2 >>>>>> >>>>>> queue_command(first), p++ (=1) >>>>>> --completion irq fires-- -- timer times out at same time-- >>>>>> handle_cmd_completion() handle_cmd_timeout(),) >>>>>> lock(xhci_lock ) spin_on(xhci_lock) >>>>>> del_timer() fail, p (=1, nochange) >>>>>> no more commands, P (=1, nochange) >>>>>> unlock(xhci_lock) >>>>>> lock(xhci_lock) >>>>>> p-- (=0) >>>>>> p == 0, continue, even if we should >>>>>> not. >>>>>> For this we still need to rely on >>>>>> checking cur_cmd == NULL in the timeout function. >>>>>> (Baolus patch sets it to NULL if there are no more commands pending) >>>>> As I pointed out in patch 1 of this patchset, this patchset is based >>>>> on Lu Baolu's new fix patch: >>>>> usb: xhci: fix possible wild pointer >>>>> https://www.spinics.net/lists/linux-usb/msg150219.html >>>>> >>>>> After applying Baolu's patch, after decrement the counter, we will >>>>> check the xhci->cur_command if is NULL. So in this situation: >>>>> cpu1 cpu2 >>>>> >>>>> queue_command(first), p++ (=1) >>>>> --completion irq fires-- -- timer times out at same time-- >>>>> handle_cmd_completion() handle_cmd_timeout(),) >>>>> lock(xhci_lock ) spin_on(xhci_lock) >>>>> del_timer() fail, p (=1, nochange) >>>>> no more commands, P (=1, nochange) >>>>> unlock(xhci_lock) >>>>> lock(xhci_lock) >>>>> p-- (=0) >>>>> no current command, return >>>>> if (!xhci->current_cmd) { >>>>> unlock(xhci_lock); >>>>> return; >>>>> } >>>>> >>>>> It can work. >>>> Yes, >>>> >>>> What I wanted to say is that as it relies on Baolus patch for that one case >>>> it seems that patch 2/2 can be replaced by a single line change: >>>> >>>> if (!xhci->current_cmd || timer_pending(&xhci->cmd_timer)) >>>> >>>> Right? >>>> >>>> -Mathias >>>> >>> It seems that the watch dog algorithm for command queue becomes >>> more and more complicated and hard for maintain. I am also seeing >>> another case where a command may lose the chance to be tracked by >>> the watch dog timer. >>> >>> Say, >>> >>> queue_command(the only command in queue) >>> - completion irq fires-- - timer times out at same time-- - another command enqueue-- >>> - lock(xhci_lock ) - spin_on(xhci_lock) - spin_on(xhci_lock) >>> - del_timer() fail >>> - free the command and >>> set current_cmd to NULL >>> - unlock(xhci_lock) >>> - lock(xhci_lock) >>> - queue_command()(timer will >>> not rescheduled since the timer >>> is pending) >> In this case, since the command timer was fired and you did not re-add >> the command timer, why here timer is pending? Maybe I missed >> something? Thanks. > > In queue_command(), > > /* if there are no other commands queued we start the timeout timer */ > if (list_is_singular(&xhci->cmd_list) && > !timer_pending(&xhci->cmd_timer)) { > xhci->current_cmd = cmd; > mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT); > } > > timer_pending() will return true if the timer is fired, but the function is still > running on another CPU. Do I understand it right? From my understanding, if the timer was fired, no matter the timeout function is running or finished, timer_pending() will return false. Please correct me if I made mistakes. Thanks. > > Best regards, > Lu Baolu > >>> - lock(xhci_lock) >>> - no current command >>> - return >>> >>> As the result, the later command isn't under track of the watch dog. >>> If hardware fails to response to this command, kernel will hang in >>> the thread which is waiting for the completion of the command. >>> >>> I can write a patch to fix this and cc stable kernel as well. For long >>> term, in order to make it simple and easy to maintain, how about >>> allocating a watch dog timer for each command? It could be part >>> of the command structure and be managed just like the life cycle >>> of a command structure. >>> >>> I can write a patch for review and discussion, if you think this >>> change is possible. >>> >>> Best regards, >>> Lu Baolu >> >> > -- Baolin.wang Best Regards -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 20.12.2016 09:30, Baolin Wang wrote: ... Alright, I gathered all current work related to xhci races and timeouts and put them into a branch: git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes Its based on 4.9 It includes a few other patches just to avoid conflicts and make my life easier Interesting patches are: ee4eb91 xhci: remove unnecessary check for pending timer 0cba67d xhci: detect stop endpoint race using pending timer instead of counter. 4f2535f xhci: Handle command completion and timeout race b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command 529a5a0 usb: xhci: fix possible wild pointer 4766555 xhci: Fix race related to abort operation de834a3 xhci: Use delayed_work instead of timer for command timeout 69973b8 Linux 4.9 The fixes for command queue races will go to usb-linus and stable, the reworks for stop ep watchdog timer will go to usb-next. Still completely untested, (well it compiles) Felipe gave instructions how to modify dwc3 driver to timeout on address devicecommands to test these, I'll try to set that up. All additional testing is welcome, especially if you can trigger timeouts and races -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mathias, On 20 December 2016 at 23:13, Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > On 20.12.2016 09:30, Baolin Wang wrote: > ... > > Alright, I gathered all current work related to xhci races and timeouts > and put them into a branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git > timeout_race_fixes > > Its based on 4.9 > It includes a few other patches just to avoid conflicts and make my life > easier > > Interesting patches are: > > ee4eb91 xhci: remove unnecessary check for pending timer > 0cba67d xhci: detect stop endpoint race using pending timer instead of > counter. > 4f2535f xhci: Handle command completion and timeout race > b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort > command > 529a5a0 usb: xhci: fix possible wild pointer > 4766555 xhci: Fix race related to abort operation > de834a3 xhci: Use delayed_work instead of timer for command timeout > 69973b8 Linux 4.9 > > The fixes for command queue races will go to usb-linus and stable, the > reworks for stop ep watchdog timer will go to usb-next. How about applying below patch in your 'timeout_race_fixes' branch? [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is timeout https://lkml.org/lkml/2016/12/13/94 > > Still completely untested, (well it compiles) > > Felipe gave instructions how to modify dwc3 driver to timeout on address > devicecommands to test these, I'll try to set that up. > > All additional testing is welcome, especially if you can trigger timeouts > and races I will try to test these patches. -- Baolin.wang Best Regards -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mathias, I have some comments for the implementation of xhci_abort_cmd_ring() below. On 12/20/2016 11:13 PM, Mathias Nyman wrote: > On 20.12.2016 09:30, Baolin Wang wrote: > ... > > Alright, I gathered all current work related to xhci races and timeouts > and put them into a branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes > > Its based on 4.9 > It includes a few other patches just to avoid conflicts and make my life easier > > Interesting patches are: > > ee4eb91 xhci: remove unnecessary check for pending timer > 0cba67d xhci: detect stop endpoint race using pending timer instead of counter. > 4f2535f xhci: Handle command completion and timeout race > b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command > 529a5a0 usb: xhci: fix possible wild pointer > 4766555 xhci: Fix race related to abort operation > de834a3 xhci: Use delayed_work instead of timer for command timeout > 69973b8 Linux 4.9 > > The fixes for command queue races will go to usb-linus and stable, the > reworks for stop ep watchdog timer will go to usb-next. > > Still completely untested, (well it compiles) > > Felipe gave instructions how to modify dwc3 driver to timeout on address > devicecommands to test these, I'll try to set that up. > > All additional testing is welcome, especially if you can trigger timeouts > and races > > -Mathias > > Below is the latest code. I put my comments in line. 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci) 323 { 324 u64 temp_64; 325 int ret; 326 327 xhci_dbg(xhci, "Abort command ring\n"); 328 329 reinit_completion(&xhci->cmd_ring_stop_completion); 330 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, 333 &xhci->op_regs->cmd_ring); We should hold xhci->lock when we are modifying xhci registers at runtime. 334 335 /* Section 4.6.1.2 of xHCI 1.0 spec says software should 336 * time the completion od all xHCI commands, including s/od/of/g 337 * the Command Abort operation. If software doesn't see 338 * CRR negated in a timely manner (e.g. longer than 5 339 * seconds), then it should assume that the there are 340 * larger problems with the xHC and assert HCRST. 341 */ 342 ret = xhci_handshake(&xhci->op_regs->cmd_ring, 343 CMD_RING_RUNNING, 0, 5 * 1000 * 1000); 344 if (ret < 0) { 345 /* we are about to kill xhci, give it one more chance */ The retry of setting CMD_RING_ABORT is not necessary according to previous discussion. We have cleaned code for second try in xhci_handle_command_timeout(). Need to clean up here as well. 346 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, 347 &xhci->op_regs->cmd_ring); 348 udelay(1000); 349 ret = xhci_handshake(&xhci->op_regs->cmd_ring, 350 CMD_RING_RUNNING, 0, 3 * 1000 * 1000); 351 if (ret < 0) { 352 xhci_err(xhci, "Stopped the command ring failed, " 353 "maybe the host is dead\n"); 354 xhci->xhc_state |= XHCI_STATE_DYING; 355 xhci_halt(xhci); 356 return -ESHUTDOWN; 357 } 358 } 359 /* 360 * Writing the CMD_RING_ABORT bit should cause a cmd completion event, 361 * however on some host hw the CMD_RING_RUNNING bit is correctly cleared 362 * but the completion event in never sent. Wait 2 secs (arbitrary s/in never sent/is never sent/g 363 * number) to handle those cases after negation of CMD_RING_RUNNING. 364 */ This should be implemented with a quirk bit. It's not common for all host controllers. 365 if (!wait_for_completion_timeout(&xhci->cmd_ring_stop_completion, 366 2 * HZ)) { 367 xhci_dbg(xhci, "No stop event for abort, ring start fail?\n"); 368 xhci_cleanup_command_queue(xhci); 369 } else { 370 unsigned long flags; 371 372 spin_lock_irqsave(&xhci->lock, flags); 373 xhci_handle_stopped_cmd_ring(xhci, xhci_next_queued_cmd(xhci)); 374 spin_unlock_irqrestore(&xhci->lock, flags); 375 } 376 return 0; 377 } Best regards, Lu Baolu -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Mathias, I have some comments for the implementation of xhci_handle_command_timeout() as well. On 12/20/2016 11:13 PM, Mathias Nyman wrote: > On 20.12.2016 09:30, Baolin Wang wrote: > ... > > Alright, I gathered all current work related to xhci races and timeouts > and put them into a branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git timeout_race_fixes > > Its based on 4.9 > It includes a few other patches just to avoid conflicts and make my life easier > > Interesting patches are: > > ee4eb91 xhci: remove unnecessary check for pending timer > 0cba67d xhci: detect stop endpoint race using pending timer instead of counter. > 4f2535f xhci: Handle command completion and timeout race > b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort command > 529a5a0 usb: xhci: fix possible wild pointer > 4766555 xhci: Fix race related to abort operation > de834a3 xhci: Use delayed_work instead of timer for command timeout > 69973b8 Linux 4.9 > > The fixes for command queue races will go to usb-linus and stable, the > reworks for stop ep watchdog timer will go to usb-next. > > Still completely untested, (well it compiles) > > Felipe gave instructions how to modify dwc3 driver to timeout on address > devicecommands to test these, I'll try to set that up. > > All additional testing is welcome, especially if you can trigger timeouts > and races > > -Mathias > > I post the code below and add my comments in line. 1276 void xhci_handle_command_timeout(struct work_struct *work) 1277 { 1278 struct xhci_hcd *xhci; 1279 int ret; 1280 unsigned long flags; 1281 u64 hw_ring_state; 1282 1283 xhci = container_of(to_delayed_work(work), struct xhci_hcd, cmd_timer); 1284 1285 spin_lock_irqsave(&xhci->lock, flags); 1286 1287 /* 1288 * If timeout work is pending, or current_cmd is NULL, it means we 1289 * raced with command completion. Command is handled so just return. 1290 */ 1291 if (!xhci->current_cmd || delayed_work_pending(&xhci->cmd_timer)) { 1292 spin_unlock_irqrestore(&xhci->lock, flags); 1293 return; 1294 } 1295 /* mark this command to be cancelled */ 1296 xhci->current_cmd->status = COMP_CMD_ABORT; 1297 1298 /* Make sure command ring is running before aborting it */ 1299 hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); 1300 if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) && 1301 (hw_ring_state & CMD_RING_RUNNING)) { 1302 /* Prevent new doorbell, and start command abort */ 1303 xhci->cmd_ring_state = CMD_RING_STATE_ABORTED; 1304 spin_unlock_irqrestore(&xhci->lock, flags); 1305 xhci_dbg(xhci, "Command timeout\n"); 1306 ret = xhci_abort_cmd_ring(xhci); 1307 if (unlikely(ret == -ESHUTDOWN)) { 1308 xhci_err(xhci, "Abort command ring failed\n"); 1309 xhci_cleanup_command_queue(xhci); 1310 usb_hc_died(xhci_to_hcd(xhci)->primary_hcd); 1311 xhci_dbg(xhci, "xHCI host controller is dead.\n"); 1312 } 1313 return; 1314 } 1315 1316 /* host removed. Bail out */ 1317 if (xhci->xhc_state & XHCI_STATE_REMOVING) { 1318 spin_unlock_irqrestore(&xhci->lock, flags); 1319 xhci_dbg(xhci, "host removed, ring start fail?\n"); 1320 xhci_cleanup_command_queue(xhci); 1321 return; 1322 } I think this part of code should be moved up to line 1295. 1323 1324 /* command timeout on stopped ring, ring can't be aborted */ 1325 xhci_dbg(xhci, "Command timeout on stopped ring\n"); 1326 xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd); 1327 spin_unlock_irqrestore(&xhci->lock, flags); This part of code is tricky. I have no idea about in which case should this code be executed? Anyway, we shouldn't call xhci_handle_stopped_cmd_ring() here, right? 1328 return; 1329 } Best regards, Lu Baolu -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Mathias Nyman <mathias.nyman@linux.intel.com> writes: >> Below is the latest code. I put my comments in line. >> >> 322 static int xhci_abort_cmd_ring(struct xhci_hcd *xhci) >> 323 { >> 324 u64 temp_64; >> 325 int ret; >> 326 >> 327 xhci_dbg(xhci, "Abort command ring\n"); >> 328 >> 329 reinit_completion(&xhci->cmd_ring_stop_completion); >> 330 >> 331 temp_64 = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); >> 332 xhci_write_64(xhci, temp_64 | CMD_RING_ABORT, >> 333 &xhci->op_regs->cmd_ring); >> >> We should hold xhci->lock when we are modifying xhci registers >> at runtime. >> > > Makes sense, but we need to unlock it before sleeping or waiting for > completion. I need to look into that in more detail. > > But this was an issue already before these changes. We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what is for taking lock for register though, I guess it should be enough just lock around of read=>write of ->cmd_ring if need lock. [Rather ->cmd_ring user should check CMD_RING_STATE_ABORTED state.] > But then again I really like OGAWA Hiroumi's solution that separates the > command ring stopping from aborting commands and restarting the ring. > > The current way of always restarting the command ring as a response to > a stop command ring command really limits its usage. > > So, with this in mind most reasonable would be to > 1. fix the lock to cover abort+CRR check, and send it to usb-linus +stable > 2. rebase OGAWA Hirofumi's changes on top of that, and send to usb-linus only > 3. remove unnecessary second abort try as a separate patch, send to usb-next > 4. remove polling for the Command ring running (CRR), waiting for completion > is enough, if completion times out then we can check CRR. for usb-next I think we should check both of CRR and even of stop completion. Because CRR and stop completion was not same time (both can be first). Thanks. -- OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 22.12.2016 03:46, Lu Baolu wrote: > Hi, > > On 12/21/2016 11:18 PM, OGAWA Hirofumi wrote: >> Mathias Nyman <mathias.nyman@linux.intel.com> writes: >> >>>> We set CMD_RING_STATE_ABORTED state under locking. I'm not checking what >>>> is for taking lock for register though, I guess it should be enough just >>>> lock around of read=>write of ->cmd_ring if need lock. >>> After your patch it should be enough to have the lock only while >>> reading and writing the cmd_ring register. >>> >>> If we want a locking fix that applies more easily to older stable >>> releases before your change then the lock needs to cover set >>> CMD_RING_STATE_ABORT, read cmd_reg, write cmd_reg and busiloop >>> checking CRR bit. Otherwise the stop cmd ring interrupt handler may >>> restart the ring just before we start checing CRR. The stop cmd ring >>> interrupt will set the CMD_RING_STATE_ABORTED to >>> CMD_RING_STATE_RUNNING so ring will really restart in the interrupt >>> handler. >> Just for record (no chance to make patch I myself for now, sorry), while >> checking locking slightly, I noticed unrelated missing locking. >> >> xhci_cleanup_command_queue() >> >> We are calling it without locking, but we need to lock for accessing list. > > Yeah. I can make the patch. > Force updated timeout_race_fixes branch. I'll be out for a few days during xmas, continue testing after that -Mathias -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2 January 2017 at 22:57, Mathias Nyman <mathias.nyman@linux.intel.com> wrote: > On 27.12.2016 05:07, Baolin Wang wrote: >> >> Hi, >> >> On 21 December 2016 at 21:00, Mathias Nyman >> <mathias.nyman@linux.intel.com> wrote: >>> >>> On 21.12.2016 04:22, Baolin Wang wrote: >>>> >>>> >>>> Hi Mathias, >>>> >>>> On 20 December 2016 at 23:13, Mathias Nyman >>>> <mathias.nyman@linux.intel.com> wrote: >>>>> >>>>> >>>>> On 20.12.2016 09:30, Baolin Wang wrote: >>>>> ... >>>>> >>>>> Alright, I gathered all current work related to xhci races and timeouts >>>>> and put them into a branch: >>>>> >>>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git >>>>> timeout_race_fixes >>>>> >>>>> Its based on 4.9 >>>>> It includes a few other patches just to avoid conflicts and make my >>>>> life >>>>> easier >>>>> >>>>> Interesting patches are: >>>>> >>>>> ee4eb91 xhci: remove unnecessary check for pending timer >>>>> 0cba67d xhci: detect stop endpoint race using pending timer instead of >>>>> counter. >>>>> 4f2535f xhci: Handle command completion and timeout race >>>>> b9d00d7 usb: host: xhci: Fix possible wild pointer when handling abort >>>>> command >>>>> 529a5a0 usb: xhci: fix possible wild pointer >>>>> 4766555 xhci: Fix race related to abort operation >>>>> de834a3 xhci: Use delayed_work instead of timer for command timeout >>>>> 69973b8 Linux 4.9 >>>>> >>>>> The fixes for command queue races will go to usb-linus and stable, the >>>>> reworks for stop ep watchdog timer will go to usb-next. >>>> >>>> >>>> >>>> How about applying below patch in your 'timeout_race_fixes' branch? >>>> [PATCH] usb: host: xhci: Clean up commands when stop endpoint command is >>>> timeout >>>> https://lkml.org/lkml/2016/12/13/94 >>>> >>> >>> usb_hc_died() should eventyally end up calling xhci_mem_cleanup() >>> which will cleanup the command queue. But I need to look into that >> >> >> usb_hc_died() did not call xhci_mem_cleanup() to clean up command >> queue, it just disconnects all children devices attached on the dying >> hub. I did not find where it will clean up the command queue when >> issuing usb_hc_died(). Also before issuing usb_hc_died() in >> xhci_handle_command_timeout(), we will call >> xhci_cleanup_command_queue(). I think it should same as in >> xhci_stop_endpoint_command_watchdog(). >> > > You're right, it doesn't call xhci_mem_cleanup. > Need to look at this after getting first series of fixes to usb-linus Thanks. -- Baolin.wang Best Regards -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/usb/host/xhci-ring.c b/drivers/usb/host/xhci-ring.c index 9965a4c..edc9ac2 100644 --- a/drivers/usb/host/xhci-ring.c +++ b/drivers/usb/host/xhci-ring.c @@ -1253,6 +1253,7 @@ static void xhci_handle_stopped_cmd_ring(struct xhci_hcd *xhci, if ((xhci->cmd_ring->dequeue != xhci->cmd_ring->enqueue) && !(xhci->xhc_state & XHCI_STATE_DYING)) { xhci->current_cmd = cur_cmd; + xhci->current_cmd_pending++; mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT); xhci_ring_cmd_db(xhci); } @@ -1269,11 +1270,27 @@ void xhci_handle_command_timeout(unsigned long data) xhci = (struct xhci_hcd *) data; spin_lock_irqsave(&xhci->lock, flags); + xhci->current_cmd_pending--; + if (!xhci->current_cmd) { spin_unlock_irqrestore(&xhci->lock, flags); return; } + /* + * If the current_cmd_pending is 0, which means current command is + * timeout. + * + * If the current_cmd_pending is greater than 0, which means current + * timeout command has been handled by host and host has fetched new + * command as xhci->current_cmd, then just return and wait for new + * current command. + */ + if (xhci->current_cmd_pending > 0) { + spin_unlock_irqrestore(&xhci->lock, flags); + return; + } + if (xhci->current_cmd->status == COMP_CMD_ABORT) second_timeout = true; xhci->current_cmd->status = COMP_CMD_ABORT; @@ -1282,6 +1299,8 @@ void xhci_handle_command_timeout(unsigned long data) hw_ring_state = xhci_read_64(xhci, &xhci->op_regs->cmd_ring); if ((xhci->cmd_ring_state & CMD_RING_STATE_RUNNING) && (hw_ring_state & CMD_RING_RUNNING)) { + /* Will add command timer again to wait for abort event */ + xhci->current_cmd_pending++; spin_unlock_irqrestore(&xhci->lock, flags); xhci_dbg(xhci, "Command timeout\n"); ret = xhci_abort_cmd_ring(xhci); @@ -1336,7 +1355,13 @@ static void handle_cmd_completion(struct xhci_hcd *xhci, cmd = list_entry(xhci->cmd_list.next, struct xhci_command, cmd_list); - del_timer(&xhci->cmd_timer); + /* + * If the command timer is running on another CPU, we don't decrement + * current_cmd_pending, since we didn't successfully stop the command + * timer. + */ + if (del_timer(&xhci->cmd_timer)) + xhci->current_cmd_pending--; trace_xhci_cmd_completion(cmd_trb, (struct xhci_generic_trb *) event); @@ -1427,6 +1452,7 @@ static void handle_cmd_completion(struct xhci_hcd *xhci, if (cmd->cmd_list.next != &xhci->cmd_list) { xhci->current_cmd = list_entry(cmd->cmd_list.next, struct xhci_command, cmd_list); + xhci->current_cmd_pending++; mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT); } else if (xhci->current_cmd == cmd) { xhci->current_cmd = NULL; @@ -3927,6 +3953,7 @@ static int queue_command(struct xhci_hcd *xhci, struct xhci_command *cmd, if (xhci->cmd_list.next == &cmd->cmd_list && !timer_pending(&xhci->cmd_timer)) { xhci->current_cmd = cmd; + xhci->current_cmd_pending++; mod_timer(&xhci->cmd_timer, jiffies + XHCI_CMD_DEFAULT_TIMEOUT); } diff --git a/drivers/usb/host/xhci.h b/drivers/usb/host/xhci.h index 9dbaacf..5d81257 100644 --- a/drivers/usb/host/xhci.h +++ b/drivers/usb/host/xhci.h @@ -1567,6 +1567,7 @@ struct xhci_hcd { unsigned int cmd_ring_reserved_trbs; struct timer_list cmd_timer; struct xhci_command *current_cmd; + u32 current_cmd_pending; struct xhci_ring *event_ring; struct xhci_erst erst; /* Scratchpad */
If a command event is found on the event ring during an interrupt, we need to stop the command timer with del_timer(). Since del_timer() can fail if the timer is running and waiting on the xHCI lock, then it maybe get the wrong timeout command in xhci_handle_command_timeout() if host fetched a new command and updated the xhci->current_cmd in handle_cmd_completion(). For this situation, we need a way to signal to the command timer that everything is fine and it should exit. We should introduce a counter (xhci->current_cmd_pending) for the number of pending commands. If we need to cancel the command timer and del_timer() succeeds, we decrement the number of pending commands. If del_timer() fails, we leave the number of pending commands alone. For handling timeout command, in xhci_handle_command_timeout() we will check the counter after decrementing it, if the counter (xhci->current_cmd_pending) is 0, which means xhci->current_cmd is the right timeout command. If the counter (xhci->current_cmd_pending) is greater than 0, which means current timeout command has been handled by host and host has fetched new command as xhci->current_cmd, then just return and wait for new current command. Signed-off-by: Baolin Wang <baolin.wang@linaro.org> --- drivers/usb/host/xhci-ring.c | 29 ++++++++++++++++++++++++++++- drivers/usb/host/xhci.h | 1 + 2 files changed, 29 insertions(+), 1 deletion(-) -- 1.7.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-usb" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html