Message ID | 20230129234441.116310-8-michael.christie@oracle.com |
---|---|
State | New |
Headers | show |
Series | target: TMF and recovery fixes | expand |
On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: > > This fixes a bug where an initiator thinks a LUN_RESET has cleaned > up running commands when it hasn't. The bug was added in: > > commit 51ec502a3266 ("target: Delete tmr from list before processing") > > The problem occurs when: > > 1. We have N IO cmds running in the target layer spread over 2 sessions. > 2. The initiator sends a LUN_RESET for each session. > 3. session1's LUN_RESET loops over all the running commands from both > sessions and moves them to its local drain_task_list. > 4. session2's LUN_RESET does not see the LUN_RESET from session1 because > the commit above has it remove itself. session2 also does not see any > commands since the other reset moved them off the state lists. > 5. sessions2's LUN_RESET will then complete with a successful response. > 6. sessions2's inititor believes the running commands on its session are > now cleaned up due to the successful response and cleans up the running > commands from its side. It then restarts them. > 7. The commands do eventually complete on the backend and the target > starts to return aborted task statuses for them. The initiator will > either throw a invalid ITT error or might accidentally lookup a new task > if the ITT has been reallocated already. > > This fixes the bug by reverting the patch. > > Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") > Signed-off-by: Mike Christie <michael.christie@oracle.com> > Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> Actually, this patch even fixes a crash that we've just faced. The second LUN_RESET moves the first LUN_RESET from tmr_list to its drain_tmr_list, then the first LUN_RESET removes itself from second`s drain_tmr_list, then the second LUN_RESET tries to remove the first from the list and crashes because it was deleted already. So, Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com>
On 3/3/23 3:26 AM, Dmitry Bogdanov wrote: > On Thu, Mar 02, 2023 at 11:29:50AM -0600, Mike Christie wrote: >> >> On 3/2/23 3:43 AM, Dmitry Bogdanov wrote: >>> On Sat, Feb 11, 2023 at 11:59:22AM +0300, Dmitry Bogdanov wrote: >>>> On Sun, Jan 29, 2023 at 05:44:34PM -0600, Mike Christie wrote: >>>>> >>>>> This fixes a bug where an initiator thinks a LUN_RESET has cleaned >>>>> up running commands when it hasn't. The bug was added in: >>>>> >>>>> commit 51ec502a3266 ("target: Delete tmr from list before processing") >>>>> >>>>> The problem occurs when: >>>>> >>>>> 1. We have N IO cmds running in the target layer spread over 2 sessions. >>>>> 2. The initiator sends a LUN_RESET for each session. >>>>> 3. session1's LUN_RESET loops over all the running commands from both >>>>> sessions and moves them to its local drain_task_list. >>>>> 4. session2's LUN_RESET does not see the LUN_RESET from session1 because >>>>> the commit above has it remove itself. session2 also does not see any >>>>> commands since the other reset moved them off the state lists. >>>>> 5. sessions2's LUN_RESET will then complete with a successful response. >>>>> 6. sessions2's inititor believes the running commands on its session are >>>>> now cleaned up due to the successful response and cleans up the running >>>>> commands from its side. It then restarts them. >>>>> 7. The commands do eventually complete on the backend and the target >>>>> starts to return aborted task statuses for them. The initiator will >>>>> either throw a invalid ITT error or might accidentally lookup a new task >>>>> if the ITT has been reallocated already. >>>>> >>>>> This fixes the bug by reverting the patch. >>>>> >>>>> Fixes: 51ec502a3266 ("target: Delete tmr from list before processing") >>>>> Signed-off-by: Mike Christie <michael.christie@oracle.com> >>>>> Reviewed-by: Dmitry Bogdanov <d.bogdanov@yadro.com> >>>> >>>> Actually, this patch even fixes a crash that we've just faced. >>>> The second LUN_RESET moves the first LUN_RESET from tmr_list to its >>>> drain_tmr_list, then the first LUN_RESET removes itself from second`s >>>> drain_tmr_list, then the second LUN_RESET tries to remove the first from >>>> the list and crashes because it was deleted already. >>>> So, >>>> >>>> Tested-by: Dmitry Bogdanov <d.bogdanov@yadro.com> >>> >>> Unfortunately, I am revoking my tags. This patch leads to deadlock of two >>> LUN_RESETs waiting for each other in its drain_tmr_list. >>> >>> To keep LUN_RESETs ignoring each other something like that is needed: >>>> --- >>>> drivers/target/target_core_tmr.c | 5 +++-- >>>> 1 file changed, 3 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c >>>> index 2b95b4550a63..a60802b4c5a3 100644 >>>> --- a/drivers/target/target_core_tmr.c >>>> +++ b/drivers/target/target_core_tmr.c >>>> @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( >>>> * LUN_RESET tmr.. >>>> */ >>>> spin_lock_irqsave(&dev->se_tmr_lock, flags); >>>> - if (tmr) >>>> - list_del_init(&tmr->tmr_list); >>>> list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { >>> - > + if (tmr_p == tmr) >>> - > + continue; >>> - > + >>> >>> + /* Ignore LUN_RESETs to avoid deadlocks */ >>> + if (tmr_p->function == TMR_LUN_RESET) >>> + continue; >>> + >> >> Shoot, that adds back the bug I was hitting. > > Haha, exactly. > >> I have an idea for how to fix both issues, but let me do some more testing >> and will repost the set. Thanks for testing and reviewing this. > > scst serializes LUN_RESET execution, may be we should do the same. > That's what I'm working on. I was thinking async aborts are nice for ESX where you have N VMs on a LUN and you don't want VM1's abort to have to wait for VM2's abort so left that. However, for LUN_RESET, it has to wait for all the commands, so it's not as bad to serialize them.
diff --git a/drivers/target/target_core_tmr.c b/drivers/target/target_core_tmr.c index 2b95b4550a63..a60802b4c5a3 100644 --- a/drivers/target/target_core_tmr.c +++ b/drivers/target/target_core_tmr.c @@ -188,9 +188,10 @@ static void core_tmr_drain_tmr_list( * LUN_RESET tmr.. */ spin_lock_irqsave(&dev->se_tmr_lock, flags); - if (tmr) - list_del_init(&tmr->tmr_list); list_for_each_entry_safe(tmr_p, tmr_pp, &dev->dev_tmr_list, tmr_list) { + if (tmr_p == tmr) + continue; + cmd = tmr_p->task_cmd; if (!cmd) { pr_err("Unable to locate struct se_cmd for TMR\n");