diff mbox series

scsi: libiscsi: Fix cmds hung when sd_shutdown

Message ID 1604132622-497115-1-git-send-email-wubo40@huawei.com
State New
Headers show
Series scsi: libiscsi: Fix cmds hung when sd_shutdown | expand

Commit Message

Wu Bo Oct. 31, 2020, 8:23 a.m. UTC
For some reason, during reboot the system, iscsi.service failed to
logout all sessions. kernel will hang forever on its
sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all
still existent paths.

[ 1044.098991] reboot: Mddev shutdown finished.
[ 1044.099311] reboot: Usermodehelper disable finished.
[ 1050.611244]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295152378, last ping 4295153633, now 4295154944
[ 1348.599676] Call trace:
[ 1348.599887]  __switch_to+0xe8/0x150
[ 1348.600113]  __schedule+0x33c/0xa08
[ 1348.600372]  schedule+0x2c/0x88
[ 1348.600567]  schedule_timeout+0x184/0x3a8
[ 1348.600820]  io_schedule_timeout+0x28/0x48
[ 1348.601089]  wait_for_common_io.constprop.2+0x168/0x258
[ 1348.601425]  wait_for_completion_io_timeout+0x28/0x38
[ 1348.601762]  blk_execute_rq+0x98/0xd8
[ 1348.602006]  __scsi_execute+0xe0/0x1e8
[ 1348.602262]  sd_sync_cache+0xd0/0x220 [sd_mod]
[ 1348.602551]  sd_shutdown+0x6c/0xf8 [sd_mod]
[ 1348.602826]  device_shutdown+0x13c/0x250
[ 1348.603078]  kernel_restart_prepare+0x5c/0x68
[ 1348.603400]  kernel_restart+0x20/0x98
[ 1348.603683]  __se_sys_reboot+0x214/0x260
[ 1348.603987]  __arm64_sys_reboot+0x24/0x30
[ 1348.604300]  el0_svc_common+0x80/0x1b8
[ 1348.604590]  el0_svc_handler+0x78/0xe0
[ 1348.604877]  el0_svc+0x10/0x260

d754941225 (scsi: libiscsi: Allow sd_shutdown on bad transport) Once
solved this problem. The iscsi_eh_cmd_timed_out() function add system_state
judgment, and will return BLK_EH_DONE and mark the result as 
DID_NO_CONNECT when system_state is not SYSTEM_RUNNING, 
To tell upper layers that the command was handled during 
the transport layer error handler helper.

The scsi Mid Layer timeout handler function(scsi_times_out) will be
abort the cmd if the scsi LLD timeout handler return BLK_EH_DONE.
if abort cmd failed, will enter scsi EH logic.

Scsi EH will do reset target logic, if reset target failed, Will
call iscsi_eh_session_reset() function to drop the session.

The iscsi_eh_session_reset function will wait for a relogin,
session termination from userspace, or a recovery/replacement timeout.
But at this time, the app iscsid has exited, and the session was marked as
ISCSI_STATE_FAILED, So the SCSI EH process will never be 
scheduled back again.

PID: 9123   TASK: ffff80020c1b4d80  CPU: 3   COMMAND: "scsi_eh_2"
 #0 [ffff00008632bb70] __switch_to at ffff000080088738
 #1 [ffff00008632bb90] __schedule at ffff000080a00480
 #2 [ffff00008632bc20] schedule at ffff000080a00b58
 #3 [ffff00008632bc30] iscsi_eh_session_reset at ffff000000d1ab9c [libiscsi]
 #4 [ffff00008632bcb0] iscsi_eh_recover_target at ffff000000d1d1fc [libiscsi]
 #5 [ffff00008632bd00] scsi_try_target_reset at ffff0000806f0bac
 #6 [ffff00008632bd30] scsi_eh_ready_devs at ffff0000806f2724
 #7 [ffff00008632bde0] scsi_error_handler at ffff0000806f41d4
 #8 [ffff00008632be70] kthread at ffff000080119ae0

Reported-by: Tianxiong Lu <lutianxiong@huawei.com>
Signed-off-by: Wu Bo <wubo40@huawei.com>
---
 drivers/scsi/libiscsi.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

Comments

Lee Duncan Oct. 31, 2020, 6:06 p.m. UTC | #1
On 10/31/20 1:23 AM, Wu Bo wrote:
> For some reason, during reboot the system, iscsi.service failed to

> logout all sessions. kernel will hang forever on its

> sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all

> still existent paths.

> 

> [ 1044.098991] reboot: Mddev shutdown finished.

> [ 1044.099311] reboot: Usermodehelper disable finished.

> [ 1050.611244]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295152378, last ping 4295153633, now 4295154944

> [ 1348.599676] Call trace:

> [ 1348.599887]  __switch_to+0xe8/0x150

> [ 1348.600113]  __schedule+0x33c/0xa08

> [ 1348.600372]  schedule+0x2c/0x88

> [ 1348.600567]  schedule_timeout+0x184/0x3a8

> [ 1348.600820]  io_schedule_timeout+0x28/0x48

> [ 1348.601089]  wait_for_common_io.constprop.2+0x168/0x258

> [ 1348.601425]  wait_for_completion_io_timeout+0x28/0x38

> [ 1348.601762]  blk_execute_rq+0x98/0xd8

> [ 1348.602006]  __scsi_execute+0xe0/0x1e8

> [ 1348.602262]  sd_sync_cache+0xd0/0x220 [sd_mod]

> [ 1348.602551]  sd_shutdown+0x6c/0xf8 [sd_mod]

> [ 1348.602826]  device_shutdown+0x13c/0x250

> [ 1348.603078]  kernel_restart_prepare+0x5c/0x68

> [ 1348.603400]  kernel_restart+0x20/0x98

> [ 1348.603683]  __se_sys_reboot+0x214/0x260

> [ 1348.603987]  __arm64_sys_reboot+0x24/0x30

> [ 1348.604300]  el0_svc_common+0x80/0x1b8

> [ 1348.604590]  el0_svc_handler+0x78/0xe0

> [ 1348.604877]  el0_svc+0x10/0x260

> 

> d754941225 (scsi: libiscsi: Allow sd_shutdown on bad transport) Once

> solved this problem. The iscsi_eh_cmd_timed_out() function add system_state

> judgment, and will return BLK_EH_DONE and mark the result as 

> DID_NO_CONNECT when system_state is not SYSTEM_RUNNING, 

> To tell upper layers that the command was handled during 

> the transport layer error handler helper.

> 

> The scsi Mid Layer timeout handler function(scsi_times_out) will be

> abort the cmd if the scsi LLD timeout handler return BLK_EH_DONE.

> if abort cmd failed, will enter scsi EH logic.

> 

> Scsi EH will do reset target logic, if reset target failed, Will

> call iscsi_eh_session_reset() function to drop the session.

> 

> The iscsi_eh_session_reset function will wait for a relogin,

> session termination from userspace, or a recovery/replacement timeout.

> But at this time, the app iscsid has exited, and the session was marked as

> ISCSI_STATE_FAILED, So the SCSI EH process will never be 

> scheduled back again.

> 

> PID: 9123   TASK: ffff80020c1b4d80  CPU: 3   COMMAND: "scsi_eh_2"

>  #0 [ffff00008632bb70] __switch_to at ffff000080088738

>  #1 [ffff00008632bb90] __schedule at ffff000080a00480

>  #2 [ffff00008632bc20] schedule at ffff000080a00b58

>  #3 [ffff00008632bc30] iscsi_eh_session_reset at ffff000000d1ab9c [libiscsi]

>  #4 [ffff00008632bcb0] iscsi_eh_recover_target at ffff000000d1d1fc [libiscsi]

>  #5 [ffff00008632bd00] scsi_try_target_reset at ffff0000806f0bac

>  #6 [ffff00008632bd30] scsi_eh_ready_devs at ffff0000806f2724

>  #7 [ffff00008632bde0] scsi_error_handler at ffff0000806f41d4

>  #8 [ffff00008632be70] kthread at ffff000080119ae0

> 

> Reported-by: Tianxiong Lu <lutianxiong@huawei.com>

> Signed-off-by: Wu Bo <wubo40@huawei.com>

> ---

>  drivers/scsi/libiscsi.c | 12 +++++++++++-

>  1 file changed, 11 insertions(+), 1 deletion(-)

> 

> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c

> index 1e9c317..2570768 100644

> --- a/drivers/scsi/libiscsi.c

> +++ b/drivers/scsi/libiscsi.c

> @@ -2380,7 +2380,17 @@ int iscsi_eh_session_reset(struct scsi_cmnd *sc)

>  

>  	mutex_lock(&session->eh_mutex);

>  	spin_lock_bh(&session->frwd_lock);

> -	if (session->state == ISCSI_STATE_TERMINATE) {

> +

> +	/*

> +	 * During shutdown, if session is prematurely disconnected

> +	 * recovery won't happen and there will be hung cmds.

> +	 * To solve this case, all cmds would be enter scsi EH.

> +	 * But the EH path will wait for wait_event_interruptible() completed,

> +	 * when the session state machine is not ISCSI_STATE_TERMINATE,

> +	 * ISCSI_STATE_LOGGED_IN and ISCSI_STATE_RECOVERY_FAILED.

> +	 */

> +	if (session->state == ISCSI_STATE_TERMINATE ||

> +		unlikely(system_state != SYSTEM_RUNNING)) {

>  failed:

>  		ISCSI_DBG_EH(session,

>  			     "failing session reset: Could not log back into "

> 


Reviewed-by: Lee Duncan <lduncan@suse.com>
Mike Christie Oct. 31, 2020, 7:07 p.m. UTC | #2
On 10/31/20 3:23 AM, Wu Bo wrote:
> For some reason, during reboot the system, iscsi.service failed to

> logout all sessions. kernel will hang forever on its

> sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all

> still existent paths.

> 

> [ 1044.098991] reboot: Mddev shutdown finished.

> [ 1044.099311] reboot: Usermodehelper disable finished.

> [ 1050.611244]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295152378, last ping 4295153633, now 4295154944

> [ 1348.599676] Call trace:

> [ 1348.599887]  __switch_to+0xe8/0x150

> [ 1348.600113]  __schedule+0x33c/0xa08

> [ 1348.600372]  schedule+0x2c/0x88

> [ 1348.600567]  schedule_timeout+0x184/0x3a8

> [ 1348.600820]  io_schedule_timeout+0x28/0x48

> [ 1348.601089]  wait_for_common_io.constprop.2+0x168/0x258

> [ 1348.601425]  wait_for_completion_io_timeout+0x28/0x38

> [ 1348.601762]  blk_execute_rq+0x98/0xd8

> [ 1348.602006]  __scsi_execute+0xe0/0x1e8

> [ 1348.602262]  sd_sync_cache+0xd0/0x220 [sd_mod]

> [ 1348.602551]  sd_shutdown+0x6c/0xf8 [sd_mod]

> [ 1348.602826]  device_shutdown+0x13c/0x250

> [ 1348.603078]  kernel_restart_prepare+0x5c/0x68

> [ 1348.603400]  kernel_restart+0x20/0x98

> [ 1348.603683]  __se_sys_reboot+0x214/0x260

> [ 1348.603987]  __arm64_sys_reboot+0x24/0x30

> [ 1348.604300]  el0_svc_common+0x80/0x1b8

> [ 1348.604590]  el0_svc_handler+0x78/0xe0

> [ 1348.604877]  el0_svc+0x10/0x260

> 

> d754941225 (scsi: libiscsi: Allow sd_shutdown on bad transport) Once

> solved this problem. The iscsi_eh_cmd_timed_out() function add system_state

> judgment, and will return BLK_EH_DONE and mark the result as

> DID_NO_CONNECT when system_state is not SYSTEM_RUNNING,

> To tell upper layers that the command was handled during

> the transport layer error handler helper.

> 

> The scsi Mid Layer timeout handler function(scsi_times_out) will be

> abort the cmd if the scsi LLD timeout handler return BLK_EH_DONE.

> if abort cmd failed, will enter scsi EH logic.

> 

> Scsi EH will do reset target logic, if reset target failed, Will

> call iscsi_eh_session_reset() function to drop the session.

> 

> The iscsi_eh_session_reset function will wait for a relogin,

> session termination from userspace, or a recovery/replacement timeout.

> But at this time, the app iscsid has exited, and the session was marked as

> ISCSI_STATE_FAILED, So the SCSI EH process will never be

> scheduled back again.

> 

> PID: 9123   TASK: ffff80020c1b4d80  CPU: 3   COMMAND: "scsi_eh_2"

>   #0 [ffff00008632bb70] __switch_to at ffff000080088738

>   #1 [ffff00008632bb90] __schedule at ffff000080a00480

>   #2 [ffff00008632bc20] schedule at ffff000080a00b58

>   #3 [ffff00008632bc30] iscsi_eh_session_reset at ffff000000d1ab9c [libiscsi]

>   #4 [ffff00008632bcb0] iscsi_eh_recover_target at ffff000000d1d1fc [libiscsi]

>   #5 [ffff00008632bd00] scsi_try_target_reset at ffff0000806f0bac

>   #6 [ffff00008632bd30] scsi_eh_ready_devs at ffff0000806f2724

>   #7 [ffff00008632bde0] scsi_error_handler at ffff0000806f41d4

>   #8 [ffff00008632be70] kthread at ffff000080119ae0

> 

> Reported-by: Tianxiong Lu <lutianxiong@huawei.com>

> Signed-off-by: Wu Bo <wubo40@huawei.com>

> ---

>   drivers/scsi/libiscsi.c | 12 +++++++++++-

>   1 file changed, 11 insertions(+), 1 deletion(-)

> 

> diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c

> index 1e9c317..2570768 100644

> --- a/drivers/scsi/libiscsi.c

> +++ b/drivers/scsi/libiscsi.c

> @@ -2380,7 +2380,17 @@ int iscsi_eh_session_reset(struct scsi_cmnd *sc)

>   

>   	mutex_lock(&session->eh_mutex);

>   	spin_lock_bh(&session->frwd_lock);

> -	if (session->state == ISCSI_STATE_TERMINATE) {

> +

> +	/*

> +	 * During shutdown, if session is prematurely disconnected

> +	 * recovery won't happen and there will be hung cmds.

> +	 * To solve this case, all cmds would be enter scsi EH.

> +	 * But the EH path will wait for wait_event_interruptible() completed,

> +	 * when the session state machine is not ISCSI_STATE_TERMINATE,

> +	 * ISCSI_STATE_LOGGED_IN and ISCSI_STATE_RECOVERY_FAILED.

> +	 */

> +	if (session->state == ISCSI_STATE_TERMINATE ||

> +		unlikely(system_state != SYSTEM_RUNNING)) {

>   failed:

>   		ISCSI_DBG_EH(session,

>   			     "failing session reset: Could not log back into "


Do you need this with the current code? If the system_state is not 
SYSTEM_RUNNING above, shouldn't we call

iscsi_conn_failure -> iscsi_conn_error_event ->
stop_conn_work_fn -> iscsi_if_stop_conn(STOP_CONN_TERM) -> iscsi_conn_stop.

iscsi_conn_stop will set session->state to ISCSI_STATE_TERMINATE, so 
when iscsi_eh_session_reset does:

wait_event_interruptible(conn->ehwait,
                          session->state == ISCSI_STATE_TERMINATE ||

....

that will fail immediately right?
Wu Bo Nov. 2, 2020, 12:53 p.m. UTC | #3
On 2020/10/31 16:23, Wu Bo wrote:
> For some reason, during reboot the system, iscsi.service failed to

> logout all sessions. kernel will hang forever on its

> sd_sync_cache() logic, after issuing the SYNCHRONIZE_CACHE cmd to all

> still existent paths.

> 

> [ 1044.098991] reboot: Mddev shutdown finished.

> [ 1044.099311] reboot: Usermodehelper disable finished.

> [ 1050.611244]  connection2:0: ping timeout of 5 secs expired, recv timeout 5, last rx 4295152378, last ping 4295153633, now 4295154944

> [ 1348.599676] Call trace:

> [ 1348.599887]  __switch_to+0xe8/0x150

> [ 1348.600113]  __schedule+0x33c/0xa08

> [ 1348.600372]  schedule+0x2c/0x88

> [ 1348.600567]  schedule_timeout+0x184/0x3a8

> [ 1348.600820]  io_schedule_timeout+0x28/0x48

> [ 1348.601089]  wait_for_common_io.constprop.2+0x168/0x258

> [ 1348.601425]  wait_for_completion_io_timeout+0x28/0x38

> [ 1348.601762]  blk_execute_rq+0x98/0xd8

> [ 1348.602006]  __scsi_execute+0xe0/0x1e8

> [ 1348.602262]  sd_sync_cache+0xd0/0x220 [sd_mod]

> [ 1348.602551]  sd_shutdown+0x6c/0xf8 [sd_mod]

> [ 1348.602826]  device_shutdown+0x13c/0x250

> [ 1348.603078]  kernel_restart_prepare+0x5c/0x68

> [ 1348.603400]  kernel_restart+0x20/0x98

> [ 1348.603683]  __se_sys_reboot+0x214/0x260

> [ 1348.603987]  __arm64_sys_reboot+0x24/0x30

> [ 1348.604300]  el0_svc_common+0x80/0x1b8

> [ 1348.604590]  el0_svc_handler+0x78/0xe0

> [ 1348.604877]  el0_svc+0x10/0x260

> 


Hi,

Sorry, I did not add this commit:

commit 0ab710458da113a71c461c4df27e7f1353d9f864
Author: Bharath Ravi <rbharath@google.com>
Date:   Sat Jan 25 01:19:25 2020 -0500

     scsi: iscsi: Perform connection failure entirely in kernel space

It makes the session recovery in the kernel do not need iscsid.

This commit was not added in my system, So the problem still exists. 
After adding the patch, Tested and verified that the problem I 
encountered has been resolved.

Please ignore this patch.

Thanks.
diff mbox series

Patch

diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 1e9c317..2570768 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2380,7 +2380,17 @@  int iscsi_eh_session_reset(struct scsi_cmnd *sc)
 
 	mutex_lock(&session->eh_mutex);
 	spin_lock_bh(&session->frwd_lock);
-	if (session->state == ISCSI_STATE_TERMINATE) {
+
+	/*
+	 * During shutdown, if session is prematurely disconnected
+	 * recovery won't happen and there will be hung cmds.
+	 * To solve this case, all cmds would be enter scsi EH.
+	 * But the EH path will wait for wait_event_interruptible() completed,
+	 * when the session state machine is not ISCSI_STATE_TERMINATE,
+	 * ISCSI_STATE_LOGGED_IN and ISCSI_STATE_RECOVERY_FAILED.
+	 */
+	if (session->state == ISCSI_STATE_TERMINATE ||
+		unlikely(system_state != SYSTEM_RUNNING)) {
 failed:
 		ISCSI_DBG_EH(session,
 			     "failing session reset: Could not log back into "