Message ID | 1604556596-27228-1-git-send-email-muneendra.kumar@broadcom.com |
---|---|
Headers | show |
Series | scsi: Support to handle Intermittent errors | expand |
On 11/5/20 12:09 AM, Muneendra wrote: > int fc_block_scsi_eh(struct scsi_cmnd *cmnd) > { > struct fc_rport *rport = starget_to_rport(scsi_target(cmnd->device)); > + int ret = 0; > > if (WARN_ON_ONCE(!rport)) > return FAST_IO_FAIL; > > - return fc_block_rport(rport); > + ret = fc_block_rport(rport); > + /* > + * Clear the SCMD_NORETRIES_ABORT bit if the Port state has > + * changed from marginal to online due to > + * fc_remote_port_delete and fc_remote_port_add > + */ > + if (rport->port_state != FC_PORTSTATE_MARGINAL) > + clear_bit(SCMD_NORETRIES_ABORT, &cmnd->state); > + return ret; > } Hey sorry for the late reply. I was trying to test some things out but am not sure if all drivers work the same. For the code above, what will happen if we have passed that check in the driver, then the driver does the report del and add sequence? Let's say it's initially calling the abort callout, and we passed that check, we then do the del/add seqeuence, what will happen next? Do the fc drivers return success or failure for the abort call. What happens for the other callouts too? If failure, then the eh escalates and when we call the next callout, and we hit the check above and will clear it, so we are ok. If success then we would not get a chance to clear it right? If this is the case, then I think you need to instead go the route where you add the eh cmd completion/decide_disposition callout. You would call it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc when we are deciding if we want to retry/fail the command. In this approach you do not need the eh_timed_out changes, since we only seem to care about the port state after the eh callout has completed.
Hi Mike, Thanks for the input. Below are my replies. >Hey sorry for the late reply. I was trying to test some things out but am >not sure if all drivers work the same. >For the code above, what will happen if we have passed that check in the >driver, then the driver does the report del and add sequence? Let's say >it's initially calling the abort callout, and we passed that check, we then >do the >del/add seqeuence, what will happen next? Do the fc drivers return >success or failure for the abort call. What happens for the other callouts >too? >If failure, then the eh escalates and when we call the next callout, and we >hit the check above and will clear it, so we are ok. If success then we would not get a chance to clear it right? [Muneendra]Agreed. So what about clearing the flags in fc_remote_port_del. I think this should address all the concerns? > If this is the case, then I think you need to instead go the route where > you add the eh cmd completion/decide_disposition callout. You would call > it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc when we are > deciding if we want to retry/fail the command. [Muneendra]Sorry I didn't get what you are saying could you please elaborate on the same. In this approach you do not need the eh_timed_out changes, since we only seem to care about the port state after the eh callout has completed. [Muneendra]what about setting the SCMD_NORETRIES_ABORT bit? Regards, Muneendra.
On 11/5/20 11:27 AM, Muneendra Kumar M wrote: > Hi Mike, > Thanks for the input. > Below are my replies. > > >> Hey sorry for the late reply. I was trying to test some things out but am >> not sure if all drivers work the same. > >> For the code above, what will happen if we have passed that check in the >> driver, then the driver does the report del and add sequence? Let's say >> it's initially calling the abort callout, and we passed that check, we then >> do the >del/add seqeuence, what will happen next? Do the fc drivers return >> success or failure for the abort call. What happens for the other callouts >> too? > >> If failure, then the eh escalates and when we call the next callout, and we >> hit the check above and will clear it, so we are ok. > > If success then we would not get a chance to clear it right? > [Muneendra]Agreed. So what about clearing the flags in fc_remote_port_del. I > think this should address all the concerns? > >> If this is the case, then I think you need to instead go the route where >> you add the eh cmd completion/decide_disposition callout. You would call >> it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc when we are >> deciding if we want to retry/fail the command. > [Muneendra]Sorry I didn't get what you are saying could you please elaborate > on the same. > > In this approach you do not need the eh_timed_out changes, since we only > seem to care about the port state after the eh callout has completed. > [Muneendra]what about setting the SCMD_NORETRIES_ABORT bit? > I don't think you need it. It sounds like we only care about the port state when the cmd is completing. For example we have: 1. the case where the cmd times out, we do aborts/resets, then the port state goes into marginal, then the aborts/resets complete. We want to fail the cmds without retries. 2. If the port state is in marginal, the cmd times out, we do the aborts/resets and when we are done if the port state is still marginal we want to fail the cmd without retries. 3. If the port state is marginal (or any value), before or after the cmd initially times out, but the port state goes back to online, then when the aborts/resets complete we want to retry the cmd. So can we just add a callout to check the port state when the eh has completed like the untested unfinished patch below: diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c index 983eeb0..8ad3a9a 100644 --- a/drivers/scsi/lpfc/lpfc_scsi.c +++ b/drivers/scsi/lpfc/lpfc_scsi.c @@ -6041,6 +6041,7 @@ struct scsi_host_template lpfc_template = { .info = lpfc_info, .queuecommand = lpfc_queuecommand, .eh_timed_out = fc_eh_timed_out, + .eh_timed_out = fc_eh_should_retry_cmd, .eh_abort_handler = lpfc_abort_handler, .eh_device_reset_handler = lpfc_device_reset_handler, .eh_target_reset_handler = lpfc_target_reset_handler, diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f11f51e..7c66d17 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -140,6 +140,7 @@ static bool scsi_cmd_retry_allowed(struct scsi_cmnd *cmd) struct scsi_cmnd *scmd = container_of(work, struct scsi_cmnd, abort_work.work); struct scsi_device *sdev = scmd->device; + struct Scsi_Host *host = sdev->host; int rtn; if (scsi_host_eh_past_deadline(sdev->host)) { @@ -159,7 +160,8 @@ static bool scsi_cmd_retry_allowed(struct scsi_cmnd *cmd) "eh timeout, not retrying " "aborted command\n")); } else if (!scsi_noretry_cmd(scmd) && - scsi_cmd_retry_allowed(scmd)) { + scsi_cmd_retry_allowed(scmd) && + host->hostt->eh_should_retry_cmd(scmd)) { SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_WARNING, scmd, "retry aborted command\n")); @@ -2105,7 +2107,8 @@ void scsi_eh_flush_done_q(struct list_head *done_q) list_for_each_entry_safe(scmd, next, done_q, eh_entry) { list_del_init(&scmd->eh_entry); if (scsi_device_online(scmd->device) && - !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd)) { + !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd) && + host->hostt->eh_should_retry_cmd(scmd)) { SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, "%s: flush retry cmd\n", diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 2ff7f06..7011963 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -2043,6 +2043,18 @@ static int fc_vport_match(struct attribute_container *cont, return &i->vport_attr_cont.ac == cont; } +bool fc_eh_should_retry_cmd(struct scsi_cmnd *scmd) +{ + struct fc_rport *rport = starget_to_rport(scsi_target(scmd->device)); + + if (rport->port_state == FC_PORTSTATE_MARGINAL) + return false; + + /* Other port states will set the sdev state */ + /* TODO check comment above */ + return true; +} +EXPORT_SYMBOL_GPL(fc_eh_should_retry_cmd); /** * fc_eh_timed_out - FC Transport I/O timeout intercept handler diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 701f178..51d5af0 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -315,6 +315,13 @@ struct scsi_host_template { */ enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *); + /* + * Optional routine that allows the transport to decide if a cmd is + * retryable. Return true if the transport is in a state the cmd + * should be retried on. + */ + bool (*eh_should_retry_cmd)(struct scsi_cmnd *); + /* This is an optional routine that allows transport to initiate * LLD adapter or firmware reset using sysfs attribute. * diff --git a/include/scsi/scsi_transport_fc.h b/include/scsi/scsi_transport_fc.h index 1c7dd35..f21b583 100644 --- a/include/scsi/scsi_transport_fc.h +++ b/include/scsi/scsi_transport_fc.h @@ -803,6 +803,7 @@ struct fc_vport *fc_vport_create(struct Scsi_Host *shost, int channel, int fc_block_rport(struct fc_rport *rport); int fc_block_scsi_eh(struct scsi_cmnd *cmnd); enum blk_eh_timer_return fc_eh_timed_out(struct scsi_cmnd *scmd); +bool fc_eh_should_retry_cmd(struct scsi_cmnd *scmd); static inline struct Scsi_Host *fc_bsg_to_shost(struct bsg_job *job) {
Hi Mike, Thanks for the detailed input. I will apply the below changes and will resubmit the patches. Regards, Muneendra. -----Original Message----- From: Mike Christie [mailto:michael.christie@oracle.com] Sent: Thursday, November 5, 2020 9:29 PM To: Muneendra <muneendra.kumar@broadcom.com>; linux-scsi@vger.kernel.org; hare@suse.de Cc: jsmart2021@gmail.com; emilne@redhat.com; mkumar@redhat.com Subject: Re: [PATCH v6 3/4] scsi_transport_fc: Added a new rport state FC_PORTSTATE_MARGINAL On 11/5/20 12:09 AM, Muneendra wrote: > int fc_block_scsi_eh(struct scsi_cmnd *cmnd) > { > struct fc_rport *rport = > starget_to_rport(scsi_target(cmnd->device)); > + int ret = 0; > > if (WARN_ON_ONCE(!rport)) > return FAST_IO_FAIL; > > - return fc_block_rport(rport); > + ret = fc_block_rport(rport); > + /* > + * Clear the SCMD_NORETRIES_ABORT bit if the Port state has > + * changed from marginal to online due to > + * fc_remote_port_delete and fc_remote_port_add > + */ > + if (rport->port_state != FC_PORTSTATE_MARGINAL) > + clear_bit(SCMD_NORETRIES_ABORT, &cmnd->state); > + return ret; > } Hey sorry for the late reply. I was trying to test some things out but am not sure if all drivers work the same. For the code above, what will happen if we have passed that check in the driver, then the driver does the report del and add sequence? Let's say it's initially calling the abort callout, and we passed that check, we then do the del/add seqeuence, what will happen next? Do the fc drivers return success or failure for the abort call. What happens for the other callouts too? If failure, then the eh escalates and when we call the next callout, and we hit the check above and will clear it, so we are ok. If success then we would not get a chance to clear it right? If this is the case, then I think you need to instead go the route where you add the eh cmd completion/decide_disposition callout. You would call it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc when we are deciding if we want to retry/fail the command. In this approach you do not need the eh_timed_out changes, since we only seem to care about the port state after the eh callout has completed.
Hi Mike, Thanks for the detailed input. I will apply the below changes and will resubmit the patches. Regards, Muneendra. -----Original Message----- From: Mike Christie [mailto:michael.christie@oracle.com] Sent: Friday, November 6, 2020 12:46 AM To: Muneendra Kumar M <muneendra.kumar@broadcom.com>; linux-scsi@vger.kernel.org; hare@suse.de Cc: jsmart2021@gmail.com; emilne@redhat.com; mkumar@redhat.com Subject: Re: [PATCH v6 3/4] scsi_transport_fc: Added a new rport state FC_PORTSTATE_MARGINAL On 11/5/20 11:27 AM, Muneendra Kumar M wrote: > Hi Mike, > Thanks for the input. > Below are my replies. > > >> Hey sorry for the late reply. I was trying to test some things out >> but am not sure if all drivers work the same. > >> For the code above, what will happen if we have passed that check in >> the driver, then the driver does the report del and add sequence? >> Let's say it's initially calling the abort callout, and we passed >> that check, we then do the >del/add seqeuence, what will happen next? >> Do the fc drivers return success or failure for the abort call. What >> happens for the other callouts too? > >> If failure, then the eh escalates and when we call the next callout, >> and we hit the check above and will clear it, so we are ok. > > If success then we would not get a chance to clear it right? > [Muneendra]Agreed. So what about clearing the flags in > fc_remote_port_del. I think this should address all the concerns? > >> If this is the case, then I think you need to instead go the route >> where you add the eh cmd completion/decide_disposition callout. You >> would call it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc >> when we are deciding if we want to retry/fail the command. > [Muneendra]Sorry I didn't get what you are saying could you please > elaborate on the same. > > In this approach you do not need the eh_timed_out changes, since we > only seem to care about the port state after the eh callout has completed. > [Muneendra]what about setting the SCMD_NORETRIES_ABORT bit? > I don't think you need it. It sounds like we only care about the port state when the cmd is completing. For example we have: 1. the case where the cmd times out, we do aborts/resets, then the port state goes into marginal, then the aborts/resets complete. We want to fail the cmds without retries. 2. If the port state is in marginal, the cmd times out, we do the aborts/resets and when we are done if the port state is still marginal we want to fail the cmd without retries. 3. If the port state is marginal (or any value), before or after the cmd initially times out, but the port state goes back to online, then when the aborts/resets complete we want to retry the cmd. So can we just add a callout to check the port state when the eh has completed like the untested unfinished patch below: diff --git a/drivers/scsi/lpfc/lpfc_scsi.c b/drivers/scsi/lpfc/lpfc_scsi.c index 983eeb0..8ad3a9a 100644 --- a/drivers/scsi/lpfc/lpfc_scsi.c +++ b/drivers/scsi/lpfc/lpfc_scsi.c @@ -6041,6 +6041,7 @@ struct scsi_host_template lpfc_template = { .info = lpfc_info, .queuecommand = lpfc_queuecommand, .eh_timed_out = fc_eh_timed_out, + .eh_timed_out = fc_eh_should_retry_cmd, .eh_abort_handler = lpfc_abort_handler, .eh_device_reset_handler = lpfc_device_reset_handler, .eh_target_reset_handler = lpfc_target_reset_handler, diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c index f11f51e..7c66d17 100644 --- a/drivers/scsi/scsi_error.c +++ b/drivers/scsi/scsi_error.c @@ -140,6 +140,7 @@ static bool scsi_cmd_retry_allowed(struct scsi_cmnd *cmd) struct scsi_cmnd *scmd = container_of(work, struct scsi_cmnd, abort_work.work); struct scsi_device *sdev = scmd->device; + struct Scsi_Host *host = sdev->host; int rtn; if (scsi_host_eh_past_deadline(sdev->host)) { @@ -159,7 +160,8 @@ static bool scsi_cmd_retry_allowed(struct scsi_cmnd *cmd) "eh timeout, not retrying " "aborted command\n")); } else if (!scsi_noretry_cmd(scmd) && - scsi_cmd_retry_allowed(scmd)) { + scsi_cmd_retry_allowed(scmd) && + host->hostt->eh_should_retry_cmd(scmd)) { SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_WARNING, scmd, "retry aborted command\n")); @@ -2105,7 +2107,8 @@ void scsi_eh_flush_done_q(struct list_head *done_q) list_for_each_entry_safe(scmd, next, done_q, eh_entry) { list_del_init(&scmd->eh_entry); if (scsi_device_online(scmd->device) && - !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd)) { + !scsi_noretry_cmd(scmd) && scsi_cmd_retry_allowed(scmd) && + host->hostt->eh_should_retry_cmd(scmd)) { SCSI_LOG_ERROR_RECOVERY(3, scmd_printk(KERN_INFO, scmd, "%s: flush retry cmd\n", diff --git a/drivers/scsi/scsi_transport_fc.c b/drivers/scsi/scsi_transport_fc.c index 2ff7f06..7011963 100644 --- a/drivers/scsi/scsi_transport_fc.c +++ b/drivers/scsi/scsi_transport_fc.c @@ -2043,6 +2043,18 @@ static int fc_vport_match(struct attribute_container *cont, return &i->vport_attr_cont.ac == cont; } +bool fc_eh_should_retry_cmd(struct scsi_cmnd *scmd) { + struct fc_rport *rport = starget_to_rport(scsi_target(scmd->device)); + + if (rport->port_state == FC_PORTSTATE_MARGINAL) + return false; + + /* Other port states will set the sdev state */ + /* TODO check comment above */ + return true; +} +EXPORT_SYMBOL_GPL(fc_eh_should_retry_cmd); /** * fc_eh_timed_out - FC Transport I/O timeout intercept handler diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h index 701f178..51d5af0 100644 --- a/include/scsi/scsi_host.h +++ b/include/scsi/scsi_host.h @@ -315,6 +315,13 @@ struct scsi_host_template { */ enum blk_eh_timer_return (*eh_timed_out)(struct scsi_cmnd *); + /* + * Optional routine that allows the transport to decide if a cmd is + * retryable. Return true if the transport is in a state the cmd + * should be retried on. + */ + bool (*eh_should_retry_cmd)(struct scsi_cmnd *); + /* This is an optional routine that allows transport to initiate * LLD adapter or firmware reset using sysfs attribute. * diff --git a/include/scsi/scsi_transport_fc.h b/include/scsi/scsi_transport_fc.h index 1c7dd35..f21b583 100644 --- a/include/scsi/scsi_transport_fc.h +++ b/include/scsi/scsi_transport_fc.h @@ -803,6 +803,7 @@ struct fc_vport *fc_vport_create(struct Scsi_Host *shost, int channel, int fc_block_rport(struct fc_rport *rport); int fc_block_scsi_eh(struct scsi_cmnd *cmnd); enum blk_eh_timer_return fc_eh_timed_out(struct scsi_cmnd *scmd); +bool fc_eh_should_retry_cmd(struct scsi_cmnd *scmd); static inline struct Scsi_Host *fc_bsg_to_shost(struct bsg_job *job) {
On 11/5/20 8:15 PM, Mike Christie wrote: > On 11/5/20 11:27 AM, Muneendra Kumar M wrote: >> Hi Mike, >> Thanks for the input. >> Below are my replies. >> >> >>> Hey sorry for the late reply. I was trying to test some things out but am >>> not sure if all drivers work the same. >> >>> For the code above, what will happen if we have passed that check in the >>> driver, then the driver does the report del and add sequence? Let's say >>> it's initially calling the abort callout, and we passed that check, we then >>> do the >del/add seqeuence, what will happen next? Do the fc drivers return >>> success or failure for the abort call. What happens for the other callouts >>> too? >> >>> If failure, then the eh escalates and when we call the next callout, and we >>> hit the check above and will clear it, so we are ok. >> >> If success then we would not get a chance to clear it right? >> [Muneendra]Agreed. So what about clearing the flags in fc_remote_port_del. I >> think this should address all the concerns? >> >>> If this is the case, then I think you need to instead go the route where >>> you add the eh cmd completion/decide_disposition callout. You would call >>> it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc when we are >>> deciding if we want to retry/fail the command. >> [Muneendra]Sorry I didn't get what you are saying could you please elaborate >> on the same. >> >> In this approach you do not need the eh_timed_out changes, since we only >> seem to care about the port state after the eh callout has completed. >> [Muneendra]what about setting the SCMD_NORETRIES_ABORT bit? >> > > I don't think you need it. It sounds like we only care about the port state > when the cmd is completing. For example we have: > > 1. the case where the cmd times out, we do aborts/resets, then the > port state goes into marginal, then the aborts/resets complete. We want to > fail the cmds without retries. > > 2. If the port state is in marginal, the cmd times out, we do the aborts/resets > and when we are done if the port state is still marginal we want to fail the > cmd without retries. > > 3. If the port state is marginal (or any value), before or after the cmd > initially times out, but the port state goes back to online, then when the > aborts/resets complete we want to retry the cmd. > Actually, I don't think the third case can (or should) happen. A transition from marginal to online should always include a link bounce (ie a rport_del/rport_add sequence), which would cancel all outstanding commands anyway. And if we have the above provision we could clear the flag in rport_del() and everything would be dandy. Cheers, Hannes -- Dr. Hannes Reinecke Kernel Storage Architect hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On 11/6/20 1:23 AM, Hannes Reinecke wrote: > On 11/5/20 8:15 PM, Mike Christie wrote: >> On 11/5/20 11:27 AM, Muneendra Kumar M wrote: >>> Hi Mike, >>> Thanks for the input. >>> Below are my replies. >>> >>> >>>> Hey sorry for the late reply. I was trying to test some things out but am >>>> not sure if all drivers work the same. >>> >>>> For the code above, what will happen if we have passed that check in the >>>> driver, then the driver does the report del and add sequence? Let's say >>>> it's initially calling the abort callout, and we passed that check, we then >>>> do the >del/add seqeuence, what will happen next? Do the fc drivers return >>>> success or failure for the abort call. What happens for the other callouts >>>> too? >>> >>>> If failure, then the eh escalates and when we call the next callout, and we >>>> hit the check above and will clear it, so we are ok. >>> >>> If success then we would not get a chance to clear it right? >>> [Muneendra]Agreed. So what about clearing the flags in fc_remote_port_del. I >>> think this should address all the concerns? >>> >>>> If this is the case, then I think you need to instead go the route where >>>> you add the eh cmd completion/decide_disposition callout. You would call >>>> it in scmd_eh_abort_handler, scsi_eh_bus_device_reset, etc when we are >>>> deciding if we want to retry/fail the command. >>> [Muneendra]Sorry I didn't get what you are saying could you please elaborate >>> on the same. >>> >>> In this approach you do not need the eh_timed_out changes, since we only >>> seem to care about the port state after the eh callout has completed. >>> [Muneendra]what about setting the SCMD_NORETRIES_ABORT bit? >>> >> >> I don't think you need it. It sounds like we only care about the port state >> when the cmd is completing. For example we have: >> >> 1. the case where the cmd times out, we do aborts/resets, then the >> port state goes into marginal, then the aborts/resets complete. We want to >> fail the cmds without retries. >> >> 2. If the port state is in marginal, the cmd times out, we do the aborts/resets >> and when we are done if the port state is still marginal we want to fail the >> cmd without retries. >> >> 3. If the port state is marginal (or any value), before or after the cmd >> initially times out, but the port state goes back to online, then when the >> aborts/resets complete we want to retry the cmd. >> > Actually, I don't think the third case can (or should) happen. > A transition from marginal to online should always include a link bounce (ie a rport_del/rport_add sequence), which would cancel all outstanding commands anyway. > And if we have the above provision we could clear the flag in rport_del() and everything would be dandy. That is the part I didn't like or I thought could have issues: 1. What if we go back into marginal after the add() but before we have done scsi_eh_flush_done_q? You need the original looping code and some code for del() right? 2. My issue with #1 is why do we want to add code that loops over commands when we only care about the port state and when scsi_eh_flush_done_q and the abort completion code is already looping over them. 3. This is probably a matter of preference, and I can see why people would not like an extra callout. However, I like the idea that we have an eh callout for the transport class that checks if it's even worth it to run the eh code based on the port state and then we have a eh completion callout that can also check the port state and determine if it's ok. If we had a marginal scsi_device state or even a bit then you would not actually need #3. The fc class could just set the device state/bit and we could check that in the eh completion code paths.