mbox series

[v2,0/8] scsi: Support to handle Intermittent errors

Message ID 1601268657-940-1-git-send-email-muneendra.kumar@broadcom.com
Headers show
Series scsi: Support to handle Intermittent errors | expand

Message

Muneendra Kumar Sept. 28, 2020, 4:50 a.m. UTC
This patch adds a support to prevent retries of all the pending/inflight
io's after an abort succeeds on a particular device when transport
connectivity to the device is encountering intermittent errors.

Intermittent connectivity is a condition that can be detected by transport
fabric notifications. A service can monitor the ELS notifications and
take action on all the outstanding io's of a scsi device at that instant.

This feature is intended to be used when the device is part of a multipath
environment. When the service detects the poor connectivity, the multipath
path can be placed in a marginal path group and ignored further io
operations.

After placing a path in the marginal path group,the daemon sets the
port_state to Marginal which sets bit in scmd->state for all the
outstanding io's on that particular device with the new sysfs interface
provided in this patch.This prevent retries of all the pending/inflight
io's if an io hits a scsi timeout which inturn issues an abort.
On Abort succeeds on a marginal path the io will be immediately retried on 
another active path.On abort fails then the things escalates to existing
target reset sg interface recovery process.

Below is the interface provided to set the port state to Marginal
and Online.

echo "Marginal" >> /sys/class/fc_transport/targetX\:Y\:Z/port_state
echo "Online" >> /sys/class/fc_transport/targetX\:Y\:Z/port_state


The patches were cut against  5.10/scsi-queue tree

---
v2:
Added new error code DID_TRANSPORT_MARGINAL to handle marginal errors.
Added a new rport_state FC_PORTSTATE_MARGINAL and also added a new
sysfs interface port_state to set the port_state to marginal.
Added the support in lpfc to handle the marginal state.


Muneendra (8):
  scsi: Added a new definition in scsi_cmnd.h
  scsi: Added a new error code in scsi.h
  scsi: Clear state bit SCMD_NORETRIES_ABORT of scsi_cmd before start
    request
  scsi: No retries on abort success
  scsi: Added routine to set/clear SCMD_NORETRIES_ABORT bit for
    outstanding io on scsi_dev
  scsi_transport_fc: Added a new rport state FC_PORTSTATE_MARGINAL
  scsi_transport_fc: Added a new sysfs attribute port_state
  lpfc: Added support to handle marginal state

 drivers/scsi/lpfc/lpfc_scsi.c    |   6 ++
 drivers/scsi/scsi_error.c        |  86 +++++++++++++++++++
 drivers/scsi/scsi_lib.c          |   2 +
 drivers/scsi/scsi_priv.h         |   2 +
 drivers/scsi/scsi_transport_fc.c | 140 +++++++++++++++++++++++++++----
 include/scsi/scsi.h              |   1 +
 include/scsi/scsi_cmnd.h         |   3 +
 include/scsi/scsi_transport_fc.h |  24 ++++++
 8 files changed, 246 insertions(+), 18 deletions(-)

Comments

Mike Christie Oct. 2, 2020, 5:01 p.m. UTC | #1
On 9/27/20 11:50 PM, Muneendra wrote:
> This patch adds a support to prevent retries of all the pending/inflight
> io's after an abort succeeds on a particular device when transport
> connectivity to the device is encountering intermittent errors.
> 
> Intermittent connectivity is a condition that can be detected by transport
> fabric notifications. A service can monitor the ELS notifications and
> take action on all the outstanding io's of a scsi device at that instant.
> 

Is the service mentioned above a new daemon or is it integrated into
something like multipathd?

What does the part about monitoring ELS notifications mean? Is the
service just doing something like a ELS ECHO, or is it able to watch
the IO on the wire/card (like if you did tcpdump and watched iscsi/tcp
traffic) or is it something completely different?
James Smart Oct. 2, 2020, 5:27 p.m. UTC | #2
On 10/2/2020 10:01 AM, Mike Christie wrote:
> On 9/27/20 11:50 PM, Muneendra wrote:
>> This patch adds a support to prevent retries of all the pending/inflight
>> io's after an abort succeeds on a particular device when transport
>> connectivity to the device is encountering intermittent errors.
>>
>> Intermittent connectivity is a condition that can be detected by transport
>> fabric notifications. A service can monitor the ELS notifications and
>> take action on all the outstanding io's of a scsi device at that instant.
>>
> 
> Is the service mentioned above a new daemon or is it integrated into
> something like multipathd?
> 
> What does the part about monitoring ELS notifications mean? Is the
> service just doing something like a ELS ECHO, or is it able to watch
> the IO on the wire/card (like if you did tcpdump and watched iscsi/tcp
> traffic) or is it something completely different?
> 

For the last part.... the FC drivers, when receiving FC FPIN ELS's are 
calling a scsi transport routine with the FPIN payload.  The transport 
is pushing this as an "event" via netlink.  An app bound to the local 
address used by the scsi transport can receive the event and parse it.

This is a new daemon, specific to FC, which monitors for FPIN events, 
parses the related topology devices, then interacts with sysfs and 
possibly multipath based on what it's seeing from the fabric.

-- james