diff mbox series

[1/1] scsi_dh_alua: properly handling the ALUA transitioning state

Message ID CAHZQxy+4sTPz9+pY3=7VJH+CLUJsDct81KtnR2be8ycN5mhqTg@mail.gmail.com
State New
Headers show
Series [1/1] scsi_dh_alua: properly handling the ALUA transitioning state | expand

Commit Message

Brian Bunker May 2, 2022, 3:09 p.m. UTC
The handling of the ALUA transitioning state is currently broken. When
a target goes into this state, it is expected that the target is
allowed to stay in this state for the implicit transition timeout
without a path failure. The handler has this logic, but it gets
skipped currently.

When the target transitions, there is in-flight I/O from the
initiator. The first of these responses from the target will be a unit
attention letting the initiator know that the ALUA state has changed.
The remaining in-flight I/Os, before the initiator finds out that the
portal state has changed, will return not ready, ALUA state is
transitioning. The portal state will change to
SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O
immediately failing the path unexpectedly. The path failure happens in
less than a second instead of the expected successes until the
transition timer is exceeded.

This change allows I/Os to continue while the path is in the ALUA
transitioning state. The handler already takes care of a target that
stays in the transitioning state for too long by changing the state to
ALUA state standby once the transition timeout is exceeded at which
point the path will fail.

Signed-off-by: Brian Bunker <brian@purestorage.com>
Acked-by: Krishna Kant <krishna.kant@purestorage.com>
Acked-by: Seamus Connor <sconnor@purestorage.com>

Comments

Martin Wilck May 20, 2022, 10:57 a.m. UTC | #1
Brian, Martin, 

sorry, I've overlooked this patch previously. I have to say I think
it's wrong and shouldn't have been applied. At least I need more in-
depth explanation.

On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
> 
> > The handling of the ALUA transitioning state is currently broken.
> > When
> > a target goes into this state, it is expected that the target is
> > allowed to stay in this state for the implicit transition timeout
> > without a path failure. 

Can you please show me a quote from the specs on which this expectation
("without a path failure") is based? AFAIK the SCSI specs don't say
anything about device-mapper multipath semantics.

> > The handler has this logic, but it gets
> > skipped currently.
> > 
> > When the target transitions, there is in-flight I/O from the
> > initiator. The first of these responses from the target will be a
> > unit
> > attention letting the initiator know that the ALUA state has
> > changed.
> > The remaining in-flight I/Os, before the initiator finds out that
> > the
> > portal state has changed, will return not ready, ALUA state is
> > transitioning. The portal state will change to
> > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O
> > immediately failing the path unexpectedly. The path failure happens
> > in
> > less than a second instead of the expected successes until the
> > transition timer is exceeded.

dm multipath has no concept of "transitioning" state. Path state can be
either active or inactive. As Brian wrote, commands sent to the
transitioning device will return NOT READY, TRANSITIONING, and require
retries on the SCSI layer. If we know this in advance, why should we
continue sending I/O down this semi-broken path? If other, healthy
paths are available, why it would it not be the right thing to switch
I/O to them ASAP?

I suppose the problem you want to solve here is a transient situation
in which all paths are transitioning (some up, some down), which would
lead to a failure on the dm level (at least with no_path_retry=0). IMO
this has to be avoided at the firmware level, and if that is
impossible, multipath-tools' (no_path_retry * polling_interval) must be
set to a value that is higher than the time for which this transient
degraded situation would persist.

Am I missing something?

The way I see it, this is a problem that affects only storage from one
vendor, and would cause suboptimal behavior on most others. If you
really need this, I would suggest a new devinfo flag, e.g.
BLIST_DONT_FAIL_TRANSITIONING.

Regards,
Martin


> > 
> > [...]
> 
> Applied to 5.18/scsi-fixes, thanks!
> 
> [1/1] scsi_dh_alua: properly handling the ALUA transitioning state
>       https://git.kernel.org/mkp/scsi/c/6056a92ceb2a
>
Hannes Reinecke May 20, 2022, 12:06 p.m. UTC | #2
On 5/20/22 12:57, Martin Wilck wrote:
> Brian, Martin,
> 
> sorry, I've overlooked this patch previously. I have to say I think
> it's wrong and shouldn't have been applied. At least I need more in-
> depth explanation.
> 
> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
>>
>>> The handling of the ALUA transitioning state is currently broken.
>>> When
>>> a target goes into this state, it is expected that the target is
>>> allowed to stay in this state for the implicit transition timeout
>>> without a path failure.
> 
> Can you please show me a quote from the specs on which this expectation
> ("without a path failure") is based? AFAIK the SCSI specs don't say
> anything about device-mapper multipath semantics.
> 
>>> The handler has this logic, but it gets
>>> skipped currently.
>>>
>>> When the target transitions, there is in-flight I/O from the
>>> initiator. The first of these responses from the target will be a
>>> unit
>>> attention letting the initiator know that the ALUA state has
>>> changed.
>>> The remaining in-flight I/Os, before the initiator finds out that
>>> the
>>> portal state has changed, will return not ready, ALUA state is
>>> transitioning. The portal state will change to
>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O
>>> immediately failing the path unexpectedly. The path failure happens
>>> in
>>> less than a second instead of the expected successes until the
>>> transition timer is exceeded.
> 
> dm multipath has no concept of "transitioning" state. Path state can be
> either active or inactive. As Brian wrote, commands sent to the
> transitioning device will return NOT READY, TRANSITIONING, and require
> retries on the SCSI layer. If we know this in advance, why should we
> continue sending I/O down this semi-broken path? If other, healthy
> paths are available, why it would it not be the right thing to switch
> I/O to them ASAP?
> 
But we do, don't we?
Commands are being returned with the appropriate status, and 
dm-multipath should make the corresponding decisions here.
This patch just modifies the check when _sending_ commands; ie multipath 
had decided that the path is still usable.

Question rather would be why multipath did that; however that logic 
isn't modified here.

Cheers,

Hannes
Martin Wilck May 20, 2022, 2:03 p.m. UTC | #3
On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
> On 5/20/22 12:57, Martin Wilck wrote:
> > Brian, Martin,
> > 
> > sorry, I've overlooked this patch previously. I have to say I think
> > it's wrong and shouldn't have been applied. At least I need more
> > in-
> > depth explanation.
> > 
> > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
> > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
> > > 
> > > > The handling of the ALUA transitioning state is currently
> > > > broken.
> > > > When
> > > > a target goes into this state, it is expected that the target
> > > > is
> > > > allowed to stay in this state for the implicit transition
> > > > timeout
> > > > without a path failure.
> > 
> > Can you please show me a quote from the specs on which this
> > expectation
> > ("without a path failure") is based? AFAIK the SCSI specs don't say
> > anything about device-mapper multipath semantics.
> > 
> > > > The handler has this logic, but it gets
> > > > skipped currently.
> > > > 
> > > > When the target transitions, there is in-flight I/O from the
> > > > initiator. The first of these responses from the target will be
> > > > a
> > > > unit
> > > > attention letting the initiator know that the ALUA state has
> > > > changed.
> > > > The remaining in-flight I/Os, before the initiator finds out
> > > > that
> > > > the
> > > > portal state has changed, will return not ready, ALUA state is
> > > > transitioning. The portal state will change to
> > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O
> > > > immediately failing the path unexpectedly. The path failure
> > > > happens
> > > > in
> > > > less than a second instead of the expected successes until the
> > > > transition timer is exceeded.
> > 
> > dm multipath has no concept of "transitioning" state. Path state
> > can be
> > either active or inactive. As Brian wrote, commands sent to the
> > transitioning device will return NOT READY, TRANSITIONING, and
> > require
> > retries on the SCSI layer. If we know this in advance, why should
> > we
> > continue sending I/O down this semi-broken path? If other, healthy
> > paths are available, why it would it not be the right thing to
> > switch
> > I/O to them ASAP?
> > 
> But we do, don't we?
> Commands are being returned with the appropriate status, and 
> dm-multipath should make the corresponding decisions here.
> This patch just modifies the check when _sending_ commands; ie
> multipath 
> had decided that the path is still usable.
> Question rather would be why multipath did that;

If alua_prep_fn() got called, the path was considered usable at the
given point in time by dm-multipath. Most probably the reason was
simply that no error condition had occured on this path before ALUA
state switched to transitioning. I suppose this can happen if storage
switches a PG consisting of multiple paths to TRANSITIONING. We get an
error on one path (sda, say), issue an RTPG, and receive the new ALUA
state for all paths of the PG. For all paths except sda, we'd just see
a switch to TRANSITIONING without a previous SCSI error.

With this patch, we'll dispatch I/O (usually an entire bunch) to these
paths despite seeing them in TRANSITIONING state. Eventually, when the
SCSI responses are received, this leads to path failures. If I/O
latencies are small, this happens after a few ms. In that case, the
goal of Brian's patch is not reached, because the time until path
failure would still be on the order of milliseconds. OTOH, if latencies
are high, it takes substantially longer for the kernel to realize that
the path is non-functional, while other, good paths may be idle. I fail
to see the benefit.

Regards,
Martin


>  however that logic 
> isn't modified here.
> 
> Cheers,
> 
> Hannes
Mike Christie May 20, 2022, 7:08 p.m. UTC | #4
On 5/20/22 9:03 AM, Martin Wilck wrote:
> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
>> On 5/20/22 12:57, Martin Wilck wrote:
>>> Brian, Martin,
>>>
>>> sorry, I've overlooked this patch previously. I have to say I think
>>> it's wrong and shouldn't have been applied. At least I need more
>>> in-
>>> depth explanation.
>>>
>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
>>>>
>>>>> The handling of the ALUA transitioning state is currently
>>>>> broken.
>>>>> When
>>>>> a target goes into this state, it is expected that the target
>>>>> is
>>>>> allowed to stay in this state for the implicit transition
>>>>> timeout
>>>>> without a path failure.
>>>
>>> Can you please show me a quote from the specs on which this
>>> expectation
>>> ("without a path failure") is based? AFAIK the SCSI specs don't say
>>> anything about device-mapper multipath semantics.
>>>
>>>>> The handler has this logic, but it gets
>>>>> skipped currently.
>>>>>
>>>>> When the target transitions, there is in-flight I/O from the
>>>>> initiator. The first of these responses from the target will be
>>>>> a
>>>>> unit
>>>>> attention letting the initiator know that the ALUA state has
>>>>> changed.
>>>>> The remaining in-flight I/Os, before the initiator finds out
>>>>> that
>>>>> the
>>>>> portal state has changed, will return not ready, ALUA state is
>>>>> transitioning. The portal state will change to
>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new I/O
>>>>> immediately failing the path unexpectedly. The path failure
>>>>> happens
>>>>> in
>>>>> less than a second instead of the expected successes until the
>>>>> transition timer is exceeded.
>>>
>>> dm multipath has no concept of "transitioning" state. Path state
>>> can be
>>> either active or inactive. As Brian wrote, commands sent to the
>>> transitioning device will return NOT READY, TRANSITIONING, and
>>> require
>>> retries on the SCSI layer. If we know this in advance, why should
>>> we
>>> continue sending I/O down this semi-broken path? If other, healthy
>>> paths are available, why it would it not be the right thing to
>>> switch
>>> I/O to them ASAP?
>>>
>> But we do, don't we?
>> Commands are being returned with the appropriate status, and 
>> dm-multipath should make the corresponding decisions here.
>> This patch just modifies the check when _sending_ commands; ie
>> multipath 
>> had decided that the path is still usable.
>> Question rather would be why multipath did that;
> 
> If alua_prep_fn() got called, the path was considered usable at the
> given point in time by dm-multipath. Most probably the reason was
> simply that no error condition had occured on this path before ALUA
> state switched to transitioning. I suppose this can happen if storage
> switches a PG consisting of multiple paths to TRANSITIONING. We get an
> error on one path (sda, say), issue an RTPG, and receive the new ALUA
> state for all paths of the PG. For all paths except sda, we'd just see
> a switch to TRANSITIONING without a previous SCSI error.
> 
> With this patch, we'll dispatch I/O (usually an entire bunch) to these
> paths despite seeing them in TRANSITIONING state. Eventually, when the
> SCSI responses are received, this leads to path failures. If I/O
> latencies are small, this happens after a few ms. In that case, the
> goal of Brian's patch is not reached, because the time until path
> failure would still be on the order of milliseconds. OTOH, if latencies
> are high, it takes substantially longer for the kernel to realize that
> the path is non-functional, while other, good paths may be idle. I fail
> to see the benefit.
> 

I'm not sure everyone agrees with you on the meaning of transitioning.

If we go from non-optimized to optimized or standby to non-opt/optimized
we don't want to try other paths because it can cause thrashing. We just
need to transition resources before we can fully use the path. It could
be a local cache operation or for distributed targets it could be a really
expensive operation.

For both though, it can take longer than the retries we get from scsi-ml.
For example this patch:

commit 2b35865e7a290d313c3d156c0c2074b4c4ffaf52
Author: Hannes Reinecke <hare@suse.de>
Date:   Fri Feb 19 09:17:13 2016 +0100

    scsi_dh_alua: Recheck state on unit attention


caused us issues because the retries were used up quickly. We just changed
the target to return BUSY status and we don't use the transitioning state.
The spec does mention using either return value in "5.15.2.5 Transitions
between target port asymmetric access states":

------
if during the transition the logical unit is inaccessible, then the transition
is performed as a single indivisible event and the device server shall respond
by either returning BUSY status, or returning CHECK CONDITION status, with the
sense key set to NOT READY, and the sense code set to LOGICAL UNIT NOT ACCESSIBLE,
ASYMMETRIC ACCESS STATE TRANSITION;

------

So Brian's patch works if you return BUSY instead of 02/04/0a and are setting
the state to transitioning during the time it's transitioning.

I do partially agree with you and it's kind of a messy mix and match. However,
I think we should change alua_check_sense to handle 02/04/0a the same way we
handle it in alua_prep_fn. And then we should add a new flag for devices that
have a bug and return transitioning forever.
Martin Wilck May 20, 2022, 8:03 p.m. UTC | #5
On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
> On 5/20/22 9:03 AM, Martin Wilck wrote:
> > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
> > > On 5/20/22 12:57, Martin Wilck wrote:
> > > > Brian, Martin,
> > > > 
> > > > sorry, I've overlooked this patch previously. I have to say I
> > > > think
> > > > it's wrong and shouldn't have been applied. At least I need
> > > > more
> > > > in-
> > > > depth explanation.
> > > > 
> > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
> > > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
> > > > > 
> > > > > > The handling of the ALUA transitioning state is currently
> > > > > > broken.
> > > > > > When
> > > > > > a target goes into this state, it is expected that the
> > > > > > target
> > > > > > is
> > > > > > allowed to stay in this state for the implicit transition
> > > > > > timeout
> > > > > > without a path failure.
> > > > 
> > > > Can you please show me a quote from the specs on which this
> > > > expectation
> > > > ("without a path failure") is based? AFAIK the SCSI specs don't
> > > > say
> > > > anything about device-mapper multipath semantics.
> > > > 
> > > > > > The handler has this logic, but it gets
> > > > > > skipped currently.
> > > > > > 
> > > > > > When the target transitions, there is in-flight I/O from
> > > > > > the
> > > > > > initiator. The first of these responses from the target
> > > > > > will be
> > > > > > a
> > > > > > unit
> > > > > > attention letting the initiator know that the ALUA state
> > > > > > has
> > > > > > changed.
> > > > > > The remaining in-flight I/Os, before the initiator finds
> > > > > > out
> > > > > > that
> > > > > > the
> > > > > > portal state has changed, will return not ready, ALUA state
> > > > > > is
> > > > > > transitioning. The portal state will change to
> > > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
> > > > > > I/O
> > > > > > immediately failing the path unexpectedly. The path failure
> > > > > > happens
> > > > > > in
> > > > > > less than a second instead of the expected successes until
> > > > > > the
> > > > > > transition timer is exceeded.
> > > > 
> > > > dm multipath has no concept of "transitioning" state. Path
> > > > state
> > > > can be
> > > > either active or inactive. As Brian wrote, commands sent to the
> > > > transitioning device will return NOT READY, TRANSITIONING, and
> > > > require
> > > > retries on the SCSI layer. If we know this in advance, why
> > > > should
> > > > we
> > > > continue sending I/O down this semi-broken path? If other,
> > > > healthy
> > > > paths are available, why it would it not be the right thing to
> > > > switch
> > > > I/O to them ASAP?
> > > > 
> > > But we do, don't we?
> > > Commands are being returned with the appropriate status, and 
> > > dm-multipath should make the corresponding decisions here.
> > > This patch just modifies the check when _sending_ commands; ie
> > > multipath 
> > > had decided that the path is still usable.
> > > Question rather would be why multipath did that;
> > 
> > If alua_prep_fn() got called, the path was considered usable at the
> > given point in time by dm-multipath. Most probably the reason was
> > simply that no error condition had occured on this path before ALUA
> > state switched to transitioning. I suppose this can happen
> > if storage
> > switches a PG consisting of multiple paths to TRANSITIONING. We get
> > an
> > error on one path (sda, say), issue an RTPG, and receive the new
> > ALUA
> > state for all paths of the PG. For all paths except sda, we'd just
> > see
> > a switch to TRANSITIONING without a previous SCSI error.
> > 
> > With this patch, we'll dispatch I/O (usually an entire bunch) to
> > these
> > paths despite seeing them in TRANSITIONING state. Eventually, when
> > the
> > SCSI responses are received, this leads to path failures. If I/O
> > latencies are small, this happens after a few ms. In that case, the
> > goal of Brian's patch is not reached, because the time until path
> > failure would still be on the order of milliseconds. OTOH, if
> > latencies
> > are high, it takes substantially longer for the kernel to realize
> > that
> > the path is non-functional, while other, good paths may be idle. I
> > fail
> > to see the benefit.
> > 
> 
> I'm not sure everyone agrees with you on the meaning of
> transitioning.
> 
> If we go from non-optimized to optimized or standby to non-
> opt/optimized
> we don't want to try other paths because it can cause thrashing.

But only with explicit ALUA, or am I missing something? I agree that
the host shouldn't initiate a PG switch if it encounters transitioning
state. I also agree that for transitioning towards a "better" state,
e.g. standby to (non)-optimized, failing the path would be
questionable. Unfortunately we don't know in which "direction" the path
is transitioning - it could be for 'better' or 'worse'. I suppose that
in the case of a PG switch, it can happen that we dispatch I/O to a 
device that used to be in Standby and is now transitioning. Would it
make sense to remember the previous state and "guess" what we're going
to transition to? I.e. if the previous state was "Standby", it's
probably going to be (non)optimized after the transition, and vice-
versa?

>  We just
> need to transition resources before we can fully use the path. It
> could
> be a local cache operation or for distributed targets it could be a
> really
> expensive operation.
> 
> For both though, it can take longer than the retries we get from
> scsi-ml.

So if we want to do "the right thing", we'd continue dispatching to the
device until either the state changes or the device-reported transition
timeout has expired?

Martin


> For example this patch:
> 
> commit 2b35865e7a290d313c3d156c0c2074b4c4ffaf52
> Author: Hannes Reinecke <hare@suse.de>
> Date:   Fri Feb 19 09:17:13 2016 +0100
> 
>     scsi_dh_alua: Recheck state on unit attention
> 
> 
> caused us issues because the retries were used up quickly. We just
> changed
> the target to return BUSY status and we don't use the transitioning
> state.
> The spec does mention using either return value in "5.15.2.5
> Transitions
> between target port asymmetric access states":
> 
> ------
> if during the transition the logical unit is inaccessible, then the
> transition
> is performed as a single indivisible event and the device server
> shall respond
> by either returning BUSY status, or returning CHECK CONDITION status,
> with the
> sense key set to NOT READY, and the sense code set to LOGICAL UNIT
> NOT ACCESSIBLE,
> ASYMMETRIC ACCESS STATE TRANSITION;
> 
> ------
> 
> So Brian's patch works if you return BUSY instead of 02/04/0a and are
> setting
> the state to transitioning during the time it's transitioning.
> 
> I do partially agree with you and it's kind of a messy mix and match.
> However,
> I think we should change alua_check_sense to handle 02/04/0a the same
> way we
> handle it in alua_prep_fn. And then we should add a new flag for
> devices that
> have a bug and return transitioning forever.
>
Hannes Reinecke May 21, 2022, 10:17 a.m. UTC | #6
On 5/20/22 13:03, Martin Wilck wrote:
> On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
>> On 5/20/22 9:03 AM, Martin Wilck wrote:
>>> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
>>>> On 5/20/22 12:57, Martin Wilck wrote:
>>>>> Brian, Martin,
>>>>>
>>>>> sorry, I've overlooked this patch previously. I have to say I
>>>>> think
>>>>> it's wrong and shouldn't have been applied. At least I need
>>>>> more
>>>>> in-
>>>>> depth explanation.
>>>>>
>>>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
>>>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
>>>>>>
>>>>>>> The handling of the ALUA transitioning state is currently
>>>>>>> broken.
>>>>>>> When
>>>>>>> a target goes into this state, it is expected that the
>>>>>>> target
>>>>>>> is
>>>>>>> allowed to stay in this state for the implicit transition
>>>>>>> timeout
>>>>>>> without a path failure.
>>>>>
>>>>> Can you please show me a quote from the specs on which this
>>>>> expectation
>>>>> ("without a path failure") is based? AFAIK the SCSI specs don't
>>>>> say
>>>>> anything about device-mapper multipath semantics.
>>>>>
>>>>>>> The handler has this logic, but it gets
>>>>>>> skipped currently.
>>>>>>>
>>>>>>> When the target transitions, there is in-flight I/O from
>>>>>>> the
>>>>>>> initiator. The first of these responses from the target
>>>>>>> will be
>>>>>>> a
>>>>>>> unit
>>>>>>> attention letting the initiator know that the ALUA state
>>>>>>> has
>>>>>>> changed.
>>>>>>> The remaining in-flight I/Os, before the initiator finds
>>>>>>> out
>>>>>>> that
>>>>>>> the
>>>>>>> portal state has changed, will return not ready, ALUA state
>>>>>>> is
>>>>>>> transitioning. The portal state will change to
>>>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
>>>>>>> I/O
>>>>>>> immediately failing the path unexpectedly. The path failure
>>>>>>> happens
>>>>>>> in
>>>>>>> less than a second instead of the expected successes until
>>>>>>> the
>>>>>>> transition timer is exceeded.
>>>>>
>>>>> dm multipath has no concept of "transitioning" state. Path
>>>>> state
>>>>> can be
>>>>> either active or inactive. As Brian wrote, commands sent to the
>>>>> transitioning device will return NOT READY, TRANSITIONING, and
>>>>> require
>>>>> retries on the SCSI layer. If we know this in advance, why
>>>>> should
>>>>> we
>>>>> continue sending I/O down this semi-broken path? If other,
>>>>> healthy
>>>>> paths are available, why it would it not be the right thing to
>>>>> switch
>>>>> I/O to them ASAP?
>>>>>
>>>> But we do, don't we?
>>>> Commands are being returned with the appropriate status, and
>>>> dm-multipath should make the corresponding decisions here.
>>>> This patch just modifies the check when _sending_ commands; ie
>>>> multipath
>>>> had decided that the path is still usable.
>>>> Question rather would be why multipath did that;
>>>
>>> If alua_prep_fn() got called, the path was considered usable at the
>>> given point in time by dm-multipath. Most probably the reason was
>>> simply that no error condition had occured on this path before ALUA
>>> state switched to transitioning. I suppose this can happen
>>> if storage
>>> switches a PG consisting of multiple paths to TRANSITIONING. We get
>>> an
>>> error on one path (sda, say), issue an RTPG, and receive the new
>>> ALUA
>>> state for all paths of the PG. For all paths except sda, we'd just
>>> see
>>> a switch to TRANSITIONING without a previous SCSI error.
>>>
>>> With this patch, we'll dispatch I/O (usually an entire bunch) to
>>> these
>>> paths despite seeing them in TRANSITIONING state. Eventually, when
>>> the
>>> SCSI responses are received, this leads to path failures. If I/O
>>> latencies are small, this happens after a few ms. In that case, the
>>> goal of Brian's patch is not reached, because the time until path
>>> failure would still be on the order of milliseconds. OTOH, if
>>> latencies
>>> are high, it takes substantially longer for the kernel to realize
>>> that
>>> the path is non-functional, while other, good paths may be idle. I
>>> fail
>>> to see the benefit.
>>>
>>
>> I'm not sure everyone agrees with you on the meaning of
>> transitioning.
>>
>> If we go from non-optimized to optimized or standby to non-
>> opt/optimized
>> we don't want to try other paths because it can cause thrashing.
> 
> But only with explicit ALUA, or am I missing something? I agree that
> the host shouldn't initiate a PG switch if it encounters transitioning
> state. I also agree that for transitioning towards a "better" state,
> e.g. standby to (non)-optimized, failing the path would be
> questionable. Unfortunately we don't know in which "direction" the path
> is transitioning - it could be for 'better' or 'worse'. I suppose that
> in the case of a PG switch, it can happen that we dispatch I/O to a
> device that used to be in Standby and is now transitioning. Would it
> make sense to remember the previous state and "guess" what we're going
> to transition to? I.e. if the previous state was "Standby", it's
> probably going to be (non)optimized after the transition, and vice-
> versa?
> 
>>   We just
>> need to transition resources before we can fully use the path. It
>> could
>> be a local cache operation or for distributed targets it could be a
>> really
>> expensive operation.
>>
>> For both though, it can take longer than the retries we get from
>> scsi-ml.
> 
> So if we want to do "the right thing", we'd continue dispatching to the
> device until either the state changes or the device-reported transition
> timeout has expired?
> 
Not quite.
I think multipath should make the decision, and we shouldn't try to 
out-guess multipathing from the device driver.
We should only make the information available (to wit: the path state) 
for multipath to make the decision.
But if multipath decides to send I/O via a path which we _think_ it's in 
transitioning we shouldn't be arguing with that (ie we shouldn't preempt 
is from prep_fn) but rather let it rip; who knows, maybe there was a 
reason for that.

The only thing this patch changes (for any current setup) is that we'll 
incur additional round-trip times for I/O being sent in between the 
first I/O returning with 'transitioning' and multipath picking up the 
new path state. This I/O will be returned more-or-less immediately by 
the target, so really we're only having an additional round-trip latency 
for each I/O during that time. This is at most in the milliseconds 
range, so I guess we can live with that.

If, however, multipath fails to pick up the 'transitioning' state then 
this won't work, and we indeed will incur a regression. But that 
arguably is an issue with multipath ...

Cheers,

Hannes
Mike Christie May 21, 2022, 4:58 p.m. UTC | #7
On 5/20/22 3:03 PM, Martin Wilck wrote:
> On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
>> On 5/20/22 9:03 AM, Martin Wilck wrote:
>>> On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
>>>> On 5/20/22 12:57, Martin Wilck wrote:
>>>>> Brian, Martin,
>>>>>
>>>>> sorry, I've overlooked this patch previously. I have to say I
>>>>> think
>>>>> it's wrong and shouldn't have been applied. At least I need
>>>>> more
>>>>> in-
>>>>> depth explanation.
>>>>>
>>>>> On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen wrote:
>>>>>> On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
>>>>>>
>>>>>>> The handling of the ALUA transitioning state is currently
>>>>>>> broken.
>>>>>>> When
>>>>>>> a target goes into this state, it is expected that the
>>>>>>> target
>>>>>>> is
>>>>>>> allowed to stay in this state for the implicit transition
>>>>>>> timeout
>>>>>>> without a path failure.
>>>>>
>>>>> Can you please show me a quote from the specs on which this
>>>>> expectation
>>>>> ("without a path failure") is based? AFAIK the SCSI specs don't
>>>>> say
>>>>> anything about device-mapper multipath semantics.
>>>>>
>>>>>>> The handler has this logic, but it gets
>>>>>>> skipped currently.
>>>>>>>
>>>>>>> When the target transitions, there is in-flight I/O from
>>>>>>> the
>>>>>>> initiator. The first of these responses from the target
>>>>>>> will be
>>>>>>> a
>>>>>>> unit
>>>>>>> attention letting the initiator know that the ALUA state
>>>>>>> has
>>>>>>> changed.
>>>>>>> The remaining in-flight I/Os, before the initiator finds
>>>>>>> out
>>>>>>> that
>>>>>>> the
>>>>>>> portal state has changed, will return not ready, ALUA state
>>>>>>> is
>>>>>>> transitioning. The portal state will change to
>>>>>>> SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all new
>>>>>>> I/O
>>>>>>> immediately failing the path unexpectedly. The path failure
>>>>>>> happens
>>>>>>> in
>>>>>>> less than a second instead of the expected successes until
>>>>>>> the
>>>>>>> transition timer is exceeded.
>>>>>
>>>>> dm multipath has no concept of "transitioning" state. Path
>>>>> state
>>>>> can be
>>>>> either active or inactive. As Brian wrote, commands sent to the
>>>>> transitioning device will return NOT READY, TRANSITIONING, and
>>>>> require
>>>>> retries on the SCSI layer. If we know this in advance, why
>>>>> should
>>>>> we
>>>>> continue sending I/O down this semi-broken path? If other,
>>>>> healthy
>>>>> paths are available, why it would it not be the right thing to
>>>>> switch
>>>>> I/O to them ASAP?
>>>>>
>>>> But we do, don't we?
>>>> Commands are being returned with the appropriate status, and 
>>>> dm-multipath should make the corresponding decisions here.
>>>> This patch just modifies the check when _sending_ commands; ie
>>>> multipath 
>>>> had decided that the path is still usable.
>>>> Question rather would be why multipath did that;
>>>
>>> If alua_prep_fn() got called, the path was considered usable at the
>>> given point in time by dm-multipath. Most probably the reason was
>>> simply that no error condition had occured on this path before ALUA
>>> state switched to transitioning. I suppose this can happen
>>> if storage
>>> switches a PG consisting of multiple paths to TRANSITIONING. We get
>>> an
>>> error on one path (sda, say), issue an RTPG, and receive the new
>>> ALUA
>>> state for all paths of the PG. For all paths except sda, we'd just
>>> see
>>> a switch to TRANSITIONING without a previous SCSI error.
>>>
>>> With this patch, we'll dispatch I/O (usually an entire bunch) to
>>> these
>>> paths despite seeing them in TRANSITIONING state. Eventually, when
>>> the
>>> SCSI responses are received, this leads to path failures. If I/O
>>> latencies are small, this happens after a few ms. In that case, the
>>> goal of Brian's patch is not reached, because the time until path
>>> failure would still be on the order of milliseconds. OTOH, if
>>> latencies
>>> are high, it takes substantially longer for the kernel to realize
>>> that
>>> the path is non-functional, while other, good paths may be idle. I
>>> fail
>>> to see the benefit.
>>>
>>
>> I'm not sure everyone agrees with you on the meaning of
>> transitioning.
>>
>> If we go from non-optimized to optimized or standby to non-
>> opt/optimized
>> we don't want to try other paths because it can cause thrashing.
> 
> But only with explicit ALUA, or am I missing something? I agree that

That section of the spec mentions both implicit and explicit. For
implicit, the target can want to rebalance resources for things like
a resource is down permanently, or you add more ports, or we bring
up/down resources dynamically based on usage or maintenance.


> the host shouldn't initiate a PG switch if it encounters transitioning
> state. I also agree that for transitioning towards a "better" state,
> e.g. standby to (non)-optimized, failing the path would be
> questionable. Unfortunately we don't know in which "direction" the path
> is transitioning - it could be for 'better' or 'worse'. I suppose that

For implicit, the target knows. It's initiating the transition based
on whatever metrics or resources it has. We want the initiator to let us
complete what we are doing.

For explicit, then again the target knows what it wants to do when
it gets the STPG and we only it to set the paths to optimized. So
if it goes that route where it completes the STPG before the transition
completes, then goes into transitioning, then we can just let the
device do it's transitions.


> in the case of a PG switch, it can happen that we dispatch I/O to a 
> device that used to be in Standby and is now transitioning. Would it
> make sense to remember the previous state and "guess" what we're going
> to transition to? I.e. if the previous state was "Standby", it's
> probably going to be (non)optimized after the transition, and vice-
> versa?

You are referring to the issue Hannes mentions where multipath can pick
up the transitioning state and it might get confused, right? I'm not
sure what to do.

> 
>>  We just
>> need to transition resources before we can fully use the path. It
>> could
>> be a local cache operation or for distributed targets it could be a
>> really
>> expensive operation.
>>
>> For both though, it can take longer than the retries we get from
>> scsi-ml.
> 
> So if we want to do "the right thing", we'd continue dispatching to the
> device until either the state changes or the device-reported transition
> timeout has expired?
Sort of.

Ideally I think it would be nice if we blocked the device/queue for
normal IO, then just sent a RTPG every N secs or msecs until we changed
state or until the timer expired. We then unblock and either fail upwards
or dispatch. I think this is a lot of work though.

The problem with constant dispatching is that on low latency systems we
retry too quickly. I had to add a little sleep on the target side for this
or we hammer the target/initiator too hard and we got warnings (I can't
remember the exact warn/err).
Martin Wilck May 23, 2022, 3:33 p.m. UTC | #8
On Sat, 2022-05-21 at 12:17 +0200, Hannes Reinecke wrote:
> On 5/20/22 13:03, Martin Wilck wrote:
> > On Fri, 2022-05-20 at 14:08 -0500, Mike Christie wrote:
> > > On 5/20/22 9:03 AM, Martin Wilck wrote:
> > > > On Fri, 2022-05-20 at 14:06 +0200, Hannes Reinecke wrote:
> > > > > On 5/20/22 12:57, Martin Wilck wrote:
> > > > > > Brian, Martin,
> > > > > > 
> > > > > > sorry, I've overlooked this patch previously. I have to say
> > > > > > I
> > > > > > think
> > > > > > it's wrong and shouldn't have been applied. At least I need
> > > > > > more
> > > > > > in-
> > > > > > depth explanation.
> > > > > > 
> > > > > > On Mon, 2022-05-02 at 20:50 -0400, Martin K. Petersen
> > > > > > wrote:
> > > > > > > On Mon, 2 May 2022 08:09:17 -0700, Brian Bunker wrote:
> > > > > > > 
> > > > > > > > The handling of the ALUA transitioning state is
> > > > > > > > currently
> > > > > > > > broken.
> > > > > > > > When
> > > > > > > > a target goes into this state, it is expected that the
> > > > > > > > target
> > > > > > > > is
> > > > > > > > allowed to stay in this state for the implicit
> > > > > > > > transition
> > > > > > > > timeout
> > > > > > > > without a path failure.
> > > > > > 
> > > > > > Can you please show me a quote from the specs on which this
> > > > > > expectation
> > > > > > ("without a path failure") is based? AFAIK the SCSI specs
> > > > > > don't
> > > > > > say
> > > > > > anything about device-mapper multipath semantics.
> > > > > > 
> > > > > > > > The handler has this logic, but it gets
> > > > > > > > skipped currently.
> > > > > > > > 
> > > > > > > > When the target transitions, there is in-flight I/O
> > > > > > > > from
> > > > > > > > the
> > > > > > > > initiator. The first of these responses from the target
> > > > > > > > will be
> > > > > > > > a
> > > > > > > > unit
> > > > > > > > attention letting the initiator know that the ALUA
> > > > > > > > state
> > > > > > > > has
> > > > > > > > changed.
> > > > > > > > The remaining in-flight I/Os, before the initiator
> > > > > > > > finds
> > > > > > > > out
> > > > > > > > that
> > > > > > > > the
> > > > > > > > portal state has changed, will return not ready, ALUA
> > > > > > > > state
> > > > > > > > is
> > > > > > > > transitioning. The portal state will change to
> > > > > > > > SCSI_ACCESS_STATE_TRANSITIONING. This will lead to all
> > > > > > > > new
> > > > > > > > I/O
> > > > > > > > immediately failing the path unexpectedly. The path
> > > > > > > > failure
> > > > > > > > happens
> > > > > > > > in
> > > > > > > > less than a second instead of the expected successes
> > > > > > > > until
> > > > > > > > the
> > > > > > > > transition timer is exceeded.
> > > > > > 
> > > > > > dm multipath has no concept of "transitioning" state. Path
> > > > > > state
> > > > > > can be
> > > > > > either active or inactive. As Brian wrote, commands sent to
> > > > > > the
> > > > > > transitioning device will return NOT READY, TRANSITIONING,
> > > > > > and
> > > > > > require
> > > > > > retries on the SCSI layer. If we know this in advance, why
> > > > > > should
> > > > > > we
> > > > > > continue sending I/O down this semi-broken path? If other,
> > > > > > healthy
> > > > > > paths are available, why it would it not be the right thing
> > > > > > to
> > > > > > switch
> > > > > > I/O to them ASAP?
> > > > > > 
> > > > > But we do, don't we?
> > > > > Commands are being returned with the appropriate status, and
> > > > > dm-multipath should make the corresponding decisions here.
> > > > > This patch just modifies the check when _sending_ commands;
> > > > > ie
> > > > > multipath Depending on timing, multipathd may or may not
> > > > > already have done a PG switch because of 3.)
> > > > > had decided that the path is still usable.
> > > > > Question rather would be why multipath did that;
> > > > 
> > > > If alua_prep_fn() got called, the path was considered usable at
> > > > the
> > > > given point in time by dm-multipath. Most probably the reason
> > > > was
> > > > simply that no error condition had occured on this path before
> > > > ALUA
> > > > state switched to transitioning. I suppose this can happen
> > > > if storage
> > > > switches a PG consisting of multiple paths to TRANSITIONING. We
> > > > get
> > > > an
> > > > error on one path (sda, say), issue an RTPG, and receive the
> > > > new
> > > > ALUA
> > > > state for all paths of the PG. For all paths except sda, we'd
> > > > just
> > > > see
> > > > a switch to TRANSITIONING without a previous SCSI error.
> > > > 
> > > > With this patch, we'll dispatch I/O (usually an entire bunch)
> > > > to
> > > > these
> > > > paths despite seeing them in TRANSITIONING state. Eventually,
> > > > when
> > > > the
> > > > SCSI responses are received, this leads to path failures. If
> > > > I/O
> > > > latencies are small, this happens after a few ms. In that case,
> > > > the
> > > > goal of Brian's patch is not reached, because the time until
> > > > path
> > > > failure would still be on the order of milliseconds. OTOH, if
> > > > latencies
> > > > are high, it takes substantially longer for the kernel to
> > > > realize
> > > > that
> > > > the path is non-functional, while other, good paths may be
> > > > idle. I
> > > > fail
> > > > to see the benefit.
> > > > 
> > > 
> > > I'm not sure everyone agrees with you on the meaning of
> > > transitioning.
> > > 
> > > If we go from non-optimized to optimized or standby to non-
> > > opt/optimized
> > > we don't want to try other paths because it can cause thrashing.
> > 
> > But only with explicit ALUA, or am I missing something? I agree
> > that
> > the host shouldn't initiate a PG switch if it encounters
> > transitioning
> > state. I also agree that for transitioning towards a "better"
> > state,
> > e.g. standby to (non)-optimized, failing the path would be
> > questionable. Unfortunately we don't know in which "direction" the
> > path
> > is transitioning - it could be for 'better' or 'worse'. I suppose
> > that
> > in the case of a PG switch, it can happen that we dispatch I/O to a
> > device that used to be in Standby and is now transitioning. Would
> > it
> > make sense to remember the previous state and "guess" what we're
> > going
> > to transition to? I.e. if the previous state was "Standby", it's
> > probably going to be (non)optimized after the transition, and vice-
> > versa?
> > 
> > >   We just
> > > need to transition resources before we can fully use the path. It
> > > could
> > > be a local cache operation or for distributed targets it could be
> > > a
> > > really
> > > expensive operation.
> > > 
> > > For both though, it can take longer than the retries we get from
> > > scsi-ml.
> > 
> > So if we want to do "the right thing", we'd continue dispatching to
> > the
> > device until either the state changes or the device-reported
> > transition
> > timeout has expired?
> > 
> Not quite.
> I think multipath should make the decision, and we shouldn't try to 
> out-guess multipathing from the device driver.
> We should only make the information available (to wit: the path
> state) 
> for multipath to make the decision. 

The only information that dm-multipath currently has is the block-level
error code. The SCSI layer uses BLK_STS_AGAIN for "transitioning". Can
dm-multipath assume that BLK_STS_AGAIN *always* has these semantics,
regadless of the low-level driver? It isn't clearly documented, AFAICS.

Doing this properly requires additional logic in dm-mpath.c to treat
BLK_STS_AGAIN differently from other path errors. IIUC, what Brian and
Mike say is that dm-multipath may do path failover, but not PG failover
for BLK_STS_AGAIN. We currently don't have this logic, but we could
work on it.

IMO this means that Brian's patch would be a step in the wrong
direction: by returning BLK_STS_OK instead of BLK_STS_AGAIN, dm-
multipath gets *less* information about path state than before.

dm-multipath has 3 ways to get notified of transitioning state
(correct me if I'm missing anything):

 1. I/O queuing fails with BLK_STS_AGAIN in scsi_prep_fn(). This won't
    happen any more with Brian's patch.
 2. I/O returns from target with CHECK CONDITION / NOT READY / 0x4:0xa
    (LU not accessible - asymmetric state change transition).
    In this case the SCSI layer applies the "maybe_retry" logic,
    i.e. it retries the command. When retries are exhausted, it fails
    the command with BLK_STS_AGAIN.
 3. multipathd in user space sees a TRANSITIONING status during routine
    path checking. Currently multipathd uses prio=0 for TRANSITIONING
    state, which is even less than the prio used for STANDBY (1). Thus
    multipathd would initiate a PG switch sooner or later. I believe
    these prio assignments by multipathd are questionable, but that's
    a different discussion.

dm-multipath interprets BLK_STS_AGAIN in 2.) as path error, and will
thus do a path failover in the PG. In a case like Brian's, the other
paths in the PG will likely also be in TRANSITIONING state, so
procedure 2.) will repeat until all paths in the PG have failed. In
that situation, dm-multipath will initiate a PG switch, which is what
Brian wanted to avoid in the first place.

> But if multipath decides to send I/O via a path which we _think_ it's
> in 
> transitioning we shouldn't be arguing with that (ie we shouldn't
> preempt 
> is from prep_fn) but rather let it rip; who knows, maybe there was a 
> reason for that.

You seem to assume that multipath knows more about the device then the
SCSI layer. Isn't the opposite true? dm-multipath would never know that
the path is "transitioning", and it has zero knowledge about the
transition timeout. Only the ALUA code knows that certain path devices
will change their status simultaneously, always. The only additional
"knowledge" that multipath has is knowledge about path siblings and
PGs. By returning BLK_STS_OK from scsi_prep_fn(), we hold back
information that the multipath layer could use for an informed
decision.

> The only thing this patch changes (for any current setup) is that
> we'll 
> incur additional round-trip times for I/O being sent in between the 
> first I/O returning with 'transitioning' and multipath picking up the
> new path state. This I/O will be returned more-or-less immediately by
> the target, so really we're only having an additional round-trip
> latency 
> for each I/O during that time. This is at most in the milliseconds 
> range, so I guess we can live with that.

Side note: AFAICS it'll be several round-trips (until SCSI retries are
exhausted, as this is not a "fast io fail" case).

Anyway, you're right, a PG switch will happen after a few milliseconds.
That's what Brian wanted to avoid. IOW, the patch won't achieve what
it's supposed to achieve. *But* it will hold back information that dm-
multipath could use for handling transitions smarter than it currently
does.

> If, however, multipath fails to pick up the 'transitioning' state
> then 
> this won't work, and we indeed will incur a regression. But that 
> arguably is an issue with multipath ...

dm-multipath currently doesn't. Requesting that it should is a valid
feature request, no less and no more. I'm not against working on this
feature. But the generic device model in dm-multipath isn't quite
powerful enough for this. We could try to handle it more cleverly in
user space, but yet I don't see how it could  be done.

Cheers,
Martin
Martin Wilck May 23, 2022, 4:03 p.m. UTC | #9
On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
> From my perspective, the ALUA transitioning state is a temporary
> state
> where the target is saying that it does not have a permanent state
> yet. Having the initiator try another pg to me doesn't seem like the
> right thing for it to do.

I agree. Unfortunately, there's no logic in dm-multipath saying "I may
switch paths inside a PG, but I may not do PG failover".

>  If the target wanted the initiator to use a
> different pg, it should use an ALUA state which would make that
> clear,
> standby, unavailable, etc. The target would only return an error
> state
> if it was aware that some other path is in an active state.When
> transitioning is returned, I don't think the initiator should assume
> that any other pg would be a better choice. I think it should assume
> that the target will make its intention clear for that path with a
> permanent state within a transition timeout.

For me the question is still whether trying to send I/O to the path
that is known not to be able to process it makes sense. As noted
elsewhere, you patch just delays the BLK_STS_AGAIN by a few
milliseconds. You want to avoid a PG switch, and I second that, but IMO
that needs a different approach.

> From my perspective the right thing to do is to let the ALUA handler
> do what it is trying to do. If the pg state is transitioning and
> within the transition timeout it should continue to retry that
> request
> checking each time the transition timeout.

But this means that we should modify the logic not only in
alua_prep_fn() but also for handling of NOT READY conditions, either in
alua_check_sense() or in scsi_io_completion_action().
I agree that this would make a lot of sense, perhaps more than trying
to implement a cleverer logic in dm-multipath as discussed between
Hannes and myself.

This is what we need to figure out first: Do we want to change the
logic in the multipath layer, making it somehow aware of the special
nature of "transitioning" state, or should we tune the retry logic in
the SCSI layer such that dm-multipath will "do the right thing"
automatically?
 
Regards
Martin
Brian Bunker May 23, 2022, 4:52 p.m. UTC | #10
> On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote:
> 
> On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
>> From my perspective, the ALUA transitioning state is a temporary
>> state
>> where the target is saying that it does not have a permanent state
>> yet. Having the initiator try another pg to me doesn't seem like the
>> right thing for it to do.
> 
> I agree. Unfortunately, there's no logic in dm-multipath saying "I may
> switch paths inside a PG, but I may not do PG failover".
> 
>> If the target wanted the initiator to use a
>> different pg, it should use an ALUA state which would make that
>> clear,
>> standby, unavailable, etc. The target would only return an error
>> state
>> if it was aware that some other path is in an active state.When
>> transitioning is returned, I don't think the initiator should assume
>> that any other pg would be a better choice. I think it should assume
>> that the target will make its intention clear for that path with a
>> permanent state within a transition timeout.
> 
> For me the question is still whether trying to send I/O to the path
> that is known not to be able to process it makes sense. As noted
> elsewhere, you patch just delays the BLK_STS_AGAIN by a few
> milliseconds. You want to avoid a PG switch, and I second that, but IMO
> that needs a different approach.
> 
>> From my perspective the right thing to do is to let the ALUA handler
>> do what it is trying to do. If the pg state is transitioning and
>> within the transition timeout it should continue to retry that
>> request
>> checking each time the transition timeout.
> 
> But this means that we should modify the logic not only in
> alua_prep_fn() but also for handling of NOT READY conditions, either in
> alua_check_sense() or in scsi_io_completion_action().
> I agree that this would make a lot of sense, perhaps more than trying
> to implement a cleverer logic in dm-multipath as discussed between
> Hannes and myself.
> 
> This is what we need to figure out first: Do we want to change the
> logic in the multipath layer, making it somehow aware of the special
> nature of "transitioning" state, or should we tune the retry logic in
> the SCSI layer such that dm-multipath will "do the right thing"
> automatically?
> 
> Regards
> Martin
> 
I think that a couple of things are needed to get to where my expectation of how it should work would be. First this code should come out of the not ready sense check. The reason is that it will continually override alua_check’s attempt to change the pg state as it exceeds the allowed time and the path state is changed to standby to handle a misbehaving target which stays in transitioning past the timer.

--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -409,20 +409,12 @@ static char print_alua_state(unsigned char state)
 static enum scsi_disposition alua_check_sense(struct scsi_device *sdev,
                                              struct scsi_sense_hdr *sense_hdr)
 {
-       struct alua_dh_data *h = sdev->handler_data;
-       struct alua_port_group *pg;
-
        switch (sense_hdr->sense_key) {
        case NOT_READY:
                if (sense_hdr->asc == 0x04 && sense_hdr->ascq == 0x0a) {
                        /*
                         * LUN Not Accessible - ALUA state transition
                         */
-                       rcu_read_lock();
-                       pg = rcu_dereference(h->pg);
-                       if (pg)
-                               pg->state = SCSI_ACCESS_STATE_TRANSITIONING;
-                       rcu_read_unlock();

Second for this to work how it used to work in Linux and how it works for other OS’s where ALUA state transition is not a fail path response:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 8d18cc7e510e..33828aa44347 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct scsi_cmnd *cmd, int result)
                                case 0x1b: /* sanitize in progress */
                                case 0x1d: /* configuration in progress */
                                case 0x24: /* depopulation in progress */
                                       action = ACTION_DELAYED_RETRY;
                                        break;
                                case 0x0a: /* ALUA state transition */
-                                       blk_stat = BLK_STS_AGAIN;
-                                       fallthrough;
+                                       action = ACTION_REPREP;
+                                       break;

Because, as you said the expiation check for whether to continually allow new I/O on the pg is in the prep function of the ALUA handler. I think that this does introduce a lot error processing for a target that will respond quickly. Maybe there is some way to use the equivalent of ACTION_DELAYED_RETRY so that the target is not as aggressively retried in the transitioning state.

It is possible, maybe likely, in other OS’s that this retry is done at the multipath level. The DSM from Microsoft in GitHub would seem to show that Windows does it that way. The multipath-tools in Linux though don’t seem to use sense data to make decisions so getting this logic in would seem like a decent set of changes and getting the buy-in to go that route.

Thanks,
Brian
'Christoph Hellwig' May 24, 2022, 5:25 a.m. UTC | #11
On Mon, May 02, 2022 at 08:09:17AM -0700, Brian Bunker wrote:
>         case SCSI_ACCESS_STATE_ACTIVE:
>         case SCSI_ACCESS_STATE_LBA:
> -               return BLK_STS_OK;
>         case SCSI_ACCESS_STATE_TRANSITIONING:
> -               return BLK_STS_AGAIN;
> +               return BLK_STS_OK;

As there is a lot of discussion on BLK_STS_AGAIN in the thread:
Independent of the actul outcome here, BLK_STS_AGAIN is fundamentally
the wrong thing to return here.  BLK_STS_AGAIN must only be returned
for REQ_NOWAIT requests that would have to block.
Hannes Reinecke May 24, 2022, 5:33 a.m. UTC | #12
On 5/24/22 07:25, Christoph Hellwig wrote:
> On Mon, May 02, 2022 at 08:09:17AM -0700, Brian Bunker wrote:
>>          case SCSI_ACCESS_STATE_ACTIVE:
>>          case SCSI_ACCESS_STATE_LBA:
>> -               return BLK_STS_OK;
>>          case SCSI_ACCESS_STATE_TRANSITIONING:
>> -               return BLK_STS_AGAIN;
>> +               return BLK_STS_OK;
> 
> As there is a lot of discussion on BLK_STS_AGAIN in the thread:
> Independent of the actual outcome here, BLK_STS_AGAIN is fundamentally
> the wrong thing to return here.  BLK_STS_AGAIN must only be returned
> for REQ_NOWAIT requests that would have to block.

Guess we should document that.
It wasn't clear to me that this is a requirement.

Cheers,

Hannes
Martin Wilck May 24, 2022, 8:29 a.m. UTC | #13
On Mon, 2022-05-23 at 09:52 -0700, Brian Bunker wrote:
> 
> > On May 23, 2022, at 9:03 AM, Martin Wilck <mwilck@suse.com> wrote:
> > 
> > On Fri, 2022-05-20 at 19:52 -0700, Brian Bunker wrote:
> > > From my perspective, the ALUA transitioning state is a temporary
> > > state
> > > where the target is saying that it does not have a permanent
> > > state
> > > yet. Having the initiator try another pg to me doesn't seem like
> > > the
> > > right thing for it to do.
> > 
> > I agree. Unfortunately, there's no logic in dm-multipath saying "I
> > may
> > switch paths inside a PG, but I may not do PG failover".
> > 
> > > If the target wanted the initiator to use a
> > > different pg, it should use an ALUA state which would make that
> > > clear,
> > > standby, unavailable, etc. The target would only return an error
> > > state
> > > if it was aware that some other path is in an active state.When
> > > transitioning is returned, I don't think the initiator should
> > > assume
> > > that any other pg would be a better choice. I think it should
> > > assume
> > > that the target will make its intention clear for that path with
> > > a
> > > permanent state within a transition timeout.
> > 
> > For me the question is still whether trying to send I/O to the path
> > that is known not to be able to process it makes sense. As noted
> > elsewhere, you patch just delays the BLK_STS_AGAIN by a few
> > milliseconds. You want to avoid a PG switch, and I second that, but
> > IMO
> > that needs a different approach.
> > 
> > > From my perspective the right thing to do is to let the ALUA
> > > handler
> > > do what it is trying to do. If the pg state is transitioning and
> > > within the transition timeout it should continue to retry that
> > > request
> > > checking each time the transition timeout.
> > 
> > But this means that we should modify the logic not only in
> > alua_prep_fn() but also for handling of NOT READY conditions,
> > either in
> > alua_check_sense() or in scsi_io_completion_action().
> > I agree that this would make a lot of sense, perhaps more than
> > trying
> > to implement a cleverer logic in dm-multipath as discussed between
> > Hannes and myself.
> > 
> > This is what we need to figure out first: Do we want to change the
> > logic in the multipath layer, making it somehow aware of the
> > special
> > nature of "transitioning" state, or should we tune the retry logic
> > in
> > the SCSI layer such that dm-multipath will "do the right thing"
> > automatically?
> > 
> > Regards
> > Martin
> > 
> I think that a couple of things are needed to get to where my
> expectation of how it should work would be. First this code should
> come out of the not ready sense check. The reason is that it will
> continually override alua_check’s attempt to change the pg state as
> it exceeds the allowed time and the path state is changed to standby
> to handle a misbehaving target which stays in transitioning past the
> timer.
> 
> --- a/drivers/scsi/device_handler/scsi_dh_alua.c
> +++ b/drivers/scsi/device_handler/scsi_dh_alua.c
> @@ -409,20 +409,12 @@ static char print_alua_state(unsigned char
> state)
>  static enum scsi_disposition alua_check_sense(struct scsi_device
> *sdev,
>                                               struct scsi_sense_hdr
> *sense_hdr)
>  {
> -       struct alua_dh_data *h = sdev->handler_data;
> -       struct alua_port_group *pg;
> -
>         switch (sense_hdr->sense_key) {
>         case NOT_READY:
>                 if (sense_hdr->asc == 0x04 && sense_hdr->ascq ==
> 0x0a) {
>                         /*
>                          * LUN Not Accessible - ALUA state transition
>                          */
> -                       rcu_read_lock();
> -                       pg = rcu_dereference(h->pg);
> -                       if (pg)
> -                               pg->state =
> SCSI_ACCESS_STATE_TRANSITIONING;
> -                       rcu_read_unlock();
> 

Good point. But I'd say that rather than removing it, this code should
be made aware of the timer.

> Second for this to work how it used to work in Linux and how it works
> for other OS’s where ALUA state transition is not a fail path
> response:
> 
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 8d18cc7e510e..33828aa44347 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -776,11 +776,11 @@ static void scsi_io_completion_action(struct
> scsi_cmnd *cmd, int result)
>                                 case 0x1b: /* sanitize in progress */
>                                 case 0x1d: /* configuration in
> progress */
>                                 case 0x24: /* depopulation in
> progress */
>                                        action = ACTION_DELAYED_RETRY;
>                                         break;
>                                 case 0x0a: /* ALUA state transition
> */
> -                                       blk_stat = BLK_STS_AGAIN;
> -                                       fallthrough;
> +                                       action = ACTION_REPREP;
> +                                       break;
> 
> Because, as you said the expiation check for whether to continually
> allow new I/O on the pg is in the prep function of the ALUA handler.

Not any more with your patch, or what am I missing?

> I think that this does introduce a lot error processing for a target
> that will respond quickly. Maybe there is some way to use the
> equivalent of ACTION_DELAYED_RETRY so that the target is not as
> aggressively retried in the transitioning state.

IMHO ACTION_DELAYED_RETRY would be a good option. Note that it blocks
the device; thus it prevents dispatching IO to the device according to
the scsi_device_block() logic. This way it automatically introduces
some delay. 

> 
> It is possible, maybe likely, in other OS’s that this retry is done
> at the multipath level. The DSM from Microsoft in GitHub would seem
> to show that Windows does it that way. The multipath-tools in Linux
> though don’t seem to use sense data to make decisions so getting this
> logic in would seem like a decent set of changes and getting the buy-
> in to go that route.

IMO we're closer to a good solution on the SCSI side. I've no idea how
Microsoft's DSM works, but it probably has a more detailed concept of
path states than Linux' dm-multipath, which operates on abstract block
devices.

Neither dm-multipath nor multipath-tools have a concept of
transitioning state. While it'd be possible to add this logic in
multipath-tools (user space), I see a couple of hard-to-solve problems:

 - multipathd has no concept of ALUA. The ALUA functionality is split
into the device handler part (handled in the kernel, responsible for
explicit PG switching and error handling), and the "prioritizer"
handled in user space. ALUA is just one out of several path
prioritizers. The generic prioritizer concept doesn't define a
"transitioning" property. We'd need to try to generalize the concept,
and possibly figure out how to implement it with other prioritizers.

 - the dm-multipath path group concept is abstract and, in general,
independent of ALUA's port groups. The core property of ALUA (all ports
in a group always have the same primary port state) doesn't generlly
apply to multipath PGs. Nothing prevents users from configuring
different path grouping policies, like multibus or grouping by latency,
even if ALUA is supported. The idea to prevent dm-multipath from
switching PG away from a transitioning path group is only valid if dm-
multipath PGs map 1:1 to ALUA port groups.

 - ALUA itself is more generic than the scenario you describe. You say
that that if a path is transitioning, trying to switch PGs is wrong.
This is true if there's just one other PG, and that other PG is
unavailable. But there may be setups with more than 2 PGs, where one PG
is perfectly healthy while another is transitioning. In such a case,
it'd be reasonable to switch PGs towards the healthy one.

 - multipathd doesn't get notified of ALUA state changes. It polls in
rather long time intervals (5s). If this logic is implemented in
multipathd, we'll see considerable latency in PG switching.

- In general, multipathd doesn't have full control over dm-multipath PG
switching. Some logic is in the kernel and some in user space. It's not
easy for multipathd to prevent the kernel from switching to a certain
PG. Basically the only way is to set all paths in that PG to failed
status.

Here's one dumb question: ALUA defines "transitioning" in an abstract
fashion, as a "movement from one target port asymmetric access state to
another". This doesn't say anything about the state the port is
transitioning towards. IOW, it could be transitioning towards "STANDBY"
or "UNAVAILABLE", perhaps even "OFFLINE". I am unsure if this makes
sense in practice; I _guess_ that most of the time "transitioning"
means something like "booting". Your suggestions seem to confirm that.
at least for PURE hardware, "transitioning" always means "will be
usable soon". Is this correct? And if yes, to which extent could this
be generalized to other arrays, past, present, and future?

Regards
Martin
diff mbox series

Patch

diff --git a/drivers/scsi/device_handler/scsi_dh_alua.c
b/drivers/scsi/device_handler/scsi_dh_alua.c
index 37d06f993b76..1d9be771f3ee 100644
--- a/drivers/scsi/device_handler/scsi_dh_alua.c
+++ b/drivers/scsi/device_handler/scsi_dh_alua.c
@@ -1172,9 +1172,8 @@  static blk_status_t alua_prep_fn(struct
scsi_device *sdev, struct request *req)
        case SCSI_ACCESS_STATE_OPTIMAL:
        case SCSI_ACCESS_STATE_ACTIVE:
        case SCSI_ACCESS_STATE_LBA:
-               return BLK_STS_OK;
        case SCSI_ACCESS_STATE_TRANSITIONING:
-               return BLK_STS_AGAIN;
+               return BLK_STS_OK;
        default:
                req->rq_flags |= RQF_QUIET;