Message ID | CAHZQxyKJ1qFatzhR-k19PXjAPo7eC0ZgwgaGKwfndB=jEO8mRQ@mail.gmail.com |
---|---|
State | New |
Headers | show |
Series | [1/1] : scsi dm-mpath do not fail paths which are in ALUA state transitioning | expand |
Hello Brian, On Do, 2021-07-08 at 13:42 -0700, Brian Bunker wrote: > In a controller failover do not fail paths that are transitioning or > an unexpected I/O error will return when accessing a multipath device. > > Consider this case, a two controller array with paths coming from a > primary and a secondary controller. During any upgrade there will be a > transition from a secondary to a primary state. > > [...] > 4. It is not expected that the remaining 4 paths will also fail. This > was not the case until the change which introduced BLK_STS_AGAIN into > the SCSI ALUA device handler. With that change new I/O which reaches > that handler on paths that are in ALUA state transitioning will result > in those paths failing. Previous Linux versions, before that change, > will not return an I/O error back to the client application. > Similarly, this problem does not happen in other operating systems, > e.g. ESXi, Windows, AIX, etc. Please confirm that your kernel included ee8868c5c78f ("scsi: scsi_dh_alua: Retry RTPG on a different path after failure"). That commit should cause the RTPG to be retried on other map members which are not in failed state, thus avoiding this phenomenon. > [...] > > 6. The error gets back to the user of the muitipath device > unexpectedly: > Thu Jul 8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io > 43047 fd 36 op read offset 00000028ef7a7000 size 4096 errno 11 > rsize -1 > > The earlier patch I made for this was not desirable, so I am proposing > this much smaller patch which will similarly not allow the > transitioning paths to result in immediate failure. > > Signed-off-by: Brian Bunker <brian@purestorage.com> > Acked-by: Krishna Kant <krishna.kant@purestorage.com> > Acked-by: Seamus Connor <sconnor@purestorage.com> > > ____ > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index bced42f082b0..d5d6be96068d 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target > *ti, struct request *clone, > else > r = DM_ENDIO_REQUEUE; > > - if (pgpath) > + if (pgpath && (error != BLK_STS_AGAIN)) > fail_path(pgpath); > > if (!atomic_read(&m->nr_valid_paths) && > I doubt that this is correct. If you look at the commit msg of 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"): "When the ALUA state indicates transitioning we should not retry the command immediately, but rather complete the command with BLK_STS_AGAIN to signal the completion handler that it might be retried. This allows multipathing to redirect the command to another path if possible, and avoid stalls during lengthy transitioning times." The purpose of that patch was to set the state of the transitioning path to failed in order to make sure IO is retried on a different path. Your patch would undermine this purpose. Regards Martin
Martin, Please confirm that your kernel included ee8868c5c78f ("scsi: scsi_dh_alua: Retry RTPG on a different path after failure"). That commit should cause the RTPG to be retried on other map members which are not in failed state, thus avoiding this phenomenon. In my case, there are no other map members that are not in the failed state. One set of paths goes to the ALUA unavailable state when the primary fails, and the second set of paths moves to ALUA state transitioning as the previous secondary becomes the primary. If the paths are failed which are transitioning, an all paths down state happens which is not expected. There should be a time for which transitioning is a transient state until the next state is entered. Failing a path assuming there would be non-failed paths seems wrong. The purpose of that patch was to set the state of the transitioning path to failed in order to make sure IO is retried on a different path. Your patch would undermine this purpose. I agree this is what happens but those transitioning paths might be the only non-failed paths available. I don't think it is reasonable to fail them. This is the same as treating transitioning as standby or unavailable. As you point out this happened with the commit you mention. Before this commit what I am doing does not result in an all paths down error, and similarly, it does not in earlier Linux versions or other OS's under the same condition. I see this as a regression. Thanks, Brian On Mon, Jul 12, 2021 at 2:19 AM Martin Wilck <mwilck@suse.com> wrote: > > Hello Brian, > > On Do, 2021-07-08 at 13:42 -0700, Brian Bunker wrote: > > In a controller failover do not fail paths that are transitioning or > > an unexpected I/O error will return when accessing a multipath device. > > > > Consider this case, a two controller array with paths coming from a > > primary and a secondary controller. During any upgrade there will be a > > transition from a secondary to a primary state. > > > > [...] > > 4. It is not expected that the remaining 4 paths will also fail. This > > was not the case until the change which introduced BLK_STS_AGAIN into > > the SCSI ALUA device handler. With that change new I/O which reaches > > that handler on paths that are in ALUA state transitioning will result > > in those paths failing. Previous Linux versions, before that change, > > will not return an I/O error back to the client application. > > Similarly, this problem does not happen in other operating systems, > > e.g. ESXi, Windows, AIX, etc. > > Please confirm that your kernel included ee8868c5c78f ("scsi: > scsi_dh_alua: Retry RTPG on a different path after failure"). > That commit should cause the RTPG to be retried on other map members > which are not in failed state, thus avoiding this phenomenon. > > > > [...] > > > > 6. The error gets back to the user of the muitipath device > > unexpectedly: > > Thu Jul 8 13:33:59 2021: /opt/Purity/bin/bb/pureload I/O Error: io > > 43047 fd 36 op read offset 00000028ef7a7000 size 4096 errno 11 > > rsize -1 > > > > The earlier patch I made for this was not desirable, so I am proposing > > this much smaller patch which will similarly not allow the > > transitioning paths to result in immediate failure. > > > > Signed-off-by: Brian Bunker <brian@purestorage.com> > > Acked-by: Krishna Kant <krishna.kant@purestorage.com> > > Acked-by: Seamus Connor <sconnor@purestorage.com> > > > > ____ > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > index bced42f082b0..d5d6be96068d 100644 > > --- a/drivers/md/dm-mpath.c > > +++ b/drivers/md/dm-mpath.c > > @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target > > *ti, struct request *clone, > > else > > r = DM_ENDIO_REQUEUE; > > > > - if (pgpath) > > + if (pgpath && (error != BLK_STS_AGAIN)) > > fail_path(pgpath); > > > > if (!atomic_read(&m->nr_valid_paths) && > > > > I doubt that this is correct. If you look at the commit msg of > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"): > > > "When the ALUA state indicates transitioning we should not retry the command > immediately, but rather complete the command with BLK_STS_AGAIN to signal > the completion handler that it might be retried. This allows multipathing > to redirect the command to another path if possible, and avoid stalls > during lengthy transitioning times." > > The purpose of that patch was to set the state of the transitioning > path to failed in order to make sure IO is retried on a different path. > Your patch would undermine this purpose. > > Regards > Martin > > -- Brian Bunker PURE Storage, Inc. brian@purestorage.com
Hello Brian, On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > Martin, > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > That commit should cause the RTPG to be retried on other map > > members > > which are not in failed state, thus avoiding this phenomenon. > > In my case, there are no other map members that are not in the failed > state. One set of paths goes to the ALUA unavailable state when the > primary fails, and the second set of paths moves to ALUA state > transitioning as the previous secondary becomes the primary. IMO this is the problem. How does your array respond to SCSI commands while ports are transitioning? SPC5 (§5.16.2.6) says that the server should either fail all commands with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support all TMFs and a subset of SCSI commands, while responding with CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. No matter how you read that paragraph, it's pretty clear that "transitioning" is generally not a healthy state to attempt I/O. Are you saying that on your server, the transitioning ports are able to process regular I/O commands like READ and WRITE? If that's the case, why do you pretend to be "transitioning" at all, rather than in an active state? If it's not the case, why would it make sense for the host to retry I/O on the transitioning path? > If the > paths are failed which are transitioning, an all paths down state > happens which is not expected. IMO it _is_ expected if in fact no path is able to process SCSI commands at the given point in time. > There should be a time for which > transitioning is a transient state until the next state is entered. > Failing a path assuming there would be non-failed paths seems wrong. This is a misunderstanding. The path isn't failed because of assumptions about other paths. It is failed because we know that it's non-functional, and thus we must try other paths, if there are any. Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"), I/O was indeed retried on transitioning paths, possibly forever. This posed a serious problem when a transitioning path was about to be removed (e.g. dev_loss_tmo expired). And I'm still quite convinced that it was wrong in general, because by all reasonable means a "transitioning" path isn't usable for the host. If we find a transitioning path, it might make sense to retry on other paths first and eventually switch back to the transitioning path, when all others have failed hard (e.g. "unavailable" state). However, this logic doesn't exist in the kernel. In theory, it could be mapped to a "transitioning" priority group in device-mapper multipath. But prio groups are managed in user space (multipathd), which treats transitioning paths as "almost failed" (priority 0) anyway. We can discuss enhancing multipathd such that it re-checks transitioning paths more frequently, in order to be able to reinstate them ASAP. According to what you said above, the "transitioning" ports in the problem situation ("second set") are those that were in "unavailable" state before, which means "failed" as far as device mapper is concerned - IOW, the paths in question would be unused anyway, until they got reinstated, which wouldn't happen before they are fully up. With this in mind, I have to say I don't understand why your proposed patch would help at all. Please explain. > > The purpose of that patch was to set the state of the transitioning > > path to failed in order to make sure IO is retried on a different > > path. > > Your patch would undermine this purpose. (Additional indentation added by me) Can you please use proper quoting? You were mixing my statements and your own. > I agree this is what happens but those transitioning paths might be > the only non-failed paths available. I don't think it is reasonable > to > fail them. This is the same as treating transitioning as standby or > unavailable. Right, and according to the SPC spec (see above), that makes more sense than treating it as "active". Storage vendors seem to interpret "transitioning" very differently, both in terms of commands supported and in terms of time required to reach the target state. That makes it hard to deal with it correctly on the host side. > As you point out this happened with the commit you > mention. Before this commit what I am doing does not result in an all > paths down error, and similarly, it does not in earlier Linux > versions > or other OS's under the same condition. I see this as a regression. If you use a suitable "no_path_retry" setting in multipathd, you should be able to handle the situation you describe just fine by queueing the I/O until the transitioning paths are fully up. IIUC, on your server "transitioning" is a transient state that ends quickly, so queueing shouldn't be an issue. E.g. if you are certain that "transitioning" won't last longer than 10s, you could set "no_path_retry 2". Regards, Martin
On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > Hello Brian, > > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > > Martin, > > > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > > That commit should cause the RTPG to be retried on other map > > > members > > > which are not in failed state, thus avoiding this phenomenon. > > > > In my case, there are no other map members that are not in the failed > > state. One set of paths goes to the ALUA unavailable state when the > > primary fails, and the second set of paths moves to ALUA state > > transitioning as the previous secondary becomes the primary. > > IMO this is the problem. How does your array respond to SCSI commands > while ports are transitioning? > > SPC5 (§5.16.2.6) says that the server should either fail all commands > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support > all TMFs and a subset of SCSI commands, while responding with > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. > > No matter how you read that paragraph, it's pretty clear that > "transitioning" is generally not a healthy state to attempt I/O. > > Are you saying that on your server, the transitioning ports are able to > process regular I/O commands like READ and WRITE? If that's the case, > why do you pretend to be "transitioning" at all, rather than in an > active state? If it's not the case, why would it make sense for the > host to retry I/O on the transitioning path? In the ALUA transitioning state, we cannot process READ or WRITE and will return with the sense data as you mentioned above. We expect retries down that transitioning path until it transitions to another ALUA state (at least for some reasonable period of time for the transition). The spec defines the state as the transition between target asymmetric states. The current implementation requires coordination on the target not to return a state transition down all paths at the same time or risk all paths being failed. Using the ALUA transitioning state allows us to respond to initiator READ and WRITE requests even if we can't serve them when our internal target state is transitioning (secondary to primary). The alternative is to queue them which presents a different set of problems. > > > If the > > paths are failed which are transitioning, an all paths down state > > happens which is not expected. > > IMO it _is_ expected if in fact no path is able to process SCSI > commands at the given point in time. In this case it would seem having all paths move to transitioning would lead to all paths lost. It is possible to imagine implementations where for a brief period of time all paths are in a transitioning state. What would be the point of returning a transient state if the result is a permanent failure? > > > There should be a time for which > > transitioning is a transient state until the next state is entered. > > Failing a path assuming there would be non-failed paths seems wrong. > > This is a misunderstanding. The path isn't failed because of > assumptions about other paths. It is failed because we know that it's > non-functional, and thus we must try other paths, if there are any. > > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"), I/O was indeed retried on transitioning paths, > possibly forever. This posed a serious problem when a transitioning > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still > quite convinced that it was wrong in general, because by all reasonable > means a "transitioning" path isn't usable for the host. > > If we find a transitioning path, it might make sense to retry on other > paths first and eventually switch back to the transitioning path, when > all others have failed hard (e.g. "unavailable" state). However, this > logic doesn't exist in the kernel. In theory, it could be mapped to a > "transitioning" priority group in device-mapper multipath. But prio > groups are managed in user space (multipathd), which treats > transitioning paths as "almost failed" (priority 0) anyway. We can > discuss enhancing multipathd such that it re-checks transitioning paths > more frequently, in order to be able to reinstate them ASAP. > > According to what you said above, the "transitioning" ports in the > problem situation ("second set") are those that were in "unavailable" > state before, which means "failed" as far as device mapper is concerned > - IOW, the paths in question would be unused anyway, until they got > reinstated, which wouldn't happen before they are fully up. With this > in mind, I have to say I don't understand why your proposed patch would > help at all. Please explain. > My proposed patch would not fail the paths in the case of BLK_STS_AGAIN. This seems to result in the requests being retried until the path transitions to either a failed state, standby or unavailable, or an online state. > > > The purpose of that patch was to set the state of the transitioning > > > path to failed in order to make sure IO is retried on a different > > > path. > > > Your patch would undermine this purpose. > > (Additional indentation added by me) Can you please use proper quoting? > You were mixing my statements and your own. > > > I agree this is what happens but those transitioning paths might be > > the only non-failed paths available. I don't think it is reasonable > > to > > fail them. This is the same as treating transitioning as standby or > > unavailable. > > Right, and according to the SPC spec (see above), that makes more sense > than treating it as "active". > > Storage vendors seem to interpret "transitioning" very differently, > both in terms of commands supported and in terms of time required to > reach the target state. That makes it hard to deal with it correctly on > the host side. > > > As you point out this happened with the commit you > > mention. Before this commit what I am doing does not result in an all > > paths down error, and similarly, it does not in earlier Linux > > versions > > or other OS's under the same condition. I see this as a regression. > > If you use a suitable "no_path_retry" setting in multipathd, you should > be able to handle the situation you describe just fine by queueing the > I/O until the transitioning paths are fully up. IIUC, on your server > "transitioning" is a transient state that ends quickly, so queueing > shouldn't be an issue. E.g. if you are certain that "transitioning" > won't last longer than 10s, you could set "no_path_retry 2". > > Regards, > Martin > > > I have tested using the no_path_retry and you are correct that it does work around the issue that I am seeing. The problem with that is there are times we want to convey all paths down to the initiator as quickly as possible and doing this will delay that. Thanks, Brian Brian Bunker PURE Storage, Inc. brian@purestorage.com
On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > Hello Brian, > > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > > Martin, > > > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > > That commit should cause the RTPG to be retried on other map > > > members > > > which are not in failed state, thus avoiding this phenomenon. > > > > In my case, there are no other map members that are not in the failed > > state. One set of paths goes to the ALUA unavailable state when the > > primary fails, and the second set of paths moves to ALUA state > > transitioning as the previous secondary becomes the primary. > > IMO this is the problem. How does your array respond to SCSI commands > while ports are transitioning? > > SPC5 (§5.16.2.6) says that the server should either fail all commands > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support > all TMFs and a subset of SCSI commands, while responding with > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. > > No matter how you read that paragraph, it's pretty clear that > "transitioning" is generally not a healthy state to attempt I/O. > > Are you saying that on your server, the transitioning ports are able to > process regular I/O commands like READ and WRITE? If that's the case, > why do you pretend to be "transitioning" at all, rather than in an > active state? If it's not the case, why would it make sense for the > host to retry I/O on the transitioning path? In the ALUA transitioning state, we cannot process READ or WRITE and will return with the sense data as you mentioned above. We expect retries down that transitioning path until it transitions to another ALUA state (at least for some reasonable period of time for the transition). The spec defines the state as the transition between target asymmetric states. The current implementation requires coordination on the target not to return a state transition down all paths at the same time or risk all paths being failed. Using the ALUA transition state allows us to respond to initiator READ and WRITE requests even if we can't serve them when our internal target state is transitioning (secondary to primary). The alternative is to queue them which presents a different set of problems. > > > If the > > paths are failed which are transitioning, an all paths down state > > happens which is not expected. > > IMO it _is_ expected if in fact no path is able to process SCSI > commands at the given point in time. In this case it would seem having all paths move to transitioning would lead to all paths lost. It is possible to imagine implementations where for a brief period of time all paths are in a transitioning state. What would be the point of returning a transient state if the result is a permanent failure? > > > There should be a time for which > > transitioning is a transient state until the next state is entered. > > Failing a path assuming there would be non-failed paths seems wrong. > > This is a misunderstanding. The path isn't failed because of > assumptions about other paths. It is failed because we know that it's > non-functional, and thus we must try other paths, if there are any. > > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"), I/O was indeed retried on transitioning paths, > possibly forever. This posed a serious problem when a transitioning > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still > quite convinced that it was wrong in general, because by all reasonable > means a "transitioning" path isn't usable for the host. > > If we find a transitioning path, it might make sense to retry on other > paths first and eventually switch back to the transitioning path, when > all others have failed hard (e.g. "unavailable" state). However, this > logic doesn't exist in the kernel. In theory, it could be mapped to a > "transitioning" priority group in device-mapper multipath. But prio > groups are managed in user space (multipathd), which treats > transitioning paths as "almost failed" (priority 0) anyway. We can > discuss enhancing multipathd such that it re-checks transitioning paths > more frequently, in order to be able to reinstate them ASAP. > > According to what you said above, the "transitioning" ports in the > problem situation ("second set") are those that were in "unavailable" > state before, which means "failed" as far as device mapper is concerned > - IOW, the paths in question would be unused anyway, until they got > reinstated, which wouldn't happen before they are fully up. With this > in mind, I have to say I don't understand why your proposed patch would > help at all. Please explain. > My proposed patch would not fail the paths in the case of BLK_STS_AGAIN. This seems to result in the requests being retried until the path transitions to either a failed state, standby or unavailable, or an online state. > > > The purpose of that patch was to set the state of the transitioning > > > path to failed in order to make sure IO is retried on a different > > > path. > > > Your patch would undermine this purpose. > > (Additional indentation added by me) Can you please use proper quoting? > You were mixing my statements and your own. > > > I agree this is what happens but those transitioning paths might be > > the only non-failed paths available. I don't think it is reasonable > > to > > fail them. This is the same as treating transitioning as standby or > > unavailable. > > Right, and according to the SPC spec (see above), that makes more sense > than treating it as "active". > > Storage vendors seem to interpret "transitioning" very differently, > both in terms of commands supported and in terms of time required to > reach the target state. That makes it hard to deal with it correctly on > the host side. > > > As you point out this happened with the commit you > > mention. Before this commit what I am doing does not result in an all > > paths down error, and similarly, it does not in earlier Linux > > versions > > or other OS's under the same condition. I see this as a regression. > > If you use a suitable "no_path_retry" setting in multipathd, you should > be able to handle the situation you describe just fine by queueing the > I/O until the transitioning paths are fully up. IIUC, on your server > "transitioning" is a transient state that ends quickly, so queueing > shouldn't be an issue. E.g. if you are certain that "transitioning" > won't last longer than 10s, you could set "no_path_retry 2". > > Regards, > Martin > > > I have tested using the no_path_retry and you are correct that it does work around the issue that I am seeing. The problem with that is are times we want to convey all paths down to the initiator as quickly as possible and doing this will delay that. Thanks, Brian Brian Bunker PURE Storage, Inc. brian@purestorage.com On Tue, Jul 13, 2021 at 5:32 PM Brian Bunker <brian@purestorage.com> wrote: > > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > > > Hello Brian, > > > > On Mo, 2021-07-12 at 14:38 -0700, Brian Bunker wrote: > > > Martin, > > > > > > > Please confirm that your kernel included ee8868c5c78f ("scsi: > > > > scsi_dh_alua: Retry RTPG on a different path after failure"). > > > > That commit should cause the RTPG to be retried on other map > > > > members > > > > which are not in failed state, thus avoiding this phenomenon. > > > > > > In my case, there are no other map members that are not in the failed > > > state. One set of paths goes to the ALUA unavailable state when the > > > primary fails, and the second set of paths moves to ALUA state > > > transitioning as the previous secondary becomes the primary. > > > > IMO this is the problem. How does your array respond to SCSI commands > > while ports are transitioning? > > > > SPC5 (§5.16.2.6) says that the server should either fail all commands > > with BUSY or CHECK CONDITION/NOT READY/LOGICAL UNIT NOT > > ACCESSIBLE/ASYMMETRIC ACCESS STATE TRANSITION (a), or should support > > all TMFs and a subset of SCSI commands, while responding with > > CC/NR/AAST to all other commands (b). SPC6 (§5.18.2.6) is no different. > > > > No matter how you read that paragraph, it's pretty clear that > > "transitioning" is generally not a healthy state to attempt I/O. > > > > Are you saying that on your server, the transitioning ports are able to > > process regular I/O commands like READ and WRITE? If that's the case, > > why do you pretend to be "transitioning" at all, rather than in an > > active state? If it's not the case, why would it make sense for the > > host to retry I/O on the transitioning path? > > In the ALUA transitioning state, we cannot process READ or WRITE and > will return with the sense data as you mentioned above. We expect > retries down that transitioning path until it transitions to another > ALUA state (at least for some reasonable period of time for the > transition). The spec defines the state as the transition between > target asymmetric states. The current implementation requires > coordination on the target not to return a state transition down all > paths at the same time or risk all paths being failed. Using the ALUA > transitioning state allows us to respond to initiator READ and WRITE > requests even if we can't serve them when our internal target state is > transitioning (secondary to primary). The alternative is to queue them > which presents a different set of problems. > > > > > > If the > > > paths are failed which are transitioning, an all paths down state > > > happens which is not expected. > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > commands at the given point in time. > > In this case it would seem having all paths move to transitioning > would lead to all paths lost. It is possible to imagine > implementations where for a brief period of time all paths are in a > transitioning state. What would be the point of returning a transient > state if the result is a permanent failure? > > > > > > There should be a time for which > > > transitioning is a transient state until the next state is entered. > > > Failing a path assuming there would be non-failed paths seems wrong. > > > > This is a misunderstanding. The path isn't failed because of > > assumptions about other paths. It is failed because we know that it's > > non-functional, and thus we must try other paths, if there are any. > > > > Before 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > > transitioning state"), I/O was indeed retried on transitioning paths, > > possibly forever. This posed a serious problem when a transitioning > > path was about to be removed (e.g. dev_loss_tmo expired). And I'm still > > quite convinced that it was wrong in general, because by all reasonable > > means a "transitioning" path isn't usable for the host. > > > > If we find a transitioning path, it might make sense to retry on other > > paths first and eventually switch back to the transitioning path, when > > all others have failed hard (e.g. "unavailable" state). However, this > > logic doesn't exist in the kernel. In theory, it could be mapped to a > > "transitioning" priority group in device-mapper multipath. But prio > > groups are managed in user space (multipathd), which treats > > transitioning paths as "almost failed" (priority 0) anyway. We can > > discuss enhancing multipathd such that it re-checks transitioning paths > > more frequently, in order to be able to reinstate them ASAP. > > > > According to what you said above, the "transitioning" ports in the > > problem situation ("second set") are those that were in "unavailable" > > state before, which means "failed" as far as device mapper is concerned > > - IOW, the paths in question would be unused anyway, until they got > > reinstated, which wouldn't happen before they are fully up. With this > > in mind, I have to say I don't understand why your proposed patch would > > help at all. Please explain. > > > > My proposed patch would not fail the paths in the case of > BLK_STS_AGAIN. This seems to result in the requests being retried > until the path transitions to either a failed state, standby or > unavailable, or an online state. > > > > > The purpose of that patch was to set the state of the transitioning > > > > path to failed in order to make sure IO is retried on a different > > > > path. > > > > Your patch would undermine this purpose. > > > > (Additional indentation added by me) Can you please use proper quoting? > > You were mixing my statements and your own. > > > > > I agree this is what happens but those transitioning paths might be > > > the only non-failed paths available. I don't think it is reasonable > > > to > > > fail them. This is the same as treating transitioning as standby or > > > unavailable. > > > > Right, and according to the SPC spec (see above), that makes more sense > > than treating it as "active". > > > > Storage vendors seem to interpret "transitioning" very differently, > > both in terms of commands supported and in terms of time required to > > reach the target state. That makes it hard to deal with it correctly on > > the host side. > > > > > As you point out this happened with the commit you > > > mention. Before this commit what I am doing does not result in an all > > > paths down error, and similarly, it does not in earlier Linux > > > versions > > > or other OS's under the same condition. I see this as a regression. > > > > If you use a suitable "no_path_retry" setting in multipathd, you should > > be able to handle the situation you describe just fine by queueing the > > I/O until the transitioning paths are fully up. IIUC, on your server > > "transitioning" is a transient state that ends quickly, so queueing > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > won't last longer than 10s, you could set "no_path_retry 2". > > > > Regards, > > Martin > > > > > > > > I have tested using the no_path_retry and you are correct that it does > work around the issue that I am seeing. The problem with that is there > are times we want to convey all paths down to the initiator as quickly > as possible and doing this will delay that. > > Thanks, > Brian > > Brian Bunker > PURE Storage, Inc. > brian@purestorage.com -- Brian Bunker PURE Storage, Inc. brian@purestorage.com
On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote: > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > Are you saying that on your server, the transitioning ports are able > > to > > process regular I/O commands like READ and WRITE? If that's the case, > > why do you pretend to be "transitioning" at all, rather than in an > > active state? If it's not the case, why would it make sense for the > > host to retry I/O on the transitioning path? > > In the ALUA transitioning state, we cannot process READ or WRITE and > will return with the sense data as you mentioned above. We expect > retries down that transitioning path until it transitions to another > ALUA state (at least for some reasonable period of time for the > transition). The spec defines the state as the transition between > target asymmetric states. The current implementation requires > coordination on the target not to return a state transition down all > paths at the same time or risk all paths being failed. Using the ALUA > transition state allows us to respond to initiator READ and WRITE > requests even if we can't serve them when our internal target state is > transitioning (secondary to primary). The alternative is to queue them > which presents a different set of problems. Indeed, it would be less prone to host-side errors if the "new" pathgroup went to a usable state before the "old" pathgroup got unavailable. Granted, this may be difficult to guarantee on the storage side. > > > If the > > > paths are failed which are transitioning, an all paths down state > > > happens which is not expected. > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > commands at the given point in time. > > In this case it would seem having all paths move to transitioning > would lead to all paths lost. It is possible to imagine > implementations where for a brief period of time all paths are in a > transitioning state. What would be the point of returning a transient > state if the result is a permanent failure? When a command fails with ALUA TRANSITIONING status, we must make sure that: 1) The command itself is not retried on the path at hand, neither on the SCSI layer nor on the blk-mq layer. The former was the case anyway, the latter is guaranteed by 0d88232010d5 ("scsi: core: Return BLK_STS_AGAIN for ALUA transitioning"). 2) No new commands are sent down this path until it reaches a usable final state. This is achieved on the SCSI layer by alua_prep_fn(), with 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA transitioning state"). These two items would still be true with your patch applied. However, the problem is that if the path isn't failed, dm-multipath would continue sending I/O down this path. If the path isn't set to failed state, the path selector algorithm may or may not choose a different path next time. In the worst case, dm-multipath would busy-loop retrying the I/O on the same path. Please consider the following: diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index 86518aec32b4..3f3a89fc2b3b 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, if (error && blk_path_error(error)) { struct multipath *m = ti->private; - if (error == BLK_STS_RESOURCE) + if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN) r = DM_ENDIO_DELAY_REQUEUE; else r = DM_ENDIO_REQUEUE; - if (pgpath) + if (pgpath && (error != BLK_STS_AGAIN)) fail_path(pgpath); This way we'd avoid busy-looping by delaying the retry. This would cause I/O delay in the case where some healthy paths are still in the same dm-multipath priority group as the transitioning path. I suppose this is a minor problem, because in the default case for ALUA (group_by_prio in multipathd), all paths in the PG would have switched to "transitioning" simultaneously. Note that multipathd assigns priority 0 (the same prio as "unavailable") if it happens to notice a "transitioning" path. This is something we may want to change eventually. In your specific case, it would cause the paths to be temporarily re-grouped, all paths being moved to the same non-functional PG. The way you describe it, for your storage at least, "transitioning" should be assigned a higher priority. > > > If you use a suitable "no_path_retry" setting in multipathd, you > > should > > be able to handle the situation you describe just fine by queueing > > the > > I/O until the transitioning paths are fully up. IIUC, on your > > server > > "transitioning" is a transient state that ends quickly, so queueing > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > won't last longer than 10s, you could set "no_path_retry 2". > > I have tested using the no_path_retry and you are correct that it > does > work around the issue that I am seeing. The problem with that is are > times > we want to convey all paths down to the initiator as quickly > as possible and doing this will delay that. Ok, that makes sense e.g. for cluster configurations. So, the purpose is to distinguish between two cases where no path can process SCSI commands: a) all paths are really down on the storage, and b) some paths are down and some are "coming up". Thanks, Martin
On Wed, Jul 14, 2021 at 1:39 AM Martin Wilck <mwilck@suse.com> wrote: > > On Di, 2021-07-13 at 17:37 -0700, Brian Bunker wrote: > > On Tue, Jul 13, 2021 at 2:13 AM Martin Wilck <mwilck@suse.com> wrote: > > > Are you saying that on your server, the transitioning ports are able > > > to > > > process regular I/O commands like READ and WRITE? If that's the case, > > > why do you pretend to be "transitioning" at all, rather than in an > > > active state? If it's not the case, why would it make sense for the > > > host to retry I/O on the transitioning path? > > > > In the ALUA transitioning state, we cannot process READ or WRITE and > > will return with the sense data as you mentioned above. We expect > > retries down that transitioning path until it transitions to another > > ALUA state (at least for some reasonable period of time for the > > transition). The spec defines the state as the transition between > > target asymmetric states. The current implementation requires > > coordination on the target not to return a state transition down all > > paths at the same time or risk all paths being failed. Using the ALUA > > transition state allows us to respond to initiator READ and WRITE > > requests even if we can't serve them when our internal target state is > > transitioning (secondary to primary). The alternative is to queue them > > which presents a different set of problems. > > Indeed, it would be less prone to host-side errors if the "new" > pathgroup went to a usable state before the "old" pathgroup got > unavailable. Granted, this may be difficult to guarantee on the storage > side. > For us, this is not easily doable. We are transitioning from a secondary to a primary, so for a brief time we have no paths which can serve the I/O. The transition looks like this (a bit oversimplified, but the general idea): 1. primary and secondary are active optimized 2. primary fails and secondary starts promoting 3. both previous primary and promoting secondary return transitioning 4. once the promoting primary gets far enough along the previous primary returns unavailable 5. the promoting primary continues to return transitioning 6. the promoting primary is done and returns active optimized 7. the previous primary becomes secondary and returns active optimized So there are windows when we can return AO for all paths, transitioning for all paths, transitioning for half the paths and unavailable for the other half, even though these windows can be very small, they are possible. > > > > If the > > > > paths are failed which are transitioning, an all paths down state > > > > happens which is not expected. > > > > > > IMO it _is_ expected if in fact no path is able to process SCSI > > > commands at the given point in time. > > > > In this case it would seem having all paths move to transitioning > > would lead to all paths lost. It is possible to imagine > > implementations where for a brief period of time all paths are in a > > transitioning state. What would be the point of returning a transient > > state if the result is a permanent failure? > > When a command fails with ALUA TRANSITIONING status, we must make sure > that: > > 1) The command itself is not retried on the path at hand, neither on > the SCSI layer nor on the blk-mq layer. The former was the case anyway, > the latter is guaranteed by 0d88232010d5 ("scsi: core: Return > BLK_STS_AGAIN for ALUA transitioning"). > > 2) No new commands are sent down this path until it reaches a usable > final state. This is achieved on the SCSI layer by alua_prep_fn(), with > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > transitioning state"). > > These two items would still be true with your patch applied. However, > the problem is that if the path isn't failed, dm-multipath would > continue sending I/O down this path. If the path isn't set to failed > state, the path selector algorithm may or may not choose a different > path next time. In the worst case, dm-multipath would busy-loop > retrying the I/O on the same path. Please consider the following: > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > index 86518aec32b4..3f3a89fc2b3b 100644 > --- a/drivers/md/dm-mpath.c > +++ b/drivers/md/dm-mpath.c > @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, > if (error && blk_path_error(error)) { > struct multipath *m = ti->private; > > - if (error == BLK_STS_RESOURCE) > + if (error == BLK_STS_RESOURCE || error == BLK_STS_AGAIN) > r = DM_ENDIO_DELAY_REQUEUE; > else > r = DM_ENDIO_REQUEUE; > > - if (pgpath) > + if (pgpath && (error != BLK_STS_AGAIN)) > fail_path(pgpath); > > This way we'd avoid busy-looping by delaying the retry. This would > cause I/O delay in the case where some healthy paths are still in the > same dm-multipath priority group as the transitioning path. I suppose > this is a minor problem, because in the default case for ALUA > (group_by_prio in multipathd), all paths in the PG would have switched > to "transitioning" simultaneously. > > Note that multipathd assigns priority 0 (the same prio as > "unavailable") if it happens to notice a "transitioning" path. This is > something we may want to change eventually. In your specific case, it > would cause the paths to be temporarily re-grouped, all paths being > moved to the same non-functional PG. The way you describe it, for your > storage at least, "transitioning" should be assigned a higher priority. > > > Yes. I tried it with this change and it works. Should I re-submit this modified version or do you want to do it? Either way works for me. I was flying a bit in the dark with my initial patch since I just found the fail_path and made that not run without much thought to what happens after that. > > > If you use a suitable "no_path_retry" setting in multipathd, you > > > should > > > be able to handle the situation you describe just fine by queueing > > > the > > > I/O until the transitioning paths are fully up. IIUC, on your > > > server > > > "transitioning" is a transient state that ends quickly, so queueing > > > shouldn't be an issue. E.g. if you are certain that "transitioning" > > > won't last longer than 10s, you could set "no_path_retry 2". > > > > I have tested using the no_path_retry and you are correct that it > > does > > work around the issue that I am seeing. The problem with that is are > > times > > we want to convey all paths down to the initiator as quickly > > as possible and doing this will delay that. > > Ok, that makes sense e.g. for cluster configurations. So, the purpose > is to distinguish between two cases where no path can process SCSI > commands: a) all paths are really down on the storage, and b) some > paths are down and some are "coming up". > > Thanks, > Martin > > Thanks, Brian -- Brian Bunker PURE Storage, Inc. brian@purestorage.com
On Mi, 2021-07-14 at 11:13 -0700, Brian Bunker wrote: > On Wed, Jul 14, 2021 at 1:39 AM Martin Wilck <mwilck@suse.com> wrote: > > > > > When a command fails with ALUA TRANSITIONING status, we must make > > sure > > that: > > > > 1) The command itself is not retried on the path at hand, neither > > on > > the SCSI layer nor on the blk-mq layer. The former was the case > > anyway, > > the latter is guaranteed by 0d88232010d5 ("scsi: core: Return > > BLK_STS_AGAIN for ALUA transitioning"). > > > > 2) No new commands are sent down this path until it reaches a > > usable > > final state. This is achieved on the SCSI layer by alua_prep_fn(), > > with > > 268940b80fa4 ("scsi: scsi_dh_alua: Return BLK_STS_AGAIN for ALUA > > transitioning state"). > > > > These two items would still be true with your patch applied. > > However, > > the problem is that if the path isn't failed, dm-multipath would > > continue sending I/O down this path. If the path isn't set to > > failed > > state, the path selector algorithm may or may not choose a > > different > > path next time. In the worst case, dm-multipath would busy-loop > > retrying the I/O on the same path. Please consider the following: > > > > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c > > index 86518aec32b4..3f3a89fc2b3b 100644 > > --- a/drivers/md/dm-mpath.c > > +++ b/drivers/md/dm-mpath.c > > @@ -1654,12 +1654,12 @@ static int multipath_end_io(struct > > dm_target *ti, struct request *clone, > > if (error && blk_path_error(error)) { > > struct multipath *m = ti->private; > > > > - if (error == BLK_STS_RESOURCE) > > + if (error == BLK_STS_RESOURCE || error == > > BLK_STS_AGAIN) > > r = DM_ENDIO_DELAY_REQUEUE; > > else > > r = DM_ENDIO_REQUEUE; > > > > - if (pgpath) > > + if (pgpath && (error != BLK_STS_AGAIN)) > > fail_path(pgpath); > > > > This way we'd avoid busy-looping by delaying the retry. This would > > cause I/O delay in the case where some healthy paths are still in > > the > > same dm-multipath priority group as the transitioning path. I > > suppose > > this is a minor problem, because in the default case for ALUA > > (group_by_prio in multipathd), all paths in the PG would have > > switched > > to "transitioning" simultaneously. > > > > Note that multipathd assigns priority 0 (the same prio as > > "unavailable") if it happens to notice a "transitioning" path. This > > is > > something we may want to change eventually. In your specific case, > > it > > would cause the paths to be temporarily re-grouped, all paths being > > moved to the same non-functional PG. The way you describe it, for > > your > > storage at least, "transitioning" should be assigned a higher > > priority. > > > > > > > Yes. I tried it with this change and it works. Should I re-submit > this > modified version or do you want to do it? Yes, please. Regards, Martin
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index bced42f082b0..d5d6be96068d 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c @@ -1657,7 +1657,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, else r = DM_ENDIO_REQUEUE; - if (pgpath) + if (pgpath && (error != BLK_STS_AGAIN)) fail_path(pgpath);