diff mbox series

[v4,5/9] scsi: Do not wait for a request in scsi_eh_lock_door()

Message ID 20201130024615.29171-6-bvanassche@acm.org
State New
Headers show
Series Rework runtime suspend and SPI domain validation | expand

Commit Message

Bart Van Assche Nov. 30, 2020, 2:46 a.m. UTC
scsi_eh_lock_door() is the only function in the SCSI error handler that
calls blk_get_request(). It is not guaranteed that a request is available
when scsi_eh_lock_door() is called. Hence pass the BLK_MQ_REQ_NOWAIT flag
to blk_get_request().

Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Can Guo <cang@codeaurora.org>
Cc: Stanley Chu <stanley.chu@mediatek.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 drivers/scsi/scsi_error.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Hannes Reinecke Dec. 2, 2020, 7:06 a.m. UTC | #1
On 11/30/20 3:46 AM, Bart Van Assche wrote:
> scsi_eh_lock_door() is the only function in the SCSI error handler that

> calls blk_get_request(). It is not guaranteed that a request is available

> when scsi_eh_lock_door() is called. Hence pass the BLK_MQ_REQ_NOWAIT flag

> to blk_get_request().

> 

> Reviewed-by: Alan Stern <stern@rowland.harvard.edu>

> Reviewed-by: Christoph Hellwig <hch@lst.de>

> Cc: Can Guo <cang@codeaurora.org>

> Cc: Stanley Chu <stanley.chu@mediatek.com>

> Cc: Ming Lei <ming.lei@redhat.com>

> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> Signed-off-by: Bart Van Assche <bvanassche@acm.org>

> ---

>   drivers/scsi/scsi_error.c | 7 ++++++-

>   1 file changed, 6 insertions(+), 1 deletion(-)

> 

> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c

> index d94449188270..6de6e1bf3dcb 100644

> --- a/drivers/scsi/scsi_error.c

> +++ b/drivers/scsi/scsi_error.c

> @@ -1993,7 +1993,12 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)

>   	struct request *req;

>   	struct scsi_request *rq;

>   

> -	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN, 0);

> +	/*

> +	 * It is not guaranteed that a request is available nor that

> +	 * sdev->request_queue is unfrozen. Hence the BLK_MQ_REQ_NOWAIT below.

> +	 */

> +	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,

> +			      BLK_MQ_REQ_NOWAIT);

>   	if (IS_ERR(req))

>   		return;

>   	rq = scsi_req(req);

> 

Well ... had been thinking about that one, too.
The idea of this function is that prior to SCSI EH the device was locked
via scsi_set_medium_removal(). And during SCSI EH the device might have 
become unlocked, so we need to lock it again.
However, scsi_set_medium_removal() not only issues the 
PREVENT_ALLOW_MEDIUM_REMOVAL command, but also sets the 'locked' flag 
based on the result.
So if we fail to get a request here, shouldn't we unset the 'locked' 
flag, too?
And what does happen if we fail here? There is no return value, hence 
SCSI EH might run to completion, and the system will continue
with an unlocked door ...
Not sure if that's a good idea.

But anyway, at the very least unset the 'locked' flag upon failure such 
that the internal state is correctly updated.

_Actually_, the flag should be unset after each successful SCSI EH step, 
to mirror the actual state. But this is probably out of scope for this 
patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Bart Van Assche Dec. 3, 2020, 5:10 a.m. UTC | #2
On 12/1/20 11:06 PM, Hannes Reinecke wrote:
> On 11/30/20 3:46 AM, Bart Van Assche wrote:

>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c

>> index d94449188270..6de6e1bf3dcb 100644

>> --- a/drivers/scsi/scsi_error.c

>> +++ b/drivers/scsi/scsi_error.c

>> @@ -1993,7 +1993,12 @@ static void scsi_eh_lock_door(struct

>> scsi_device *sdev)

>>       struct request *req;

>>       struct scsi_request *rq;

>>   -    req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN, 0);

>> +    /*

>> +     * It is not guaranteed that a request is available nor that

>> +     * sdev->request_queue is unfrozen. Hence the BLK_MQ_REQ_NOWAIT

>> below.

>> +     */

>> +    req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,

>> +                  BLK_MQ_REQ_NOWAIT);

>>       if (IS_ERR(req))

>>           return;

>>       rq = scsi_req(req);

>>

>

> Well ... had been thinking about that one, too.

> The idea of this function is that prior to SCSI EH the device was locked

> via scsi_set_medium_removal(). And during SCSI EH the device might have

> become unlocked, so we need to lock it again.

> However, scsi_set_medium_removal() not only issues the

> PREVENT_ALLOW_MEDIUM_REMOVAL command, but also sets the 'locked' flag

> based on the result.

> So if we fail to get a request here, shouldn't we unset the 'locked'

> flag, too?


Probably not. My interpretation of the 'locked' flag is that it
represents the door state before error handling began. The following
code in the SCSI error handler restores the door state after a bus reset:

	if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) {
		scsi_eh_lock_door(sdev);
		sdev->was_reset = 0;
	}

> And what does happen if we fail here? There is no return value, hence

> SCSI EH might run to completion, and the system will continue

> with an unlocked door ...

> Not sure if that's a good idea.


How about applying the following patch on top of patch 5/9?

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 6de6e1bf3dcb..feac7262e40e 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1988,7 +1988,7 @@ static void eh_lock_door_done(struct request *req, blk_status_t status)
  * 	We queue up an asynchronous "ALLOW MEDIUM REMOVAL" request on the
  * 	head of the devices request queue, and continue.
  */
-static void scsi_eh_lock_door(struct scsi_device *sdev)
+static int scsi_eh_lock_door(struct scsi_device *sdev)
 {
 	struct request *req;
 	struct scsi_request *rq;
@@ -2000,7 +2000,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
 	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,
 			      BLK_MQ_REQ_NOWAIT);
 	if (IS_ERR(req))
-		return;
+		return PTR_ERR(req);
 	rq = scsi_req(req);

 	rq->cmd[0] = ALLOW_MEDIUM_REMOVAL;
@@ -2016,6 +2016,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)
 	rq->retries = 5;

 	blk_execute_rq_nowait(req->q, NULL, req, 1, eh_lock_door_done);
+	return 0;
 }

 /**
@@ -2037,8 +2038,8 @@ static void scsi_restart_operations(struct Scsi_Host *shost)
 	 * is no point trying to lock the door of an off-line device.
 	 */
 	shost_for_each_device(sdev, shost) {
-		if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) {
-			scsi_eh_lock_door(sdev);
+		if (scsi_device_online(sdev) && sdev->was_reset &&
+		    sdev->locked && scsi_eh_lock_door(sdev) == 0) {
 			sdev->was_reset = 0;
 		}
 	}

Thanks,

Bart.
Hannes Reinecke Dec. 3, 2020, 7:18 a.m. UTC | #3
On 12/3/20 6:10 AM, Bart Van Assche wrote:
> On 12/1/20 11:06 PM, Hannes Reinecke wrote:

>> On 11/30/20 3:46 AM, Bart Van Assche wrote:

>>> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c

>>> index d94449188270..6de6e1bf3dcb 100644

>>> --- a/drivers/scsi/scsi_error.c

>>> +++ b/drivers/scsi/scsi_error.c

>>> @@ -1993,7 +1993,12 @@ static void scsi_eh_lock_door(struct

>>> scsi_device *sdev)

>>>        struct request *req;

>>>        struct scsi_request *rq;

>>>    -    req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN, 0);

>>> +    /*

>>> +     * It is not guaranteed that a request is available nor that

>>> +     * sdev->request_queue is unfrozen. Hence the BLK_MQ_REQ_NOWAIT

>>> below.

>>> +     */

>>> +    req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,

>>> +                  BLK_MQ_REQ_NOWAIT);

>>>        if (IS_ERR(req))

>>>            return;

>>>        rq = scsi_req(req);

>>>

>>

>> Well ... had been thinking about that one, too.

>> The idea of this function is that prior to SCSI EH the device was locked

>> via scsi_set_medium_removal(). And during SCSI EH the device might have

>> become unlocked, so we need to lock it again.

>> However, scsi_set_medium_removal() not only issues the

>> PREVENT_ALLOW_MEDIUM_REMOVAL command, but also sets the 'locked' flag

>> based on the result.

>> So if we fail to get a request here, shouldn't we unset the 'locked'

>> flag, too?

> 

> Probably not. My interpretation of the 'locked' flag is that it

> represents the door state before error handling began. The following

> code in the SCSI error handler restores the door state after a bus reset:

> 

> 	if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) {

> 		scsi_eh_lock_door(sdev);

> 		sdev->was_reset = 0;

> 	}

> 

>> And what does happen if we fail here? There is no return value, hence

>> SCSI EH might run to completion, and the system will continue

>> with an unlocked door ...

>> Not sure if that's a good idea.

> 

> How about applying the following patch on top of patch 5/9?

> 

> diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c

> index 6de6e1bf3dcb..feac7262e40e 100644

> --- a/drivers/scsi/scsi_error.c

> +++ b/drivers/scsi/scsi_error.c

> @@ -1988,7 +1988,7 @@ static void eh_lock_door_done(struct request *req, blk_status_t status)

>    * 	We queue up an asynchronous "ALLOW MEDIUM REMOVAL" request on the

>    * 	head of the devices request queue, and continue.

>    */

> -static void scsi_eh_lock_door(struct scsi_device *sdev)

> +static int scsi_eh_lock_door(struct scsi_device *sdev)

>   {

>   	struct request *req;

>   	struct scsi_request *rq;

> @@ -2000,7 +2000,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)

>   	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,

>   			      BLK_MQ_REQ_NOWAIT);

>   	if (IS_ERR(req))

> -		return;

> +		return PTR_ERR(req);

>   	rq = scsi_req(req);

> 

>   	rq->cmd[0] = ALLOW_MEDIUM_REMOVAL;

> @@ -2016,6 +2016,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)

>   	rq->retries = 5;

> 

>   	blk_execute_rq_nowait(req->q, NULL, req, 1, eh_lock_door_done);

> +	return 0;

>   }

> 

>   /**

> @@ -2037,8 +2038,8 @@ static void scsi_restart_operations(struct Scsi_Host *shost)

>   	 * is no point trying to lock the door of an off-line device.

>   	 */

>   	shost_for_each_device(sdev, shost) {

> -		if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) {

> -			scsi_eh_lock_door(sdev);

> +		if (scsi_device_online(sdev) && sdev->was_reset &&

> +		    sdev->locked && scsi_eh_lock_door(sdev) == 0) {

>   			sdev->was_reset = 0;

>   		}

>   	}

> 

I probably didn't make myself clear.
As per SBC (in this case, sbc3r36) the effects of 
PREVENT_ALLOW_MEDIUM_REMOVAL are being reset by a successfull LUN Reset, 
Hard Reset, Power/On Reset, or an I_T Nexus loss. Which incidentally 
maps nicely onto SCSI EH, so after a successful SCSI EH the door will be 
unlocked (which is why we need to call scsi_eh_lock_door()).
In the SCSI midlayer this state is being reflected by the 'locked' flag.
Now, if scsi_eh_lock_door() is _not_ being executed due to a 
blk_get_request() failure, the device remains unlocked, and as such the 
'locked' flag would need to be _unset_.

So I was thinking more along these lines:

@@ -2030,7 +2037,8 @@ static void scsi_restart_operations(struct 
Scsi_Host *shost)
          */
         shost_for_each_device(sdev, shost) {
                 if (scsi_device_online(sdev) && sdev->was_reset && 
sdev->locked) {
-                       scsi_eh_lock_door(sdev);
+                       if (scsi_eh_lock_door(sdev) < 0)
+                               sdev->locked = 0;
                         sdev->was_reset = 0;
                 }
         }


Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare@suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
Ming Lei Dec. 3, 2020, 7:27 a.m. UTC | #4
On Thu, Dec 03, 2020 at 08:18:57AM +0100, Hannes Reinecke wrote:
> On 12/3/20 6:10 AM, Bart Van Assche wrote:

> > On 12/1/20 11:06 PM, Hannes Reinecke wrote:

> > > On 11/30/20 3:46 AM, Bart Van Assche wrote:

> > > > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c

> > > > index d94449188270..6de6e1bf3dcb 100644

> > > > --- a/drivers/scsi/scsi_error.c

> > > > +++ b/drivers/scsi/scsi_error.c

> > > > @@ -1993,7 +1993,12 @@ static void scsi_eh_lock_door(struct

> > > > scsi_device *sdev)

> > > >        struct request *req;

> > > >        struct scsi_request *rq;

> > > >    -    req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN, 0);

> > > > +    /*

> > > > +     * It is not guaranteed that a request is available nor that

> > > > +     * sdev->request_queue is unfrozen. Hence the BLK_MQ_REQ_NOWAIT

> > > > below.

> > > > +     */

> > > > +    req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,

> > > > +                  BLK_MQ_REQ_NOWAIT);

> > > >        if (IS_ERR(req))

> > > >            return;

> > > >        rq = scsi_req(req);

> > > > 

> > > 

> > > Well ... had been thinking about that one, too.

> > > The idea of this function is that prior to SCSI EH the device was locked

> > > via scsi_set_medium_removal(). And during SCSI EH the device might have

> > > become unlocked, so we need to lock it again.

> > > However, scsi_set_medium_removal() not only issues the

> > > PREVENT_ALLOW_MEDIUM_REMOVAL command, but also sets the 'locked' flag

> > > based on the result.

> > > So if we fail to get a request here, shouldn't we unset the 'locked'

> > > flag, too?

> > 

> > Probably not. My interpretation of the 'locked' flag is that it

> > represents the door state before error handling began. The following

> > code in the SCSI error handler restores the door state after a bus reset:

> > 

> > 	if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) {

> > 		scsi_eh_lock_door(sdev);

> > 		sdev->was_reset = 0;

> > 	}

> > 

> > > And what does happen if we fail here? There is no return value, hence

> > > SCSI EH might run to completion, and the system will continue

> > > with an unlocked door ...

> > > Not sure if that's a good idea.

> > 

> > How about applying the following patch on top of patch 5/9?

> > 

> > diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c

> > index 6de6e1bf3dcb..feac7262e40e 100644

> > --- a/drivers/scsi/scsi_error.c

> > +++ b/drivers/scsi/scsi_error.c

> > @@ -1988,7 +1988,7 @@ static void eh_lock_door_done(struct request *req, blk_status_t status)

> >    * 	We queue up an asynchronous "ALLOW MEDIUM REMOVAL" request on the

> >    * 	head of the devices request queue, and continue.

> >    */

> > -static void scsi_eh_lock_door(struct scsi_device *sdev)

> > +static int scsi_eh_lock_door(struct scsi_device *sdev)

> >   {

> >   	struct request *req;

> >   	struct scsi_request *rq;

> > @@ -2000,7 +2000,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)

> >   	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,

> >   			      BLK_MQ_REQ_NOWAIT);

> >   	if (IS_ERR(req))

> > -		return;

> > +		return PTR_ERR(req);

> >   	rq = scsi_req(req);

> > 

> >   	rq->cmd[0] = ALLOW_MEDIUM_REMOVAL;

> > @@ -2016,6 +2016,7 @@ static void scsi_eh_lock_door(struct scsi_device *sdev)

> >   	rq->retries = 5;

> > 

> >   	blk_execute_rq_nowait(req->q, NULL, req, 1, eh_lock_door_done);

> > +	return 0;

> >   }

> > 

> >   /**

> > @@ -2037,8 +2038,8 @@ static void scsi_restart_operations(struct Scsi_Host *shost)

> >   	 * is no point trying to lock the door of an off-line device.

> >   	 */

> >   	shost_for_each_device(sdev, shost) {

> > -		if (scsi_device_online(sdev) && sdev->was_reset && sdev->locked) {

> > -			scsi_eh_lock_door(sdev);

> > +		if (scsi_device_online(sdev) && sdev->was_reset &&

> > +		    sdev->locked && scsi_eh_lock_door(sdev) == 0) {

> >   			sdev->was_reset = 0;

> >   		}

> >   	}

> > 

> I probably didn't make myself clear.

> As per SBC (in this case, sbc3r36) the effects of

> PREVENT_ALLOW_MEDIUM_REMOVAL are being reset by a successfull LUN Reset,

> Hard Reset, Power/On Reset, or an I_T Nexus loss. Which incidentally maps

> nicely onto SCSI EH, so after a successful SCSI EH the door will be unlocked

> (which is why we need to call scsi_eh_lock_door()).

> In the SCSI midlayer this state is being reflected by the 'locked' flag.

> Now, if scsi_eh_lock_door() is _not_ being executed due to a

> blk_get_request() failure, the device remains unlocked, and as such the

> 'locked' flag would need to be _unset_.

> 

> So I was thinking more along these lines:

> 

> @@ -2030,7 +2037,8 @@ static void scsi_restart_operations(struct Scsi_Host

> *shost)

>          */

>         shost_for_each_device(sdev, shost) {

>                 if (scsi_device_online(sdev) && sdev->was_reset &&

> sdev->locked) {

> -                       scsi_eh_lock_door(sdev);

> +                       if (scsi_eh_lock_door(sdev) < 0)

> +                               sdev->locked = 0;


BTW, scsi_eh_lock_door() returns void, and it can't be sync because
there may not be any driver tag available. Even though it is available,
the host state isn't running yet, so the command can't be queued to LLD
yet.

Maybe the above lines should be put after host state is updated to
RUNNING.

Also changing to NOWAIT can't avoid the issue completely, what if 'none'
is used?


Thanks,
Ming
Bart Van Assche Dec. 4, 2020, 4:50 p.m. UTC | #5
On 12/2/20 11:27 PM, Ming Lei wrote:
> BTW, scsi_eh_lock_door() returns void, and it can't be sync because

> there may not be any driver tag available. Even though it is available,

> the host state isn't running yet, so the command can't be queued to LLD

> yet.

> 

> Maybe the above lines should be put after host state is updated to

> RUNNING.

> 

> Also changing to NOWAIT can't avoid the issue completely, what if 'none'

> is used?


Hi Ming,

I am considering to drop this patch since the latest version of the SPI
DV patch no longer introduces a new blk_mq_freeze_queue() call in the
SPI DV code. In other words, any potential issues with
scsi_eh_lock_door() are existing issues and are not made worse by my
patch series.

Thanks,

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index d94449188270..6de6e1bf3dcb 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -1993,7 +1993,12 @@  static void scsi_eh_lock_door(struct scsi_device *sdev)
 	struct request *req;
 	struct scsi_request *rq;
 
-	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN, 0);
+	/*
+	 * It is not guaranteed that a request is available nor that
+	 * sdev->request_queue is unfrozen. Hence the BLK_MQ_REQ_NOWAIT below.
+	 */
+	req = blk_get_request(sdev->request_queue, REQ_OP_SCSI_IN,
+			      BLK_MQ_REQ_NOWAIT);
 	if (IS_ERR(req))
 		return;
 	rq = scsi_req(req);