diff mbox series

[07/19] scsi: sd: Do not issue commands to suspended disks on remove

Message ID 20230911040217.253905-8-dlemoal@kernel.org
State New
Headers show
Series [01/19] ata: libata-core: Fix ata_port_request_pm() locking | expand

Commit Message

Damien Le Moal Sept. 11, 2023, 4:02 a.m. UTC
If an error occurs when resuming a host adapter before the devices
attached to the adapter are resumed, the adapter low level driver may
remove the scsi host, resulting in a call to sd_remove() for the
disks of the host. However, since this function calls sd_shutdown(),
a synchronize cache command and a start stop unit may be issued with the
drive still sleeping and the HBA non-functional. This causes PM resume
to hang, forcing a reset of the machine to recover.

Fix this by checking a device host state in sd_shutdown() and by
returning early doing nothing if the host state is not SHOST_RUNNING.

Cc: stable@vger.kernel.org
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 drivers/scsi/sd.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Bart Van Assche Sept. 13, 2023, 8:50 p.m. UTC | #1
On 9/10/23 21:02, Damien Le Moal wrote:
> If an error occurs when resuming a host adapter before the devices
> attached to the adapter are resumed, the adapter low level driver may
> remove the scsi host, resulting in a call to sd_remove() for the
> disks of the host. However, since this function calls sd_shutdown(),
> a synchronize cache command and a start stop unit may be issued with the
> drive still sleeping and the HBA non-functional. This causes PM resume
> to hang, forcing a reset of the machine to recover.
> 
> Fix this by checking a device host state in sd_shutdown() and by
> returning early doing nothing if the host state is not SHOST_RUNNING.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
> ---
>   drivers/scsi/sd.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index c92a317ba547..a415abb721d3 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev)
>   	if (!sdkp)
>   		return;         /* this can happen */
>   
> -	if (pm_runtime_suspended(dev))
> +	if (pm_runtime_suspended(dev) ||
> +	    sdkp->device->host->shost_state != SHOST_RUNNING)
>   		return;
>   
>   	if (sdkp->WCE && sdkp->media_present) {

Why to test the host state instead of dev->power.runtime_status? I don't
think that it is safe to skip shutdown if the error handler is active.
If the error handler can recover the device a SYNCHRONIZE CACHE command
should be submitted.

Thanks,

Bart.
Damien Le Moal Sept. 14, 2023, 12:29 a.m. UTC | #2
On 9/14/23 05:50, Bart Van Assche wrote:
> On 9/10/23 21:02, Damien Le Moal wrote:
>> If an error occurs when resuming a host adapter before the devices
>> attached to the adapter are resumed, the adapter low level driver may
>> remove the scsi host, resulting in a call to sd_remove() for the
>> disks of the host. However, since this function calls sd_shutdown(),
>> a synchronize cache command and a start stop unit may be issued with the
>> drive still sleeping and the HBA non-functional. This causes PM resume
>> to hang, forcing a reset of the machine to recover.
>>
>> Fix this by checking a device host state in sd_shutdown() and by
>> returning early doing nothing if the host state is not SHOST_RUNNING.
>>
>> Cc: stable@vger.kernel.org
>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>> ---
>>   drivers/scsi/sd.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>> index c92a317ba547..a415abb721d3 100644
>> --- a/drivers/scsi/sd.c
>> +++ b/drivers/scsi/sd.c
>> @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev)
>>   	if (!sdkp)
>>   		return;         /* this can happen */
>>   
>> -	if (pm_runtime_suspended(dev))
>> +	if (pm_runtime_suspended(dev) ||
>> +	    sdkp->device->host->shost_state != SHOST_RUNNING)
>>   		return;
>>   
>>   	if (sdkp->WCE && sdkp->media_present) {
> 
> Why to test the host state instead of dev->power.runtime_status? I don't
> think that it is safe to skip shutdown if the error handler is active.
> If the error handler can recover the device a SYNCHRONIZE CACHE command
> should be submitted.

But there is no synchronization with EH that I can see anyway. At least for
sd_remove(), I would assume that this is called only once the device references
were all dropped, so presumably EH is not doing anything with the drive when
that happen, no ?

In any case, looking at dev->power.runtime_status is not correct as this is set
to RPM_ACTIVE when the device is suspended through system suspend. We could
replace the test "sdkp->device->host->shost_state != SHOST_RUNNING" with
"dev->power.is_suspended", as that indicates true (1) for a suspended device.
However, I really do not like that as that is a PM internal field and should not
be accessing it directly. The PM code comments say as much. Any better idea ?
Bart Van Assche Sept. 14, 2023, 2:39 p.m. UTC | #3
On 9/13/23 17:29, Damien Le Moal wrote:
> On 9/14/23 05:50, Bart Van Assche wrote:
>> On 9/10/23 21:02, Damien Le Moal wrote:
>>> If an error occurs when resuming a host adapter before the devices
>>> attached to the adapter are resumed, the adapter low level driver may
>>> remove the scsi host, resulting in a call to sd_remove() for the
>>> disks of the host. However, since this function calls sd_shutdown(),
>>> a synchronize cache command and a start stop unit may be issued with the
>>> drive still sleeping and the HBA non-functional. This causes PM resume
>>> to hang, forcing a reset of the machine to recover.
>>>
>>> Fix this by checking a device host state in sd_shutdown() and by
>>> returning early doing nothing if the host state is not SHOST_RUNNING.
>>>
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
>>> ---
>>>    drivers/scsi/sd.c | 3 ++-
>>>    1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
>>> index c92a317ba547..a415abb721d3 100644
>>> --- a/drivers/scsi/sd.c
>>> +++ b/drivers/scsi/sd.c
>>> @@ -3763,7 +3763,8 @@ static void sd_shutdown(struct device *dev)
>>>    	if (!sdkp)
>>>    		return;         /* this can happen */
>>>    
>>> -	if (pm_runtime_suspended(dev))
>>> +	if (pm_runtime_suspended(dev) ||
>>> +	    sdkp->device->host->shost_state != SHOST_RUNNING)
>>>    		return;
>>>    
>>>    	if (sdkp->WCE && sdkp->media_present) {
>>
>> Why to test the host state instead of dev->power.runtime_status? I don't
>> think that it is safe to skip shutdown if the error handler is active.
>> If the error handler can recover the device a SYNCHRONIZE CACHE command
>> should be submitted.
> 
> But there is no synchronization with EH that I can see anyway. At least for
> sd_remove(), I would assume that this is called only once the device references
> were all dropped, so presumably EH is not doing anything with the drive when
> that happen, no ?
> 
> In any case, looking at dev->power.runtime_status is not correct as this is set
> to RPM_ACTIVE when the device is suspended through system suspend. We could
> replace the test "sdkp->device->host->shost_state != SHOST_RUNNING" with
> "dev->power.is_suspended", as that indicates true (1) for a suspended device.
> However, I really do not like that as that is a PM internal field and should not
> be accessing it directly. The PM code comments say as much. Any better idea ?

I will reply to the above question on v2 of this patch.

Bart.
diff mbox series

Patch

diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index c92a317ba547..a415abb721d3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3763,7 +3763,8 @@  static void sd_shutdown(struct device *dev)
 	if (!sdkp)
 		return;         /* this can happen */
 
-	if (pm_runtime_suspended(dev))
+	if (pm_runtime_suspended(dev) ||
+	    sdkp->device->host->shost_state != SHOST_RUNNING)
 		return;
 
 	if (sdkp->WCE && sdkp->media_present) {