Message ID | 20221018202958.1902564-1-bvanassche@acm.org |
---|---|
Headers | show |
Series | Fix a deadlock in the UFS driver | expand |
On 10/19/22 12:57, Mike Christie wrote: > On 10/18/22 3:29 PM, Bart Van Assche wrote: >> Open-code scsi_execute() because a later patch will modify scmd->flags >> and because scsi_execute() does not support setting scmd->flags. No >> functionality is changed. >> >> Signed-off-by: Bart Van Assche <bvanassche@acm.org> >> --- >> drivers/ufs/core/ufshcd.c | 39 ++++++++++++++++++++++++++++++++++----- >> 1 file changed, 34 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c >> index 2a32bcc93d2e..c5ccc7ba583b 100644 >> --- a/drivers/ufs/core/ufshcd.c >> +++ b/drivers/ufs/core/ufshcd.c >> @@ -8729,6 +8729,39 @@ static void ufshcd_hba_exit(struct ufs_hba *hba) >> } >> } >> >> +static int ufshcd_execute_start_stop(struct scsi_device *sdev, >> + enum ufs_dev_pwr_mode pwr_mode, >> + struct scsi_sense_hdr *sshdr) >> +{ >> + unsigned char cdb[6] = { START_STOP, 0, 0, 0, pwr_mode << 4, 0 }; >> + struct request *req; >> + struct scsi_cmnd *scmd; >> + int ret; >> + >> + req = scsi_alloc_request(sdev->request_queue, REQ_OP_DRV_IN, >> + BLK_MQ_REQ_PM); >> > > Can you hit a case where we have run out of tags (__blk_mq_alloc_requests > is hitting the blk_mq_get_tag == BLK_MQ_NO_TAG check), the host has gone > into recovery and so commands are completing to add a tag back and then we > try to call this and get stuck waiting on a tag? Or for passthrough do we > have some special reserve? > > If so do you need to use BLK_MQ_REQ_NOWAIT here? Maybe do the retry loop > yourself like: > > retry: > if host is in recovery > return failure > > req = scsi_alloc_request(.... BLK_MQ_REQ_NOWAIT) > if (!req and we have not hit some retry limit) > goto retry > > > or have some special reserve command/tag. Hi Mike, No other SCSI commands should be in progress when ufshcd_execute_start_stop() is called because that function is only called during system suspend and resume and no other I/O should be in progress at that time. Additionally, there is a mutual exclusion mechanism in the UFS driver to serialize system suspend/resume activity and error handling. Thanks, Bart.
On 10/19/22 4:13 PM, Bart Van Assche wrote: > On 10/19/22 12:57, Mike Christie wrote: >> On 10/18/22 3:29 PM, Bart Van Assche wrote: >>> Open-code scsi_execute() because a later patch will modify scmd->flags >>> and because scsi_execute() does not support setting scmd->flags. No >>> functionality is changed. >>> >>> Signed-off-by: Bart Van Assche <bvanassche@acm.org> >>> --- >>> drivers/ufs/core/ufshcd.c | 39 ++++++++++++++++++++++++++++++++++----- >>> 1 file changed, 34 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c >>> index 2a32bcc93d2e..c5ccc7ba583b 100644 >>> --- a/drivers/ufs/core/ufshcd.c >>> +++ b/drivers/ufs/core/ufshcd.c >>> @@ -8729,6 +8729,39 @@ static void ufshcd_hba_exit(struct ufs_hba *hba) >>> } >>> } >>> +static int ufshcd_execute_start_stop(struct scsi_device *sdev, >>> + enum ufs_dev_pwr_mode pwr_mode, >>> + struct scsi_sense_hdr *sshdr) >>> +{ >>> + unsigned char cdb[6] = { START_STOP, 0, 0, 0, pwr_mode << 4, 0 }; >>> + struct request *req; >>> + struct scsi_cmnd *scmd; >>> + int ret; >>> + >>> + req = scsi_alloc_request(sdev->request_queue, REQ_OP_DRV_IN, >>> + BLK_MQ_REQ_PM); >>> >> >> Can you hit a case where we have run out of tags (__blk_mq_alloc_requests >> is hitting the blk_mq_get_tag == BLK_MQ_NO_TAG check), the host has gone >> into recovery and so commands are completing to add a tag back and then we >> try to call this and get stuck waiting on a tag? Or for passthrough do we >> have some special reserve? >> >> If so do you need to use BLK_MQ_REQ_NOWAIT here? Maybe do the retry loop >> yourself like: >> >> retry: >> if host is in recovery >> return failure >> >> req = scsi_alloc_request(.... BLK_MQ_REQ_NOWAIT) >> if (!req and we have not hit some retry limit) >> goto retry >> >> >> or have some special reserve command/tag. > > Hi Mike, > > No other SCSI commands should be in progress when ufshcd_execute_start_stop() is called because that function is only called during system suspend and resume and no other I/O should be in progress at that time. Additionally, there is a mutual exclusion mechanism in the UFS driver to serialize system suspend/resume activity and error handling. Looks ok to me then. Reviewed-by: Mike Christie <michael.christie@oracle.com>
Bart, > This patch series fixes a deadlock in the UFS driver between the > suspend/resume code and the SCSI error handler. Please consider this > patch series for the next merge window. Applied to 6.2/scsi-staging, thanks!
Martin Hi, > Bart, > > > This patch series fixes a deadlock in the UFS driver between the > > suspend/resume code and the SCSI error handler. Please consider this > > patch series for the next merge window. > > Applied to 6.2/scsi-staging, thanks! Patch #6: dcd5b7637c6d (scsi: ufs: Reduce the START STOP UNIT timeout) wasn't acked by any device manufacturer. This timeout is now reduced from 10 seconds to 3 seconds, after it was reduced from 3x60 on August. Please allow few more days to verify it internally. Thanks, Avri > > -- > Martin K. Petersen Oracle Linux Engineering
On 10/22/22 23:16, Avri Altman wrote: > Patch #6: dcd5b7637c6d (scsi: ufs: Reduce the START STOP UNIT > timeout) wasn't acked by any device manufacturer. This timeout is now > reduced from 10 seconds to 3 seconds, after it was reduced from 3x60 > on August. Please allow few more days to verify it internally. Thanks Avri for the additional testing. If the timeout value would have to be modified I think that can be done easily with a follow-up patch. Bart.
On Tue, 18 Oct 2022 13:29:48 -0700, Bart Van Assche wrote: > This patch series fixes a deadlock in the UFS driver between the suspend/resume > code and the SCSI error handler. Please consider this patch series for the next > merge window. > > Thanks, > > Bart. > > [...] Applied to 6.2/scsi-queue, thanks! [01/10] scsi: core: Fix a race between scsi_done() and scsi_timeout() https://git.kernel.org/mkp/scsi/c/978b7922d3dc [02/10] scsi: core: Change the return type of .eh_timed_out() https://git.kernel.org/mkp/scsi/c/dee7121e8c0a [03/10] scsi: core: Support failing requests while recovering https://git.kernel.org/mkp/scsi/c/310bcaef6d7e [04/10] scsi: ufs: Remove an outdated comment https://git.kernel.org/mkp/scsi/c/1626c7bba1c4 [05/10] scsi: ufs: Use 'else' in ufshcd_set_dev_pwr_mode() https://git.kernel.org/mkp/scsi/c/836d322d73cb [06/10] scsi: ufs: Reduce the START STOP UNIT timeout https://git.kernel.org/mkp/scsi/c/dcd5b7637c6d [07/10] scsi: ufs: Try harder to change the power mode https://git.kernel.org/mkp/scsi/c/579a4e9dbd53 [08/10] scsi: ufs: Track system suspend / resume activity https://git.kernel.org/mkp/scsi/c/1a547cbc6fdd [09/10] scsi: ufs: Introduce the function ufshcd_execute_start_stop() https://git.kernel.org/mkp/scsi/c/6a354a7e740e [10/10] scsi: ufs: Fix a deadlock between PM and the SCSI error handler https://git.kernel.org/mkp/scsi/c/7029e2151a7c