Message ID | 20220915115858.7642-1-peter.wang@mediatek.com |
---|---|
State | New |
Headers | show |
Series | [v1] ufs: core: bypass get rpm when err handling with pm_op_in_progress | expand |
On 9/15/22 04:58, peter.wang@mediatek.com wrote: > -static void ufshcd_err_handling_prepare(struct ufs_hba *hba) > +static void ufshcd_err_handling_prepare(struct ufs_hba *hba, bool *rpm_put) > { > - ufshcd_rpm_get_sync(hba); > + if (!hba->pm_op_in_progress) { > + ufshcd_rpm_get_sync(hba); > + *rpm_put = true; > + } > + Hi Peter, I don't think that this patch is sufficient. If ufshcd_err_handling_prepare() is called by the host reset handler (ufshcd_eh_host_reset_handler()) then the host state will be SHOST_RECOVERY. In that state SCSI command submission will hang and hence any ufshcd_rpm_get_sync() call will hang. How about removing the ufshcd_rpm_get_sync() call from ufshcd_err_handling_prepare() and the ufshcd_rpm_put() call from ufshcd_err_handling_unprepare()? It is guaranteed that no commands are in progress for a runtime suspended LUN so the code for aborting pending requests in the UFS error handler will be skipped anyway if it is invoked for a runtime suspended device. Thanks, Bart.
On 9/17/22 05:39, Bart Van Assche wrote: > On 9/15/22 04:58, peter.wang@mediatek.com wrote: >> -static void ufshcd_err_handling_prepare(struct ufs_hba *hba) >> +static void ufshcd_err_handling_prepare(struct ufs_hba *hba, bool >> *rpm_put) >> { >> - ufshcd_rpm_get_sync(hba); >> + if (!hba->pm_op_in_progress) { >> + ufshcd_rpm_get_sync(hba); >> + *rpm_put = true; >> + } >> + > > Hi Peter, > > I don't think that this patch is sufficient. If > ufshcd_err_handling_prepare() is called by the host reset handler > (ufshcd_eh_host_reset_handler()) then the host state will be > SHOST_RECOVERY. In that state SCSI command submission will hang and > hence any ufshcd_rpm_get_sync() call will hang. > > How about removing the ufshcd_rpm_get_sync() call from > ufshcd_err_handling_prepare() and the ufshcd_rpm_put() call from > ufshcd_err_handling_unprepare()? It is guaranteed that no commands are > in progress for a runtime suspended LUN so the code for aborting > pending requests in the UFS error handler will be skipped anyway if it > is invoked for a runtime suspended device. > > Thanks, > > Bart. Hi Bart, If the scsi error happened and need do ufshcd_eh_host_reset_handler, the rpm state should in RPM_ACTIVE. Because scsi need wakeup suspended LUN, and send command to LUN then get error, right? So, ufshcd_rpm_get_sync should not hang in this case. If remove ufshcd_rpm_get_sync directly, think about this case. ufshcd_err_handler is on going and try to abort some task (which may get stuck and timeout too). Then rpm count down try to suspend. Finally runtime suspend callback may return IO error and IO hang. Thanks. Peter
On 9/19/22 07:47, Peter Wang wrote: > If the scsi error happened and need do ufshcd_eh_host_reset_handler, the > rpm state should in RPM_ACTIVE. > Because scsi need wakeup suspended LUN, and send command to LUN then get > error, right? The following sequence may activate the SCSI error handler while the RPM state is RPM_RESUMING: * The RPM state is RPM_SUSPENDED. * The RPM state is changed into RPM_RESUMING and ufshcd_wl_resume() is called. * ufshcd_set_dev_pwr_mode() calls scsi_execute() and the START STOP UNIT command times out. * Because of this timeout the SCSI error handler is activated. > If remove ufshcd_rpm_get_sync directly, think about this case. > ufshcd_err_handler is on going and try to abort some task (which may get > stuck and timeout too). > Then rpm count down try to suspend. Finally runtime suspend callback may > return IO error and IO hang. Hmm ... suspending a UFS device involves calling ufshcd_wl_shutdown(), ufshcd_set_dev_pwr_mode() and scsi_execute(). scsi_execute() is serialized against the UFS error handler because the latter calls ufshcd_scsi_block_requests(). Thanks, Bart.
On 9/20/22 00:25, Bart Van Assche wrote: > On 9/19/22 07:47, Peter Wang wrote: >> If the scsi error happened and need do ufshcd_eh_host_reset_handler, >> the rpm state should in RPM_ACTIVE. >> Because scsi need wakeup suspended LUN, and send command to LUN then >> get error, right? > > The following sequence may activate the SCSI error handler while the > RPM state is RPM_RESUMING: > * The RPM state is RPM_SUSPENDED. > * The RPM state is changed into RPM_RESUMING and ufshcd_wl_resume() is > called. > * ufshcd_set_dev_pwr_mode() calls scsi_execute() and the START STOP > UNIT command times out. > * Because of this timeout the SCSI error handler is activated. This case will not get rpm, because pm_op_in_progress is true. So it won't hang with ufshcd_rpm_get_sync. Thanks Peter
On 9/19/22 19:00, Peter Wang wrote: > > On 9/20/22 00:25, Bart Van Assche wrote: >> On 9/19/22 07:47, Peter Wang wrote: >>> If the scsi error happened and need do ufshcd_eh_host_reset_handler, >>> the rpm state should in RPM_ACTIVE. >>> Because scsi need wakeup suspended LUN, and send command to LUN then >>> get error, right? >> >> The following sequence may activate the SCSI error handler while the >> RPM state is RPM_RESUMING: >> * The RPM state is RPM_SUSPENDED. >> * The RPM state is changed into RPM_RESUMING and ufshcd_wl_resume() is >> called. >> * ufshcd_set_dev_pwr_mode() calls scsi_execute() and the START STOP >> UNIT command times out. >> * Because of this timeout the SCSI error handler is activated. > > This case will not get rpm, because pm_op_in_progress is true. > > So it won't hang with ufshcd_rpm_get_sync. Right, but I think the following scenario will result in a hang: * The RPM state is changed from RPM_SUSPENDED into RPM_RESUMING and ufshcd_wl_resume() has not yet been called. * ufshcd_eh_host_reset_handler() queues ufshcd_err_handler() and the latter function calls ufshcd_rpm_get_sync(). * This results in a deadlock: the scsi_execute() call by ufshcd_wl_resume() cannot make progress because the SCSI host state is SHOST_RECOVERY and the error handler cannot make progress because it keeps waiting until ufshcd_rpm_get_sync() has finished. Thanks, Bart.
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index a202d7d5240d..cc58fb585df2 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -6086,9 +6086,13 @@ static void ufshcd_clk_scaling_suspend(struct ufs_hba *hba, bool suspend) } } -static void ufshcd_err_handling_prepare(struct ufs_hba *hba) +static void ufshcd_err_handling_prepare(struct ufs_hba *hba, bool *rpm_put) { - ufshcd_rpm_get_sync(hba); + if (!hba->pm_op_in_progress) { + ufshcd_rpm_get_sync(hba); + *rpm_put = true; + } + if (pm_runtime_status_suspended(&hba->ufs_device_wlun->sdev_gendev) || hba->is_sys_suspended) { enum ufs_pm_op pm_op; @@ -6122,13 +6126,14 @@ static void ufshcd_err_handling_prepare(struct ufs_hba *hba) cancel_work_sync(&hba->eeh_work); } -static void ufshcd_err_handling_unprepare(struct ufs_hba *hba) +static void ufshcd_err_handling_unprepare(struct ufs_hba *hba, bool rpm_put) { ufshcd_scsi_unblock_requests(hba); ufshcd_release(hba); if (ufshcd_is_clkscaling_supported(hba)) ufshcd_clk_scaling_suspend(hba, false); - ufshcd_rpm_put(hba); + if (rpm_put) + ufshcd_rpm_put(hba); } static inline bool ufshcd_err_handling_should_stop(struct ufs_hba *hba) @@ -6210,6 +6215,7 @@ static void ufshcd_err_handler(struct work_struct *work) bool err_tm; int pmc_err; int tag; + bool rpm_put = false; hba = container_of(work, struct ufs_hba, eh_work); @@ -6231,7 +6237,7 @@ static void ufshcd_err_handler(struct work_struct *work) } ufshcd_set_eh_in_progress(hba); spin_unlock_irqrestore(hba->host->host_lock, flags); - ufshcd_err_handling_prepare(hba); + ufshcd_err_handling_prepare(hba, &rpm_put); /* Complete requests that have door-bell cleared by h/w */ ufshcd_complete_requests(hba); spin_lock_irqsave(hba->host->host_lock, flags); @@ -6394,7 +6400,7 @@ static void ufshcd_err_handler(struct work_struct *work) } ufshcd_clear_eh_in_progress(hba); spin_unlock_irqrestore(hba->host->host_lock, flags); - ufshcd_err_handling_unprepare(hba); + ufshcd_err_handling_unprepare(hba, rpm_put); up(&hba->host_sem); dev_info(hba->dev, "%s finished; HBA state %s\n", __func__,