[v1] ufs: core: wlun suspend SSU fail recovery

Message ID	20221101142410.31463-1-peter.wang@mediatek.com
State	New
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: <peter.wang@mediatek.com> To: <stanley.chu@mediatek.com>, <linux-scsi@vger.kernel.org>, <martin.petersen@oracle.com>, <avri.altman@wdc.com>, <alim.akhtar@samsung.com>, <jejb@linux.ibm.com> CC: <wsd_upstream@mediatek.com>, <linux-mediatek@lists.infradead.org>, <peter.wang@mediatek.com>, <chun-hung.wu@mediatek.com>, <alice.chao@mediatek.com>, <cc.chou@mediatek.com>, <chaotian.jing@mediatek.com>, <jiajie.hao@mediatek.com>, <powen.kao@mediatek.com>, <qilin.tan@mediatek.com>, <lin.gui@mediatek.com>, <tun-yu.yu@mediatek.com>, <eddie.huang@mediatek.com>, <naomi.chu@mediatek.com> Subject: [PATCH v1] ufs: core: wlun suspend SSU fail recovery Date: Tue, 1 Nov 2022 22:24:10 +0800 Message-ID: <20221101142410.31463-1-peter.wang@mediatek.com> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	[v1] ufs: core: wlun suspend SSU fail recovery \| expand [v1] ufs: core: wlun suspend SSU fail recovery

Peter Wang (王信友) Nov. 1, 2022, 2:24 p.m. UTC

From: Peter Wang <peter.wang@mediatek.com>

When SSU fail in wlun suspend flow, trigger error handlder and
return busy to break the suspend.
If not, wlun runtime pm status become error and the consumer will
stuck in runtime suspend status.

Signed-off-by: Peter Wang <peter.wang@mediatek.com>
---
 drivers/ufs/core/ufshcd.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

Adrian Hunter Nov. 30, 2022, 10:17 a.m. UTC | #1

On 1/11/22 16:24, peter.wang@mediatek.com wrote:
> From: Peter Wang <peter.wang@mediatek.com>
> 
> When SSU fail in wlun suspend flow, trigger error handlder and

handlder -> handler

Why / how does SSU fail?

> return busy to break the suspend.
> If not, wlun runtime pm status become error and the consumer will
> stuck in runtime suspend status.
> 
> Signed-off-by: Peter Wang <peter.wang@mediatek.com>
> ---
>  drivers/ufs/core/ufshcd.c | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> index b1f59a5fe632..2f2d3d5d8684 100644
> --- a/drivers/ufs/core/ufshcd.c
> +++ b/drivers/ufs/core/ufshcd.c
> @@ -8970,6 +8970,7 @@ static int __ufshcd_wl_suspend(struct ufs_hba *hba, enum ufs_pm_op pm_op)
>  	enum ufs_pm_level pm_lvl;
>  	enum ufs_dev_pwr_mode req_dev_pwr_mode;
>  	enum uic_link_state req_link_state;
> +	unsigned long flags;
>  
>  	hba->pm_op_in_progress = true;
>  	if (pm_op != UFS_SHUTDOWN_PM) {
> @@ -9049,8 +9050,21 @@ static int __ufshcd_wl_suspend(struct ufs_hba *hba, enum ufs_pm_op pm_op)
>  
>  		if (!hba->dev_info.b_rpm_dev_flush_capable) {
>  			ret = ufshcd_set_dev_pwr_mode(hba, req_dev_pwr_mode);
> -			if (ret)
> +			if (ret) {
> +				/*
> +				 * If retrun err in suspend flow, IO will hang.

retrun -> return

> +				 * Trigger error handler and break suspend for
> +				 * error recovery.
> +				 */
> +				spin_lock_irqsave(hba->host->host_lock, flags);
> +				hba->force_reset = true;
> +				ufshcd_schedule_eh_work(hba);

__ufshcd_wl_suspend() is also used for shutdown and poweroff
where scheduling EH is not appropriate.

> +				spin_unlock_irqrestore(hba->host->host_lock,
> +					flags);
> +
> +				ret = -EBUSY;
>  				goto enable_scaling;
> +			}
>  		}
>  	}
>

Bart Van Assche Nov. 30, 2022, 5:49 p.m. UTC | #2

On 11/30/22 02:17, Adrian Hunter wrote:
> On 1/11/22 16:24, peter.wang@mediatek.com wrote:
>> From: Peter Wang <peter.wang@mediatek.com>
>>
>> When SSU fail in wlun suspend flow, trigger error handlder and
> 
> handlder -> handler
> 
> Why / how does SSU fail?

I'm not sure but the issue that Peter is trying to fix with this patch 
may already have been fixed by my patch series "Fix a deadlock in the 
UFS driver".

Thanks,

Bart.

Peter Wang (王信友) Dec. 1, 2022, 9:19 a.m. UTC | #3

On Wed, 2022-11-30 at 12:17 +0200, Adrian Hunter wrote:
> On 1/11/22 16:24, peter.wang@mediatek.com wrote:
> > From: Peter Wang <peter.wang@mediatek.com>
> > 
> > When SSU fail in wlun suspend flow, trigger error handlder and
> 
> handlder -> handler

will fix next versio, thanks.

> 
> Why / how does SSU fail?
> 

Unknow, it is seldom and we cannot catch fail the failure scenarios. We
suspect is maybe the device problem which response wrong sense code.
So, I think the driver is better to cover this error.
Here is our cmd histroy log with 3 times SSU fail, and retrun busy to
do error handler.

 26-c(1),  2576.085601525, 9396,18, rs, ret=-16, time_us=   41136,
pwr_mode=1, link_status=1
 25-r(1),  2576.085027371, 9396,
1,0x1b,t=53,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	31308
 24-r(1),  2576.084996910, 9396,
0,0x1b,t=53,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	0
 23-r(1),  2576.084919679, 9396,
1,0x1b,t=52,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	32846
 22-r(1),  2576.084887679, 9396,
0,0x1b,t=52,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	0
 21-r(1),  2576.084704295,    0,
1,0x1b,t=55,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	39777539
 20-r(1),  2576.044927679, 9396,
0,0x1b,t=55,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	0


> > return busy to break the suspend.
> > If not, wlun runtime pm status become error and the consumer will
> > stuck in runtime suspend status.
> > 
> > Signed-off-by: Peter Wang <peter.wang@mediatek.com>
> > ---
> >  drivers/ufs/core/ufshcd.c | 16 +++++++++++++++-
> >  1 file changed, 15 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c
> > index b1f59a5fe632..2f2d3d5d8684 100644
> > --- a/drivers/ufs/core/ufshcd.c
> > +++ b/drivers/ufs/core/ufshcd.c
> > @@ -8970,6 +8970,7 @@ static int __ufshcd_wl_suspend(struct ufs_hba
> > *hba, enum ufs_pm_op pm_op)
> >  	enum ufs_pm_level pm_lvl;
> >  	enum ufs_dev_pwr_mode req_dev_pwr_mode;
> >  	enum uic_link_state req_link_state;
> > +	unsigned long flags;
> >  
> >  	hba->pm_op_in_progress = true;
> >  	if (pm_op != UFS_SHUTDOWN_PM) {
> > @@ -9049,8 +9050,21 @@ static int __ufshcd_wl_suspend(struct
> > ufs_hba *hba, enum ufs_pm_op pm_op)
> >  
> >  		if (!hba->dev_info.b_rpm_dev_flush_capable) {
> >  			ret = ufshcd_set_dev_pwr_mode(hba,
> > req_dev_pwr_mode);
> > -			if (ret)
> > +			if (ret) {
> > +				/*
> > +				 * If retrun err in suspend flow, IO
> > will hang.
> 
> retrun -> return

will fix next version, thanks

> 
> > +				 * Trigger error handler and break
> > suspend for
> > +				 * error recovery.
> > +				 */
> > +				spin_lock_irqsave(hba->host->host_lock, 
> > flags);
> > +				hba->force_reset = true;
> > +				ufshcd_schedule_eh_work(hba);
> 
> __ufshcd_wl_suspend() is also used for shutdown and poweroff
> where scheduling EH is not appropriate.

Will check pm_op and bypass UFS_SHUTDOWN_PM next version, thanks.

> 
> > +				spin_unlock_irqrestore(hba->host-
> > >host_lock,
> > +					flags);
> > +
> > +				ret = -EBUSY;
> >  				goto enable_scaling;
> > +			}
> >  		}
> >  	}
> >  
> 
>

Peter Wang (王信友) Dec. 1, 2022, 9:31 a.m. UTC | #4

On Wed, 2022-11-30 at 09:49 -0800, Bart Van Assche wrote:
> On 11/30/22 02:17, Adrian Hunter wrote:
> > On 1/11/22 16:24, peter.wang@mediatek.com wrote:
> > > From: Peter Wang <peter.wang@mediatek.com>
> > > 
> > > When SSU fail in wlun suspend flow, trigger error handlder and
> > 
> > handlder -> handler
> > 
> > Why / how does SSU fail?
> 
> I'm not sure but the issue that Peter is trying to fix with this
> patch 
> may already have been fixed by my patch series "Fix a deadlock in
> the 
> UFS driver".
> 
> Thanks,
> 
> Bart.
> 

Hi Bart,

Yes, it may also can fix by 
"[v4,07/10] scsi: ufs: Try harder to change the power mode"

But I am not sure if any outher corner case SSU will got antother err
which retry canoot fix? (ex, device always return same error sense code
which host is not expect to receive again, again and again let retry
sill fail)

By the way, in wl suspend flow, not only SSU, but also enter
hibern8(0x17) could got fail. I think it is better fix in both fail
case.
Hera is our 0x17 timeout fail log and runtime suspend retrun timeout.

473-c(1), 21140.355276017,31742,18, rs, ret=-110, time_us= 2892755,
pwr_mode=2, link_status=3
472-u(1), 21139.829423708,31742, 7,0x17,arg1=0x0,arg2=0x0,arg3=0x0,	
0
471-r(1), 21139.829361861,    0,
1,0x1b,t=46,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	1738384
470-r(1), 21139.827623708,31742,
0,0x1b,t=46,db:0x       0,is:0x   80000,crypt:0,0,lba=         0,len=  
  -1,	0

Thanks.
Peter

Daniil Lunev Dec. 1, 2022, 11:32 p.m. UTC | #5

On Thu, Dec 1, 2022 at 4:50 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> On 11/30/22 02:17, Adrian Hunter wrote:
> > On 1/11/22 16:24, peter.wang@mediatek.com wrote:
> >> From: Peter Wang <peter.wang@mediatek.com>
> >>
> >> When SSU fail in wlun suspend flow, trigger error handlder and
> >
> > handlder -> handler
> >
> > Why / how does SSU fail?
>
> I'm not sure but the issue that Peter is trying to fix with this patch
> may already have been fixed by my patch series "Fix a deadlock in the
> UFS driver".

We observe a similar failure scenario when a device tries to change
power mode (e.g. due to autosuspend) while "purge" is in progress. It
appears that in this case the "pm_only" counter is increased on the
device queue prior to the attempt of entering a runtime suspended
state, but not reduced upon it failing due to the device being busy
with an ongoing purge. That leads to a deadlock of subsequent IO
operations to the device.

--Daniil

Peter Wang (王信友) Dec. 2, 2022, 6:40 a.m. UTC | #6

On Fri, 2022-12-02 at 10:32 +1100, Daniil Lunev wrote:
> On Thu, Dec 1, 2022 at 4:50 AM Bart Van Assche <bvanassche@acm.org>
> wrote:
> > 
> > On 11/30/22 02:17, Adrian Hunter wrote:
> > > On 1/11/22 16:24, peter.wang@mediatek.com wrote:
> > > > From: Peter Wang <peter.wang@mediatek.com>
> > > > 
> > > > When SSU fail in wlun suspend flow, trigger error handlder and
> > > 
> > > handlder -> handler
> > > 
> > > Why / how does SSU fail?
> > 
> > I'm not sure but the issue that Peter is trying to fix with this
> > patch
> > may already have been fixed by my patch series "Fix a deadlock in
> > the
> > UFS driver".
> 
> We observe a similar failure scenario when a device tries to change
> power mode (e.g. due to autosuspend) while "purge" is in progress. It
> appears that in this case the "pm_only" counter is increased on the
> device queue prior to the attempt of entering a runtime suspended
> state, but not reduced upon it failing due to the device being busy
> with an ongoing purge. That leads to a deadlock of subsequent IO
> operations to the device.
> 
> --Daniil

Hi Daniil,

May I have a question, which dievce you use in this failure scenario?  
I think it is same issue due to SSU fail, and wlun devcie pm enter
error state. So the cunsomer(scsi lu device) is block in suspend state
and connot resume to reduce pm_only lead to IO hang.

Thanks.
BR
Peter

Daniil Lunev Dec. 4, 2022, 8:02 p.m. UTC | #7

> May I have a question, which dievce you use in this failure scenario?
> I think it is same issue due to SSU fail, and wlun devcie pm enter
> error state. So the cunsomer(scsi lu device) is block in suspend state
> and connot resume to reduce pm_only lead to IO hang.
>
> Thanks.
> BR
> Peter

I don't think it is device specific. I am pretty sure it is the same
problem, for it traces to the same origin and your patch fixes the
issue (though with a lot of log spam which is a bit unfortunate, but
we can live with that)

--Daniil

Peter Wang (王信友) Dec. 5, 2022, 7:39 a.m. UTC | #8

Hi Daniil,

Thanks your feedback. 
I am glad to hear your problem can be solved by this patch.
Please also help have a review of my v3 patch

https://patchwork.kernel.org/project/linux-scsi/patch/20221202073532.7884-1-peter.wang@mediatek.com/

Thanks.
BR
Peter



On Mon, 2022-12-05 at 07:02 +1100, Daniil Lunev wrote:
> > May I have a question, which dievce you use in this failure
> > scenario?
> > I think it is same issue due to SSU fail, and wlun devcie pm enter
> > error state. So the cunsomer(scsi lu device) is block in suspend
> > state
> > and connot resume to reduce pm_only lead to IO hang.
> > 
> > Thanks.
> > BR
> > Peter
> 
> I don't think it is device specific. I am pretty sure it is the same
> problem, for it traces to the same origin and your patch fixes the
> issue (though with a lot of log spam which is a bit unfortunate, but
> we can live with that)
> 
> --Daniil

[v1] ufs: core: wlun suspend SSU fail recovery

Commit Message

Comments

Patch