blk-mq: avoid to iterate over stale request

Message ID	20210906065003.439019-1-ming.lei@redhat.com
State	New
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Ming Lei <ming.lei@redhat.com> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, Ming Lei <ming.lei@redhat.com>, linux-scsi@vger.kernel.org, "Martin K. Petersen" <martin.petersen@oracle.com>, luojiaxing <luojiaxing@huawei.com> Subject: [PATCH] blk-mq: avoid to iterate over stale request Date: Mon, 6 Sep 2021 14:50:03 +0800 Message-Id: <20210906065003.439019-1-ming.lei@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	blk-mq: avoid to iterate over stale request \| expand blk-mq: avoid to iterate over stale request

Ming Lei Sept. 6, 2021, 6:50 a.m. UTC

blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver
tag is released.

So there is chance to iterating over one stale request just after the
tag is allocated and before updating ->rqs[tag].

scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight requests after scsi host is blocked, so no new scsi command can
be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can
be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,
but this request may have been kept in another slot of ->rqs[], meantime
the slot can be allocated out but ->rqs[] isn't updated yet. Then this
in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes
trouble in handling scsi error.

Fixes the issue by not iterating over stale request.

Cc: linux-scsi@vger.kernel.org
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Reported-by: luojiaxing <luojiaxing@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
---
 block/blk-mq-tag.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Ming Lei Sept. 7, 2021, 1:14 a.m. UTC | #1

On Mon, Sep 06, 2021 at 03:44:08PM -0700, Bart Van Assche wrote:
> On 9/5/21 23:50, Ming Lei wrote:

> > -	if (!rq || !refcount_inc_not_zero(&rq->ref))

> > +	if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))

> >   		rq = NULL;

> 

> Shouldn't the rq->tag != bitnr test happen after the refcount has been

> incremented since otherwise rq->tag can change after it has been read and

> before the refcount is incremented?

rq->tag can change too after its refcount is grabbed. If the rq is released
during the iterating, either SCMD_STATE_INFLIGHT is cleared or
refcount_inc_not_zero() fails. So this way works.

The use case for scsi_host_queue_ready() and scsi EH handling is a bit
special. For others, either the iterating needn't to be exact, or queue
is frozen.

Thanks,
Ming

Ming Lei Sept. 13, 2021, 1:26 a.m. UTC | #2

On Mon, Sep 06, 2021 at 02:50:03PM +0800, Ming Lei wrote:
> blk-mq can't run allocating driver tag and updating ->rqs[tag]

> atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver

> tag is released.

> 

> So there is chance to iterating over one stale request just after the

> tag is allocated and before updating ->rqs[tag].

> 

> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi

> in-flight requests after scsi host is blocked, so no new scsi command can

> be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can

> be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,

> but this request may have been kept in another slot of ->rqs[], meantime

> the slot can be allocated out but ->rqs[] isn't updated yet. Then this

> in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes

> trouble in handling scsi error.

> 

> Fixes the issue by not iterating over stale request.

> 

> Cc: linux-scsi@vger.kernel.org

> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>

> Reported-by: luojiaxing <luojiaxing@huawei.com>

> Signed-off-by: Ming Lei <ming.lei@redhat.com>


Hello Jens,

luojiaxiang has verified that this patch fixes his issue, any chance to
merge it?


Thanks,
Ming

Jens Axboe Sept. 13, 2021, 1:31 a.m. UTC | #3

On 9/12/21 7:26 PM, Ming Lei wrote:
> On Mon, Sep 06, 2021 at 02:50:03PM +0800, Ming Lei wrote:

>> blk-mq can't run allocating driver tag and updating ->rqs[tag]

>> atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver

>> tag is released.

>>

>> So there is chance to iterating over one stale request just after the

>> tag is allocated and before updating ->rqs[tag].

>>

>> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi

>> in-flight requests after scsi host is blocked, so no new scsi command can

>> be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can

>> be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT,

>> but this request may have been kept in another slot of ->rqs[], meantime

>> the slot can be allocated out but ->rqs[] isn't updated yet. Then this

>> in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes

>> trouble in handling scsi error.

>>

>> Fixes the issue by not iterating over stale request.

>>

>> Cc: linux-scsi@vger.kernel.org

>> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>

>> Reported-by: luojiaxing <luojiaxing@huawei.com>

>> Signed-off-by: Ming Lei <ming.lei@redhat.com>

> 

> Hello Jens,

> 

> luojiaxiang has verified that this patch fixes his issue, any chance to

> merge it?


I'll queue it up for 5.15-rc2, thanks.

-- 
Jens Axboe

Kashyap Desai Dec. 14, 2021, 6:41 p.m. UTC | #4

+ John Garry

> blk-mq can't run allocating driver tag and updating ->rqs[tag]
atomically,
> meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
released.
>
> So there is chance to iterating over one stale request just after the
tag is
> allocated and before updating ->rqs[tag].
>
> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
in-flight
> requests after scsi host is blocked, so no new scsi command can be
marked as
> SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by
blk-
> mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request
> may have been kept in another slot of ->rqs[], meantime the slot can be
> allocated out but ->rqs[] isn't updated yet. Then this in-flight request
is
> counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
handling
> scsi error.

Hi Ming,

We found similar issue on RHEL8.5 (kernel  does not have this patch in
discussion.). Issue reproduced on 5.15 kernel as well.
I understood this commit will fix specific race condition and avoid
reading incorrect host_busy value.
As per commit message - That incorrect host_busy will be just transient.
If we read after some delay, correct host_busy count will be available.
Right ?

In my case (I am using shared host tag enabled driver), it is not race
condition issue but stale rqs[] entries create permanent incorrect count
of host_busy.
Example - There are two pending IOs. This IOs are timed out. Bitmap of
pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually belongs
to hctx1).  Note  - This is a shared bit map.
If hctx0 has same address of the request at 5th and 10th index, we will
count total 2 inflight commands instead of 1 from hctx0 context + From
hctx1 context, we will count 1 inflight command = Total is 3.
Even though we read after some delay, host_busy will be incorrect. We
expect host_busy = 2 but it will return 3.

This patch fix my issue explained above for shared host-tag case.  I am
confused reading the commit message. You may not have intentionally fix
the issue as I explained but indirectly it fixes my issue. Am I correct ?

What was an issue reported by Luojiaxiang ? I am interested to know if
issue reported by Luojiaxiang had shared host tagset enabled ?

Kashyap

>
> Fixes the issue by not iterating over stale request.
>
> Cc: linux-scsi@vger.kernel.org
> Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> Reported-by: luojiaxing <luojiaxing@huawei.com>
> Signed-off-by: Ming Lei <ming.lei@redhat.com>
> ---
>  block/blk-mq-tag.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> 86f87346232a..ff5caeb82542 100644
> --- a/block/blk-mq-tag.c
> +++ b/block/blk-mq-tag.c
> @@ -208,7 +208,7 @@ static struct request
> *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
>
>  	spin_lock_irqsave(&tags->lock, flags);
>  	rq = tags->rqs[bitnr];
> -	if (!rq || !refcount_inc_not_zero(&rq->ref))
> +	if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
>  		rq = NULL;
>  	spin_unlock_irqrestore(&tags->lock, flags);
>  	return rq;
> --
> 2.31.1

Ming Lei Dec. 15, 2021, 3:36 a.m. UTC | #5

Hello Kashyap,

On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote:
> + John Garry
> 
> > blk-mq can't run allocating driver tag and updating ->rqs[tag]
> atomically,
> > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
> released.
> >
> > So there is chance to iterating over one stale request just after the
> tag is
> > allocated and before updating ->rqs[tag].
> >
> > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi
> in-flight
> > requests after scsi host is blocked, so no new scsi command can be
> marked as
> > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by
> blk-
> > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request
> > may have been kept in another slot of ->rqs[], meantime the slot can be
> > allocated out but ->rqs[] isn't updated yet. Then this in-flight request
> is
> > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
> handling
> > scsi error.
> 
> Hi Ming,
> 
> We found similar issue on RHEL8.5 (kernel  does not have this patch in
> discussion.). Issue reproduced on 5.15 kernel as well.
> I understood this commit will fix specific race condition and avoid
> reading incorrect host_busy value.
> As per commit message - That incorrect host_busy will be just transient.
> If we read after some delay, correct host_busy count will be available.
> Right ?

Yeah, any counter(include atomic counter) works in this way.

But here it may be 'permanent' because one stale request pointer may stay in
one slot of ->rqs[] for long enough time if this slot isn't reused, meantime
the same request can be reallocated in case of real io scheduler. Maybe the
commit log should be improved a bit for making it explicit.

> 
> In my case (I am using shared host tag enabled driver), it is not race
> condition issue but stale rqs[] entries create permanent incorrect count
> of host_busy.
> Example - There are two pending IOs. This IOs are timed out. Bitmap of
> pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually belongs
> to hctx1).  Note  - This is a shared bit map.
> If hctx0 has same address of the request at 5th and 10th index, we will

It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit
map, both tag#5 and tag#10 are set, and both shared_tags->rqs[5] &
shared_tags->rqs[10] should point to the updated requests(timed out).

> count total 2 inflight commands instead of 1 from hctx0 context + From
> hctx1 context, we will count 1 inflight command = Total is 3.
> Even though we read after some delay, host_busy will be incorrect. We
> expect host_busy = 2 but it will return 3.
> 
> This patch fix my issue explained above for shared host-tag case.  I am
> confused reading the commit message. You may not have intentionally fix
> the issue as I explained but indirectly it fixes my issue. Am I correct ?
> 
> What was an issue reported by Luojiaxiang ? I am interested to know if
> issue reported by Luojiaxiang had shared host tagset enabled ?

https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@huawei.com/

It should be same with yours, and the original report described & explained
the issue in details.

thanks,
Ming

> 
> Kashyap
> 
> >
> > Fixes the issue by not iterating over stale request.
> >
> > Cc: linux-scsi@vger.kernel.org
> > Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
> > Reported-by: luojiaxing <luojiaxing@huawei.com>
> > Signed-off-by: Ming Lei <ming.lei@redhat.com>
> > ---
> >  block/blk-mq-tag.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index
> > 86f87346232a..ff5caeb82542 100644
> > --- a/block/blk-mq-tag.c
> > +++ b/block/blk-mq-tag.c
> > @@ -208,7 +208,7 @@ static struct request
> > *blk_mq_find_and_get_req(struct blk_mq_tags *tags,
> >
> >  	spin_lock_irqsave(&tags->lock, flags);
> >  	rq = tags->rqs[bitnr];
> > -	if (!rq || !refcount_inc_not_zero(&rq->ref))
> > +	if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref))
> >  		rq = NULL;
> >  	spin_unlock_irqrestore(&tags->lock, flags);
> >  	return rq;
> > --
> > 2.31.1

Kashyap Desai Dec. 15, 2021, 7:30 a.m. UTC | #6

>
> Hello Kashyap,
>
> On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote:
> > + John Garry
> >
> > > blk-mq can't run allocating driver tag and updating ->rqs[tag]
> > atomically,
> > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
> > released.
> > >
> > > So there is chance to iterating over one stale request just after
> > > the
> > tag is
> > > allocated and before updating ->rqs[tag].
> > >
> > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count
> > > scsi
> > in-flight
> > > requests after scsi host is blocked, so no new scsi command can be
> > marked as
> > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run
> > > by
> > blk-
> > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this
> > > request may have been kept in another slot of ->rqs[], meantime the
> > > slot can be allocated out but ->rqs[] isn't updated yet. Then this
> > > in-flight request
> > is
> > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
> > handling
> > > scsi error.
> >
> > Hi Ming,
> >
> > We found similar issue on RHEL8.5 (kernel  does not have this patch in
> > discussion.). Issue reproduced on 5.15 kernel as well.
> > I understood this commit will fix specific race condition and avoid
> > reading incorrect host_busy value.
> > As per commit message - That incorrect host_busy will be just
transient.
> > If we read after some delay, correct host_busy count will be
available.
> > Right ?
>
> Yeah, any counter(include atomic counter) works in this way.
>
> But here it may be 'permanent' because one stale request pointer may
stay in
> one slot of ->rqs[] for long enough time if this slot isn't reused,
meantime the
> same request can be reallocated in case of real io scheduler. Maybe the
> commit log should be improved a bit for making it explicit.

Changing commit log description will help.

>
> >
> > In my case (I am using shared host tag enabled driver), it is not race
> > condition issue but stale rqs[] entries create permanent incorrect
> > count of host_busy.
> > Example - There are two pending IOs. This IOs are timed out. Bitmap of
> > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually
> > belongs to hctx1).  Note  - This is a shared bit map.
> > If hctx0 has same address of the request at 5th and 10th index, we
> > will
>
> It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit
map, both
> tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags-
> >rqs[10] should point to the updated requests(timed out).
Updated pointers will be there for actual hctx.  Below is possible and
that is what causing problem in original issue.

shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command)  <-
This is incorrect. While looping on hctx0 tags[], bitmap = 10 this entry
is also found which is actually outstanding on hctx1.
shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)


Issue noticed by me is the exact same issue described @ below -
https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@hu
awei.com/

Issue is only exposed to shared host tagset. I got the required
information. Thanks.

Kashyap
>
> > count total 2 inflight commands instead of 1 from hctx0 context + From
> > hctx1 context, we will count 1 inflight command = Total is 3.
> > Even though we read after some delay, host_busy will be incorrect. We
> > expect host_busy = 2 but it will return 3.
> >
> > This patch fix my issue explained above for shared host-tag case.  I
> > am confused reading the commit message. You may not have intentionally
> > fix the issue as I explained but indirectly it fixes my issue. Am I
correct ?
> >
> > What was an issue reported by Luojiaxiang ? I am interested to know if
> > issue reported by Luojiaxiang had shared host tagset enabled ?
>
> https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> 5c10539492cb@huawei.com/

I check this. It is same issue as what I am seeing on Broadcom controller
only if shared host tagset is enabled.

Ming Lei Dec. 15, 2021, 8:02 a.m. UTC | #7

On Wed, Dec 15, 2021 at 01:00:49PM +0530, Kashyap Desai wrote:
> >
> > Hello Kashyap,
> >
> > On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote:
> > > + John Garry
> > >
> > > > blk-mq can't run allocating driver tag and updating ->rqs[tag]
> > > atomically,
> > > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is
> > > released.
> > > >
> > > > So there is chance to iterating over one stale request just after
> > > > the
> > > tag is
> > > > allocated and before updating ->rqs[tag].
> > > >
> > > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count
> > > > scsi
> > > in-flight
> > > > requests after scsi host is blocked, so no new scsi command can be
> > > marked as
> > > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run
> > > > by
> > > blk-
> > > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this
> > > > request may have been kept in another slot of ->rqs[], meantime the
> > > > slot can be allocated out but ->rqs[] isn't updated yet. Then this
> > > > in-flight request
> > > is
> > > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in
> > > handling
> > > > scsi error.
> > >
> > > Hi Ming,
> > >
> > > We found similar issue on RHEL8.5 (kernel  does not have this patch in
> > > discussion.). Issue reproduced on 5.15 kernel as well.
> > > I understood this commit will fix specific race condition and avoid
> > > reading incorrect host_busy value.
> > > As per commit message - That incorrect host_busy will be just
> transient.
> > > If we read after some delay, correct host_busy count will be
> available.
> > > Right ?
> >
> > Yeah, any counter(include atomic counter) works in this way.
> >
> > But here it may be 'permanent' because one stale request pointer may
> stay in
> > one slot of ->rqs[] for long enough time if this slot isn't reused,
> meantime the
> > same request can be reallocated in case of real io scheduler. Maybe the
> > commit log should be improved a bit for making it explicit.
> 
> Changing commit log description will help.
> 
> >
> > >
> > > In my case (I am using shared host tag enabled driver), it is not race
> > > condition issue but stale rqs[] entries create permanent incorrect
> > > count of host_busy.
> > > Example - There are two pending IOs. This IOs are timed out. Bitmap of
> > > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually
> > > belongs to hctx1).  Note  - This is a shared bit map.
> > > If hctx0 has same address of the request at 5th and 10th index, we
> > > will
> >
> > It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit
> map, both
> > tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags-
> > >rqs[10] should point to the updated requests(timed out).
> Updated pointers will be there for actual hctx.  Below is possible and
> that is what causing problem in original issue.
> 
> shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
> shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command)  <-
> This is incorrect. While looping on hctx0 tags[], bitmap = 10 this entry
> is also found which is actually outstanding on hctx1.
> shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)

Sorry, I am a bit confused, please look at the following code and
blk_mq_find_and_get_req()(<-bt_tags_iter).

void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
                busy_tag_iter_fn *fn, void *priv)
{
        unsigned int flags = tagset->flags;
        int i, nr_tags;

        nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues;

        for (i = 0; i < nr_tags; i++) {
                if (tagset->tags && tagset->tags[i])
                        __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
                                              BT_TAG_ITER_STARTED);
        }
}

In case of shared tags, only tagset->tags[0] is iterated over, so both
5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10]
shouldn't just point to the two latest requests? How can both 0xAA &
0xBB be retrieved from shared_tags->rqs[10]?

> Issue noticed by me is the exact same issue described @ below -
> https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@hu
> awei.com/
> 
> Issue is only exposed to shared host tagset. I got the required
> information. Thanks.
> 
> Kashyap
> >
> > > count total 2 inflight commands instead of 1 from hctx0 context + From
> > > hctx1 context, we will count 1 inflight command = Total is 3.
> > > Even though we read after some delay, host_busy will be incorrect. We
> > > expect host_busy = 2 but it will return 3.
> > >
> > > This patch fix my issue explained above for shared host-tag case.  I
> > > am confused reading the commit message. You may not have intentionally
> > > fix the issue as I explained but indirectly it fixes my issue. Am I
> correct ?
> > >
> > > What was an issue reported by Luojiaxiang ? I am interested to know if
> > > issue reported by Luojiaxiang had shared host tagset enabled ?
> >
> > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> > 5c10539492cb@huawei.com/
> 
> I check this. It is same issue as what I am seeing on Broadcom controller
> only if shared host tagset is enabled.

But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't
only for shared tags.



Thanks 
Ming

Kashyap Desai Dec. 15, 2021, 8:45 a.m. UTC | #8

> >
> > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
> > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command)
> > <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this
> > entry is also found which is actually outstanding on hctx1.
> > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)
>
> Sorry, I am a bit confused, please look at the following code and
> blk_mq_find_and_get_req()(<-bt_tags_iter).


My issue is without shared tags. Below patch is certainly a fix for shared
tags (for 5.16-rc).
I have seen scsi eh deadlock issue on 5.15 kernel which does not have
shared tags (but it has shared bitmap.)

>
> void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
>                 busy_tag_iter_fn *fn, void *priv) {
>         unsigned int flags = tagset->flags;
>         int i, nr_tags;
>
>         nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
tagset->nr_hw_queues;
>
>         for (i = 0; i < nr_tags; i++) {
>                 if (tagset->tags && tagset->tags[i])
>                         __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
>                                               BT_TAG_ITER_STARTED);
>         }
> }
>
> In case of shared tags, only tagset->tags[0] is iterated over, so both
> 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10]
> shouldn't just point to the two latest requests? How can both 0xAA &
0xBB be
> retrieved from shared_tags->rqs[10]?

Since I am not talking about shared tags, you can remap same condition
now.
In 5.15 kernel (shared bitmap is enabled but not shared tags) -
blk_mq_tagset_busy_iter is called for each tags[] of nr_hw_queues times.
It will call  bt_tags_for_each()  for hctx0.tags->[], hctx1.tags->[] ..
Since tags->bitmap_tags is shared when it traverse hctx0.tags->[], it can
find hctx0.tags[10].rq which is really stale entry.
If the same request which is stale @ hctx0.tags[10] is really outstanding
at some other tag#, stale entry will be counted in host_busy.

>
> > Issue noticed by me is the exact same issue described @ below -
> > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492c
> > b@hu
> > awei.com/
> >
> > Issue is only exposed to shared host tagset. I got the required
> > information. Thanks.
> >
> > Kashyap
> > >
> > > > count total 2 inflight commands instead of 1 from hctx0 context +
> > > > From
> > > > hctx1 context, we will count 1 inflight command = Total is 3.
> > > > Even though we read after some delay, host_busy will be incorrect.
> > > > We expect host_busy = 2 but it will return 3.
> > > >
> > > > This patch fix my issue explained above for shared host-tag case.
> > > > I am confused reading the commit message. You may not have
> > > > intentionally fix the issue as I explained but indirectly it fixes
> > > > my issue. Am I
> > correct ?
> > > >
> > > > What was an issue reported by Luojiaxiang ? I am interested to
> > > > know if issue reported by Luojiaxiang had shared host tagset
enabled ?
> > >
> > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> > > 5c10539492cb@huawei.com/
> >
> > I check this. It is same issue as what I am seeing on Broadcom
> > controller only if shared host tagset is enabled.
>
> But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't
only for
> shared tags.

I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
shared bitmap, so it is exposed and It is confirmed from this discussion.
Do we still have exposure (if "blk-mq: avoid to iterate over stale
request" is not part of kernel)  to mpi3mr type driver which does not use
shared bitmap but has nr_hw_queues > 1. ?

Kashyap

>
>
>
> Thanks
> Ming

Ming Lei Dec. 15, 2021, 12:56 p.m. UTC | #9

On Wed, Dec 15, 2021 at 02:15:51PM +0530, Kashyap Desai wrote:
> > >
> > > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command)
> > > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command)
> > > <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this
> > > entry is also found which is actually outstanding on hctx1.
> > > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command)
> >
> > Sorry, I am a bit confused, please look at the following code and
> > blk_mq_find_and_get_req()(<-bt_tags_iter).
> 
> 
> My issue is without shared tags. Below patch is certainly a fix for shared
> tags (for 5.16-rc).
> I have seen scsi eh deadlock issue on 5.15 kernel which does not have
> shared tags (but it has shared bitmap.)

OK, if you are talking about non-shared tags, your issue is exactly
addressed by 67f3b2f822b7 blk-mq: avoid to iterate over stale request

> 
> >
> > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset,
> >                 busy_tag_iter_fn *fn, void *priv) {
> >         unsigned int flags = tagset->flags;
> >         int i, nr_tags;
> >
> >         nr_tags = blk_mq_is_shared_tags(flags) ? 1 :
> tagset->nr_hw_queues;
> >
> >         for (i = 0; i < nr_tags; i++) {
> >                 if (tagset->tags && tagset->tags[i])
> >                         __blk_mq_all_tag_iter(tagset->tags[i], fn, priv,
> >                                               BT_TAG_ITER_STARTED);
> >         }
> > }
> >
> > In case of shared tags, only tagset->tags[0] is iterated over, so both
> > 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10]
> > shouldn't just point to the two latest requests? How can both 0xAA &
> 0xBB be
> > retrieved from shared_tags->rqs[10]?
> 
> Since I am not talking about shared tags, you can remap same condition
> now.
> In 5.15 kernel (shared bitmap is enabled but not shared tags) -

Shared bitmap and shared tags should have be same thing, that means all hw queues
share same tag space.

> blk_mq_tagset_busy_iter is called for each tags[] of nr_hw_queues times.
> It will call  bt_tags_for_each()  for hctx0.tags->[], hctx1.tags->[] ..
> Since tags->bitmap_tags is shared when it traverse hctx0.tags->[], it can

If ->birtmap_tags is shared, only one tags should be iterated over, so
the following commit was added:

0994c64eb415 blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags

> find hctx0.tags[10].rq which is really stale entry.
> If the same request which is stale @ hctx0.tags[10] is really outstanding
> at some other tag#, stale entry will be counted in host_busy.
> 
> >
> > > Issue noticed by me is the exact same issue described @ below -
> > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492c
> > > b@hu
> > > awei.com/
> > >
> > > Issue is only exposed to shared host tagset. I got the required
> > > information. Thanks.
> > >
> > > Kashyap
> > > >
> > > > > count total 2 inflight commands instead of 1 from hctx0 context +
> > > > > From
> > > > > hctx1 context, we will count 1 inflight command = Total is 3.
> > > > > Even though we read after some delay, host_busy will be incorrect.
> > > > > We expect host_busy = 2 but it will return 3.
> > > > >
> > > > > This patch fix my issue explained above for shared host-tag case.
> > > > > I am confused reading the commit message. You may not have
> > > > > intentionally fix the issue as I explained but indirectly it fixes
> > > > > my issue. Am I
> > > correct ?
> > > > >
> > > > > What was an issue reported by Luojiaxiang ? I am interested to
> > > > > know if issue reported by Luojiaxiang had shared host tagset
> enabled ?
> > > >
> > > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-
> > > > 5c10539492cb@huawei.com/
> > >
> > > I check this. It is same issue as what I am seeing on Broadcom
> > > controller only if shared host tagset is enabled.
> >
> > But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't
> only for
> > shared tags.
> 
> I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
> shared bitmap, so it is exposed and It is confirmed from this discussion.
> Do we still have exposure (if "blk-mq: avoid to iterate over stale
> request" is not part of kernel)  to mpi3mr type driver which does not use
> shared bitmap but has nr_hw_queues > 1. ?

Not sure I understand your poing, but patch "blk-mq: avoid to iterate over stale
request" can cover both shared tags or not.



Thanks, 
Ming

Kashyap Desai Dec. 16, 2021, 11:55 a.m. UTC | #10

> >
> > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
> > shared bitmap, so it is exposed and It is confirmed from this
discussion.
> > Do we still have exposure (if "blk-mq: avoid to iterate over stale
> > request" is not part of kernel)  to mpi3mr type driver which does not
> > use shared bitmap but has nr_hw_queues > 1. ?
>
> Not sure I understand your poing, but patch "blk-mq: avoid to iterate
over
> stale request" can cover both shared tags or not.

I agree with all above reply.

My query is for mpi3mr driver which is not calling "host->host_tagset =
1;", but nr_hw_queues are more than 1.
Current <mpi3mr> driver is nvme style interface. nr_hw_queues  > 1 and
host->host_tagset = 0.

Is this patch  "blk-mq: avoid to iterate over stale request" is applicable
for <mpi3mr> driver ?

Kashyap

>
>
>
> Thanks,
> Ming

Ming Lei Dec. 16, 2021, 12:50 p.m. UTC | #11

On Thu, Dec 16, 2021 at 05:25:41PM +0530, Kashyap Desai wrote:
> > >
> > > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use
> > > shared bitmap, so it is exposed and It is confirmed from this
> discussion.
> > > Do we still have exposure (if "blk-mq: avoid to iterate over stale
> > > request" is not part of kernel)  to mpi3mr type driver which does not
> > > use shared bitmap but has nr_hw_queues > 1. ?
> >
> > Not sure I understand your poing, but patch "blk-mq: avoid to iterate
> over
> > stale request" can cover both shared tags or not.
> 
> I agree with all above reply.
> 
> My query is for mpi3mr driver which is not calling "host->host_tagset =
> 1;", but nr_hw_queues are more than 1.
> Current <mpi3mr> driver is nvme style interface. nr_hw_queues  > 1 and
> host->host_tagset = 0.
> 
> Is this patch  "blk-mq: avoid to iterate over stale request" is applicable
> for <mpi3mr> driver ?

Yeah, this change covers both ->host_tagset and !->host_tagset, same
with the following words:

	scsi_host_check_in_flight() is used to counting scsi in-flight requests
	after scsi host is blocked, so no new scsi command can be marked as
	SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run at
	that time by blk-mq core.
	
	The issue is in blk_mq_tagset_busy_iter(). One request is in-flight, but
	this request may be kept in another slot of ->rqs[], meantime the slot
	can be allocated out but ->rqs[] isn't updated yet. Then the in-flight
	request is counted twice as SCMD_STATE_INFLIGHT.



Thanks,
Ming

blk-mq: avoid to iterate over stale request

Commit Message

Comments

Patch