Message ID | 20210906065003.439019-1-ming.lei@redhat.com |
---|---|
State | New |
Headers | show |
Series | blk-mq: avoid to iterate over stale request | expand |
On Mon, Sep 06, 2021 at 03:44:08PM -0700, Bart Van Assche wrote: > On 9/5/21 23:50, Ming Lei wrote: > > - if (!rq || !refcount_inc_not_zero(&rq->ref)) > > + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref)) > > rq = NULL; > > Shouldn't the rq->tag != bitnr test happen after the refcount has been > incremented since otherwise rq->tag can change after it has been read and > before the refcount is incremented? rq->tag can change too after its refcount is grabbed. If the rq is released during the iterating, either SCMD_STATE_INFLIGHT is cleared or refcount_inc_not_zero() fails. So this way works. The use case for scsi_host_queue_ready() and scsi EH handling is a bit special. For others, either the iterating needn't to be exact, or queue is frozen. Thanks, Ming
On Mon, Sep 06, 2021 at 02:50:03PM +0800, Ming Lei wrote: > blk-mq can't run allocating driver tag and updating ->rqs[tag] > atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver > tag is released. > > So there is chance to iterating over one stale request just after the > tag is allocated and before updating ->rqs[tag]. > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi > in-flight requests after scsi host is blocked, so no new scsi command can > be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can > be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT, > but this request may have been kept in another slot of ->rqs[], meantime > the slot can be allocated out but ->rqs[] isn't updated yet. Then this > in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes > trouble in handling scsi error. > > Fixes the issue by not iterating over stale request. > > Cc: linux-scsi@vger.kernel.org > Cc: "Martin K. Petersen" <martin.petersen@oracle.com> > Reported-by: luojiaxing <luojiaxing@huawei.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> Hello Jens, luojiaxiang has verified that this patch fixes his issue, any chance to merge it? Thanks, Ming
On 9/12/21 7:26 PM, Ming Lei wrote: > On Mon, Sep 06, 2021 at 02:50:03PM +0800, Ming Lei wrote: >> blk-mq can't run allocating driver tag and updating ->rqs[tag] >> atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver >> tag is released. >> >> So there is chance to iterating over one stale request just after the >> tag is allocated and before updating ->rqs[tag]. >> >> scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi >> in-flight requests after scsi host is blocked, so no new scsi command can >> be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can >> be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT, >> but this request may have been kept in another slot of ->rqs[], meantime >> the slot can be allocated out but ->rqs[] isn't updated yet. Then this >> in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes >> trouble in handling scsi error. >> >> Fixes the issue by not iterating over stale request. >> >> Cc: linux-scsi@vger.kernel.org >> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> >> Reported-by: luojiaxing <luojiaxing@huawei.com> >> Signed-off-by: Ming Lei <ming.lei@redhat.com> > > Hello Jens, > > luojiaxiang has verified that this patch fixes his issue, any chance to > merge it? I'll queue it up for 5.15-rc2, thanks. -- Jens Axboe
+ John Garry > blk-mq can't run allocating driver tag and updating ->rqs[tag] atomically, > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is released. > > So there is chance to iterating over one stale request just after the tag is > allocated and before updating ->rqs[tag]. > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi in-flight > requests after scsi host is blocked, so no new scsi command can be marked as > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by blk- > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request > may have been kept in another slot of ->rqs[], meantime the slot can be > allocated out but ->rqs[] isn't updated yet. Then this in-flight request is > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in handling > scsi error. Hi Ming, We found similar issue on RHEL8.5 (kernel does not have this patch in discussion.). Issue reproduced on 5.15 kernel as well. I understood this commit will fix specific race condition and avoid reading incorrect host_busy value. As per commit message - That incorrect host_busy will be just transient. If we read after some delay, correct host_busy count will be available. Right ? In my case (I am using shared host tag enabled driver), it is not race condition issue but stale rqs[] entries create permanent incorrect count of host_busy. Example - There are two pending IOs. This IOs are timed out. Bitmap of pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually belongs to hctx1). Note - This is a shared bit map. If hctx0 has same address of the request at 5th and 10th index, we will count total 2 inflight commands instead of 1 from hctx0 context + From hctx1 context, we will count 1 inflight command = Total is 3. Even though we read after some delay, host_busy will be incorrect. We expect host_busy = 2 but it will return 3. This patch fix my issue explained above for shared host-tag case. I am confused reading the commit message. You may not have intentionally fix the issue as I explained but indirectly it fixes my issue. Am I correct ? What was an issue reported by Luojiaxiang ? I am interested to know if issue reported by Luojiaxiang had shared host tagset enabled ? Kashyap > > Fixes the issue by not iterating over stale request. > > Cc: linux-scsi@vger.kernel.org > Cc: "Martin K. Petersen" <martin.petersen@oracle.com> > Reported-by: luojiaxing <luojiaxing@huawei.com> > Signed-off-by: Ming Lei <ming.lei@redhat.com> > --- > block/blk-mq-tag.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index > 86f87346232a..ff5caeb82542 100644 > --- a/block/blk-mq-tag.c > +++ b/block/blk-mq-tag.c > @@ -208,7 +208,7 @@ static struct request > *blk_mq_find_and_get_req(struct blk_mq_tags *tags, > > spin_lock_irqsave(&tags->lock, flags); > rq = tags->rqs[bitnr]; > - if (!rq || !refcount_inc_not_zero(&rq->ref)) > + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref)) > rq = NULL; > spin_unlock_irqrestore(&tags->lock, flags); > return rq; > -- > 2.31.1
Hello Kashyap, On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote: > + John Garry > > > blk-mq can't run allocating driver tag and updating ->rqs[tag] > atomically, > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is > released. > > > > So there is chance to iterating over one stale request just after the > tag is > > allocated and before updating ->rqs[tag]. > > > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi > in-flight > > requests after scsi host is blocked, so no new scsi command can be > marked as > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by > blk- > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request > > may have been kept in another slot of ->rqs[], meantime the slot can be > > allocated out but ->rqs[] isn't updated yet. Then this in-flight request > is > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in > handling > > scsi error. > > Hi Ming, > > We found similar issue on RHEL8.5 (kernel does not have this patch in > discussion.). Issue reproduced on 5.15 kernel as well. > I understood this commit will fix specific race condition and avoid > reading incorrect host_busy value. > As per commit message - That incorrect host_busy will be just transient. > If we read after some delay, correct host_busy count will be available. > Right ? Yeah, any counter(include atomic counter) works in this way. But here it may be 'permanent' because one stale request pointer may stay in one slot of ->rqs[] for long enough time if this slot isn't reused, meantime the same request can be reallocated in case of real io scheduler. Maybe the commit log should be improved a bit for making it explicit. > > In my case (I am using shared host tag enabled driver), it is not race > condition issue but stale rqs[] entries create permanent incorrect count > of host_busy. > Example - There are two pending IOs. This IOs are timed out. Bitmap of > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually belongs > to hctx1). Note - This is a shared bit map. > If hctx0 has same address of the request at 5th and 10th index, we will It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit map, both tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags->rqs[10] should point to the updated requests(timed out). > count total 2 inflight commands instead of 1 from hctx0 context + From > hctx1 context, we will count 1 inflight command = Total is 3. > Even though we read after some delay, host_busy will be incorrect. We > expect host_busy = 2 but it will return 3. > > This patch fix my issue explained above for shared host-tag case. I am > confused reading the commit message. You may not have intentionally fix > the issue as I explained but indirectly it fixes my issue. Am I correct ? > > What was an issue reported by Luojiaxiang ? I am interested to know if > issue reported by Luojiaxiang had shared host tagset enabled ? https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@huawei.com/ It should be same with yours, and the original report described & explained the issue in details. thanks, Ming > > Kashyap > > > > > Fixes the issue by not iterating over stale request. > > > > Cc: linux-scsi@vger.kernel.org > > Cc: "Martin K. Petersen" <martin.petersen@oracle.com> > > Reported-by: luojiaxing <luojiaxing@huawei.com> > > Signed-off-by: Ming Lei <ming.lei@redhat.com> > > --- > > block/blk-mq-tag.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index > > 86f87346232a..ff5caeb82542 100644 > > --- a/block/blk-mq-tag.c > > +++ b/block/blk-mq-tag.c > > @@ -208,7 +208,7 @@ static struct request > > *blk_mq_find_and_get_req(struct blk_mq_tags *tags, > > > > spin_lock_irqsave(&tags->lock, flags); > > rq = tags->rqs[bitnr]; > > - if (!rq || !refcount_inc_not_zero(&rq->ref)) > > + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref)) > > rq = NULL; > > spin_unlock_irqrestore(&tags->lock, flags); > > return rq; > > -- > > 2.31.1
> > Hello Kashyap, > > On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote: > > + John Garry > > > > > blk-mq can't run allocating driver tag and updating ->rqs[tag] > > atomically, > > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is > > released. > > > > > > So there is chance to iterating over one stale request just after > > > the > > tag is > > > allocated and before updating ->rqs[tag]. > > > > > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count > > > scsi > > in-flight > > > requests after scsi host is blocked, so no new scsi command can be > > marked as > > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run > > > by > > blk- > > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this > > > request may have been kept in another slot of ->rqs[], meantime the > > > slot can be allocated out but ->rqs[] isn't updated yet. Then this > > > in-flight request > > is > > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in > > handling > > > scsi error. > > > > Hi Ming, > > > > We found similar issue on RHEL8.5 (kernel does not have this patch in > > discussion.). Issue reproduced on 5.15 kernel as well. > > I understood this commit will fix specific race condition and avoid > > reading incorrect host_busy value. > > As per commit message - That incorrect host_busy will be just transient. > > If we read after some delay, correct host_busy count will be available. > > Right ? > > Yeah, any counter(include atomic counter) works in this way. > > But here it may be 'permanent' because one stale request pointer may stay in > one slot of ->rqs[] for long enough time if this slot isn't reused, meantime the > same request can be reallocated in case of real io scheduler. Maybe the > commit log should be improved a bit for making it explicit. Changing commit log description will help. > > > > > In my case (I am using shared host tag enabled driver), it is not race > > condition issue but stale rqs[] entries create permanent incorrect > > count of host_busy. > > Example - There are two pending IOs. This IOs are timed out. Bitmap of > > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually > > belongs to hctx1). Note - This is a shared bit map. > > If hctx0 has same address of the request at 5th and 10th index, we > > will > > It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit map, both > tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags- > >rqs[10] should point to the updated requests(timed out). Updated pointers will be there for actual hctx. Below is possible and that is what causing problem in original issue. shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command) shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command) <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this entry is also found which is actually outstanding on hctx1. shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command) Issue noticed by me is the exact same issue described @ below - https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@hu awei.com/ Issue is only exposed to shared host tagset. I got the required information. Thanks. Kashyap > > > count total 2 inflight commands instead of 1 from hctx0 context + From > > hctx1 context, we will count 1 inflight command = Total is 3. > > Even though we read after some delay, host_busy will be incorrect. We > > expect host_busy = 2 but it will return 3. > > > > This patch fix my issue explained above for shared host-tag case. I > > am confused reading the commit message. You may not have intentionally > > fix the issue as I explained but indirectly it fixes my issue. Am I correct ? > > > > What was an issue reported by Luojiaxiang ? I am interested to know if > > issue reported by Luojiaxiang had shared host tagset enabled ? > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab- > 5c10539492cb@huawei.com/ I check this. It is same issue as what I am seeing on Broadcom controller only if shared host tagset is enabled.
On Wed, Dec 15, 2021 at 01:00:49PM +0530, Kashyap Desai wrote: > > > > Hello Kashyap, > > > > On Wed, Dec 15, 2021 at 12:11:13AM +0530, Kashyap Desai wrote: > > > + John Garry > > > > > > > blk-mq can't run allocating driver tag and updating ->rqs[tag] > > > atomically, > > > > meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is > > > released. > > > > > > > > So there is chance to iterating over one stale request just after > > > > the > > > tag is > > > > allocated and before updating ->rqs[tag]. > > > > > > > > scsi_host_busy_iter() calls scsi_host_check_in_flight() to count > > > > scsi > > > in-flight > > > > requests after scsi host is blocked, so no new scsi command can be > > > marked as > > > > SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run > > > > by > > > blk- > > > > mq core. One request is marked as SCMD_STATE_INFLIGHT, but this > > > > request may have been kept in another slot of ->rqs[], meantime the > > > > slot can be allocated out but ->rqs[] isn't updated yet. Then this > > > > in-flight request > > > is > > > > counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in > > > handling > > > > scsi error. > > > > > > Hi Ming, > > > > > > We found similar issue on RHEL8.5 (kernel does not have this patch in > > > discussion.). Issue reproduced on 5.15 kernel as well. > > > I understood this commit will fix specific race condition and avoid > > > reading incorrect host_busy value. > > > As per commit message - That incorrect host_busy will be just > transient. > > > If we read after some delay, correct host_busy count will be > available. > > > Right ? > > > > Yeah, any counter(include atomic counter) works in this way. > > > > But here it may be 'permanent' because one stale request pointer may > stay in > > one slot of ->rqs[] for long enough time if this slot isn't reused, > meantime the > > same request can be reallocated in case of real io scheduler. Maybe the > > commit log should be improved a bit for making it explicit. > > Changing commit log description will help. > > > > > > > > > In my case (I am using shared host tag enabled driver), it is not race > > > condition issue but stale rqs[] entries create permanent incorrect > > > count of host_busy. > > > Example - There are two pending IOs. This IOs are timed out. Bitmap of > > > pending IO is tag#5 (actually belongs to hctx0), tag#10 (actually > > > belongs to hctx1). Note - This is a shared bit map. > > > If hctx0 has same address of the request at 5th and 10th index, we > > > will > > > > It shouldn't be possible, since ->rqs[] is per-tags. If it is shared bit > map, both > > tag#5 and tag#10 are set, and both shared_tags->rqs[5] & shared_tags- > > >rqs[10] should point to the updated requests(timed out). > Updated pointers will be there for actual hctx. Below is possible and > that is what causing problem in original issue. > > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command) > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command) <- > This is incorrect. While looping on hctx0 tags[], bitmap = 10 this entry > is also found which is actually outstanding on hctx1. > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command) Sorry, I am a bit confused, please look at the following code and blk_mq_find_and_get_req()(<-bt_tags_iter). void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, busy_tag_iter_fn *fn, void *priv) { unsigned int flags = tagset->flags; int i, nr_tags; nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; for (i = 0; i < nr_tags; i++) { if (tagset->tags && tagset->tags[i]) __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, BT_TAG_ITER_STARTED); } } In case of shared tags, only tagset->tags[0] is iterated over, so both 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10] shouldn't just point to the two latest requests? How can both 0xAA & 0xBB be retrieved from shared_tags->rqs[10]? > Issue noticed by me is the exact same issue described @ below - > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492cb@hu > awei.com/ > > Issue is only exposed to shared host tagset. I got the required > information. Thanks. > > Kashyap > > > > > count total 2 inflight commands instead of 1 from hctx0 context + From > > > hctx1 context, we will count 1 inflight command = Total is 3. > > > Even though we read after some delay, host_busy will be incorrect. We > > > expect host_busy = 2 but it will return 3. > > > > > > This patch fix my issue explained above for shared host-tag case. I > > > am confused reading the commit message. You may not have intentionally > > > fix the issue as I explained but indirectly it fixes my issue. Am I > correct ? > > > > > > What was an issue reported by Luojiaxiang ? I am interested to know if > > > issue reported by Luojiaxiang had shared host tagset enabled ? > > > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab- > > 5c10539492cb@huawei.com/ > > I check this. It is same issue as what I am seeing on Broadcom controller > only if shared host tagset is enabled. But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't only for shared tags. Thanks Ming
> > > > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command) > > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command) > > <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this > > entry is also found which is actually outstanding on hctx1. > > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command) > > Sorry, I am a bit confused, please look at the following code and > blk_mq_find_and_get_req()(<-bt_tags_iter). My issue is without shared tags. Below patch is certainly a fix for shared tags (for 5.16-rc). I have seen scsi eh deadlock issue on 5.15 kernel which does not have shared tags (but it has shared bitmap.) > > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > busy_tag_iter_fn *fn, void *priv) { > unsigned int flags = tagset->flags; > int i, nr_tags; > > nr_tags = blk_mq_is_shared_tags(flags) ? 1 : tagset->nr_hw_queues; > > for (i = 0; i < nr_tags; i++) { > if (tagset->tags && tagset->tags[i]) > __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, > BT_TAG_ITER_STARTED); > } > } > > In case of shared tags, only tagset->tags[0] is iterated over, so both > 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10] > shouldn't just point to the two latest requests? How can both 0xAA & 0xBB be > retrieved from shared_tags->rqs[10]? Since I am not talking about shared tags, you can remap same condition now. In 5.15 kernel (shared bitmap is enabled but not shared tags) - blk_mq_tagset_busy_iter is called for each tags[] of nr_hw_queues times. It will call bt_tags_for_each() for hctx0.tags->[], hctx1.tags->[] .. Since tags->bitmap_tags is shared when it traverse hctx0.tags->[], it can find hctx0.tags[10].rq which is really stale entry. If the same request which is stale @ hctx0.tags[10] is really outstanding at some other tag#, stale entry will be counted in host_busy. > > > Issue noticed by me is the exact same issue described @ below - > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492c > > b@hu > > awei.com/ > > > > Issue is only exposed to shared host tagset. I got the required > > information. Thanks. > > > > Kashyap > > > > > > > count total 2 inflight commands instead of 1 from hctx0 context + > > > > From > > > > hctx1 context, we will count 1 inflight command = Total is 3. > > > > Even though we read after some delay, host_busy will be incorrect. > > > > We expect host_busy = 2 but it will return 3. > > > > > > > > This patch fix my issue explained above for shared host-tag case. > > > > I am confused reading the commit message. You may not have > > > > intentionally fix the issue as I explained but indirectly it fixes > > > > my issue. Am I > > correct ? > > > > > > > > What was an issue reported by Luojiaxiang ? I am interested to > > > > know if issue reported by Luojiaxiang had shared host tagset enabled ? > > > > > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab- > > > 5c10539492cb@huawei.com/ > > > > I check this. It is same issue as what I am seeing on Broadcom > > controller only if shared host tagset is enabled. > > But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't only for > shared tags. I mean, shared bitmap in my whole discussion. Megaraid_sas driver use shared bitmap, so it is exposed and It is confirmed from this discussion. Do we still have exposure (if "blk-mq: avoid to iterate over stale request" is not part of kernel) to mpi3mr type driver which does not use shared bitmap but has nr_hw_queues > 1. ? Kashyap > > > > Thanks > Ming
On Wed, Dec 15, 2021 at 02:15:51PM +0530, Kashyap Desai wrote: > > > > > > shared_tags->rqs[5] of hctx0 is having scmd = 0xAA (inflight command) > > > shared_tags->rqs[10] of hctx0 is having scmd = 0xAA (inflight command) > > > <- This is incorrect. While looping on hctx0 tags[], bitmap = 10 this > > > entry is also found which is actually outstanding on hctx1. > > > shared_tags->rqs[10] of hctx1 is having scmd = 0xBB (inflight command) > > > > Sorry, I am a bit confused, please look at the following code and > > blk_mq_find_and_get_req()(<-bt_tags_iter). > > > My issue is without shared tags. Below patch is certainly a fix for shared > tags (for 5.16-rc). > I have seen scsi eh deadlock issue on 5.15 kernel which does not have > shared tags (but it has shared bitmap.) OK, if you are talking about non-shared tags, your issue is exactly addressed by 67f3b2f822b7 blk-mq: avoid to iterate over stale request > > > > > void blk_mq_tagset_busy_iter(struct blk_mq_tag_set *tagset, > > busy_tag_iter_fn *fn, void *priv) { > > unsigned int flags = tagset->flags; > > int i, nr_tags; > > > > nr_tags = blk_mq_is_shared_tags(flags) ? 1 : > tagset->nr_hw_queues; > > > > for (i = 0; i < nr_tags; i++) { > > if (tagset->tags && tagset->tags[i]) > > __blk_mq_all_tag_iter(tagset->tags[i], fn, priv, > > BT_TAG_ITER_STARTED); > > } > > } > > > > In case of shared tags, only tagset->tags[0] is iterated over, so both > > 5 and 10 are checked, and shared_tags->rqs[5] and shared_tags->rqs[10] > > shouldn't just point to the two latest requests? How can both 0xAA & > 0xBB be > > retrieved from shared_tags->rqs[10]? > > Since I am not talking about shared tags, you can remap same condition > now. > In 5.15 kernel (shared bitmap is enabled but not shared tags) - Shared bitmap and shared tags should have be same thing, that means all hw queues share same tag space. > blk_mq_tagset_busy_iter is called for each tags[] of nr_hw_queues times. > It will call bt_tags_for_each() for hctx0.tags->[], hctx1.tags->[] .. > Since tags->bitmap_tags is shared when it traverse hctx0.tags->[], it can If ->birtmap_tags is shared, only one tags should be iterated over, so the following commit was added: 0994c64eb415 blk-mq: Fix blk_mq_tagset_busy_iter() for shared tags > find hctx0.tags[10].rq which is really stale entry. > If the same request which is stale @ hctx0.tags[10] is really outstanding > at some other tag#, stale entry will be counted in host_busy. > > > > > > Issue noticed by me is the exact same issue described @ below - > > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab-5c10539492c > > > b@hu > > > awei.com/ > > > > > > Issue is only exposed to shared host tagset. I got the required > > > information. Thanks. > > > > > > Kashyap > > > > > > > > > count total 2 inflight commands instead of 1 from hctx0 context + > > > > > From > > > > > hctx1 context, we will count 1 inflight command = Total is 3. > > > > > Even though we read after some delay, host_busy will be incorrect. > > > > > We expect host_busy = 2 but it will return 3. > > > > > > > > > > This patch fix my issue explained above for shared host-tag case. > > > > > I am confused reading the commit message. You may not have > > > > > intentionally fix the issue as I explained but indirectly it fixes > > > > > my issue. Am I > > > correct ? > > > > > > > > > > What was an issue reported by Luojiaxiang ? I am interested to > > > > > know if issue reported by Luojiaxiang had shared host tagset > enabled ? > > > > > > > > https://lore.kernel.org/linux-scsi/fe5cf6c4-ce5e-4a0f-f4ab- > > > > 5c10539492cb@huawei.com/ > > > > > > I check this. It is same issue as what I am seeing on Broadcom > > > controller only if shared host tagset is enabled. > > > > But 67f3b2f822b7 ("blk-mq: avoid to iterate over stale request") isn't > only for > > shared tags. > > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use > shared bitmap, so it is exposed and It is confirmed from this discussion. > Do we still have exposure (if "blk-mq: avoid to iterate over stale > request" is not part of kernel) to mpi3mr type driver which does not use > shared bitmap but has nr_hw_queues > 1. ? Not sure I understand your poing, but patch "blk-mq: avoid to iterate over stale request" can cover both shared tags or not. Thanks, Ming
> > > > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use > > shared bitmap, so it is exposed and It is confirmed from this discussion. > > Do we still have exposure (if "blk-mq: avoid to iterate over stale > > request" is not part of kernel) to mpi3mr type driver which does not > > use shared bitmap but has nr_hw_queues > 1. ? > > Not sure I understand your poing, but patch "blk-mq: avoid to iterate over > stale request" can cover both shared tags or not. I agree with all above reply. My query is for mpi3mr driver which is not calling "host->host_tagset = 1;", but nr_hw_queues are more than 1. Current <mpi3mr> driver is nvme style interface. nr_hw_queues > 1 and host->host_tagset = 0. Is this patch "blk-mq: avoid to iterate over stale request" is applicable for <mpi3mr> driver ? Kashyap > > > > Thanks, > Ming
On Thu, Dec 16, 2021 at 05:25:41PM +0530, Kashyap Desai wrote: > > > > > > I mean, shared bitmap in my whole discussion. Megaraid_sas driver use > > > shared bitmap, so it is exposed and It is confirmed from this > discussion. > > > Do we still have exposure (if "blk-mq: avoid to iterate over stale > > > request" is not part of kernel) to mpi3mr type driver which does not > > > use shared bitmap but has nr_hw_queues > 1. ? > > > > Not sure I understand your poing, but patch "blk-mq: avoid to iterate > over > > stale request" can cover both shared tags or not. > > I agree with all above reply. > > My query is for mpi3mr driver which is not calling "host->host_tagset = > 1;", but nr_hw_queues are more than 1. > Current <mpi3mr> driver is nvme style interface. nr_hw_queues > 1 and > host->host_tagset = 0. > > Is this patch "blk-mq: avoid to iterate over stale request" is applicable > for <mpi3mr> driver ? Yeah, this change covers both ->host_tagset and !->host_tagset, same with the following words: scsi_host_check_in_flight() is used to counting scsi in-flight requests after scsi host is blocked, so no new scsi command can be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run at that time by blk-mq core. The issue is in blk_mq_tagset_busy_iter(). One request is in-flight, but this request may be kept in another slot of ->rqs[], meantime the slot can be allocated out but ->rqs[] isn't updated yet. Then the in-flight request is counted twice as SCMD_STATE_INFLIGHT. Thanks, Ming
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 86f87346232a..ff5caeb82542 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -208,7 +208,7 @@ static struct request *blk_mq_find_and_get_req(struct blk_mq_tags *tags, spin_lock_irqsave(&tags->lock, flags); rq = tags->rqs[bitnr]; - if (!rq || !refcount_inc_not_zero(&rq->ref)) + if (!rq || rq->tag != bitnr || !refcount_inc_not_zero(&rq->ref)) rq = NULL; spin_unlock_irqrestore(&tags->lock, flags); return rq;
blk-mq can't run allocating driver tag and updating ->rqs[tag] atomically, meantime blk-mq doesn't clear ->rqs[tag] after the driver tag is released. So there is chance to iterating over one stale request just after the tag is allocated and before updating ->rqs[tag]. scsi_host_busy_iter() calls scsi_host_check_in_flight() to count scsi in-flight requests after scsi host is blocked, so no new scsi command can be marked as SCMD_STATE_INFLIGHT. However, driver tag allocation still can be run by blk-mq core. One request is marked as SCMD_STATE_INFLIGHT, but this request may have been kept in another slot of ->rqs[], meantime the slot can be allocated out but ->rqs[] isn't updated yet. Then this in-flight request is counted twice as SCMD_STATE_INFLIGHT. This way causes trouble in handling scsi error. Fixes the issue by not iterating over stale request. Cc: linux-scsi@vger.kernel.org Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Reported-by: luojiaxing <luojiaxing@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> --- block/blk-mq-tag.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)