Message ID | 1591810159-240929-3-git-send-email-john.garry@huawei.com |
---|---|
State | New |
Headers | show |
Series | blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs | expand |
On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote: > From: Hannes Reinecke <hare@suse.de> > > The function does not set the depth, but rather transitions from > shared to non-shared queues and vice versa. > So rename it to blk_mq_update_tag_set_shared() to better reflect > its purpose. It is fine to rename it for me, however: This patch claims to rename blk_mq_update_tag_set_shared(), but also change blk_mq_init_bitmap_tags's signature. So suggest to split this patch into two or add comment log on changing blk_mq_init_bitmap_tags(). Thanks, Ming
On 11/06/2020 03:57, Ming Lei wrote: > On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote: >> From: Hannes Reinecke <hare@suse.de> >> >> The function does not set the depth, but rather transitions from >> shared to non-shared queues and vice versa. >> So rename it to blk_mq_update_tag_set_shared() to better reflect >> its purpose. > > It is fine to rename it for me, however: > > This patch claims to rename blk_mq_update_tag_set_shared(), but also > change blk_mq_init_bitmap_tags's signature. I was going to update the commit message here, but forgot again... > > So suggest to split this patch into two or add comment log on changing > blk_mq_init_bitmap_tags(). I think I'll just split into 2x commits. Thanks, John
On 11/06/2020 09:26, John Garry wrote: > On 11/06/2020 03:57, Ming Lei wrote: >> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote: >>> From: Hannes Reinecke <hare@suse.de> >>> >>> The function does not set the depth, but rather transitions from >>> shared to non-shared queues and vice versa. >>> So rename it to blk_mq_update_tag_set_shared() to better reflect >>> its purpose. >> >> It is fine to rename it for me, however: >> >> This patch claims to rename blk_mq_update_tag_set_shared(), but also >> change blk_mq_init_bitmap_tags's signature. > > I was going to update the commit message here, but forgot again... > >> >> So suggest to split this patch into two or add comment log on changing >> blk_mq_init_bitmap_tags(). > > I think I'll just split into 2x commits. Hi Hannes, Do you have any issue with splitting the undocumented changes into another patch as so: -------------------->8--------------------- From db3f8ec1295efbf53273ffc137d348857cbd411e Mon Sep 17 00:00:00 2001 From: Hannes Reinecke <hare@suse.de> Date: Tue, 23 Jun 2020 12:07:33 +0100 Subject: [PATCH] blk-mq: Free tags in blk_mq_init_tags() upon error Since the tags are allocated in blk_mq_init_tags() it's better practice to free in that same function upon error, rather than a callee which is to init the bitmap tags - blk_mq_init_tags(). Signed-off-by: Hannes Reinecke <hare@suse.de> [jpg: split from an earlier patch with a new commit message, minor mod to return NULL directly for error] Signed-off-by: John Garry <john.garry@huawei.com> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 1085dc9848f3..b8972014cd90 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -487,24 +487,22 @@ static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth, node); } -static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags, - int node, int alloc_policy) +static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags, + int node, int alloc_policy) { unsigned int depth = tags->nr_tags - tags->nr_reserved_tags; bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR; if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node)) - goto free_tags; + return -ENOMEM; if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin, node)) goto free_bitmap_tags; - return tags; + return 0; free_bitmap_tags: sbitmap_queue_free(&tags->bitmap_tags); -free_tags: - kfree(tags); - return NULL; + return -ENOMEM; } struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, @@ -525,7 +523,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, tags->nr_tags = total_tags; tags->nr_reserved_tags = reserved_tags; - return blk_mq_init_bitmap_tags(tags, node, alloc_policy); + if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) { + kfree(tags); + return NULL; + } + return tags; } void blk_mq_free_tags(struct blk_mq_tags *tags) --------------------8<--------------------- Thanks
On 6/23/20 1:25 PM, John Garry wrote: > On 11/06/2020 09:26, John Garry wrote: >> On 11/06/2020 03:57, Ming Lei wrote: >>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote: >>>> From: Hannes Reinecke <hare@suse.de> >>>> >>>> The function does not set the depth, but rather transitions from >>>> shared to non-shared queues and vice versa. >>>> So rename it to blk_mq_update_tag_set_shared() to better reflect >>>> its purpose. >>> >>> It is fine to rename it for me, however: >>> >>> This patch claims to rename blk_mq_update_tag_set_shared(), but also >>> change blk_mq_init_bitmap_tags's signature. >> >> I was going to update the commit message here, but forgot again... >> >>> >>> So suggest to split this patch into two or add comment log on changing >>> blk_mq_init_bitmap_tags(). >> >> I think I'll just split into 2x commits. > > Hi Hannes, > > Do you have any issue with splitting the undocumented changes into > another patch as so: > No, that's perfectly fine. Kashyap, I've also attached an updated patch for the elevator_count patch; if you agree John can include it in the next version. Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer From d50b5f773713070208c405f7c7056eb1afed896a Mon Sep 17 00:00:00 2001 From: Hannes Reinecke <hare@suse.de> Date: Tue, 23 Jun 2020 16:18:40 +0200 Subject: [PATCH] elevator: count requests per hctx to improve performance Add a 'elevator_queued' count to the hctx to avoid triggering the elevator even though there are no requests queued. Suggested-by: Kashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: Hannes Reinecke <hare@suse.de> --- block/bfq-iosched.c | 5 +++++ block/blk-mq.c | 1 + block/mq-deadline.c | 5 +++++ include/linux/blk-mq.h | 4 ++++ 4 files changed, 15 insertions(+) diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index a1123d4d586d..3d63b35f6121 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4640,6 +4640,9 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; + if (!atomic_read(&hctx->elevator_queued)) + return false; + /* * Avoiding lock: a race on bfqd->busy_queues should cause at * most a call to dispatch for nothing @@ -5554,6 +5557,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx, rq = list_first_entry(list, struct request, queuelist); list_del_init(&rq->queuelist); bfq_insert_request(hctx, rq, at_head); + atomic_inc(&hctx->elevator_queued) } } @@ -5933,6 +5937,7 @@ static void bfq_finish_requeue_request(struct request *rq) bfq_completed_request(bfqq, bfqd); bfq_finish_requeue_request_body(bfqq); + atomic_dec(&rq->mq_hctx->elevator_queued); spin_unlock_irqrestore(&bfqd->lock, flags); } else { diff --git a/block/blk-mq.c b/block/blk-mq.c index e06e8c9f326f..f5403fc97572 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2542,6 +2542,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, goto free_hctx; atomic_set(&hctx->nr_active, 0); + atomic_set(&hctx->elevator_queued, 0); if (node == NUMA_NO_NODE) node = set->numa_node; hctx->numa_node = node; diff --git a/block/mq-deadline.c b/block/mq-deadline.c index b57470e154c8..9d753745e6be 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx, rq = list_first_entry(list, struct request, queuelist); list_del_init(&rq->queuelist); dd_insert_request(hctx, rq, at_head); + atomic_inc(&hctx->elevator_queued); } spin_unlock(&dd->lock); } @@ -573,12 +574,16 @@ static void dd_finish_request(struct request *rq) blk_mq_sched_mark_restart_hctx(rq->mq_hctx); spin_unlock_irqrestore(&dd->zone_lock, flags); } + atomic_dec(&rq->mq_hctx->elevator_queued); } static bool dd_has_work(struct blk_mq_hw_ctx *hctx) { struct deadline_data *dd = hctx->queue->elevator->elevator_data; + if (!atomic_read(&hctx->elevator_queued)) + return false; + return !list_empty_careful(&dd->dispatch) || !list_empty_careful(&dd->fifo_list[0]) || !list_empty_careful(&dd->fifo_list[1]); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 66711c7234db..a18c506b14e7 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx { * shared across request queues. */ atomic_t nr_active; + /** + * @elevator_queued: Number of queued requests on hctx. + */ + atomic_t elevator_queued; /** @cpuhp_online: List to store request if CPU is going to die */ struct hlist_node cpuhp_online; -- 2.26.2
> > On 6/23/20 1:25 PM, John Garry wrote: > > On 11/06/2020 09:26, John Garry wrote: > >> On 11/06/2020 03:57, Ming Lei wrote: > >>> On Thu, Jun 11, 2020 at 01:29:09AM +0800, John Garry wrote: > >>>> From: Hannes Reinecke <hare@suse.de> > >>>> > >>>> The function does not set the depth, but rather transitions from > >>>> shared to non-shared queues and vice versa. > >>>> So rename it to blk_mq_update_tag_set_shared() to better reflect > >>>> its purpose. > >>> > >>> It is fine to rename it for me, however: > >>> > >>> This patch claims to rename blk_mq_update_tag_set_shared(), but also > >>> change blk_mq_init_bitmap_tags's signature. > >> > >> I was going to update the commit message here, but forgot again... > >> > >>> > >>> So suggest to split this patch into two or add comment log on > >>> changing blk_mq_init_bitmap_tags(). > >> > >> I think I'll just split into 2x commits. > > > > Hi Hannes, > > > > Do you have any issue with splitting the undocumented changes into > > another patch as so: > > > No, that's perfectly fine. > > Kashyap, I've also attached an updated patch for the elevator_count patch; > if > you agree John can include it in the next version. Hannes - Patch looks good. Header does not include problem statement. How about adding below in header ? High CPU utilization on "native_queued_spin_lock_slowpath" due to lock contention is possible in mq-deadline and bfq io scheduler when nr_hw_queues is more than one. It is because kblockd work queue can submit IO from all online CPUs (through blk_mq_run_hw_queues) even though only one hctx has pending commands. Elevator callback "has_work" for mq-deadline and bfq scheduler consider pending work if there are any IOs on request queue and it does not account hctx context. I have not seen performance drop after this patch, but I will continue further testing. John - One more thing, I am working on megaraid_sas driver to provide both host_tagset = 1 and 0 option through module parameter. Kashyap > > Cheers, > > Hannes > -- > Dr. Hannes Reinecke Teamlead Storage & Networking > hare@suse.de +49 911 74053 688 > SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 > (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On 24/06/2020 09:13, Kashyap Desai wrote: >>> Hi Hannes, >>> >>> Do you have any issue with splitting the undocumented changes into >>> another patch as so: >>> >> No, that's perfectly fine. >> >> Kashyap, I've also attached an updated patch for the elevator_count patch; >> if >> you agree John can include it in the next version. ok, but I need to check it myself. > Hannes - Patch looks good. Header does not include problem statement. How > about adding below in header ? > > High CPU utilization on "native_queued_spin_lock_slowpath" due to lock > contention is possible in mq-deadline and bfq io scheduler when nr_hw_queues > is more than one. > It is because kblockd work queue can submit IO from all online CPUs (through > blk_mq_run_hw_queues) even though only one hctx has pending commands. > Elevator callback "has_work" for mq-deadline and bfq scheduler consider > pending work if there are any IOs on request queue and it does not account > hctx context. > > I have not seen performance drop after this patch, but I will continue > further testing. > > John - One more thing, I am working on megaraid_sas driver to provide both > host_tagset = 1 and 0 option through module parameter. I was hoping that we wouldn't have this, and have host_tagset = 1 always. Or maybe host_tagset = 1 by default, and allow module param to set = 0. Your choice, though. Thanks, John
> > Kashyap, I've also attached an updated patch for the elevator_count > > patch; if you agree John can include it in the next version. > > Hannes - Patch looks good. Header does not include problem statement. > How about adding below in header ? > > High CPU utilization on "native_queued_spin_lock_slowpath" due to lock > contention is possible in mq-deadline and bfq io scheduler when > nr_hw_queues is more than one. > It is because kblockd work queue can submit IO from all online CPUs > (through > blk_mq_run_hw_queues) even though only one hctx has pending commands. > Elevator callback "has_work" for mq-deadline and bfq scheduler consider > pending work if there are any IOs on request queue and it does not account > hctx context. Hannes/John - We need one more correction for below patch - https://github.com/hisilicon/kernel-dev/commit/ff631eb80aa0449eaeb78a282fd7eff2a9e42f77 I noticed - that elevator_queued count goes negative mainly because there are some case where IO was submitted from dispatch queue(not scheduler queue) and request still has "RQF_ELVPRIV" flag set. In such cases " dd_finish_request" is called without " dd_insert_request". I think it is better to decrement counter once it is taken out from dispatched queue. (Ming proposed to use dispatch path for decrementing counter, but I somehow did not accounted assuming RQF_ELVPRIV will be set only if IO is submitted from scheduler queue.) Below is additional change. Can you merge this ? diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 9d75374..bc413dd 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -385,6 +385,8 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx) spin_lock(&dd->lock); rq = __dd_dispatch_request(dd); + if (rq) + atomic_dec(&rq->mq_hctx->elevator_queued); spin_unlock(&dd->lock); return rq; @@ -574,7 +576,6 @@ static void dd_finish_request(struct request *rq) blk_mq_sched_mark_restart_hctx(rq->mq_hctx); spin_unlock_irqrestore(&dd->zone_lock, flags); } - atomic_dec(&rq->mq_hctx->elevator_queued); } static bool dd_has_work(struct blk_mq_hw_ctx *hctx) -- 2.9.5 Kashyap
On 10/08/2020 17:51, Kashyap Desai wrote: >> tx context. > Hannes/John - We need one more correction for below patch - > > https://github.com/hisilicon/kernel-dev/commit/ff631eb80aa0449eaeb78a282fd7eff2a9e42f77 > > I noticed - that elevator_queued count goes negative mainly because there > are some case where IO was submitted from dispatch queue(not scheduler > queue) and request still has "RQF_ELVPRIV" flag set. > In such cases " dd_finish_request" is called without " dd_insert_request". I > think it is better to decrement counter once it is taken out from dispatched > queue. (Ming proposed to use dispatch path for decrementing counter, but I > somehow did not accounted assuming RQF_ELVPRIV will be set only if IO is > submitted from scheduler queue.) > > Below is additional change. Can you merge this ? > > diff --git a/block/mq-deadline.c b/block/mq-deadline.c > index 9d75374..bc413dd 100644 > --- a/block/mq-deadline.c > +++ b/block/mq-deadline.c > @@ -385,6 +385,8 @@ static struct request *dd_dispatch_request(struct > blk_mq_hw_ctx *hctx) > > spin_lock(&dd->lock); > rq = __dd_dispatch_request(dd); > + if (rq) > + atomic_dec(&rq->mq_hctx->elevator_queued); Is there any reason why this operation could not be taken outside the spinlock? I assume raciness is not a problem with this patch... > spin_unlock(&dd->lock); > > return rq; > @@ -574,7 +576,6 @@ static void dd_finish_request(struct request *rq) > blk_mq_sched_mark_restart_hctx(rq->mq_hctx); > spin_unlock_irqrestore(&dd->zone_lock, flags); > } > - atomic_dec(&rq->mq_hctx->elevator_queued); > } > > static bool dd_has_work(struct blk_mq_hw_ctx *hctx) > -- > 2.9.5 > > Kashyap > .# btw, can you provide signed-off-by if you want credit upgraded to Co-developed-by? Thanks, john
> > diff --git a/block/mq-deadline.c b/block/mq-deadline.c index > > 9d75374..bc413dd 100644 > > --- a/block/mq-deadline.c > > +++ b/block/mq-deadline.c > > @@ -385,6 +385,8 @@ static struct request *dd_dispatch_request(struct > > blk_mq_hw_ctx *hctx) > > > > spin_lock(&dd->lock); > > rq = __dd_dispatch_request(dd); > > + if (rq) > > + atomic_dec(&rq->mq_hctx->elevator_queued); > > Is there any reason why this operation could not be taken outside the > spinlock? I assume raciness is not a problem with this patch... No issue if we want to move this outside spinlock. > > > spin_unlock(&dd->lock); > > > > return rq; > > @@ -574,7 +576,6 @@ static void dd_finish_request(struct request *rq) > > blk_mq_sched_mark_restart_hctx(rq->mq_hctx); > > spin_unlock_irqrestore(&dd->zone_lock, flags); > > } > > - atomic_dec(&rq->mq_hctx->elevator_queued); > > } > > > > static bool dd_has_work(struct blk_mq_hw_ctx *hctx) > > -- > > 2.9.5 > > > > Kashyap > > .# > > > btw, can you provide signed-off-by if you want credit upgraded to Co- > developed-by? I will send you merged patch which you can push to your git repo. Kashyap > > Thanks, > john
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c index 85aa1690cbcf..bedddf168253 100644 --- a/block/blk-mq-tag.c +++ b/block/blk-mq-tag.c @@ -454,24 +454,22 @@ static int bt_alloc(struct sbitmap_queue *bt, unsigned int depth, node); } -static struct blk_mq_tags *blk_mq_init_bitmap_tags(struct blk_mq_tags *tags, - int node, int alloc_policy) +static int blk_mq_init_bitmap_tags(struct blk_mq_tags *tags, + int node, int alloc_policy) { unsigned int depth = tags->nr_tags - tags->nr_reserved_tags; bool round_robin = alloc_policy == BLK_TAG_ALLOC_RR; if (bt_alloc(&tags->bitmap_tags, depth, round_robin, node)) - goto free_tags; + return -ENOMEM; if (bt_alloc(&tags->breserved_tags, tags->nr_reserved_tags, round_robin, node)) goto free_bitmap_tags; - return tags; + return 0; free_bitmap_tags: sbitmap_queue_free(&tags->bitmap_tags); -free_tags: - kfree(tags); - return NULL; + return -ENOMEM; } struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, @@ -492,7 +490,11 @@ struct blk_mq_tags *blk_mq_init_tags(unsigned int total_tags, tags->nr_tags = total_tags; tags->nr_reserved_tags = reserved_tags; - return blk_mq_init_bitmap_tags(tags, node, alloc_policy); + if (blk_mq_init_bitmap_tags(tags, node, alloc_policy) < 0) { + kfree(tags); + tags = NULL; + } + return tags; } void blk_mq_free_tags(struct blk_mq_tags *tags) diff --git a/block/blk-mq.c b/block/blk-mq.c index d255c485ca5f..c20d75c851f2 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2802,8 +2802,8 @@ static void queue_set_hctx_shared(struct request_queue *q, bool shared) } } -static void blk_mq_update_tag_set_depth(struct blk_mq_tag_set *set, - bool shared) +static void blk_mq_update_tag_set_shared(struct blk_mq_tag_set *set, + bool shared) { struct request_queue *q; @@ -2826,7 +2826,7 @@ static void blk_mq_del_queue_tag_set(struct request_queue *q) /* just transitioned to unshared */ set->flags &= ~BLK_MQ_F_TAG_QUEUE_SHARED; /* update existing queue */ - blk_mq_update_tag_set_depth(set, false); + blk_mq_update_tag_set_shared(set, false); } mutex_unlock(&set->tag_list_lock); INIT_LIST_HEAD(&q->tag_set_list); @@ -2844,7 +2844,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, !(set->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) { set->flags |= BLK_MQ_F_TAG_QUEUE_SHARED; /* update existing queue */ - blk_mq_update_tag_set_depth(set, true); + blk_mq_update_tag_set_shared(set, true); } if (set->flags & BLK_MQ_F_TAG_QUEUE_SHARED) queue_set_hctx_shared(q, true);