Message ID | 20231130193139.880955-2-bvanassche@acm.org |
---|---|
State | New |
Headers | show |
Series | Disable fair tag sharing for UFS devices | expand |
On 30.11.23 20:31, Bart Van Assche wrote: > +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool enable) > +{ > + const unsigned int DFTS_BIT = ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING); > + struct blk_mq_hw_ctx *hctx; > + struct request_queue *q; > + unsigned long i; > + > + /* > + * Serialize against blk_mq_update_nr_hw_queues() and > + * blk_mq_realloc_hw_ctxs(). > + */ > + mutex_lock(&set->tag_list_lock); > + list_for_each_entry(q, &set->tag_list, tag_set_list) > + blk_mq_freeze_queue(q); > + assign_bit(DFTS_BIT, &set->flags, !enable); > + list_for_each_entry(q, &set->tag_list, tag_set_list) > + queue_for_each_hw_ctx(q, hctx, i) > + assign_bit(DFTS_BIT, &hctx->flags, !enable); > + list_for_each_entry(q, &set->tag_list, tag_set_list) > + blk_mq_unfreeze_queue(q); > + mutex_unlock(&set->tag_list_lock); Hi Bart, The above code adds a 3rd user (at least) of the following pattern to the kernel: list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_freeze_queue(q); /* do stuff */ list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_unfreeze_queue(q); Would it maybe be beneficial if we'd introduce functions for this, like: static inline void blk_mq_freeze_tag_set(struct blk_mq_tag_set *set) { lockdep_assert_held(&set->tag_list_lock); list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_freeze_queue(q); } static inline void blk_mq_unfreeze_tag_set(struct blk_mq_tag_set *set) { lockdep_assert_held(&set->tag_list_lock); list_for_each_entry(q, &set->tag_list, tag_set_list) blk_mq_unfreeze_queue(q); }
On 12/1/23 23:21, Yu Kuai wrote: > 在 2023/12/01 3:31, Bart Van Assche 写道: >> +/* >> + * Enable or disable fair tag sharing for all request queues >> associated with >> + * a tag set. >> + */ >> +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool enable) >> +{ >> + const unsigned int DFTS_BIT = >> ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING); >> + struct blk_mq_hw_ctx *hctx; >> + struct request_queue *q; >> + unsigned long i; >> + >> + /* >> + * Serialize against blk_mq_update_nr_hw_queues() and >> + * blk_mq_realloc_hw_ctxs(). >> + */ >> + mutex_lock(&set->tag_list_lock); > I'm a litter confused about this comment, because > blk_mq_realloc_hw_ctxs() can be called from > blk_mq_update_nr_hw_queues(). > > If you are talking about blk_mq_init_allocated_queue(), it looks like > just holding this lock is not enough? I added that comment because blk_mq_init_allocated_queue() calls blk_mq_realloc_hw_ctxs() before the request queue is added to set->tag_list. I will take a closer look at how blk_mq_init_allocated_queue() reads set->flags and will make sure that these reads are properly serialized against the changes made by blk_mq_update_fair_sharing(). Thanks, Bart.
Hi, Bart! 在 2023/12/04 12:13, Bart Van Assche 写道: > On 12/1/23 23:21, Yu Kuai wrote: >> 在 2023/12/01 3:31, Bart Van Assche 写道: >>> +/* >>> + * Enable or disable fair tag sharing for all request queues >>> associated with >>> + * a tag set. >>> + */ >>> +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool >>> enable) >>> +{ >>> + const unsigned int DFTS_BIT = >>> ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING); >>> + struct blk_mq_hw_ctx *hctx; >>> + struct request_queue *q; >>> + unsigned long i; >>> + >>> + /* >>> + * Serialize against blk_mq_update_nr_hw_queues() and >>> + * blk_mq_realloc_hw_ctxs(). >>> + */ >>> + mutex_lock(&set->tag_list_lock); >> I'm a litter confused about this comment, because >> blk_mq_realloc_hw_ctxs() can be called from >> blk_mq_update_nr_hw_queues(). >> >> If you are talking about blk_mq_init_allocated_queue(), it looks like >> just holding this lock is not enough? > > I added that comment because blk_mq_init_allocated_queue() calls > blk_mq_realloc_hw_ctxs() before the request queue is added to > set->tag_list. I will take a closer look at how > blk_mq_init_allocated_queue() reads set->flags and will make sure > that these reads are properly serialized against the changes made > by blk_mq_update_fair_sharing(). Are you still intrested in this patchset? I really want this switch in our product as well. If so, how do you think about following changes, a new field in blk_mq_tag_set will make synchronization much eaiser. Thanks, Kuai diff --git a/block/blk-mq.c b/block/blk-mq.c index 6ab7f360ff2a..791306dcd656 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3935,6 +3935,34 @@ static void blk_mq_map_swqueue(struct request_queue *q) } } +static void queue_update_fair_tag_sharing(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + unsigned long i; + + queue_for_each_hw_ctx(q, hctx, i) { + if (q->tag_set->disable_fair_tag_sharing) + hctx->flags |= BLK_MQ_F_DISABLE_FAIR_TAG_SHARING; + else + hctx->flags &= ~BLK_MQ_F_DISABLE_FAIR_TAG_SHARING; + } + +} + +void blk_mq_update_fair_tag_sharing(struct blk_mq_tag_set *set) +{ + struct request_queue *q; + + lockdep_assert_held(&set->tag_list_lock); + + list_for_each_entry(q, &set->tag_list, tag_set_list) { + blk_mq_freeze_queue(q); + queue_update_tag_fair_share(q); + blk_mq_unfreeze_queue(q); + } +} +EXPORT_SYMBOL_GPL(blk_mq_update_tag_fair_share); + /* * Caller needs to ensure that we're either frozen/quiesced, or that * the queue isn't live yet. @@ -3989,6 +4017,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, { mutex_lock(&set->tag_list_lock); + queue_update_fair_tag_sharing(q); /* * Check to see if we're transitioning to shared (from 1 to 2 queues). */ diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 958ed7e89b30..d76630ac45d8 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -506,6 +506,7 @@ struct blk_mq_tag_set { int numa_node; unsigned int timeout; unsigned int flags; + bool disable_fair_tag_sharing; void *driver_data; struct blk_mq_tags **tags; > > Thanks, > > Bart. > > . >
On 12/25/23 04:51, Yu Kuai wrote: > Are you still intrested in this patchset? I really want this switch in > our product as well. > > If so, how do you think about following changes, a new field in > blk_mq_tag_set will make synchronization much easier. Hi Kuai, Thanks for the reminder. I plan to continue working on this patch series in January 2024 (after the merge window has closed). I will take a closer look at the patch in your email. Thanks, Bart.
On 12/25/23 04:51, Yu Kuai wrote: > Are you still intrested in this patchset? I really want this switch in > our product as well. > > If so, how do you think about following changes, a new field in > blk_mq_tag_set will make synchronization much eaiser. Do you perhaps see the new field as an alternative for the BLK_MQ_F_DISABLE_FAIR_TAG_SHARING flag? I'm not sure that would be an improvement. hctx_may_queue() is called from the hot path. Using the 'flags' field will make it easier for the compiler to optimize that function compared to using a new structure member. Thanks, Bart.
Hi, 在 2024/01/12 3:22, Bart Van Assche 写道: > On 12/25/23 04:51, Yu Kuai wrote: >> Are you still intrested in this patchset? I really want this switch in >> our product as well. >> >> If so, how do you think about following changes, a new field in >> blk_mq_tag_set will make synchronization much eaiser. > Do you perhaps see the new field as an alternative for the > BLK_MQ_F_DISABLE_FAIR_TAG_SHARING flag? I'm not sure that would be an > improvement. hctx_may_queue() is called from the hot path. Using the > 'flags' field will make it easier for the compiler to optimize that > function compared to using a new structure member. Yes, I realized that, handle the new flag in blk_mq_allow_hctx() is good, how about following chang? Thanks, Kuai diff --git a/block/blk-mq.c b/block/blk-mq.c index 6ab7f360ff2a..dd7c9e3eca1b 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -3706,7 +3706,8 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, spin_lock_init(&hctx->lock); INIT_LIST_HEAD(&hctx->dispatch); hctx->queue = q; - hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED; + hctx->flags = set->flags & ~(BLK_MQ_F_TAG_QUEUE_SHARED | + BLK_MQ_F_DISABLE_FAIR_TAG_SHARING); INIT_LIST_HEAD(&hctx->hctx_list); @@ -3935,6 +3936,37 @@ static void blk_mq_map_swqueue(struct request_queue *q) } } +static void queue_update_fair_tag_sharing(struct request_queue *q) +{ + struct blk_mq_hw_ctx *hctx; + unsigned long i; + bool disabled = q->tag_set->flags & BLK_MQ_F_DISABLE_FAIR_TAG_SHARING; + + lockdep_assert_held(&q->tag_set->tag_list_lock); + + queue_for_each_hw_ctx(q, hctx, i) { + if (disabled) + hctx->flags |= BLK_MQ_F_DISABLE_FAIR_TAG_SHARING; + else + hctx->flags &= ~BLK_MQ_F_DISABLE_FAIR_TAG_SHARING; + } + +} + +void blk_mq_update_fair_tag_sharing(struct blk_mq_tag_set *set) +{ + struct request_queue *q; + + lockdep_assert_held(&set->tag_list_lock); + + list_for_each_entry(q, &set->tag_list, tag_set_list) { + blk_mq_freeze_queue(q); + queue_update_fair_tag_sharing(q); + blk_mq_unfreeze_queue(q); + } +} +EXPORT_SYMBOL_GPL(blk_mq_update_fair_tag_sharing); + /* * Caller needs to ensure that we're either frozen/quiesced, or that * the queue isn't live yet. @@ -3989,6 +4021,7 @@ static void blk_mq_add_queue_tag_set(struct blk_mq_tag_set *set, { mutex_lock(&set->tag_list_lock); + queue_update_fair_tag_sharing(q); /* * Check to see if we're transitioning to shared (from 1 to 2 queues). */ @@ -4767,6 +4800,9 @@ static void __blk_mq_update_nr_hw_queues(struct blk_mq_tag_set *set, blk_mq_map_swqueue(q); } + list_for_each_entry(q, &set->tag_list, tag_set_list) + queue_update_fair_tag_sharing(q); + reregister: list_for_each_entry(q, &set->tag_list, tag_set_list) { blk_mq_sysfs_register_hctxs(q); diff --git a/block/blk-mq.h b/block/blk-mq.h index 1743857e0b01..8b9aac701035 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -393,7 +393,8 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, { unsigned int depth, users; - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) + if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || + (hctx->flags & BLK_MQ_F_DISABLE_FAIR_TAG_SHARING)) return true; /* > > Thanks, > > Bart. > . >
On Fri, Jan 12, 2024 at 09:08:25AM +0800, Yu Kuai wrote: > Yes, I realized that, handle the new flag in blk_mq_allow_hctx() is > good, how about following chang? Who would make that decision and on what grounds?
On Sun, Jan 14, 2024 at 11:22:01AM +0800, Yu Kuai wrote: > As you might noticed, Bart and I both met the performance problem in > production due to fair tag sharing in the environment that total driver > tags is not sufficient. Disable fair tag sharing is a straight way to > fix the problem, of course this is not the ideal solution, but make tag > sharing configurable and let drivers make the decision if they want to > disable it really solve the dilemma, and won't have any influence > outside the driver. How can the driver make any sensible decision here? This really looks like a horrible band aid. You'll need to figure out a way to make the fair sharing less costly or adaptic. That might involve making it a little less fair, which is probably ok as long a the original goals are met.
On 1/14/24 21:59, Christoph Hellwig wrote: > How can the driver make any sensible decision here? This really looks > like a horrible band aid. (just returned from a four day trip) Hi Christoph, I agree that in general it is not up to the driver to decide whether or not fair tag sharing should be disabled. The UFS driver is an exception because we know that for all UFS use cases the latency for all logical units is similar. > You'll need to figure out a way to make the fair sharing less costly > or adaptic. That might involve making it a little less fair, which > is probably ok as long a the original goals are met. I disagree. Fair tag sharing is something that should be implemented in hardware (e.g. in an NVMe controller) rather than in software. Additionally, I'm convinced that it is impossible to come up with a better algorithm than the current without slowing down the hot path in the block layer, something that nobody wants. Bart.
On 1/14/24 22:18, Yu Kuai wrote: > Can you take a look at my previous patset if you haven't? And it'll > be great to hear from your comments. > > https://lore.kernel.org/all/20231021154806.4019417-1-yukuai1@huaweicloud.com/ Something is missing from the cover letter of that patch series: measurements that show the impact of that patch series on the maximum IOPS that can be achieved with the null_blk driver. I'm afraid that the performance impact of that patch series will be larger than what is acceptable. Thanks, Bart.
Hi, 在 2024/01/16 10:59, Bart Van Assche 写道: > On 1/14/24 22:18, Yu Kuai wrote: >> Can you take a look at my previous patset if you haven't? And it'll >> be great to hear from your comments. >> >> https://lore.kernel.org/all/20231021154806.4019417-1-yukuai1@huaweicloud.com/ >> > > Something is missing from the cover letter of that patch series: > measurements that show the impact of that patch series on the maximum > IOPS that can be achieved with the null_blk driver. I'm afraid that the > performance impact of that patch series will be larger than what is > acceptable. Hi, This version is just RFC, and is focusing on the method. I ran some tests on null_blk in my VM and didn't notice performace regression. My idea is that I already make this patchset complicated, and I'm looking for some comments for this patchset without spending too much time on this blindly. I'll provide null_blk tests result in the next version if anyone thinks the approch is acceptable: 1) add a new field 'available_tags' and update it in slow path, hence fast path hctx_may_queue() won't be affected. 2) delay tag sharing untill failed to get driver tag; 3) add a timer per shared tags to balance shared tags; Thanks, Kuai > > Thanks, > > Bart. > > . >
On 1/16/24 02:24, Yu Kuai wrote: > I'll provide null_blk tests result in the next version if anyone thinks > the approch is acceptable: > > 1) add a new field 'available_tags' and update it in slow path, hence > fast path hctx_may_queue() won't be affected. > 2) delay tag sharing untill failed to get driver tag; > 3) add a timer per shared tags to balance shared tags; My concern is that the complexity of the algorithm introduced by that patch series is significant. I prefer code that is easy to understand. This is why I haven't started yet with a detailed review. If anyone else wants to review that patch series that's fine with me. Thanks, Bart.
On 1/17/24 23:31, Christoph Hellwig wrote: > On Tue, Jan 16, 2024 at 09:36:27AM -0800, Bart Van Assche wrote: >> My concern is that the complexity of the algorithm introduced by that patch >> series is significant. I prefer code that is easy to understand. This is why >> I haven't started yet with a detailed review. If anyone else wants to review >> that patch series that's fine with me. > > Given that simply disabling fair sharing isn't going to fly we'll need > something more complex than that. > > The question is how much complexity do we need, and for that it would > be good to collect the use cases first. Hi Christoph, Patch "[PATCH v6 2/4] scsi: core: Make fair tag sharing configurable in the host template" of this series can be dropped by making the UFS driver call blk_mq_update_fair_sharing() directly. So far two use cases have been identified: setups with an UFSHCI 3.0 host controller and ATA controllers for which all storage devices have similar latency characteristics. Both storage controllers have a queue depth limit of 32 commands. It seems to me that disabling fair sharing will always result in better performance than any algorithm that realizes fair sharing (including the current algorithm). Only a single boolean needs to be tested to determine whether or not fair sharing should be disabled. Any fair sharing algorithm that we can come up with will be significantly more complex than testing a single boolean. I think this is a strong argument for adding support for disabling fair sharing. If anyone wants to improve the fair sharing algorithm that's fine with me. However, I do not plan to work on this myself. Thanks, Bart.
On Thu, Jan 18, 2024 at 10:40:26AM -0800, Bart Van Assche wrote: > So far two use cases have been identified: setups with an UFSHCI 3.0 > host controller and ATA controllers for which all storage devices have > similar latency characteristics. Both storage controllers have a queue > depth limit of 32 commands. > > It seems to me that disabling fair sharing will always result in better > performance than any algorithm that realizes fair sharing (including the > current algorithm). Fair sharing by definition is always faster than not doing fair sharing, that is not the point. The point is why you think fair sharing is not actually required for these particular setups only.
On 1/23/24 01:13, Christoph Hellwig wrote: > The point is why you think fair sharing is not actually required for > these particular setups only. Hi Christoph, Do you perhaps want me to move the SCSI host sysfs attribute that controls fair sharing to the /sys/block/${bdev}/queue directory? Thanks, Bart.
On Tue, Jan 23, 2024 at 07:16:05AM -0800, Bart Van Assche wrote: > On 1/23/24 01:13, Christoph Hellwig wrote: >> The point is why you think fair sharing is not actually required for >> these particular setups only. > > Hi Christoph, > > Do you perhaps want me to move the SCSI host sysfs attribute that controls > fair sharing to the /sys/block/${bdev}/queue directory? No. I want an explanation from you why you think your use case is so snowflake special that you and just you need to fisable fair sharing.
On 1/24/24 01:08, Christoph Hellwig wrote: > On Tue, Jan 23, 2024 at 07:16:05AM -0800, Bart Van Assche wrote: >> On 1/23/24 01:13, Christoph Hellwig wrote: >>> The point is why you think fair sharing is not actually required for >>> these particular setups only. >> >> Do you perhaps want me to move the SCSI host sysfs attribute that controls >> fair sharing to the /sys/block/${bdev}/queue directory? > > No. I want an explanation from you why you think your use case is so > snowflake special that you and just you need to fisable fair sharing. Hi Christoph, Would you agree with disabling fair sharing entirely? The use cases that need fair sharing most are those were different storage types (e.g. hard disk and SSDs) are connected to the same storage controller. This scenario often occurs in a cloud computing context. There are better solutions for cloud computing contexts than fair sharing, e.g. associating different storage types with different storage controllers. The same approach works for storage-over-network since storage arrays that have a network connection usually support to establish multiple connections from a storage initiator to the storage server. Thanks, Bart.
On Mon, Jan 29, 2024 at 04:03:11PM -0800, Bart Van Assche wrote:
> Would you agree with disabling fair sharing entirely?
As far as I can tell fair sharing exists to for two reasons:
1) to guarantee each queue can actually make progress for e.g.
memory reclaim
2) to not unfairly give queues and advantage over others
What are you arguments that we do not need this?
On 1/30/24 22:22, Christoph Hellwig wrote: > On Mon, Jan 29, 2024 at 04:03:11PM -0800, Bart Van Assche wrote: >> Would you agree with disabling fair sharing entirely? > > As far as I can tell fair sharing exists to for two reasons: > > 1) to guarantee each queue can actually make progress for e.g. > memory reclaim > 2) to not unfairly give queues and advantage over others > > What are you arguments that we do not need this? Regarding (1), isn't forward progress guaranteed by the sbitmap implementation? The algorithm in __sbitmap_queue_wake_up() does not guarantee perfect fairness but I think it is good enough to guarantee forward progress of code that is waiting for a block layer tag. Regarding (2), the fairness algorithm in the blk-mq code was introduced before fairness of the sbitmap implementation was improved. See also commit 0d2602ca30e4 ("blk-mq: improve support for shared tags maps") from 2014 and commit 976570b4ecd3 ("sbitmap: Advance the queue index before waking up a queue") from 2022. The current code in the sbitmap implementation is probably good enough if request queues share a tag set. It would be interesting to verify this with two null_blk driver instances with shared_tags and different completion_nsec values. Thanks, Bart.
On Wed, Jan 31, 2024 at 01:32:40PM -0800, Bart Van Assche wrote: > On 1/30/24 22:22, Christoph Hellwig wrote: > > On Mon, Jan 29, 2024 at 04:03:11PM -0800, Bart Van Assche wrote: > > > Would you agree with disabling fair sharing entirely? > > > > As far as I can tell fair sharing exists to for two reasons: > > > > 1) to guarantee each queue can actually make progress for e.g. > > memory reclaim > > 2) to not unfairly give queues and advantage over others > > > > What are you arguments that we do not need this? > > Regarding (1), isn't forward progress guaranteed by the sbitmap > implementation? The algorithm in __sbitmap_queue_wake_up() does not guarantee > perfect fairness but I think it is good enough to guarantee forward progress > of code that is waiting for a block layer tag. What if all the tags are used by one queue and all the tags are performing long running operations? Sure, sbitmap might wake up the longest waiter, but that could be hours.
On 1/31/24 13:37, Keith Busch wrote: > What if all the tags are used by one queue and all the tags are > performing long running operations? Sure, sbitmap might wake up the > longest waiter, but that could be hours. I have not yet encountered any storage driver that needs hours to process a single request. Can you give an example of a storage device that is that slow? Bart.
On Wed, Jan 31, 2024 at 01:42:30PM -0800, Bart Van Assche wrote: > On 1/31/24 13:37, Keith Busch wrote: > > What if all the tags are used by one queue and all the tags are > > performing long running operations? Sure, sbitmap might wake up the > > longest waiter, but that could be hours. > > I have not yet encountered any storage driver that needs hours to > process a single request. Can you give an example of a storage device > that is that slow? I didn't have anything in mind; just that protocols don't require all commands be fast. NVMe has wait event commands that might not ever complete. A copy command requesting multiple terabyes won't be quick for even the fastest hardware (not "hours", but not fast). If hardware stops responding, the tags are locked up for as long as it takes recovery escalation to reclaim them. For nvme, error recovery could take over a minute by default. Anyway, even with sbitmap approach, it's possible most of the waiting threads are for the greedy queue and will be selected ahead of the others. Relying on sbitmap might eventually get forward progress, but maybe not fair.
On 1/31/24 15:04, Keith Busch wrote: > I didn't have anything in mind; just that protocols don't require all > commands be fast. The default block layer timeout is 30 seconds because typical storage commands complete in much less than 30 seconds. > NVMe has wait event commands that might not ever complete. Are you perhaps referring to the NVMe Asynchronous Event Request command? That command doesn't count because the command ID for that command comes from another set than I/O commands. From the NVMe driver: static inline bool nvme_is_aen_req(u16 qid, __u16 command_id) { return !qid && nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH; } > A copy command requesting multiple terabyes won't be quick for even the > fastest hardware (not "hours", but not fast). Is there any setup in which such large commands are submitted? Write commands that last long may negatively affect read latency. This is a good reason not to make the max_sectors value too large. > If hardware stops responding, the tags are locked up for as long as it > takes recovery escalation to reclaim them. For nvme, error recovery > could take over a minute by default. If hardware stops responding, who cares about fairness of tag allocation since this means that request processing halts for all queues associated with the controller that locked up? Bart.
On 2/1/24 08:41, Bart Van Assche wrote: > On 1/31/24 15:04, Keith Busch wrote: >> I didn't have anything in mind; just that protocols don't require all >> commands be fast. > > The default block layer timeout is 30 seconds because typical storage > commands complete in much less than 30 seconds. > >> NVMe has wait event commands that might not ever complete. > > Are you perhaps referring to the NVMe Asynchronous Event Request > command? That command doesn't count because the command ID for that > command comes from another set than I/O commands. From the NVMe > driver: > > static inline bool nvme_is_aen_req(u16 qid, __u16 command_id) > { > return !qid && > nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH; > } > >> A copy command requesting multiple terabyes won't be quick for even the >> fastest hardware (not "hours", but not fast). > > Is there any setup in which such large commands are submitted? Write > commands that last long may negatively affect read latency. This is a > good reason not to make the max_sectors value too large. Even if max_sectors is not very large, if the device has a gigantic write cache that needs to be flushed first to be able to process an incoming write, then writes can be slow. I have seen issues in the field with that causing timeouts. Even a worst case scenario: HDDs doing on-media caching of writes when the volatile write cache is disabled by the user. Then if the on-media write cache needs to be freed up for a new write, the HDD will be very very slow handling writes. There are plenty of scenarios out there where the device can suddenly become slow, hogging a lot of tags in the process. >> If hardware stops responding, the tags are locked up for as long as it >> takes recovery escalation to reclaim them. For nvme, error recovery >> could take over a minute by default. > > If hardware stops responding, who cares about fairness of tag allocation > since this means that request processing halts for all queues associated > with the controller that locked up? Considering the above, it would be more about horrible slowdown of all devices sharing the tagset because for whatever reason one of the device is slow. Note: this is only my 2 cents input. I have not seen any issue in practice with shared tagset, but I do not think I ever encountered a system actually using that feature :)
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c index 5cbeb9344f2f..f41408103106 100644 --- a/block/blk-mq-debugfs.c +++ b/block/blk-mq-debugfs.c @@ -198,6 +198,7 @@ static const char *const hctx_flag_name[] = { HCTX_FLAG_NAME(NO_SCHED), HCTX_FLAG_NAME(STACKING), HCTX_FLAG_NAME(TAG_HCTX_SHARED), + HCTX_FLAG_NAME(DISABLE_FAIR_TAG_SHARING), }; #undef HCTX_FLAG_NAME diff --git a/block/blk-mq.c b/block/blk-mq.c index b8093155df8d..206295606cec 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -4569,6 +4569,34 @@ void blk_mq_free_tag_set(struct blk_mq_tag_set *set) } EXPORT_SYMBOL(blk_mq_free_tag_set); +/* + * Enable or disable fair tag sharing for all request queues associated with + * a tag set. + */ +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool enable) +{ + const unsigned int DFTS_BIT = ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING); + struct blk_mq_hw_ctx *hctx; + struct request_queue *q; + unsigned long i; + + /* + * Serialize against blk_mq_update_nr_hw_queues() and + * blk_mq_realloc_hw_ctxs(). + */ + mutex_lock(&set->tag_list_lock); + list_for_each_entry(q, &set->tag_list, tag_set_list) + blk_mq_freeze_queue(q); + assign_bit(DFTS_BIT, &set->flags, !enable); + list_for_each_entry(q, &set->tag_list, tag_set_list) + queue_for_each_hw_ctx(q, hctx, i) + assign_bit(DFTS_BIT, &hctx->flags, !enable); + list_for_each_entry(q, &set->tag_list, tag_set_list) + blk_mq_unfreeze_queue(q); + mutex_unlock(&set->tag_list_lock); +} +EXPORT_SYMBOL(blk_mq_update_fair_sharing); + int blk_mq_update_nr_requests(struct request_queue *q, unsigned int nr) { struct blk_mq_tag_set *set = q->tag_set; diff --git a/block/blk-mq.h b/block/blk-mq.h index f75a9ecfebde..eda6bd0611ea 100644 --- a/block/blk-mq.h +++ b/block/blk-mq.h @@ -416,7 +416,8 @@ static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx, { unsigned int depth, users; - if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)) + if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) || + (hctx->flags & BLK_MQ_F_DISABLE_FAIR_TAG_SHARING)) return true; /* diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 1ab3081c82ed..ddda190b5c24 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -503,7 +503,7 @@ struct blk_mq_tag_set { unsigned int cmd_size; int numa_node; unsigned int timeout; - unsigned int flags; + unsigned long flags; void *driver_data; struct blk_mq_tags **tags; @@ -662,7 +662,8 @@ enum { * or shared hwqs instead of 'mq-deadline'. */ BLK_MQ_F_NO_SCHED_BY_DEFAULT = 1 << 7, - BLK_MQ_F_ALLOC_POLICY_START_BIT = 8, + BLK_MQ_F_DISABLE_FAIR_TAG_SHARING = 1 << 8, + BLK_MQ_F_ALLOC_POLICY_START_BIT = 16, BLK_MQ_F_ALLOC_POLICY_BITS = 1, BLK_MQ_S_STOPPED = 0, @@ -705,6 +706,7 @@ int blk_mq_alloc_sq_tag_set(struct blk_mq_tag_set *set, const struct blk_mq_ops *ops, unsigned int queue_depth, unsigned int set_flags); void blk_mq_free_tag_set(struct blk_mq_tag_set *set); +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool enable); void blk_mq_free_request(struct request *rq); int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
The fair sharing algorithm has a negative performance impact for storage devices for which the full queue depth is required to reach peak performance, e.g. UFS devices. This is because it takes long after a request queue became inactive until tags are reassigned to the active request queue(s). Since making tag sharing fair is not needed if the request processing latency is similar for all request queues, introduce a function for configuring fair tag sharing. Increase BLK_MQ_F_ALLOC_POLICY_START_BIT to prevent that the fair tag sharing flag overlaps with the tag allocation policy. Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com> Cc: Yu Kuai <yukuai1@huaweicloud.com> Cc: Ed Tsai <ed.tsai@mediatek.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> --- block/blk-mq-debugfs.c | 1 + block/blk-mq.c | 28 ++++++++++++++++++++++++++++ block/blk-mq.h | 3 ++- include/linux/blk-mq.h | 6 ++++-- 4 files changed, 35 insertions(+), 3 deletions(-)