[v6,1/4] block: Make fair tag sharing configurable

Message ID	20231130193139.880955-2-bvanassche@acm.org
State	New
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=none From: Bart Van Assche <bvanassche@acm.org> To: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, Christoph Hellwig <hch@lst.de>, Bart Van Assche <bvanassche@acm.org>, Ming Lei <ming.lei@redhat.com>, Keith Busch <kbusch@kernel.org>, Damien Le Moal <damien.lemoal@opensource.wdc.com>, Yu Kuai <yukuai1@huaweicloud.com>, Ed Tsai <ed.tsai@mediatek.com>, Matthias Brugger <matthias.bgg@gmail.com>, AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Subject: [PATCH v6 1/4] block: Make fair tag sharing configurable Date: Thu, 30 Nov 2023 11:31:28 -0800 Message-ID: <20231130193139.880955-2-bvanassche@acm.org> In-Reply-To: <20231130193139.880955-1-bvanassche@acm.org> References: <20231130193139.880955-1-bvanassche@acm.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Disable fair tag sharing for UFS devices \| expand [v6,0/6] Disable fair tag sharing for UFS devices [v6,1/4] block: Make fair tag sharing configurable [v6,2/4] scsi: core: Make fair tag sharing configurable in the host template [v6,3/4] scsi: core: Make fair tag sharing configurable via sysfs [v6,4/4] scsi: ufs: Disable fair tag sharing

Bart Van Assche Nov. 30, 2023, 7:31 p.m. UTC

The fair sharing algorithm has a negative performance impact for storage
devices for which the full queue depth is required to reach peak
performance, e.g. UFS devices. This is because it takes long after a
request queue became inactive until tags are reassigned to the active
request queue(s). Since making tag sharing fair is not needed if the
request processing latency is similar for all request queues, introduce
a function for configuring fair tag sharing. Increase
BLK_MQ_F_ALLOC_POLICY_START_BIT to prevent that the fair tag sharing
flag overlaps with the tag allocation policy.

Cc: Christoph Hellwig <hch@lst.de>
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Cc: Ed Tsai <ed.tsai@mediatek.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
---
 block/blk-mq-debugfs.c |  1 +
 block/blk-mq.c         | 28 ++++++++++++++++++++++++++++
 block/blk-mq.h         |  3 ++-
 include/linux/blk-mq.h |  6 ++++--
 4 files changed, 35 insertions(+), 3 deletions(-)

Johannes Thumshirn Dec. 1, 2023, 12:52 p.m. UTC | #1

On 30.11.23 20:31, Bart Van Assche wrote:
> +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool enable)
> +{
> +	const unsigned int DFTS_BIT = ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING);
> +	struct blk_mq_hw_ctx *hctx;
> +	struct request_queue *q;
> +	unsigned long i;
> +
> +	/*
> +	 * Serialize against blk_mq_update_nr_hw_queues() and
> +	 * blk_mq_realloc_hw_ctxs().
> +	 */
> +	mutex_lock(&set->tag_list_lock);
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		blk_mq_freeze_queue(q);
> +	assign_bit(DFTS_BIT, &set->flags, !enable);
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		queue_for_each_hw_ctx(q, hctx, i)
> +			assign_bit(DFTS_BIT, &hctx->flags, !enable);
> +	list_for_each_entry(q, &set->tag_list, tag_set_list)
> +		blk_mq_unfreeze_queue(q);
> +	mutex_unlock(&set->tag_list_lock);

Hi Bart,

The above code adds a 3rd user (at least) of the following pattern to 
the kernel:

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_freeze_queue(q);

	/* do stuff */

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_unfreeze_queue(q);

Would it maybe be beneficial if we'd introduce functions for this, like:

static inline void blk_mq_freeze_tag_set(struct blk_mq_tag_set *set)
{
	lockdep_assert_held(&set->tag_list_lock);

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_freeze_queue(q);
}

static inline void blk_mq_unfreeze_tag_set(struct blk_mq_tag_set *set)
{
	lockdep_assert_held(&set->tag_list_lock);

	list_for_each_entry(q, &set->tag_list, tag_set_list)
		blk_mq_unfreeze_queue(q);
}

Bart Van Assche Dec. 4, 2023, 4:13 a.m. UTC | #2

On 12/1/23 23:21, Yu Kuai wrote:
> 在 2023/12/01 3:31, Bart Van Assche 写道:
>> +/*
>> + * Enable or disable fair tag sharing for all request queues 
>> associated with
>> + * a tag set.
>> + */
>> +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool enable)
>> +{
>> +    const unsigned int DFTS_BIT = 
>> ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING);
>> +    struct blk_mq_hw_ctx *hctx;
>> +    struct request_queue *q;
>> +    unsigned long i;
>> +
>> +    /*
>> +     * Serialize against blk_mq_update_nr_hw_queues() and
>> +     * blk_mq_realloc_hw_ctxs().
>> +     */
>> +    mutex_lock(&set->tag_list_lock);
> I'm a litter confused about this comment, because
> blk_mq_realloc_hw_ctxs() can be called from
> blk_mq_update_nr_hw_queues().
> 
> If you are talking about blk_mq_init_allocated_queue(), it looks like
> just holding this lock is not enough?

I added that comment because blk_mq_init_allocated_queue() calls
blk_mq_realloc_hw_ctxs() before the request queue is added to
set->tag_list. I will take a closer look at how
blk_mq_init_allocated_queue() reads set->flags and will make sure
that these reads are properly serialized against the changes made
by blk_mq_update_fair_sharing().

Thanks,

Bart.

Yu Kuai Dec. 25, 2023, 12:51 p.m. UTC | #3

Hi, Bart!

在 2023/12/04 12:13, Bart Van Assche 写道:
> On 12/1/23 23:21, Yu Kuai wrote:
>> 在 2023/12/01 3:31, Bart Van Assche 写道:
>>> +/*
>>> + * Enable or disable fair tag sharing for all request queues 
>>> associated with
>>> + * a tag set.
>>> + */
>>> +void blk_mq_update_fair_sharing(struct blk_mq_tag_set *set, bool 
>>> enable)
>>> +{
>>> +    const unsigned int DFTS_BIT = 
>>> ilog2(BLK_MQ_F_DISABLE_FAIR_TAG_SHARING);
>>> +    struct blk_mq_hw_ctx *hctx;
>>> +    struct request_queue *q;
>>> +    unsigned long i;
>>> +
>>> +    /*
>>> +     * Serialize against blk_mq_update_nr_hw_queues() and
>>> +     * blk_mq_realloc_hw_ctxs().
>>> +     */
>>> +    mutex_lock(&set->tag_list_lock);
>> I'm a litter confused about this comment, because
>> blk_mq_realloc_hw_ctxs() can be called from
>> blk_mq_update_nr_hw_queues().
>>
>> If you are talking about blk_mq_init_allocated_queue(), it looks like
>> just holding this lock is not enough?
> 
> I added that comment because blk_mq_init_allocated_queue() calls
> blk_mq_realloc_hw_ctxs() before the request queue is added to
> set->tag_list. I will take a closer look at how
> blk_mq_init_allocated_queue() reads set->flags and will make sure
> that these reads are properly serialized against the changes made
> by blk_mq_update_fair_sharing().

Are you still intrested in this patchset? I really want this switch in
our product as well.

If so, how do you think about following changes, a new field in
blk_mq_tag_set will make synchronization much eaiser.

Thanks,
Kuai

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6ab7f360ff2a..791306dcd656 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3935,6 +3935,34 @@ static void blk_mq_map_swqueue(struct 
request_queue *q)
         }
  }

+static void queue_update_fair_tag_sharing(struct request_queue *q)
+{
+       struct blk_mq_hw_ctx *hctx;
+       unsigned long i;
+
+       queue_for_each_hw_ctx(q, hctx, i) {
+               if (q->tag_set->disable_fair_tag_sharing)
+                       hctx->flags |= BLK_MQ_F_DISABLE_FAIR_TAG_SHARING;
+               else
+                       hctx->flags &= ~BLK_MQ_F_DISABLE_FAIR_TAG_SHARING;
+       }
+
+}
+
+void blk_mq_update_fair_tag_sharing(struct blk_mq_tag_set *set)
+{
+       struct request_queue *q;
+
+       lockdep_assert_held(&set->tag_list_lock);
+
+       list_for_each_entry(q, &set->tag_list, tag_set_list) {
+               blk_mq_freeze_queue(q);
+               queue_update_tag_fair_share(q);
+               blk_mq_unfreeze_queue(q);
+       }
+}
+EXPORT_SYMBOL_GPL(blk_mq_update_tag_fair_share);
+
  /*
   * Caller needs to ensure that we're either frozen/quiesced, or that
   * the queue isn't live yet.
@@ -3989,6 +4017,7 @@ static void blk_mq_add_queue_tag_set(struct 
blk_mq_tag_set *set,
  {
         mutex_lock(&set->tag_list_lock);

+       queue_update_fair_tag_sharing(q);
         /*
          * Check to see if we're transitioning to shared (from 1 to 2 
queues).
          */

diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 958ed7e89b30..d76630ac45d8 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -506,6 +506,7 @@ struct blk_mq_tag_set {
         int                     numa_node;
         unsigned int            timeout;
         unsigned int            flags;
+       bool                    disable_fair_tag_sharing;
         void                    *driver_data;

         struct blk_mq_tags      **tags;

> 
> Thanks,
> 
> Bart.
> 
> .
>

Bart Van Assche Dec. 26, 2023, 2:22 a.m. UTC | #4

On 12/25/23 04:51, Yu Kuai wrote:
> Are you still intrested in this patchset? I really want this switch in
> our product as well.
> 
> If so, how do you think about following changes, a new field in
> blk_mq_tag_set will make synchronization much easier.

Hi Kuai,

Thanks for the reminder. I plan to continue working on this patch
series in January 2024 (after the merge window has closed).

I will take a closer look at the patch in your email.

Thanks,

Bart.

Bart Van Assche Jan. 11, 2024, 7:22 p.m. UTC | #5

On 12/25/23 04:51, Yu Kuai wrote:
> Are you still intrested in this patchset? I really want this switch in
> our product as well.
> 
> If so, how do you think about following changes, a new field in
> blk_mq_tag_set will make synchronization much eaiser.
Do you perhaps see the new field as an alternative for the
BLK_MQ_F_DISABLE_FAIR_TAG_SHARING flag? I'm not sure that would be an
improvement. hctx_may_queue() is called from the hot path. Using the
'flags' field will make it easier for the compiler to optimize that
function compared to using a new structure member.

Thanks,

Bart.

Yu Kuai Jan. 12, 2024, 1:08 a.m. UTC | #6

Hi,

在 2024/01/12 3:22, Bart Van Assche 写道:
> On 12/25/23 04:51, Yu Kuai wrote:
>> Are you still intrested in this patchset? I really want this switch in
>> our product as well.
>>
>> If so, how do you think about following changes, a new field in
>> blk_mq_tag_set will make synchronization much eaiser.
> Do you perhaps see the new field as an alternative for the
> BLK_MQ_F_DISABLE_FAIR_TAG_SHARING flag? I'm not sure that would be an
> improvement. hctx_may_queue() is called from the hot path. Using the
> 'flags' field will make it easier for the compiler to optimize that
> function compared to using a new structure member.

Yes, I realized that, handle the new flag in blk_mq_allow_hctx() is
good, how about following chang?

Thanks,
Kuai

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 6ab7f360ff2a..dd7c9e3eca1b 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -3706,7 +3706,8 @@ blk_mq_alloc_hctx(struct request_queue *q, struct 
blk_mq_tag_set *set,
         spin_lock_init(&hctx->lock);
         INIT_LIST_HEAD(&hctx->dispatch);
         hctx->queue = q;
-       hctx->flags = set->flags & ~BLK_MQ_F_TAG_QUEUE_SHARED;
+       hctx->flags = set->flags & ~(BLK_MQ_F_TAG_QUEUE_SHARED |
+                                    BLK_MQ_F_DISABLE_FAIR_TAG_SHARING);

         INIT_LIST_HEAD(&hctx->hctx_list);

@@ -3935,6 +3936,37 @@ static void blk_mq_map_swqueue(struct 
request_queue *q)
         }
  }

+static void queue_update_fair_tag_sharing(struct request_queue *q)
+{
+       struct blk_mq_hw_ctx *hctx;
+       unsigned long i;
+       bool disabled = q->tag_set->flags & 
BLK_MQ_F_DISABLE_FAIR_TAG_SHARING;
+
+       lockdep_assert_held(&q->tag_set->tag_list_lock);
+
+       queue_for_each_hw_ctx(q, hctx, i) {
+               if (disabled)
+                       hctx->flags |= BLK_MQ_F_DISABLE_FAIR_TAG_SHARING;
+               else
+                       hctx->flags &= ~BLK_MQ_F_DISABLE_FAIR_TAG_SHARING;
+       }
+
+}
+
+void blk_mq_update_fair_tag_sharing(struct blk_mq_tag_set *set)
+{
+       struct request_queue *q;
+
+       lockdep_assert_held(&set->tag_list_lock);
+
+       list_for_each_entry(q, &set->tag_list, tag_set_list) {
+               blk_mq_freeze_queue(q);
+               queue_update_fair_tag_sharing(q);
+               blk_mq_unfreeze_queue(q);
+       }
+}
+EXPORT_SYMBOL_GPL(blk_mq_update_fair_tag_sharing);
+
  /*
   * Caller needs to ensure that we're either frozen/quiesced, or that
   * the queue isn't live yet.
@@ -3989,6 +4021,7 @@ static void blk_mq_add_queue_tag_set(struct 
blk_mq_tag_set *set,
  {
         mutex_lock(&set->tag_list_lock);

+       queue_update_fair_tag_sharing(q);
         /*
          * Check to see if we're transitioning to shared (from 1 to 2 
queues).
          */
@@ -4767,6 +4800,9 @@ static void __blk_mq_update_nr_hw_queues(struct 
blk_mq_tag_set *set,
                 blk_mq_map_swqueue(q);
         }

+       list_for_each_entry(q, &set->tag_list, tag_set_list)
+               queue_update_fair_tag_sharing(q);
+
  reregister:
         list_for_each_entry(q, &set->tag_list, tag_set_list) {
                 blk_mq_sysfs_register_hctxs(q);
diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1743857e0b01..8b9aac701035 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -393,7 +393,8 @@ static inline bool hctx_may_queue(struct 
blk_mq_hw_ctx *hctx,
  {
         unsigned int depth, users;

-       if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
+       if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED) ||
+           (hctx->flags & BLK_MQ_F_DISABLE_FAIR_TAG_SHARING))
                 return true;

         /*

> 
> Thanks,
> 
> Bart.
> .
>

Christoph Hellwig Jan. 12, 2024, 4:39 a.m. UTC | #7

On Fri, Jan 12, 2024 at 09:08:25AM +0800, Yu Kuai wrote:
> Yes, I realized that, handle the new flag in blk_mq_allow_hctx() is
> good, how about following chang?

Who would make that decision and on what grounds?

Christoph Hellwig Jan. 15, 2024, 5:59 a.m. UTC | #8

On Sun, Jan 14, 2024 at 11:22:01AM +0800, Yu Kuai wrote:
> As you might noticed, Bart and I both met the performance problem in
> production due to fair tag sharing in the environment that total driver
> tags is not sufficient. Disable fair tag sharing is a straight way to
> fix the problem, of course this is not the ideal solution, but make tag
> sharing configurable and let drivers make the decision if they want to
> disable it really solve the dilemma, and won't have any influence
> outside the driver.

How can the driver make any sensible decision here?  This really looks
like a horrible band aid.  You'll need to figure out a way to make
the fair sharing less costly or adaptic.  That might involve making it
a little less fair, which is probably ok as long a the original goals
are met.

Bart Van Assche Jan. 16, 2024, 2:52 a.m. UTC | #9

On 1/14/24 21:59, Christoph Hellwig wrote:
> How can the driver make any sensible decision here?  This really looks
> like a horrible band aid.

(just returned from a four day trip)

Hi Christoph,

I agree that in general it is not up to the driver to decide whether or
not fair tag sharing should be disabled. The UFS driver is an exception
because we know that for all UFS use cases the latency for all logical 
units is similar.

> You'll need to figure out a way to make the fair sharing less costly
> or adaptic.  That might involve making it a little less fair, which
> is probably ok as long a the original goals are met.
I disagree. Fair tag sharing is something that should be implemented in
hardware (e.g. in an NVMe controller) rather than in software.
Additionally, I'm convinced that it is impossible to come up with a
better algorithm than the current without slowing down the hot path in
the block layer, something that nobody wants.

Bart.

Bart Van Assche Jan. 16, 2024, 2:59 a.m. UTC | #10

On 1/14/24 22:18, Yu Kuai wrote:
> Can you take a look at my previous patset if you haven't? And it'll
> be great to hear from your comments.
> 
> https://lore.kernel.org/all/20231021154806.4019417-1-yukuai1@huaweicloud.com/

Something is missing from the cover letter of that patch series:
measurements that show the impact of that patch series on the maximum
IOPS that can be achieved with the null_blk driver. I'm afraid that the
performance impact of that patch series will be larger than what is 
acceptable.

Thanks,

Bart.

Yu Kuai Jan. 16, 2024, 10:24 a.m. UTC | #11

Hi,

在 2024/01/16 10:59, Bart Van Assche 写道:
> On 1/14/24 22:18, Yu Kuai wrote:
>> Can you take a look at my previous patset if you haven't? And it'll
>> be great to hear from your comments.
>>
>> https://lore.kernel.org/all/20231021154806.4019417-1-yukuai1@huaweicloud.com/ 
>>
> 
> Something is missing from the cover letter of that patch series:
> measurements that show the impact of that patch series on the maximum
> IOPS that can be achieved with the null_blk driver. I'm afraid that the
> performance impact of that patch series will be larger than what is 
> acceptable.

Hi,

This version is just RFC, and is focusing on the method. I ran some
tests on null_blk in my VM and didn't notice performace regression. My
idea is that I already make this patchset complicated, and I'm looking
for some comments for this patchset without spending too much time on
this blindly.

I'll provide null_blk tests result in the next version if anyone thinks
the approch is acceptable:

1) add a new field 'available_tags' and update it in slow path, hence
fast path hctx_may_queue() won't be affected.
2) delay tag sharing untill failed to get driver tag;
3) add a timer per shared tags to balance shared tags;

Thanks,
Kuai

> 
> Thanks,
> 
> Bart.
> 
> .
>

Bart Van Assche Jan. 16, 2024, 5:36 p.m. UTC | #12

On 1/16/24 02:24, Yu Kuai wrote:
> I'll provide null_blk tests result in the next version if anyone thinks
> the approch is acceptable:
> 
> 1) add a new field 'available_tags' and update it in slow path, hence
> fast path hctx_may_queue() won't be affected.
> 2) delay tag sharing untill failed to get driver tag;
> 3) add a timer per shared tags to balance shared tags;
  My concern is that the complexity of the algorithm introduced by that patch
series is significant. I prefer code that is easy to understand. This is why
I haven't started yet with a detailed review. If anyone else wants to review
that patch series that's fine with me.

Thanks,

Bart.

Bart Van Assche Jan. 18, 2024, 6:40 p.m. UTC | #13

On 1/17/24 23:31, Christoph Hellwig wrote:
> On Tue, Jan 16, 2024 at 09:36:27AM -0800, Bart Van Assche wrote:
>> My concern is that the complexity of the algorithm introduced by that patch
>> series is significant. I prefer code that is easy to understand. This is why
>> I haven't started yet with a detailed review. If anyone else wants to review
>> that patch series that's fine with me.
> 
> Given that simply disabling fair sharing isn't going to fly we'll need
> something more complex than that.
> 
> The question is how much complexity do we need, and for that it would
> be good to collect the use cases first.

Hi Christoph,

Patch "[PATCH v6 2/4] scsi: core: Make fair tag sharing configurable in
the host template" of this series can be dropped by making the UFS
driver call blk_mq_update_fair_sharing() directly.

So far two use cases have been identified: setups with an UFSHCI 3.0
host controller and ATA controllers for which all storage devices have
similar latency characteristics. Both storage controllers have a queue
depth limit of 32 commands.

It seems to me that disabling fair sharing will always result in better
performance than any algorithm that realizes fair sharing (including the
current algorithm). Only a single boolean needs to be tested to 
determine whether or not fair sharing should be disabled. Any fair 
sharing algorithm that we can come up with will be significantly more 
complex than testing a single boolean. I think this is a strong argument 
for adding support for disabling fair sharing.

If anyone wants to improve the fair sharing algorithm that's fine with 
me. However, I do not plan to work on this myself.

Thanks,

Bart.

Christoph Hellwig Jan. 23, 2024, 9:13 a.m. UTC | #14

On Thu, Jan 18, 2024 at 10:40:26AM -0800, Bart Van Assche wrote:
> So far two use cases have been identified: setups with an UFSHCI 3.0
> host controller and ATA controllers for which all storage devices have
> similar latency characteristics. Both storage controllers have a queue
> depth limit of 32 commands.
>
> It seems to me that disabling fair sharing will always result in better
> performance than any algorithm that realizes fair sharing (including the
> current algorithm).

Fair sharing by definition is always faster than not doing fair
sharing, that is not the point.

The point is why you think fair sharing is not actually required for
these particular setups only.

Bart Van Assche Jan. 23, 2024, 3:16 p.m. UTC | #15

On 1/23/24 01:13, Christoph Hellwig wrote:
> The point is why you think fair sharing is not actually required for
> these particular setups only.

Hi Christoph,

Do you perhaps want me to move the SCSI host sysfs attribute that controls
fair sharing to the /sys/block/${bdev}/queue directory?

Thanks,

Bart.

Christoph Hellwig Jan. 24, 2024, 9:08 a.m. UTC | #16

On Tue, Jan 23, 2024 at 07:16:05AM -0800, Bart Van Assche wrote:
> On 1/23/24 01:13, Christoph Hellwig wrote:
>> The point is why you think fair sharing is not actually required for
>> these particular setups only.
>
> Hi Christoph,
>
> Do you perhaps want me to move the SCSI host sysfs attribute that controls
> fair sharing to the /sys/block/${bdev}/queue directory?

No.  I want an explanation from you why you think your use case is so
snowflake special that you and just you need to fisable fair sharing.

Bart Van Assche Jan. 30, 2024, 12:03 a.m. UTC | #17

On 1/24/24 01:08, Christoph Hellwig wrote:
> On Tue, Jan 23, 2024 at 07:16:05AM -0800, Bart Van Assche wrote:
>> On 1/23/24 01:13, Christoph Hellwig wrote:
>>> The point is why you think fair sharing is not actually required for
>>> these particular setups only.
>>
>> Do you perhaps want me to move the SCSI host sysfs attribute that controls
>> fair sharing to the /sys/block/${bdev}/queue directory?
> 
> No.  I want an explanation from you why you think your use case is so
> snowflake special that you and just you need to fisable fair sharing.

Hi Christoph,

Would you agree with disabling fair sharing entirely? The use cases that
need fair sharing most are those were different storage types (e.g. hard
disk and SSDs) are connected to the same storage controller. This scenario
often occurs in a cloud computing context. There are better solutions for
cloud computing contexts than fair sharing, e.g. associating different
storage types with different storage controllers. The same approach works
for storage-over-network since storage arrays that have a network connection
usually support to establish multiple connections from a storage initiator
to the storage server.

Thanks,

Bart.

Christoph Hellwig Jan. 31, 2024, 6:22 a.m. UTC | #18

On Mon, Jan 29, 2024 at 04:03:11PM -0800, Bart Van Assche wrote:
> Would you agree with disabling fair sharing entirely?

As far as I can tell fair sharing exists to for two reasons:

 1) to guarantee each queue can actually make progress for e.g.
    memory reclaim
 2) to not unfairly give queues and advantage over others

What are you arguments that we do not need this?

Bart Van Assche Jan. 31, 2024, 9:32 p.m. UTC | #19

On 1/30/24 22:22, Christoph Hellwig wrote:
> On Mon, Jan 29, 2024 at 04:03:11PM -0800, Bart Van Assche wrote:
>> Would you agree with disabling fair sharing entirely?
> 
> As far as I can tell fair sharing exists to for two reasons:
> 
>   1) to guarantee each queue can actually make progress for e.g.
>      memory reclaim
>   2) to not unfairly give queues and advantage over others
> 
> What are you arguments that we do not need this?

Regarding (1), isn't forward progress guaranteed by the sbitmap
implementation? The algorithm in __sbitmap_queue_wake_up() does not guarantee
perfect fairness but I think it is good enough to guarantee forward progress
of code that is waiting for a block layer tag.

Regarding (2), the fairness algorithm in the blk-mq code was introduced
before fairness of the sbitmap implementation was improved. See also commit
0d2602ca30e4 ("blk-mq: improve support for shared tags maps") from 2014 and
commit 976570b4ecd3 ("sbitmap: Advance the queue index before waking up a
queue") from 2022. The current code in the sbitmap implementation is
probably good enough if request queues share a tag set. It would be
interesting to verify this with two null_blk driver instances with
shared_tags and different completion_nsec values.

Thanks,

Bart.

Keith Busch Jan. 31, 2024, 9:37 p.m. UTC | #20

On Wed, Jan 31, 2024 at 01:32:40PM -0800, Bart Van Assche wrote:
> On 1/30/24 22:22, Christoph Hellwig wrote:
> > On Mon, Jan 29, 2024 at 04:03:11PM -0800, Bart Van Assche wrote:
> > > Would you agree with disabling fair sharing entirely?
> > 
> > As far as I can tell fair sharing exists to for two reasons:
> > 
> >   1) to guarantee each queue can actually make progress for e.g.
> >      memory reclaim
> >   2) to not unfairly give queues and advantage over others
> > 
> > What are you arguments that we do not need this?
> 
> Regarding (1), isn't forward progress guaranteed by the sbitmap
> implementation? The algorithm in __sbitmap_queue_wake_up() does not guarantee
> perfect fairness but I think it is good enough to guarantee forward progress
> of code that is waiting for a block layer tag.

What if all the tags are used by one queue and all the tags are
performing long running operations? Sure, sbitmap might wake up the
longest waiter, but that could be hours.

Bart Van Assche Jan. 31, 2024, 9:42 p.m. UTC | #21

On 1/31/24 13:37, Keith Busch wrote:
> What if all the tags are used by one queue and all the tags are
> performing long running operations? Sure, sbitmap might wake up the
> longest waiter, but that could be hours.

I have not yet encountered any storage driver that needs hours to
process a single request. Can you give an example of a storage device
that is that slow?

Bart.

Keith Busch Jan. 31, 2024, 11:04 p.m. UTC | #22

On Wed, Jan 31, 2024 at 01:42:30PM -0800, Bart Van Assche wrote:
> On 1/31/24 13:37, Keith Busch wrote:
> > What if all the tags are used by one queue and all the tags are
> > performing long running operations? Sure, sbitmap might wake up the
> > longest waiter, but that could be hours.
> 
> I have not yet encountered any storage driver that needs hours to
> process a single request. Can you give an example of a storage device
> that is that slow?

I didn't have anything in mind; just that protocols don't require all
commands be fast.

NVMe has wait event commands that might not ever complete.

A copy command requesting multiple terabyes won't be quick for even the
fastest hardware (not "hours", but not fast).

If hardware stops responding, the tags are locked up for as long as it
takes recovery escalation to reclaim them. For nvme, error recovery
could take over a minute by default.

Anyway, even with sbitmap approach, it's possible most of the waiting
threads are for the greedy queue and will be selected ahead of the
others. Relying on sbitmap might eventually get forward progress, but
maybe not fair.

Bart Van Assche Jan. 31, 2024, 11:41 p.m. UTC | #23

On 1/31/24 15:04, Keith Busch wrote:
> I didn't have anything in mind; just that protocols don't require all
> commands be fast.

The default block layer timeout is 30 seconds because typical storage 
commands complete in much less than 30 seconds.

> NVMe has wait event commands that might not ever complete.

Are you perhaps referring to the NVMe Asynchronous Event Request
command? That command doesn't count because the command ID for that
command comes from another set than I/O commands. From the NVMe
driver:

static inline bool nvme_is_aen_req(u16 qid, __u16 command_id)
{
	return !qid &&
		nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH;
}

> A copy command requesting multiple terabyes won't be quick for even the
> fastest hardware (not "hours", but not fast).

Is there any setup in which such large commands are submitted? Write
commands that last long may negatively affect read latency. This is a
good reason not to make the max_sectors value too large.

> If hardware stops responding, the tags are locked up for as long as it
> takes recovery escalation to reclaim them. For nvme, error recovery
> could take over a minute by default.

If hardware stops responding, who cares about fairness of tag allocation 
since this means that request processing halts for all queues associated
with the controller that locked up?

Bart.

Damien Le Moal Jan. 31, 2024, 11:52 p.m. UTC | #24

On 2/1/24 08:41, Bart Van Assche wrote:
> On 1/31/24 15:04, Keith Busch wrote:
>> I didn't have anything in mind; just that protocols don't require all
>> commands be fast.
> 
> The default block layer timeout is 30 seconds because typical storage 
> commands complete in much less than 30 seconds.
> 
>> NVMe has wait event commands that might not ever complete.
> 
> Are you perhaps referring to the NVMe Asynchronous Event Request
> command? That command doesn't count because the command ID for that
> command comes from another set than I/O commands. From the NVMe
> driver:
> 
> static inline bool nvme_is_aen_req(u16 qid, __u16 command_id)
> {
> 	return !qid &&
> 		nvme_tag_from_cid(command_id) >= NVME_AQ_BLK_MQ_DEPTH;
> }
> 
>> A copy command requesting multiple terabyes won't be quick for even the
>> fastest hardware (not "hours", but not fast).
> 
> Is there any setup in which such large commands are submitted? Write
> commands that last long may negatively affect read latency. This is a
> good reason not to make the max_sectors value too large.

Even if max_sectors is not very large, if the device has a gigantic write cache
that needs to be flushed first to be able to process an incoming write, then
writes can be slow. I have seen issues in the field with that causing timeouts.
Even a worst case scenario: HDDs doing on-media caching of writes when the
volatile write cache is disabled by the user. Then if the on-media write cache
needs to be freed up for a new write, the HDD will be very very slow handling
writes. There are plenty of scenarios out there where the device can suddenly
become slow, hogging a lot of tags in the process.

>> If hardware stops responding, the tags are locked up for as long as it
>> takes recovery escalation to reclaim them. For nvme, error recovery
>> could take over a minute by default.
> 
> If hardware stops responding, who cares about fairness of tag allocation 
> since this means that request processing halts for all queues associated
> with the controller that locked up?

Considering the above, it would be more about horrible slowdown of all devices
sharing the tagset because for whatever reason one of the device is slow.

Note: this is only my 2 cents input. I have not seen any issue in practice with
shared tagset, but I do not think I ever encountered a system actually using
that feature :)

[v6,1/4] block: Make fair tag sharing configurable

Commit Message

Comments

Patch