[v2,00/28] Zone write plugging

Message ID	20240325044452.3125418-1-dlemoal@kernel.org
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D5D8113E6A1; Mon, 25 Mar 2024 04:44:55 +0000 (UTC) From: Damien Le Moal <dlemoal@kernel.org> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Subject: [PATCH v2 00/28] Zone write plugging Date: Mon, 25 Mar 2024 13:44:24 +0900 Message-ID: <20240325044452.3125418-1-dlemoal@kernel.org> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Zone write plugging \| expand [v2,00/28] Zone write plugging [v2,01/28] block: Restore sector of flush requests [v2,02/28] block: Remove req_bio_endio() [v2,03/28] block: Introduce blk_zone_update_request_bio() [v2,04/28] block: Introduce bio_straddle_zones() and bio_offset_from_zone_start() [v2,05/28] block: Allow using bio_attempt_back_merge() internally [v2,06/28] block: Remember zone capacity when revalidating zones [v2,07/28] block: Introduce zone write plugging [v2,08/28] block: Use a mempool to allocate zone write plugs [v2,09/28] block: Fake max open zones limit when there is no limit [v2,10/28] block: Allow zero value of max_zone_append_sectors queue limit [v2,11/28] block: Implement zone append emulation [v2,12/28] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() [v2,13/28] dm: Use the block layer zone append emulation [v2,14/28] scsi: sd: Use the block layer zone append emulation [v2,15/28] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v2,16/28] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v2,17/28] null_blk: Introduce zone_append_max_sectors attribute [v2,18/28] null_blk: Introduce fua attribute [v2,19/28] nvmet: zns: Do not reference the gendisk conv_zones_bitmap [v2,20/28] block: Remove BLK_STS_ZONE_RESOURCE [v2,21/28] block: Simplify blk_revalidate_disk_zones() interface [v2,22/28] block: mq-deadline: Remove support for zone write locking [v2,23/28] block: Remove elevator required features [v2,24/28] block: Do not check zone type in blk_check_zone_append() [v2,25/28] block: Move zone related debugfs attribute to blk-zoned.c [v2,26/28] block: Remove zone write locking [v2,27/28] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED [v2,28/28] block: Do not special-case plugging of zone write operations

Damien Le Moal March 25, 2024, 4:44 a.m. UTC

The patch series introduces zone write plugging (ZWP) as the new
mechanism to control the ordering of writes to zoned block devices.
ZWP replaces zone write locking (ZWL) which is implemented only by
mq-deadline today. ZWP also allows emulating zone append operations
using regular writes for zoned devices that do not natively support this
operation (e.g. SMR HDDs). This patch series removes the scsi disk
driver and device mapper zone append emulation to use ZWP emulation.

Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
write plug is simply a BIO list that is atomically manipulated using a
spinlock and a kblockd submission work. A write BIO to a zone is
"plugged" to delay its execution if a write BIO for the same zone was
already issued, that is, if a write request for the same zone is being
executed. The next plugged BIO is unplugged and issued once the write
request completes.

This mechanism allows to:
 - Untangle zone write ordering from the block IO schedulers. This
   allows removing the restriction on using only mq-deadline for zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   do not prevent other BIOs from being submitted to the device (reads
   or writes to other zones). Depending on the workload, this can
   significantly improve the device use and the performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use ZWP. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and in device drivers.
   ZWP implementation is mostly limited to blk-zoned.c, with some small
   changes in blk-mq.c, blk-merge.c and bio.c.

Performance evaluation results are shown below.

The series is based on 6.9.0-rc1 and organized as follows:

 - Patch 1 to 6 are preparatory changes for patch 7.
 - Patch 7, 8 and 9 introduce ZWP
 - Patch 10 and 11 add zone append emulation to ZWP.
 - Patch 12 to 19 modify zoned block device drivers to use ZWP and
   prepare for the removal of ZWL.
 - Patch 20 to 28 remove zone write locking

Overall, these changes do not significantly increase the amount of code
(the higher number of addition shown by diff-stat is in fact due to a
larger amount of comments in the code).

Many thanks must go to Christoph Hellwig for comments and suggestions
he provided on earlier versions of these patches.

Performance evaluation results
==============================

Environments:
 - Intel Xeon 16-cores/32-threads, 128GB of RAM
 - Kernel:
   - ZWL (baseline): 6.9.0-rc1
   - ZWP: 6.9.0-rc1 patched kernel to add zone write plugging
   (both kernels were compiled with the same configuration turning off
   most heavy debug features)

Workoads:
 - seqw4K1: 4KB sequential write, qd=1
 - seqw4K16: 4KB sequential write, qd=16
 - seqw1M16: 1MB sequential write, qd=16
 - rndw4K16: 4KB random write, qd=16
 - rndw128K16: 128KB random write, qd=16
 - btrfs workoad: Single fio job writing 128 MB files using 128 KB
   direct IOs at qd=16.

Devices:
 - nullblk (zoned): 4096 zones of 256 MB, 128 max open zones.
 - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open and
   active zones.
 - SMR HDD: 20 TB disk with 256MB zone size, 128 max open zones.

For ZWP, the result show the performance percentage increase (or
decrease) against ZWL (baseline) case.

1) null_blk zoned device:

             +--------+--------+-------+--------+---------+---------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)| (KIOPS)  |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 961    | 835    | 18640 | 14480  | 415    | 167      |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 963    | 845    | 18640 | 14630  | 452    | 165      |
 |mq-deadline| (+0%)  | (+1%)  | (+0%) | (+1%)  | (+8%)  | (-1%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 731    | 651    | 15780 | 12710  | 129    | 108      |
 |    bfq    | (-23%) | (-22%) | (-15%)| (-12%) | (-68%) | (-15%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 2511   | 1632   | 27470 | 19340  | 336    | 150      |
 |   none    | (+161%)| (+95%) | (+47%)| (+33%) | (-19%) | (-19%)   |
 +-----------+--------+--------+-------+--------+--------+----------+

ZWP with mq-deadline gives performance very similar to zone write
locking, showing that zone write plugging overhead is acceptable.
But ZWP ability to run fast block devices with the none scheduler
shows brings all the benefits of zone write plugging and results in
significant performance increase for all workloads. The exception to
this are random write workloads with multiple jobs: for these, the
faster request submission rate achieved by zone write plugging results
in higher contention on null-blk zone spinlock, which degrades
performance.

2) NVMe ZNS drive:

             +--------+--------+-------+--------+--------+----------+
             |seqw4K1 |seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s)  | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWL    | 183    | 707    | 1083  | 1101   | 53.6   | 14.1     |
 |mq-deadline|        |        |       |        |        |          |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 183    | 719    | 1082  | 1103   |55.5    | 14.1     |
 |mq-deadline|(-0%)   | (+1%)  | (+0%) | (+0%)  |(+3%)   | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 175    | 691    | 1078  | 1097   | 28.3   | 11.2     |
 |    bfq    | (-4%)  | (-2%)  | (-0%) | (-0%)  | (-47%) | (-20%)   |
 +-----------+--------+--------+-------+--------+--------+----------+
 |    ZWP    | 190    | 665    | 1083  | 1105   | 51.4   | 14.1     |
 |   none    | (+4%)  | (-5%)  | (+0%) | (+0%)  | (-4%)  | (+0%)    |
 +-----------+--------+--------+-------+--------+--------+----------+

Zone write plugging overhead does not significantly impact performance.
Similar to nullblk, using the none scheduler leads to performance
increase for most workloads.

3) SMR SATA HDD:

             +-------+--------+-------+--------+--------+----------+
             |seqw4K1|seqw4K16|seqw1M1|seqw1M16|rndw4K16|rndw128K16|
             |(MB/s) | (MB/s) |(MB/s) | (MB/s) | (KIOPS)|  (KIOPS) |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWL    | 107   | 243    | 246   | 246    | 2.2    | 0.769    |
 |mq-deadline|       |        |       |        |        |          |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 107   | 240    | 245   | 242    | 2.2    | 0.767    |
 |mq-deadline|(-0%)  | (-1%)  | (-0%) | (-1%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 104   | 240    | 245   | 243    | 2.3    | 0.765    |
 |    bfq    | (-2%) | (-1%)  | (-0%) | (-1%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+
 |    ZWP    | 113   | 230    | 246   | 242    | 2.2    | 0.771    |
 |   none    | (+5%) | (-5%)  | (+0%) | (-1%)  | (+0%)  | (+0%)    |
 +-----------+-------+--------+-------+--------+--------+----------+

Performance with purely sequential write workloads at high queue depth
somewhat decrease a little when using zone write plugging. This is due
to the different IO pattern that ZWP generates where the first writes to
a zone start being issued when the end of the previous zone are still
being written. Depending on how the disk handles queued commands, seek
may be generated, slightly impacting the throughput achieved. Such pure
sequential write workloads are however rare with SMR drives.

4) Zone append tests using btrfs:

             +-------------+-------------+-----------+-------------+
             |  null-blk   |  null_blk   |    ZNS    |     SMR     |
             |  native ZA  | emulated ZA | native ZA | emulated ZA |
             |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWL    | 2434        | N/A         | 1083      | 244         |
 |mq-deadline|             |             |           |             |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2361        | 3111        | 1087      | 237         |
 |mq-deadline| (+1%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2299        | 2840        | 1082      | 239         |
 |    bfq    | (-4%)       |             | (+0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+
 |    ZWP    | 2443        | 3152        | 1078      | 238         |
 |    none   | (+0%)       |             | (-0%)     | (-2%)       |
 +-----------+-------------+-------------+-----------+-------------+

With a more realistic use of the device though a file system, ZWP does
not introduce significant performance differences, except for SMR for
the same reason as with the fio sequential workloads at high queue
depth.

Changes from v1:
 - Added patch 6
 - Rewrite of patch 7 to use a hash table of dynamically allocated zone
   write plugs. This results in changes in patch 11 and the addition of
   patch 8 and 9.
 - Rebased everything on 6.9.0-rc1
 - Added review tags for patches that did not change

Damien Le Moal (28):
  block: Restore sector of flush requests
  block: Remove req_bio_endio()
  block: Introduce blk_zone_update_request_bio()
  block: Introduce bio_straddle_zones() and bio_offset_from_zone_start()
  block: Allow using bio_attempt_back_merge() internally
  block: Remember zone capacity when revalidating zones
  block: Introduce zone write plugging
  block: Use a mempool to allocate zone write plugs
  block: Fake max open zones limit when there is no limit
  block: Allow zero value of max_zone_append_sectors queue limit
  block: Implement zone append emulation
  block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  dm: Use the block layer zone append emulation
  scsi: sd: Use the block layer zone append emulation
  ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Introduce zone_append_max_sectors attribute
  null_blk: Introduce fua attribute
  nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  block: Remove BLK_STS_ZONE_RESOURCE
  block: Simplify blk_revalidate_disk_zones() interface
  block: mq-deadline: Remove support for zone write locking
  block: Remove elevator required features
  block: Do not check zone type in blk_check_zone_append()
  block: Move zone related debugfs attribute to blk-zoned.c
  block: Remove zone write locking
  block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED
  block: Do not special-case plugging of zone write operations

 block/Kconfig                     |    5 -
 block/Makefile                    |    1 -
 block/bio.c                       |    7 +
 block/blk-core.c                  |   13 +-
 block/blk-flush.c                 |    1 +
 block/blk-merge.c                 |   22 +-
 block/blk-mq-debugfs-zoned.c      |   22 -
 block/blk-mq-debugfs.c            |    3 +-
 block/blk-mq-debugfs.h            |    6 +-
 block/blk-mq.c                    |  144 ++--
 block/blk-mq.h                    |   31 -
 block/blk-settings.c              |   46 +-
 block/blk-sysfs.c                 |    2 +-
 block/blk-zoned.c                 | 1262 +++++++++++++++++++++++++++--
 block/blk.h                       |   73 +-
 block/elevator.c                  |   46 +-
 block/elevator.h                  |    1 -
 block/genhd.c                     |    3 +-
 block/mq-deadline.c               |  176 +---
 drivers/block/null_blk/main.c     |   24 +-
 drivers/block/null_blk/null_blk.h |    2 +
 drivers/block/null_blk/zoned.c    |   23 +-
 drivers/block/ublk_drv.c          |    5 +-
 drivers/block/virtio_blk.c        |    2 +-
 drivers/md/dm-core.h              |    2 +-
 drivers/md/dm-zone.c              |  476 +----------
 drivers/md/dm.c                   |   75 +-
 drivers/md/dm.h                   |    4 +-
 drivers/nvme/host/core.c          |    2 +-
 drivers/nvme/target/zns.c         |   10 +-
 drivers/scsi/scsi_lib.c           |    1 -
 drivers/scsi/sd.c                 |    8 -
 drivers/scsi/sd.h                 |   19 -
 drivers/scsi/sd_zbc.c             |  335 +-------
 include/linux/blk-mq.h            |   85 +-
 include/linux/blk_types.h         |   30 +-
 include/linux/blkdev.h            |  106 ++-
 37 files changed, 1607 insertions(+), 1466 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

Bart Van Assche March 25, 2024, 7:52 p.m. UTC | #1

On 3/24/24 21:44, Damien Le Moal wrote:
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 8aeb8e96f1a7..9e6e2a9a147c 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -820,11 +820,11 @@ static void blk_complete_request(struct request *req)
>   		/* Completion has already been traced */
>   		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
>   
> -		if (req_op(req) == REQ_OP_ZONE_APPEND)
> -			bio->bi_iter.bi_sector = req->__sector;
> -
> -		if (!is_flush)
> +		if (!is_flush) {
> +			blk_zone_update_request_bio(req, bio);
>   			bio_endio(bio);
> +		}

The above change includes a behavior change. It seems wrong to me not
to call blk_zone_update_request_bio() for REQ_OP_ZONE_APPEND requests if
RQF_FLUSH_SEQ has been set.

Thanks,

Bart.

Bart Van Assche March 25, 2024, 9:53 p.m. UTC | #2

On 3/24/24 21:44, Damien Le Moal wrote:
> +/*
> + * Per-zone write plug.
> + */
> +struct blk_zone_wplug {
> +	struct hlist_node	node;
> +	struct list_head	err;
> +	atomic_t		ref;
> +	spinlock_t		lock;
> +	unsigned int		flags;
> +	unsigned int		zone_no;
> +	unsigned int		wp_offset;
> +	struct bio_list		bio_list;
> +	struct work_struct	bio_work;
> +};

Please document what 'lock' protects. Please also document the unit of
wp_offset.

Since there is an atomic reference count in this data structure, why is
the flag BLK_ZONE_WPLUG_FREEING required? Can that flag be replaced by
checking whether or not 'ref' is zero?

> -void disk_free_zone_bitmaps(struct gendisk *disk)
> +static bool disk_insert_zone_wplug(struct gendisk *disk,
> +				   struct blk_zone_wplug *zwplug)
> +{
> +	struct blk_zone_wplug *zwplg;
> +	unsigned long flags;
> +	unsigned int idx =
> +		hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits);
> +
> +	/*
> +	 * Add the new zone write plug to the hash table, but carefully as we
> +	 * are racing with other submission context, so we may already have a
> +	 * zone write plug for the same zone.
> +	 */
> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
> +	hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) {
> +		if (zwplg->zone_no == zwplug->zone_no) {
> +			spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +			return false;
> +		}
> +	}
> +	hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
> +
> +	return true;
> +}

Since this function inserts an element into disk->zone_wplugs_hash[],
can it happen that another thread removes that element from the hash
list before this function returns?

> +static inline void disk_put_zone_wplug(struct blk_zone_wplug *zwplug)
> +{
> +	if (atomic_dec_and_test(&zwplug->ref)) {
> +		WARN_ON_ONCE(!bio_list_empty(&zwplug->bio_list));
> +		WARN_ON_ONCE(!list_empty(&zwplug->err));
> +
> +		kmem_cache_free(blk_zone_wplugs_cachep, zwplug);
> +	}
> +}

Calling kfree() or any of its variants for elements on an RCU-list
usually is not safe. If this is really safe in this case, a comment
should explain why.

> +static struct blk_zone_wplug *disk_get_zone_wplug_locked(struct gendisk *disk,
> +					sector_t sector, gfp_t gfp_mask,
> +					unsigned long *flags)

What does the "_locked" suffix in the function name mean? Please add a
comment that explains this.

> +static void disk_zone_wplug_abort_unaligned(struct gendisk *disk,
> +					    struct blk_zone_wplug *zwplug)

What does the "_unaligned" suffix in this function name mean? Please add
a comment that explains this.

Thanks,

Bart.

Bart Van Assche March 25, 2024, 10:13 p.m. UTC | #3

On 3/24/24 21:44, Damien Le Moal wrote:
> With the block layer generic plugging of write operations for zoned
> block devices, mq-deadline, or any other scheduler, can only ever
> see at most one write operation per zone at any time. There is thus no
> sequentiality requirements for these writes and thus no need to tightly
> control the dispatching of write requests using zone write locking.
> 
> Remove all the code that implement this control in the mq-deadline
> scheduler and remove advertizing support for the
> ELEVATOR_F_ZBD_SEQ_WRITE elevator feature.

Reviewed-by: Bart Van Assche <bvanassche@acm.org>

Damien Le Moal March 25, 2024, 11:20 p.m. UTC | #4

On 3/26/24 06:53, Bart Van Assche wrote:
> On 3/24/24 21:44, Damien Le Moal wrote:
>> +		/*
>> +		 * Remember the capacity of the first sequential zone and check
>> +		 * if it is constant for all zones.
>> +		 */
>> +		if (!args->zone_capacity)
>> +			args->zone_capacity = zone->capacity;
>> +		if (zone->capacity != args->zone_capacity) {
>> +			pr_warn("%s: Invalid variable zone capacity\n",
>> +				disk->disk_name);
>> +			return -ENODEV;
>> +		}
> 
> The above code won't refuse devices for which the first few zones have 
> capacity zero. Shouldn't these be rejected?

The device driver is supposed to ensure that zone capacity is always set to the
device reported value or to the zone size for devices that do not have a zone
capacity smaller than the zone size. But sure, I can add a check.

Damien Le Moal March 25, 2024, 11:23 p.m. UTC | #5

On 3/26/24 04:52, Bart Van Assche wrote:
> On 3/24/24 21:44, Damien Le Moal wrote:
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 8aeb8e96f1a7..9e6e2a9a147c 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -820,11 +820,11 @@ static void blk_complete_request(struct request *req)
>>   		/* Completion has already been traced */
>>   		bio_clear_flag(bio, BIO_TRACE_COMPLETION);
>>   
>> -		if (req_op(req) == REQ_OP_ZONE_APPEND)
>> -			bio->bi_iter.bi_sector = req->__sector;
>> -
>> -		if (!is_flush)
>> +		if (!is_flush) {
>> +			blk_zone_update_request_bio(req, bio);
>>   			bio_endio(bio);
>> +		}
> 
> The above change includes a behavior change. It seems wrong to me not
> to call blk_zone_update_request_bio() for REQ_OP_ZONE_APPEND requests if
> RQF_FLUSH_SEQ has been set.

REQ_OP_ZONE_APPEND + RQF_FLUSH_SEQ is not something supported, and this patch
series is not changing that. The reason is that the flush machinery is not
zone-append aware and will break if such request is issued for a device that
does not support fua. We probably should check for this, but that is not
something for this series to do and should be a separate fix.

Damien Le Moal March 26, 2024, 3:12 a.m. UTC | #6

On 3/26/24 06:53, Bart Van Assche wrote:
> On 3/24/24 21:44, Damien Le Moal wrote:
>> +/*
>> + * Per-zone write plug.
>> + */
>> +struct blk_zone_wplug {
>> +	struct hlist_node	node;
>> +	struct list_head	err;
>> +	atomic_t		ref;
>> +	spinlock_t		lock;
>> +	unsigned int		flags;
>> +	unsigned int		zone_no;
>> +	unsigned int		wp_offset;
>> +	struct bio_list		bio_list;
>> +	struct work_struct	bio_work;
>> +};
> 
> Please document what 'lock' protects. Please also document the unit of
> wp_offset.
> 
> Since there is an atomic reference count in this data structure, why is
> the flag BLK_ZONE_WPLUG_FREEING required? Can that flag be replaced by
> checking whether or not 'ref' is zero?

Nope, we cannot. The reason is that BIO issuing and zone reset/finish can be
concurrently processed and we need to be ready for a user doing really stupid
things like resetting or finishing a zone while BIOs for that zone are being
issued. When zone reset/finish is processed, the plug is removed from the hash
table, but disk_get_zone_wplug_locked() may still get a reference to it because
we do not have the plug locked yet. Hence the flag, to prevent reusing the plug
for the reset/finished zone that was already removed from the hash table. This
is mentioned with a comment in disk_get_zone_wplug_locked():

	/*
	 * Check that a BIO completion or a zone reset or finish
	 * operation has not already flagged the zone write plug for
	 * freeing and dropped its reference count. In such case, we
	 * need to get a new plug so start over from the beginning.
	 */

The reference count dropping to 0 will then be the trigger for actually freeing
the plug, after all in-flight or plugged BIOs are completed (most likely failed).

>> -void disk_free_zone_bitmaps(struct gendisk *disk)
>> +static bool disk_insert_zone_wplug(struct gendisk *disk,
>> +				   struct blk_zone_wplug *zwplug)
>> +{
>> +	struct blk_zone_wplug *zwplg;
>> +	unsigned long flags;
>> +	unsigned int idx =
>> +		hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits);
>> +
>> +	/*
>> +	 * Add the new zone write plug to the hash table, but carefully as we
>> +	 * are racing with other submission context, so we may already have a
>> +	 * zone write plug for the same zone.
>> +	 */
>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>> +	hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) {
>> +		if (zwplg->zone_no == zwplug->zone_no) {
>> +			spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +			return false;
>> +		}
>> +	}
>> +	hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>> +
>> +	return true;
>> +}
> 
> Since this function inserts an element into disk->zone_wplugs_hash[],
> can it happen that another thread removes that element from the hash
> list before this function returns?

No, that cannot happen. Both insertion and deletion of plugs in the hash table
are serialized with disk->zone_wplugs_lock. See disk_remove_zone_wplug().

Christoph Hellwig March 26, 2024, 6:37 a.m. UTC | #7

On Tue, Mar 26, 2024 at 08:23:19AM +0900, Damien Le Moal wrote:
> REQ_OP_ZONE_APPEND + RQF_FLUSH_SEQ is not something supported, and this patch
> series is not changing that. The reason is that the flush machinery is not
> zone-append aware and will break if such request is issued for a device that
> does not support fua. We probably should check for this, but that is not
> something for this series to do and should be a separate fix.

Btw, I don't think we're even catching this right now.  Would be great
to have a submission path check for it.

Christoph Hellwig March 26, 2024, 6:45 a.m. UTC | #8

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

Christoph Hellwig March 26, 2024, 6:51 a.m. UTC | #9

On Tue, Mar 26, 2024 at 12:12:03PM +0900, Damien Le Moal wrote:
> Nope, we cannot. The reason is that BIO issuing and zone reset/finish can be
> concurrently processed and we need to be ready for a user doing really stupid
> things like resetting or finishing a zone while BIOs for that zone are being
> issued. When zone reset/finish is processed, the plug is removed from the hash
> table, but disk_get_zone_wplug_locked() may still get a reference to it because
> we do not have the plug locked yet. Hence the flag, to prevent reusing the plug
> for the reset/finished zone that was already removed from the hash table. This
> is mentioned with a comment in disk_get_zone_wplug_locked():
> 
> 	/*
> 	 * Check that a BIO completion or a zone reset or finish
> 	 * operation has not already flagged the zone write plug for
> 	 * freeing and dropped its reference count. In such case, we
> 	 * need to get a new plug so start over from the beginning.
> 	 */
> 
> The reference count dropping to 0 will then be the trigger for actually freeing
> the plug, after all in-flight or plugged BIOs are completed (most likely failed).

Maybe the comment should be expanded even more and move to the definition
of the freeing flag?

Christoph Hellwig March 26, 2024, 6:57 a.m. UTC | #10

On Mon, Mar 25, 2024 at 01:44:33PM +0900, Damien Le Moal wrote:
> For a zoned block device that has no limit on the number of open zones
> and no limit on the number of active zones, the zone write plug mempool
> size defaults to 128 zone write plugs. For such case, set the device
> max_open_zones queue limit to this value to indicate to the user the
> potential performance penalty that may happen when writing
> simultaneously to more zones than the mempool size.

zone_hw_limits is a horrible name for
max(lim->max_active_zones, lim->max_open_zones).

But why do we even need it?  I'd rather make sure that we always
have valid max_open/active limits, doing something like:

	if (!max_open && !max_active)
		max_open = max_active = DEFAULT_CAP;
	else if (!max_open || WARN_ON_ONCE(max_open > max_active)
		max_open = max_active;
	else if (!max_active)
		max_active = max_open;

and do away with the extra limit.  Maybe DEFAULT_CAP should be
tunable as well?

Damien Le Moal March 26, 2024, 7:47 a.m. UTC | #11

On 3/26/24 15:37, Christoph Hellwig wrote:
> On Tue, Mar 26, 2024 at 08:23:19AM +0900, Damien Le Moal wrote:
>> REQ_OP_ZONE_APPEND + RQF_FLUSH_SEQ is not something supported, and this patch
>> series is not changing that. The reason is that the flush machinery is not
>> zone-append aware and will break if such request is issued for a device that
>> does not support fua. We probably should check for this, but that is not
>> something for this series to do and should be a separate fix.
> 
> Btw, I don't think we're even catching this right now.  Would be great
> to have a submission path check for it.

Yep, we are not checking this, but we should. Will send a separate patch for this.

Bart Van Assche March 26, 2024, 5:23 p.m. UTC | #12

On 3/25/24 20:12, Damien Le Moal wrote:
> On 3/26/24 06:53, Bart Van Assche wrote:
>> On 3/24/24 21:44, Damien Le Moal wrote:
>>> -void disk_free_zone_bitmaps(struct gendisk *disk)
>>> +static bool disk_insert_zone_wplug(struct gendisk *disk,
>>> +				   struct blk_zone_wplug *zwplug)
>>> +{
>>> +	struct blk_zone_wplug *zwplg;
>>> +	unsigned long flags;
>>> +	unsigned int idx =
>>> +		hash_32(zwplug->zone_no, disk->zone_wplugs_hash_bits);
>>> +
>>> +	/*
>>> +	 * Add the new zone write plug to the hash table, but carefully as we
>>> +	 * are racing with other submission context, so we may already have a
>>> +	 * zone write plug for the same zone.
>>> +	 */
>>> +	spin_lock_irqsave(&disk->zone_wplugs_lock, flags);
>>> +	hlist_for_each_entry_rcu(zwplg, &disk->zone_wplugs_hash[idx], node) {
>>> +		if (zwplg->zone_no == zwplug->zone_no) {
>>> +			spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>>> +			return false;
>>> +		}
>>> +	}
>>> +	hlist_add_head_rcu(&zwplug->node, &disk->zone_wplugs_hash[idx]);
>>> +	spin_unlock_irqrestore(&disk->zone_wplugs_lock, flags);
>>> +
>>> +	return true;
>>> +}
>>
>> Since this function inserts an element into disk->zone_wplugs_hash[],
>> can it happen that another thread removes that element from the hash
>> list before this function returns?
> 
> No, that cannot happen. Both insertion and deletion of plugs in the hash table
> are serialized with disk->zone_wplugs_lock. See disk_remove_zone_wplug().

I think that documenting locking assumptions with lockdep_assert_held()
would make this patch easier to review.

Thanks,

Bart.

[v2,00/28] Zone write plugging

Message

Comments