[v5,07/28] block: Introduce zone write plugging

Zone write plugging implements a per-zone "plug" for write operations
to control the submission and execution order of write operations to
sequential write required zones of a zoned block device. Per-zone
plugging guarantees that at any time there is at most only one write
request per zone being executed. This mechanism is intended to replace
zone write locking which implements a similar per-zone write throttling
at the scheduler level, but is implemented only by mq-deadline.

Unlike zone write locking which operates on requests, zone write
plugging operates on BIOs. A zone write plug is simply a BIO list that
is atomically manipulated using a spinlock and a kblockd submission
work. A write BIO to a zone is "plugged" to delay its execution if a
write BIO for the same zone was already issued, that is, if a write
request for the same zone is being executed. The next plugged BIO is
unplugged and issued once the write request completes.

This mechanism allows to:
 - Untangle zone write ordering from block IO schedulers. This allows
   removing the restriction on using mq-deadline for writing to zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   are not preventing other BIOs from executing (reads or writes to
   other zones). Depending on the workload, this can significantly
   improve the device use (higher queue depth operation) and
   performance.
 - Both blk-mq (request based) zoned devices and BIO-based zoned devices
   (e.g.  device mapper) can use zone write plugging. It is mandatory
   for the former but optional for the latter. BIO-based drivers can
   use zone write plugging to implement write ordering guarantees, or
   the drivers can implement their own if needed.
 - The code is less invasive in the block layer and is mostly limited to
   blk-zoned.c with some small changes in blk-mq.c, blk-merge.c and
   bio.c.

Zone write plugging is implemented using struct blk_zone_wplug. This
structure includes a spinlock, a BIO list and a work structure to
handle the submission of plugged BIOs. Zone write plugs structures are
managed using a per-disk hash table.

Plugging of zone write BIOs is done using the function
blk_zone_write_plug_bio() which returns false if a BIO execution does
not need to be delayed and true otherwise. This function is called
from blk_mq_submit_bio() after a BIO is split to avoid large BIOs
spanning multiple zones which would cause mishandling of zone write
plugs. This ichange enables by default zone write plugging for any mq
request-based block device. BIO-based device drivers can also use zone
write plugging by expliclty calling blk_zone_write_plug_bio() in their
->submit_bio method. For such devices, the driver must ensure that a
BIO passed to blk_zone_write_plug_bio() is already split and not
straddling zone boundaries.

Only write and write zeroes BIOs are plugged. Zone write plugging does
not introduce any significant overhead for other operations. A BIO that
is being handled through zone write plugging is flagged using the new
BIO flag BIO_ZONE_WRITE_PLUGGING. A request handling a BIO flagged with
this new flag is flagged with the new RQF_ZONE_WRITE_PLUGGING flag.
The completion of BIOs and requests flagged trigger respectively calls
to the functions blk_zone_write_bio_endio() and
blk_zone_write_complete_request(). The latter function is used to
trigger submission of the next plugged BIO using the zone plug work.
blk_zone_write_bio_endio() does the same for BIO-based devices.
This ensures that at any time, at most one request (blk-mq devices) or
one BIO (BIO-based devices) is being executed for any zone. The
handling of zone write plugs using a per-zone plug spinlock maximizes
parallelism and device usage by allowing multiple zones to be writen
simultaneously without lock contention.

Zone write plugging ignores flush BIOs without data. Hovever, any flush
BIO that has data is always plugged so that the write part of the flush
sequence is serialized with other regular writes.

Given that any BIO handled through zone write plugging will be the only
BIO in flight for the target zone when it is executed, the unplugging
and submission of a BIO will have no chance of successfully merging with
plugged requests or requests in the scheduler. To overcome this
potential performance degradation, blk_mq_submit_bio() calls the
function blk_zone_write_plug_attempt_merge() to try to merge other
plugged BIOs with the one just unplugged and submitted. Successful
merging is signaled using blk_zone_write_plug_bio_merged(), called from
bio_attempt_back_merge(). Furthermore, to avoid recalculating the number
of segments of plugged BIOs to attempt merging, the number of segments
of a plugged BIO is saved using the new struct bio field
__bi_nr_segments. To avoid growing the size of struct bio, this field is
added as a union with the bio_cookie field. This is safe to do as
polling is always disabled for plugged BIOs.

When BIOs are plugged in a zone write plug, the device request queue
usage counter is always incremented. This reference is kept and reused
for blk-mq devices when the plugged BIO is unplugged and submitted
again using submit_bio_noacct_nocheck(). For this case, the unplugged
BIO is already flagged with BIO_ZONE_WRITE_PLUGGING and
blk_mq_submit_bio() proceeds directly to allocating a new request for
the BIO, re-using the usage reference count taken when the BIO was
plugged. This extra reference count is dropped in
blk_zone_write_plug_attempt_merge() for any plugged BIO that is
successfully merged. Given that BIO-based devices will not take this
path, the extra reference is dropped after a plugged BIO is unplugged
and submitted.

Zone write plugs are dynamically allocated and managed using a hash
table (an array of struct hlist_head) with RCU protection.
A zone write plug is allocated when a write BIO is received for the
zone and not freed until the zone is fully written, reset or finished.
To detect when a zone write plug can be freed, the write state of each
zone is tracked using a write pointer offset which corresponds to the
offset of a zone write pointer relative to the zone start. Write
operations always increment this write pointer offset. Zone reset
operations set it to 0 and zone finish operations set it to the zone
size.

If a write error happens, the wp_offset value of a zone write plug may
become incorrect and out of sync with the device managed write pointer.
This is handled using the zone write plug flag BLK_ZONE_WPLUG_ERROR.
The function blk_zone_wplug_handle_error() is called from the new disk
zone write plug work when this flag is set. This function executes a
report zone to update the zone write pointer offset to the current
value as indicated by the device. The disk zone write plug work is
scheduled whenever a BIO flagged with BIO_ZONE_WRITE_PLUGGING completes
with an error or when bio_zone_wplug_prepare_bio() detects an unaligned
write. Once scheduled, the disk zone write plugs work keeps running
until all zone errors are handled.

To match the new data structures used for zoned disks, the function
disk_free_zone_bitmaps() is renamed to the more generic
disk_free_zone_resources(). The function disk_init_zone_resources() is
also introduced to initialize zone write plugs resources when a gendisk
is allocated.

In order to guarantee that the user can simultaneously write up to a
number of zones equal to a device max active zone limit or max open zone
limit, zone write plugs are allocated using a mempool sized to the
maximum of these 2 device limits. For a device that does not have
active and open zone limits, 128 is used as the default mempool size.

If a change to the device active and open zone limits is detected, the
disk mempool is resized when blk_revalidate_disk_zones() is executed.

This commit contains contributions from Christoph Hellwig <hch@lst.de>.

Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
---
 block/bio.c               |    6 +
 block/blk-merge.c         |   11 +
 block/blk-mq.c            |   32 +-
 block/blk-zoned.c         | 1090 ++++++++++++++++++++++++++++++++++++-
 block/blk.h               |   47 +-
 block/genhd.c             |    3 +-
 include/linux/blk-mq.h    |    2 +
 include/linux/blk_types.h |    8 +-
 include/linux/blkdev.h    |   12 +
 9 files changed, 1200 insertions(+), 11 deletions(-)

Message ID	20240403084247.856481-8-dlemoal@kernel.org
State	Superseded
Headers	show Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7A3655FEE5; Wed, 3 Apr 2024 08:43:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712133781; cv=none; b=RnYTdWGo9ioSwXOVz0anF8r4CMMEKtYQazGYSb1v4bm5EFu3sqG+P7gqjzVbC3HCXVHfwJAxbywCHYL3PEgZLOcFjqa1977AXCW1pM85HN4asBlkY5PS5nCVKe0gTmUqzGv62oy+4+sgeYZlurFxJ8BcrLxi4tzJGEH8WInols4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1712133781; c=relaxed/simple; bh=Hovi8B9VF3l1hrfidE+9jJTo1G+P4wpId2auREFA+wA=; h=From:To:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=OplHxCFgJcBULlOPKYYhGLtarnd+uyS090H+wVlMurGM7HI3C/P0yJ29UYymKheTJk31oNlVFRg+RBnAUFQWzux8fJfmrb7sbMYZTuX4AFTOz7N5u6B3JJ1kPZMMRZLcvyijM29uQ+T8u+i91YSmRMTVJWiXN9cQn36C/Rde0oE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=GiqepYsF; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="GiqepYsF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B602EC433A6; Wed, 3 Apr 2024 08:42:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1712133781; bh=Hovi8B9VF3l1hrfidE+9jJTo1G+P4wpId2auREFA+wA=; h=From:To:Subject:Date:In-Reply-To:References:From; b=GiqepYsF7wDWL29OtGyvl3ygjVFDgjXZciEivgI6tz1MvItKSVdLuJgz20uy/A/T0 s08Gr+7hRUeCG6ZYhisYUV1APiH2ru8x3/5GIhHUZ+T+MdjB3d311MAhaH8hqLLzRP 5EM3Y1USAOZ43aQV+VjaixyFQBiwgazKGFR/v+UM0y/d+JogNUAJOdCgaYcwX9C+69 b6i8bEGvXaL//uTec3Uy9x7OYEVfhNbAn+5AFh1U1aPx7ub3nBCj0flgt5Is7Poc7w rm9iJVJALeNsZoh1Z73JfygUIZt7+keSVRlTEev9/Ejty/IVkU1gTRBCkdVE7F1hiK y+P06yN+eNrMA== From: Damien Le Moal <dlemoal@kernel.org> To: linux-block@vger.kernel.org, Jens Axboe <axboe@kernel.dk>, linux-scsi@vger.kernel.org, "Martin K . Petersen" <martin.petersen@oracle.com>, dm-devel@lists.linux.dev, Mike Snitzer <snitzer@redhat.com>, linux-nvme@lists.infradead.org, Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de> Subject: [PATCH v5 07/28] block: Introduce zone write plugging Date: Wed, 3 Apr 2024 17:42:26 +0900 Message-ID: <20240403084247.856481-8-dlemoal@kernel.org> X-Mailer: git-send-email 2.44.0 In-Reply-To: <20240403084247.856481-1-dlemoal@kernel.org> References: <20240403084247.856481-1-dlemoal@kernel.org> Precedence: bulk X-Mailing-List: linux-scsi@vger.kernel.org List-Id: <linux-scsi.vger.kernel.org> List-Subscribe: <mailto:linux-scsi+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-scsi+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	Zone write plugging \| expand [v5,00/28] Zone write plugging [v5,01/28] block: Restore sector of flush requests [v5,02/28] block: Remove req_bio_endio() [v5,03/28] block: Introduce blk_zone_update_request_bio() [v5,04/28] block: Introduce bio_straddles_zones() and bio_offset_from_zone_start() [v5,05/28] block: Allow using bio_attempt_back_merge() internally [v5,06/28] block: Remember zone capacity when revalidating zones [v5,07/28] block: Introduce zone write plugging [v5,08/28] block: Fake max open zones limit when there is no limit [v5,09/28] block: Allow zero value of max_zone_append_sectors queue limit [v5,10/28] block: Implement zone append emulation [v5,11/28] block: Allow BIO-based drivers to use blk_revalidate_disk_zones() [v5,12/28] dm: Use the block layer zone append emulation [v5,13/28] scsi: sd: Use the block layer zone append emulation [v5,14/28] ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v5,15/28] null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature [v5,16/28] null_blk: Introduce zone_append_max_sectors attribute [v5,17/28] null_blk: Introduce fua attribute [v5,18/28] nvmet: zns: Do not reference the gendisk conv_zones_bitmap [v5,19/28] block: Remove BLK_STS_ZONE_RESOURCE [v5,20/28] block: Simplify blk_revalidate_disk_zones() interface [v5,21/28] block: mq-deadline: Remove support for zone write locking [v5,22/28] block: Remove elevator required features [v5,23/28] block: Do not check zone type in blk_check_zone_append() [v5,24/28] block: Move zone related debugfs attribute to blk-zoned.c [v5,25/28] block: Replace zone_wlock debugfs entry with zone_wplugs entry [v5,26/28] block: Remove zone write locking [v5,27/28] block: Do not force select mq-deadline with CONFIG_BLK_DEV_ZONED [v5,28/28] block: Do not special-case plugging of zone write operations

[v5,07/28] block: Introduce zone write plugging

Commit Message

Comments

Patch