mbox series

[00/26] Zone write plugging

Message ID 20240202073104.2418230-1-dlemoal@kernel.org
Headers show
Series Zone write plugging | expand

Message

Damien Le Moal Feb. 2, 2024, 7:30 a.m. UTC
The patch series introduces zone write plugging (ZWP) as the new
mechanism to control the ordering of writes to zoned block devices.
ZWP replaces zone write locking (ZWL) which is implemented only by
mq-deadline today. ZWP also allows emulating zone append operations
using regular writes for zoned devices that do not natively support this
operation (e.g. SMR HDDs). This patch series removes the scsi disk
driver and device mapper zone append emulation to use ZWP emulation.

Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
write plug is simply a BIO list that is atomically manipulated using a
spinlock and a kblockd submission work. A write BIO to a zone is
"plugged" to delay its execution if a write BIO for the same zone was
already issued, that is, if a write request for the same zone is being
executed. The next plugged BIO is unplugged and issued once the write
request completes.

This mechanism allows to:
 - Untangle zone write ordering from the block IO schedulers. This
   allows removing the restriction on using only mq-deadline for zoned
   block devices. Any block IO scheduler, including "none" can be used.
 - Zone write plugging operates on BIOs instead of requests. Plugged
   BIOs waiting for execution thus do not hold scheduling tags and thus
   do not prevent other BIOs from being submitted to the device (reads
   or writes to other zones). Depending on the workload, this can
   significantly improve the device use and the performance.
 - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
   device mapper) can use ZWP. It is mandatory for the
   former but optional for the latter: BIO-based driver can use zone
   write plugging to implement write ordering guarantees, or the drivers
   can implement their own if needed.
 - The code is less invasive in the block layer and in device drivers.
   ZWP implementation is mostly limited to blk-zoned.c, with some small
   changes in blk-mq.c, blk-merge.c and bio.c.

Performance evaluation results are shown below.

The series is organized as follows:

 - Patch 1 to 5 are preparatory changes for patch 6.
 - Patch 6 introduce ZWP
 - Patch 7 and 8 add zone append emulation to ZWP.
 - Patch 9 to 16 modify zoned block device drivers to use ZWP and
   prepare for the removal of ZWL.
 - Patch 17 to 24 remove zone write locking
 - Finally, Patch 24 and 25 improve ZWP (memory usage reduction and
   debugfs attributes).

Overall, these changes do not increase the amount of code (small
reduction achieved looking at the diff-stat, but in fact, the reduction
is much larger if comments are ignored).

Many thanks must go to Christoph Hellwig for comments and suggestions
he provided on earlier versions of these patches.

Performance evaluation results
==============================

Environments:
 - Xeon 8-cores/16-threads, 128GB of RAM
 - Kernel:
   - Baseline: 6.8-rc2, Linus tree as of 2024-02-01
   - Baseline-next: Jens block/for-next branch as of 2024-02-01
   - ZWP: Jens block/for-next patched to add zone write plugging
   (all kernels were compiled with the same configuration turning off
   most heavy debug features)

Workoads:
 - seqw4K1: 4KB sequential write, qd=1
 - seqw4K16: 4KB sequential write, qd=16
 - seqw1M16: 1MB sequential write, qd=16
 - rndw4K16: 4KB random write, qd=16
 - rndw128K16: 128KB random write, qd=16
 - btrfs workoad: Single fio job writing 128 MB files using 128 KB
   direct IOs at qd=16.

Devices:
 - nullblk (zoned): 4096 zones of 256 MB, no zone resource limits.
 - NVMe ZNS drive: 1 TB ZNS drive with 2GB zone size, 14 max open/active
   zones.
 - SMR HDD: 26 TB disk with 256MB zone size and 128 max open zones.

For ZWP, the result show the performance percentage increase (or
decrease) against current for-next.

1) null_blk zoned device:

               +---------+----------+----------+----------+------------+
               | seqw4K1 | seqw4K16 | seqw1M16 | rndw4K16 | rndw128K16 |
               | (MB/s)  |  (MB/s)  |  (MB/s)  |  (KIOPS) |   (KIOPS)  |
+--------------+---------+----------+----------+----------+------------+
|   Baseline   | 1005    | 881      | 15600    | 564      | 217        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
| Baseline-next| 921     | 813      | 14300    | 817      | 330        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 946     | 826      | 15000    | 935      | 358        |
|  mq-deadline |(+2%)    | (+1%)    | (+4%)    | (+14%)   | (+8%)      |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 2937    | 1882     | 19900    | 2286     | 709        |
|     none     | (+218%) | (+131%)  | (+39%)   | (+179%)  | (+114%)    |
+--------------+---------+----------+----------+----------+------------+

For-next mq-deadline changes and ZWP significantly increase random write
performance but slightly reduce sequential write performance compared to
ZWL.  However, ZWP ability to run fast block devices with the none
scheduler result in very large performance increase for all workloads.

2) NVMe ZNS drive:

               +---------+----------+----------+----------+------------+
               | seqw4K1 | seqw4K16 | seqw1M16 | rndw4K16 | rndw128K16 |
               | (MB/s)  |  (MB/s)  |  (MB/s)  |  (KIOPS) |   (KIOPS)  |
+--------------+---------+----------+----------+----------+------------+
|   Baseline   | 183     | 798      | 1104     | 53.5     | 14.6       |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
| Baseline-next| 180     | 261      | 1113     | 51.6     | 14.9       |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 181     | 671      | 1109     | 51.7     | 14.7       |
|  mq-deadline |(+0%)    | (+157%)  | (+0%)    | (+0%)    | (-1%)      |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 190     | 660      | 1106     | 51.4     | 15.1       |
|     none     | (+5%)   | (+152%)  | (+0%)    | (-0%)    | (+1%)      |
+--------------+---------+----------+----------+----------+------------+

The current block/for-next significantly regress sequential small write
performace at high queue depth due to lost BIO merge oportunities.
ZWP corrects this but is not as efficient as ZWL for this workload.

3) SMR SATA HDD:

               +---------+----------+----------+----------+------------+
               | seqw4K1 | seqw4K16 | seqw1M16 | rndw4K16 | rndw128K16 |
               | (MB/s)  |  (MB/s)  |  (MB/s)  |  (IOPS)  |   (IOPS)   |
+--------------+---------+----------+----------+----------+------------+
|   Baseline   | 121     | 251      | 251      | 2471     | 664        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
| Baseline-next| 121     | 137      | 249      | 2428     | 649        |
|  mq-deadline |         |          |          |          |            |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 118     | 137      | 251      | 2415     | 651        |
|  mq-deadline |(-2%)    | (+0%)    | (+0%)    | (+0%)    | (+0%)      |
+--------------+---------+----------+----------+----------+------------+
|      ZWP     | 117     | 238      | 251      | 2400     | 666        |
|     none     | (-3%)   | (+73%)   | (+0%)    | (-1%)    | (+2%)      |
+--------------+---------+----------+----------+----------+------------+

Same observation as for ZNS: for-next regress sequential high QD
performance but ZWP brings back better performance, still slightly lower
than with ZWL.

4) Zone append tests using btrfs:

                +-------------+-------------+-----------+-------------+
                |  null-blk   |  null_blk   |    ZNS    |     SMR     |
                |  native ZA  | emulated ZA | native ZA | emulated ZA |
                |    (MB/s)   |   (MB/s)    |   (MB/s)  |    (MB/s)   |
+---------------+-------------+-------------+-----------+-------------+
|    Baseline   | 2412        | N/A         | 1080      | 203         |
|   mq-deadline |             |             |           |             |
+---------------+-------------+-------------+-----------+-------------+
| Baseline-next | 2471        | N/A         | 1084      | 209         |
|  mq-deadline  |             |             |           |             |
+---------------+-------------+-------------+-----------+-------------+
|      ZWP      | 2397        | 3025        | 1085      | 245         |
|  mq-deadline  | (-2%)       |             | (+0%)     | (+17%)      |
+---------------+-------------+-------------+-----------+-------------+
|      ZWP      | 2614        | 3301        | 1082      | 247         |
|      none     | (+5%)       |             | (-0%)     | (+18%)      |
+---------------+-------------+-------------+-----------+-------------+

With a more realistic use of the device by the FS, ZWP significantly
improves SMR HDD performance thanks to the more efficient zone append
emulation compared to ZWL.

Damien Le Moal (26):
  block: Restore sector of flush requests
  block: Remove req_bio_endio()
  block: Introduce bio_straddle_zones() and bio_offset_from_zone_start()
  block: Introduce blk_zone_complete_request_bio()
  block: Allow using bio_attempt_back_merge() internally
  block: Introduce zone write plugging
  block: Allow zero value of max_zone_append_sectors queue limit
  block: Implement zone append emulation
  block: Allow BIO-based drivers to use blk_revalidate_disk_zones()
  dm: Use the block layer zone append emulation
  scsi: sd: Use the block layer zone append emulation
  ublk_drv: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Do not request ELEVATOR_F_ZBD_SEQ_WRITE elevator feature
  null_blk: Introduce zone_append_max_sectors attribute
  null_blk: Introduce fua attribute
  nvmet: zns: Do not reference the gendisk conv_zones_bitmap
  block: Remove BLK_STS_ZONE_RESOURCE
  block: Simplify blk_revalidate_disk_zones() interface
  block: mq-deadline: Remove support for zone write locking
  block: Remove elevator required features
  block: Do not check zone type in blk_check_zone_append()
  block: Move zone related debugfs attribute to blk-zoned.c
  block: Remove zone write locking
  block: Do not special-case plugging of zone write operations
  block: Reduce zone write plugging memory usage
  block: Add zone_active_wplugs debugfs entry

 block/Kconfig                     |    4 -
 block/Makefile                    |    1 -
 block/bio.c                       |    7 +
 block/blk-core.c                  |   13 +-
 block/blk-flush.c                 |    1 +
 block/blk-merge.c                 |   22 +-
 block/blk-mq-debugfs-zoned.c      |   22 -
 block/blk-mq-debugfs.c            |    4 +-
 block/blk-mq-debugfs.h            |   11 +-
 block/blk-mq.c                    |  134 ++--
 block/blk-mq.h                    |   31 -
 block/blk-settings.c              |   51 +-
 block/blk-sysfs.c                 |    2 +-
 block/blk-zoned.c                 | 1143 ++++++++++++++++++++++++++---
 block/blk.h                       |   69 +-
 block/elevator.c                  |   46 +-
 block/elevator.h                  |    1 -
 block/genhd.c                     |    2 +-
 block/mq-deadline.c               |  176 +----
 drivers/block/null_blk/main.c     |   52 +-
 drivers/block/null_blk/null_blk.h |    2 +
 drivers/block/null_blk/zoned.c    |   32 +-
 drivers/block/ublk_drv.c          |    4 +-
 drivers/block/virtio_blk.c        |    2 +-
 drivers/md/dm-core.h              |   11 +-
 drivers/md/dm-zone.c              |  470 ++----------
 drivers/md/dm.c                   |   44 +-
 drivers/md/dm.h                   |    7 -
 drivers/nvme/host/zns.c           |    2 +-
 drivers/nvme/target/zns.c         |   10 +-
 drivers/scsi/scsi_lib.c           |    1 -
 drivers/scsi/sd.c                 |    8 -
 drivers/scsi/sd.h                 |   19 -
 drivers/scsi/sd_zbc.c             |  335 +--------
 include/linux/blk-mq.h            |   85 +--
 include/linux/blk_types.h         |   30 +-
 include/linux/blkdev.h            |  102 ++-
 37 files changed, 1453 insertions(+), 1503 deletions(-)
 delete mode 100644 block/blk-mq-debugfs-zoned.c

Comments

Jens Axboe Feb. 3, 2024, 12:11 p.m. UTC | #1
On 2/2/24 12:37 AM, Damien Le Moal wrote:
> On 2/2/24 16:30, Damien Le Moal wrote:
>> The patch series introduces zone write plugging (ZWP) as the new
>> mechanism to control the ordering of writes to zoned block devices.
>> ZWP replaces zone write locking (ZWL) which is implemented only by
>> mq-deadline today. ZWP also allows emulating zone append operations
>> using regular writes for zoned devices that do not natively support this
>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>> driver and device mapper zone append emulation to use ZWP emulation.
>>
>> Unlike ZWL which operates on requests, ZWP operates on BIOs. A zone
>> write plug is simply a BIO list that is atomically manipulated using a
>> spinlock and a kblockd submission work. A write BIO to a zone is
>> "plugged" to delay its execution if a write BIO for the same zone was
>> already issued, that is, if a write request for the same zone is being
>> executed. The next plugged BIO is unplugged and issued once the write
>> request completes.
>>
>> This mechanism allows to:
>>  - Untangle zone write ordering from the block IO schedulers. This
>>    allows removing the restriction on using only mq-deadline for zoned
>>    block devices. Any block IO scheduler, including "none" can be used.
>>  - Zone write plugging operates on BIOs instead of requests. Plugged
>>    BIOs waiting for execution thus do not hold scheduling tags and thus
>>    do not prevent other BIOs from being submitted to the device (reads
>>    or writes to other zones). Depending on the workload, this can
>>    significantly improve the device use and the performance.
>>  - Both blk-mq (request) based zoned devices and BIO-based devices (e.g.
>>    device mapper) can use ZWP. It is mandatory for the
>>    former but optional for the latter: BIO-based driver can use zone
>>    write plugging to implement write ordering guarantees, or the drivers
>>    can implement their own if needed.
>>  - The code is less invasive in the block layer and in device drivers.
>>    ZWP implementation is mostly limited to blk-zoned.c, with some small
>>    changes in blk-mq.c, blk-merge.c and bio.c.
>>
>> Performance evaluation results are shown below.
>>
>> The series is organized as follows:
> 
> I forgot to mention that the patches are against Jens block/for-next
> branch with the addition of Christoph's "clean up blk_mq_submit_bio"
> patches [1] and my patch "null_blk: Always split BIOs to respect queue
> limits" [2].

I figured that was the case, I'll get both of these properly setup in a
for-6.9/block branch, just wanted -rc3 to get cut first. JFYI that they
are coming tomorrow.
Bart Van Assche Feb. 5, 2024, 5:21 p.m. UTC | #2
On 2/1/24 23:30, Damien Le Moal wrote:
> The patch series introduces zone write plugging (ZWP) as the new
> mechanism to control the ordering of writes to zoned block devices.
> ZWP replaces zone write locking (ZWL) which is implemented only by
> mq-deadline today. ZWP also allows emulating zone append operations
> using regular writes for zoned devices that do not natively support this
> operation (e.g. SMR HDDs). This patch series removes the scsi disk
> driver and device mapper zone append emulation to use ZWP emulation.

How are SCSI unit attention conditions handled?

Thanks,

Bart.
Bart Van Assche Feb. 5, 2024, 6:18 p.m. UTC | #3
On 2/1/24 23:30, Damien Le Moal wrote:
>   - Zone write plugging operates on BIOs instead of requests. Plugged
>     BIOs waiting for execution thus do not hold scheduling tags and thus
>     do not prevent other BIOs from being submitted to the device (reads
>     or writes to other zones). Depending on the workload, this can
>     significantly improve the device use and the performance.

Deep queues may introduce performance problems. In Android we had to
restrict the number of pending writes to the device queue depth because
otherwise read latency is too high (e.g. to start the camera app).

I'm not convinced that queuing zoned write bios is a better approach than
queuing zoned write requests.

Are there numbers available about the performance differences (bandwidth
and latency) between plugging zoned write bios and zoned write plugging
requests?

Thanks,

Bart.
Damien Le Moal Feb. 5, 2024, 11:42 p.m. UTC | #4
On 2/6/24 02:21, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>> The patch series introduces zone write plugging (ZWP) as the new
>> mechanism to control the ordering of writes to zoned block devices.
>> ZWP replaces zone write locking (ZWL) which is implemented only by
>> mq-deadline today. ZWP also allows emulating zone append operations
>> using regular writes for zoned devices that do not natively support this
>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>> driver and device mapper zone append emulation to use ZWP emulation.
> 
> How are SCSI unit attention conditions handled?

???? How does that have anything to do with this series ?
Whatever SCSI sd is doing with unit attention conditions remains the same. I did
not touch that.
Damien Le Moal Feb. 6, 2024, 12:07 a.m. UTC | #5
On 2/6/24 03:18, Bart Van Assche wrote:
> On 2/1/24 23:30, Damien Le Moal wrote:
>>   - Zone write plugging operates on BIOs instead of requests. Plugged
>>     BIOs waiting for execution thus do not hold scheduling tags and thus
>>     do not prevent other BIOs from being submitted to the device (reads
>>     or writes to other zones). Depending on the workload, this can
>>     significantly improve the device use and the performance.
> 
> Deep queues may introduce performance problems. In Android we had to
> restrict the number of pending writes to the device queue depth because
> otherwise read latency is too high (e.g. to start the camera app).

With zone write plugging, BIOS are delayed well above the scheduler and device.
BIOs that are plugged/delayed by ZWP do not hold tags, not even a scheduler tag,
so that allows reads (which are never plugged) to proceed. That is actually
unlike zone write locking which can hold on to all scheduler tags thus
preventing reads to proceed.

> I'm not convinced that queuing zoned write bios is a better approach than
> queuing zoned write requests.

Well, I do not see why not. The above point on its own is actually to me a good
argument enough. And various tests with btrfs showed that even with a slow HDD I
can see better overall thoughtput with ZWP compared to zone write locking.
And for fast sloid state zoned device (NVMe/UFS), you do not even need an IO
scheduler anymore.

> 
> Are there numbers available about the performance differences (bandwidth
> and latency) between plugging zoned write bios and zoned write plugging
> requests?

Finish reading the cover letter. It has lots of measurements with rc2, Jens
block/for-next and ZWP...

I actually reran all these perf tests over the weekend, but this time did 10
runs and took the average for comparison. Overall, I confirmed the results
showed in the cover letter: performance is generally on-par with ZWP or better,
but there is one exception: small sequential writes at high qd. There seem to be
an issue with regular plugging (current->plug) which result in lost merging
opportunists, causing the performance regression. I am digging into that to
understand what is happening.
Bart Van Assche Feb. 6, 2024, 12:57 a.m. UTC | #6
On 2/5/24 15:42, Damien Le Moal wrote:
> On 2/6/24 02:21, Bart Van Assche wrote:
>> On 2/1/24 23:30, Damien Le Moal wrote:
>>> The patch series introduces zone write plugging (ZWP) as the new
>>> mechanism to control the ordering of writes to zoned block devices.
>>> ZWP replaces zone write locking (ZWL) which is implemented only by
>>> mq-deadline today. ZWP also allows emulating zone append operations
>>> using regular writes for zoned devices that do not natively support this
>>> operation (e.g. SMR HDDs). This patch series removes the scsi disk
>>> driver and device mapper zone append emulation to use ZWP emulation.
>>
>> How are SCSI unit attention conditions handled?
> 
> ???? How does that have anything to do with this series ?
> Whatever SCSI sd is doing with unit attention conditions remains the same. I did
> not touch that.

I wrote my question before I had realized that this patch series
restricts the number of outstanding writes to one per zone. Hence,
there is no risk of unaligned write pointer errors due to reordering
of writes due to unit attention conditions. Hence, my question can
be ignored :-)

Thanks,

Bart.
Damien Le Moal Feb. 9, 2024, 4:03 a.m. UTC | #7
On 2/6/24 10:25, Bart Van Assche wrote:
> On 2/5/24 16:07, Damien Le Moal wrote:
>> On 2/6/24 03:18, Bart Van Assche wrote:
>>> Are there numbers available about the performance differences (bandwidth
>>> and latency) between plugging zoned write bios and zoned write plugging
>>> requests?
>>
>> Finish reading the cover letter. It has lots of measurements with rc2, Jens
>> block/for-next and ZWP...
> Hmm ... as far as I know nobody ever implemented zoned write plugging
> for requests in the block layer core so these numbers can't be in the
> cover letter.

No, I have not implemented zone write plugging for requests as I beleive it
would lead to very similar results as zone write locking, that is, a potential
problem with efficiently using a device in a mixed read/write workload as
having too many plugged writes can lead to read starvation (blocking of read
submission on request allocation when nr_requests is reached).

> Has the bio plugging approach perhaps been chosen because it works
> better for bio-based device mapper drivers?

Not that it "works better" but rather that doing plugging at the BIO level
allows re-using the exact same code for zone append emulation, and write
ordering (if a DM driver wants the block layer to handle that). We had zone
append emulation implemented for DM (for dm-crypt) using BIOs and in scsi sd
driver using requests. ZWP unifies all this and will trivially allow enabling
that emulation for other device types as well (e.g. NVMe ZNS drives that do not
have native zone append support).
Damien Le Moal Feb. 9, 2024, 5:28 a.m. UTC | #8
On 2/3/24 21:11, Jens Axboe wrote:
>> I forgot to mention that the patches are against Jens block/for-next
>> branch with the addition of Christoph's "clean up blk_mq_submit_bio"
>> patches [1] and my patch "null_blk: Always split BIOs to respect queue
>> limits" [2].
> 
> I figured that was the case, I'll get both of these properly setup in a
> for-6.9/block branch, just wanted -rc3 to get cut first. JFYI that they
> are coming tomorrow.

Jens,

I saw the updated rc3-based for-next branch. Thanks for that. But it seems that
you removed the mq-deadline insert optimization ? Is that in purpose or did I
mess up something ?