mbox series

[PATCHv9,0/7] write hints with nvme fdp, scsi streams

Message ID 20241025213645.3464331-1-kbusch@meta.com
Headers show
Series write hints with nvme fdp, scsi streams | expand

Message

Keith Busch Oct. 25, 2024, 9:36 p.m. UTC
From: Keith Busch <kbusch@kernel.org>

A little something for everyone here.

Upfront, I really didn't get the feedback about setting different flags
for stream vs. temperature support. Who wants to use it, and where and
how is that information used?

Changes from v8:

  Added reviews.

  Removed an unused header.

  Changed "hint" to "streams" in the commit logs.

  Ability to split available hints that a partition can use.

  Dropped all the generic filesystem changes that were defaulting to the
  kiocb write_hint. They are unchanged, which having no functional
  change was really the intention anyway, so let's just not change them.

  The above means we don't need a special fop flag to indicate support
  for the kiocb write_hint. Those filesystems that don't support it
  simply don't read it.

  Added the SCSI support since I had to read the spec anyway, and it is
  just a one-line change.

Kanchan Joshi (2):
  io_uring: enable per-io hinting capability
  nvme: enable FDP support

Keith Busch (5):
  block: use generic u16 for write hints
  block: introduce max_write_hints queue limit
  block: allow ability to limit partition write hints
  block, fs: add write hint to kiocb
  scsi: set permanent stream count in block limits

 Documentation/ABI/stable/sysfs-block |  7 +++
 block/bdev.c                         | 15 +++++
 block/blk-settings.c                 |  3 +
 block/blk-sysfs.c                    |  3 +
 block/fops.c                         | 26 ++++++++-
 block/partitions/core.c              | 46 +++++++++++++++-
 drivers/nvme/host/core.c             | 82 ++++++++++++++++++++++++++++
 drivers/nvme/host/nvme.h             |  5 ++
 drivers/scsi/sd.c                    |  2 +
 include/linux/blk-mq.h               |  3 +-
 include/linux/blk_types.h            |  4 +-
 include/linux/blkdev.h               | 12 ++++
 include/linux/fs.h                   |  1 +
 include/linux/nvme.h                 | 19 +++++++
 include/uapi/linux/io_uring.h        |  4 ++
 io_uring/rw.c                        |  3 +-
 16 files changed, 225 insertions(+), 10 deletions(-)

Comments

Christoph Hellwig Oct. 28, 2024, 11:49 a.m. UTC | #1
On Fri, Oct 25, 2024 at 02:36:38PM -0700, Keith Busch wrote:
> Upfront, I really didn't get the feedback about setting different flags
> for stream vs. temperature support. Who wants to use it, and where and
> how is that information used?

The SBC permanent streams have a 'relative lifetime' associated with
them.  So if you expose them as plain write streams you violate the
contract with the device.
Christoph Hellwig Oct. 28, 2024, 11:51 a.m. UTC | #2
On Fri, Oct 25, 2024 at 02:36:40PM -0700, Keith Busch wrote:
> +static inline unsigned short bdev_max_write_hints(struct block_device *bdev)
> +{
> +	return queue_max_write_hints(bdev_get_queue(bdev));
> +}

As pointed out by Bart last time, you can't simply give the write hints
to all block device.  Assume we'd want to wire up the write stream based
separate to f2fs (which btw would be a good demonstration), and you'd
have two different f2fs file systems on separate partitions that'd
now start sharing the write streams if they simply started from stream
1.  Same for our pending XFS data placement work.
Christoph Hellwig Oct. 28, 2024, 11:52 a.m. UTC | #3
On Mon, Oct 28, 2024 at 12:51:32PM +0100, Christoph Hellwig wrote:
> As pointed out by Bart last time, you can't simply give the write hints
> to all block device.  Assume we'd want to wire up the write stream based
> separate to f2fs (which btw would be a good demonstration), and you'd
> have two different f2fs file systems on separate partitions that'd
> now start sharing the write streams if they simply started from stream
> 1.  Same for our pending XFS data placement work.

And I'm an idiot and should have looked at the next patch patch first.
Sorry for that.
Bart Van Assche Oct. 28, 2024, 6:19 p.m. UTC | #4
On 10/25/24 2:36 PM, Keith Busch wrote:
> This is still backwards compatible with lifetime hints. It just doesn't
> constrain the hints to that definition.

Since struct bio is modified, and since it is important not to increase
the size of struct bio, some comments about whether or not the size of
struct bio is affected would be welcome.

Thanks,

Bart.
Keith Busch Oct. 28, 2024, 6:38 p.m. UTC | #5
On Mon, Oct 28, 2024 at 11:19:33AM -0700, Bart Van Assche wrote:
> On 10/25/24 2:36 PM, Keith Busch wrote:
> > This is still backwards compatible with lifetime hints. It just doesn't
> > constrain the hints to that definition.
> 
> Since struct bio is modified, and since it is important not to increase
> the size of struct bio, some comments about whether or not the size of
> struct bio is affected would be welcome.

Sure. The type just shrinks a hole from 2 bytes to 1. Total size remains
the same.
Anuj gupta Oct. 29, 2024, 12:46 p.m. UTC | #6
On Sat, Oct 26, 2024 at 3:13 AM Keith Busch <kbusch@meta.com> wrote:
>
> From: Kanchan Joshi <joshi.k@samsung.com>
>
> With F_SET_RW_HINT fcntl, user can set a hint on the file inode, and
> all the subsequent writes on the file pass that hint value down. This
> can be limiting for block device as all the writes will be tagged with
> only one lifetime hint value. Concurrent writes (with different hint
> values) are hard to manage. Per-IO hinting solves that problem.
>
> Allow userspace to pass additional metadata in the SQE.
>
>         __u16 write_hint;
>
> If the hint is provided, filesystems may optionally use it. A filesytem
> may ignore this field if it does not support per-io hints, or if the
> value is invalid for its backing storage. Just like the inode hints,
> requesting values that are not supported by the hardware are not an
> error.
>
> Reviewed-by: Hannes Reinecke <hare@suse.de>
> Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
> Signed-off-by: Nitesh Shetty <nj.shetty@samsung.com>
> Signed-off-by: Keith Busch <kbusch@kernel.org>
> ---
>  include/uapi/linux/io_uring.h | 4 ++++
>  io_uring/rw.c                 | 3 ++-
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 60b9c98595faf..8cdcc461d464c 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -92,6 +92,10 @@ struct io_uring_sqe {
>                         __u16   addr_len;
>                         __u16   __pad3[1];
>                 };
> +               struct {
> +                       __u16   write_hint;
> +                       __u16   __pad4[1];
> +               };
>         };
>         union {
>                 struct {
> diff --git a/io_uring/rw.c b/io_uring/rw.c
> index 8080ffd6d5712..5a1231bfecc3a 100644
> --- a/io_uring/rw.c
> +++ b/io_uring/rw.c
> @@ -279,7 +279,8 @@ static int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe,
>                 rw->kiocb.ki_ioprio = get_current_ioprio();
>         }
>         rw->kiocb.dio_complete = NULL;
> -
> +       if (ddir == ITER_SOURCE)
> +               rw->kiocb.ki_write_hint = READ_ONCE(sqe->write_hint);
>         rw->addr = READ_ONCE(sqe->addr);
>         rw->len = READ_ONCE(sqe->len);
>         rw->flags = READ_ONCE(sqe->rw_flags);
> --
> 2.43.5
>

Since this patch adds a couple of new fields, it makes sense to add
BUILD_BUG_ON() checks in io_uring_init for these fields to assert the
layout of struct io_uring_sqe. And probably a zero check for pad4 in
io_prep_rw.
--
Anuj Gupta