mbox series

[v9,00/11] Read/Write with meta/integrity

Message ID 20241114104517.51726-1-anuj20.g@samsung.com
Headers show
Series Read/Write with meta/integrity | expand

Message

Anuj Gupta Nov. 14, 2024, 10:45 a.m. UTC
This adds a new io_uring interface to pass additional attributes with
read/write. It can be done using two ways.

1. Passing the attribute information via user pointer.
2. Passing the attribute information inline with SQE/SQE128.

We have had discussions in the past about using the SQE128 space for passing
PI fields. Still if there are concerns about it, can we please consider
this patchset for inclusion only with the user pointer approach for now
(and drop patch 7)?

Example program for using the interface is appended below [1].
In the program, write is done via inline scheme and read is done via user
pointer scheme.

The patchset is on top of block/for-next.

Block path (direct IO) , NVMe and SCSI driver are modified to support
this.

Patch 1 is an enhancement patch.
Patch 2 is required to make the bounce buffer copy back work correctly.
Patch 3 to 5 are prep patches.
Patch 6 and 7 adds the io_uring support.
Patch 8 gives us unified interface for user and kernel generated
integrity.
Patch 9 adds support in SCSI and patch 10 in NVMe.
Patch 11 adds the support for block direct IO.

Changes since v8:
https://lore.kernel.org/io-uring/20241106121842.5004-1-anuj20.g@samsung.com/

- add option of the pass the PI information from user space via a
  pointer (Pavel)

Changes since v7:
https://lore.kernel.org/io-uring/20241104140601.12239-1-anuj20.g@samsung.com/

- change the sign-off order (hch)
- add a check for doing metadata completion handling only for async-io
- change meta_type name to something more meaningful (hch, keith)
- add detail description in io-uring patch (hch)

Changes since v6:
https://lore.kernel.org/linux-block/20241030180112.4635-1-joshi.k@samsung.com/

- io_uring changes (bring back meta_type, move PI to the end of SQE128)
- Fix robot warnings

Changes since v5:
https://lore.kernel.org/linux-block/20241029162402.21400-1-anuj20.g@samsung.com/

- remove meta_type field from SQE (hch, keith)
- remove __bitwise annotation (hch)
- remove BIP_CTRL_NOCHECK from scsi (hch)

Changes since v4:
https://lore.kernel.org/linux-block/20241016112912.63542-1-anuj20.g@samsung.com/

- better variable names to describe bounce buffer copy back (hch)
- move defintion of flags in the same patch introducing uio_meta (hch)
- move uio_meta definition to include/linux/uio.h (hch)
- bump seed size in uio_meta to 8 bytes (martin)
- move flags definition to include/uapi/linux/fs.h (hch)
- s/meta/metadata in commit description of io-uring (hch)
- rearrange the meta fields in sqe for cleaner layout
- partial submission case is not applicable as, we are only plumbing for async case
- s/META_TYPE_INTEGRITY/META_TYPE_PI (hch, martin)
- remove unlikely branching (hch)
- Better formatting, misc cleanups, better commit descriptions, reordering commits(hch)

Changes since v3:
https://lore.kernel.org/linux-block/20240823103811.2421-1-anuj20.g@samsung.com/

- add reftag seed support (Martin)
- fix incorrect formatting in uio_meta (hch)
- s/IOCB_HAS_META/IOCB_HAS_METADATA (hch)
- move integrity check flags to block layer header (hch)
- add comments for BIP_CHECK_GUARD/REFTAG/APPTAG flags (hch)
- remove bio_integrity check during completion if IOCB_HAS_METADATA is set (hch)
- use goto label to get rid of duplicate error handling (hch)
- add warn_on if trying to do sync io with iocb_has_metadata flag (hch)
- remove check for disabling reftag remapping (hch)
- remove BIP_INTEGRITY_USER flag (hch)
- add comment for app_tag field introduced in bio_integrity_payload (hch)
- pass request to nvme_set_app_tag function (hch)
- right indentation at a place in scsi patch (hch)
- move IOCB_HAS_METADATA to a separate fs patch (hch)

Changes since v2:
https://lore.kernel.org/linux-block/20240626100700.3629-1-anuj20.g@samsung.com/
- io_uring error handling styling (Gabriel)
- add documented helper to get metadata bytes from data iter (hch)
- during clone specify "what flags to clone" rather than
"what not to clone" (hch)
- Move uio_meta defination to bio-integrity.h (hch)
- Rename apptag field to app_tag (hch)
- Change datatype of flags field in uio_meta to bitwise (hch)
- Don't introduce BIP_USER_CHK_FOO flags (hch, martin)
- Driver should rely on block layer flags instead of seeing if it is
user-passthrough (hch)
- update the scsi code for handling user-meta (hch, martin)

Changes since v1:
https://lore.kernel.org/linux-block/20240425183943.6319-1-joshi.k@samsung.com/
- Do not use new opcode for meta, and also add the provision to introduce new
meta types beyond integrity (Pavel)
- Stuff IOCB_HAS_META check in need_complete_io (Jens)
- Split meta handling in NVMe into a separate handler (Keith)
- Add meta handling for __blkdev_direct_IO too (Keith)
- Don't inherit BIP_COPY_USER flag for cloned bio's (Christoph)
- Better commit descriptions (Christoph)

Changes since RFC:
- modify io_uring plumbing based on recent async handling state changes
- fixes/enhancements to correctly handle the split for meta buffer
- add flags to specify guard/reftag/apptag checks
- add support to send apptag

[1]

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <linux/fs.h>
#include <linux/io_uring.h>
#include <linux/types.h>
#include "liburing.h"

/*
 * write data/meta. read both. compare. send apptag too.
 * prerequisite:
 * protected xfer: format namespace with 4KB + 8b, pi_type = 1
 * For testing reftag remapping on device-mapper, create a
 * device-mapper and run this program. Device mapper creation:
 * # echo 0 80 linear /dev/nvme0n1 0 > /tmp/table
 * # echo 80 160 linear /dev/nvme0n1 200 >> /tmp/table
 * # dmsetup create two /tmp/table
 * # ./a.out /dev/dm-0
 */

#define DATA_LEN 4096
#define META_LEN 8

struct t10_pi_tuple {
        __be16  guard;
        __be16  apptag;
        __be32  reftag;
};

int main(int argc, char *argv[])
{
         struct io_uring ring;
         struct io_uring_sqe *sqe = NULL;
         struct io_uring_cqe *cqe = NULL;
         void *wdb,*rdb;
         char wmb[META_LEN], rmb[META_LEN];
         int fd, ret, blksize;
         struct stat fstat;
         unsigned long long offset = DATA_LEN * 10;
         struct t10_pi_tuple *pi;
         struct io_uring_sqe_ext *sqe_ext;
	 struct io_uring_attr_pi pi_attr;
	 struct io_uring_attr_vec attr_vec;

         if (argc != 2) {
                 fprintf(stderr, "Usage: %s <block-device>", argv[0]);
                 return 1;
         };

         if (stat(argv[1], &fstat) == 0) {
                 blksize = (int)fstat.st_blksize;
         } else {
                 perror("stat");
                 return 1;
         }

         if (posix_memalign(&wdb, blksize, DATA_LEN)) {
                 perror("posix_memalign failed");
                 return 1;
         }
         if (posix_memalign(&rdb, blksize, DATA_LEN)) {
                 perror("posix_memalign failed");
                 return 1;
         }

         memset(wdb, 0, DATA_LEN);

         fd = open(argv[1], O_RDWR | O_DIRECT);
         if (fd < 0) {
                 printf("Error in opening device\n");
                 return 0;
         }

         ret = io_uring_queue_init(8, &ring, IORING_SETUP_SQE128);
         if (ret) {
                 fprintf(stderr, "ring setup failed: %d\n", ret);
                 return 1;
         }

         /* write data + meta-buffer to device */
         sqe = io_uring_get_sqe(&ring);
         if (!sqe) {
                 fprintf(stderr, "get sqe failed\n");
                 return 1;
         }

         io_uring_prep_write(sqe, fd, wdb, DATA_LEN, offset);

	 sqe->attr_inline_flags = ATTR_FLAG_PI;
         sqe_ext= (struct io_uring_sqe_ext *) (sqe + 1);
         sqe_ext->rw_pi.addr = (__u64)wmb;
         sqe_ext->rw_pi.len = META_LEN;
         /* flags to ask for guard/reftag/apptag*/
         sqe_ext->rw_pi.flags = IO_INTEGRITY_CHK_GUARD | IO_INTEGRITY_CHK_REFTAG | IO_INTEGRITY_CHK_APPTAG;
         sqe_ext->rw_pi.app_tag = 0x1234;
         sqe_ext->rw_pi.seed = 10;

         pi = (struct t10_pi_tuple *)wmb;
         pi->guard = 0;
         pi->reftag = 0x0A000000;
         pi->apptag = 0x3412;

         ret = io_uring_submit(&ring);
         if (ret <= 0) {
                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                 return 1;
         }

         ret = io_uring_wait_cqe(&ring, &cqe);
         if (!cqe) {
                 fprintf(stderr, "cqe is NULL :%d\n", ret);
                 return 1;
         }
         if (cqe->res < 0) {
                 fprintf(stderr, "write cqe failure: %d", cqe->res);
                 return 1;
         }

         io_uring_cqe_seen(&ring, cqe);

         /* read data + meta-buffer back from device */
         sqe = io_uring_get_sqe(&ring);
         if (!sqe) {
                 fprintf(stderr, "get sqe failed\n");
                 return 1;
         }

         io_uring_prep_read(sqe, fd, rdb, DATA_LEN, offset);

         sqe->nr_attr_indirect = 1;
	 sqe->attr_vec_addr = (__u64)&attr_vec;
	 attr_vec.type = ATTR_TYPE_PI;
	 attr_vec.addr = (__u64)&pi_attr;
         pi_attr.addr = (__u64)rmb;
         pi_attr.len = META_LEN;
         pi_attr.flags = IO_INTEGRITY_CHK_GUARD | IO_INTEGRITY_CHK_REFTAG | IO_INTEGRITY_CHK_APPTAG;
         pi_attr.app_tag = 0x1234;
         pi_attr.seed = 10;

         ret = io_uring_submit(&ring);
         if (ret <= 0) {
                 fprintf(stderr, "sqe submit failed: %d\n", ret);
                 return 1;
         }

         ret = io_uring_wait_cqe(&ring, &cqe);
         if (!cqe) {
                 fprintf(stderr, "cqe is NULL :%d\n", ret);
                 return 1;
         }

         if (cqe->res < 0) {
                 fprintf(stderr, "read cqe failure: %d", cqe->res);
                 return 1;
         }

	 pi = (struct t10_pi_tuple *)rmb;
	 if (pi->apptag != 0x3412)
		 printf("Failure: apptag mismatch!\n");
	 if (pi->reftag != 0x0A000000)
		 printf("Failure: reftag mismatch!\n");

         io_uring_cqe_seen(&ring, cqe);

         pi = (struct t10_pi_tuple *)rmb;

         if (strncmp(wmb, rmb, META_LEN))
                 printf("Failure: meta mismatch!, wmb=%s, rmb=%s\n", wmb, rmb);

         if (strncmp(wdb, rdb, DATA_LEN))
                 printf("Failure: data mismatch!\n");

         io_uring_queue_exit(&ring);
         free(rdb);
         free(wdb);
         return 0;
}


Anuj Gupta (8):
  block: define set of integrity flags to be inherited by cloned bip
  block: modify bio_integrity_map_user to accept iov_iter as argument
  fs, iov_iter: define meta io descriptor
  fs: introduce IOCB_HAS_METADATA for metadata
  io_uring: introduce attributes for read/write and PI support
  io_uring: inline read/write attributes and PI
  block: introduce BIP_CHECK_GUARD/REFTAG/APPTAG bip_flags
  scsi: add support for user-meta interface

Christoph Hellwig (1):
  block: copy back bounce buffer to user-space correctly in case of
    split

Kanchan Joshi (2):
  nvme: add support for passing on the application tag
  block: add support to pass user meta buffer

 block/bio-integrity.c         |  84 ++++++++++++++----
 block/blk-integrity.c         |  10 ++-
 block/fops.c                  |  45 +++++++---
 drivers/nvme/host/core.c      |  21 +++--
 drivers/scsi/sd.c             |   4 +-
 include/linux/bio-integrity.h |  25 ++++--
 include/linux/fs.h            |   1 +
 include/linux/uio.h           |   9 ++
 include/uapi/linux/fs.h       |   9 ++
 include/uapi/linux/io_uring.h |  40 +++++++++
 io_uring/io_uring.c           |   5 ++
 io_uring/rw.c                 | 160 +++++++++++++++++++++++++++++++++-
 io_uring/rw.h                 |  14 ++-
 13 files changed, 381 insertions(+), 46 deletions(-)

Comments

Christoph Hellwig Nov. 14, 2024, 12:16 p.m. UTC | #1
On Thu, Nov 14, 2024 at 04:15:12PM +0530, Anuj Gupta wrote:
> PI attribute is supported only for direct IO. Also, vectored read/write
> operations are not supported with PI currently.

Eww.  I know it's frustration for your if maintainers give contradicting
guidance, but this is really an awful interface.  Not only the pointless
indirection which make the interface hard to use, but limiting it to
not support vectored I/O makes it pretty useless.

I guess I need to do a little read-up on why Pavel wants this, but
from the block/fs perspective the previous interface made so much
more sense.
Pavel Begunkov Nov. 14, 2024, 1:09 p.m. UTC | #2
On 11/14/24 12:16, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 04:15:12PM +0530, Anuj Gupta wrote:
>> PI attribute is supported only for direct IO. Also, vectored read/write
>> operations are not supported with PI currently.

And my apologies Anuj, I've been busy, I hope to take a look
at this series today / tomorrow.

> Eww.  I know it's frustration for your if maintainers give contradicting
> guidance, but this is really an awful interface.  Not only the pointless

Because once you placed it at a fixed location nothing realistically
will be able to reuse it. Not everyone will need PI, but the assumption
that there will be more more additional types of attributes / parameters.

With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
of whether a particular request needs it or not, and the user will need
to zero them for each request.

The discussion continued in the v6 thread, here

https://lore.kernel.org/all/20241031065535.GA26299@lst.de/T/#m12beca2ede2bd2017796adb391bedec9c95d85c3

and a little bit more here:

https://lore.kernel.org/all/20241031065535.GA26299@lst.de/T/#mc3f7a95915a64551e061d37b33a643676c5d87b2

> indirection which make the interface hard to use, but limiting it to
> not support vectored I/O makes it pretty useless.

I'm not sure why that's the case and need to take a look), but I
don't immediately see how it's relevant to that part of the API. It
shouldn't really matter where the main PI structure is located, you
get an iovec pointer and code from there wouldn't be any different.

> I guess I need to do a little read-up on why Pavel wants this, but
> from the block/fs perspective the previous interface made so much
> more sense.
Pavel Begunkov Nov. 15, 2024, 4:40 p.m. UTC | #3
On 11/14/24 15:19, Christoph Hellwig wrote:
> On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
>>> Eww.  I know it's frustration for your if maintainers give contradicting
>>> guidance, but this is really an awful interface.  Not only the pointless
>>
>> Because once you placed it at a fixed location nothing realistically
>> will be able to reuse it. Not everyone will need PI, but the assumption
>> that there will be more more additional types of attributes / parameters.
> 
> So?  If we have a strong enough requirement for something else we
> can triviall add another opcode.  Maybe we should just add different
> opcodes for read/write with metadata so that folks don't freak out
> about this?

IMHO, PI is not so special to have a special opcode for it unlike
some more generic read/write with meta / attributes, but that one
would have same questions.

FWIW, the series was steered from the separate opcode approach to avoid
duplicating things, for example there are 3 different OP_READ* opcodes
varying by the buffer type, and there is no reason meta reads wouldn't
want to support all of them as well. I have to admit that the effort is
a bit unfortunate on that side switching back a forth at least a couple
of times including attempts from 2+ years ago by some other guy.

>> With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
>> of whether a particular request needs it or not, and the user will need
>> to zero them for each request.
> 
> The user is not going to create a SQE128 ring unless they need to,
> so this seem like a bit of an odd objection.

It doesn't bring this overhead to those who don't use meta/PI, that's
right, but it does add it if you want to mix it with nearly all other
request types, and that is desirable.

As I mentioned before, it's just one downside but not a deal breaker.
I'm more concerned that the next type of meta information won't be
able to fit into the SQE and then we'll need to solve the same problem
(indirection + optimising copy_from_user with other means) while having
PI as a special case. And that's more of a problem of the static
placing from previous version, e.g. it wouldn't be a problem if in the
long run it becomes sth like:

struct attr attr, *p;

if (flags & META_IN_USE_SQE128)
	p = sqe + 1;
else {
	copy_from_user(&attr);
	p = &attr;
}

but that shouldn't be PI specific.
Jens Axboe Nov. 15, 2024, 5:44 p.m. UTC | #4
On 11/15/24 10:12 AM, Christoph Hellwig wrote:
> On Fri, Nov 15, 2024 at 04:40:58PM +0000, Pavel Begunkov wrote:
>>> So?  If we have a strong enough requirement for something else we
>>> can triviall add another opcode.  Maybe we should just add different
>>> opcodes for read/write with metadata so that folks don't freak out
>>> about this?
>>
>> IMHO, PI is not so special to have a special opcode for it unlike
>> some more generic read/write with meta / attributes, but that one
>> would have same questions.
> 
> Well, apparently is one the hand hand not general enough that you
> don't want to give it SQE128 space, but you also don't want to give
> it an opcode.
> 
> Maybe we just need make it uring_cmd to get out of these conflicting
> requirements.

Let's please lay off the hyperbole here, uring_cmd would be a terrible
way to do this. We're working through the flags requirements. Obviously
this is now missing 6.13, but there's no reason why it's not on track to
make 6.14 in a saner way.
Matthew Wilcox Nov. 15, 2024, 6:04 p.m. UTC | #5
On Thu, Nov 14, 2024 at 01:09:44PM +0000, Pavel Begunkov wrote:
> With SQE128 it's also a problem that now all SQEs are 128 bytes regardless
> of whether a particular request needs it or not, and the user will need
> to zero them for each request.

The way we handled this in NVMe was to use a bit in the command that
was called (iirc) FUSED, which let you use two consecutive entries for
a single command.

Some variant on that could surely be used for io_uring.  Perhaps a
special opcode that says "the real opcode is here, and this is a two-slot
command".  Processing gets a little spicy when one slot is the last in
the buffer and the next is the the first in the buffer, but that's a SMOP.
Pavel Begunkov Nov. 16, 2024, midnight UTC | #6
On 11/14/24 10:45, Anuj Gupta wrote:
> Add the ability to pass additional attributes along with read/write.
> Application can populate an array of 'struct io_uring_attr_vec' and pass
> its address using the SQE field:
> 	__u64	attr_vec_addr;
> 
> Along with number of attributes using:
> 	__u8	nr_attr_indirect;
> 
> Overall 16 attributes are allowed and currently one attribute
> 'ATTR_TYPE_PI' is supported.

Why only 16? It's possible that might need more, 256 would
be a safer choice and fits into u8. I don't think you even
need to commit to a number, it should be ok to add more as
long as it fits into the given types (u8 above). It can also
be u16 as well.

> With PI attribute, userspace can pass following information:
> - flags: integrity check flags IO_INTEGRITY_CHK_{GUARD/APPTAG/REFTAG}
> - len: length of PI/metadata buffer
> - addr: address of metadata buffer
> - seed: seed value for reftag remapping
> - app_tag: application defined 16b value

In terms of flexibility I like it apart from small nits,
but the double indirection could be a bit inefficient,
this thing:

struct pi_attr pi = {};
attr_array = { &pi, ... };
sqe->attr_addr = attr_array;

So maybe we should just flatten it? An attempt to pseudo
code it to understand what it entails is below. Certainly
buggy and some handling is omitted, but should show the
idea.

// uapi/.../io_uring.h

struct sqe {
	...
	u64 attr_addr;
	/* the total size of the array pointed by attr_addr */
	u16 attr_size; /* max 64KB, more than enough */
}

struct io_attr_header {
	/* bit mask of attributes passed, can be helpful in the future
	 * for optimising processing.
	 */
	u64 attr_type_map;
};

/* each attribute should start with a preamble */
struct io_uring_attr_preamble {
	u16 attr_type;
};

// user space

struct PI_param {
	struct io_attr_header header;
	struct io_uring_attr_preamble preamble;
	struct io_uring_attr_pi pi;
};

struct PI_param p = {};
p.header.map = 1 << ATTR_TYPE_PI;
p.preamble.type = ATTR_TYPE_PI;
p.pi = {...};

sqe->attr_addr = &p;
sqe->attr_size = sizeof(p);


The holes b/w structures should be packed better. For the same
reason I don't like a separate preamble structure much, maybe it
should be embedded into the attribute structures, e.g.

struct io_uring_attr_pi {
	u16 attr_type;
	...
}

The user side looks ok to me, should be pretty straightforward
if the user can define a structure like PI_param, i.e. knows
at compilation time which attributes it wants to use.

// kernel space (current patch set, PI only)

addr = READ_ONCE(sqe->attr_addr);
if (addr) {
	size = READ_ONCE(sqe->attr_size);
	process_pi(addr, size);
}

process_pi(addr, size) {
	struct PI_param p;

	if (size != sizeof(PI_attr + struct attr_preamble + struct attr_header))
		return -EINVAL;
	copy_from_user(p, addr, sizeof(p));
	if (p.preamble != ATTR_TYPE_PI)
		return -EINVAL;
	do_pi_setup(&p->pi);
}

This one is pretty simple as well. A bit more troublesome if
extended with many attributes, but it'd need additional handling
regardless:

process_pi(addr, size) {
	if (size < sizeof(header + preamble))
		return -EINVAL;

	attr_array = malloc(size); // +caching by io_uring
	copy_from_user(attr_array);
	handle_attributes(attr_array, size);
}

handle_attributes(attr_array, size) {
	struct io_attr_header *hdr = attr_array;
	offset = sizeof(*hdr);

	while (1) {
		if (offset + sizeof(struct preamble) > size)
			break;

		struct preamble *pr = attr_array + offset;
		if (pr->type > MAX_TYPES)
			return -EINVAL;
		attr_size = attr_sizes[pr->type];
		if (offset + sizeof(preamble) + attr_size > size)
			return -EINVAL;
		offset += sizeof(preamble) + attr_size;

		process_attr(pr->type, (void *)(pr + 1));
	}
}

Some checks can probably be optimised by playing with the uapi
a bit.

attr_type_map is unused here, but I like the idea. I'd love
to see all actual attribute handling to move deeper into the
stack to those who actually need it, but that's for far
away undecided future. E.g.

io_uring_rw {
	p = malloc();
	copy_from_user(p, sqe->attr_addr);
	kiocb->attributes = p;
}

block_do_read {
	hdr = kiocb->attributes;
	type_mask = /* all types block layer recognises */
	if (hdr->attr_type_map & type_mask)
		use_attributes();
}

copy_from_user can be optimised, I mentioned before, we can
have a pre-mapped area into which the indirection can point.
The infra is already in there and even used for passing
waiting arguments.

process_pi(addr, size) {
	struct PI_param *p, __p;

	if (some_flags & USE_REGISTERED_REGION) {
		// Glorified p = ctx->ptr; with some checks
		p = io_uring_get_mem(addr, size);
	} else {
		copy_from_user(__p, addr, sizeof(__p));
		p = &__p;
	}
	...
}

In this case all reads would need to be READ_ONCE, but that
shouldn't be a problem. It might also optimise out the kmalloc
in the extended version.