mbox series

[v3,0/7] no-copy bvec

Message ID cover.1610170479.git.asml.silence@gmail.com
Headers show
Series no-copy bvec | expand

Message

Pavel Begunkov Jan. 9, 2021, 4:02 p.m. UTC
Currently, when iomap and block direct IO gets a bvec based iterator
the bvec will be copied, with all other accounting that takes much
CPU time and causes additional allocation for larger bvecs. The
patchset makes it to reuse the passed in iter bvec.

[1,2] are forbidding zero-length bvec segments to not pile special
cases, [3] skip/fix PSI tracking to not iterate over bvecs extra
time.


nullblk completion_nsec=0 submit_queues=NR_CORES, no merges, no stats
fio/t/io_uring /dev/nullb0 -d 128 -s 32 -c 32 -p 0 -B 1 -F 1 -b BLOCK_SIZE

BLOCK_SIZE             512     4K      8K      16K     32K     64K
===================================================================
old (KIOPS)            1208    1208    1131    1039    863     699
new (KIOPS)            1222    1222    1170    1137    1083    982

Previously, Jens got before 10% difference for polling real HW and small
block sizes, but that was for an older version that had one
iov_iter_advance() less


since RFC:
- add target_core_file patch by Christoph
- make no-copy default behaviour, remove iter flag
- iter_advance() instead of hacks to revert to work
- add bvec iter_advance() optimisation patch
- remove PSI annotations from direct IO (iomap, block and fs/direct)
- note in d/f/porting

since v1:
- don't allow zero-length bvec segments (Ming)
- don't add a BIO_WORKINGSET-less version of bio_add_page(), just clear
  the flag at the end and leave it for further cleanups (Christoph)
- commit message and comments rewording (Dave)
- other nits by Christoph

since v2:
- add a comment in 1/7 (Christoph)
- add a note about 0-len bvecs in biovecs.rst (Matthew)

Christoph Hellwig (1):
  target/file: allocate the bvec array as part of struct
    target_core_file_cmd

Pavel Begunkov (6):
  splice: don't generate zero-len segement bvecs
  bvec/iter: disallow zero-length segment bvecs
  block/psi: remove PSI annotations from direct IO
  iov_iter: optimise bvec iov_iter_advance()
  bio: add a helper calculating nr segments to alloc
  bio: don't copy bvec for direct IO

 Documentation/block/biovecs.rst       |  2 +
 Documentation/filesystems/porting.rst | 16 ++++++
 block/bio.c                           | 71 +++++++++++++--------------
 drivers/target/target_core_file.c     | 20 +++-----
 fs/block_dev.c                        |  7 +--
 fs/direct-io.c                        |  2 +
 fs/iomap/direct-io.c                  |  9 ++--
 fs/splice.c                           |  9 ++--
 include/linux/bio.h                   | 13 +++++
 lib/iov_iter.c                        | 21 +++++++-
 10 files changed, 106 insertions(+), 64 deletions(-)

Comments

Ming Lei Jan. 11, 2021, 2:48 a.m. UTC | #1
On Sat, Jan 09, 2021 at 04:02:57PM +0000, Pavel Begunkov wrote:
> iter_file_splice_write() may spawn bvec segments with zero-length. In

> preparation for prohibiting them, filter out by hand at splice level.

> 

> Reviewed-by: Christoph Hellwig <hch@lst.de>

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

>  fs/splice.c | 9 ++++++---

>  1 file changed, 6 insertions(+), 3 deletions(-)

> 

> diff --git a/fs/splice.c b/fs/splice.c

> index 866d5c2367b2..474fb8b5562a 100644

> --- a/fs/splice.c

> +++ b/fs/splice.c

> @@ -662,12 +662,14 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,

>  

>  		/* build the vector */

>  		left = sd.total_len;

> -		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++, n++) {

> +		for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) {

>  			struct pipe_buffer *buf = &pipe->bufs[tail & mask];

>  			size_t this_len = buf->len;

>  

> -			if (this_len > left)

> -				this_len = left;

> +			/* zero-length bvecs are not supported, skip them */

> +			if (!this_len)

> +				continue;

> +			this_len = min(this_len, left);

>  

>  			ret = pipe_buf_confirm(pipe, buf);

>  			if (unlikely(ret)) {

> @@ -680,6 +682,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out,

>  			array[n].bv_len = this_len;

>  			array[n].bv_offset = buf->offset;

>  			left -= this_len;

> +			n++;

>  		}

>  

>  		iov_iter_bvec(&from, WRITE, array, n, sd.total_len - left);

> -- 

> 2.24.0

> 


Reviewed-by: Ming Lei <ming.lei@redhat.com>


-- 
Ming
Ming Lei Jan. 11, 2021, 2:52 a.m. UTC | #2
On Sat, Jan 09, 2021 at 04:02:59PM +0000, Pavel Begunkov wrote:
> Direct IO does not operate on the current working set of pages managed

> by the kernel, so it should not be accounted as memory stall to PSI

> infrastructure.

> 

> The block layer and iomap direct IO use bio_iov_iter_get_pages()

> to build bios, and they are the only users of it, so to avoid PSI

> tracking for them clear out BIO_WORKINGSET flag. Do same for

> dio_bio_submit() because fs/direct_io constructs bios by hand directly

> calling bio_add_page().

> 

> Reported-by: Christoph Hellwig <hch@infradead.org>

> Suggested-by: Christoph Hellwig <hch@infradead.org>

> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>

> Reviewed-by: Christoph Hellwig <hch@lst.de>

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

>  block/bio.c    | 6 ++++++

>  fs/direct-io.c | 2 ++

>  2 files changed, 8 insertions(+)

> 

> diff --git a/block/bio.c b/block/bio.c

> index 1f2cc1fbe283..9f26984af643 100644

> --- a/block/bio.c

> +++ b/block/bio.c

> @@ -1099,6 +1099,9 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)

>   * fit into the bio, or are requested in @iter, whatever is smaller. If

>   * MM encounters an error pinning the requested pages, it stops. Error

>   * is returned only if 0 pages could be pinned.

> + *

> + * It's intended for direct IO, so doesn't do PSI tracking, the caller is

> + * responsible for setting BIO_WORKINGSET if necessary.

>   */

>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

>  {

> @@ -1123,6 +1126,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

>  

>  	if (is_bvec)

>  		bio_set_flag(bio, BIO_NO_PAGE_REF);

> +

> +	/* don't account direct I/O as memory stall */

> +	bio_clear_flag(bio, BIO_WORKINGSET);

>  	return bio->bi_vcnt ? 0 : ret;

>  }

>  EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages);

> diff --git a/fs/direct-io.c b/fs/direct-io.c

> index d53fa92a1ab6..0e689233f2c7 100644

> --- a/fs/direct-io.c

> +++ b/fs/direct-io.c

> @@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio)

>  	unsigned long flags;

>  

>  	bio->bi_private = dio;

> +	/* don't account direct I/O as memory stall */

> +	bio_clear_flag(bio, BIO_WORKINGSET);

>  

>  	spin_lock_irqsave(&dio->bio_lock, flags);

>  	dio->refcount++;

> -- 

> 2.24.0

> 


Reviewed-by: Ming Lei <ming.lei@redhat.com>


-- 
Ming
Ming Lei Jan. 11, 2021, 2:55 a.m. UTC | #3
On Sat, Jan 09, 2021 at 04:03:01PM +0000, Pavel Begunkov wrote:
> iov_iter_advance() is heavily used, but implemented through generic

> means. For bvecs there is a specifically crafted function for that, so

> use bvec_iter_advance() instead, it's faster and slimmer.

> 

> Reviewed-by: Christoph Hellwig <hch@lst.de>

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

>  lib/iov_iter.c | 19 +++++++++++++++++++

>  1 file changed, 19 insertions(+)

> 

> diff --git a/lib/iov_iter.c b/lib/iov_iter.c

> index 7de304269641..9b1c109dc8a9 100644

> --- a/lib/iov_iter.c

> +++ b/lib/iov_iter.c

> @@ -1065,6 +1065,21 @@ static void pipe_advance(struct iov_iter *i, size_t size)

>  	pipe_truncate(i);

>  }

>  

> +static void iov_iter_bvec_advance(struct iov_iter *i, size_t size)

> +{

> +	struct bvec_iter bi;

> +

> +	bi.bi_size = i->count;

> +	bi.bi_bvec_done = i->iov_offset;

> +	bi.bi_idx = 0;

> +	bvec_iter_advance(i->bvec, &bi, size);

> +

> +	i->bvec += bi.bi_idx;

> +	i->nr_segs -= bi.bi_idx;

> +	i->count = bi.bi_size;

> +	i->iov_offset = bi.bi_bvec_done;

> +}

> +

>  void iov_iter_advance(struct iov_iter *i, size_t size)

>  {

>  	if (unlikely(iov_iter_is_pipe(i))) {

> @@ -1075,6 +1090,10 @@ void iov_iter_advance(struct iov_iter *i, size_t size)

>  		i->count -= size;

>  		return;

>  	}

> +	if (iov_iter_is_bvec(i)) {

> +		iov_iter_bvec_advance(i, size);

> +		return;

> +	}

>  	iterate_and_advance(i, size, v, 0, 0, 0)

>  }

>  EXPORT_SYMBOL(iov_iter_advance);

> -- 

> 2.24.0

> 


Reviewed-by: Ming Lei <ming.lei@redhat.com>


-- 
Ming
Ming Lei Jan. 11, 2021, 3 a.m. UTC | #4
On Sat, Jan 09, 2021 at 04:03:03PM +0000, Pavel Begunkov wrote:
> The block layer spends quite a while in blkdev_direct_IO() to copy and

> initialise bio's bvec. However, if we've already got a bvec in the input

> iterator it might be reused in some cases, i.e. when new

> ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable

> performance boost, and it also reduces memory footprint.

> 

> Suggested-by: Matthew Wilcox <willy@infradead.org>

> Reviewed-by: Christoph Hellwig <hch@lst.de>

> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>

> ---

>  Documentation/filesystems/porting.rst |  9 ++++

>  block/bio.c                           | 67 ++++++++++++---------------

>  include/linux/bio.h                   |  5 +-

>  3 files changed, 42 insertions(+), 39 deletions(-)

> 

> diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst

> index c722d94f29ea..1f8cf8e10b34 100644

> --- a/Documentation/filesystems/porting.rst

> +++ b/Documentation/filesystems/porting.rst

> @@ -872,3 +872,12 @@ its result is kern_unmount() or kern_unmount_array().

>  

>  zero-length bvec segments are disallowed, they must be filtered out before

>  passed on to an iterator.

> +

> +---

> +

> +**mandatory**

> +

> +For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but

> +uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and

> +page references stay until I/O has completed, i.e. until ->ki_complete() has

> +been called or returned with non -EIOCBQUEUED code.

> diff --git a/block/bio.c b/block/bio.c

> index 9f26984af643..6f031a04b59a 100644

> --- a/block/bio.c

> +++ b/block/bio.c

> @@ -960,21 +960,17 @@ void bio_release_pages(struct bio *bio, bool mark_dirty)

>  }

>  EXPORT_SYMBOL_GPL(bio_release_pages);

>  

> -static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter)

> +static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter)

>  {

> -	const struct bio_vec *bv = iter->bvec;

> -	unsigned int len;

> -	size_t size;

> -

> -	if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len))

> -		return -EINVAL;

> -

> -	len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count);

> -	size = bio_add_page(bio, bv->bv_page, len,

> -				bv->bv_offset + iter->iov_offset);

> -	if (unlikely(size != len))

> -		return -EINVAL;

> -	iov_iter_advance(iter, size);

> +	WARN_ON_ONCE(BVEC_POOL_IDX(bio) != 0);

> +

> +	bio->bi_vcnt = iter->nr_segs;

> +	bio->bi_max_vecs = iter->nr_segs;

> +	bio->bi_io_vec = (struct bio_vec *)iter->bvec;

> +	bio->bi_iter.bi_bvec_done = iter->iov_offset;

> +	bio->bi_iter.bi_size = iter->count;

> +

> +	iov_iter_advance(iter, iter->count);

>  	return 0;

>  }

>  

> @@ -1088,12 +1084,12 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)

>   * This takes either an iterator pointing to user memory, or one pointing to

>   * kernel pages (BVEC iterator). If we're adding user pages, we pin them and

>   * map them into the kernel. On IO completion, the caller should put those

> - * pages. If we're adding kernel pages, and the caller told us it's safe to

> - * do so, we just have to add the pages to the bio directly. We don't grab an

> - * extra reference to those pages (the user should already have that), and we

> - * don't put the page on IO completion. The caller needs to check if the bio is

> - * flagged BIO_NO_PAGE_REF on IO completion. If it isn't, then pages should be

> - * released.

> + * pages. For bvec based iterators bio_iov_iter_get_pages() uses the provided

> + * bvecs rather than copying them. Hence anyone issuing kiocb based IO needs

> + * to ensure the bvecs and pages stay referenced until the submitted I/O is

> + * completed by a call to ->ki_complete() or returns with an error other than

> + * -EIOCBQUEUED. The caller needs to check if the bio is flagged BIO_NO_PAGE_REF

> + * on IO completion. If it isn't, then pages should be released.

>   *

>   * The function tries, but does not guarantee, to pin as many pages as

>   * fit into the bio, or are requested in @iter, whatever is smaller. If

> @@ -1105,27 +1101,22 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter)

>   */

>  int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)

>  {

> -	const bool is_bvec = iov_iter_is_bvec(iter);

> -	int ret;

> -

> -	if (WARN_ON_ONCE(bio->bi_vcnt))

> -		return -EINVAL;

> +	int ret = 0;

>  

> -	do {

> -		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {

> -			if (WARN_ON_ONCE(is_bvec))

> -				return -EINVAL;

> -			ret = __bio_iov_append_get_pages(bio, iter);

> -		} else {

> -			if (is_bvec)

> -				ret = __bio_iov_bvec_add_pages(bio, iter);

> +	if (iov_iter_is_bvec(iter)) {

> +		if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND))

> +			return -EINVAL;

> +		bio_iov_bvec_set(bio, iter);

> +		bio_set_flag(bio, BIO_NO_PAGE_REF);

> +		return 0;

> +	} else {

> +		do {

> +			if (bio_op(bio) == REQ_OP_ZONE_APPEND)

> +				ret = __bio_iov_append_get_pages(bio, iter);

>  			else

>  				ret = __bio_iov_iter_get_pages(bio, iter);

> -		}

> -	} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));

> -

> -	if (is_bvec)

> -		bio_set_flag(bio, BIO_NO_PAGE_REF);

> +		} while (!ret && iov_iter_count(iter) && !bio_full(bio, 0));

> +	}

>  

>  	/* don't account direct I/O as memory stall */

>  	bio_clear_flag(bio, BIO_WORKINGSET);

> diff --git a/include/linux/bio.h b/include/linux/bio.h

> index d8f9077c43ef..1d30572a8c53 100644

> --- a/include/linux/bio.h

> +++ b/include/linux/bio.h

> @@ -444,10 +444,13 @@ static inline void bio_wouldblock_error(struct bio *bio)

>  

>  /*

>   * Calculate number of bvec segments that should be allocated to fit data

> - * pointed by @iter.

> + * pointed by @iter. If @iter is backed by bvec it's going to be reused

> + * instead of allocating a new one.

>   */

>  static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs)

>  {

> +	if (iov_iter_is_bvec(iter))

> +		return 0;

>  	return iov_iter_npages(iter, max_segs);

>  }

>  

> -- 

> 2.24.0

> 


Reviewed-by: Ming Lei <ming.lei@redhat.com>


-- 
Ming
Jens Axboe Jan. 25, 2021, 3:58 p.m. UTC | #5
On 1/9/21 9:02 AM, Pavel Begunkov wrote:
> Currently, when iomap and block direct IO gets a bvec based iterator

> the bvec will be copied, with all other accounting that takes much

> CPU time and causes additional allocation for larger bvecs. The

> patchset makes it to reuse the passed in iter bvec.


Applied, thanks.

-- 
Jens Axboe