Message ID | cover.1610170479.git.asml.silence@gmail.com |
---|---|
Headers | show |
Series | no-copy bvec | expand |
On Sat, Jan 09, 2021 at 04:02:57PM +0000, Pavel Begunkov wrote: > iter_file_splice_write() may spawn bvec segments with zero-length. In > preparation for prohibiting them, filter out by hand at splice level. > > Reviewed-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > fs/splice.c | 9 ++++++--- > 1 file changed, 6 insertions(+), 3 deletions(-) > > diff --git a/fs/splice.c b/fs/splice.c > index 866d5c2367b2..474fb8b5562a 100644 > --- a/fs/splice.c > +++ b/fs/splice.c > @@ -662,12 +662,14 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, > > /* build the vector */ > left = sd.total_len; > - for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++, n++) { > + for (n = 0; !pipe_empty(head, tail) && left && n < nbufs; tail++) { > struct pipe_buffer *buf = &pipe->bufs[tail & mask]; > size_t this_len = buf->len; > > - if (this_len > left) > - this_len = left; > + /* zero-length bvecs are not supported, skip them */ > + if (!this_len) > + continue; > + this_len = min(this_len, left); > > ret = pipe_buf_confirm(pipe, buf); > if (unlikely(ret)) { > @@ -680,6 +682,7 @@ iter_file_splice_write(struct pipe_inode_info *pipe, struct file *out, > array[n].bv_len = this_len; > array[n].bv_offset = buf->offset; > left -= this_len; > + n++; > } > > iov_iter_bvec(&from, WRITE, array, n, sd.total_len - left); > -- > 2.24.0 > Reviewed-by: Ming Lei <ming.lei@redhat.com> -- Ming
On Sat, Jan 09, 2021 at 04:02:59PM +0000, Pavel Begunkov wrote: > Direct IO does not operate on the current working set of pages managed > by the kernel, so it should not be accounted as memory stall to PSI > infrastructure. > > The block layer and iomap direct IO use bio_iov_iter_get_pages() > to build bios, and they are the only users of it, so to avoid PSI > tracking for them clear out BIO_WORKINGSET flag. Do same for > dio_bio_submit() because fs/direct_io constructs bios by hand directly > calling bio_add_page(). > > Reported-by: Christoph Hellwig <hch@infradead.org> > Suggested-by: Christoph Hellwig <hch@infradead.org> > Suggested-by: Johannes Weiner <hannes@cmpxchg.org> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > block/bio.c | 6 ++++++ > fs/direct-io.c | 2 ++ > 2 files changed, 8 insertions(+) > > diff --git a/block/bio.c b/block/bio.c > index 1f2cc1fbe283..9f26984af643 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -1099,6 +1099,9 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter) > * fit into the bio, or are requested in @iter, whatever is smaller. If > * MM encounters an error pinning the requested pages, it stops. Error > * is returned only if 0 pages could be pinned. > + * > + * It's intended for direct IO, so doesn't do PSI tracking, the caller is > + * responsible for setting BIO_WORKINGSET if necessary. > */ > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > { > @@ -1123,6 +1126,9 @@ int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > > if (is_bvec) > bio_set_flag(bio, BIO_NO_PAGE_REF); > + > + /* don't account direct I/O as memory stall */ > + bio_clear_flag(bio, BIO_WORKINGSET); > return bio->bi_vcnt ? 0 : ret; > } > EXPORT_SYMBOL_GPL(bio_iov_iter_get_pages); > diff --git a/fs/direct-io.c b/fs/direct-io.c > index d53fa92a1ab6..0e689233f2c7 100644 > --- a/fs/direct-io.c > +++ b/fs/direct-io.c > @@ -426,6 +426,8 @@ static inline void dio_bio_submit(struct dio *dio, struct dio_submit *sdio) > unsigned long flags; > > bio->bi_private = dio; > + /* don't account direct I/O as memory stall */ > + bio_clear_flag(bio, BIO_WORKINGSET); > > spin_lock_irqsave(&dio->bio_lock, flags); > dio->refcount++; > -- > 2.24.0 > Reviewed-by: Ming Lei <ming.lei@redhat.com> -- Ming
On Sat, Jan 09, 2021 at 04:03:01PM +0000, Pavel Begunkov wrote: > iov_iter_advance() is heavily used, but implemented through generic > means. For bvecs there is a specifically crafted function for that, so > use bvec_iter_advance() instead, it's faster and slimmer. > > Reviewed-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > lib/iov_iter.c | 19 +++++++++++++++++++ > 1 file changed, 19 insertions(+) > > diff --git a/lib/iov_iter.c b/lib/iov_iter.c > index 7de304269641..9b1c109dc8a9 100644 > --- a/lib/iov_iter.c > +++ b/lib/iov_iter.c > @@ -1065,6 +1065,21 @@ static void pipe_advance(struct iov_iter *i, size_t size) > pipe_truncate(i); > } > > +static void iov_iter_bvec_advance(struct iov_iter *i, size_t size) > +{ > + struct bvec_iter bi; > + > + bi.bi_size = i->count; > + bi.bi_bvec_done = i->iov_offset; > + bi.bi_idx = 0; > + bvec_iter_advance(i->bvec, &bi, size); > + > + i->bvec += bi.bi_idx; > + i->nr_segs -= bi.bi_idx; > + i->count = bi.bi_size; > + i->iov_offset = bi.bi_bvec_done; > +} > + > void iov_iter_advance(struct iov_iter *i, size_t size) > { > if (unlikely(iov_iter_is_pipe(i))) { > @@ -1075,6 +1090,10 @@ void iov_iter_advance(struct iov_iter *i, size_t size) > i->count -= size; > return; > } > + if (iov_iter_is_bvec(i)) { > + iov_iter_bvec_advance(i, size); > + return; > + } > iterate_and_advance(i, size, v, 0, 0, 0) > } > EXPORT_SYMBOL(iov_iter_advance); > -- > 2.24.0 > Reviewed-by: Ming Lei <ming.lei@redhat.com> -- Ming
On Sat, Jan 09, 2021 at 04:03:03PM +0000, Pavel Begunkov wrote: > The block layer spends quite a while in blkdev_direct_IO() to copy and > initialise bio's bvec. However, if we've already got a bvec in the input > iterator it might be reused in some cases, i.e. when new > ITER_BVEC_FLAG_FIXED flag is set. Simple tests show considerable > performance boost, and it also reduces memory footprint. > > Suggested-by: Matthew Wilcox <willy@infradead.org> > Reviewed-by: Christoph Hellwig <hch@lst.de> > Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> > --- > Documentation/filesystems/porting.rst | 9 ++++ > block/bio.c | 67 ++++++++++++--------------- > include/linux/bio.h | 5 +- > 3 files changed, 42 insertions(+), 39 deletions(-) > > diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst > index c722d94f29ea..1f8cf8e10b34 100644 > --- a/Documentation/filesystems/porting.rst > +++ b/Documentation/filesystems/porting.rst > @@ -872,3 +872,12 @@ its result is kern_unmount() or kern_unmount_array(). > > zero-length bvec segments are disallowed, they must be filtered out before > passed on to an iterator. > + > +--- > + > +**mandatory** > + > +For bvec based itererators bio_iov_iter_get_pages() now doesn't copy bvecs but > +uses the one provided. Anyone issuing kiocb-I/O should ensure that the bvec and > +page references stay until I/O has completed, i.e. until ->ki_complete() has > +been called or returned with non -EIOCBQUEUED code. > diff --git a/block/bio.c b/block/bio.c > index 9f26984af643..6f031a04b59a 100644 > --- a/block/bio.c > +++ b/block/bio.c > @@ -960,21 +960,17 @@ void bio_release_pages(struct bio *bio, bool mark_dirty) > } > EXPORT_SYMBOL_GPL(bio_release_pages); > > -static int __bio_iov_bvec_add_pages(struct bio *bio, struct iov_iter *iter) > +static int bio_iov_bvec_set(struct bio *bio, struct iov_iter *iter) > { > - const struct bio_vec *bv = iter->bvec; > - unsigned int len; > - size_t size; > - > - if (WARN_ON_ONCE(iter->iov_offset > bv->bv_len)) > - return -EINVAL; > - > - len = min_t(size_t, bv->bv_len - iter->iov_offset, iter->count); > - size = bio_add_page(bio, bv->bv_page, len, > - bv->bv_offset + iter->iov_offset); > - if (unlikely(size != len)) > - return -EINVAL; > - iov_iter_advance(iter, size); > + WARN_ON_ONCE(BVEC_POOL_IDX(bio) != 0); > + > + bio->bi_vcnt = iter->nr_segs; > + bio->bi_max_vecs = iter->nr_segs; > + bio->bi_io_vec = (struct bio_vec *)iter->bvec; > + bio->bi_iter.bi_bvec_done = iter->iov_offset; > + bio->bi_iter.bi_size = iter->count; > + > + iov_iter_advance(iter, iter->count); > return 0; > } > > @@ -1088,12 +1084,12 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter) > * This takes either an iterator pointing to user memory, or one pointing to > * kernel pages (BVEC iterator). If we're adding user pages, we pin them and > * map them into the kernel. On IO completion, the caller should put those > - * pages. If we're adding kernel pages, and the caller told us it's safe to > - * do so, we just have to add the pages to the bio directly. We don't grab an > - * extra reference to those pages (the user should already have that), and we > - * don't put the page on IO completion. The caller needs to check if the bio is > - * flagged BIO_NO_PAGE_REF on IO completion. If it isn't, then pages should be > - * released. > + * pages. For bvec based iterators bio_iov_iter_get_pages() uses the provided > + * bvecs rather than copying them. Hence anyone issuing kiocb based IO needs > + * to ensure the bvecs and pages stay referenced until the submitted I/O is > + * completed by a call to ->ki_complete() or returns with an error other than > + * -EIOCBQUEUED. The caller needs to check if the bio is flagged BIO_NO_PAGE_REF > + * on IO completion. If it isn't, then pages should be released. > * > * The function tries, but does not guarantee, to pin as many pages as > * fit into the bio, or are requested in @iter, whatever is smaller. If > @@ -1105,27 +1101,22 @@ static int __bio_iov_append_get_pages(struct bio *bio, struct iov_iter *iter) > */ > int bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter) > { > - const bool is_bvec = iov_iter_is_bvec(iter); > - int ret; > - > - if (WARN_ON_ONCE(bio->bi_vcnt)) > - return -EINVAL; > + int ret = 0; > > - do { > - if (bio_op(bio) == REQ_OP_ZONE_APPEND) { > - if (WARN_ON_ONCE(is_bvec)) > - return -EINVAL; > - ret = __bio_iov_append_get_pages(bio, iter); > - } else { > - if (is_bvec) > - ret = __bio_iov_bvec_add_pages(bio, iter); > + if (iov_iter_is_bvec(iter)) { > + if (WARN_ON_ONCE(bio_op(bio) == REQ_OP_ZONE_APPEND)) > + return -EINVAL; > + bio_iov_bvec_set(bio, iter); > + bio_set_flag(bio, BIO_NO_PAGE_REF); > + return 0; > + } else { > + do { > + if (bio_op(bio) == REQ_OP_ZONE_APPEND) > + ret = __bio_iov_append_get_pages(bio, iter); > else > ret = __bio_iov_iter_get_pages(bio, iter); > - } > - } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); > - > - if (is_bvec) > - bio_set_flag(bio, BIO_NO_PAGE_REF); > + } while (!ret && iov_iter_count(iter) && !bio_full(bio, 0)); > + } > > /* don't account direct I/O as memory stall */ > bio_clear_flag(bio, BIO_WORKINGSET); > diff --git a/include/linux/bio.h b/include/linux/bio.h > index d8f9077c43ef..1d30572a8c53 100644 > --- a/include/linux/bio.h > +++ b/include/linux/bio.h > @@ -444,10 +444,13 @@ static inline void bio_wouldblock_error(struct bio *bio) > > /* > * Calculate number of bvec segments that should be allocated to fit data > - * pointed by @iter. > + * pointed by @iter. If @iter is backed by bvec it's going to be reused > + * instead of allocating a new one. > */ > static inline int bio_iov_vecs_to_alloc(struct iov_iter *iter, int max_segs) > { > + if (iov_iter_is_bvec(iter)) > + return 0; > return iov_iter_npages(iter, max_segs); > } > > -- > 2.24.0 > Reviewed-by: Ming Lei <ming.lei@redhat.com> -- Ming
On 1/9/21 9:02 AM, Pavel Begunkov wrote: > Currently, when iomap and block direct IO gets a bvec based iterator > the bvec will be copied, with all other accounting that takes much > CPU time and causes additional allocation for larger bvecs. The > patchset makes it to reuse the passed in iter bvec. Applied, thanks. -- Jens Axboe