Message ID | 20220927014420.71141-1-axboe@kernel.dk |
---|---|
Headers | show |
Series | Enable alloc caching and batched freeing for passthrough | expand |
On Tue, Sep 27, 2022 at 7:14 AM Jens Axboe <axboe@kernel.dk> wrote: > > Hi, > > The passthrough IO path currently doesn't do any request allocation > batching like we do for normal IO. Wire this up through the usual > blk_mq_alloc_request() allocation helper. > > Similarly, we don't currently supported batched completions for > passthrough IO. Allow the request->end_io() handler to return back > whether or not it retains ownership of the request. By default all > handlers are converted to returning RQ_END_IO_NONE, which retains > the existing behavior. But with that in place, we can tweak the > nvme uring_cmd end_io handler to pass back ownership, and hence enable > completion batching for passthrough requests as well. > > This is good for a 10% improvement for passthrough performance. For > a non-drive limited test case, passthrough IO is now more efficient > than the regular bdev O_DIRECT path. > > Changes since v1: > - Remove spurious semicolon > - Cleanup struct nvme_uring_cmd_pdu handling > > -- > Jens Axboe > > I see an improvement of ~12% (2.34 to 2.63 MIOPS) with polling enabled and an improvement of ~4% (1.84 to 1.92 MIOPS) with polling disabled using the t/io_uring utility (in fio) in my setup with this patch series! -- Anuj Gupta
On Tue, Sep 27, 2022 at 7:19 AM Jens Axboe <axboe@kernel.dk> wrote: > > The filesystem IO path can take advantage of allocating batches of > requests, if the underlying submitter tells the block layer about it > through the blk_plug. For passthrough IO, the exported API is the > blk_mq_alloc_request() helper, and that one does not allow for > request caching. > > Wire up request caching for blk_mq_alloc_request(), which is generally > done without having a bio available upfront. > > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > block/blk-mq.c | 80 ++++++++++++++++++++++++++++++++++++++++++++------ > 1 file changed, 71 insertions(+), 9 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index c11949d66163..d3a9f8b9c7ee 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -510,25 +510,87 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data) > alloc_time_ns); > } > > -struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, > - blk_mq_req_flags_t flags) > +static struct request *blk_mq_rq_cache_fill(struct request_queue *q, > + struct blk_plug *plug, > + blk_opf_t opf, > + blk_mq_req_flags_t flags) > { > struct blk_mq_alloc_data data = { > .q = q, > .flags = flags, > .cmd_flags = opf, > - .nr_tags = 1, > + .nr_tags = plug->nr_ios, > + .cached_rq = &plug->cached_rq, > }; > struct request *rq; > - int ret; > > - ret = blk_queue_enter(q, flags); > - if (ret) > - return ERR_PTR(ret); > + if (blk_queue_enter(q, flags)) > + return NULL; > + > + plug->nr_ios = 1; > > rq = __blk_mq_alloc_requests(&data); > - if (!rq) > - goto out_queue_exit; > + if (unlikely(!rq)) > + blk_queue_exit(q); > + return rq; > +} > + > +static struct request *blk_mq_alloc_cached_request(struct request_queue *q, > + blk_opf_t opf, > + blk_mq_req_flags_t flags) > +{ > + struct blk_plug *plug = current->plug; > + struct request *rq; > + > + if (!plug) > + return NULL; > + if (rq_list_empty(plug->cached_rq)) { > + if (plug->nr_ios == 1) > + return NULL; > + rq = blk_mq_rq_cache_fill(q, plug, opf, flags); > + if (rq) > + goto got_it; > + return NULL; > + } > + rq = rq_list_peek(&plug->cached_rq); > + if (!rq || rq->q != q) > + return NULL; > + > + if (blk_mq_get_hctx_type(opf) != rq->mq_hctx->type) > + return NULL; > + if (op_is_flush(rq->cmd_flags) != op_is_flush(opf)) > + return NULL; > + > + plug->cached_rq = rq_list_next(rq); > +got_it: > + rq->cmd_flags = opf; > + INIT_LIST_HEAD(&rq->queuelist); > + return rq; > +} > + > +struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf, > + blk_mq_req_flags_t flags) > +{ > + struct request *rq; > + > + rq = blk_mq_alloc_cached_request(q, opf, flags); > + if (!rq) { > + struct blk_mq_alloc_data data = { > + .q = q, > + .flags = flags, > + .cmd_flags = opf, > + .nr_tags = 1, > + }; > + int ret; > + > + ret = blk_queue_enter(q, flags); > + if (ret) > + return ERR_PTR(ret); > + > + rq = __blk_mq_alloc_requests(&data); > + if (!rq) > + goto out_queue_exit; > + } > rq->__data_len = 0; > rq->__sector = (sector_t) -1; > rq->bio = rq->biotail = NULL; > -- > 2.35.1 > A large chunk of this improvement in passthrough performance is coming by enabling request caching. On my setup, the performance improves from 2.34 to 2.54 MIOPS. I have tested this using the t/io_uring utility (in fio) and I am using an Intel Optane Gen2 device. Tested-by: Anuj Gupta <anuj20.g@samsung.com> -- Anuj Gupta
On Tue, Sep 27, 2022 at 7:20 AM Jens Axboe <axboe@kernel.dk> wrote: > > With end_io handlers now being able to potentially pass ownership of > the request upon completion, we can allow requests with end_io handlers > in the batch completion handling. > > Co-developed-by: Stefan Roesch <shr@fb.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > block/blk-mq.c | 13 +++++++++++-- > include/linux/blk-mq.h | 3 ++- > 2 files changed, 13 insertions(+), 3 deletions(-) > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index a4e018c82b7c..a7dfe7a898a4 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -823,8 +823,10 @@ static void blk_complete_request(struct request *req) > * can find how many bytes remain in the request > * later. > */ > - req->bio = NULL; > - req->__data_len = 0; > + if (!req->end_io) { > + req->bio = NULL; > + req->__data_len = 0; > + } > } > > /** > @@ -1055,6 +1057,13 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob) > > rq_qos_done(rq->q, rq); > > + /* > + * If end_io handler returns NONE, then it still has > + * ownership of the request. > + */ > + if (rq->end_io && rq->end_io(rq, 0) == RQ_END_IO_NONE) > + continue; > + > WRITE_ONCE(rq->state, MQ_RQ_IDLE); > if (!req_ref_put_and_test(rq)) > continue; > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index e6fa49dd6196..50811d0fb143 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -853,8 +853,9 @@ static inline bool blk_mq_add_to_batch(struct request *req, > struct io_comp_batch *iob, int ioerror, > void (*complete)(struct io_comp_batch *)) > { > - if (!iob || (req->rq_flags & RQF_ELV) || req->end_io || ioerror) > + if (!iob || (req->rq_flags & RQF_ELV) || ioerror) > return false; > + > if (!iob->complete) > iob->complete = complete; > else if (iob->complete != complete) > -- > 2.35.1 > Looks good. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> -- Anuj Gupta
On Tue, Sep 27, 2022 at 7:19 AM Jens Axboe <axboe@kernel.dk> wrote: > > Now that the normal passthrough end_io path doesn't need the request > anymore, we can kill the explicit blk_mq_free_request() and just pass > back RQ_END_IO_FREE instead. This enables the batched completion from > freeing batches of requests at the time. > > This brings passthrough IO performance at least on par with bdev based > O_DIRECT with io_uring. With this and batche allocations, peak performance > goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also > about 10% faster than previously, going from ~61M to ~67M IOPS. > > Co-developed-by: Stefan Roesch <shr@fb.com> > Signed-off-by: Jens Axboe <axboe@kernel.dk> > --- > drivers/nvme/host/ioctl.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c > index 9e356a6c96c2..d9633f426690 100644 > --- a/drivers/nvme/host/ioctl.c > +++ b/drivers/nvme/host/ioctl.c > @@ -423,8 +423,7 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req, > else > io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb); > > - blk_mq_free_request(req); > - return RQ_END_IO_NONE; > + return RQ_END_IO_FREE; > } > > static enum rq_end_io_ret nvme_uring_cmd_end_io_meta(struct request *req, > -- > 2.35.1 > Looks good to me. Reviewed-by: Anuj Gupta <anuj20.g@samsung.com> -- Anuj Gupta
On 9/28/22 7:23 AM, Anuj gupta wrote: > On Tue, Sep 27, 2022 at 7:14 AM Jens Axboe <axboe@kernel.dk> wrote: >> >> Hi, >> >> The passthrough IO path currently doesn't do any request allocation >> batching like we do for normal IO. Wire this up through the usual >> blk_mq_alloc_request() allocation helper. >> >> Similarly, we don't currently supported batched completions for >> passthrough IO. Allow the request->end_io() handler to return back >> whether or not it retains ownership of the request. By default all >> handlers are converted to returning RQ_END_IO_NONE, which retains >> the existing behavior. But with that in place, we can tweak the >> nvme uring_cmd end_io handler to pass back ownership, and hence enable >> completion batching for passthrough requests as well. >> >> This is good for a 10% improvement for passthrough performance. For >> a non-drive limited test case, passthrough IO is now more efficient >> than the regular bdev O_DIRECT path. >> >> Changes since v1: >> - Remove spurious semicolon >> - Cleanup struct nvme_uring_cmd_pdu handling >> >> -- >> Jens Axboe >> >> > I see an improvement of ~12% (2.34 to 2.63 MIOPS) with polling enabled and > an improvement of ~4% (1.84 to 1.92 MIOPS) with polling disabled using the > t/io_uring utility (in fio) in my setup with this patch series! Thanks for your testing! I'll add your reviewed-by to the series.
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Series looks good.
Reviewed-by: Keith Busch <kbusch@kernel.org>