[PATCHSET,v2,0/5] Enable alloc caching and batched freeing for passthrough

Message ID	20220927014420.71141-1-axboe@kernel.dk
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: Jens Axboe <axboe@kernel.dk> To: linux-block@vger.kernel.org Cc: linux-scsi@vger.kernel.org, linux-nvme@lists.infradead.org Subject: [PATCHSET v2 0/5] Enable alloc caching and batched freeing for passthrough Date: Mon, 26 Sep 2022 19:44:15 -0600 Message-Id: <20220927014420.71141-1-axboe@kernel.dk> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Enable alloc caching and batched freeing for passthrough \| expand [PATCHSET,v2,0/5] Enable alloc caching and batched freeing for passthrough [1/5] block: enable batched allocation for blk_mq_alloc_request() [2/5] block: change request end_io handler to pass back a return value [3/5] block: allow end_io based requests in the completion batch handling [4/5] nvme: split out metadata vs non metadata end_io uring_cmd completions [5/5] nvme: enable batched completions of passthrough IO

Jens Axboe Sept. 27, 2022, 1:44 a.m. UTC

Hi,

The passthrough IO path currently doesn't do any request allocation
batching like we do for normal IO. Wire this up through the usual
blk_mq_alloc_request() allocation helper.

Similarly, we don't currently supported batched completions for
passthrough IO. Allow the request->end_io() handler to return back
whether or not it retains ownership of the request. By default all
handlers are converted to returning RQ_END_IO_NONE, which retains
the existing behavior. But with that in place, we can tweak the
nvme uring_cmd end_io handler to pass back ownership, and hence enable
completion batching for passthrough requests as well.

This is good for a 10% improvement for passthrough performance. For
a non-drive limited test case, passthrough IO is now more efficient
than the regular bdev O_DIRECT path.

Changes since v1:
- Remove spurious semicolon
- Cleanup struct nvme_uring_cmd_pdu handling

Anuj gupta Sept. 28, 2022, 1:23 p.m. UTC | #1

On Tue, Sep 27, 2022 at 7:14 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Hi,
>
> The passthrough IO path currently doesn't do any request allocation
> batching like we do for normal IO. Wire this up through the usual
> blk_mq_alloc_request() allocation helper.
>
> Similarly, we don't currently supported batched completions for
> passthrough IO. Allow the request->end_io() handler to return back
> whether or not it retains ownership of the request. By default all
> handlers are converted to returning RQ_END_IO_NONE, which retains
> the existing behavior. But with that in place, we can tweak the
> nvme uring_cmd end_io handler to pass back ownership, and hence enable
> completion batching for passthrough requests as well.
>
> This is good for a 10% improvement for passthrough performance. For
> a non-drive limited test case, passthrough IO is now more efficient
> than the regular bdev O_DIRECT path.
>
> Changes since v1:
> - Remove spurious semicolon
> - Cleanup struct nvme_uring_cmd_pdu handling
>
> --
> Jens Axboe
>
>
I see an improvement of ~12% (2.34 to 2.63 MIOPS) with polling enabled and
an improvement of ~4% (1.84 to 1.92 MIOPS) with polling disabled using the
t/io_uring utility (in fio) in my setup with this patch series!

--
Anuj Gupta

Anuj gupta Sept. 28, 2022, 1:38 p.m. UTC | #2

On Tue, Sep 27, 2022 at 7:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> The filesystem IO path can take advantage of allocating batches of
> requests, if the underlying submitter tells the block layer about it
> through the blk_plug. For passthrough IO, the exported API is the
> blk_mq_alloc_request() helper, and that one does not allow for
> request caching.
>
> Wire up request caching for blk_mq_alloc_request(), which is generally
> done without having a bio available upfront.
>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq.c | 80 ++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 71 insertions(+), 9 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index c11949d66163..d3a9f8b9c7ee 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -510,25 +510,87 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
>                                         alloc_time_ns);
>  }
>
> -struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
> -               blk_mq_req_flags_t flags)
> +static struct request *blk_mq_rq_cache_fill(struct request_queue *q,
> +                                           struct blk_plug *plug,
> +                                           blk_opf_t opf,
> +                                           blk_mq_req_flags_t flags)
>  {
>         struct blk_mq_alloc_data data = {
>                 .q              = q,
>                 .flags          = flags,
>                 .cmd_flags      = opf,
> -               .nr_tags        = 1,
> +               .nr_tags        = plug->nr_ios,
> +               .cached_rq      = &plug->cached_rq,
>         };
>         struct request *rq;
> -       int ret;
>
> -       ret = blk_queue_enter(q, flags);
> -       if (ret)
> -               return ERR_PTR(ret);
> +       if (blk_queue_enter(q, flags))
> +               return NULL;
> +
> +       plug->nr_ios = 1;
>
>         rq = __blk_mq_alloc_requests(&data);
> -       if (!rq)
> -               goto out_queue_exit;
> +       if (unlikely(!rq))
> +               blk_queue_exit(q);
> +       return rq;
> +}
> +
> +static struct request *blk_mq_alloc_cached_request(struct request_queue *q,
> +                                                  blk_opf_t opf,
> +                                                  blk_mq_req_flags_t flags)
> +{
> +       struct blk_plug *plug = current->plug;
> +       struct request *rq;
> +
> +       if (!plug)
> +               return NULL;
> +       if (rq_list_empty(plug->cached_rq)) {
> +               if (plug->nr_ios == 1)
> +                       return NULL;
> +               rq = blk_mq_rq_cache_fill(q, plug, opf, flags);
> +               if (rq)
> +                       goto got_it;
> +               return NULL;
> +       }
> +       rq = rq_list_peek(&plug->cached_rq);
> +       if (!rq || rq->q != q)
> +               return NULL;
> +
> +       if (blk_mq_get_hctx_type(opf) != rq->mq_hctx->type)
> +               return NULL;
> +       if (op_is_flush(rq->cmd_flags) != op_is_flush(opf))
> +               return NULL;
> +
> +       plug->cached_rq = rq_list_next(rq);
> +got_it:
> +       rq->cmd_flags = opf;
> +       INIT_LIST_HEAD(&rq->queuelist);
> +       return rq;
> +}
> +
> +struct request *blk_mq_alloc_request(struct request_queue *q, blk_opf_t opf,
> +               blk_mq_req_flags_t flags)
> +{
> +       struct request *rq;
> +
> +       rq = blk_mq_alloc_cached_request(q, opf, flags);
> +       if (!rq) {
> +               struct blk_mq_alloc_data data = {
> +                       .q              = q,
> +                       .flags          = flags,
> +                       .cmd_flags      = opf,
> +                       .nr_tags        = 1,
> +               };
> +               int ret;
> +
> +               ret = blk_queue_enter(q, flags);
> +               if (ret)
> +                       return ERR_PTR(ret);
> +
> +               rq = __blk_mq_alloc_requests(&data);
> +               if (!rq)
> +                       goto out_queue_exit;
> +       }
>         rq->__data_len = 0;
>         rq->__sector = (sector_t) -1;
>         rq->bio = rq->biotail = NULL;
> --
> 2.35.1
>

A large chunk of this improvement in passthrough performance is coming by
enabling request caching. On my setup, the performance improves from
2.34 to 2.54 MIOPS. I have tested this using the t/io_uring utility (in fio) and
I am using an Intel Optane Gen2 device.

Tested-by: Anuj Gupta <anuj20.g@samsung.com>

--
Anuj Gupta

Anuj gupta Sept. 28, 2022, 1:42 p.m. UTC | #3

On Tue, Sep 27, 2022 at 7:20 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> With end_io handlers now being able to potentially pass ownership of
> the request upon completion, we can allow requests with end_io handlers
> in the batch completion handling.
>
> Co-developed-by: Stefan Roesch <shr@fb.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  block/blk-mq.c         | 13 +++++++++++--
>  include/linux/blk-mq.h |  3 ++-
>  2 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index a4e018c82b7c..a7dfe7a898a4 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -823,8 +823,10 @@ static void blk_complete_request(struct request *req)
>          * can find how many bytes remain in the request
>          * later.
>          */
> -       req->bio = NULL;
> -       req->__data_len = 0;
> +       if (!req->end_io) {
> +               req->bio = NULL;
> +               req->__data_len = 0;
> +       }
>  }
>
>  /**
> @@ -1055,6 +1057,13 @@ void blk_mq_end_request_batch(struct io_comp_batch *iob)
>
>                 rq_qos_done(rq->q, rq);
>
> +               /*
> +                * If end_io handler returns NONE, then it still has
> +                * ownership of the request.
> +                */
> +               if (rq->end_io && rq->end_io(rq, 0) == RQ_END_IO_NONE)
> +                       continue;
> +
>                 WRITE_ONCE(rq->state, MQ_RQ_IDLE);
>                 if (!req_ref_put_and_test(rq))
>                         continue;
> diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
> index e6fa49dd6196..50811d0fb143 100644
> --- a/include/linux/blk-mq.h
> +++ b/include/linux/blk-mq.h
> @@ -853,8 +853,9 @@ static inline bool blk_mq_add_to_batch(struct request *req,
>                                        struct io_comp_batch *iob, int ioerror,
>                                        void (*complete)(struct io_comp_batch *))
>  {
> -       if (!iob || (req->rq_flags & RQF_ELV) || req->end_io || ioerror)
> +       if (!iob || (req->rq_flags & RQF_ELV) || ioerror)
>                 return false;
> +
>         if (!iob->complete)
>                 iob->complete = complete;
>         else if (iob->complete != complete)
> --
> 2.35.1
>
Looks good.
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>

--
Anuj Gupta

Anuj gupta Sept. 28, 2022, 1:55 p.m. UTC | #4

On Tue, Sep 27, 2022 at 7:19 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> Now that the normal passthrough end_io path doesn't need the request
> anymore, we can kill the explicit blk_mq_free_request() and just pass
> back RQ_END_IO_FREE instead. This enables the batched completion from
> freeing batches of requests at the time.
>
> This brings passthrough IO performance at least on par with bdev based
> O_DIRECT with io_uring. With this and batche allocations, peak performance
> goes from 110M IOPS to 122M IOPS. For IRQ based, passthrough is now also
> about 10% faster than previously, going from ~61M to ~67M IOPS.
>
> Co-developed-by: Stefan Roesch <shr@fb.com>
> Signed-off-by: Jens Axboe <axboe@kernel.dk>
> ---
>  drivers/nvme/host/ioctl.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/nvme/host/ioctl.c b/drivers/nvme/host/ioctl.c
> index 9e356a6c96c2..d9633f426690 100644
> --- a/drivers/nvme/host/ioctl.c
> +++ b/drivers/nvme/host/ioctl.c
> @@ -423,8 +423,7 @@ static enum rq_end_io_ret nvme_uring_cmd_end_io(struct request *req,
>         else
>                 io_uring_cmd_complete_in_task(ioucmd, nvme_uring_task_cb);
>
> -       blk_mq_free_request(req);
> -       return RQ_END_IO_NONE;
> +       return RQ_END_IO_FREE;
>  }
>
>  static enum rq_end_io_ret nvme_uring_cmd_end_io_meta(struct request *req,
> --
> 2.35.1
>
Looks good to me.
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>

--
Anuj Gupta

Jens Axboe Sept. 28, 2022, 2:22 p.m. UTC | #5

On 9/28/22 7:23 AM, Anuj gupta wrote:
> On Tue, Sep 27, 2022 at 7:14 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> Hi,
>>
>> The passthrough IO path currently doesn't do any request allocation
>> batching like we do for normal IO. Wire this up through the usual
>> blk_mq_alloc_request() allocation helper.
>>
>> Similarly, we don't currently supported batched completions for
>> passthrough IO. Allow the request->end_io() handler to return back
>> whether or not it retains ownership of the request. By default all
>> handlers are converted to returning RQ_END_IO_NONE, which retains
>> the existing behavior. But with that in place, we can tweak the
>> nvme uring_cmd end_io handler to pass back ownership, and hence enable
>> completion batching for passthrough requests as well.
>>
>> This is good for a 10% improvement for passthrough performance. For
>> a non-drive limited test case, passthrough IO is now more efficient
>> than the regular bdev O_DIRECT path.
>>
>> Changes since v1:
>> - Remove spurious semicolon
>> - Cleanup struct nvme_uring_cmd_pdu handling
>>
>> --
>> Jens Axboe
>>
>>
> I see an improvement of ~12% (2.34 to 2.63 MIOPS) with polling enabled and
> an improvement of ~4% (1.84 to 1.92 MIOPS) with polling disabled using the
> t/io_uring utility (in fio) in my setup with this patch series!

Thanks for your testing! I'll add your reviewed-by to the series.

Sagi Grimberg Sept. 28, 2022, 2:47 p.m. UTC | #6

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Keith Busch Sept. 28, 2022, 5:05 p.m. UTC | #7

Series looks good.

Reviewed-by: Keith Busch <kbusch@kernel.org>

[PATCHSET,v2,0/5] Enable alloc caching and batched freeing for passthrough

Message

Comments