[RFC,v3,0/2] add simple copy support

Message ID	20201211135139.49232-1-selvakuma.s1@samsung.com
Headers	show Return-Path: <linux-scsi-owner@kernel.org> From: SelvaKumar S <selvakuma.s1@samsung.com> To: linux-nvme@lists.infradead.org Cc: kbusch@kernel.org, axboe@kernel.dk, damien.lemoal@wdc.com, Johannes.Thumshirn@wdc.com, hch@lst.de, sagi@grimberg.me, linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, linux-scsi@vger.kernel.org, martin.petersen@oracle.com, bvanassche@acm.org, mpatocka@redhat.com, hare@suse.de, dm-devel@redhat.com, snitzer@redhat.com, selvajove@gmail.com, nj.shetty@samsung.com, joshi.k@samsung.com, javier.gonz@samsung.com, SelvaKumar S <selvakuma.s1@samsung.com> Subject: [RFC PATCH v3 0/2] add simple copy support Date: Fri, 11 Dec 2020 19:21:37 +0530 Message-Id: <20201211135139.49232-1-selvakuma.s1@samsung.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="utf-8" CMS-TYPE: 105P References: <CGME20201211135154epcas5p34cd7f57fb4b13fcd50619efdc0099fa8@epcas5p3.samsung.com> Precedence: bulk
Series	add simple copy support \| expand [RFC,v3,0/2] add simple copy support [RFC,v3,1/2] block: add simple copy support [RFC,v3,2/2] nvme: add simple copy support

Message ID

20201211135139.49232-1-selvakuma.s1@samsung.com

Headers

From: SelvaKumar S <selvakuma.s1@samsung.com>
To: linux-nvme@lists.infradead.org
Cc: kbusch@kernel.org, axboe@kernel.dk, damien.lemoal@wdc.com,
	Johannes.Thumshirn@wdc.com, hch@lst.de, sagi@grimberg.me,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-scsi@vger.kernel.org, martin.petersen@oracle.com,
	bvanassche@acm.org, mpatocka@redhat.com, hare@suse.de,
	dm-devel@redhat.com, snitzer@redhat.com, selvajove@gmail.com,
	nj.shetty@samsung.com, joshi.k@samsung.com,
	javier.gonz@samsung.com, SelvaKumar S <selvakuma.s1@samsung.com>
Subject: [RFC PATCH v3 0/2] add simple copy support
Date: Fri, 11 Dec 2020 19:21:37 +0530
Message-Id: <20201211135139.49232-1-selvakuma.s1@samsung.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset="utf-8"
CMS-TYPE: 105P
References: <CGME20201211135154epcas5p34cd7f57fb4b13fcd50619efdc0099fa8@epcas5p3.samsung.com>
Precedence: bulk

Series

add simple copy support | expand

Message

SelvaKumar S Dec. 11, 2020, 1:51 p.m. UTC

This patchset tries to add support for TP4065a ("Simple Copy Command"),
v2020.05.04 ("Ratified")

The Specification can be found in following link.
https://nvmexpress.org/wp-content/uploads/NVM-Express-1.4-Ratified-TPs-1.zip

This is an RFC. Looking forward for any feedbacks or other alternate
designs for plumbing simple copy to IO stack.

Simple copy command is a copy offloading operation and is  used to copy
multiple contiguous ranges (source_ranges) of LBA's to a single destination
LBA within the device reducing traffic between host and device.

This implementation accepts destination, no of sources and arrays of
source ranges from application. For devices suporting native copy
offloading, attach the control informantion as payload to the bio and
submits to the device. For devices without native copy support, copy
emulation is done by reading source range into memory and writing it to
the destination.

Following limits are added to queue limits and are exposed in sysfs
to userspace
	- *copy_offload* controls copy_offload. set 0 to disable copy
offload, 1 to enable native copy offloading support.
	- *max_copy_sectors* limits the sum of all source_range length
	- *max_copy_nr_ranges* limits the number of source ranges
	- *max_copy_range_sectors* limit the maximum number of sectors
		that can constitute a single source range.

	max_copy_sectors = 0 indicates the device doesn't support copy
offloading.

	*copy offload* sysfs entry is configurable and can be used toggle
between emulation and native support depending upon the usecase.

Changes from v2

1. Add emulation support for devices not supporting copy.
2. Add *copy_offload* sysfs entry to enable and disable copy_offload
	in devices supporting simple copy.
3. Remove simple copy support for stacked devices.

Changes from v1:

1. Fix memory leak in __blkdev_issue_copy
2. Unmark blk_check_copy inline
3. Fix line break in blk_check_copy_eod
4. Remove p checks and made code more readable
5. Don't use bio_set_op_attrs and remove op and set
   bi_opf directly
6. Use struct_size to calculate total_size
7. Fix partition remap of copy destination
8. Remove mcl,mssrl,msrc from nvme_ns
9. Initialize copy queue limits to 0 in nvme_config_copy
10. Remove return in QUEUE_FLAG_COPY check
11. Remove unused OCFS

SelvaKumar S (2):
  block: add simple copy support
  nvme: add simple copy support

 block/blk-core.c          |  94 ++++++++++++++++++--
 block/blk-lib.c           | 182 ++++++++++++++++++++++++++++++++++++++
 block/blk-merge.c         |   2 +
 block/blk-settings.c      |  10 +++
 block/blk-sysfs.c         |  50 +++++++++++
 block/blk-zoned.c         |   1 +
 block/bounce.c            |   1 +
 block/ioctl.c             |  43 +++++++++
 drivers/nvme/host/core.c  |  89 +++++++++++++++++++
 include/linux/bio.h       |   1 +
 include/linux/blk_types.h |  15 ++++
 include/linux/blkdev.h    |  16 ++++
 include/linux/nvme.h      |  43 ++++++++-
 include/uapi/linux/fs.h   |  13 +++
 14 files changed, 549 insertions(+), 11 deletions(-)

Comments

Keith Busch Dec. 11, 2020, 6:04 p.m. UTC | #1

On Fri, Dec 11, 2020 at 07:21:38PM +0530, SelvaKumar S wrote:
> +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload *payload,
> +		gfp_t gfp_mask)
> +{
> +	struct request_queue *q = bdev_get_queue(bdev);
> +	struct bio *bio;
> +	void *buf = NULL;
> +	int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;
> +
> +	nr_srcs = payload->copy_range;
> +	max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;

The default value for this limit is 0, and this is the function for when
the device doesn't support copy. Are we expecting drivers to set this
value to something else for that case?

> +	cur_dest = payload->dest;
> +	buf = kvmalloc(max_range_len, GFP_ATOMIC);
> +	if (!buf)
> +		return -ENOMEM;
> +
> +	for (i = 0; i < nr_srcs; i++) {
> +		bio = bio_alloc(gfp_mask, 1);
> +		bio->bi_iter.bi_sector = payload->range[i].src;
> +		bio->bi_opf = REQ_OP_READ;
> +		bio_set_dev(bio, bdev);
> +
> +		cur_size = payload->range[i].len << SECTOR_SHIFT;
> +		ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> +						   offset_in_page(payload));

'buf' is vmalloc'ed, so we don't necessarily have congituous pages. I
think you need to allocate the bio with bio_map_kern() or something like
that instead with that kind of memory.

> +		if (ret != cur_size) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		ret = submit_bio_wait(bio);
> +		bio_put(bio);
> +		if (ret)
> +			goto out;
> +
> +		bio = bio_alloc(gfp_mask, 1);
> +		bio_set_dev(bio, bdev);
> +		bio->bi_opf = REQ_OP_WRITE;
> +		bio->bi_iter.bi_sector = cur_dest;
> +		ret = bio_add_page(bio, virt_to_page(buf), cur_size,
> +						   offset_in_page(payload));
> +		if (ret != cur_size) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		ret = submit_bio_wait(bio);
> +		bio_put(bio);
> +		if (ret)
> +			goto out;
> +
> +		cur_dest += payload->range[i].len;
> +	}

I think this would be a faster implementation if the reads were
asynchronous with a payload buffer allocated specific to that read, and
the callback can enqueue the write part. This would allow you to
accumulate all the read data and write it in a single call.

Selva Jove Dec. 14, 2020, 6:57 a.m. UTC | #2

On Fri, Dec 11, 2020 at 11:35 PM Keith Busch <kbusch@kernel.org> wrote:
>

> On Fri, Dec 11, 2020 at 07:21:38PM +0530, SelvaKumar S wrote:

> > +int blk_copy_emulate(struct block_device *bdev, struct blk_copy_payload *payload,

> > +             gfp_t gfp_mask)

> > +{

> > +     struct request_queue *q = bdev_get_queue(bdev);

> > +     struct bio *bio;

> > +     void *buf = NULL;

> > +     int i, nr_srcs, max_range_len, ret, cur_dest, cur_size;

> > +

> > +     nr_srcs = payload->copy_range;

> > +     max_range_len = q->limits.max_copy_range_sectors << SECTOR_SHIFT;

>

> The default value for this limit is 0, and this is the function for when

> the device doesn't support copy. Are we expecting drivers to set this

> value to something else for that case?


Sorry. Missed that. Will add a fix.

>

> > +     cur_dest = payload->dest;

> > +     buf = kvmalloc(max_range_len, GFP_ATOMIC);

> > +     if (!buf)

> > +             return -ENOMEM;

> > +

> > +     for (i = 0; i < nr_srcs; i++) {

> > +             bio = bio_alloc(gfp_mask, 1);

> > +             bio->bi_iter.bi_sector = payload->range[i].src;

> > +             bio->bi_opf = REQ_OP_READ;

> > +             bio_set_dev(bio, bdev);

> > +

> > +             cur_size = payload->range[i].len << SECTOR_SHIFT;

> > +             ret = bio_add_page(bio, virt_to_page(buf), cur_size,

> > +                                                offset_in_page(payload));

>

> 'buf' is vmalloc'ed, so we don't necessarily have congituous pages. I

> think you need to allocate the bio with bio_map_kern() or something like

> that instead with that kind of memory.

>


Sure. Will use bio_map_kern().

> > +             if (ret != cur_size) {

> > +                     ret = -ENOMEM;

> > +                     goto out;

> > +             }

> > +

> > +             ret = submit_bio_wait(bio);

> > +             bio_put(bio);

> > +             if (ret)

> > +                     goto out;

> > +

> > +             bio = bio_alloc(gfp_mask, 1);

> > +             bio_set_dev(bio, bdev);

> > +             bio->bi_opf = REQ_OP_WRITE;

> > +             bio->bi_iter.bi_sector = cur_dest;

> > +             ret = bio_add_page(bio, virt_to_page(buf), cur_size,

> > +                                                offset_in_page(payload));

> > +             if (ret != cur_size) {

> > +                     ret = -ENOMEM;

> > +                     goto out;

> > +             }

> > +

> > +             ret = submit_bio_wait(bio);

> > +             bio_put(bio);

> > +             if (ret)

> > +                     goto out;

> > +

> > +             cur_dest += payload->range[i].len;

> > +     }

>

> I think this would be a faster implementation if the reads were

> asynchronous with a payload buffer allocated specific to that read, and

> the callback can enqueue the write part. This would allow you to

> accumulate all the read data and write it in a single call.


Sounds like a better approach. Will add this implementation in v4.