[net-next,RFC,v1,00/10] nvme-tcp receive offloads

Message ID	20200930162010.21610-1-borisp@mellanox.com
Headers	show Return-Path: <SRS0=1G/X=DH=vger.kernel.org=netdev-owner@kernel.org> From: Boris Pismenny <borisp@mellanox.com> To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org Subject: [PATCH net-next RFC v1 00/10] nvme-tcp receive offloads Date: Wed, 30 Sep 2020 19:20:00 +0300 Message-Id: <20200930162010.21610-1-borisp@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nvme-tcp receive offloads \| expand [net-next,RFC,v1,00/10] nvme-tcp receive offloads [net-next,RFC,v1,01/10] iov_iter: Skip copy in memcpy_to_page if src==dst [net-next,RFC,v1,02/10] net: Introduce direct data placement tcp offload [net-next,RFC,v1,03/10] net: Introduce crc offload for tcp ddp ulp [net-next,RFC,v1,04/10] net/tls: expose get_netdev_for_sock [net-next,RFC,v1,05/10] nvme-tcp: Add DDP offload control path [net-next,RFC,v1,06/10] nvme-tcp: Add DDP data-path [net-next,RFC,v1,07/10] nvme-tcp : Recalculate crc in the end of the capsule [net-next,RFC,v1,08/10] nvme-tcp: Deal with netdevice DOWN events [net-next,RFC,v1,09/10] net/mlx5e: Add NVMEoTCP offload [net-next,RFC,v1,10/10] net/mlx5e: NVMEoTCP, data-path for DDP offload

Boris Pismenny Sept. 30, 2020, 4:20 p.m. UTC

This series adds support for nvme-tcp receive offloads
which do not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers
2. CRC verification for received PDU

The series implements these as a generic offload infrastructure for storage
protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
respectively. We use this infrastructure to implement NVMe-TCP offload for copy
and CRC. Future implementations can reuse the same infrastructure for other
protcols such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Initialization and teardown:
=========================================
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usually with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of nvme-tcp in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, nvme-tcp does not initialize the offload.
Instead, nvme-tcp calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments; these must be limited to accomodate
potential HW resource limits, and to improve performance.

If some error occured, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

Copy offload works as follows:
=========================================
The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
`nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer; this SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once nvme-tcp attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function (memcpy_to_page):
if (src == dst) -> skip copy
Finally, when the PDU has been processed to completion, the nvme-tcp layer
releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

As the last change is to a sensative function, we are careful to place it under
static_key which is only enabled when this functionality is actually used for
nvme-tcp copy offload.

Asynchronous completion:
=========================================
The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance criticial, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

An alternative approach is to move all the functions related to coping from
SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.

CRC offload works as follows:
=========================================
After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

SKB changes:
=========================================
The CRC offload requires an additional bit in the SKB, which is useful for
preventing the coalescing of SKB with different crc offload values. This bit
is similar in concept to the "decrypted" bit. 

Performance:
=========================================
The expected performance gain from this offload varies with the block size.
We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
fio random read workloads:
For 4K blocks we see up to 11% improvement for a 100% read fio workload,
while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
operations, then we observe a gain of about 1.1x and 2x respectively.

Resynchronization:
=========================================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the driver, regarding a possible location of a PDU header. Followed by
a response from the nvme-tcp driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve nvme-tcp PDU headers.

The patches are organized as follows:
=========================================
Patch 1         the iov_iter change to skip copy if (src == dst)
Patches 2-3     the infrastructure for all TCP DDP
                and TCP DDP CRC offloads, respectively.
Patch 4         exposes the get_netdev_for_sock function from TLS
Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown
Patches 6       NVMe-TCP changes to call NIC driver on IO operation
                setup/teardown, and support async completions.
Patches 7       NVMe-TCP changes to support CRC offload on receive.
                Also, this patch moves CRC calculation to the end of PDU
                in case offload requires software fallback.
Patches 8       NVMe-TCP handling of netdev events: stop the offload if
                netdev is going down
Patches 9-10    implement support for NVMe-TCP copy and CRC offload in
                the mlx5 NIC driver

Testing:
=========================================
This series was tested using fio with various configurations of IO sizes,
depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.

Future work:
=========================================
A follow-up series will introduce support for transmit side CRC. Then,
we will work on adding support for TLS in NVMe-TCP and combining the
two offloads.


Boris Pismenny (8):
  iov_iter: Skip copy in memcpy_to_page if src==dst
  net: Introduce direct data placement tcp offload
  net: Introduce crc offload for tcp ddp ulp
  net/tls: expose get_netdev_for_sock
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: Add NVMEoTCP offload
  net/mlx5e: NVMEoTCP, data-path for DDP offload

Or Gerlitz (1):
  nvme-tcp: Deal with netdevice DOWN events

Yoray Zack (1):
  nvme-tcp : Recalculate crc in the end of the capsule

 .../net/ethernet/mellanox/mlx5/core/Kconfig   |  11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |   2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  12 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |   4 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  13 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |   1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |   9 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |  10 +
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 894 ++++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    | 116 +++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        | 256 +++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |  25 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |  79 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  73 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  76 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |  38 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |  24 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   6 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |   6 +
 drivers/nvme/host/tcp.c                       | 410 +++++++-
 include/linux/mlx5/device.h                   |  38 +-
 include/linux/mlx5/mlx5_ifc.h                 | 121 ++-
 include/linux/mlx5/qp.h                       |   1 +
 include/linux/netdev_features.h               |   4 +
 include/linux/netdevice.h                     |   5 +
 include/linux/nvme-tcp.h                      |   2 +
 include/linux/skbuff.h                        |   4 +
 include/linux/uio.h                           |   2 +
 include/net/inet_connection_sock.h            |   4 +
 include/net/sock.h                            |  17 +
 include/net/tcp_ddp.h                         |  90 ++
 lib/iov_iter.c                                |  11 +-
 net/Kconfig                                   |  17 +
 net/core/skbuff.c                             |   9 +-
 net/ethtool/common.c                          |   2 +
 net/ipv4/tcp_input.c                          |   7 +
 net/ipv4/tcp_ipv4.c                           |   3 +
 net/ipv4/tcp_offload.c                        |   3 +
 net/tls/tls_device.c                          |  20 +-
 40 files changed, 2375 insertions(+), 51 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/tcp_ddp.h


base-commit: 300fd579b2e8608586b002207e906ac95c74b911
prerequisite-patch-id: b3079fa1f4c0e3d6d4a92f59e70981e8a28f3b83

Sagi Grimberg Oct. 8, 2020, 9:47 p.m. UTC | #1

> + * tcp_ddp.h
> + *	Author:	Boris Pismenny <borisp@mellanox.com>
> + *	Copyright (C) 2020 Mellanox Technologies.
> + */
> +#ifndef _TCP_DDP_H
> +#define _TCP_DDP_H
> +
> +#include <linux/blkdev.h>

Why is blkdev.h needed?

> +#include <linux/netdevice.h>
> +#include <net/inet_connection_sock.h>
> +#include <net/sock.h>
> +
> +/* limits returned by the offload driver, zero means don't care */
> +struct tcp_ddp_limits {
> +	int	 max_ddp_sgl_len;
> +};
> +
> +enum tcp_ddp_type {
> +	TCP_DDP_NVME = 1,
> +};
> +
> +struct tcp_ddp_config {
> +	enum tcp_ddp_type    type;
> +	unsigned char        buf[];

A little kdoc may help here as its not exactly clear what is
buf used for (at this point at least)...

> +};
> +
> +struct nvme_tcp_config {

struct nvme_tcp_ddp_config

> +	struct tcp_ddp_config   cfg;
> +
> +	u16			pfv;
> +	u8			cpda;
> +	u8			dgst;
> +	int			queue_size;
> +	int			queue_id;
> +	int			io_cpu;
> +};
> +

Other than that this looks good to me.

Sagi Grimberg Oct. 8, 2020, 9:56 p.m. UTC | #2

Reviewed-by: Sagi Grimberg <sagi@grimberg.me>

Sagi Grimberg Oct. 8, 2020, 10:29 p.m. UTC | #3

>   bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> +void nvme_tcp_ddp_teardown_done(void *ddp_ctx);
>   const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops __read_mostly = {
> +
>   	.resync_request		= nvme_tcp_resync_request,
> +	.ddp_teardown_done	= nvme_tcp_ddp_teardown_done,
>   };
>   
> +static
> +int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> +			  uint16_t command_id,
> +			  struct request *rq)
> +{
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	int ret;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);

dev_info_ratelimited

> +		return -EINVAL;
> +	}
> +
> +	ret = netdev->tcp_ddp_ops->tcp_ddp_teardown(netdev, queue->sock->sk,
> +						    &req->ddp, rq);
> +	sg_free_table_chained(&req->ddp.sg_table, SG_CHUNK_SIZE);
> +	req->offloaded = false;
> +	return ret;
> +}
> +
> +void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> +{
> +	struct request *rq = ddp_ctx;
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +
> +	if (!nvme_try_complete_req(rq, cpu_to_le16(req->status << 1), req->result))
> +		nvme_complete_rq(rq);
> +}
> +
> +static
> +int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> +		       uint16_t command_id,
> +		       struct request *rq)
> +{
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct net_device *netdev = queue->ctrl->offloading_netdev;
> +	int ret;
> +
> +	req->offloaded = false;
> +
> +	if (unlikely(!netdev)) {
> +		pr_info_ratelimited("%s: netdev not found\n", __func__);
> +		return -EINVAL;
> +	}
> +
> +	req->ddp.command_id = command_id;
> +	req->ddp.sg_table.sgl = req->ddp.first_sgl;
> +	ret = sg_alloc_table_chained(&req->ddp.sg_table,
> +		blk_rq_nr_phys_segments(rq), req->ddp.sg_table.sgl,
> +		SG_CHUNK_SIZE);
> +	if (ret)
> +		return -ENOMEM;

newline here

> +	req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl);
> +
> +	ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev,
> +						 queue->sock->sk,
> +						 &req->ddp);
> +	if (!ret)
> +		req->offloaded = true;
> +	return ret;
> +}
> +
>   static
>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>   			    struct nvme_tcp_config *config)
> @@ -351,6 +422,25 @@ bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags)
>   
>   #else
>   
> +static
> +int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue,
> +		       uint16_t command_id,
> +		       struct request *rq)
> +{
> +	return -EINVAL;
> +}
> +
> +static
> +int nvme_tcp_teardown_ddp(struct nvme_tcp_queue *queue,
> +			  uint16_t command_id,
> +			  struct request *rq)
> +{
> +	return -EINVAL;
> +}
> +
> +void nvme_tcp_ddp_teardown_done(void *ddp_ctx)
> +{}
> +
>   static
>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>   			    struct nvme_tcp_config *config)
> @@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>   		struct nvme_completion *cqe)
>   {
> +	struct nvme_tcp_request *req;
>   	struct request *rq;
>   
>   	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
> @@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>   		return -EINVAL;
>   	}
>   
> -	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> -		nvme_complete_rq(rq);
> +	req = blk_mq_rq_to_pdu(rq);
> +	if (req->offloaded) {
> +		req->status = cqe->status;
> +		req->result = cqe->result;
> +		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
> +	} else {
> +		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
> +			nvme_complete_rq(rq);
> +	}
>   	queue->nr_cqe++;
>   
>   	return 0;
> @@ -836,9 +934,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   static inline void nvme_tcp_end_request(struct request *rq, u16 status)
>   {
>   	union nvme_result res = {};
> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);
> +	struct nvme_tcp_queue *queue = req->queue;
> +	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;
>   
> -	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> -		nvme_complete_rq(rq);
> +	if (req->offloaded) {
> +		req->status = cpu_to_le16(status << 1);
> +		req->result = res;
> +		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);
> +	} else {
> +		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))
> +			nvme_complete_rq(rq);
> +	}
>   }
>   
>   static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
> @@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>   	bool inline_data = nvme_tcp_has_inline_data(req);
>   	u8 hdgst = nvme_tcp_hdgst_len(queue);
>   	int len = sizeof(*pdu) + hdgst - req->offset;
> +	struct request *rq = blk_mq_rq_from_pdu(req);
>   	int flags = MSG_DONTWAIT;
>   	int ret;
>   
> @@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)
>   	else
>   		flags |= MSG_EOR;
>   
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
> +	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)
> +		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);

I'd assume that this is something we want to setup in
nvme_tcp_setup_cmd_pdu. Why do it here?

Sagi Grimberg Oct. 8, 2020, 10:44 p.m. UTC | #4

> crc offload of the nvme capsule. Check if all the skb bits
> are on, and if not recalculate the crc in SW and check it.

Can you clarify in the patch description that this is only
for pdu data digest and not header digest?

> 
> This patch reworks the receive-side crc calculation to always
> run at the end, so as to keep a single flow for both offload
> and non-offload. This change simplifies the code, but it may degrade
> performance for non-offload crc calculation.

??

 From my scan it doeesn't look like you do that.. Am I missing something?
Can you explain?

> 
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> ---
>   drivers/nvme/host/tcp.c | 66 ++++++++++++++++++++++++++++++++++++-----
>   1 file changed, 58 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 7bd97f856677..9a620d1dacb4 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -94,6 +94,7 @@ struct nvme_tcp_queue {
>   	size_t			data_remaining;
>   	size_t			ddgst_remaining;
>   	unsigned int		nr_cqe;
> +	bool			crc_valid;
>   
>   	/* send state */
>   	struct nvme_tcp_request *request;
> @@ -233,6 +234,41 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req,
>   	return nvme_tcp_pdu_data_left(req) <= len;
>   }
>   
> +static inline bool nvme_tcp_device_ddgst_ok(struct nvme_tcp_queue *queue)

Maybe call it nvme_tcp_ddp_ddgst_ok?

> +{
> +	return queue->crc_valid;
> +}
> +
> +static inline void nvme_tcp_device_ddgst_update(struct nvme_tcp_queue *queue,
> +						struct sk_buff *skb)

Maybe call it nvme_tcp_ddp_ddgst_update?

> +{
> +	if (queue->crc_valid)
> +#ifdef CONFIG_TCP_DDP_CRC
> +		queue->crc_valid = skb->ddp_crc;
> +#else
> +		queue->crc_valid = false;
> +#endif
> +}
> +
> +static void nvme_tcp_crc_recalculate(struct nvme_tcp_queue *queue,
> +				     struct nvme_tcp_data_pdu *pdu)

Maybe call it nvme_tcp_ddp_ddgst_recalc?

> +{
> +	struct nvme_tcp_request *req;
> +	struct request *rq;
> +
> +	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
> +	if (!rq)
> +		return;
> +	req = blk_mq_rq_to_pdu(rq);
> +	crypto_ahash_init(queue->rcv_hash);
> +	req->ddp.sg_table.sgl = req->ddp.first_sgl;
> +	/* req->ddp.sg_table is allocated and filled in nvme_tcp_setup_ddp */
> +	ahash_request_set_crypt(queue->rcv_hash, req->ddp.sg_table.sgl, NULL,
> +				le32_to_cpu(pdu->data_length));
> +	crypto_ahash_update(queue->rcv_hash);
> +}
> +
> +
>   #ifdef CONFIG_TCP_DDP
>   
>   bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags);
> @@ -706,6 +742,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>   	queue->pdu_offset = 0;
>   	queue->data_remaining = -1;
>   	queue->ddgst_remaining = 0;
> +	queue->crc_valid = true;
>   }
>   
>   static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> @@ -955,6 +992,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   	struct nvme_tcp_request *req;
>   	struct request *rq;
>   
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
> +		nvme_tcp_device_ddgst_update(queue, skb);

Is the queue->data_digest condition missing here?

>   	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
>   	if (!rq) {
>   		dev_err(queue->ctrl->ctrl.device,
> @@ -992,7 +1031,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   		recv_len = min_t(size_t, recv_len,
>   				iov_iter_count(&req->iter));
>   
> -		if (queue->data_digest)
> +		if (queue->data_digest && !test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
>   			ret = skb_copy_and_hash_datagram_iter(skb, *offset,
>   				&req->iter, recv_len, queue->rcv_hash);

This is the skb copy and hash, not clear why you say that you move this
to the end...

>   		else
> @@ -1012,7 +1051,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
>   
>   	if (!queue->data_remaining) {
>   		if (queue->data_digest) {
> -			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);

If I instead do:
			if (!test_bit(NVME_TCP_Q_OFFLOADS,
                                       &queue->flags))
				nvme_tcp_ddgst_final(queue->rcv_hash,
						     &queue->exp_ddgst);

Does that help the mess in nvme_tcp_recv_ddgst?

>   			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;
>   		} else {
>   			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
> @@ -1033,8 +1071,11 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>   	char *ddgst = (char *)&queue->recv_ddgst;
>   	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);
>   	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;
> +	bool ddgst_offload_fail;
>   	int ret;
>   
> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))
> +		nvme_tcp_device_ddgst_update(queue, skb);
>   	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);
>   	if (unlikely(ret))
>   		return ret;
> @@ -1045,12 +1086,21 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,
>   	if (queue->ddgst_remaining)
>   		return 0;
>   
> -	if (queue->recv_ddgst != queue->exp_ddgst) {
> -		dev_err(queue->ctrl->ctrl.device,
> -			"data digest error: recv %#x expected %#x\n",
> -			le32_to_cpu(queue->recv_ddgst),
> -			le32_to_cpu(queue->exp_ddgst));
> -		return -EIO;
> +	ddgst_offload_fail = !nvme_tcp_device_ddgst_ok(queue);
> +	if (!test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) ||
> +	    ddgst_offload_fail) {
> +		if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&
> +		    ddgst_offload_fail)
> +			nvme_tcp_crc_recalculate(queue, pdu);
> +
> +		nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);
> +		if (queue->recv_ddgst != queue->exp_ddgst) {
> +			dev_err(queue->ctrl->ctrl.device,
> +				"data digest error: recv %#x expected %#x\n",
> +				le32_to_cpu(queue->recv_ddgst),
> +				le32_to_cpu(queue->exp_ddgst));
> +			return -EIO;

This gets convoluted here...

> +		}
>   	}
>   
>   	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {
>

Sagi Grimberg Oct. 8, 2020, 11 p.m. UTC | #5

>>   static
>>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,
>>                   struct nvme_tcp_config *config)
>> @@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct 
>> nvme_ctrl *ctrl)
>>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
>>           struct nvme_completion *cqe)
>>   {
>> +    struct nvme_tcp_request *req;
>>       struct request *rq;
>>       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
>> @@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct 
>> nvme_tcp_queue *queue,
>>           return -EINVAL;
>>       }
>> -    if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>> -        nvme_complete_rq(rq);
>> +    req = blk_mq_rq_to_pdu(rq);
>> +    if (req->offloaded) {
>> +        req->status = cqe->status;
>> +        req->result = cqe->result;
>> +        nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);
>> +    } else {
>> +        if (!nvme_try_complete_req(rq, cqe->status, cqe->result))
>> +            nvme_complete_rq(rq);
>> +    }

Oh forgot to ask,

We have places in the driver that we may complete (cancel) one
or more requests from the error recovery or timeout flow. We
first prevent future incoming RX on the socket such that we
can safely cancel requests. This may break with the deferred
completion in ddp_teardown_done.

If I have a request that is waiting for ddp_teardown_done do
I have a way to tell the HW to never call ddp_teardown_done
on a specific socket?

If so the place to is in nvme_tcp_stop_queue.

Sagi Grimberg Oct. 9, 2020, 12:08 a.m. UTC | #6

On 9/30/20 9:20 AM, Boris Pismenny wrote:
> This series adds support for nvme-tcp receive offloads
> which do not mandate the offload of the network stack to the device.
> Instead, these work together with TCP to offload:
> 1. copy from SKB to the block layer buffers
> 2. CRC verification for received PDU
> 
> The series implements these as a generic offload infrastructure for storage
> protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
> respectively. We use this infrastructure to implement NVMe-TCP offload for copy
> and CRC. Future implementations can reuse the same infrastructure for other
> protcols such as iSCSI.
> 
> Note:
> These offloads are similar in nature to the packet-based NIC TLS offloads,
> which are already upstream (see net/tls/tls_device.c).
> You can read more about TLS offload here:
> https://www.kernel.org/doc/html/latest/networking/tls-offload.html
> 
> Initialization and teardown:
> =========================================
> The offload for IO queues is initialized after the handshake of the
> NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
> with the tcp socket of the nvme_tcp_queue:
> This operation sets all relevant hardware contexts in
> hardware. If it fails, then the IO queue proceeds as usually with no offload.
> If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
> called to perform copy offload, and crc offload will be used.
> This initialization does not change the normal operation of nvme-tcp in any
> way besides adding the option to call the above mentioned NDO operations.
> 
> For the admin queue, nvme-tcp does not initialize the offload.
> Instead, nvme-tcp calls the driver to configure limits for the controller,
> such as max_hw_sectors and max_segments; these must be limited to accomodate
> potential HW resource limits, and to improve performance.
> 
> If some error occured, and the IO queue must be closed or reconnected, then
> offload is teardown and initialized again. Additionally, we handle netdev
> down events via the existing error recovery flow.
> 
> Copy offload works as follows:
> =========================================
> The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
> `nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
> received, then the NIC HW will write the PDU payload directly into the
> designated buffer, and build an SKB such that it points into the destination
> buffer; this SKB represents the entire packet received on the wire, but it
> points to the block layer buffers. Once nvme-tcp attempts to copy data from
> this SKB to the block layer buffer it can skip the copy by checking in the
> copying function (memcpy_to_page):
> if (src == dst) -> skip copy
> Finally, when the PDU has been processed to completion, the nvme-tcp layer
> releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
> asynchronously unmaps the buffers from NIC HW.
> 
> As the last change is to a sensative function, we are careful to place it under
> static_key which is only enabled when this functionality is actually used for
> nvme-tcp copy offload.
> 
> Asynchronous completion:
> =========================================
> The NIC must release its mapping between command IDs and the target buffers.
> This mapping is released when NVMe-TCP calls the NIC
> driver (`nvme_tcp_offload_socket`).
> As completing IOs is performance criticial, we introduce asynchronous
> completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
> call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).
> 
> An alternative approach is to move all the functions related to coping from
> SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.
> 
> CRC offload works as follows:
> =========================================
> After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
> "there was no problem with the verification of all CRC fields in this packet's
> payload". The bit is set to zero if there was an error, or if HW skipped
> offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
> must compute the CRC, and check it. We perform this check, and
> accompanying software fallback at the end of the processing of a received PDU.
> 
> SKB changes:
> =========================================
> The CRC offload requires an additional bit in the SKB, which is useful for
> preventing the coalescing of SKB with different crc offload values. This bit
> is similar in concept to the "decrypted" bit.
> 
> Performance:
> =========================================
> The expected performance gain from this offload varies with the block size.
> We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
> fio random read workloads:
> For 4K blocks we see up to 11% improvement for a 100% read fio workload,
> while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
> operations, then we observe a gain of about 1.1x and 2x respectively.

Nice!

> 
> Resynchronization:
> =========================================
> The resynchronization flow is performed to reset the hardware tracking of
> NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
> the driver, regarding a possible location of a PDU header. Followed by
> a response from the nvme-tcp driver.
> 
> This flow is rare, and it should happen only after packet loss or
> reordering events that involve nvme-tcp PDU headers.
> 
> The patches are organized as follows:
> =========================================
> Patch 1         the iov_iter change to skip copy if (src == dst)
> Patches 2-3     the infrastructure for all TCP DDP
>                  and TCP DDP CRC offloads, respectively.
> Patch 4         exposes the get_netdev_for_sock function from TLS
> Patch 5         NVMe-TCP changes to call NIC driver on queue init/teardown
> Patches 6       NVMe-TCP changes to call NIC driver on IO operation
>                  setup/teardown, and support async completions.
> Patches 7       NVMe-TCP changes to support CRC offload on receive.
>                  Also, this patch moves CRC calculation to the end of PDU
>                  in case offload requires software fallback.
> Patches 8       NVMe-TCP handling of netdev events: stop the offload if
>                  netdev is going down
> Patches 9-10    implement support for NVMe-TCP copy and CRC offload in
>                  the mlx5 NIC driver
> 
> Testing:
> =========================================
> This series was tested using fio with various configurations of IO sizes,
> depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.
> 
> Future work:
> =========================================
> A follow-up series will introduce support for transmit side CRC. Then,
> we will work on adding support for TLS in NVMe-TCP and combining the
> two offloads.

Boris, Or and Yoray

Thanks for submitting this work. Overall this looks good to me.
The model here is not messy at all which is not trivial when it comes to 
tcp offloads.

Gave you comments in the patches themselves but overall this looks
good!

Would love to see TLS work from you moving forward.

Boris Pismenny Oct. 11, 2020, 2:44 p.m. UTC | #7

On 09/10/2020 0:47, Sagi Grimberg wrote:
>> + * tcp_ddp.h
>> + *	Author:	Boris Pismenny <borisp@mellanox.com>
>> + *	Copyright (C) 2020 Mellanox Technologies.
>> + */
>> +#ifndef _TCP_DDP_H
>> +#define _TCP_DDP_H
>> +
>> +#include <linux/blkdev.h>
> Why is blkdev.h needed?

That's a lefotover from a previous iteration over this code. I'll remove it for the next patch.

>
>> +#include <linux/netdevice.h>
>> +#include <net/inet_connection_sock.h>
>> +#include <net/sock.h>
>> +
>> +/* limits returned by the offload driver, zero means don't care */
>> +struct tcp_ddp_limits {
>> +	int	 max_ddp_sgl_len;
>> +};
>> +
>> +enum tcp_ddp_type {
>> +	TCP_DDP_NVME = 1,
>> +};
>> +
>> +struct tcp_ddp_config {
>> +	enum tcp_ddp_type    type;
>> +	unsigned char        buf[];
> A little kdoc may help here as its not exactly clear what is
> buf used for (at this point at least)...

Will add.

>> +};
>> +
>> +struct nvme_tcp_config {
> struct nvme_tcp_ddp_config

Sure.

>> +	struct tcp_ddp_config   cfg;
>> +
>> +	u16			pfv;
>> +	u8			cpda;
>> +	u8			dgst;
>> +	int			queue_size;
>> +	int			queue_id;
>> +	int			io_cpu;
>> +};
>> +
> Other than that this looks good to me.
Thanks Sagi!

Shai Malin Nov. 8, 2020, 6:59 a.m. UTC | #8

On 09/10/2020 1:44, Sagi Grimberg wrote:
> On 9/30/20 7:20 PM, Boris Pismenny wrote:

> 

> > crc offload of the nvme capsule. Check if all the skb bits are on, 

> > and if not recalculate the crc in SW and check it.

> 

> Can you clarify in the patch description that this is only for pdu 

> data digest and not header digest?

> 


Not a security expert, but according to my understanding, the NVMeTCP data digest is a layer 5 CRC,  and as such it is expected to be end-to-end, meaning it is computed by layer 5 on the transmitter and verified on layer 5 on the receiver.
Any data corruption which happens in any of the lower layers, including their software processing, should be protected by this CRC. For example, if the IP or TCP stack has a bug that corrupts the NVMeTCP payload data, the CRC should protect against it. It seems that may not be the case with this offload.


> >

> > This patch reworks the receive-side crc calculation to always run at 

> > the end, so as to keep a single flow for both offload and non-offload.

> > This change simplifies the code, but it may degrade performance for 

> > non-offload crc calculation.

> 

> ??

> 

>  From my scan it doeesn't look like you do that.. Am I missing something?

> Can you explain?

> 

> >

> > Signed-off-by: Boris Pismenny <borisp@mellanox.com>

> > Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>

> > Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

> > Signed-off-by: Yoray Zack <yorayz@mellanox.com>

> > ---

> >   drivers/nvme/host/tcp.c | 66

> ++++++++++++++++++++++++++++++++++++-----

> >   1 file changed, 58 insertions(+), 8 deletions(-)

> >

> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index

> > 7bd97f856677..9a620d1dacb4 100644

> > --- a/drivers/nvme/host/tcp.c

> > +++ b/drivers/nvme/host/tcp.c

> > @@ -94,6 +94,7 @@ struct nvme_tcp_queue {

> >   	size_t			data_remaining;

> >   	size_t			ddgst_remaining;

> >   	unsigned int		nr_cqe;

> > +	bool			crc_valid;


I suggest to rename it to ddgst_valid.

Boris Pismenny Nov. 8, 2020, 7:28 a.m. UTC | #9

On 08/11/2020 8:59, Shai Malin wrote:
> On 09/10/2020 1:44, Sagi Grimberg wrote:

>> On 9/30/20 7:20 PM, Boris Pismenny wrote:

>>

>>> crc offload of the nvme capsule. Check if all the skb bits are on, 

>>> and if not recalculate the crc in SW and check it.

>> Can you clarify in the patch description that this is only for pdu 

>> data digest and not header digest?

>>

> Not a security expert, but according to my understanding, the NVMeTCP data digest is a layer 5 CRC,  and as such it is expected to be end-to-end, meaning it is computed by layer 5 on the transmitter and verified on layer 5 on the receiver.

> Any data corruption which happens in any of the lower layers, including their software processing, should be protected by this CRC. For example, if the IP or TCP stack has a bug that corrupts the NVMeTCP payload data, the CRC should protect against it. It seems that may not be the case with this offload.


If the TCP/IP stack corrupts packet data, then likely many TCP/IP consumers will be effected, and it will be fixed promptly.
Unlike with TOE, these bugs can be easily fixed/hotpatched by the community.

>

>>> This patch reworks the receive-side crc calculation to always run at 

>>> the end, so as to keep a single flow for both offload and non-offload.

>>> This change simplifies the code, but it may degrade performance for 

>>> non-offload crc calculation.

>> ??

>>

>>  From my scan it doeesn't look like you do that.. Am I missing something?

>> Can you explain?

>>

>>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>

>>> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>

>>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

>>> Signed-off-by: Yoray Zack <yorayz@mellanox.com>

>>> ---

>>>   drivers/nvme/host/tcp.c | 66

>> ++++++++++++++++++++++++++++++++++++-----

>>>   1 file changed, 58 insertions(+), 8 deletions(-)

>>>

>>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index

>>> 7bd97f856677..9a620d1dacb4 100644

>>> --- a/drivers/nvme/host/tcp.c

>>> +++ b/drivers/nvme/host/tcp.c

>>> @@ -94,6 +94,7 @@ struct nvme_tcp_queue {

>>>   	size_t			data_remaining;

>>>   	size_t			ddgst_remaining;

>>>   	unsigned int		nr_cqe;

>>> +	bool			crc_valid;

> I suggest to rename it to ddgst_valid.

>

Sure

Boris Pismenny Nov. 8, 2020, 9:44 a.m. UTC | #10

On 09/10/2020 1:29, Sagi Grimberg wrote:
>

>>   static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

>> @@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)

>>   	bool inline_data = nvme_tcp_has_inline_data(req);

>>   	u8 hdgst = nvme_tcp_hdgst_len(queue);

>>   	int len = sizeof(*pdu) + hdgst - req->offset;

>> +	struct request *rq = blk_mq_rq_from_pdu(req);

>>   	int flags = MSG_DONTWAIT;

>>   	int ret;

>>   

>> @@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)

>>   	else

>>   		flags |= MSG_EOR;

>>   

>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&

>> +	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)

>> +		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);

> I'd assume that this is something we want to setup in

> nvme_tcp_setup_cmd_pdu. Why do it here?

Our goal in placing it here is to keep both setup and teardown in the same thread.
This enables drivers to avoid locking for per-queue operations.

Boris Pismenny Nov. 8, 2020, 1:59 p.m. UTC | #11

On 09/10/2020 2:00, Sagi Grimberg wrote:
>>>   static

>>>   int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue,

>>>                   struct nvme_tcp_config *config)

>>> @@ -630,6 +720,7 @@ static void nvme_tcp_error_recovery(struct 

>>> nvme_ctrl *ctrl)

>>>   static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,

>>>           struct nvme_completion *cqe)

>>>   {

>>> +    struct nvme_tcp_request *req;

>>>       struct request *rq;

>>>       rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);

>>> @@ -641,8 +732,15 @@ static int nvme_tcp_process_nvme_cqe(struct 

>>> nvme_tcp_queue *queue,

>>>           return -EINVAL;

>>>       }

>>> -    if (!nvme_try_complete_req(rq, cqe->status, cqe->result))

>>> -        nvme_complete_rq(rq);

>>> +    req = blk_mq_rq_to_pdu(rq);

>>> +    if (req->offloaded) {

>>> +        req->status = cqe->status;

>>> +        req->result = cqe->result;

>>> +        nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);

>>> +    } else {

>>> +        if (!nvme_try_complete_req(rq, cqe->status, cqe->result))

>>> +            nvme_complete_rq(rq);

>>> +    }

> Oh forgot to ask,

>

> We have places in the driver that we may complete (cancel) one

> or more requests from the error recovery or timeout flow. We

> first prevent future incoming RX on the socket such that we

> can safely cancel requests. This may break with the deferred

> completion in ddp_teardown_done.

>

> If I have a request that is waiting for ddp_teardown_done do

> I have a way to tell the HW to never call ddp_teardown_done

> on a specific socket?

>

> If so the place to is in nvme_tcp_stop_queue.

Interesting and indeed, it is a problem that we haven't considered.

Boris Pismenny Nov. 8, 2020, 2:46 p.m. UTC | #12

On 09/10/2020 1:44, Sagi Grimberg wrote:
>> crc offload of the nvme capsule. Check if all the skb bits

>> are on, and if not recalculate the crc in SW and check it.

> Can you clarify in the patch description that this is only

> for pdu data digest and not header digest?


Will do

>

>> This patch reworks the receive-side crc calculation to always

>> run at the end, so as to keep a single flow for both offload

>> and non-offload. This change simplifies the code, but it may degrade

>> performance for non-offload crc calculation.

> ??

>

>  From my scan it doeesn't look like you do that.. Am I missing something?

> Can you explain?


The performance of CRC data digest in the offload's fallback path may be less good compared to CRC calculation with skb_copy_and_hash.
To be clear, the fallback path is occurs when `queue->data_digest && test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags)`, while we receive SKBs where `skb->ddp_crc = 0`

>

>>   	rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);

>>   	if (!rq) {

>>   		dev_err(queue->ctrl->ctrl.device,

>> @@ -992,7 +1031,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

>>   		recv_len = min_t(size_t, recv_len,

>>   				iov_iter_count(&req->iter));

>>   

>> -		if (queue->data_digest)

>> +		if (queue->data_digest && !test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))

>>   			ret = skb_copy_and_hash_datagram_iter(skb, *offset,

>>   				&req->iter, recv_len, queue->rcv_hash);

> This is the skb copy and hash, not clear why you say that you move this

> to the end...


See the offload fallback path below

>

>>   		else

>> @@ -1012,7 +1051,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

>>   

>>   	if (!queue->data_remaining) {

>>   		if (queue->data_digest) {

>> -			nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);

> If I instead do:

> 			if (!test_bit(NVME_TCP_Q_OFFLOADS,

>                                        &queue->flags))

> 				nvme_tcp_ddgst_final(queue->rcv_hash,

> 						     &queue->exp_ddgst);

>

> Does that help the mess in nvme_tcp_recv_ddgst?


Not really, as the code path there takes care of the fallback path, i.e. offloaded requested, but didn't succeed.

>

>>   			queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH;

>>   		} else {

>>   			if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {

>> @@ -1033,8 +1071,11 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,

>>   	char *ddgst = (char *)&queue->recv_ddgst;

>>   	size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining);

>>   	off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining;

>> +	bool ddgst_offload_fail;

>>   	int ret;

>>   

>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags))

>> +		nvme_tcp_device_ddgst_update(queue, skb);

>>   	ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len);

>>   	if (unlikely(ret))

>>   		return ret;

>> @@ -1045,12 +1086,21 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue,

>>   	if (queue->ddgst_remaining)

>>   		return 0;

>>   

>> -	if (queue->recv_ddgst != queue->exp_ddgst) {

>> -		dev_err(queue->ctrl->ctrl.device,

>> -			"data digest error: recv %#x expected %#x\n",

>> -			le32_to_cpu(queue->recv_ddgst),

>> -			le32_to_cpu(queue->exp_ddgst));

>> -		return -EIO;

>> +	ddgst_offload_fail = !nvme_tcp_device_ddgst_ok(queue);

>> +	if (!test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) ||

>> +	    ddgst_offload_fail) {

>> +		if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&

>> +		    ddgst_offload_fail)

>> +			nvme_tcp_crc_recalculate(queue, pdu);

>> +

>> +		nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);

>> +		if (queue->recv_ddgst != queue->exp_ddgst) {

>> +			dev_err(queue->ctrl->ctrl.device,

>> +				"data digest error: recv %#x expected %#x\n",

>> +				le32_to_cpu(queue->recv_ddgst),

>> +				le32_to_cpu(queue->exp_ddgst));

>> +			return -EIO;

> This gets convoluted here...


Will try to simplify, the general idea is that there are 3 paths with common code:
1. non-offload
2. offload failed
3. offload success
(1) and (2) share the code for finalizing checking the data digest, while (3) skips this entirely.

In other words, how about this:
```
          offload_fail = !nvme_tcp_ddp_ddgst_ok(queue);                                               
          offload = test_bit(NVME_TCP_Q_OFF_CRC_RX, &queue->flags);                                   
          if (!offload || offload_fail) {                                                             
                  if (offload && offload_fail) // software-fallback                                   
                          nvme_tcp_ddp_ddgst_recalc(queue, pdu);                                      
                                                                                                      
                  nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst);                           
                  if (queue->recv_ddgst != queue->exp_ddgst) {                                        
                          dev_err(queue->ctrl->ctrl.device,                                           
                                  "data digest error: recv %#x expected %#x\n",                       
                                  le32_to_cpu(queue->recv_ddgst),                                     
                                  le32_to_cpu(queue->exp_ddgst));                                     
                          return -EIO;                                                                
                  }                                                                                   
          } 
```

>

>> +		}

>>   	}

>>   

>>   	if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) {

>>

Sagi Grimberg Nov. 9, 2020, 11:18 p.m. UTC | #13

>>>    static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

>>> @@ -1115,6 +1222,7 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)

>>>    	bool inline_data = nvme_tcp_has_inline_data(req);

>>>    	u8 hdgst = nvme_tcp_hdgst_len(queue);

>>>    	int len = sizeof(*pdu) + hdgst - req->offset;

>>> +	struct request *rq = blk_mq_rq_from_pdu(req);

>>>    	int flags = MSG_DONTWAIT;

>>>    	int ret;

>>>    

>>> @@ -1123,6 +1231,10 @@ static int nvme_tcp_try_send_cmd_pdu(struct nvme_tcp_request *req)

>>>    	else

>>>    		flags |= MSG_EOR;

>>>    

>>> +	if (test_bit(NVME_TCP_Q_OFFLOADS, &queue->flags) &&

>>> +	    blk_rq_nr_phys_segments(rq) && rq_data_dir(rq) == READ)

>>> +		nvme_tcp_setup_ddp(queue, pdu->cmd.common.command_id, rq);

>> I'd assume that this is something we want to setup in

>> nvme_tcp_setup_cmd_pdu. Why do it here?

> Our goal in placing it here is to keep both setup and teardown in the same thread.

> This enables drivers to avoid locking for per-queue operations.


I also think that it is cleaner when setting up the PDU. Do note that if
queues match 1x1 with cpu cores then any synchronization is pretty
lightweight, and if not, we have other synchronizations anyways...

[net-next,RFC,v1,00/10] nvme-tcp receive offloads

Message

Comments