[v2,net-next,00/21] nvme-tcp receive offloads

Message ID	20210114151033.13020-1-borisp@mellanox.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Boris Pismenny <borisp@mellanox.com> To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com Subject: [PATCH v2 net-next 00/21] nvme-tcp receive offloads Date: Thu, 14 Jan 2021 17:10:12 +0200 Message-Id: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	nvme-tcp receive offloads \| expand [v2,net-next,00/21] nvme-tcp receive offloads [v2,net-next,03/21] net: Introduce crc offload for tcp ddp ulp [v2,net-next,04/21] net: SKB copy(+hash) iterators for DDP offloads [v2,net-next,05/21] net/tls: expose get_netdev_for_sock [v2,net-next,06/21] nvme-tcp: Add DDP offload control path [v2,net-next,08/21] nvme-tcp : Recalculate crc in the end of the capsule [v2,net-next,12/21] net/mlx5e: TCP flow steering for nvme-tcp [v2,net-next,14/21] net/mlx5e: KLM UMR helper macros [v2,net-next,16/21] net/mlx5e: NVMEoTCP queue init/teardown [v2,net-next,20/21] net/mlx5e: NVMEoTCP statistics [v2,net-next,21/21] Documentation: add TCP DDP offload documentation

Boris Pismenny Jan. 14, 2021, 3:10 p.m. UTC

Changes since v1:
=========================================
* Rework iov_iter copy skip if src==dst to be less intrusive (David Ahern)
* Add tcp-ddp documentation (David Ahern)
* Refactor mellanox driver patches into more patches (Saeed Mahameed)
* Avoid pointer casting (David Ahern)
* Rename nvme-tcp offload flags (Shai Malin)
* Update cover-letter according to the above

Changes since RFC v1:
=========================================
* Split mlx5 driver patches to several commits
* Fix nvme-tcp handling of recovery flows. In particular, move queue offlaod
  init/teardown to the start/stop functions.

Overview
=========================================
This series adds support for nvme-tcp receive offloads
which do not mandate the offload of the network stack to the device.
Instead, these work together with TCP to offload:
1. copy from SKB to the block layer buffers
2. CRC verification for received PDU

The series implements these as a generic offload infrastructure for storage
protocols, which we call TCP Direct Data Placement (TCP_DDP) and TCP DDP CRC,
respectively. We use this infrastructure to implement NVMe-TCP offload for copy
and CRC. Future implementations can reuse the same infrastructure for other
protcols such as iSCSI.

Note:
These offloads are similar in nature to the packet-based NIC TLS offloads,
which are already upstream (see net/tls/tls_device.c).
You can read more about TLS offload here:
https://www.kernel.org/doc/html/latest/networking/tls-offload.html

Initialization and teardown:
=========================================
The offload for IO queues is initialized after the handshake of the
NVMe-TCP protocol is finished by calling `nvme_tcp_offload_socket`
with the tcp socket of the nvme_tcp_queue:
This operation sets all relevant hardware contexts in
hardware. If it fails, then the IO queue proceeds as usually with no offload.
If it succeeds then `nvme_tcp_setup_ddp` and `nvme_tcp_teardown_ddp` may be
called to perform copy offload, and crc offload will be used.
This initialization does not change the normal operation of nvme-tcp in any
way besides adding the option to call the above mentioned NDO operations.

For the admin queue, nvme-tcp does not initialize the offload.
Instead, nvme-tcp calls the driver to configure limits for the controller,
such as max_hw_sectors and max_segments; these must be limited to accomodate
potential HW resource limits, and to improve performance.

If some error occured, and the IO queue must be closed or reconnected, then
offload is teardown and initialized again. Additionally, we handle netdev
down events via the existing error recovery flow.

Copy offload works as follows:
=========================================
The nvme-tcp layer calls the NIC drive to map block layer buffers to ccid using
`nvme_tcp_setup_ddp` before sending the read request. When the repsonse is
received, then the NIC HW will write the PDU payload directly into the
designated buffer, and build an SKB such that it points into the destination
buffer; this SKB represents the entire packet received on the wire, but it
points to the block layer buffers. Once nvme-tcp attempts to copy data from
this SKB to the block layer buffer it can skip the copy by checking in the
copying function (memcpy_to_page):
if (src == dst) -> skip copy
Finally, when the PDU has been processed to completion, the nvme-tcp layer
releases the NIC HW context be calling `nvme_tcp_teardown_ddp` which
asynchronously unmaps the buffers from NIC HW.

As the copy skip change is in a sensative function, we are careful to avoid
changing it. To that end, we create alternative skb copy and hash iterators
that skip copy/hash if (src == dst). Nvme-tcp is the first user for these.

Asynchronous completion:
=========================================
The NIC must release its mapping between command IDs and the target buffers.
This mapping is released when NVMe-TCP calls the NIC
driver (`nvme_tcp_offload_socket`).
As completing IOs is performance criticial, we introduce asynchronous
completions for NVMe-TCP, i.e. NVMe-TCP calls the NIC, which will later
call NVMe-TCP to complete the IO (`nvme_tcp_ddp_teardown_done`).

An alternative approach is to move all the functions related to coping from
SKBs to the block layer buffers inside the nvme-tcp code - about 200 LOC.

CRC offload works as follows:
=========================================
After offload is initialized, we use the SKB's ddp_crc bit to indicate that:
"there was no problem with the verification of all CRC fields in this packet's
payload". The bit is set to zero if there was an error, or if HW skipped
offload for some reason. If *any* SKB in a PDU has (ddp_crc != 1), then software
must compute the CRC, and check it. We perform this check, and
accompanying software fallback at the end of the processing of a received PDU.

SKB changes:
=========================================
The CRC offload requires an additional bit in the SKB, which is useful for
preventing the coalescing of SKB with different crc offload values. This bit
is similar in concept to the "decrypted" bit. 

Performance:
=========================================
The expected performance gain from this offload varies with the block size.
We perform a CPU cycles breakdown of the copy/CRC operations in nvme-tcp
fio random read workloads:
For 4K blocks we see up to 11% improvement for a 100% read fio workload,
while for 128K blocks we see upto 52%. If we run nvme-tcp, and skip these
operations, then we observe a gain of about 1.1x and 2x respectively.

Resynchronization:
=========================================
The resynchronization flow is performed to reset the hardware tracking of
NVMe-TCP PDUs within the TCP stream. The flow consists of a request from
the driver, regarding a possible location of a PDU header. Followed by
a response from the nvme-tcp driver.

This flow is rare, and it should happen only after packet loss or
reordering events that involve nvme-tcp PDU headers.

The patches are organized as follows:
=========================================
Patch 1         the iov_iter change to skip copy if (src == dst).
Patches 2-4     the infrastructure for all TCP DDP.
                and TCP DDP CRC offloads, respectively.
Patch 5         exposes the get_netdev_for_sock function from TLS.
Patch 6         NVMe-TCP changes to call NIC driver on queue init/teardown.
Patches 7       NVMe-TCP changes to call NIC driver on IO operation.
                setup/teardown, and support async completions.
Patches 8       NVMe-TCP changes to support CRC offload on receive.
                Also, this patch moves CRC calculation to the end of PDU
                in case offload requires software fallback.
Patches 9       NVMe-TCP handling of netdev events: stop the offload if
                netdev is going down.
Patches 10-20   implement support for NVMe-TCP copy and CRC offload in
                the mlx5 NIC driver as the first user.
Patches 21      Document TCP DDP offload.

Testing:
=========================================
This series was tested using fio with various configurations of IO sizes,
depths, MTUs, and with both the SPDK and kernel NVMe-TCP targets.
Also, we have used QEMU and gate-level simulation to verify these patches.

Future work:
=========================================
A follow-up series will introduce support for transmit side CRC. Then,
we will work on adding support for TLS in NVMe-TCP and combining the
two offloads.

Ben Ben-Ishay (8):
  net/mlx5e: NVMEoTCP offload initialization
  net/mlx5e: KLM UMR helper macros
  net/mlx5e: NVMEoTCP use KLM UMRs
  net/mlx5e: NVMEoTCP queue init/teardown
  net/mlx5e: NVMEoTCP async ddp invalidation
  net/mlx5e: NVMEoTCP ddp setup and resync
  net/mlx5e: NVMEoTCP, data-path for DDP offload
  net/mlx5e: NVMEoTCP statistics

Ben Ben-ishay (4):
  net: SKB copy(+hash) iterators for DDP offloads
  nvme-tcp : Recalculate crc in the end of the capsule
  net/mlx5: Header file changes for nvme-tcp offload
  net/mlx5: Add 128B CQE for NVMEoTCP offload

Boris Pismenny (8):
  iov_iter: Introduce new procedures for copy to iter/pages
  net: Introduce direct data placement tcp offload
  net: Introduce crc offload for tcp ddp ulp
  net/tls: expose get_netdev_for_sock
  nvme-tcp: Add DDP offload control path
  nvme-tcp: Add DDP data-path
  net/mlx5e: TCP flow steering for nvme-tcp
  Documentation: add TCP DDP offload documentation

Or Gerlitz (1):
  nvme-tcp: Deal with netdevice DOWN events

 Documentation/networking/index.rst            |    1 +
 Documentation/networking/tcp-ddp-offload.rst  |  296 +++++
 .../net/ethernet/mellanox/mlx5/core/Kconfig   |   11 +
 .../net/ethernet/mellanox/mlx5/core/Makefile  |    2 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   31 +-
 .../net/ethernet/mellanox/mlx5/core/en/fs.h   |    4 +-
 .../ethernet/mellanox/mlx5/core/en/params.h   |    1 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   13 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |    1 +
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |    1 +
 .../mellanox/mlx5/core/en_accel/en_accel.h    |    9 +-
 .../mellanox/mlx5/core/en_accel/fs_tcp.c      |   10 +
 .../mellanox/mlx5/core/en_accel/fs_tcp.h      |    2 +-
 .../mellanox/mlx5/core/en_accel/nvmeotcp.c    | 1012 +++++++++++++++++
 .../mellanox/mlx5/core/en_accel/nvmeotcp.h    |  120 ++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.c        |  259 +++++
 .../mlx5/core/en_accel/nvmeotcp_rxtx.h        |   43 +
 .../mlx5/core/en_accel/nvmeotcp_utils.h       |   80 ++
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   39 +-
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |   66 +-
 .../ethernet/mellanox/mlx5/core/en_stats.c    |   37 +
 .../ethernet/mellanox/mlx5/core/en_stats.h    |   24 +
 .../net/ethernet/mellanox/mlx5/core/en_txrx.c |   17 +
 drivers/net/ethernet/mellanox/mlx5/core/fw.c  |    6 +
 drivers/nvme/host/tcp.c                       |  466 +++++++-
 include/linux/mlx5/device.h                   |   44 +-
 include/linux/mlx5/mlx5_ifc.h                 |  101 +-
 include/linux/mlx5/qp.h                       |    1 +
 include/linux/netdev_features.h               |    4 +
 include/linux/netdevice.h                     |    5 +
 include/linux/skbuff.h                        |   10 +
 include/linux/uio.h                           |   12 +
 include/net/inet_connection_sock.h            |    4 +
 include/net/sock.h                            |   17 +
 include/net/tcp_ddp.h                         |  136 +++
 lib/iov_iter.c                                |   45 +
 net/Kconfig                                   |   17 +
 net/core/datagram.c                           |   44 +
 net/core/skbuff.c                             |    9 +-
 net/ethtool/common.c                          |    2 +
 net/ipv4/tcp_input.c                          |    7 +
 net/ipv4/tcp_ipv4.c                           |    3 +
 net/ipv4/tcp_offload.c                        |    3 +
 net/tls/tls_device.c                          |   20 +-
 44 files changed, 2977 insertions(+), 58 deletions(-)
 create mode 100644 Documentation/networking/tcp-ddp-offload.rst
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h
 create mode 100644 include/net/tcp_ddp.h

Eric Dumazet Jan. 14, 2021, 3:57 p.m. UTC | #1

On Thu, Jan 14, 2021 at 4:10 PM Boris Pismenny <borisp@mellanox.com> wrote:
>
> This commit introduces direct data placement offload for TCP.
> This capability is accompanied by new net_device operations that
> configure hardware contexts. There is a context per socket, and a context per DDP
> opreation. Additionally, a resynchronization routine is used to assist
> hardware handle TCP OOO, and continue the offload.
> Furthermore, we let the offloading driver advertise what is the max hw
> sectors/segments.
>
> Using this interface, the NIC hardware will scatter TCP payload directly
> to the BIO pages according to the command_id.
> To maintain the correctness of the network stack, the driver is expected
> to construct SKBs that point to the BIO pages.
>
> This, the SKB represents the data on the wire, while it is pointing
> to data that is already placed in the destination buffer.
> As a result, data from page frags should not be copied out to
> the linear part.
>
> As SKBs that use DDP are already very memory efficient, we modify
> skb_condence to avoid copying data from fragments to the linear
> part of SKBs that belong to a socket that uses DDP offload.
>
> A follow-up patch will use this interface for DDP in NVMe-TCP.
>
> Signed-off-by: Boris Pismenny <borisp@mellanox.com>
> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> Signed-off-by: Yoray Zack <yorayz@mellanox.com>
> ---
>  include/linux/netdev_features.h    |   2 +
>  include/linux/netdevice.h          |   5 ++
>  include/net/inet_connection_sock.h |   4 +
>  include/net/tcp_ddp.h              | 136 +++++++++++++++++++++++++++++
>  net/Kconfig                        |   9 ++
>  net/core/skbuff.c                  |   9 +-
>  net/ethtool/common.c               |   1 +
>  7 files changed, 165 insertions(+), 1 deletion(-)
>  create mode 100644 include/net/tcp_ddp.h
>
> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h
> index 934de56644e7..fb35dcac03d2 100644
> --- a/include/linux/netdev_features.h
> +++ b/include/linux/netdev_features.h
> @@ -84,6 +84,7 @@ enum {
>         NETIF_F_GRO_FRAGLIST_BIT,       /* Fraglist GRO */
>
>         NETIF_F_HW_MACSEC_BIT,          /* Offload MACsec operations */
> +       NETIF_F_HW_TCP_DDP_BIT,         /* TCP direct data placement offload */
>
>         /*
>          * Add your fresh new feature above and remember to update
> @@ -157,6 +158,7 @@ enum {
>  #define NETIF_F_GRO_FRAGLIST   __NETIF_F(GRO_FRAGLIST)
>  #define NETIF_F_GSO_FRAGLIST   __NETIF_F(GSO_FRAGLIST)
>  #define NETIF_F_HW_MACSEC      __NETIF_F(HW_MACSEC)
> +#define NETIF_F_HW_TCP_DDP     __NETIF_F(HW_TCP_DDP)
>
>  /* Finds the next feature with the highest number of the range of start till 0.
>   */
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 259be67644e3..3dd3cdf5dec3 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -941,6 +941,7 @@ struct dev_ifalias {
>
>  struct devlink;
>  struct tlsdev_ops;
> +struct tcp_ddp_dev_ops;
>
>  struct netdev_name_node {
>         struct hlist_node hlist;
> @@ -1937,6 +1938,10 @@ struct net_device {
>         const struct tlsdev_ops *tlsdev_ops;
>  #endif
>
> +#ifdef CONFIG_TCP_DDP
> +       const struct tcp_ddp_dev_ops *tcp_ddp_ops;
> +#endif
> +
>         const struct header_ops *header_ops;
>
>         unsigned int            flags;
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 7338b3865a2a..a08b85b53aa8 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {
>   * @icsk_ulp_ops          Pluggable ULP control hook
>   * @icsk_ulp_data         ULP private data
>   * @icsk_clean_acked      Clean acked data hook
> + * @icsk_ulp_ddp_ops      Pluggable ULP direct data placement control hook
> + * @icsk_ulp_ddp_data     ULP direct data placement private data
>   * @icsk_listen_portaddr_node  hash to the portaddr listener hashtable
>   * @icsk_ca_state:        Congestion control state
>   * @icsk_retransmits:     Number of unrecovered [RTO] timeouts
> @@ -94,6 +96,8 @@ struct inet_connection_sock {
>         const struct tcp_ulp_ops  *icsk_ulp_ops;
>         void __rcu                *icsk_ulp_data;
>         void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);

#ifdef CONFIG_TCP_DDP ?

> +       const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;
> +       void __rcu                *icsk_ulp_ddp_data;
>         struct hlist_node         icsk_listen_portaddr_node;
>         unsigned int              (*icsk_sync_mss)(struct sock *sk, u32 pmtu);
>         __u8                      icsk_ca_state:5,
> diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h
> new file mode 100644
> index 000000000000..31e5b1a16d0f
> --- /dev/null
> +++ b/include/net/tcp_ddp.h
> @@ -0,0 +1,136 @@
> +/* SPDX-License-Identifier: GPL-2.0
> + *
> + * tcp_ddp.h
> + *     Author: Boris Pismenny <borisp@mellanox.com>
> + *     Copyright (C) 2021 Mellanox Technologies.
> + */
> +#ifndef _TCP_DDP_H
> +#define _TCP_DDP_H
> +
> +#include <linux/netdevice.h>
> +#include <net/inet_connection_sock.h>
> +#include <net/sock.h>
> +
> +/* limits returned by the offload driver, zero means don't care */
> +struct tcp_ddp_limits {
> +       int      max_ddp_sgl_len;
> +};
> +
> +enum tcp_ddp_type {
> +       TCP_DDP_NVME = 1,
> +};
> +
> +/**
> + * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue
> + * config implementations must use this as the first member.
> + * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).
> + */
> +struct tcp_ddp_config {
> +       enum tcp_ddp_type    type;
> +       unsigned char        buf[];
> +};
> +
> +/**
> + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue
> + *
> + * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)
> + * @cpda:       controller pdu data alignmend (dwords, 0's based)
> + * @dgst:       digest types enabled.
> + *              The netdev will offload crc if ddp_crc is supported.
> + * @queue_size: number of nvme-tcp IO queue elements
> + * @queue_id:   queue identifier
> + * @cpu_io:     cpu core running the IO thread for this queue
> + */
> +struct nvme_tcp_ddp_config {
> +       struct tcp_ddp_config   cfg;
> +
> +       u16                     pfv;
> +       u8                      cpda;
> +       u8                      dgst;
> +       int                     queue_size;
> +       int                     queue_id;
> +       int                     io_cpu;
> +};
> +
> +/**
> + * struct tcp_ddp_io - tcp ddp configuration for an IO request.
> + *
> + * @command_id:  identifier on the wire associated with these buffers
> + * @nents:       number of entries in the sg_table
> + * @sg_table:    describing the buffers for this IO request
> + * @first_sgl:   first SGL in sg_table
> + */
> +struct tcp_ddp_io {
> +       u32                     command_id;
> +       int                     nents;
> +       struct sg_table         sg_table;
> +       struct scatterlist      first_sgl[SG_CHUNK_SIZE];
> +};
> +
> +/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload
> + *
> + * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.
> + *                     the device driver can use this to limit the resources allocated per queue.
> + * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.
> + *                     this function is used to configure either copy, crc or both offloads.
> + * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.
> + * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.
> + * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in
> + *                     tcp_ddp_io.
> + * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.
> + */
> +struct tcp_ddp_dev_ops {
> +       int (*tcp_ddp_limits)(struct net_device *netdev,
> +                             struct tcp_ddp_limits *limits);
> +       int (*tcp_ddp_sk_add)(struct net_device *netdev,
> +                             struct sock *sk,
> +                             struct tcp_ddp_config *config);
> +       void (*tcp_ddp_sk_del)(struct net_device *netdev,
> +                              struct sock *sk);
> +       int (*tcp_ddp_setup)(struct net_device *netdev,
> +                            struct sock *sk,
> +                            struct tcp_ddp_io *io);
> +       int (*tcp_ddp_teardown)(struct net_device *netdev,
> +                               struct sock *sk,
> +                               struct tcp_ddp_io *io,
> +                               void *ddp_ctx);
> +       void (*tcp_ddp_resync)(struct net_device *netdev,
> +                              struct sock *sk, u32 seq);
> +};
> +
> +#define TCP_DDP_RESYNC_REQ BIT(0)
> +
> +/**
> + * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload
> + */
> +struct tcp_ddp_ulp_ops {
> +       /* NIC requests ulp to indicate if @seq is the start of a message */
> +       bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);
> +       /* NIC driver informs the ulp that ddp teardown is done - used for async completions*/
> +       void (*ddp_teardown_done)(void *ddp_ctx);
> +};
> +
> +/**
> + * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must
> + * use this as the first member.
> + */
> +struct tcp_ddp_ctx {
> +       enum tcp_ddp_type    type;
> +       unsigned char        buf[];
> +};
> +
> +static inline struct tcp_ddp_ctx *tcp_ddp_get_ctx(const struct sock *sk)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +
> +       return (__force struct tcp_ddp_ctx *)icsk->icsk_ulp_ddp_data;
> +}
> +
> +static inline void tcp_ddp_set_ctx(struct sock *sk, void *ctx)
> +{
> +       struct inet_connection_sock *icsk = inet_csk(sk);
> +
> +       rcu_assign_pointer(icsk->icsk_ulp_ddp_data, ctx);
> +}
> +
> +#endif //_TCP_DDP_H
> diff --git a/net/Kconfig b/net/Kconfig
> index f4c32d982af6..3876861cdc90 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -457,6 +457,15 @@ config ETHTOOL_NETLINK
>           netlink. It provides better extensibility and some new features,
>           e.g. notification messages.
>
> +config TCP_DDP
> +       bool "TCP direct data placement offload"
> +       default n
> +       help
> +         Direct Data Placement (DDP) offload for TCP enables ULP, such as
> +         NVMe-TCP/iSCSI, to request the NIC to place TCP payload data
> +         of a command response directly into kernel pages.
> +
> +
>  endif   # if NET
>
>  # Used by archs to tell that they support BPF JIT compiler plus which flavour.
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f62cae3f75d8..791c1b6bc067 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -69,6 +69,7 @@
>  #include <net/xfrm.h>
>  #include <net/mpls.h>
>  #include <net/mptcp.h>
> +#include <net/tcp_ddp.h>
>
>  #include <linux/uaccess.h>
>  #include <trace/events/skb.h>
> @@ -6140,9 +6141,15 @@ EXPORT_SYMBOL(pskb_extract);
>   */
>  void skb_condense(struct sk_buff *skb)
>  {
> +       bool is_ddp = false;
> +
> +#ifdef CONFIG_TCP_DDP

This looks strange to me : TCP should call this helper while skb->sk is NULL

Are you sure this is not dead code ?

> +       is_ddp = skb->sk && inet_csk(skb->sk) &&
> +                inet_csk(skb->sk)->icsk_ulp_ddp_data;
> +#endif
>         if (skb->data_len) {
>                 if (skb->data_len > skb->end - skb->tail ||
> -                   skb_cloned(skb))
> +                   skb_cloned(skb) || is_ddp)
>                         return;
>
>                 /* Nice, we can free page frag(s) right now */
> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> index 24036e3055a1..a2ff7a4a6bbf 100644
> --- a/net/ethtool/common.c
> +++ b/net/ethtool/common.c
> @@ -68,6 +68,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {
>         [NETIF_F_HW_TLS_RX_BIT] =        "tls-hw-rx-offload",
>         [NETIF_F_GRO_FRAGLIST_BIT] =     "rx-gro-list",
>         [NETIF_F_HW_MACSEC_BIT] =        "macsec-hw-offload",
> +       [NETIF_F_HW_TCP_DDP_BIT] =       "tcp-ddp-offload",
>  };
>
>  const char
> --
> 2.24.1
>

David Ahern Jan. 16, 2021, 4:57 a.m. UTC | #2

I have not had time to review this version of the patches, but this
patch seems very similar to 13 of 15 from v1 and you did not respond to
my question on it ...

On 1/14/21 8:10 AM, Boris Pismenny wrote:
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c

> new file mode 100644

> index 000000000000..f446b5d56d64

> --- /dev/null

> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c

> @@ -0,0 +1,243 @@

> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB

> +/* Copyright (c) 2021 Mellanox Technologies. */

> +

> +#include "en_accel/nvmeotcp_rxtx.h"

> +#include "en_accel/nvmeotcp.h"

> +#include <linux/mlx5/mlx5_ifc.h>

> +

> +#define	MLX5E_TC_FLOW_ID_MASK  0x00ffffff

> +static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue,

> +				   struct mlx5e_cqe128 *cqe128)

> +{

> +	const struct tcp_ddp_ulp_ops *ulp_ops;

> +	u32 seq;

> +

> +	seq = be32_to_cpu(cqe128->resync_tcp_sn);

> +	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;

> +	if (ulp_ops && ulp_ops->resync_request)

> +		ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ);

> +}

> +

> +static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue *queue)

> +{

> +	struct nvmeotcp_queue_entry *nqe = &queue->ccid_table[queue->ccid];

> +

> +	queue->ccoff += nqe->sgl[queue->ccsglidx].length;

> +	queue->ccoff_inner = 0;

> +	queue->ccsglidx++;

> +}

> +

> +static inline void

> +mlx5e_nvmeotcp_add_skb_frag(struct net_device *netdev, struct sk_buff *skb,

> +			    struct mlx5e_nvmeotcp_queue *queue,

> +			    struct nvmeotcp_queue_entry *nqe, u32 fragsz)

> +{

> +	dma_sync_single_for_cpu(&netdev->dev,

> +				nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,

> +				fragsz, DMA_FROM_DEVICE);

> +	page_ref_inc(compound_head(sg_page(&nqe->sgl[queue->ccsglidx])));

> +	// XXX: consider reducing the truesize, as no new memory is consumed

> +	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

> +			sg_page(&nqe->sgl[queue->ccsglidx]),

> +			nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,

> +			fragsz,

> +			fragsz);

> +}

> +

> +static struct sk_buff*

> +mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue,

> +				 struct sk_buff *skb, skb_frag_t *org_frags,

> +				 int org_nr_frags, int frag_index)

> +{

> +	struct mlx5e_priv *priv = queue->priv;

> +

> +	while (org_nr_frags != frag_index) {

> +		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {

> +			dev_kfree_skb_any(skb);

> +			return NULL;

> +		}

> +		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

> +				skb_frag_page(&org_frags[frag_index]),

> +				skb_frag_off(&org_frags[frag_index]),

> +				skb_frag_size(&org_frags[frag_index]),

> +				skb_frag_size(&org_frags[frag_index]));

> +		page_ref_inc(skb_frag_page(&org_frags[frag_index]));

> +		frag_index++;

> +	}

> +	return skb;

> +}

> +

> +static struct sk_buff*

> +mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff *skb,

> +		       int offset, int len)

> +{

> +	struct mlx5e_priv *priv = queue->priv;

> +

> +	if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {

> +		dev_kfree_skb_any(skb);

> +		return NULL;

> +	}

> +	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

> +			virt_to_page(skb->data),

> +			offset,

> +			len,

> +			len);

> +	page_ref_inc(virt_to_page(skb->data));

> +	return skb;

> +}

> +

> +static void mlx5_nvmeotcp_trim_nonlinear(struct sk_buff *skb,

> +					 skb_frag_t *org_frags,

> +					 int *frag_index,

> +					 int remaining)

> +{

> +	unsigned int frag_size;

> +	int nr_frags;

> +

> +	/* skip @remaining bytes in frags */

> +	*frag_index = 0;

> +	while (remaining) {

> +		frag_size = skb_frag_size(&skb_shinfo(skb)->frags[*frag_index]);

> +		if (frag_size > remaining) {

> +			skb_frag_off_add(&skb_shinfo(skb)->frags[*frag_index],

> +					 remaining);

> +			skb_frag_size_sub(&skb_shinfo(skb)->frags[*frag_index],

> +					  remaining);

> +			remaining = 0;

> +		} else {

> +			remaining -= frag_size;

> +			skb_frag_unref(skb, *frag_index);

> +			*frag_index += 1;

> +		}

> +	}

> +

> +	/* save original frags for the tail and unref */

> +	nr_frags = skb_shinfo(skb)->nr_frags;

> +	memcpy(&org_frags[*frag_index], &skb_shinfo(skb)->frags[*frag_index],

> +	       (nr_frags - *frag_index) * sizeof(skb_frag_t));

> +	while (--nr_frags >= *frag_index)

> +		skb_frag_unref(skb, nr_frags);

> +

> +	/* remove frags from skb */

> +	skb_shinfo(skb)->nr_frags = 0;

> +	skb->len -= skb->data_len;

> +	skb->truesize -= skb->data_len;

> +	skb->data_len = 0;

> +}

> +

> +struct sk_buff*

> +mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,

> +			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,

> +			     bool linear)

> +{

> +	int ccoff, cclen, hlen, ccid, remaining, fragsz, to_copy = 0;

> +	struct mlx5e_priv *priv = netdev_priv(netdev);

> +	skb_frag_t org_frags[MAX_SKB_FRAGS];

> +	struct mlx5e_nvmeotcp_queue *queue;

> +	struct nvmeotcp_queue_entry *nqe;

> +	int org_nr_frags, frag_index;

> +	struct mlx5e_cqe128 *cqe128;

> +	u32 queue_id;

> +

> +	queue_id = (be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK);

> +	queue = mlx5e_nvmeotcp_get_queue(priv->nvmeotcp, queue_id);

> +	if (unlikely(!queue)) {

> +		dev_kfree_skb_any(skb);

> +		return NULL;

> +	}

> +

> +	cqe128 = container_of(cqe, struct mlx5e_cqe128, cqe64);

> +	if (cqe_is_nvmeotcp_resync(cqe)) {

> +		nvmeotcp_update_resync(queue, cqe128);

> +		mlx5e_nvmeotcp_put_queue(queue);

> +		return skb;

> +	}

> +

> +#ifdef CONFIG_TCP_DDP_CRC

> +	/* If a resync occurred in the previous cqe,

> +	 * the current cqe.crcvalid bit may not be valid,

> +	 * so we will treat it as 0

> +	 */

> +	skb->ddp_crc = queue->after_resync_cqe ? 0 :

> +		cqe_is_nvmeotcp_crcvalid(cqe);

> +	queue->after_resync_cqe = 0;

> +#endif

> +	if (!cqe_is_nvmeotcp_zc(cqe)) {

> +		mlx5e_nvmeotcp_put_queue(queue);

> +		return skb;

> +	}

> +

> +	/* cc ddp from cqe */

> +	ccid = be16_to_cpu(cqe128->ccid);

> +	ccoff = be32_to_cpu(cqe128->ccoff);

> +	cclen = be16_to_cpu(cqe128->cclen);

> +	hlen  = be16_to_cpu(cqe128->hlen);

> +

> +	/* carve a hole in the skb for DDP data */

> +	if (linear) {

> +		skb_trim(skb, hlen);

> +	} else {

> +		org_nr_frags = skb_shinfo(skb)->nr_frags;

> +		mlx5_nvmeotcp_trim_nonlinear(skb, org_frags, &frag_index,

> +					     cclen);

> +	}

> +

> +	nqe = &queue->ccid_table[ccid];

> +

> +	/* packet starts new ccid? */

> +	if (queue->ccid != ccid || queue->ccid_gen != nqe->ccid_gen) {

> +		queue->ccid = ccid;

> +		queue->ccoff = 0;

> +		queue->ccoff_inner = 0;

> +		queue->ccsglidx = 0;

> +		queue->ccid_gen = nqe->ccid_gen;

> +	}

> +

> +	/* skip inside cc until the ccoff in the cqe */

> +	while (queue->ccoff + queue->ccoff_inner < ccoff) {

> +		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;

> +		fragsz = min_t(off_t, remaining,

> +			       ccoff - (queue->ccoff + queue->ccoff_inner));

> +

> +		if (fragsz == remaining)

> +			mlx5e_nvmeotcp_advance_sgl_iter(queue);

> +		else

> +			queue->ccoff_inner += fragsz;

> +	}

> +

> +	/* adjust the skb according to the cqe cc */

> +	while (to_copy < cclen) {

> +		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {

> +			dev_kfree_skb_any(skb);

> +			mlx5e_nvmeotcp_put_queue(queue);

> +			return NULL;

> +		}

> +

> +		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;

> +		fragsz = min_t(int, remaining, cclen - to_copy);

> +

> +		mlx5e_nvmeotcp_add_skb_frag(netdev, skb, queue, nqe, fragsz);

> +		to_copy += fragsz;

> +		if (fragsz == remaining)

> +			mlx5e_nvmeotcp_advance_sgl_iter(queue);

> +		else

> +			queue->ccoff_inner += fragsz;

> +	}

> +

> +	if (cqe_bcnt > hlen + cclen) {

> +		remaining = cqe_bcnt - hlen - cclen;

> +		if (linear)

> +			skb = mlx5_nvmeotcp_add_tail(queue, skb,

> +						     offset_in_page(skb->data) +

> +								hlen + cclen,

> +						     remaining);

> +		else

> +			skb = mlx5_nvmeotcp_add_tail_nonlinear(queue, skb,

> +							       org_frags,

> +							       org_nr_frags,

> +							       frag_index);

> +	}

> +

> +	mlx5e_nvmeotcp_put_queue(queue);

> +	return skb;

> +}




... I'll copy and paste my question here:

"mlx5e_skb_from_cqe_mpwrq_linear and mlx5e_skb_from_cqe_mpwrq_nolinear
create an skb and then this function comes behind it, strips any frags
originally added to the skb, adds the frags for the sgls, and then
re-adds the original frags.

Why is this needed? Why can't the skb be created with all of the frags
in proper order?

It seems like this dance is not needed if you had generic header/payload
splits with the payload written to less retrictive SGLs."

This patch seems to be something very similar, and it is really
complicated way to create each skb for DDP. The patch description does
little to explain why it is needed.

Boris Pismenny Jan. 17, 2021, 8:42 a.m. UTC | #3

On 16/01/2021 6:57, David Ahern wrote:
> I have not had time to review this version of the patches, but this

> patch seems very similar to 13 of 15 from v1 and you did not respond to

> my question on it ...

> 

> On 1/14/21 8:10 AM, Boris Pismenny wrote:

>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c

>> new file mode 100644

>> index 000000000000..f446b5d56d64

>> --- /dev/null

>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c

>> @@ -0,0 +1,243 @@

>> +// SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB

>> +/* Copyright (c) 2021 Mellanox Technologies. */

>> +

>> +#include "en_accel/nvmeotcp_rxtx.h"

>> +#include "en_accel/nvmeotcp.h"

>> +#include <linux/mlx5/mlx5_ifc.h>

>> +

>> +#define	MLX5E_TC_FLOW_ID_MASK  0x00ffffff

>> +static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue,

>> +				   struct mlx5e_cqe128 *cqe128)

>> +{

>> +	const struct tcp_ddp_ulp_ops *ulp_ops;

>> +	u32 seq;

>> +

>> +	seq = be32_to_cpu(cqe128->resync_tcp_sn);

>> +	ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops;

>> +	if (ulp_ops && ulp_ops->resync_request)

>> +		ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ);

>> +}

>> +

>> +static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue *queue)

>> +{

>> +	struct nvmeotcp_queue_entry *nqe = &queue->ccid_table[queue->ccid];

>> +

>> +	queue->ccoff += nqe->sgl[queue->ccsglidx].length;

>> +	queue->ccoff_inner = 0;

>> +	queue->ccsglidx++;

>> +}

>> +

>> +static inline void

>> +mlx5e_nvmeotcp_add_skb_frag(struct net_device *netdev, struct sk_buff *skb,

>> +			    struct mlx5e_nvmeotcp_queue *queue,

>> +			    struct nvmeotcp_queue_entry *nqe, u32 fragsz)

>> +{

>> +	dma_sync_single_for_cpu(&netdev->dev,

>> +				nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,

>> +				fragsz, DMA_FROM_DEVICE);

>> +	page_ref_inc(compound_head(sg_page(&nqe->sgl[queue->ccsglidx])));

>> +	// XXX: consider reducing the truesize, as no new memory is consumed

>> +	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

>> +			sg_page(&nqe->sgl[queue->ccsglidx]),

>> +			nqe->sgl[queue->ccsglidx].offset + queue->ccoff_inner,

>> +			fragsz,

>> +			fragsz);

>> +}

>> +

>> +static struct sk_buff*

>> +mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue,

>> +				 struct sk_buff *skb, skb_frag_t *org_frags,

>> +				 int org_nr_frags, int frag_index)

>> +{

>> +	struct mlx5e_priv *priv = queue->priv;

>> +

>> +	while (org_nr_frags != frag_index) {

>> +		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {

>> +			dev_kfree_skb_any(skb);

>> +			return NULL;

>> +		}

>> +		skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

>> +				skb_frag_page(&org_frags[frag_index]),

>> +				skb_frag_off(&org_frags[frag_index]),

>> +				skb_frag_size(&org_frags[frag_index]),

>> +				skb_frag_size(&org_frags[frag_index]));

>> +		page_ref_inc(skb_frag_page(&org_frags[frag_index]));

>> +		frag_index++;

>> +	}

>> +	return skb;

>> +}

>> +

>> +static struct sk_buff*

>> +mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff *skb,

>> +		       int offset, int len)

>> +{

>> +	struct mlx5e_priv *priv = queue->priv;

>> +

>> +	if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {

>> +		dev_kfree_skb_any(skb);

>> +		return NULL;

>> +	}

>> +	skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,

>> +			virt_to_page(skb->data),

>> +			offset,

>> +			len,

>> +			len);

>> +	page_ref_inc(virt_to_page(skb->data));

>> +	return skb;

>> +}

>> +

>> +static void mlx5_nvmeotcp_trim_nonlinear(struct sk_buff *skb,

>> +					 skb_frag_t *org_frags,

>> +					 int *frag_index,

>> +					 int remaining)

>> +{

>> +	unsigned int frag_size;

>> +	int nr_frags;

>> +

>> +	/* skip @remaining bytes in frags */

>> +	*frag_index = 0;

>> +	while (remaining) {

>> +		frag_size = skb_frag_size(&skb_shinfo(skb)->frags[*frag_index]);

>> +		if (frag_size > remaining) {

>> +			skb_frag_off_add(&skb_shinfo(skb)->frags[*frag_index],

>> +					 remaining);

>> +			skb_frag_size_sub(&skb_shinfo(skb)->frags[*frag_index],

>> +					  remaining);

>> +			remaining = 0;

>> +		} else {

>> +			remaining -= frag_size;

>> +			skb_frag_unref(skb, *frag_index);

>> +			*frag_index += 1;

>> +		}

>> +	}

>> +

>> +	/* save original frags for the tail and unref */

>> +	nr_frags = skb_shinfo(skb)->nr_frags;

>> +	memcpy(&org_frags[*frag_index], &skb_shinfo(skb)->frags[*frag_index],

>> +	       (nr_frags - *frag_index) * sizeof(skb_frag_t));

>> +	while (--nr_frags >= *frag_index)

>> +		skb_frag_unref(skb, nr_frags);

>> +

>> +	/* remove frags from skb */

>> +	skb_shinfo(skb)->nr_frags = 0;

>> +	skb->len -= skb->data_len;

>> +	skb->truesize -= skb->data_len;

>> +	skb->data_len = 0;

>> +}

>> +

>> +struct sk_buff*

>> +mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb,

>> +			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt,

>> +			     bool linear)

>> +{

>> +	int ccoff, cclen, hlen, ccid, remaining, fragsz, to_copy = 0;

>> +	struct mlx5e_priv *priv = netdev_priv(netdev);

>> +	skb_frag_t org_frags[MAX_SKB_FRAGS];

>> +	struct mlx5e_nvmeotcp_queue *queue;

>> +	struct nvmeotcp_queue_entry *nqe;

>> +	int org_nr_frags, frag_index;

>> +	struct mlx5e_cqe128 *cqe128;

>> +	u32 queue_id;

>> +

>> +	queue_id = (be32_to_cpu(cqe->sop_drop_qpn) & MLX5E_TC_FLOW_ID_MASK);

>> +	queue = mlx5e_nvmeotcp_get_queue(priv->nvmeotcp, queue_id);

>> +	if (unlikely(!queue)) {

>> +		dev_kfree_skb_any(skb);

>> +		return NULL;

>> +	}

>> +

>> +	cqe128 = container_of(cqe, struct mlx5e_cqe128, cqe64);

>> +	if (cqe_is_nvmeotcp_resync(cqe)) {

>> +		nvmeotcp_update_resync(queue, cqe128);

>> +		mlx5e_nvmeotcp_put_queue(queue);

>> +		return skb;

>> +	}

>> +

>> +#ifdef CONFIG_TCP_DDP_CRC

>> +	/* If a resync occurred in the previous cqe,

>> +	 * the current cqe.crcvalid bit may not be valid,

>> +	 * so we will treat it as 0

>> +	 */

>> +	skb->ddp_crc = queue->after_resync_cqe ? 0 :

>> +		cqe_is_nvmeotcp_crcvalid(cqe);

>> +	queue->after_resync_cqe = 0;

>> +#endif

>> +	if (!cqe_is_nvmeotcp_zc(cqe)) {

>> +		mlx5e_nvmeotcp_put_queue(queue);

>> +		return skb;

>> +	}

>> +

>> +	/* cc ddp from cqe */

>> +	ccid = be16_to_cpu(cqe128->ccid);

>> +	ccoff = be32_to_cpu(cqe128->ccoff);

>> +	cclen = be16_to_cpu(cqe128->cclen);

>> +	hlen  = be16_to_cpu(cqe128->hlen);

>> +

>> +	/* carve a hole in the skb for DDP data */

>> +	if (linear) {

>> +		skb_trim(skb, hlen);

>> +	} else {

>> +		org_nr_frags = skb_shinfo(skb)->nr_frags;

>> +		mlx5_nvmeotcp_trim_nonlinear(skb, org_frags, &frag_index,

>> +					     cclen);

>> +	}

>> +

>> +	nqe = &queue->ccid_table[ccid];

>> +

>> +	/* packet starts new ccid? */

>> +	if (queue->ccid != ccid || queue->ccid_gen != nqe->ccid_gen) {

>> +		queue->ccid = ccid;

>> +		queue->ccoff = 0;

>> +		queue->ccoff_inner = 0;

>> +		queue->ccsglidx = 0;

>> +		queue->ccid_gen = nqe->ccid_gen;

>> +	}

>> +

>> +	/* skip inside cc until the ccoff in the cqe */

>> +	while (queue->ccoff + queue->ccoff_inner < ccoff) {

>> +		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;

>> +		fragsz = min_t(off_t, remaining,

>> +			       ccoff - (queue->ccoff + queue->ccoff_inner));

>> +

>> +		if (fragsz == remaining)

>> +			mlx5e_nvmeotcp_advance_sgl_iter(queue);

>> +		else

>> +			queue->ccoff_inner += fragsz;

>> +	}

>> +

>> +	/* adjust the skb according to the cqe cc */

>> +	while (to_copy < cclen) {

>> +		if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) {

>> +			dev_kfree_skb_any(skb);

>> +			mlx5e_nvmeotcp_put_queue(queue);

>> +			return NULL;

>> +		}

>> +

>> +		remaining = nqe->sgl[queue->ccsglidx].length - queue->ccoff_inner;

>> +		fragsz = min_t(int, remaining, cclen - to_copy);

>> +

>> +		mlx5e_nvmeotcp_add_skb_frag(netdev, skb, queue, nqe, fragsz);

>> +		to_copy += fragsz;

>> +		if (fragsz == remaining)

>> +			mlx5e_nvmeotcp_advance_sgl_iter(queue);

>> +		else

>> +			queue->ccoff_inner += fragsz;

>> +	}

>> +

>> +	if (cqe_bcnt > hlen + cclen) {

>> +		remaining = cqe_bcnt - hlen - cclen;

>> +		if (linear)

>> +			skb = mlx5_nvmeotcp_add_tail(queue, skb,

>> +						     offset_in_page(skb->data) +

>> +								hlen + cclen,

>> +						     remaining);

>> +		else

>> +			skb = mlx5_nvmeotcp_add_tail_nonlinear(queue, skb,

>> +							       org_frags,

>> +							       org_nr_frags,

>> +							       frag_index);

>> +	}

>> +

>> +	mlx5e_nvmeotcp_put_queue(queue);

>> +	return skb;

>> +}

> 

> 

> 

> ... I'll copy and paste my question here:

> 

> "mlx5e_skb_from_cqe_mpwrq_linear and mlx5e_skb_from_cqe_mpwrq_nolinear

> create an skb and then this function comes behind it, strips any frags

> originally added to the skb, adds the frags for the sgls, and then

> re-adds the original frags.

> 

> Why is this needed? Why can't the skb be created with all of the frags

> in proper order?

> 

> It seems like this dance is not needed if you had generic header/payload

> splits with the payload written to less retrictive SGLs."

> 

> This patch seems to be something very similar, and it is really

> complicated way to create each skb for DDP. The patch description does

> little to explain why it is needed.

> 


This is the same patch as before.

I'll start by explaining why this is needed. Then, clarify why generic
header-data split is not enough.

This is needed for a few reasons that are explained in detail
in the tcp-ddp offload documentation. See patch 21 overview
and rx-data-path sections. Our reasons are as follows:
1) Each SKB may contain multiple PDUs. DDP offload doesn't operate on
PDU headers, so these are written in the receive ring. Therefore, we
need to rebuild the SKB to account for it. Additionally, due to HW
limitations, we will only offload the first PDU in the SKB.
2) The newly constructed SKB represents the original data as it is on
the wire, such that the network stack is oblivious to the offload.
3) We decided not to modify all of the mlx5e_skb_from_cqe* functions
because it would make the offload harder to distinguish, and it would
add overhead to the existing data-path fucntions. Therefore, we opted
for this modular approach.

If we only had generic header-data split, then we just couldn't
provide this offload. It is not enough to place payload into some
buffer without TCP headers because RPC protocols and advanced storage
protocols, such as nvme-tcp, reorder their responses and require data
to be placed into application/pagecache buffers, which are anything
but anonymous. In other words, header-data split alone writes data
to the wrong buffers (reordering), or to anonymous buffers that
can't be page-flipped to replace application/pagecache buffers.

David Ahern Jan. 19, 2021, 4:18 a.m. UTC | #4

On 1/14/21 8:10 AM, Boris Pismenny wrote:
> @@ -664,8 +753,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,

>  		return -EINVAL;

>  	}

>  

> -	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))

> -		nvme_complete_rq(rq);

> +	req = blk_mq_rq_to_pdu(rq);

> +	if (req->offloaded) {

> +		req->status = cqe->status;

> +		req->result = cqe->result;

> +		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);

> +	} else {

> +		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))

> +			nvme_complete_rq(rq);

> +	}

>  	queue->nr_cqe++;

>  

>  	return 0;

> @@ -859,9 +955,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,

>  static inline void nvme_tcp_end_request(struct request *rq, u16 status)

>  {

>  	union nvme_result res = {};

> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);

> +	struct nvme_tcp_queue *queue = req->queue;

> +	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;

>  

> -	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))

> -		nvme_complete_rq(rq);

> +	if (req->offloaded) {

> +		req->status = cpu_to_le16(status << 1);

> +		req->result = res;

> +		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);

> +	} else {

> +		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))

> +			nvme_complete_rq(rq);

> +	}

>  }

>  

>  static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

The req->offload checks assume the offload is to the expected
offload_netdev, but you do not verify the data arrived as expected. You
might get lucky if both netdev's belong to the same PCI device (assuming
the h/w handles it a certain way), but it will not if the netdev's
belong to different devices.

Consider a system with 2 network cards -- even if it is 2 mlx5 based
devices. One setup can have the system using a bond with 1 port from
each PCI device. The tx path picks a leg based on the hash of the ntuple
and that (with Tariq's bond patches) becomes the expected offload
device. A similar example holds for a pure routing setup with ECMP. For
both there is full redundancy in the network - separate NIC cards
connected to separate TORs to have independent network paths.

A packet arrives on the *other* netdevice - you have *no* control over
the Rx path. Your current checks will think the packet arrived with DDP
but it did not.

David Ahern Jan. 19, 2021, 4:36 a.m. UTC | #5

On 1/17/21 1:42 AM, Boris Pismenny wrote:
> This is needed for a few reasons that are explained in detail

> in the tcp-ddp offload documentation. See patch 21 overview

> and rx-data-path sections. Our reasons are as follows:


I read the documentation patch, and it does not explain it and really
should not since this is very mlx specific based on the changes.
Different h/w will have different limitations. Given that, it would be
best to enhance the patch description to explain why these gymnastics
are needed for the skb.

> 1) Each SKB may contain multiple PDUs. DDP offload doesn't operate on

> PDU headers, so these are written in the receive ring. Therefore, we

> need to rebuild the SKB to account for it. Additionally, due to HW

> limitations, we will only offload the first PDU in the SKB.


Are you referring to LRO skbs here? I can't imagine going through this
for 1500 byte packets that have multiple PDUs.


> 2) The newly constructed SKB represents the original data as it is on

> the wire, such that the network stack is oblivious to the offload.

> 3) We decided not to modify all of the mlx5e_skb_from_cqe* functions

> because it would make the offload harder to distinguish, and it would

> add overhead to the existing data-path fucntions. Therefore, we opted

> for this modular approach.

> 

> If we only had generic header-data split, then we just couldn't

> provide this offload. It is not enough to place payload into some

> buffer without TCP headers because RPC protocols and advanced storage

> protocols, such as nvme-tcp, reorder their responses and require data

> to be placed into application/pagecache buffers, which are anything

> but anonymous. In other words, header-data split alone writes data

> to the wrong buffers (reordering), or to anonymous buffers that

> can't be page-flipped to replace application/pagecache buffers.

>

Boris Pismenny Jan. 31, 2021, 8:44 a.m. UTC | #6

On 19/01/2021 6:18, David Ahern wrote:
> On 1/14/21 8:10 AM, Boris Pismenny wrote:

>> @@ -664,8 +753,15 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,

>>  		return -EINVAL;

>>  	}

>>  

>> -	if (!nvme_try_complete_req(rq, cqe->status, cqe->result))

>> -		nvme_complete_rq(rq);

>> +	req = blk_mq_rq_to_pdu(rq);

>> +	if (req->offloaded) {

>> +		req->status = cqe->status;

>> +		req->result = cqe->result;

>> +		nvme_tcp_teardown_ddp(queue, cqe->command_id, rq);

>> +	} else {

>> +		if (!nvme_try_complete_req(rq, cqe->status, cqe->result))

>> +			nvme_complete_rq(rq);

>> +	}

>>  	queue->nr_cqe++;

>>  

>>  	return 0;

>> @@ -859,9 +955,18 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb,

>>  static inline void nvme_tcp_end_request(struct request *rq, u16 status)

>>  {

>>  	union nvme_result res = {};

>> +	struct nvme_tcp_request *req = blk_mq_rq_to_pdu(rq);

>> +	struct nvme_tcp_queue *queue = req->queue;

>> +	struct nvme_tcp_data_pdu *pdu = (void *)queue->pdu;

>>  

>> -	if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))

>> -		nvme_complete_rq(rq);

>> +	if (req->offloaded) {

>> +		req->status = cpu_to_le16(status << 1);

>> +		req->result = res;

>> +		nvme_tcp_teardown_ddp(queue, pdu->command_id, rq);

>> +	} else {

>> +		if (!nvme_try_complete_req(rq, cpu_to_le16(status << 1), res))

>> +			nvme_complete_rq(rq);

>> +	}

>>  }

>>  

>>  static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,

> 

> 

> The req->offload checks assume the offload is to the expected

> offload_netdev, but you do not verify the data arrived as expected. You

> might get lucky if both netdev's belong to the same PCI device (assuming

> the h/w handles it a certain way), but it will not if the netdev's

> belong to different devices.

> 

> Consider a system with 2 network cards -- even if it is 2 mlx5 based

> devices. One setup can have the system using a bond with 1 port from

> each PCI device. The tx path picks a leg based on the hash of the ntuple

> and that (with Tariq's bond patches) becomes the expected offload

> device. A similar example holds for a pure routing setup with ECMP. For

> both there is full redundancy in the network - separate NIC cards

> connected to separate TORs to have independent network paths.

> 

> A packet arrives on the *other* netdevice - you have *no* control over

> the Rx path. Your current checks will think the packet arrived with DDP

> but it did not.

> 


There's no problem if another (non-offload) netdevice receives traffic
that arrives here. Because that other device will never set the SKB
frag pages to point to the final destination buffers, and so copy
offload will not take place.

The req->offload indication is mainly for matching ddp_setup with
ddp_teardown calls, it does not control copy/crc offload as these are
controlled per-skb using frags for copy and skb bits for crc.

Boris Pismenny Jan. 31, 2021, 9:27 a.m. UTC | #7

On 19/01/2021 6:36, David Ahern wrote:
> On 1/17/21 1:42 AM, Boris Pismenny wrote:

>> This is needed for a few reasons that are explained in detail

>> in the tcp-ddp offload documentation. See patch 21 overview

>> and rx-data-path sections. Our reasons are as follows:

> 

> I read the documentation patch, and it does not explain it and really

> should not since this is very mlx specific based on the changes.

> Different h/w will have different limitations. Given that, it would be

> best to enhance the patch description to explain why these gymnastics

> are needed for the skb.

> 

The text in the documentation that describes this trade-off:
''We remark that a single TCP packet may have numerous PDUs embedded
inside. NICs can choose to offload one or more of these PDUs according
to various trade-offs. Possibly, offloading such small PDUs is of little
value, and it is better to leave it to software. ``

Indeed, different HW may have other additional trade-offs. But, I
suspect that this one will be important for all.

>> 1) Each SKB may contain multiple PDUs. DDP offload doesn't operate on

>> PDU headers, so these are written in the receive ring. Therefore, we

>> need to rebuild the SKB to account for it. Additionally, due to HW

>> limitations, we will only offload the first PDU in the SKB.

> 

> Are you referring to LRO skbs here? I can't imagine going through this

> for 1500 byte packets that have multiple PDUs.

> 

> 

No, that is true for any skb, and non-LRO skbs in particular. Most SKBs
do not contain multiple PDUs, but the ones that do are handled
gracefully in this function.

Boris Pismenny Jan. 31, 2021, 10:40 a.m. UTC | #8

On 14/01/2021 22:43, Eric Dumazet wrote:
> On Thu, Jan 14, 2021 at 9:19 PM Boris Pismenny <borispismenny@gmail.com> wrote:

>>

>>

>>

>> On 14/01/2021 17:57, Eric Dumazet wrote:

>>> On Thu, Jan 14, 2021 at 4:10 PM Boris Pismenny <borisp@mellanox.com> wrote:

>>>>

>>>> This commit introduces direct data placement offload for TCP.

>>>> This capability is accompanied by new net_device operations that

>>>> configure hardware contexts. There is a context per socket, and a context per DDP

>>>> opreation. Additionally, a resynchronization routine is used to assist

>>>> hardware handle TCP OOO, and continue the offload.

>>>> Furthermore, we let the offloading driver advertise what is the max hw

>>>> sectors/segments.

>>>>

>>>> Using this interface, the NIC hardware will scatter TCP payload directly

>>>> to the BIO pages according to the command_id.

>>>> To maintain the correctness of the network stack, the driver is expected

>>>> to construct SKBs that point to the BIO pages.

>>>>

>>>> This, the SKB represents the data on the wire, while it is pointing

>>>> to data that is already placed in the destination buffer.

>>>> As a result, data from page frags should not be copied out to

>>>> the linear part.

>>>>

>>>> As SKBs that use DDP are already very memory efficient, we modify

>>>> skb_condence to avoid copying data from fragments to the linear

>>>> part of SKBs that belong to a socket that uses DDP offload.

>>>>

>>>> A follow-up patch will use this interface for DDP in NVMe-TCP.

>>>>

>>>> Signed-off-by: Boris Pismenny <borisp@mellanox.com>

>>>> Signed-off-by: Ben Ben-Ishay <benishay@mellanox.com>

>>>> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>

>>>> Signed-off-by: Yoray Zack <yorayz@mellanox.com>

>>>> ---

>>>>  include/linux/netdev_features.h    |   2 +

>>>>  include/linux/netdevice.h          |   5 ++

>>>>  include/net/inet_connection_sock.h |   4 +

>>>>  include/net/tcp_ddp.h              | 136 +++++++++++++++++++++++++++++

>>>>  net/Kconfig                        |   9 ++

>>>>  net/core/skbuff.c                  |   9 +-

>>>>  net/ethtool/common.c               |   1 +

>>>>  7 files changed, 165 insertions(+), 1 deletion(-)

>>>>  create mode 100644 include/net/tcp_ddp.h

>>>>

>>>> diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h

>>>> index 934de56644e7..fb35dcac03d2 100644

>>>> --- a/include/linux/netdev_features.h

>>>> +++ b/include/linux/netdev_features.h

>>>> @@ -84,6 +84,7 @@ enum {

>>>>         NETIF_F_GRO_FRAGLIST_BIT,       /* Fraglist GRO */

>>>>

>>>>         NETIF_F_HW_MACSEC_BIT,          /* Offload MACsec operations */

>>>> +       NETIF_F_HW_TCP_DDP_BIT,         /* TCP direct data placement offload */

>>>>

>>>>         /*

>>>>          * Add your fresh new feature above and remember to update

>>>> @@ -157,6 +158,7 @@ enum {

>>>>  #define NETIF_F_GRO_FRAGLIST   __NETIF_F(GRO_FRAGLIST)

>>>>  #define NETIF_F_GSO_FRAGLIST   __NETIF_F(GSO_FRAGLIST)

>>>>  #define NETIF_F_HW_MACSEC      __NETIF_F(HW_MACSEC)

>>>> +#define NETIF_F_HW_TCP_DDP     __NETIF_F(HW_TCP_DDP)

>>>>

>>>>  /* Finds the next feature with the highest number of the range of start till 0.

>>>>   */

>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

>>>> index 259be67644e3..3dd3cdf5dec3 100644

>>>> --- a/include/linux/netdevice.h

>>>> +++ b/include/linux/netdevice.h

>>>> @@ -941,6 +941,7 @@ struct dev_ifalias {

>>>>

>>>>  struct devlink;

>>>>  struct tlsdev_ops;

>>>> +struct tcp_ddp_dev_ops;

>>>>

>>>>  struct netdev_name_node {

>>>>         struct hlist_node hlist;

>>>> @@ -1937,6 +1938,10 @@ struct net_device {

>>>>         const struct tlsdev_ops *tlsdev_ops;

>>>>  #endif

>>>>

>>>> +#ifdef CONFIG_TCP_DDP

>>>> +       const struct tcp_ddp_dev_ops *tcp_ddp_ops;

>>>> +#endif

>>>> +

>>>>         const struct header_ops *header_ops;

>>>>

>>>>         unsigned int            flags;

>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h

>>>> index 7338b3865a2a..a08b85b53aa8 100644

>>>> --- a/include/net/inet_connection_sock.h

>>>> +++ b/include/net/inet_connection_sock.h

>>>> @@ -66,6 +66,8 @@ struct inet_connection_sock_af_ops {

>>>>   * @icsk_ulp_ops          Pluggable ULP control hook

>>>>   * @icsk_ulp_data         ULP private data

>>>>   * @icsk_clean_acked      Clean acked data hook

>>>> + * @icsk_ulp_ddp_ops      Pluggable ULP direct data placement control hook

>>>> + * @icsk_ulp_ddp_data     ULP direct data placement private data

>>>>   * @icsk_listen_portaddr_node  hash to the portaddr listener hashtable

>>>>   * @icsk_ca_state:        Congestion control state

>>>>   * @icsk_retransmits:     Number of unrecovered [RTO] timeouts

>>>> @@ -94,6 +96,8 @@ struct inet_connection_sock {

>>>>         const struct tcp_ulp_ops  *icsk_ulp_ops;

>>>>         void __rcu                *icsk_ulp_data;

>>>>         void (*icsk_clean_acked)(struct sock *sk, u32 acked_seq);

>>>

>>> #ifdef CONFIG_TCP_DDP ?

>>>

>>>> +       const struct tcp_ddp_ulp_ops  *icsk_ulp_ddp_ops;

>>>> +       void __rcu                *icsk_ulp_ddp_data;

>>>>         struct hlist_node         icsk_listen_portaddr_node;

>>>>         unsigned int              (*icsk_sync_mss)(struct sock *sk, u32 pmtu);

>>>>         __u8                      icsk_ca_state:5,

>>>> diff --git a/include/net/tcp_ddp.h b/include/net/tcp_ddp.h

>>>> new file mode 100644

>>>> index 000000000000..31e5b1a16d0f

>>>> --- /dev/null

>>>> +++ b/include/net/tcp_ddp.h

>>>> @@ -0,0 +1,136 @@

>>>> +/* SPDX-License-Identifier: GPL-2.0

>>>> + *

>>>> + * tcp_ddp.h

>>>> + *     Author: Boris Pismenny <borisp@mellanox.com>

>>>> + *     Copyright (C) 2021 Mellanox Technologies.

>>>> + */

>>>> +#ifndef _TCP_DDP_H

>>>> +#define _TCP_DDP_H

>>>> +

>>>> +#include <linux/netdevice.h>

>>>> +#include <net/inet_connection_sock.h>

>>>> +#include <net/sock.h>

>>>> +

>>>> +/* limits returned by the offload driver, zero means don't care */

>>>> +struct tcp_ddp_limits {

>>>> +       int      max_ddp_sgl_len;

>>>> +};

>>>> +

>>>> +enum tcp_ddp_type {

>>>> +       TCP_DDP_NVME = 1,

>>>> +};

>>>> +

>>>> +/**

>>>> + * struct tcp_ddp_config - Generic tcp ddp configuration: tcp ddp IO queue

>>>> + * config implementations must use this as the first member.

>>>> + * Add new instances of tcp_ddp_config below (nvme-tcp, etc.).

>>>> + */

>>>> +struct tcp_ddp_config {

>>>> +       enum tcp_ddp_type    type;

>>>> +       unsigned char        buf[];

>>>> +};

>>>> +

>>>> +/**

>>>> + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue

>>>> + *

>>>> + * @pfv:        pdu version (e.g., NVME_TCP_PFV_1_0)

>>>> + * @cpda:       controller pdu data alignmend (dwords, 0's based)

>>>> + * @dgst:       digest types enabled.

>>>> + *              The netdev will offload crc if ddp_crc is supported.

>>>> + * @queue_size: number of nvme-tcp IO queue elements

>>>> + * @queue_id:   queue identifier

>>>> + * @cpu_io:     cpu core running the IO thread for this queue

>>>> + */

>>>> +struct nvme_tcp_ddp_config {

>>>> +       struct tcp_ddp_config   cfg;

>>>> +

>>>> +       u16                     pfv;

>>>> +       u8                      cpda;

>>>> +       u8                      dgst;

>>>> +       int                     queue_size;

>>>> +       int                     queue_id;

>>>> +       int                     io_cpu;

>>>> +};

>>>> +

>>>> +/**

>>>> + * struct tcp_ddp_io - tcp ddp configuration for an IO request.

>>>> + *

>>>> + * @command_id:  identifier on the wire associated with these buffers

>>>> + * @nents:       number of entries in the sg_table

>>>> + * @sg_table:    describing the buffers for this IO request

>>>> + * @first_sgl:   first SGL in sg_table

>>>> + */

>>>> +struct tcp_ddp_io {

>>>> +       u32                     command_id;

>>>> +       int                     nents;

>>>> +       struct sg_table         sg_table;

>>>> +       struct scatterlist      first_sgl[SG_CHUNK_SIZE];

>>>> +};

>>>> +

>>>> +/* struct tcp_ddp_dev_ops - operations used by an upper layer protocol to configure ddp offload

>>>> + *

>>>> + * @tcp_ddp_limits:    limit the number of scatter gather entries per IO.

>>>> + *                     the device driver can use this to limit the resources allocated per queue.

>>>> + * @tcp_ddp_sk_add:    add offload for the queue represennted by the socket+config pair.

>>>> + *                     this function is used to configure either copy, crc or both offloads.

>>>> + * @tcp_ddp_sk_del:    remove offload from the socket, and release any device related resources.

>>>> + * @tcp_ddp_setup:     request copy offload for buffers associated with a command_id in tcp_ddp_io.

>>>> + * @tcp_ddp_teardown:  release offload resources association between buffers and command_id in

>>>> + *                     tcp_ddp_io.

>>>> + * @tcp_ddp_resync:    respond to the driver's resync_request. Called only if resync is successful.

>>>> + */

>>>> +struct tcp_ddp_dev_ops {

>>>> +       int (*tcp_ddp_limits)(struct net_device *netdev,

>>>> +                             struct tcp_ddp_limits *limits);

>>>> +       int (*tcp_ddp_sk_add)(struct net_device *netdev,

>>>> +                             struct sock *sk,

>>>> +                             struct tcp_ddp_config *config);

>>>> +       void (*tcp_ddp_sk_del)(struct net_device *netdev,

>>>> +                              struct sock *sk);

>>>> +       int (*tcp_ddp_setup)(struct net_device *netdev,

>>>> +                            struct sock *sk,

>>>> +                            struct tcp_ddp_io *io);

>>>> +       int (*tcp_ddp_teardown)(struct net_device *netdev,

>>>> +                               struct sock *sk,

>>>> +                               struct tcp_ddp_io *io,

>>>> +                               void *ddp_ctx);

>>>> +       void (*tcp_ddp_resync)(struct net_device *netdev,

>>>> +                              struct sock *sk, u32 seq);

>>>> +};

>>>> +

>>>> +#define TCP_DDP_RESYNC_REQ BIT(0)

>>>> +

>>>> +/**

>>>> + * struct tcp_ddp_ulp_ops - Interface to register uppper layer Direct Data Placement (DDP) TCP offload

>>>> + */

>>>> +struct tcp_ddp_ulp_ops {

>>>> +       /* NIC requests ulp to indicate if @seq is the start of a message */

>>>> +       bool (*resync_request)(struct sock *sk, u32 seq, u32 flags);

>>>> +       /* NIC driver informs the ulp that ddp teardown is done - used for async completions*/

>>>> +       void (*ddp_teardown_done)(void *ddp_ctx);

>>>> +};

>>>> +

>>>> +/**

>>>> + * struct tcp_ddp_ctx - Generic tcp ddp context: device driver per queue contexts must

>>>> + * use this as the first member.

>>>> + */

>>>> +struct tcp_ddp_ctx {

>>>> +       enum tcp_ddp_type    type;

>>>> +       unsigned char        buf[];

>>>> +};

>>>> +

>>>> +static inline struct tcp_ddp_ctx *tcp_ddp_get_ctx(const struct sock *sk)

>>>> +{

>>>> +       struct inet_connection_sock *icsk = inet_csk(sk);

>>>> +

>>>> +       return (__force struct tcp_ddp_ctx *)icsk->icsk_ulp_ddp_data;

>>>> +}

>>>> +

>>>> +static inline void tcp_ddp_set_ctx(struct sock *sk, void *ctx)

>>>> +{

>>>> +       struct inet_connection_sock *icsk = inet_csk(sk);

>>>> +

>>>> +       rcu_assign_pointer(icsk->icsk_ulp_ddp_data, ctx);

>>>> +}

>>>> +

>>>> +#endif //_TCP_DDP_H

>>>> diff --git a/net/Kconfig b/net/Kconfig

>>>> index f4c32d982af6..3876861cdc90 100644

>>>> --- a/net/Kconfig

>>>> +++ b/net/Kconfig

>>>> @@ -457,6 +457,15 @@ config ETHTOOL_NETLINK

>>>>           netlink. It provides better extensibility and some new features,

>>>>           e.g. notification messages.

>>>>

>>>> +config TCP_DDP

>>>> +       bool "TCP direct data placement offload"

>>>> +       default n

>>>> +       help

>>>> +         Direct Data Placement (DDP) offload for TCP enables ULP, such as

>>>> +         NVMe-TCP/iSCSI, to request the NIC to place TCP payload data

>>>> +         of a command response directly into kernel pages.

>>>> +

>>>> +

>>>>  endif   # if NET

>>>>

>>>>  # Used by archs to tell that they support BPF JIT compiler plus which flavour.

>>>> diff --git a/net/core/skbuff.c b/net/core/skbuff.c

>>>> index f62cae3f75d8..791c1b6bc067 100644

>>>> --- a/net/core/skbuff.c

>>>> +++ b/net/core/skbuff.c

>>>> @@ -69,6 +69,7 @@

>>>>  #include <net/xfrm.h>

>>>>  #include <net/mpls.h>

>>>>  #include <net/mptcp.h>

>>>> +#include <net/tcp_ddp.h>

>>>>

>>>>  #include <linux/uaccess.h>

>>>>  #include <trace/events/skb.h>

>>>> @@ -6140,9 +6141,15 @@ EXPORT_SYMBOL(pskb_extract);

>>>>   */

>>>>  void skb_condense(struct sk_buff *skb)

>>>>  {

>>>> +       bool is_ddp = false;

>>>> +

>>>> +#ifdef CONFIG_TCP_DDP

>>>

>>> This looks strange to me : TCP should call this helper while skb->sk is NULL

>>>

>>> Are you sure this is not dead code ?

>>>

>>

>> Will verify again on Sunday. AFAICT, early demux sets skb->sk before this code

>> is called.

> 

> 

> First, early demux is optional.

> 

> Secondly, skb->sk is stolen in skb_steal_sock() if early demux was performed.

> 

> 

> Just to clarify, the purpose of this code is to avoid skb condensing

>> data that is already placed into destination buffers.

> 

> Then this has not been tested. This suggests this code could be

> removed, I doubt that your target traffic would ever be 'condensed'.

> 


Thanks for the feedback!

Some retrospective about this:
We originally used the skb->ddp_crc bit for this check, but later we
tried to avoid it so as to reduce our dependence on said skb bit.
Unfortunately, that later change wasn't tested thoroughly.

We'll post v3 patches that will use the skb->ddp_crc bit for this check.
After re-testing with the new v3 patch I'm confident it will work.

>>

>>>> +       is_ddp = skb->sk && inet_csk(skb->sk) &&

>>>> +                inet_csk(skb->sk)->icsk_ulp_ddp_data;

>>>> +#endif

>>>>         if (skb->data_len) {

>>>>                 if (skb->data_len > skb->end - skb->tail ||

>>>> -                   skb_cloned(skb))

>>>> +                   skb_cloned(skb) || is_ddp)

>>>>                         return;

>>>>

>>>>                 /* Nice, we can free page frag(s) right now */

>>>> diff --git a/net/ethtool/common.c b/net/ethtool/common.c

>>>> index 24036e3055a1..a2ff7a4a6bbf 100644

>>>> --- a/net/ethtool/common.c

>>>> +++ b/net/ethtool/common.c

>>>> @@ -68,6 +68,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = {

>>>>         [NETIF_F_HW_TLS_RX_BIT] =        "tls-hw-rx-offload",

>>>>         [NETIF_F_GRO_FRAGLIST_BIT] =     "rx-gro-list",

>>>>         [NETIF_F_HW_MACSEC_BIT] =        "macsec-hw-offload",

>>>> +       [NETIF_F_HW_TCP_DDP_BIT] =       "tcp-ddp-offload",

>>>>  };

>>>>

>>>>  const char

>>>> --

>>>> 2.24.1

>>>>

[v2,net-next,00/21] nvme-tcp receive offloads

Message

Comments