From patchwork Thu Jan 14 15:10:15 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363368 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A7FBCC432C3 for ; Thu, 14 Jan 2021 15:12:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 850DA23A69 for ; Thu, 14 Jan 2021 15:12:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729396AbhANPMT (ORCPT ); Thu, 14 Jan 2021 10:12:19 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38723 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729259AbhANPLi (ORCPT ); Thu, 14 Jan 2021 10:11:38 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:44 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh03000835; Thu, 14 Jan 2021 17:10:44 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Ben Ben-Ishay , Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 03/21] net: Introduce crc offload for tcp ddp ulp Date: Thu, 14 Jan 2021 17:10:15 +0200 Message-Id: <20210114151033.13020-4-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This commit introduces support for CRC offload to direct data placement ULP on the receive side. Both DDP and CRC share a common API to initialize the offload for a TCP socket. But otherwise, both can be executed independently. On the receive side, CRC offload requires a new SKB bit that indicates that no CRC error was encountered while processing this packet. If all packets of a ULP message have this bit set, then the CRC verification for the message can be skipped, as hardware already checked it. The following patches will set and use this bit to perform NVME-TCP CRC offload. A subsequent series, will add NVMe-TCP transmit side CRC support. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack Reviewed-by: Sagi Grimberg --- include/linux/netdev_features.h | 2 ++ include/linux/skbuff.h | 5 +++++ net/Kconfig | 8 ++++++++ net/ethtool/common.c | 1 + net/ipv4/tcp_input.c | 7 +++++++ net/ipv4/tcp_ipv4.c | 3 +++ net/ipv4/tcp_offload.c | 3 +++ 7 files changed, 29 insertions(+) diff --git a/include/linux/netdev_features.h b/include/linux/netdev_features.h index fb35dcac03d2..dc79709586cd 100644 --- a/include/linux/netdev_features.h +++ b/include/linux/netdev_features.h @@ -85,6 +85,7 @@ enum { NETIF_F_HW_MACSEC_BIT, /* Offload MACsec operations */ NETIF_F_HW_TCP_DDP_BIT, /* TCP direct data placement offload */ + NETIF_F_HW_TCP_DDP_CRC_RX_BIT, /* TCP DDP CRC RX offload */ /* * Add your fresh new feature above and remember to update @@ -159,6 +160,7 @@ enum { #define NETIF_F_GSO_FRAGLIST __NETIF_F(GSO_FRAGLIST) #define NETIF_F_HW_MACSEC __NETIF_F(HW_MACSEC) #define NETIF_F_HW_TCP_DDP __NETIF_F(HW_TCP_DDP) +#define NETIF_F_HW_TCP_DDP_CRC_RX __NETIF_F(HW_TCP_DDP_CRC_RX) /* Finds the next feature with the highest number of the range of start till 0. */ diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 333bcdc39635..a841bc974452 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -683,6 +683,7 @@ typedef unsigned char *sk_buff_data_t; * CHECKSUM_UNNECESSARY (max 3) * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB + * @ddp_crc: NIC is responsible for PDU's CRC computation and verification * @napi_id: id of the NAPI struct this skb came from * @sender_cpu: (aka @napi_id) source CPU in XPS * @secmark: security marking @@ -859,6 +860,10 @@ struct sk_buff { #ifdef CONFIG_TLS_DEVICE __u8 decrypted:1; #endif +#ifdef CONFIG_TCP_DDP_CRC + __u8 ddp_crc:1; +#endif + #ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */ diff --git a/net/Kconfig b/net/Kconfig index 3876861cdc90..80ed9f038968 100644 --- a/net/Kconfig +++ b/net/Kconfig @@ -465,6 +465,14 @@ config TCP_DDP NVMe-TCP/iSCSI, to request the NIC to place TCP payload data of a command response directly into kernel pages. +config TCP_DDP_CRC + bool "TCP direct data placement CRC offload" + default n + help + Direct Data Placement (DDP) CRC32C offload for TCP enables ULP, such as + NVMe-TCP/iSCSI, to request the NIC to calculate/verify the data digest + of commands as they go through the NIC. Thus avoiding the costly + per-byte overhead. endif # if NET diff --git a/net/ethtool/common.c b/net/ethtool/common.c index a2ff7a4a6bbf..cc6858105449 100644 --- a/net/ethtool/common.c +++ b/net/ethtool/common.c @@ -69,6 +69,7 @@ const char netdev_features_strings[NETDEV_FEATURE_COUNT][ETH_GSTRING_LEN] = { [NETIF_F_GRO_FRAGLIST_BIT] = "rx-gro-list", [NETIF_F_HW_MACSEC_BIT] = "macsec-hw-offload", [NETIF_F_HW_TCP_DDP_BIT] = "tcp-ddp-offload", + [NETIF_F_HW_TCP_DDP_CRC_RX_BIT] = "tcp-ddp-crc-rx-offload", }; const char diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index c7e16b0ed791..090c695e18b8 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5147,6 +5147,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, memcpy(nskb->cb, skb->cb, sizeof(skb->cb)); #ifdef CONFIG_TLS_DEVICE nskb->decrypted = skb->decrypted; +#endif +#ifdef CONFIG_TCP_DDP_CRC + nskb->ddp_crc = skb->ddp_crc; #endif TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start; if (list) @@ -5180,6 +5183,10 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, #ifdef CONFIG_TLS_DEVICE if (skb->decrypted != nskb->decrypted) goto end; +#endif +#ifdef CONFIG_TCP_DDP_CRC + if (skb->ddp_crc != nskb->ddp_crc) + goto end; #endif } } diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 58207c7769d0..64ccaf08fb3a 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1815,6 +1815,9 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) || #ifdef CONFIG_TLS_DEVICE tail->decrypted != skb->decrypted || +#endif +#ifdef CONFIG_TCP_DDP_CRC + tail->ddp_crc != skb->ddp_crc || #endif thtail->doff != th->doff || memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th))) diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c index e09147ac9a99..39f5f0bcf181 100644 --- a/net/ipv4/tcp_offload.c +++ b/net/ipv4/tcp_offload.c @@ -262,6 +262,9 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb) #ifdef CONFIG_TLS_DEVICE flush |= p->decrypted ^ skb->decrypted; #endif +#ifdef CONFIG_TCP_DDP_CRC + flush |= p->ddp_crc ^ skb->ddp_crc; +#endif if (flush || skb_gro_receive(p, skb)) { mss = 1; From patchwork Thu Jan 14 15:10:16 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363367 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93619C43333 for ; Thu, 14 Jan 2021 15:12:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6E33923A5F for ; Thu, 14 Jan 2021 15:12:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729283AbhANPLi (ORCPT ); Thu, 14 Jan 2021 10:11:38 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38722 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729245AbhANPLh (ORCPT ); Thu, 14 Jan 2021 10:11:37 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:44 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh04000835; Thu, 14 Jan 2021 17:10:44 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 04/21] net: SKB copy(+hash) iterators for DDP offloads Date: Thu, 14 Jan 2021 17:10:16 +0200 Message-Id: <20210114151033.13020-5-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ben Ben-ishay This commit introduces new functions to support direct data placement operation, when using direct data placement the copy of the data from the SKB to the destination buffer might be unnecessary and thus the copy should be skipped, those functions take care of it in cases that the destination buffer is represented by bio_vec. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- include/linux/skbuff.h | 5 +++++ net/core/datagram.c | 44 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 49 insertions(+) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index a841bc974452..96847aabba5b 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3536,6 +3536,8 @@ __poll_t datagram_poll(struct file *file, struct socket *sock, struct poll_table_struct *wait); int skb_copy_datagram_iter(const struct sk_buff *from, int offset, struct iov_iter *to, int size); +int skb_ddp_copy_datagram_iter(const struct sk_buff *from, int offset, + struct iov_iter *to, int size); static inline int skb_copy_datagram_msg(const struct sk_buff *from, int offset, struct msghdr *msg, int size) { @@ -3546,6 +3548,9 @@ int skb_copy_and_csum_datagram_msg(struct sk_buff *skb, int hlen, int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset, struct iov_iter *to, int len, struct ahash_request *hash); +int skb_ddp_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len, + struct ahash_request *hash); int skb_copy_datagram_from_iter(struct sk_buff *skb, int offset, struct iov_iter *from, int len); int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm); diff --git a/net/core/datagram.c b/net/core/datagram.c index 81809fa735a7..bbc476cadc71 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -495,6 +495,25 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, return 0; } +/** + * skb_ddp_copy_and_hash_datagram_iter - Copies datagrams from skb frags to + * an iterator and update a hash. If the iterator and skb frag point to the + * same page and offset, then the copy is skipped. + * @skb: buffer to copy + * @offset: offset in the buffer to start copying from + * @to: iovec iterator to copy to + * @len: amount of data to copy from buffer to iovec + * @hash: hash request to update + */ +int skb_ddp_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len, + struct ahash_request *hash) +{ + return __skb_datagram_iter(skb, offset, to, len, true, + ddp_hash_and_copy_to_iter, hash); +} +EXPORT_SYMBOL(skb_ddp_copy_and_hash_datagram_iter); + /** * skb_copy_and_hash_datagram_iter - Copy datagram to an iovec iterator * and update a hash. @@ -513,6 +532,31 @@ int skb_copy_and_hash_datagram_iter(const struct sk_buff *skb, int offset, } EXPORT_SYMBOL(skb_copy_and_hash_datagram_iter); +static size_t simple_ddp_copy_to_iter(const void *addr, size_t bytes, + void *data __always_unused, + struct iov_iter *i) +{ + return ddp_copy_to_iter(addr, bytes, i); +} + +/** + * skb_ddp_copy_datagram_iter - Copies datagrams from skb frags to an + * iterator. If the iterator and skb frag point to the same page and + * offset, then the copy is skipped. + * @skb: buffer to copy + * @offset: offset in the buffer to start copying from + * @to: iovec iterator to copy to + * @len: amount of data to copy from buffer to iovec + */ +int skb_ddp_copy_datagram_iter(const struct sk_buff *skb, int offset, + struct iov_iter *to, int len) +{ + trace_skb_copy_datagram_iovec(skb, len); + return __skb_datagram_iter(skb, offset, to, len, false, + simple_ddp_copy_to_iter, NULL); +} +EXPORT_SYMBOL(skb_ddp_copy_datagram_iter); + static size_t simple_copy_to_iter(const void *addr, size_t bytes, void *data __always_unused, struct iov_iter *i) { From patchwork Thu Jan 14 15:10:17 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363372 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D589DC43381 for ; Thu, 14 Jan 2021 15:12:22 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A6DC823A5F for ; Thu, 14 Jan 2021 15:12:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729292AbhANPLj (ORCPT ); Thu, 14 Jan 2021 10:11:39 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38720 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729244AbhANPLh (ORCPT ); Thu, 14 Jan 2021 10:11:37 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:44 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh05000835; Thu, 14 Jan 2021 17:10:44 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com Subject: [PATCH v2 net-next 05/21] net/tls: expose get_netdev_for_sock Date: Thu, 14 Jan 2021 17:10:17 +0200 Message-Id: <20210114151033.13020-6-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org get_netdev_for_sock is a utility that is used to obtain the net_device structure from a connected socket. Later patches will use this for nvme-tcp DDP and DDP CRC offloads. Signed-off-by: Boris Pismenny Reviewed-by: Sagi Grimberg --- include/net/sock.h | 17 +++++++++++++++++ net/tls/tls_device.c | 20 ++------------------ 2 files changed, 19 insertions(+), 18 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index bdc4323ce53c..9c21a4d696ad 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -2721,4 +2721,21 @@ void sock_set_sndtimeo(struct sock *sk, s64 secs); int sock_bind_add(struct sock *sk, struct sockaddr *addr, int addr_len); +/* Assume that the socket is already connected */ +static inline struct net_device *get_netdev_for_sock(struct sock *sk, bool hold) +{ + struct dst_entry *dst = sk_dst_get(sk); + struct net_device *netdev = NULL; + + if (likely(dst)) { + netdev = dst->dev; + if (hold) + dev_hold(netdev); + } + + dst_release(dst); + + return netdev; +} + #endif /* _SOCK_H */ diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c index f7fb7d2c1de1..cbd24d897023 100644 --- a/net/tls/tls_device.c +++ b/net/tls/tls_device.c @@ -106,22 +106,6 @@ static void tls_device_queue_ctx_destruction(struct tls_context *ctx) spin_unlock_irqrestore(&tls_device_lock, flags); } -/* We assume that the socket is already connected */ -static struct net_device *get_netdev_for_sock(struct sock *sk) -{ - struct dst_entry *dst = sk_dst_get(sk); - struct net_device *netdev = NULL; - - if (likely(dst)) { - netdev = dst->dev; - dev_hold(netdev); - } - - dst_release(dst); - - return netdev; -} - static void destroy_record(struct tls_record_info *record) { int i; @@ -1106,7 +1090,7 @@ int tls_set_device_offload(struct sock *sk, struct tls_context *ctx) if (skb) TCP_SKB_CB(skb)->eor = 1; - netdev = get_netdev_for_sock(sk); + netdev = get_netdev_for_sock(sk, true); if (!netdev) { pr_err_ratelimited("%s: netdev not found\n", __func__); rc = -EINVAL; @@ -1182,7 +1166,7 @@ int tls_set_device_offload_rx(struct sock *sk, struct tls_context *ctx) if (ctx->crypto_recv.info.version != TLS_1_2_VERSION) return -EOPNOTSUPP; - netdev = get_netdev_for_sock(sk); + netdev = get_netdev_for_sock(sk, true); if (!netdev) { pr_err_ratelimited("%s: netdev not found\n", __func__); return -EINVAL; From patchwork Thu Jan 14 15:10:18 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363370 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 165F1C433E0 for ; Thu, 14 Jan 2021 15:12:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F335123A80 for ; Thu, 14 Jan 2021 15:12:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729370AbhANPMH (ORCPT ); Thu, 14 Jan 2021 10:12:07 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38721 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729247AbhANPLj (ORCPT ); Thu, 14 Jan 2021 10:11:39 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:44 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh06000835; Thu, 14 Jan 2021 17:10:44 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Ben Ben-Ishay , Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 06/21] nvme-tcp: Add DDP offload control path Date: Thu, 14 Jan 2021 17:10:18 +0200 Message-Id: <20210114151033.13020-7-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This commit introduces direct data placement offload to NVME TCP. There is a context per queue, which is established after the handshake using the tcp_ddp_sk_add/del NDOs. Additionally, a resynchronization routine is used to assist hardware recovery from TCP OOO, and continue the offload. Resynchronization operates as follows: 1. TCP OOO causes the NIC HW to stop the offload 2. NIC HW identifies a PDU header at some TCP sequence number, and asks NVMe-TCP to confirm it. This request is delivered from the NIC driver to NVMe-TCP by first finding the socket for the packet that triggered the request, and then finding the nvme_tcp_queue that is used by this routine. Finally, the request is recorded in the nvme_tcp_queue. 3. When NVMe-TCP observes the requested TCP sequence, it will compare it with the PDU header TCP sequence, and report the result to the NIC driver (tcp_ddp_resync), which will update the HW, and resume offload when all is successful. Furthermore, we let the offloading driver advertise what is the max hw sectors/segments via tcp_ddp_limits. A follow-up patch introduces the data-path changes required for this offload. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- drivers/nvme/host/tcp.c | 199 +++++++++++++++++++++++++++++++++++++++- 1 file changed, 197 insertions(+), 2 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 1ba659927442..31bf9e3ea236 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -14,6 +14,7 @@ #include #include #include +#include #include "nvme.h" #include "fabrics.h" @@ -62,6 +63,7 @@ enum nvme_tcp_queue_flags { NVME_TCP_Q_ALLOCATED = 0, NVME_TCP_Q_LIVE = 1, NVME_TCP_Q_POLLING = 2, + NVME_TCP_Q_OFF_DDP = 3, }; enum nvme_tcp_recv_state { @@ -110,6 +112,8 @@ struct nvme_tcp_queue { void (*state_change)(struct sock *); void (*data_ready)(struct sock *); void (*write_space)(struct sock *); + + atomic64_t resync_req; }; struct nvme_tcp_ctrl { @@ -128,6 +132,8 @@ struct nvme_tcp_ctrl { struct delayed_work connect_work; struct nvme_tcp_request async_req; u32 io_queues[HCTX_MAX_TYPES]; + + struct net_device *offloading_netdev; }; static LIST_HEAD(nvme_tcp_ctrl_list); @@ -222,6 +228,182 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req, return nvme_tcp_pdu_data_left(req) <= len; } +#ifdef CONFIG_TCP_DDP + +static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags); +static const struct tcp_ddp_ulp_ops nvme_tcp_ddp_ulp_ops = { + .resync_request = nvme_tcp_resync_request, +}; + +static +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue) +{ + struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true); + struct nvme_tcp_ddp_config config = {}; + int ret; + + if (!netdev) { + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n"); + return -ENODEV; + } + + if (!(netdev->features & NETIF_F_HW_TCP_DDP)) { + dev_put(netdev); + return -EOPNOTSUPP; + } + + config.cfg.type = TCP_DDP_NVME; + config.pfv = NVME_TCP_PFV_1_0; + config.cpda = 0; + config.dgst = queue->hdr_digest ? + NVME_TCP_HDR_DIGEST_ENABLE : 0; + config.dgst |= queue->data_digest ? + NVME_TCP_DATA_DIGEST_ENABLE : 0; + config.queue_size = queue->queue_size; + config.queue_id = nvme_tcp_queue_id(queue); + config.io_cpu = queue->io_cpu; + + ret = netdev->tcp_ddp_ops->tcp_ddp_sk_add(netdev, + queue->sock->sk, + (struct tcp_ddp_config *)&config); + if (ret) { + dev_put(netdev); + return ret; + } + + inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = &nvme_tcp_ddp_ulp_ops; + if (netdev->features & NETIF_F_HW_TCP_DDP) + set_bit(NVME_TCP_Q_OFF_DDP, &queue->flags); + + return ret; +} + +static +void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue) +{ + struct net_device *netdev = queue->ctrl->offloading_netdev; + + if (!netdev) { + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n"); + return; + } + + netdev->tcp_ddp_ops->tcp_ddp_sk_del(netdev, queue->sock->sk); + + inet_csk(queue->sock->sk)->icsk_ulp_ddp_ops = NULL; + dev_put(netdev); /* put the queue_init get_netdev_for_sock() */ +} + +static +int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue) +{ + struct net_device *netdev = get_netdev_for_sock(queue->sock->sk, true); + struct tcp_ddp_limits limits; + int ret = 0; + + if (!netdev) { + dev_info_ratelimited(queue->ctrl->ctrl.device, "netdev not found\n"); + return -ENODEV; + } + + if (netdev->features & NETIF_F_HW_TCP_DDP && + netdev->tcp_ddp_ops && + netdev->tcp_ddp_ops->tcp_ddp_limits) + ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, &limits); + else + ret = -EOPNOTSUPP; + + if (!ret) { + queue->ctrl->offloading_netdev = netdev; + dev_dbg_ratelimited(queue->ctrl->ctrl.device, + "netdev %s offload limits: max_ddp_sgl_len %d\n", + netdev->name, limits.max_ddp_sgl_len); + queue->ctrl->ctrl.max_segments = limits.max_ddp_sgl_len; + queue->ctrl->ctrl.max_hw_sectors = + limits.max_ddp_sgl_len << (ilog2(SZ_4K) - 9); + } else { + queue->ctrl->offloading_netdev = NULL; + } + + dev_put(netdev); + + return ret; +} + +static +void nvme_tcp_resync_response(struct nvme_tcp_queue *queue, + unsigned int pdu_seq) +{ + struct net_device *netdev = queue->ctrl->offloading_netdev; + u64 resync_val; + u32 resync_seq; + + resync_val = atomic64_read(&queue->resync_req); + /* Lower 32 bit flags. Check validity of the request */ + if ((resync_val & TCP_DDP_RESYNC_REQ) == 0) + return; + + /* Obtain and check requested sequence number: is this PDU header before the request? */ + resync_seq = resync_val >> 32; + if (before(pdu_seq, resync_seq)) + return; + + if (unlikely(!netdev)) { + pr_info_ratelimited("%s: netdev not found\n", __func__); + return; + } + + /** + * The atomic operation gurarantees that we don't miss any NIC driver + * resync requests submitted after the above checks. + */ + if (atomic64_cmpxchg(&queue->resync_req, resync_val, + resync_val & ~TCP_DDP_RESYNC_REQ)) + netdev->tcp_ddp_ops->tcp_ddp_resync(netdev, queue->sock->sk, pdu_seq); +} + +static +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags) +{ + struct nvme_tcp_queue *queue = sk->sk_user_data; + + atomic64_set(&queue->resync_req, + (((uint64_t)seq << 32) | flags)); + + return true; +} + +#else + +static +int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue) +{ + return -EINVAL; +} + +static +void nvme_tcp_unoffload_socket(struct nvme_tcp_queue *queue) +{} + +static +int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue) +{ + return -EINVAL; +} + +static +void nvme_tcp_resync_response(struct nvme_tcp_queue *queue, + unsigned int pdu_seq) +{} + +static +bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags) +{ + return false; +} + +#endif + static void nvme_tcp_init_iter(struct nvme_tcp_request *req, unsigned int dir) { @@ -627,6 +809,11 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb, size_t rcv_len = min_t(size_t, *len, queue->pdu_remaining); int ret; + u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset; + + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) + nvme_tcp_resync_response(queue, pdu_seq); + ret = skb_copy_bits(skb, *offset, &pdu[queue->pdu_offset], rcv_len); if (unlikely(ret)) @@ -1517,6 +1704,9 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue) kernel_sock_shutdown(queue->sock, SHUT_RDWR); nvme_tcp_restore_sock_calls(queue); cancel_work_sync(&queue->io_work); + + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) + nvme_tcp_unoffload_socket(queue); } static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid) @@ -1534,10 +1724,13 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx) struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl); int ret; - if (idx) + if (idx) { ret = nvmf_connect_io_queue(nctrl, idx, false); - else + nvme_tcp_offload_socket(&ctrl->queues[idx]); + } else { ret = nvmf_connect_admin_queue(nctrl); + nvme_tcp_offload_limits(&ctrl->queues[idx]); + } if (!ret) { set_bit(NVME_TCP_Q_LIVE, &ctrl->queues[idx].flags); @@ -1640,6 +1833,8 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl) { int ret; + to_tcp_ctrl(ctrl)->offloading_netdev = NULL; + ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH); if (ret) return ret; From patchwork Thu Jan 14 15:10:20 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363371 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03382C4332D for ; Thu, 14 Jan 2021 15:12:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DEAFD23A5F for ; Thu, 14 Jan 2021 15:12:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729364AbhANPMG (ORCPT ); Thu, 14 Jan 2021 10:12:06 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38739 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729253AbhANPLj (ORCPT ); Thu, 14 Jan 2021 10:11:39 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:44 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh08000835; Thu, 14 Jan 2021 17:10:44 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Ben Ben-Ishay , Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 08/21] nvme-tcp : Recalculate crc in the end of the capsule Date: Thu, 14 Jan 2021 17:10:20 +0200 Message-Id: <20210114151033.13020-9-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ben Ben-ishay crc offload of the nvme capsule. Check if all the skb bits are on, and if not recalculate the crc in SW and check it. This patch reworks the receive-side crc calculation to always run at the end, so as to keep a single flow for both offload and non-offload. This change simplifies the code, but it may degrade performance for non-offload crc calculation. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- drivers/nvme/host/tcp.c | 116 ++++++++++++++++++++++++++++++++-------- 1 file changed, 94 insertions(+), 22 deletions(-) diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c index 9a43bd31ec15..09d2e2e529cd 100644 --- a/drivers/nvme/host/tcp.c +++ b/drivers/nvme/host/tcp.c @@ -69,6 +69,7 @@ enum nvme_tcp_queue_flags { NVME_TCP_Q_LIVE = 1, NVME_TCP_Q_POLLING = 2, NVME_TCP_Q_OFF_DDP = 3, + NVME_TCP_Q_OFF_DDGST_RX = 4, }; enum nvme_tcp_recv_state { @@ -95,6 +96,7 @@ struct nvme_tcp_queue { size_t data_remaining; size_t ddgst_remaining; unsigned int nr_cqe; + bool ddgst_valid; /* send state */ struct nvme_tcp_request *request; @@ -233,6 +235,55 @@ static inline size_t nvme_tcp_pdu_last_send(struct nvme_tcp_request *req, return nvme_tcp_pdu_data_left(req) <= len; } +static inline bool nvme_tcp_ddp_ddgst_ok(struct nvme_tcp_queue *queue) +{ + return queue->ddgst_valid; +} + +static inline void nvme_tcp_ddp_ddgst_update(struct nvme_tcp_queue *queue, + struct sk_buff *skb) +{ + if (queue->ddgst_valid) +#ifdef CONFIG_TCP_DDP_CRC + queue->ddgst_valid = skb->ddp_crc; +#else + queue->ddgst_valid = false; +#endif +} + +static int nvme_tcp_req_map_sg(struct nvme_tcp_request *req, struct request *rq) +{ + int ret; + + req->ddp.sg_table.sgl = req->ddp.first_sgl; + ret = sg_alloc_table_chained(&req->ddp.sg_table, blk_rq_nr_phys_segments(rq), + req->ddp.sg_table.sgl, SG_CHUNK_SIZE); + if (ret) + return -ENOMEM; + req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl); + return 0; +} + +static void nvme_tcp_ddp_ddgst_recalc(struct ahash_request *hash, + struct request *rq) +{ + struct nvme_tcp_request *req; + + if (!rq) + return; + + req = blk_mq_rq_to_pdu(rq); + + if (!req->offloaded && nvme_tcp_req_map_sg(req, rq)) + return; + + crypto_ahash_init(hash); + req->ddp.sg_table.sgl = req->ddp.first_sgl; + ahash_request_set_crypt(hash, req->ddp.sg_table.sgl, NULL, + le32_to_cpu(req->data_len)); + crypto_ahash_update(hash); +} + #ifdef CONFIG_TCP_DDP static bool nvme_tcp_resync_request(struct sock *sk, u32 seq, u32 flags); @@ -289,12 +340,9 @@ int nvme_tcp_setup_ddp(struct nvme_tcp_queue *queue, } req->ddp.command_id = command_id; - req->ddp.sg_table.sgl = req->ddp.first_sgl; - ret = sg_alloc_table_chained(&req->ddp.sg_table, blk_rq_nr_phys_segments(rq), - req->ddp.sg_table.sgl, SG_CHUNK_SIZE); + ret = nvme_tcp_req_map_sg(req, rq); if (ret) return -ENOMEM; - req->ddp.nents = blk_rq_map_sg(rq->q, rq, req->ddp.sg_table.sgl); ret = netdev->tcp_ddp_ops->tcp_ddp_setup(netdev, queue->sock->sk, @@ -316,7 +364,7 @@ int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue) return -ENODEV; } - if (!(netdev->features & NETIF_F_HW_TCP_DDP)) { + if (!(netdev->features & (NETIF_F_HW_TCP_DDP | NETIF_F_HW_TCP_DDP_CRC_RX))) { dev_put(netdev); return -EOPNOTSUPP; } @@ -344,6 +392,9 @@ int nvme_tcp_offload_socket(struct nvme_tcp_queue *queue) if (netdev->features & NETIF_F_HW_TCP_DDP) set_bit(NVME_TCP_Q_OFF_DDP, &queue->flags); + if (netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX) + set_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags); + return ret; } @@ -375,7 +426,7 @@ int nvme_tcp_offload_limits(struct nvme_tcp_queue *queue) return -ENODEV; } - if (netdev->features & NETIF_F_HW_TCP_DDP && + if ((netdev->features & (NETIF_F_HW_TCP_DDP | NETIF_F_HW_TCP_DDP_CRC_RX)) && netdev->tcp_ddp_ops && netdev->tcp_ddp_ops->tcp_ddp_limits) ret = netdev->tcp_ddp_ops->tcp_ddp_limits(netdev, &limits); @@ -727,6 +778,7 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue) queue->pdu_offset = 0; queue->data_remaining = -1; queue->ddgst_remaining = 0; + queue->ddgst_valid = true; } static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl) @@ -907,7 +959,8 @@ static int nvme_tcp_recv_pdu(struct nvme_tcp_queue *queue, struct sk_buff *skb, u64 pdu_seq = TCP_SKB_CB(skb)->seq + *offset - queue->pdu_offset; - if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags) || + test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags)) nvme_tcp_resync_response(queue, pdu_seq); ret = skb_copy_bits(skb, *offset, @@ -976,6 +1029,8 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, struct nvme_tcp_request *req; struct request *rq; + if (queue->data_digest && test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags)) + nvme_tcp_ddp_ddgst_update(queue, skb); rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id); if (!rq) { dev_err(queue->ctrl->ctrl.device, @@ -1013,15 +1068,17 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, recv_len = min_t(size_t, recv_len, iov_iter_count(&req->iter)); - if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) { - if (queue->data_digest) + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) { + if (queue->data_digest && + !test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags)) ret = skb_ddp_copy_and_hash_datagram_iter(skb, *offset, &req->iter, recv_len, queue->rcv_hash); else ret = skb_ddp_copy_datagram_iter(skb, *offset, &req->iter, recv_len); } else { - if (queue->data_digest) + if (queue->data_digest && + !test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags)) ret = skb_copy_and_hash_datagram_iter(skb, *offset, &req->iter, recv_len, queue->rcv_hash); else @@ -1043,7 +1100,6 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb, if (!queue->data_remaining) { if (queue->data_digest) { - nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); queue->ddgst_remaining = NVME_TCP_DIGEST_LENGTH; } else { if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { @@ -1064,8 +1120,12 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue, char *ddgst = (char *)&queue->recv_ddgst; size_t recv_len = min_t(size_t, *len, queue->ddgst_remaining); off_t off = NVME_TCP_DIGEST_LENGTH - queue->ddgst_remaining; + bool offload_fail, offload_en; + struct request *rq = NULL; int ret; + if (test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags)) + nvme_tcp_ddp_ddgst_update(queue, skb); ret = skb_copy_bits(skb, *offset, &ddgst[off], recv_len); if (unlikely(ret)) return ret; @@ -1076,17 +1136,29 @@ static int nvme_tcp_recv_ddgst(struct nvme_tcp_queue *queue, if (queue->ddgst_remaining) return 0; - if (queue->recv_ddgst != queue->exp_ddgst) { - dev_err(queue->ctrl->ctrl.device, - "data digest error: recv %#x expected %#x\n", - le32_to_cpu(queue->recv_ddgst), - le32_to_cpu(queue->exp_ddgst)); - return -EIO; + offload_fail = !nvme_tcp_ddp_ddgst_ok(queue); + offload_en = test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags); + if (!offload_en || offload_fail) { + if (offload_en && offload_fail) { // software-fallback + rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), + pdu->command_id); + nvme_tcp_ddp_ddgst_recalc(queue->rcv_hash, rq); + } + + nvme_tcp_ddgst_final(queue->rcv_hash, &queue->exp_ddgst); + if (queue->recv_ddgst != queue->exp_ddgst) { + dev_err(queue->ctrl->ctrl.device, + "data digest error: recv %#x expected %#x\n", + le32_to_cpu(queue->recv_ddgst), + le32_to_cpu(queue->exp_ddgst)); + return -EIO; + } } if (pdu->hdr.flags & NVME_TCP_F_DATA_SUCCESS) { - struct request *rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), - pdu->command_id); + if (!rq) + rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), + pdu->command_id); nvme_tcp_end_request(rq, NVME_SC_SUCCESS); queue->nr_cqe++; @@ -1825,8 +1897,10 @@ static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue) nvme_tcp_restore_sock_calls(queue); cancel_work_sync(&queue->io_work); - if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags)) + if (test_bit(NVME_TCP_Q_OFF_DDP, &queue->flags) || + test_bit(NVME_TCP_Q_OFF_DDGST_RX, &queue->flags)) nvme_tcp_unoffload_socket(queue); + } static void nvme_tcp_stop_queue(struct nvme_ctrl *nctrl, int qid) @@ -1953,8 +2027,6 @@ static int nvme_tcp_alloc_admin_queue(struct nvme_ctrl *ctrl) { int ret; - to_tcp_ctrl(ctrl)->offloading_netdev = NULL; - ret = nvme_tcp_alloc_queue(ctrl, 0, NVME_AQ_DEPTH); if (ret) return ret; From patchwork Thu Jan 14 15:10:24 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363377 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 879AAC433E0 for ; Thu, 14 Jan 2021 15:11:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 590B523A5F for ; Thu, 14 Jan 2021 15:11:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729312AbhANPLk (ORCPT ); Thu, 14 Jan 2021 10:11:40 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38820 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729261AbhANPLh (ORCPT ); Thu, 14 Jan 2021 10:11:37 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:45 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh0C000835; Thu, 14 Jan 2021 17:10:44 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Ben Ben-Ishay , Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 12/21] net/mlx5e: TCP flow steering for nvme-tcp Date: Thu, 14 Jan 2021 17:10:24 +0200 Message-Id: <20210114151033.13020-13-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Both nvme-tcp and tls require tcp flow steering. Compile it for both of them. Additionally, use reference counting to allocate/free TCP flow steering. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- drivers/net/ethernet/mellanox/mlx5/core/en/fs.h | 4 ++-- .../net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c | 10 ++++++++++ .../net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h | 2 +- 3 files changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h index a16297e7e2ac..a7fe3a6358ea 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/fs.h @@ -137,7 +137,7 @@ enum { MLX5E_L2_FT_LEVEL, MLX5E_TTC_FT_LEVEL, MLX5E_INNER_TTC_FT_LEVEL, -#ifdef CONFIG_MLX5_EN_TLS +#if defined(CONFIG_MLX5_EN_TLS) || defined(CONFIG_MLX5_EN_NVMEOTCP) MLX5E_ACCEL_FS_TCP_FT_LEVEL, #endif #ifdef CONFIG_MLX5_EN_ARFS @@ -256,7 +256,7 @@ struct mlx5e_flow_steering { #ifdef CONFIG_MLX5_EN_ARFS struct mlx5e_arfs_tables arfs; #endif -#ifdef CONFIG_MLX5_EN_TLS +#if defined(CONFIG_MLX5_EN_TLS) || defined(CONFIG_MLX5_EN_NVMEOTCP) struct mlx5e_accel_fs_tcp *accel_tcp; #endif }; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c index e51f60b55daa..21341a92f355 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.c @@ -14,6 +14,7 @@ enum accel_fs_tcp_type { struct mlx5e_accel_fs_tcp { struct mlx5e_flow_table tables[ACCEL_FS_TCP_NUM_TYPES]; struct mlx5_flow_handle *default_rules[ACCEL_FS_TCP_NUM_TYPES]; + refcount_t ref_count; }; static enum mlx5e_traffic_types fs_accel2tt(enum accel_fs_tcp_type i) @@ -337,6 +338,7 @@ static int accel_fs_tcp_enable(struct mlx5e_priv *priv) return err; } } + refcount_set(&priv->fs.accel_tcp->ref_count, 1); return 0; } @@ -360,6 +362,9 @@ void mlx5e_accel_fs_tcp_destroy(struct mlx5e_priv *priv) if (!priv->fs.accel_tcp) return; + if (!refcount_dec_and_test(&priv->fs.accel_tcp->ref_count)) + return; + accel_fs_tcp_disable(priv); for (i = 0; i < ACCEL_FS_TCP_NUM_TYPES; i++) @@ -376,6 +381,11 @@ int mlx5e_accel_fs_tcp_create(struct mlx5e_priv *priv) if (!MLX5_CAP_FLOWTABLE_NIC_RX(priv->mdev, ft_field_support.outer_ip_version)) return -EOPNOTSUPP; + if (priv->fs.accel_tcp) { + refcount_inc(&priv->fs.accel_tcp->ref_count); + return 0; + } + priv->fs.accel_tcp = kzalloc(sizeof(*priv->fs.accel_tcp), GFP_KERNEL); if (!priv->fs.accel_tcp) return -ENOMEM; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h index 589235824543..8aff9298183c 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/fs_tcp.h @@ -6,7 +6,7 @@ #include "en.h" -#ifdef CONFIG_MLX5_EN_TLS +#if defined(CONFIG_MLX5_EN_TLS) || defined(CONFIG_MLX5_EN_NVMEOTCP) int mlx5e_accel_fs_tcp_create(struct mlx5e_priv *priv); void mlx5e_accel_fs_tcp_destroy(struct mlx5e_priv *priv); struct mlx5_flow_handle *mlx5e_accel_fs_add_sk(struct mlx5e_priv *priv, From patchwork Thu Jan 14 15:10:26 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363369 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.9 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,UNPARSEABLE_RELAY, UNWANTED_LANGUAGE_BODY,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8CB64C43217 for ; Thu, 14 Jan 2021 15:12:25 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 59CFE23A5F for ; Thu, 14 Jan 2021 15:12:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729410AbhANPMY (ORCPT ); Thu, 14 Jan 2021 10:12:24 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38889 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729263AbhANPLi (ORCPT ); Thu, 14 Jan 2021 10:11:38 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:45 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh0E000835; Thu, 14 Jan 2021 17:10:45 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 14/21] net/mlx5e: KLM UMR helper macros Date: Thu, 14 Jan 2021 17:10:26 +0200 Message-Id: <20210114151033.13020-15-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ben Ben-Ishay Add helper macros for posting KLM UMR WQE. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- drivers/net/ethernet/mellanox/mlx5/core/en.h | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h index ff100c290b59..26259d68b98a 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h @@ -152,6 +152,24 @@ struct page_pool; #define MLX5E_UMR_WQEBBS \ (DIV_ROUND_UP(MLX5E_UMR_WQE_INLINE_SZ, MLX5_SEND_WQE_BB)) +#define KLM_ALIGNMENT 4 +#define MLX5E_KLM_UMR_WQE_SZ(sgl_len)\ + (sizeof(struct mlx5e_umr_wqe) +\ + (sizeof(struct mlx5_klm) * (sgl_len))) + +#define MLX5E_KLM_UMR_WQEBBS(sgl_len)\ + (DIV_ROUND_UP(MLX5E_KLM_UMR_WQE_SZ(sgl_len), MLX5_SEND_WQE_BB)) + +#define MLX5E_KLM_UMR_DS_CNT(sgl_len)\ + DIV_ROUND_UP(MLX5E_KLM_UMR_WQE_SZ(sgl_len), MLX5_SEND_WQE_DS) + +#define MLX5E_MAX_KLM_ENTRIES_PER_WQE(wqe_size)\ + (((wqe_size) - sizeof(struct mlx5e_umr_wqe)) / sizeof(struct mlx5_klm)) + +#define MLX5E_KLM_ENTRIES_PER_WQE(wqe_size)\ + (MLX5E_MAX_KLM_ENTRIES_PER_WQE(wqe_size) -\ + (MLX5E_MAX_KLM_ENTRIES_PER_WQE(wqe_size) % KLM_ALIGNMENT)) + #define MLX5E_MSG_LEVEL NETIF_MSG_LINK #define mlx5e_dbg(mlevel, priv, format, ...) \ From patchwork Thu Jan 14 15:10:28 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363373 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57940C43333 for ; Thu, 14 Jan 2021 15:12:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 235BB23A5F for ; Thu, 14 Jan 2021 15:12:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729362AbhANPL6 (ORCPT ); Thu, 14 Jan 2021 10:11:58 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:38937 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729264AbhANPL5 (ORCPT ); Thu, 14 Jan 2021 10:11:57 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:45 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh0G000835; Thu, 14 Jan 2021 17:10:45 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 16/21] net/mlx5e: NVMEoTCP queue init/teardown Date: Thu, 14 Jan 2021 17:10:28 +0200 Message-Id: <20210114151033.13020-17-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ben Ben-Ishay When nvme-tcp establishes new connections, we allocate a hardware context to offload operations for this queue: - Use a separate TIR to identify the queue and maintain the HW context - Use a separate ICOSQ for maintain the HW context - Use a separate tag buffer for buffer registration - Maintain static and progress HW contexts by posting the proper WQEs at creation time, or upon resync Queue teardown will free the corresponding contexts. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- .../net/ethernet/mellanox/mlx5/core/en/txrx.h | 6 + .../mellanox/mlx5/core/en_accel/nvmeotcp.c | 650 +++++++++++++++++- .../mellanox/mlx5/core/en_accel/nvmeotcp.h | 4 + .../mlx5/core/en_accel/nvmeotcp_utils.h | 68 ++ .../net/ethernet/mellanox/mlx5/core/en_rx.c | 7 + 5 files changed, 711 insertions(+), 24 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h index bfeea906612e..f00d76f1ded4 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h @@ -36,6 +36,7 @@ enum mlx5e_icosq_wqe_type { #endif #ifdef CONFIG_MLX5_EN_NVMEOTCP MLX5E_ICOSQ_WQE_UMR_NVME_TCP, + MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP, #endif }; @@ -178,6 +179,11 @@ struct mlx5e_icosq_wqe_info { struct { struct mlx5e_ktls_rx_resync_buf *buf; } tls_get_params; +#endif +#ifdef CONFIG_MLX5_EN_NVMEOTCP + struct { + struct mlx5e_nvmeotcp_queue *queue; + } nvmeotcp_q; #endif }; }; diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c index bbeded08ef01..3a42477f1bd7 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c @@ -3,6 +3,7 @@ #include #include +#include #include "en_accel/nvmeotcp.h" #include "en_accel/nvmeotcp_utils.h" #include "en_accel/fs_tcp.h" @@ -20,35 +21,180 @@ static const struct rhashtable_params rhash_queues = { .max_size = MAX_NVMEOTCP_QUEUES, }; +#define MLX5_NVME_TCP_MAX_SEGMENTS 128 + +static u32 mlx5e_get_max_sgl(struct mlx5_core_dev *mdev) +{ + return min_t(u32, + MLX5_NVME_TCP_MAX_SEGMENTS, + 1 << MLX5_CAP_GEN(mdev, log_max_klm_list_size)); +} + +static void mlx5e_nvmeotcp_destroy_tir(struct mlx5e_priv *priv, int tirn) +{ + mlx5_core_destroy_tir(priv->mdev, tirn); +} + +static inline u32 +mlx5e_get_channel_ix_from_io_cpu(struct mlx5e_priv *priv, u32 io_cpu) +{ + int num_channels = priv->channels.params.num_channels; + u32 channel_ix = io_cpu; + + if (channel_ix >= num_channels) + channel_ix = channel_ix % num_channels; + + return channel_ix; +} + +static int mlx5e_nvmeotcp_create_tir(struct mlx5e_priv *priv, + struct sock *sk, + struct nvme_tcp_ddp_config *config, + struct mlx5e_nvmeotcp_queue *queue, + bool zerocopy, bool crc_rx) +{ + u32 rqtn = priv->direct_tir[queue->channel_ix].rqt.rqtn; + int err, inlen; + void *tirc; + u32 tirn; + u32 *in; + + inlen = MLX5_ST_SZ_BYTES(create_tir_in); + in = kvzalloc(inlen, GFP_KERNEL); + if (!in) + return -ENOMEM; + tirc = MLX5_ADDR_OF(create_tir_in, in, ctx); + MLX5_SET(tirc, tirc, disp_type, MLX5_TIRC_DISP_TYPE_INDIRECT); + MLX5_SET(tirc, tirc, rx_hash_fn, MLX5_RX_HASH_FN_INVERTED_XOR8); + MLX5_SET(tirc, tirc, indirect_table, rqtn); + MLX5_SET(tirc, tirc, transport_domain, priv->mdev->mlx5e_res.td.tdn); + if (zerocopy) { + MLX5_SET(tirc, tirc, nvmeotcp_zero_copy_en, 1); + MLX5_SET(tirc, tirc, nvmeotcp_tag_buffer_table_id, + queue->tag_buf_table_id); + } + + if (crc_rx) + MLX5_SET(tirc, tirc, nvmeotcp_crc_en, 1); + + MLX5_SET(tirc, tirc, self_lb_block, + MLX5_TIRC_SELF_LB_BLOCK_BLOCK_UNICAST | + MLX5_TIRC_SELF_LB_BLOCK_BLOCK_MULTICAST); + err = mlx5_core_create_tir(priv->mdev, in, &tirn); + + if (!err) + queue->tirn = tirn; + + kvfree(in); + return err; +} + +static +int mlx5e_create_nvmeotcp_tag_buf_table(struct mlx5_core_dev *mdev, + struct mlx5e_nvmeotcp_queue *queue, + u8 log_table_size) +{ + u32 in[MLX5_ST_SZ_DW(create_nvmeotcp_tag_buf_table_in)] = {}; + u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)]; + u64 general_obj_types; + void *obj; + int err; + + obj = MLX5_ADDR_OF(create_nvmeotcp_tag_buf_table_in, in, + nvmeotcp_tag_buf_table_obj); + + general_obj_types = MLX5_CAP_GEN_64(mdev, general_obj_types); + if (!(general_obj_types & + MLX5_HCA_CAP_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE)) + return -EINVAL; + + MLX5_SET(general_obj_in_cmd_hdr, in, opcode, + MLX5_CMD_OP_CREATE_GENERAL_OBJECT); + MLX5_SET(general_obj_in_cmd_hdr, in, obj_type, + MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE); + MLX5_SET(nvmeotcp_tag_buf_table_obj, obj, + log_tag_buffer_table_size, log_table_size); + + err = mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out)); + if (!err) + queue->tag_buf_table_id = MLX5_GET(general_obj_out_cmd_hdr, + out, obj_id); + return err; +} + +static +void mlx5_destroy_nvmeotcp_tag_buf_table(struct mlx5_core_dev *mdev, u32 uid) +{ + u32 in[MLX5_ST_SZ_DW(general_obj_in_cmd_hdr)] = {}; + u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)]; + + MLX5_SET(general_obj_in_cmd_hdr, in, opcode, + MLX5_CMD_OP_DESTROY_GENERAL_OBJECT); + MLX5_SET(general_obj_in_cmd_hdr, in, obj_type, + MLX5_GENERAL_OBJECT_TYPES_NVMEOTCP_TAG_BUFFER_TABLE); + MLX5_SET(general_obj_in_cmd_hdr, in, obj_id, uid); + + mlx5_cmd_exec(mdev, in, sizeof(in), out, sizeof(out)); +} + +#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_TIR_PARAMS 0x2 +#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS 0x2 +#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_UMR 0x0 + +#define STATIC_PARAMS_DS_CNT \ + DIV_ROUND_UP(MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ, MLX5_SEND_WQE_DS) + +#define PROGRESS_PARAMS_DS_CNT \ + DIV_ROUND_UP(MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ, MLX5_SEND_WQE_DS) + +enum wqe_type { + KLM_UMR = 0, + BSF_KLM_UMR = 1, + SET_PSV_UMR = 2, + BSF_UMR = 3, +}; + static void fill_nvmeotcp_klm_wqe(struct mlx5e_nvmeotcp_queue *queue, struct mlx5e_umr_wqe *wqe, u16 ccid, u32 klm_entries, - u16 klm_offset) + u16 klm_offset, enum wqe_type klm_type) { struct scatterlist *sgl_mkey; u32 lkey, i; - lkey = queue->priv->mdev->mlx5e_res.mkey.key; - for (i = 0; i < klm_entries; i++) { - sgl_mkey = &queue->ccid_table[ccid].sgl[i + klm_offset]; - wqe->inline_klms[i].bcount = cpu_to_be32(sgl_mkey->length); - wqe->inline_klms[i].key = cpu_to_be32(lkey); - wqe->inline_klms[i].va = cpu_to_be64(sgl_mkey->dma_address); - } - - for (; i < ALIGN(klm_entries, KLM_ALIGNMENT); i++) { - wqe->inline_klms[i].bcount = 0; - wqe->inline_klms[i].key = 0; - wqe->inline_klms[i].va = 0; + if (klm_type == BSF_KLM_UMR) { + for (i = 0; i < klm_entries; i++) { + lkey = queue->ccid_table[i + klm_offset].klm_mkey.key; + wqe->inline_klms[i].bcount = cpu_to_be32(1); + wqe->inline_klms[i].key = cpu_to_be32(lkey); + wqe->inline_klms[i].va = 0; + } + } else { + lkey = queue->priv->mdev->mlx5e_res.mkey.key; + for (i = 0; i < klm_entries; i++) { + sgl_mkey = &queue->ccid_table[ccid].sgl[i + klm_offset]; + wqe->inline_klms[i].bcount = cpu_to_be32(sgl_mkey->length); + wqe->inline_klms[i].key = cpu_to_be32(lkey); + wqe->inline_klms[i].va = cpu_to_be64(sgl_mkey->dma_address); + } + + for (; i < ALIGN(klm_entries, KLM_ALIGNMENT); i++) { + wqe->inline_klms[i].bcount = 0; + wqe->inline_klms[i].key = 0; + wqe->inline_klms[i].va = 0; + } } } static void build_nvmeotcp_klm_umr(struct mlx5e_nvmeotcp_queue *queue, struct mlx5e_umr_wqe *wqe, u16 ccid, int klm_entries, - u32 klm_offset, u32 len) + u32 klm_offset, u32 len, enum wqe_type klm_type) { - u32 id = queue->ccid_table[ccid].klm_mkey.key; + u32 id = (klm_type == KLM_UMR) ? queue->ccid_table[ccid].klm_mkey.key : + (queue->tirn << MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT); + u8 opc_mod = (klm_type == KLM_UMR) ? MLX5_CTRL_SEGMENT_OPC_MOD_UMR_UMR : + MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS; struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl; struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl; struct mlx5_mkey_seg *mkc = &wqe->mkc; @@ -57,12 +203,12 @@ build_nvmeotcp_klm_umr(struct mlx5e_nvmeotcp_queue *queue, u16 pc = queue->sq->icosq.pc; cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) | - MLX5_OPCODE_UMR); + MLX5_OPCODE_UMR | (opc_mod) << 24); cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) | MLX5E_KLM_UMR_DS_CNT(ALIGN(klm_entries, KLM_ALIGNMENT))); cseg->general_id = cpu_to_be32(id); - if (!klm_offset) { + if (klm_type == KLM_UMR && !klm_offset) { ucseg->mkey_mask |= cpu_to_be64(MLX5_MKEY_MASK_XLT_OCT_SIZE | MLX5_MKEY_MASK_LEN | MLX5_MKEY_MASK_FREE); mkc->xlt_oct_size = cpu_to_be32(ALIGN(len, KLM_ALIGNMENT)); @@ -72,21 +218,150 @@ build_nvmeotcp_klm_umr(struct mlx5e_nvmeotcp_queue *queue, ucseg->flags = MLX5_UMR_INLINE | MLX5_UMR_TRANSLATION_OFFSET_EN; ucseg->xlt_octowords = cpu_to_be16(ALIGN(klm_entries, KLM_ALIGNMENT)); ucseg->xlt_offset = cpu_to_be16(klm_offset); - fill_nvmeotcp_klm_wqe(queue, wqe, ccid, klm_entries, klm_offset); + fill_nvmeotcp_klm_wqe(queue, wqe, ccid, klm_entries, klm_offset, klm_type); +} + +static void +fill_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5_seg_nvmeotcp_progress_params *params, + u32 seq) +{ + void *ctx = params->ctx; + + params->tir_num = cpu_to_be32(queue->tirn); + + MLX5_SET(nvmeotcp_progress_params, ctx, + next_pdu_tcp_sn, seq); + MLX5_SET(nvmeotcp_progress_params, ctx, pdu_tracker_state, + MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_START); +} + +void +build_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe, + u32 seq) +{ + struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl; + u32 sqn = queue->sq->icosq.sqn; + u16 pc = queue->sq->icosq.pc; + u8 opc_mod; + + memset(wqe, 0, MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ); + opc_mod = MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_PROGRESS_PARAMS; + cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) | + MLX5_OPCODE_SET_PSV | (opc_mod << 24)); + cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) | + PROGRESS_PARAMS_DS_CNT); + fill_nvmeotcp_progress_params(queue, &wqe->params, seq); +} + +static void +fill_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5_seg_nvmeotcp_static_params *params, + u32 resync_seq, bool zero_copy_en, + bool ddgst_offload_en) +{ + void *ctx = params->ctx; + + MLX5_SET(transport_static_params, ctx, const_1, 1); + MLX5_SET(transport_static_params, ctx, const_2, 2); + MLX5_SET(transport_static_params, ctx, acc_type, + MLX5_TRANSPORT_STATIC_PARAMS_ACC_TYPE_NVMETCP); + MLX5_SET(transport_static_params, ctx, nvme_resync_tcp_sn, resync_seq); + MLX5_SET(transport_static_params, ctx, pda, queue->pda); + MLX5_SET(transport_static_params, ctx, ddgst_en, queue->dgst); + MLX5_SET(transport_static_params, ctx, ddgst_offload_en, ddgst_offload_en); + MLX5_SET(transport_static_params, ctx, hddgst_en, 0); + MLX5_SET(transport_static_params, ctx, hdgst_offload_en, 0); + MLX5_SET(transport_static_params, ctx, ti, + MLX5_TRANSPORT_STATIC_PARAMS_TI_INITIATOR); + MLX5_SET(transport_static_params, ctx, const1, 1); + MLX5_SET(transport_static_params, ctx, zero_copy_en, zero_copy_en); +} + +void +build_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5e_set_nvmeotcp_static_params_wqe *wqe, + u32 resync_seq, bool zerocopy, bool crc_rx) +{ + u8 opc_mod = MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_STATIC_PARAMS; + struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl; + struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl; + u32 sqn = queue->sq->icosq.sqn; + u16 pc = queue->sq->icosq.pc; + + memset(wqe, 0, MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ); + + cseg->opmod_idx_opcode = cpu_to_be32((pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) | + MLX5_OPCODE_UMR | (opc_mod) << 24); + cseg->qpn_ds = cpu_to_be32((sqn << MLX5_WQE_CTRL_QPN_SHIFT) | + STATIC_PARAMS_DS_CNT); + cseg->imm = cpu_to_be32(queue->tirn << MLX5_WQE_CTRL_TIR_TIS_INDEX_SHIFT); + + ucseg->flags = MLX5_UMR_INLINE; + ucseg->bsf_octowords = + cpu_to_be16(MLX5E_NVMEOTCP_STATIC_PARAMS_OCTWORD_SIZE); + fill_nvmeotcp_static_params(queue, &wqe->params, resync_seq, zerocopy, crc_rx); } static void mlx5e_nvmeotcp_fill_wi(struct mlx5e_nvmeotcp_queue *nvmeotcp_queue, - struct mlx5e_icosq *sq, u32 wqe_bbs, u16 pi) + struct mlx5e_icosq *sq, u32 wqe_bbs, u16 pi, + enum wqe_type type) { struct mlx5e_icosq_wqe_info *wi = &sq->db.wqe_info[pi]; wi->num_wqebbs = wqe_bbs; - wi->wqe_type = MLX5E_ICOSQ_WQE_UMR_NVME_TCP; + switch (type) { + case SET_PSV_UMR: + wi->wqe_type = MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP; + break; + default: + wi->wqe_type = MLX5E_ICOSQ_WQE_UMR_NVME_TCP; + break; + } + + if (type == SET_PSV_UMR) + wi->nvmeotcp_q.queue = nvmeotcp_queue; +} + +static void +mlx5e_nvmeotcp_rx_post_static_params_wqe(struct mlx5e_nvmeotcp_queue *queue, + u32 resync_seq) +{ + struct mlx5e_set_nvmeotcp_static_params_wqe *wqe; + struct mlx5e_icosq *sq = &queue->sq->icosq; + u16 pi, wqe_bbs; + + wqe_bbs = MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS; + pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs); + wqe = MLX5E_NVMEOTCP_FETCH_STATIC_PARAMS_WQE(sq, pi); + mlx5e_nvmeotcp_fill_wi(NULL, sq, wqe_bbs, pi, BSF_UMR); + build_nvmeotcp_static_params(queue, wqe, resync_seq, queue->zerocopy, queue->crc_rx); + sq->pc += wqe_bbs; + mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &wqe->ctrl); +} + +static void +mlx5e_nvmeotcp_rx_post_progress_params_wqe(struct mlx5e_nvmeotcp_queue *queue, + u32 seq) +{ + struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe; + struct mlx5e_icosq *sq = &queue->sq->icosq; + u16 pi, wqe_bbs; + + wqe_bbs = MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQEBBS; + pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs); + wqe = MLX5E_NVMEOTCP_FETCH_PROGRESS_PARAMS_WQE(sq, pi); + mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi, SET_PSV_UMR); + build_nvmeotcp_progress_params(queue, wqe, seq); + sq->pc += wqe_bbs; + mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, &wqe->ctrl); } static void post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue, + enum wqe_type wqe_type, u16 ccid, u32 klm_length, u32 *klm_offset) @@ -102,9 +377,9 @@ post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue, wqe_bbs = DIV_ROUND_UP(wqe_sz, MLX5_SEND_WQE_BB); pi = mlx5e_icosq_get_next_pi(sq, wqe_bbs); wqe = MLX5E_NVMEOTCP_FETCH_KLM_WQE(sq, pi); - mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi); + mlx5e_nvmeotcp_fill_wi(queue, sq, wqe_bbs, pi, wqe_type); build_nvmeotcp_klm_umr(queue, wqe, ccid, cur_klm_entries, *klm_offset, - klm_length); + klm_length, wqe_type); *klm_offset += cur_klm_entries; sq->pc += wqe_bbs; sq->doorbell_cseg = &wqe->ctrl; @@ -112,6 +387,7 @@ post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue, static int mlx5e_nvmeotcp_post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue, + enum wqe_type wqe_type, u16 ccid, u32 klm_length) { @@ -129,31 +405,326 @@ mlx5e_nvmeotcp_post_klm_wqe(struct mlx5e_nvmeotcp_queue *queue, return -ENOSPC; for (i = 0; i < wqes; i++) - post_klm_wqe(queue, ccid, klm_length, &klm_offset); + post_klm_wqe(queue, wqe_type, ccid, klm_length, &klm_offset); mlx5e_notify_hw(&sq->wq, sq->pc, sq->uar_map, sq->doorbell_cseg); return 0; } +static int mlx5e_create_nvmeotcp_mkey(struct mlx5_core_dev *mdev, + u8 access_mode, + u32 translation_octword_size, + struct mlx5_core_mkey *mkey) +{ + int inlen = MLX5_ST_SZ_BYTES(create_mkey_in); + void *mkc; + u32 *in; + int err; + + in = kvzalloc(inlen, GFP_KERNEL); + if (!in) + return -ENOMEM; + + mkc = MLX5_ADDR_OF(create_mkey_in, in, memory_key_mkey_entry); + MLX5_SET(mkc, mkc, free, 1); + MLX5_SET(mkc, mkc, translations_octword_size, translation_octword_size); + MLX5_SET(mkc, mkc, umr_en, 1); + MLX5_SET(mkc, mkc, lw, 1); + MLX5_SET(mkc, mkc, lr, 1); + MLX5_SET(mkc, mkc, access_mode_1_0, access_mode); + + MLX5_SET(mkc, mkc, qpn, 0xffffff); + MLX5_SET(mkc, mkc, pd, mdev->mlx5e_res.pdn); + + err = mlx5_core_create_mkey(mdev, mkey, in, inlen); + + kvfree(in); + return err; +} + static int mlx5e_nvmeotcp_offload_limits(struct net_device *netdev, struct tcp_ddp_limits *limits) { + struct mlx5e_priv *priv = netdev_priv(netdev); + struct mlx5_core_dev *mdev = priv->mdev; + + limits->max_ddp_sgl_len = mlx5e_get_max_sgl(mdev); + return 0; +} + +static void +mlx5e_nvmeotcp_destroy_sq(struct mlx5e_nvmeotcp_sq *nvmeotcpsq) +{ + mlx5e_deactivate_icosq(&nvmeotcpsq->icosq); + mlx5e_close_icosq(&nvmeotcpsq->icosq); + mlx5e_close_cq(&nvmeotcpsq->icosq.cq); + list_del(&nvmeotcpsq->list); + kfree(nvmeotcpsq); +} + +static int +mlx5e_nvmeotcp_build_icosq(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5e_priv *priv) +{ + u16 max_sgl, max_klm_per_wqe, max_umr_per_ccid, sgl_rest, wqebbs_rest; + struct mlx5e_channel *c = priv->channels.c[queue->channel_ix]; + struct mlx5e_sq_param icosq_param = {0}; + struct dim_cq_moder icocq_moder = {0}; + struct mlx5e_nvmeotcp_sq *nvmeotcp_sq; + struct mlx5e_create_cq_param ccp; + struct mlx5e_icosq *icosq; + int err = -ENOMEM; + u16 log_icosq_sz; + u32 max_wqebbs; + + nvmeotcp_sq = kzalloc(sizeof(*nvmeotcp_sq), GFP_KERNEL); + if (!nvmeotcp_sq) + return err; + + icosq = &nvmeotcp_sq->icosq; + max_sgl = mlx5e_get_max_sgl(priv->mdev); + max_klm_per_wqe = queue->max_klms_per_wqe; + max_umr_per_ccid = max_sgl / max_klm_per_wqe; + sgl_rest = max_sgl % max_klm_per_wqe; + wqebbs_rest = sgl_rest ? MLX5E_KLM_UMR_WQEBBS(sgl_rest) : 0; + max_wqebbs = (MLX5E_KLM_UMR_WQEBBS(max_klm_per_wqe) * + max_umr_per_ccid + wqebbs_rest) * queue->size; + log_icosq_sz = order_base_2(max_wqebbs); + + mlx5e_build_icosq_param(priv, log_icosq_sz, &icosq_param); + mlx5e_build_create_cq_param(&ccp, c); + err = mlx5e_open_cq(priv, icocq_moder, &icosq_param.cqp, &ccp, &icosq->cq); + if (err) + goto err_nvmeotcp_sq; + + err = mlx5e_open_icosq(c, &priv->channels.params, &icosq_param, icosq); + if (err) + goto close_cq; + + INIT_LIST_HEAD(&nvmeotcp_sq->list); + spin_lock(&c->nvmeotcp_icosq_lock); + list_add(&nvmeotcp_sq->list, &c->list_nvmeotcpsq); + spin_unlock(&c->nvmeotcp_icosq_lock); + queue->sq = nvmeotcp_sq; + mlx5e_activate_icosq(icosq); + return 0; + +close_cq: + mlx5e_close_cq(&icosq->cq); +err_nvmeotcp_sq: + kfree(nvmeotcp_sq); + + return err; +} + +static void +mlx5e_nvmeotcp_destroy_rx(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5_core_dev *mdev, bool zerocopy) +{ + int i; + + mlx5e_accel_fs_del_sk(queue->fh); + for (i = 0; i < queue->size && zerocopy; i++) + mlx5_core_destroy_mkey(mdev, &queue->ccid_table[i].klm_mkey); + + mlx5e_nvmeotcp_destroy_tir(queue->priv, queue->tirn); + if (zerocopy) { + kfree(queue->ccid_table); + mlx5_destroy_nvmeotcp_tag_buf_table(mdev, queue->tag_buf_table_id); + } + + mlx5e_nvmeotcp_destroy_sq(queue->sq); +} + +static int +mlx5e_nvmeotcp_queue_rx_init(struct mlx5e_nvmeotcp_queue *queue, + struct nvme_tcp_ddp_config *config, + struct net_device *netdev, + bool zerocopy, bool crc) +{ + u8 log_queue_size = order_base_2(config->queue_size); + struct mlx5e_priv *priv = netdev_priv(netdev); + struct mlx5_core_dev *mdev = priv->mdev; + struct sock *sk = queue->sk; + int err, max_sgls, i; + + if (zerocopy) { + if (config->queue_size > + BIT(MLX5_CAP_DEV_NVMEOTCP(mdev, log_max_nvmeotcp_tag_buffer_size))) { + return -EINVAL; + } + + err = mlx5e_create_nvmeotcp_tag_buf_table(mdev, queue, log_queue_size); + if (err) + return err; + } + + err = mlx5e_nvmeotcp_build_icosq(queue, priv); + if (err) + goto destroy_tag_buffer_table; + + /* initializes queue->tirn */ + err = mlx5e_nvmeotcp_create_tir(priv, sk, config, queue, zerocopy, crc); + if (err) + goto destroy_icosq; + + mlx5e_nvmeotcp_rx_post_static_params_wqe(queue, 0); + mlx5e_nvmeotcp_rx_post_progress_params_wqe(queue, tcp_sk(sk)->copied_seq); + + if (zerocopy) { + queue->ccid_table = kcalloc(queue->size, + sizeof(struct nvmeotcp_queue_entry), + GFP_KERNEL); + if (!queue->ccid_table) { + err = -ENOMEM; + goto destroy_tir; + } + + max_sgls = mlx5e_get_max_sgl(mdev); + for (i = 0; i < queue->size; i++) { + err = mlx5e_create_nvmeotcp_mkey(mdev, + MLX5_MKC_ACCESS_MODE_KLMS, + max_sgls, + &queue->ccid_table[i].klm_mkey); + if (err) + goto free_sgl; + } + + err = mlx5e_nvmeotcp_post_klm_wqe(queue, BSF_KLM_UMR, 0, queue->size); + if (err) + goto free_sgl; + } + + if (!(WARN_ON(!wait_for_completion_timeout(&queue->done, 0)))) + queue->fh = mlx5e_accel_fs_add_sk(priv, sk, queue->tirn, queue->id); + + if (IS_ERR_OR_NULL(queue->fh)) { + err = -EINVAL; + goto free_sgl; + } + return 0; + +free_sgl: + while ((i--) && zerocopy) + mlx5_core_destroy_mkey(mdev, &queue->ccid_table[i].klm_mkey); + + if (zerocopy) + kfree(queue->ccid_table); +destroy_tir: + mlx5e_nvmeotcp_destroy_tir(priv, queue->tirn); +destroy_icosq: + mlx5e_nvmeotcp_destroy_sq(queue->sq); +destroy_tag_buffer_table: + if (zerocopy) + mlx5_destroy_nvmeotcp_tag_buf_table(mdev, queue->tag_buf_table_id); + + return err; } +#define OCTWORD_SHIFT 4 +#define MAX_DS_VALUE 63 static int mlx5e_nvmeotcp_queue_init(struct net_device *netdev, struct sock *sk, struct tcp_ddp_config *tconfig) { - return 0; + struct nvme_tcp_ddp_config *config = (struct nvme_tcp_ddp_config *)tconfig; + bool crc_rx = (netdev->features & NETIF_F_HW_TCP_DDP_CRC_RX); + bool zerocopy = (netdev->features & NETIF_F_HW_TCP_DDP); + struct mlx5e_priv *priv = netdev_priv(netdev); + struct mlx5_core_dev *mdev = priv->mdev; + struct mlx5e_nvmeotcp_queue *queue; + int max_wqe_sz_cap, queue_id, err; + + if (tconfig->type != TCP_DDP_NVME) { + err = -EOPNOTSUPP; + goto out; + } + + queue = kzalloc(sizeof(*queue), GFP_KERNEL); + if (!queue) { + err = -ENOMEM; + goto out; + } + + queue_id = ida_simple_get(&priv->nvmeotcp->queue_ids, + MIN_NVMEOTCP_QUEUES, MAX_NVMEOTCP_QUEUES, + GFP_KERNEL); + if (queue_id < 0) { + err = -ENOSPC; + goto free_queue; + } + + queue->crc_rx = crc_rx; + queue->zerocopy = zerocopy; + queue->tcp_ddp_ctx.type = TCP_DDP_NVME; + queue->sk = sk; + queue->id = queue_id; + queue->dgst = config->dgst; + queue->pda = config->cpda; + queue->channel_ix = mlx5e_get_channel_ix_from_io_cpu(priv, + config->io_cpu); + queue->size = config->queue_size; + max_wqe_sz_cap = min_t(int, MAX_DS_VALUE * MLX5_SEND_WQE_DS, + MLX5_CAP_GEN(mdev, max_wqe_sz_sq) << OCTWORD_SHIFT); + queue->max_klms_per_wqe = MLX5E_KLM_ENTRIES_PER_WQE(max_wqe_sz_cap); + queue->priv = priv; + init_completion(&queue->done); + + if (zerocopy || crc_rx) { + err = mlx5e_nvmeotcp_queue_rx_init(queue, config, netdev, + zerocopy, crc_rx); + if (err) + goto remove_queue_id; + } + + err = rhashtable_insert_fast(&priv->nvmeotcp->queue_hash, &queue->hash, + rhash_queues); + if (err) + goto destroy_rx; + + write_lock_bh(&sk->sk_callback_lock); + tcp_ddp_set_ctx(sk, queue); + write_unlock_bh(&sk->sk_callback_lock); + refcount_set(&queue->ref_count, 1); + return err; + +destroy_rx: + if (zerocopy || crc_rx) + mlx5e_nvmeotcp_destroy_rx(queue, mdev, zerocopy); +remove_queue_id: + ida_simple_remove(&priv->nvmeotcp->queue_ids, queue_id); +free_queue: + kfree(queue); +out: + return err; } static void mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev, struct sock *sk) { + struct mlx5e_priv *priv = netdev_priv(netdev); + struct mlx5_core_dev *mdev = priv->mdev; + struct mlx5e_nvmeotcp_queue *queue; + + queue = container_of(tcp_ddp_get_ctx(sk), struct mlx5e_nvmeotcp_queue, tcp_ddp_ctx); + + napi_synchronize(&priv->channels.c[queue->channel_ix]->napi); + + WARN_ON(refcount_read(&queue->ref_count) != 1); + if (queue->zerocopy | queue->crc_rx) + mlx5e_nvmeotcp_destroy_rx(queue, mdev, queue->zerocopy); + + rhashtable_remove_fast(&priv->nvmeotcp->queue_hash, &queue->hash, + rhash_queues); + ida_simple_remove(&priv->nvmeotcp->queue_ids, queue->id); + write_lock_bh(&sk->sk_callback_lock); + tcp_ddp_set_ctx(sk, NULL); + write_unlock_bh(&sk->sk_callback_lock); + mlx5e_nvmeotcp_put_queue(queue); } static int @@ -164,6 +735,16 @@ mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev, return 0; } +void mlx5e_nvmeotcp_ctx_comp(struct mlx5e_icosq_wqe_info *wi) +{ + struct mlx5e_nvmeotcp_queue *queue = wi->nvmeotcp_q.queue; + + if (unlikely(!queue)) + return; + + complete(&queue->done); +} + static int mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev, struct sock *sk, @@ -188,6 +769,27 @@ static const struct tcp_ddp_dev_ops mlx5e_nvmeotcp_ops = { .tcp_ddp_resync = mlx5e_nvmeotcp_dev_resync, }; +struct mlx5e_nvmeotcp_queue * +mlx5e_nvmeotcp_get_queue(struct mlx5e_nvmeotcp *nvmeotcp, int id) +{ + struct mlx5e_nvmeotcp_queue *queue; + + rcu_read_lock(); + queue = rhashtable_lookup_fast(&nvmeotcp->queue_hash, + &id, rhash_queues); + if (queue && !IS_ERR(queue)) + if (!refcount_inc_not_zero(&queue->ref_count)) + queue = NULL; + rcu_read_unlock(); + return queue; +} + +void mlx5e_nvmeotcp_put_queue(struct mlx5e_nvmeotcp_queue *queue) +{ + if (refcount_dec_and_test(&queue->ref_count)) + kfree(queue); +} + int set_feature_nvme_tcp(struct net_device *netdev, bool enable) { struct mlx5e_priv *priv = netdev_priv(netdev); diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h index 753757fc44a3..d0e515502d6d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.h @@ -100,6 +100,10 @@ int mlx5e_nvmeotcp_init(struct mlx5e_priv *priv); int set_feature_nvme_tcp(struct net_device *netdev, bool enable); int set_feature_nvme_tcp_crc(struct net_device *netdev, bool enable); void mlx5e_nvmeotcp_cleanup(struct mlx5e_priv *priv); +struct mlx5e_nvmeotcp_queue * +mlx5e_nvmeotcp_get_queue(struct mlx5e_nvmeotcp *nvmeotcp, int id); +void mlx5e_nvmeotcp_put_queue(struct mlx5e_nvmeotcp_queue *queue); +void mlx5e_nvmeotcp_ctx_comp(struct mlx5e_icosq_wqe_info *wi); int mlx5e_nvmeotcp_init_rx(struct mlx5e_priv *priv); void mlx5e_nvmeotcp_cleanup_rx(struct mlx5e_priv *priv); #else diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h index 329e114d6571..44671e28a9ea 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_utils.h @@ -4,9 +4,77 @@ #define __MLX5E_NVMEOTCP_UTILS_H__ #include "en.h" +#include "en_accel/nvmeotcp.h" + +enum { + MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_START = 0, + MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_TRACKING = 1, + MLX5E_NVMEOTCP_PROGRESS_PARAMS_PDU_TRACKER_STATE_SEARCHING = 2, +}; + +struct mlx5_seg_nvmeotcp_static_params { + u8 ctx[MLX5_ST_SZ_BYTES(transport_static_params)]; +}; + +struct mlx5_seg_nvmeotcp_progress_params { + __be32 tir_num; + u8 ctx[MLX5_ST_SZ_BYTES(nvmeotcp_progress_params)]; +}; + +struct mlx5e_set_nvmeotcp_static_params_wqe { + struct mlx5_wqe_ctrl_seg ctrl; + struct mlx5_wqe_umr_ctrl_seg uctrl; + struct mlx5_mkey_seg mkc; + struct mlx5_seg_nvmeotcp_static_params params; +}; + +struct mlx5e_set_nvmeotcp_progress_params_wqe { + struct mlx5_wqe_ctrl_seg ctrl; + struct mlx5_seg_nvmeotcp_progress_params params; +}; + +struct mlx5e_get_psv_wqe { + struct mlx5_wqe_ctrl_seg ctrl; + struct mlx5_seg_get_psv psv; +}; + +/////////////////////////////////////////// +#define MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ \ + (sizeof(struct mlx5e_set_nvmeotcp_static_params_wqe)) + +#define MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ \ + (sizeof(struct mlx5e_set_nvmeotcp_progress_params_wqe)) +#define MLX5E_NVMEOTCP_STATIC_PARAMS_OCTWORD_SIZE \ + (MLX5_ST_SZ_BYTES(transport_static_params) / MLX5_SEND_WQE_DS) + +#define MLX5E_NVMEOTCP_STATIC_PARAMS_WQEBBS \ + (DIV_ROUND_UP(MLX5E_NVMEOTCP_STATIC_PARAMS_WQE_SZ, MLX5_SEND_WQE_BB)) +#define MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQEBBS \ + (DIV_ROUND_UP(MLX5E_NVMEOTCP_PROGRESS_PARAMS_WQE_SZ, MLX5_SEND_WQE_BB)) + +#define MLX5E_NVMEOTCP_FETCH_STATIC_PARAMS_WQE(sq, pi) \ + ((struct mlx5e_set_nvmeotcp_static_params_wqe *)\ + mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_set_nvmeotcp_static_params_wqe))) + +#define MLX5E_NVMEOTCP_FETCH_PROGRESS_PARAMS_WQE(sq, pi) \ + ((struct mlx5e_set_nvmeotcp_progress_params_wqe *)\ + mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_set_nvmeotcp_progress_params_wqe))) #define MLX5E_NVMEOTCP_FETCH_KLM_WQE(sq, pi) \ ((struct mlx5e_umr_wqe *)\ mlx5e_fetch_wqe(&(sq)->wq, pi, sizeof(struct mlx5e_umr_wqe))) +#define MLX5_CTRL_SEGMENT_OPC_MOD_UMR_NVMEOTCP_TIR_PROGRESS_PARAMS 0x4 + +void +build_nvmeotcp_progress_params(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5e_set_nvmeotcp_progress_params_wqe *wqe, + u32 seq); + +void +build_nvmeotcp_static_params(struct mlx5e_nvmeotcp_queue *queue, + struct mlx5e_set_nvmeotcp_static_params_wqe *wqe, + u32 resync_seq, + bool zerocopy, bool crc_rx); + #endif /* __MLX5E_NVMEOTCP_UTILS_H__ */ diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index 28aa99265c0a..056fc24dd5ed 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -47,6 +47,7 @@ #include "fpga/ipsec.h" #include "en_accel/ipsec_rxtx.h" #include "en_accel/tls_rxtx.h" +#include "en_accel/nvmeotcp.h" #include "lib/clock.h" #include "en/xdp.h" #include "en/xsk/rx.h" @@ -629,6 +630,9 @@ void mlx5e_free_icosq_descs(struct mlx5e_icosq *sq) #ifdef CONFIG_MLX5_EN_NVMEOTCP case MLX5E_ICOSQ_WQE_UMR_NVME_TCP: break; + case MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP: + mlx5e_nvmeotcp_ctx_comp(wi); + break; #endif } } @@ -703,6 +707,9 @@ int mlx5e_poll_ico_cq(struct mlx5e_cq *cq) #ifdef CONFIG_MLX5_EN_NVMEOTCP case MLX5E_ICOSQ_WQE_UMR_NVME_TCP: break; + case MLX5E_ICOSQ_WQE_SET_PSV_NVME_TCP: + mlx5e_nvmeotcp_ctx_comp(wi); + break; #endif default: netdev_WARN_ONCE(cq->netdev, From patchwork Thu Jan 14 15:10:32 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363375 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0C177C4332B for ; Thu, 14 Jan 2021 15:12:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id CAB2823A5F for ; Thu, 14 Jan 2021 15:11:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729357AbhANPLw (ORCPT ); Thu, 14 Jan 2021 10:11:52 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:39014 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729316AbhANPLv (ORCPT ); Thu, 14 Jan 2021 10:11:51 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:45 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh0K000835; Thu, 14 Jan 2021 17:10:45 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com, Or Gerlitz , Yoray Zack Subject: [PATCH v2 net-next 20/21] net/mlx5e: NVMEoTCP statistics Date: Thu, 14 Jan 2021 17:10:32 +0200 Message-Id: <20210114151033.13020-21-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Ben Ben-Ishay NVMEoTCP offload statistics includes both control and data path statistic: counters for ndo, offloaded packets/bytes, dropped packets and resync operation. Signed-off-by: Boris Pismenny Signed-off-by: Ben Ben-Ishay Signed-off-by: Or Gerlitz Signed-off-by: Yoray Zack --- .../mellanox/mlx5/core/en_accel/nvmeotcp.c | 23 +++++++++++- .../mlx5/core/en_accel/nvmeotcp_rxtx.c | 16 ++++++++ .../ethernet/mellanox/mlx5/core/en_stats.c | 37 +++++++++++++++++++ .../ethernet/mellanox/mlx5/core/en_stats.h | 24 ++++++++++++ 4 files changed, 98 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c index c2967cb4cb6a..80cd32942d8d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp.c @@ -656,6 +656,11 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev, struct mlx5_core_dev *mdev = priv->mdev; struct mlx5e_nvmeotcp_queue *queue; int max_wqe_sz_cap, queue_id, err; + struct mlx5e_rq_stats *stats; + u32 channel_ix; + + channel_ix = mlx5e_get_channel_ix_from_io_cpu(priv, config->io_cpu); + stats = &priv->channel_stats[channel_ix].rq; if (tconfig->type != TCP_DDP_NVME) { err = -EOPNOTSUPP; @@ -683,8 +688,7 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev, queue->id = queue_id; queue->dgst = config->dgst; queue->pda = config->cpda; - queue->channel_ix = mlx5e_get_channel_ix_from_io_cpu(priv, - config->io_cpu); + queue->channel_ix = channel_ix; queue->size = config->queue_size; max_wqe_sz_cap = min_t(int, MAX_DS_VALUE * MLX5_SEND_WQE_DS, MLX5_CAP_GEN(mdev, max_wqe_sz_sq) << OCTWORD_SHIFT); @@ -704,6 +708,7 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev, if (err) goto destroy_rx; + stats->nvmeotcp_queue_init++; write_lock_bh(&sk->sk_callback_lock); tcp_ddp_set_ctx(sk, queue); write_unlock_bh(&sk->sk_callback_lock); @@ -718,6 +723,7 @@ mlx5e_nvmeotcp_queue_init(struct net_device *netdev, free_queue: kfree(queue); out: + stats->nvmeotcp_queue_init_fail++; return err; } @@ -728,11 +734,15 @@ mlx5e_nvmeotcp_queue_teardown(struct net_device *netdev, struct mlx5e_priv *priv = netdev_priv(netdev); struct mlx5_core_dev *mdev = priv->mdev; struct mlx5e_nvmeotcp_queue *queue; + struct mlx5e_rq_stats *stats; queue = container_of(tcp_ddp_get_ctx(sk), struct mlx5e_nvmeotcp_queue, tcp_ddp_ctx); napi_synchronize(&priv->channels.c[queue->channel_ix]->napi); + stats = &priv->channel_stats[queue->channel_ix].rq; + stats->nvmeotcp_queue_teardown++; + WARN_ON(refcount_read(&queue->ref_count) != 1); if (queue->zerocopy | queue->crc_rx) mlx5e_nvmeotcp_destroy_rx(queue, mdev, queue->zerocopy); @@ -754,6 +764,7 @@ mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev, struct mlx5e_priv *priv = netdev_priv(netdev); struct scatterlist *sg = ddp->sg_table.sgl; struct mlx5e_nvmeotcp_queue *queue; + struct mlx5e_rq_stats *stats; struct mlx5_core_dev *mdev; int i, size = 0, count = 0; @@ -775,6 +786,11 @@ mlx5e_nvmeotcp_ddp_setup(struct net_device *netdev, queue->ccid_table[ddp->command_id].ccid_gen++; queue->ccid_table[ddp->command_id].sgl_length = count; + stats = &priv->channel_stats[queue->channel_ix].rq; + stats->nvmeotcp_ddp_setup++; + if (unlikely(mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, count))) + stats->nvmeotcp_ddp_setup_fail++; + return 0; } @@ -815,6 +831,7 @@ mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev, struct mlx5e_nvmeotcp_queue *queue; struct mlx5e_priv *priv = netdev_priv(netdev); struct nvmeotcp_queue_entry *q_entry; + struct mlx5e_rq_stats *stats; queue = container_of(tcp_ddp_get_ctx(sk), struct mlx5e_nvmeotcp_queue, tcp_ddp_ctx); q_entry = &queue->ccid_table[ddp->command_id]; @@ -824,6 +841,8 @@ mlx5e_nvmeotcp_ddp_teardown(struct net_device *netdev, q_entry->queue = queue; mlx5e_nvmeotcp_post_klm_wqe(queue, KLM_UMR, ddp->command_id, 0); + stats = &priv->channel_stats[queue->channel_ix].rq; + stats->nvmeotcp_ddp_teardown++; return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c index f446b5d56d64..e1558dfc0f3d 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_accel/nvmeotcp_rxtx.c @@ -10,12 +10,16 @@ static void nvmeotcp_update_resync(struct mlx5e_nvmeotcp_queue *queue, struct mlx5e_cqe128 *cqe128) { const struct tcp_ddp_ulp_ops *ulp_ops; + struct mlx5e_rq_stats *stats; u32 seq; seq = be32_to_cpu(cqe128->resync_tcp_sn); ulp_ops = inet_csk(queue->sk)->icsk_ulp_ddp_ops; if (ulp_ops && ulp_ops->resync_request) ulp_ops->resync_request(queue->sk, seq, TCP_DDP_RESYNC_REQ); + + stats = queue->priv->channels.c[queue->channel_ix]->rq.stats; + stats->nvmeotcp_resync++; } static void mlx5e_nvmeotcp_advance_sgl_iter(struct mlx5e_nvmeotcp_queue *queue) @@ -50,10 +54,13 @@ mlx5_nvmeotcp_add_tail_nonlinear(struct mlx5e_nvmeotcp_queue *queue, int org_nr_frags, int frag_index) { struct mlx5e_priv *priv = queue->priv; + struct mlx5e_rq_stats *stats; while (org_nr_frags != frag_index) { if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) { dev_kfree_skb_any(skb); + stats = priv->channels.c[queue->channel_ix]->rq.stats; + stats->nvmeotcp_drop++; return NULL; } skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, @@ -72,9 +79,12 @@ mlx5_nvmeotcp_add_tail(struct mlx5e_nvmeotcp_queue *queue, struct sk_buff *skb, int offset, int len) { struct mlx5e_priv *priv = queue->priv; + struct mlx5e_rq_stats *stats; if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) { dev_kfree_skb_any(skb); + stats = priv->channels.c[queue->channel_ix]->rq.stats; + stats->nvmeotcp_drop++; return NULL; } skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags, @@ -135,6 +145,7 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, skb_frag_t org_frags[MAX_SKB_FRAGS]; struct mlx5e_nvmeotcp_queue *queue; struct nvmeotcp_queue_entry *nqe; + struct mlx5e_rq_stats *stats; int org_nr_frags, frag_index; struct mlx5e_cqe128 *cqe128; u32 queue_id; @@ -167,6 +178,8 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, return skb; } + stats = priv->channels.c[queue->channel_ix]->rq.stats; + /* cc ddp from cqe */ ccid = be16_to_cpu(cqe128->ccid); ccoff = be32_to_cpu(cqe128->ccoff); @@ -209,6 +222,7 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, while (to_copy < cclen) { if (skb_shinfo(skb)->nr_frags >= MAX_SKB_FRAGS) { dev_kfree_skb_any(skb); + stats->nvmeotcp_drop++; mlx5e_nvmeotcp_put_queue(queue); return NULL; } @@ -238,6 +252,8 @@ mlx5e_nvmeotcp_handle_rx_skb(struct net_device *netdev, struct sk_buff *skb, frag_index); } + stats->nvmeotcp_offload_packets++; + stats->nvmeotcp_offload_bytes += cclen; mlx5e_nvmeotcp_put_queue(queue); return skb; } diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c index 2cf2042b37c7..ca7d2cb5099f 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c @@ -34,6 +34,7 @@ #include "en.h" #include "en_accel/tls.h" #include "en_accel/en_accel.h" +#include "en_accel/nvmeotcp.h" static unsigned int stats_grps_num(struct mlx5e_priv *priv) { @@ -189,6 +190,18 @@ static const struct counter_desc sw_stats_desc[] = { { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_resync_res_ok) }, { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_resync_res_skip) }, { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_tls_err) }, +#endif +#ifdef CONFIG_MLX5_EN_NVMEOTCP + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_init) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_init_fail) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_queue_teardown) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_setup_fail) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_ddp_teardown) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_drop) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_resync) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_offload_packets) }, + { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_nvmeotcp_offload_bytes) }, #endif { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_events) }, { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, ch_poll) }, @@ -352,6 +365,18 @@ static void mlx5e_stats_grp_sw_update_stats_rq_stats(struct mlx5e_sw_stats *s, s->rx_tls_resync_res_skip += rq_stats->tls_resync_res_skip; s->rx_tls_err += rq_stats->tls_err; #endif +#ifdef CONFIG_MLX5_EN_NVMEOTCP + s->rx_nvmeotcp_queue_init += rq_stats->nvmeotcp_queue_init; + s->rx_nvmeotcp_queue_init_fail += rq_stats->nvmeotcp_queue_init_fail; + s->rx_nvmeotcp_queue_teardown += rq_stats->nvmeotcp_queue_teardown; + s->rx_nvmeotcp_ddp_setup += rq_stats->nvmeotcp_ddp_setup; + s->rx_nvmeotcp_ddp_setup_fail += rq_stats->nvmeotcp_ddp_setup_fail; + s->rx_nvmeotcp_ddp_teardown += rq_stats->nvmeotcp_ddp_teardown; + s->rx_nvmeotcp_drop += rq_stats->nvmeotcp_drop; + s->rx_nvmeotcp_resync += rq_stats->nvmeotcp_resync; + s->rx_nvmeotcp_offload_packets += rq_stats->nvmeotcp_offload_packets; + s->rx_nvmeotcp_offload_bytes += rq_stats->nvmeotcp_offload_bytes; +#endif } static void mlx5e_stats_grp_sw_update_stats_ch_stats(struct mlx5e_sw_stats *s, @@ -1612,6 +1637,18 @@ static const struct counter_desc rq_stats_desc[] = { { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_resync_res_skip) }, { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, tls_err) }, #endif +#ifdef CONFIG_MLX5_EN_NVMEOTCP + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_init) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_init_fail) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_queue_teardown) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_setup_fail) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_ddp_teardown) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_drop) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_resync) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_offload_packets) }, + { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, nvmeotcp_offload_bytes) }, +#endif }; static const struct counter_desc sq_stats_desc[] = { diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h index e41fc11f2ce7..a5cf8ec4295b 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h @@ -179,6 +179,18 @@ struct mlx5e_sw_stats { u64 rx_congst_umr; u64 rx_arfs_err; u64 rx_recover; +#ifdef CONFIG_MLX5_EN_NVMEOTCP + u64 rx_nvmeotcp_queue_init; + u64 rx_nvmeotcp_queue_init_fail; + u64 rx_nvmeotcp_queue_teardown; + u64 rx_nvmeotcp_ddp_setup; + u64 rx_nvmeotcp_ddp_setup_fail; + u64 rx_nvmeotcp_ddp_teardown; + u64 rx_nvmeotcp_drop; + u64 rx_nvmeotcp_resync; + u64 rx_nvmeotcp_offload_packets; + u64 rx_nvmeotcp_offload_bytes; +#endif u64 ch_events; u64 ch_poll; u64 ch_arm; @@ -342,6 +354,18 @@ struct mlx5e_rq_stats { u64 tls_resync_res_skip; u64 tls_err; #endif +#ifdef CONFIG_MLX5_EN_NVMEOTCP + u64 nvmeotcp_queue_init; + u64 nvmeotcp_queue_init_fail; + u64 nvmeotcp_queue_teardown; + u64 nvmeotcp_ddp_setup; + u64 nvmeotcp_ddp_setup_fail; + u64 nvmeotcp_ddp_teardown; + u64 nvmeotcp_drop; + u64 nvmeotcp_resync; + u64 nvmeotcp_offload_packets; + u64 nvmeotcp_offload_bytes; +#endif }; struct mlx5e_sq_stats { From patchwork Thu Jan 14 15:10:33 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Pismenny X-Patchwork-Id: 363376 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-11.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id ABC24C433E9 for ; Thu, 14 Jan 2021 15:11:59 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 84B2E23A7F for ; Thu, 14 Jan 2021 15:11:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729354AbhANPLq (ORCPT ); Thu, 14 Jan 2021 10:11:46 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:39030 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1729317AbhANPLo (ORCPT ); Thu, 14 Jan 2021 10:11:44 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from borisp@mellanox.com) with SMTP; 14 Jan 2021 17:10:46 +0200 Received: from gen-l-vrt-133.mtl.labs.mlnx. (gen-l-vrt-133.mtl.labs.mlnx [10.237.11.160]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id 10EFAh0L000835; Thu, 14 Jan 2021 17:10:45 +0200 From: Boris Pismenny To: kuba@kernel.org, davem@davemloft.net, saeedm@nvidia.com, hch@lst.de, sagi@grimberg.me, axboe@fb.com, kbusch@kernel.org, viro@zeniv.linux.org.uk, edumazet@google.com, dsahern@gmail.com, smalin@marvell.com Cc: boris.pismenny@gmail.com, linux-nvme@lists.infradead.org, netdev@vger.kernel.org, benishay@nvidia.com, ogerlitz@nvidia.com, yorayz@nvidia.com Subject: [PATCH v2 net-next 21/21] Documentation: add TCP DDP offload documentation Date: Thu, 14 Jan 2021 17:10:33 +0200 Message-Id: <20210114151033.13020-22-borisp@mellanox.com> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20210114151033.13020-1-borisp@mellanox.com> References: <20210114151033.13020-1-borisp@mellanox.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Add documentation to describe NIC direct data placement offload for protocol layered over TCP. Use NVMe-TCP as an example. --- Documentation/networking/index.rst | 1 + Documentation/networking/tcp-ddp-offload.rst | 296 +++++++++++++++++++ 2 files changed, 297 insertions(+) create mode 100644 Documentation/networking/tcp-ddp-offload.rst diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b8a29997d433..99644159a0cc 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -99,6 +99,7 @@ Contents: sysfs-tagging tc-actions-env-rules tcp-thin + tcp-ddp-offload team timestamping tipc diff --git a/Documentation/networking/tcp-ddp-offload.rst b/Documentation/networking/tcp-ddp-offload.rst new file mode 100644 index 000000000000..1607e8210968 --- /dev/null +++ b/Documentation/networking/tcp-ddp-offload.rst @@ -0,0 +1,296 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +================================= +TCP direct data placement offload +================================= + +Overview +======== + +The Linux kernel TCP direct data placement (DDP) offload infrastructure +provides tagged request-response protocols, such as NVMe-TCP, the ability to +place response data directly in pre-registered buffers according to header +tags. DDP is particularly useful for data-intensive pipelined protocols whose +responses may be reordered. + +For example, in NVMe-TCP numerous read requests are sent together and each +request is tagged using the PDU header CID field. Receiving servers process +requests as fast as possible and sometimes responses for smaller requests +bypasses responses to larger requests, i.e., read 4KB bypasses read 1GB. +Thereafter, clients corrleate responses to requests using PDU header CID tags. +The processing of each response requires copying data from SKBs to read +request destination buffers; The offload avoids this copy. The offload is +oblivious to destination buffers which can reside either in userspace +(O_DIRECT) or in kernel pagecache. + +Request TCP byte-stream: + +.. parsed-literal:: + + +---------------+-------+---------------+-------+---------------+-------+ + | PDU hdr CID=1 | Req 1 | PDU hdr CID=2 | Req 2 | PDU hdr CID=3 | Req 3 | + +---------------+-------+---------------+-------+---------------+-------+ + +Response TCP byte-stream: + +.. parsed-literal:: + + +---------------+--------+---------------+--------+---------------+--------+ + | PDU hdr CID=2 | Resp 2 | PDU hdr CID=3 | Resp 3 | PDU hdr CID=1 | Resp 1 | + +---------------+--------+---------------+--------+---------------+--------+ + +Offloading requires no new SKB bits. Instead, the driver builds SKB page +fragments that point destination buffers. Consequently, SKBs represent the +original data on the wire, which enables *transparent* inter-operation with the +network stack. To avoid copies between SKBs and destination buffers, the +layer-5 protocol (L5P) will check ``if (src == dst)`` for SKB page fragments, +success indicates that data is already placed there by NIC hardware and copy +should be skipped. + +Offloading does require NIC hardware to track L5P procotol framing, similarly +to RX TLS offload (see documentation at +:ref:`Documentation/networking/tls-offload.rst `). NIC hardware +will parse PDU headers extract fields such as operation type, length, ,tag +identifier, etc. and offload only segments that correspond to tags registered +with the NIC, see the :ref:`buf_reg` section. + +Device configuration +==================== + +During driver initialization the device sets the ``NETIF_F_HW_TCP_DDP`` and +feature and installs its +:c:type:`struct tcp_ddp_ops ` +pointer in the :c:member:`tcp_ddp_ops` member of the +:c:type:`struct net_device `. + +Later, after the L5P completes its handshake offload is installed on the socket. +If offload installation fails, then the connection is handled by software as if +offload was not attempted. Offload installation should configure + +To request offload for a socket `sk`, the L5P calls :c:member:`tcp_ddp_sk_add`: + +.. code-block:: c + + int (*tcp_ddp_sk_add)(struct net_device *netdev, + struct sock *sk, + struct tcp_ddp_config *config); + +The function return 0 for success. In case of failure, L5P software should +fallback to normal non-offloaded operation. The `config` parameter indicates +the L5P type and any metadata relevant for that protocol. For example, in +NVMe-TCP the following config is used: + +.. code-block:: c + + /** + * struct nvme_tcp_ddp_config - nvme tcp ddp configuration for an IO queue + * + * @pfv: pdu version (e.g., NVME_TCP_PFV_1_0) + * @cpda: controller pdu data alignmend (dwords, 0's based) + * @dgst: digest types enabled. + * The netdev will offload crc if ddp_crc is supported. + * @queue_size: number of nvme-tcp IO queue elements + * @queue_id: queue identifier + * @cpu_io: cpu core running the IO thread for this queue + */ + struct nvme_tcp_ddp_config { + struct tcp_ddp_config cfg; + + u16 pfv; + u8 cpda; + u8 dgst; + int queue_size; + int queue_id; + int io_cpu; + }; + +When offload is not needed anymore, e.g., the socket is being released, the L5P +calls :c:member:`tcp_ddp_sk_del` to release device contexts: + +.. code-block:: c + + void (*tcp_ddp_sk_del)(struct net_device *netdev, + struct sock *sk); + +Normal operation +================ + +At the very least, the device maintains the following state for each connection: + + * 5-tuple + * expected TCP sequence number + * mapping between tags and corresponding buffers + * current offset within PDU, PDU length, current PDU tag + +NICs should not assume any correleation between PDUs and TCP packets. Assuming +that TCP packets arrive in-order, offload will place PDU payload directly +inside corresponding registered buffers. No packets are to be delayed by NIC +offload. If offload is not possible, than the packet is to be passed as-is to +software. To perform offload on incoming packets without buffering packets in +the NIC, the NIC stores some inter-packet state, such as partial PDU headers. + +RX data-path +------------ + +After the device validates TCP checksums, it can perform DDP offload. The +packet is steered to the DDP offload context according to the 5-tuple. +Thereafter, the expected TCP sequence number is checked against the packet's +TCP sequence number. If there's a match, then offload is performed: PDU payload +is DMA written to corresponding destination buffer according to the PDU header +tag. The data should be DMAed only once, and the NIC receive ring will only +store the remaining TCP and PDU headers. + +We remark that a single TCP packet may have numerous PDUs embedded inside. NICs +can choose to offload one or more of these PDUs according to various +trade-offs. Possibly, offloading such small PDUs is of little value, and it is +better to leave it to software. + +Upon receiving a DDP offloaded packet, the driver reconstructs the original SKB +using page frags, while pointing to the destination buffers whenever possible. +This method enables seemless integration with the network stack, which can +inspect and modify packet fields transperently to the offload. + +.. _buf_reg: + +Destination buffer registration +------------------------------- + +To register the mapping betwteen tags and destination buffers for a socket +`sk`, the L5P calls :c:member:`tcp_ddp_setup` of :c:type:`struct tcp_ddp_ops +`: + +.. code-block:: c + + int (*tcp_ddp_setup)(struct net_device *netdev, + struct sock *sk, + struct tcp_ddp_io *io); + + +The `io` provides the buffer via scatter-gather list (`sg_table`) and +corresponding tag (`command_id`): + +.. code-block:: c + /** + * struct tcp_ddp_io - tcp ddp configuration for an IO request. + * + * @command_id: identifier on the wire associated with these buffers + * @nents: number of entries in the sg_table + * @sg_table: describing the buffers for this IO request + * @first_sgl: first SGL in sg_table + */ + struct tcp_ddp_io { + u32 command_id; + int nents; + struct sg_table sg_table; + struct scatterlist first_sgl[SG_CHUNK_SIZE]; + }; + +After the buffers have been consumed by the L5P, to release the NIC mapping of +buffers the L5P calls :c:member:`tcp_ddp_teardown` of :c:type:`struct +tcp_ddp_ops `: + +.. code-block:: c + + int (*tcp_ddp_teardown)(struct net_device *netdev, + struct sock *sk, + struct tcp_ddp_io *io, + void *ddp_ctx); + +`tcp_ddp_teardown` receives the same `io` context and an additional opaque +`ddp_ctx` that is used for asynchronous teardown, see the :ref:`async_release` +section. + +.. _async_release: + +Asynchronous teardown +--------------------- + +To teardown the association between tags and buffers and allow tag reuse NIC HW +is called by the NIC driver during `tcp_ddp_teardown`. This operation may be +performed either synchronously or asynchronously. In asynchronous teardown, +`tcp_ddp_teardown` returns immediately without unmapping NIC HW buffers. Later, +when the unmapping completes by NIC HW, the NIC driver will call up to L5P +using :c:member:`ddp_teardown_done` of :c:type:`struct tcp_ddp_ulp_ops`: + +.. code-block:: c + + void (*ddp_teardown_done)(void *ddp_ctx); + +The `ddp_ctx` parameter passed in `ddp_teardown_done` is the same on provided +in `tcp_ddp_teardown` and it is used to carry some context about the buffers +and tags that are released. + +Resync handling +=============== + +In presence of packet drops or network packet reordering, the device may lose +synchronization between the TCP stream and the L5P framing, and require a +resync with the kernel's TCP stack. When the device is out of sync, no offload +takes place, and packets are passed as-is to software. (resync is very similar +to TLS offload (see documentation at +:ref:`Documentation/networking/tls-offload.rst `) + +If only packets with L5P data are lost or reordered, then resynchronization may +be avoided by NIC HW that keeps tracking PDU headers. If, however, PDU headers +are reordered, then resynchronization is necessary. + +To resynchronize hardware during traffic, we use a handshake between hardware +and software. The NIC HW searches for a sequence of bytes that identifies L5P +headers (i.e., magic pattern). For example, in NVMe-TCP, the PDU operation +type can be used for this purpose. Using the PDU header length field, the NIC +HW will continue to find and match magic patterns in subsequent PDU headers. If +the pattern is missing in an expected position, then searching for the pattern +starts anew. + +The NIC will not resume offload when the magic pattern is first identified. +Instead, it will request L5P software to confirm that indeed this is a PDU +header. To request confirmation the NIC driver calls up to L5P using +:c:member:`*resync_request` of :c:type:`struct tcp_ddp_ulp_ops`: + +.. code-block:: c + + bool (*resync_request)(struct sock *sk, u32 seq, u32 flags); + +The `seq` field contains the TCP sequence of the last byte in the PDU header. +L5P software will respond to this request after observing the packet containing +TCP sequence `seq` in-order. If the PDU header is indeed there, then L5P +software calls the NIC driver using the :c:member:`tcp_ddp_resync` function of +the :c:type:`struct tcp_ddp_ops ` inside the :c:type:`struct +net_device ` while passing the same `seq` to confirm it is a PDU +header. + +.. code-block:: c + + void (*tcp_ddp_resync)(struct net_device *netdev, + struct sock *sk, u32 seq); + +Statistics +========== + +Per L5P protocol, the following NIC driver must report statistics for the above +netdevice operations and packets processed by offload. For example, NVMe-TCP +offload reports: + + * ``rx_nvmeotcp_queue_init`` - number of NVMe-TCP offload contexts created. + * ``rx_nvmeotcp_queue_teardown`` - number of NVMe-TCP offload contexts + destroyed. + * ``rx_nvmeotcp_ddp_setup`` - number of DDP buffers mapped. + * ``rx_nvmeotcp_ddp_setup_fail`` - number of DDP buffers mapping that failed. + * ``rx_nvmeotcp_ddp_teardown`` - number of DDP buffers unmapped. + * ``rx_nvmeotcp_drop`` - number of packets dropped in the driver due to fatal + errors. + * ``rx_nvmeotcp_resync`` - number of packets with resync requests. + * ``rx_nvmeotcp_offload_packets`` - number of packets that used offload. + * ``rx_nvmeotcp_offload_bytes`` - number of bytes placed in DDP buffers. + +NIC requirements +================ + +NIC hardware should meet the following requirements to provide this offload: + + * Offload must never buffer TCP packets. + * Offload must never modify TCP packet headers. + * Offload must never reorder TCP packets within a flow. + * Offload must never drop TCP packets. + * Offload must not depend on any TCP fields beyond the + 5-tuple and TCP sequence number.