From patchwork Tue Jul 30 02:26:06 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815497 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 484E67A724 for ; Tue, 30 Jul 2024 02:26:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306399; cv=none; b=pFOwGWjYpgoS55XR7zT7HjATg7Gv2c6ExmtP8sT77Jz2F0G7HVui6SaSYOC4UWC4YhihOB3rDiS9lo98Gk2dvmPUWHfH+P7PqCJ7Qn0BhzahHGdD66mdl7CBIhjK4qIKGPNhAzJSh0rIcH0lkqmr8NTeSSj0ZTJyH+ApRtPCBSc= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306399; c=relaxed/simple; bh=Y4cSC9AbCZ/DYKUv7lbClfDwIFTwTH36GaMwbf76APc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=KDSDSGChkXCsbB+55EyRDgAlIsPOXFN2jViK0YUokRcveW2Aidqqt2MfZITyxkSAElSAVqFmnZLJUtGc8F2zakFaO/qaK9FSwf1MH0KLXdoeeKeF/A/orgHqsIdRuoYOPTLF0hjWwvrcyCz/knTThYyArFSqF3wRDNaDGNIA73A= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=p5z1jKU7; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="p5z1jKU7" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-e035f7b5976so8113926276.0 for ; Mon, 29 Jul 2024 19:26:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306393; x=1722911193; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=SW3nR/c8w2jWnWQOFC0hGYdn6iDlxfge6uVUTKHPFX0=; b=p5z1jKU7GvzbXFaan5E6xxYoX94+6ZLQBJHiVusxcnLZpj+Oa0/caqNas/zgJtExeL pz4kMqBe8tROlTExIyajGF/EoXbBk8JBo7qJtJf+v1Itluh92B7rzVMgvnt55DOz740V D+wW3IiiTBefwffxXND9lqqEkr7pybf2l69Eh6t+ld44ZUJUx/tTMJ1yy7rIC0NvwQA4 cm1JjMxK6MYgIWBccuubIqdUkcYKIC2DRrH1Ua9Qr/cExDLiu5A0bseccSCSU0l+SHB0 3I/NvtEaMnxHlPyiyEcHkGi6ndjNQ0k4LSGEn6sIzAc8COgY6vssjxl8K2Mq89UWV/oO QV4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306393; x=1722911193; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SW3nR/c8w2jWnWQOFC0hGYdn6iDlxfge6uVUTKHPFX0=; b=adSHPDjzD7I6jzRPmtRKMXSRq3wVqAGaAohNsW0wPAMNTlRT20Mzjjaz17SrUFWr7t urmDQpdZRnLjILypqtn7fUt07oKdVXxOwYOkWnSuBf+OI229TlvU91U8uJ/45N7RH09Y a9ua/0PtH9NRnuVuGOoVjCTygyxy/Egsym6CSLUofyUn6Vf0Fc9evC7QtKTxq+je4PiL CJzV9x4GdP3qScPzmErkIkG7NwZrWRL6N13YzlVAZPFZ8jhrGO2L3u9yunEA3N/ZtQJp AZdeqdziUC0b0S4fLbCdDhWuSaxeZSYrIMl2/5nwbzmLzR8M/5/OvDpQBsGMQwXWXy8R NJOQ== X-Forwarded-Encrypted: i=1; AJvYcCVC+9T1hc75jorMNhe/H5sWieEKSTo3CO0JY25yq7XjlInDqoXSrdrKCvgtmM49s5GoE47Oyowuq96oUPUm3CfWRAI50cJ4eufnWQRwBvd+ X-Gm-Message-State: AOJu0Yx2USAf87Jxu8wg3N7WFadSvuacASfQhCVhANug/kiAHRb6zJjA PT7j9m+Y8vRHwsRb0etWeu5qcE+zF3cZaZSISUd7k0rHPCRCxwhRdu7GG8lVun3hnXX9SPYctEb A/dZe5LCLeKbWzXUu2F01Cg== X-Google-Smtp-Source: AGHT+IEazFT2FMavtQ0rD3fqQd5wrThlN6RKIWPCBREcwqatea4sFXCde9QYgu0BrhSVshaY8VaI2ZppPQVw5FMrEw== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:6902:100b:b0:e0b:4dd5:398f with SMTP id 3f1490d57ef6-e0b9e1f3833mr13520276.5.1722306392910; Mon, 29 Jul 2024 19:26:32 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:06 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-3-almasrymina@google.com> Subject: [PATCH net-next v17 02/14] net: netdev netlink api to bind dma-buf to a net device From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Stanislav Fomichev API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry Reviewed-by: Donald Hunter Reviewed-by: Jakub Kicinski --- v16: - Use subset-of: queue queue-id instead of queue-dmabuf (Jakub). - Rename attribute 'bind-dmabuf' to more generic 'dmabuf' (Jakub). - Use 'dmabuf' everywhere instead of mix of 'dma-buf' and 'dmabuf' (Donald). - Remove repetitive 'dmabuf' naming that appeared in some places (Jakub). - Reordered where the new operations went so I don't break the enum UAPI (Jakub). v7: - Use flags: [ admin-perm ] instead of a CAP_NET_ADMIN check. Changes in v1: - Add rx-queue-type to distingish rx from tx (Jakub) - Return dma-buf ID from netlink API (David, Stan) Changes in RFC-v3: - Support binding multiple rx rx-queues --- Documentation/netlink/specs/netdev.yaml | 47 +++++++++++++++++++++++++ include/uapi/linux/netdev.h | 11 ++++++ net/core/netdev-genl-gen.c | 19 ++++++++++ net/core/netdev-genl-gen.h | 2 ++ net/core/netdev-genl.c | 6 ++++ tools/include/uapi/linux/netdev.h | 11 ++++++ 6 files changed, 96 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index 959755be4d7f9..4930e8142aa63 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -457,6 +457,39 @@ attribute-sets: Number of times driver re-started accepting send requests to this queue from the stack. type: uint + - + name: queue-id + subset-of: queue + attributes: + - + name: id + - + name: type + - + name: dmabuf + attributes: + - + name: ifindex + doc: netdev ifindex to bind the dmabuf to. + type: u32 + checks: + min: 1 + - + name: queues + doc: receive queues to bind the dmabuf to. + type: nest + nested-attributes: queue-id + multi-attr: true + - + name: fd + doc: dmabuf file descriptor to bind. + type: u32 + - + name: id + doc: id of the dmabuf binding + type: u32 + checks: + min: 1 operations: list: @@ -619,6 +652,20 @@ operations: - rx-bytes - tx-packets - tx-bytes + - + name: bind-rx + doc: Bind dmabuf to netdev + attribute-set: dmabuf + flags: [ admin-perm ] + do: + request: + attributes: + - ifindex + - fd + - queues + reply: + attributes: + - id mcast-groups: list: diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 43742ac5b00da..91bf3ecc5f1d9 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -173,6 +173,16 @@ enum { NETDEV_A_QSTATS_MAX = (__NETDEV_A_QSTATS_MAX - 1) }; +enum { + NETDEV_A_DMABUF_IFINDEX = 1, + NETDEV_A_DMABUF_QUEUES, + NETDEV_A_DMABUF_FD, + NETDEV_A_DMABUF_ID, + + __NETDEV_A_DMABUF_MAX, + NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -186,6 +196,7 @@ enum { NETDEV_CMD_QUEUE_GET, NETDEV_CMD_NAPI_GET, NETDEV_CMD_QSTATS_GET, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index 8350a0afa9ec7..6b7fe6035067b 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -27,6 +27,11 @@ const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFIND [NETDEV_A_PAGE_POOL_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_page_pool_ifindex_range), }; +const struct nla_policy netdev_queue_id_nl_policy[NETDEV_A_QUEUE_TYPE + 1] = { + [NETDEV_A_QUEUE_ID] = { .type = NLA_U32, }, + [NETDEV_A_QUEUE_TYPE] = NLA_POLICY_MAX(NLA_U32, 1), +}; + /* NETDEV_CMD_DEV_GET - do */ static const struct nla_policy netdev_dev_get_nl_policy[NETDEV_A_DEV_IFINDEX + 1] = { [NETDEV_A_DEV_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), @@ -74,6 +79,13 @@ static const struct nla_policy netdev_qstats_get_nl_policy[NETDEV_A_QSTATS_SCOPE [NETDEV_A_QSTATS_SCOPE] = NLA_POLICY_MASK(NLA_UINT, 0x1), }; +/* NETDEV_CMD_BIND_RX - do */ +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_DMABUF_FD + 1] = { + [NETDEV_A_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_DMABUF_FD] = { .type = NLA_U32, }, + [NETDEV_A_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_id_nl_policy), +}; + /* Ops table for netdev */ static const struct genl_split_ops netdev_nl_ops[] = { { @@ -151,6 +163,13 @@ static const struct genl_split_ops netdev_nl_ops[] = { .maxattr = NETDEV_A_QSTATS_SCOPE, .flags = GENL_CMD_CAP_DUMP, }, + { + .cmd = NETDEV_CMD_BIND_RX, + .doit = netdev_nl_bind_rx_doit, + .policy = netdev_bind_rx_nl_policy, + .maxattr = NETDEV_A_DMABUF_FD, + .flags = GENL_ADMIN_PERM | GENL_CMD_CAP_DO, + }, }; static const struct genl_multicast_group netdev_nl_mcgrps[] = { diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index 4db40fd5b4a9e..67c34005750c3 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -13,6 +13,7 @@ /* Common nested types */ extern const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1]; +extern const struct nla_policy netdev_queue_id_nl_policy[NETDEV_A_QUEUE_TYPE + 1]; int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); @@ -30,6 +31,7 @@ int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); enum { NETDEV_NLGRP_MGMT, diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index 05f9515d2c05c..2d726e65211dd 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -721,6 +721,12 @@ int netdev_nl_qstats_get_dumpit(struct sk_buff *skb, return err; } +/* Stub */ +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + static int netdev_genl_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 43742ac5b00da..91bf3ecc5f1d9 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -173,6 +173,16 @@ enum { NETDEV_A_QSTATS_MAX = (__NETDEV_A_QSTATS_MAX - 1) }; +enum { + NETDEV_A_DMABUF_IFINDEX = 1, + NETDEV_A_DMABUF_QUEUES, + NETDEV_A_DMABUF_FD, + NETDEV_A_DMABUF_ID, + + __NETDEV_A_DMABUF_MAX, + NETDEV_A_DMABUF_MAX = (__NETDEV_A_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -186,6 +196,7 @@ enum { NETDEV_CMD_QUEUE_GET, NETDEV_CMD_NAPI_GET, NETDEV_CMD_QSTATS_GET, + NETDEV_CMD_BIND_RX, __NETDEV_CMD_MAX, NETDEV_CMD_MAX = (__NETDEV_CMD_MAX - 1) From patchwork Tue Jul 30 02:26:08 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815496 Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CEAAE14F9F9 for ; Tue, 30 Jul 2024 02:26:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306401; cv=none; b=ICeTkJ5kgMSrNiy+bdFP1JvqvPR2KnW+zznl0YzeDSbGTf1i0X6ekTEXJogj+zaecnyk7Wa6fzvnqj6lJBKkjKoBtX1Vq1Z7J5ukIxggDDGMxXZsbTDvogi9GojKy0Molkl5j9sLdWt3xwmmhezhO3fdge6zF+URv2qGctiiUoM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306401; c=relaxed/simple; bh=dNmQtYy68N/SLiUBNq045aQrib8pOh18k72EvrDuAAE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=K46rXCYJTp/p9ODXDBePYy/mxrKfhOGw7N8f70pDoh6NKcx03aopVW1LSrJctq73fvEI/A9Lug7SW4fXOKoU+o+JbKV1gKMK0EXBW4KUUyvKQIAwToujAn68POKCTaVxoAQ/8zvtISOKrZt2egtDPyZHDcDMzNO7pgS/xBL5nUA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=TAcAkG5M; arc=none smtp.client-ip=209.85.214.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="TAcAkG5M" Received: by mail-pl1-f201.google.com with SMTP id d9443c01a7336-1fda155bc43so29914635ad.3 for ; Mon, 29 Jul 2024 19:26:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306397; x=1722911197; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=8cpgk7A1Ci0fLZZgBR1RDLKJO/LQnlu3QBndF/2loSU=; b=TAcAkG5Mn7Twf2Fv5G0iXo5JzEEbl7akVcbftdXwLTvkEgDoGcLP5hl0ComRENaINK o3V4RvuFsQ4STu5yzfB8uoQhzUEZOQKr2NpGASL/5md181+2a1Q6g3eo9YRiFzv7Rp1l xoXyoMP+d4io2AYVuQ1oNMjuRqgVCK+z/SpFUs7eyYkr7+L5v6bd1TP6PLap/q+66kvg EwL4AIxy4BVYQjQ5EO0v/2ybA5cIJcJqXmks3pptO9sDmh6gsjuNtGrVE9v356m32RPT +xz8vSrXIYcemR6v0Wr8/+ZMmwIoyiN3yPyD6R5jQiuYbuC0AxtablzQaKrqzlZy2s64 BlqA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306397; x=1722911197; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8cpgk7A1Ci0fLZZgBR1RDLKJO/LQnlu3QBndF/2loSU=; b=U3lwgWtaEkchn9C/FzvQqzH0TTk9Ay9uVVH41riBbvyRyoMgj23k1Xgt2/jhhQoftt /0uh74WwyvxBqJQpOr/wDs6bvr+p2r3ySON1oVCaZeJYJ+Q1TiEuelyCRrnwP5/N0btw vvrimex4o5sAkS6EwHpfqreYUImYhsWV+dK4Nm87bkK0D6AO2MdeZAoWH8n2/Sf+o7bF uCnTT0yJCmSW+9YsFFmpUv3ncCMxzYniVooLmDZTS4N6jH38pFm7mE8HS5IEoqjlZ0v3 aLdwV1AkTlfbwX2cs6VZZERcu87lxtP4oX6PK8/LFPN1IApFO6B2DZGsVIP73L4aMxQm IIBw== X-Forwarded-Encrypted: i=1; AJvYcCVHfbeqd4oEzUM2RQKmNQ1j/g9tfW1EiFfueYl8+IwfmqCHglrZJfqYpHpjxORapCNNXMluUSevd/mT4TAF4jVCxpaHD2ATAUBA6NIiMiG+ X-Gm-Message-State: AOJu0Yy0BWAAULR0HHn+4jyn2dNchWMc4JmHG4pDOiJSMeg9zB1X3wqp GWBkli9QoJpc1Dq7PawS5/v3rLj3S77DxG/6UeRAk3RNrY2Z8w4mWhxnEyYip7va8nPVDf9J9dn Bv3HUaPf0B/6pvUmIhJg7JQ== X-Google-Smtp-Source: AGHT+IFUJSMorXRhdXYvyK6tSmZn4zMz3HIyw1VyM+jL6c8F5EfQ4VozIOl2s1shRCqIeczmwAPxygQeJAkoT8ZZyA== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a17:903:1d2:b0:1fa:fc15:c513 with SMTP id d9443c01a7336-1ff04872d65mr4166475ad.9.1722306396807; Mon, 29 Jul 2024 19:26:36 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:08 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-5-almasrymina@google.com> Subject: [PATCH net-next v17 04/14] netdev: netdevice devmem allocator From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Willem de Bruijn , Kaiyuan Zhang Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates net_iov from that binding. The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a net_iov struct. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov --- v17: - Don't acquire a binding ref for every allocation (Jakub). v11: - Fix extraneous inline directive (Paolo) v8: - Rename netdev_dmabuf_binding -> net_devmem_dmabuf_binding to avoid patch-by-patch build error. - Move niov->pp_magic/pp/pp_ref_counter usage to later patch to avoid patch-by-patch build error. v7: - netdev_ -> net_devmem_* naming (Yunsheng). v6: - Add comment on net_iov_dma_addr to explain why we don't use niov->dma_addr (Pavel) - Refactor new functions into net/core/devmem.c (Pavel) v1: - Rename devmem -> dmabuf (David). --- include/net/devmem.h | 13 +++++++++++++ include/net/netmem.h | 18 ++++++++++++++++++ net/core/devmem.c | 40 ++++++++++++++++++++++++++++++++++++++++ 3 files changed, 71 insertions(+) diff --git a/include/net/devmem.h b/include/net/devmem.h index c7bd6a0a6b9e9..2e7cc46d4d3ca 100644 --- a/include/net/devmem.h +++ b/include/net/devmem.h @@ -69,7 +69,20 @@ void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding); int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct net_devmem_dmabuf_binding *binding); void dev_dmabuf_uninstall(struct net_device *dev); +struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); +void net_devmem_free_dmabuf(struct net_iov *ppiov); #else +static inline struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) +{ + return NULL; +} + +static inline void net_devmem_free_dmabuf(struct net_iov *ppiov) +{ +} + static inline void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { diff --git a/include/net/netmem.h b/include/net/netmem.h index 41e96c2f94b5c..664df8325ece5 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -14,8 +14,26 @@ struct net_iov { struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; }; +static inline struct dmabuf_genpool_chunk_owner * +net_iov_owner(const struct net_iov *niov) +{ + return niov->owner; +} + +static inline unsigned int net_iov_idx(const struct net_iov *niov) +{ + return niov - net_iov_owner(niov)->niovs; +} + +static inline struct net_devmem_dmabuf_binding * +net_iov_binding(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding; +} + /* netmem */ /** diff --git a/net/core/devmem.c b/net/core/devmem.c index 9a357235bde8f..3f73d0bda023f 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -32,6 +32,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, kfree(owner); } +static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_dma_addr + + ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); +} + void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { size_t size, avail; @@ -54,6 +62,38 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) kfree(binding); } +struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; + struct net_iov *niov; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + niov = &owner->niovs[index]; + + niov->dma_addr = 0; + + return niov; +} + +void net_devmem_free_dmabuf(struct net_iov *niov) +{ + struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov); + unsigned long dma_addr = net_devmem_get_dma_addr(niov); + + if (gen_pool_has_addr(binding->chunk_pool, dma_addr, PAGE_SIZE)) + gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); +} + /* Protected by rtnl_lock() */ static DEFINE_XARRAY_FLAGS(net_devmem_dmabuf_bindings, XA_FLAGS_ALLOC1); From patchwork Tue Jul 30 02:26:10 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815495 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DC4CC166300 for ; Tue, 30 Jul 2024 02:26:41 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306408; cv=none; b=A1Y53VIxhQSRI7ARrar1GB8bTMfs73FDM6CJXGCB/XzmXJCj1+VkO2VHd7WQd5B/Px4ngFu/tafna+mS8GruvK+VRyFl22YpFvu5oSvbFNw3aQIkcgoxPO1tRXJL4kRcTN/yQbq5z8gCtOV3RM70xlE0P6h3iSCY1cpHH02itNs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306408; c=relaxed/simple; bh=0RWXDr5VzQMERBlF+jBiUSqoB7RHJYh7v6Uv4n8N/Ko=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=K0G1yIZcc0ctgbG/Osyouw258HdyCUP8uw/iJGMU7fswyCWG2TxQHbsvQz4OTNaaw8qnF18qAWb4YrYlRKQSHz6A1vQzoDuD3+HyNh6u8UDeztScqmr/St2TNh+fSWZbUyS2YOY1hiBl3VBRvCndWvzdBdZ7qwv/Y/vZMSxmoDk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=RPfbmrS/; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RPfbmrS/" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6688c44060fso75017127b3.2 for ; Mon, 29 Jul 2024 19:26:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306400; x=1722911200; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=kTSW0jcxZ96GuqfHfNFkzPLfDAIUTOIVgXo8FwVFlTs=; b=RPfbmrS/Nek8EzzkfLkP2+KQ21aqEHQ+oBv7r1LDoF/c45pgYohqV4/xkPzjEpSlFx IQzSHQEQgzWjXUgQT5cKHrRRacbg70ZLeLzKwSCTfxOx2Muj5YZEW+5sEW3Mt9MAOo6Y eclws8j3N3fq4IkAHQlxH52ol7PeDrxmdwPyWSd6vsiSEFdktczB6vjv5fYe6FdcVGtJ 6ow3BqvkRFtxNBBhPWeLKSU6huP+i8tRIE6FpUNPc32JOy61Vb7FakrwNw5P0WjepydD Qbn7FEMBt9BaVwfM0CUtx4Qx4PvYwpMn8N57A1uENKgwH/gz8by6ZhOJ8u9SijA7rHZ/ y4YQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306400; x=1722911200; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kTSW0jcxZ96GuqfHfNFkzPLfDAIUTOIVgXo8FwVFlTs=; b=T9btnWYMmR71bsxjRQqfPu23K5kuPtgzBTihA/mnVjJWet68ovqnxS5aXzvwE94IVe ZaHH1tzE2q6qIiAECA5otQdibIjGURWL7Yj8MX4HUYRgEhjMtlBK67oJsAwLTZMQLmZY ThRrNper/HV7+FKQJSTb0RaK3jEh87GDcXKXID2K+XU3Lg3GmCT6a7cB8X2Of4bqNCD8 THrDvKsvTk9OmV7wHhAHsSy44O/B2R62GOlbUseiv1J/lm6vGvKCNe5DlEU9cVAaW0f/ /tKFHn41hIraag55z/so4/WpGXv8+MzrERvwim0p3ySlib0P5zH7PRwtDM8EsolHiXbU MxkQ== X-Forwarded-Encrypted: i=1; AJvYcCUkUgH7lW/Gm2TGM3cnClv76xfX7cEg1HO8RfEOwXsHXtC/w9u/hbH/pI/e2U7Sc8qf+wlDd3M8oCp6dXS6dwxrc5vXCpGwpN6milKQbEKN X-Gm-Message-State: AOJu0YwkNTKZSNa6lpbbZRTe9VZKj6xDhS+SDTTRUiRV0BH+ZIPHPfgq XBHIwBl/SHCDTVdYSmvAFUCEVfPW0yOei4blliM/QcTsDyM4Q5BysqKRXgHQka+a1+fqHVZks4D gzE9SyW2qsBfzPstE4q2cgw== X-Google-Smtp-Source: AGHT+IHDZoOJM2/30G3SwTOuXSHkG3EjqLMwZDDxg+up8OLC898g4pUxezJ5Euklm0IUYxOY1BrByoB6MimI8XQSfA== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:690c:808:b0:648:3f93:68e0 with SMTP id 00721157ae682-67a0997c413mr1023707b3.6.1722306400511; Mon, 29 Jul 2024 19:26:40 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:10 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-7-almasrymina@google.com> Subject: [PATCH net-next v17 06/14] page_pool: devmem support From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , linux-mm@kvack.org, Matthew Wilcox Convert netmem to be a union of struct page and struct netmem. Overload the LSB of struct netmem* to indicate that it's a net_iov, otherwise it's a page. Currently these entries in struct page are rented by the page_pool and used exclusively by the net stack: struct { unsigned long pp_magic; struct page_pool *pp; unsigned long _pp_mapping_pad; unsigned long dma_addr; atomic_long_t pp_ref_count; }; Mirror these (and only these) entries into struct net_iov and implement netmem helpers that can access these common fields regardless of whether the underlying type is page or net_iov. Implement checks for net_iov in netmem helpers which delegate to mm APIs, to ensure net_iov are never passed to the mm stack. Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov --- v17: - Rename netmem_to_pfn to netmem_pfn_trace (Jakub) - Move some low level netmem helpers to netmem_priv.h (Jakub). v13: - Move NET_IOV dependent changes to this patch. - Fixed comment (Pavel) - Applied Reviewed-by from Pavel. v9: https://lore.kernel.org/netdev/20240403002053.2376017-8-almasrymina@google.com/ - Remove CONFIG checks in netmem_is_net_iov() (Pavel/David/Jens) v7: - Remove static_branch_unlikely from netmem_to_net_iov(). We're getting better results from the fast path in bench_page_pool_simple tests without the static_branch_unlikely, and the addition of static_branch_unlikely doesn't improve performance of devmem TCP. Additionally only check netmem_to_net_iov() if CONFIG_DMA_SHARED_BUFFER is enabled, otherwise dmabuf net_iovs cannot exist anyway. net-next base: 8 cycle fast path. with static_branch_unlikely: 10 cycle fast path. without static_branch_unlikely: 9 cycle fast path. CONFIG_DMA_SHARED_BUFFER disabled: 8 cycle fast path as baseline. Performance of devmem TCP is at 95% line rate is regardless of static_branch_unlikely or not. v6: - Rebased on top of the merged netmem_ref type. - Rebased on top of the merged skb_pp_frag_ref() changes. v5: - Use netmem instead of page* with LSB set. - Use pp_ref_count for refcounting net_iov. - Removed many of the custom checks for netmem. v1: - Disable fragmentation support for iov properly. - fix napi_pp_put_page() path (Yunsheng). - Use pp_frag_count for devmem refcounting. Cc: linux-mm@kvack.org Cc: Matthew Wilcox --- include/net/netmem.h | 108 +++++++++++++++++++++++++++++-- include/net/page_pool/helpers.h | 12 ++-- include/trace/events/page_pool.h | 12 ++-- net/core/devmem.c | 3 + net/core/netmem_priv.h | 36 +++++++++++ net/core/page_pool.c | 38 ++++++----- net/core/skbuff.c | 23 ++++--- 7 files changed, 181 insertions(+), 51 deletions(-) create mode 100644 net/core/netmem_priv.h diff --git a/include/net/netmem.h b/include/net/netmem.h index 664df8325ece5..99531780e53af 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -9,14 +9,51 @@ #define _NET_NETMEM_H #include +#include /* net_iov */ +DECLARE_STATIC_KEY_FALSE(page_pool_mem_providers); + +/* We overload the LSB of the struct page pointer to indicate whether it's + * a page or net_iov. + */ +#define NET_IOV 0x01UL + struct net_iov { + unsigned long __unused_padding; + unsigned long pp_magic; + struct page_pool *pp; struct dmabuf_genpool_chunk_owner *owner; unsigned long dma_addr; + atomic_long_t pp_ref_count; }; +/* These fields in struct page are used by the page_pool and net stack: + * + * struct { + * unsigned long pp_magic; + * struct page_pool *pp; + * unsigned long _pp_mapping_pad; + * unsigned long dma_addr; + * atomic_long_t pp_ref_count; + * }; + * + * We mirror the page_pool fields here so the page_pool can access these fields + * without worrying whether the underlying fields belong to a page or net_iov. + * + * The non-net stack fields of struct page are private to the mm stack and must + * never be mirrored to net_iov. + */ +#define NET_IOV_ASSERT_OFFSET(pg, iov) \ + static_assert(offsetof(struct page, pg) == \ + offsetof(struct net_iov, iov)) +NET_IOV_ASSERT_OFFSET(pp_magic, pp_magic); +NET_IOV_ASSERT_OFFSET(pp, pp); +NET_IOV_ASSERT_OFFSET(dma_addr, dma_addr); +NET_IOV_ASSERT_OFFSET(pp_ref_count, pp_ref_count); +#undef NET_IOV_ASSERT_OFFSET + static inline struct dmabuf_genpool_chunk_owner * net_iov_owner(const struct net_iov *niov) { @@ -47,20 +84,22 @@ net_iov_binding(const struct net_iov *niov) */ typedef unsigned long __bitwise netmem_ref; +static inline bool netmem_is_net_iov(const netmem_ref netmem) +{ + return (__force unsigned long)netmem & NET_IOV; +} + /* This conversion fails (returns NULL) if the netmem_ref is not struct page * backed. - * - * Currently struct page is the only possible netmem, and this helper never - * fails. */ static inline struct page *netmem_to_page(netmem_ref netmem) { + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) + return NULL; + return (__force struct page *)netmem; } -/* Converting from page to netmem is always safe, because a page can always be - * a netmem. - */ static inline netmem_ref page_to_netmem(struct page *page) { return (__force netmem_ref)page; @@ -68,17 +107,72 @@ static inline netmem_ref page_to_netmem(struct page *page) static inline int netmem_ref_count(netmem_ref netmem) { + /* The non-pp refcount of net_iov is always 1. On net_iov, we only + * support pp refcounting which uses the pp_ref_count field. + */ + if (netmem_is_net_iov(netmem)) + return 1; + return page_ref_count(netmem_to_page(netmem)); } -static inline unsigned long netmem_to_pfn(netmem_ref netmem) +static inline unsigned long netmem_pfn_trace(netmem_ref netmem) { + if (netmem_is_net_iov(netmem)) + return 0; + return page_to_pfn(netmem_to_page(netmem)); } +static inline struct net_iov *__netmem_clear_lsb(netmem_ref netmem) +{ + return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV); +} + +static inline struct page_pool *netmem_get_pp(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->pp; +} + +static inline atomic_long_t *netmem_get_pp_ref_count_ref(netmem_ref netmem) +{ + return &__netmem_clear_lsb(netmem)->pp_ref_count; +} + +static inline bool netmem_is_pref_nid(netmem_ref netmem, int pref_nid) +{ + /* Assume net_iov are on the preferred node without actually + * checking... + * + * This check is only used to check for recycling memory in the page + * pool's fast paths. Currently the only implementation of net_iov + * is dmabuf device memory. It's a deliberate decision by the user to + * bind a certain dmabuf to a certain netdev, and the netdev rx queue + * would not be able to reallocate memory from another dmabuf that + * exists on the preferred node, so, this check doesn't make much sense + * in this case. Assume all net_iovs can be recycled for now. + */ + if (netmem_is_net_iov(netmem)) + return true; + + return page_to_nid(netmem_to_page(netmem)) == pref_nid; +} + static inline netmem_ref netmem_compound_head(netmem_ref netmem) { + /* niov are never compounded */ + if (netmem_is_net_iov(netmem)) + return netmem; + return page_to_netmem(compound_head(netmem_to_page(netmem))); } +static inline void *netmem_address(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) + return NULL; + + return page_address(netmem_to_page(netmem)); +} + #endif /* _NET_NETMEM_H */ diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index 8f27ecc00bb16..50c1efd4018fd 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -216,7 +216,7 @@ page_pool_get_dma_dir(const struct page_pool *pool) static inline void page_pool_fragment_netmem(netmem_ref netmem, long nr) { - atomic_long_set(&netmem_to_page(netmem)->pp_ref_count, nr); + atomic_long_set(netmem_get_pp_ref_count_ref(netmem), nr); } /** @@ -244,7 +244,7 @@ static inline void page_pool_fragment_page(struct page *page, long nr) static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) { - struct page *page = netmem_to_page(netmem); + atomic_long_t *pp_ref_count = netmem_get_pp_ref_count_ref(netmem); long ret; /* If nr == pp_ref_count then we have cleared all remaining @@ -261,19 +261,19 @@ static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) * initially, and only overwrite it when the page is partitioned into * more than one piece. */ - if (atomic_long_read(&page->pp_ref_count) == nr) { + if (atomic_long_read(pp_ref_count) == nr) { /* As we have ensured nr is always one for constant case using * the BUILD_BUG_ON(), only need to handle the non-constant case * here for pp_ref_count draining, which is a rare case. */ BUILD_BUG_ON(__builtin_constant_p(nr) && nr != 1); if (!__builtin_constant_p(nr)) - atomic_long_set(&page->pp_ref_count, 1); + atomic_long_set(pp_ref_count, 1); return 0; } - ret = atomic_long_sub_return(nr, &page->pp_ref_count); + ret = atomic_long_sub_return(nr, pp_ref_count); WARN_ON(ret < 0); /* We are the last user here too, reset pp_ref_count back to 1 to @@ -282,7 +282,7 @@ static inline long page_pool_unref_netmem(netmem_ref netmem, long nr) * page_pool_unref_page() currently. */ if (unlikely(!ret)) - atomic_long_set(&page->pp_ref_count, 1); + atomic_long_set(pp_ref_count, 1); return ret; } diff --git a/include/trace/events/page_pool.h b/include/trace/events/page_pool.h index 543e54e432a18..31825ed300324 100644 --- a/include/trace/events/page_pool.h +++ b/include/trace/events/page_pool.h @@ -57,12 +57,12 @@ TRACE_EVENT(page_pool_state_release, __entry->pool = pool; __entry->netmem = (__force unsigned long)netmem; __entry->release = release; - __entry->pfn = netmem_to_pfn(netmem); + __entry->pfn = netmem_pfn_trace(netmem); ), - TP_printk("page_pool=%p netmem=%p pfn=0x%lx release=%u", + TP_printk("page_pool=%p netmem=%p is_net_iov=%lu pfn=0x%lx release=%u", __entry->pool, (void *)__entry->netmem, - __entry->pfn, __entry->release) + __entry->netmem & NET_IOV, __entry->pfn, __entry->release) ); TRACE_EVENT(page_pool_state_hold, @@ -83,12 +83,12 @@ TRACE_EVENT(page_pool_state_hold, __entry->pool = pool; __entry->netmem = (__force unsigned long)netmem; __entry->hold = hold; - __entry->pfn = netmem_to_pfn(netmem); + __entry->pfn = netmem_pfn_trace(netmem); ), - TP_printk("page_pool=%p netmem=%p pfn=0x%lx hold=%u", + TP_printk("page_pool=%p netmem=%p is_net_iov=%lu, pfn=0x%lx hold=%u", __entry->pool, (void *)__entry->netmem, - __entry->pfn, __entry->hold) + __entry->netmem & NET_IOV, __entry->pfn, __entry->hold) ); TRACE_EVENT(page_pool_update_nid, diff --git a/net/core/devmem.c b/net/core/devmem.c index 3f73d0bda023f..befff59a2ee64 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -80,7 +80,10 @@ net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) index = offset / PAGE_SIZE; niov = &owner->niovs[index]; + niov->pp_magic = 0; + niov->pp = NULL; niov->dma_addr = 0; + atomic_long_set(&niov->pp_ref_count, 0); return niov; } diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h new file mode 100644 index 0000000000000..d18eaca38cdaa --- /dev/null +++ b/net/core/netmem_priv.h @@ -0,0 +1,36 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef __NETMEM_PRIV_H +#define __NETMEM_PRIV_H + +static inline unsigned long netmem_get_pp_magic(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->pp_magic; +} + +static inline void netmem_or_pp_magic(netmem_ref netmem, unsigned long pp_magic) +{ + __netmem_clear_lsb(netmem)->pp_magic |= pp_magic; +} + +static inline void netmem_clear_pp_magic(netmem_ref netmem) +{ + __netmem_clear_lsb(netmem)->pp_magic = 0; +} + +static inline void netmem_set_pp(netmem_ref netmem, struct page_pool *pool) +{ + __netmem_clear_lsb(netmem)->pp = pool; +} + +static inline unsigned long netmem_get_dma_addr(netmem_ref netmem) +{ + return __netmem_clear_lsb(netmem)->dma_addr; +} + +static inline void netmem_set_dma_addr(netmem_ref netmem, + unsigned long dma_addr) +{ + __netmem_clear_lsb(netmem)->dma_addr = dma_addr; +} +#endif diff --git a/net/core/page_pool.c b/net/core/page_pool.c index a032f731d4146..c5c303746d494 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,6 +25,9 @@ #include #include "page_pool_priv.h" +#include "netmem_priv.h" + +DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -358,7 +361,7 @@ static noinline netmem_ref page_pool_refill_alloc_cache(struct page_pool *pool) if (unlikely(!netmem)) break; - if (likely(page_to_nid(netmem_to_page(netmem)) == pref_nid)) { + if (likely(netmem_is_pref_nid(netmem, pref_nid))) { pool->alloc.cache[pool->alloc.count++] = netmem; } else { /* NUMA mismatch; @@ -454,10 +457,8 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - page->pp = pool; - page->pp_magic |= PP_SIGNATURE; + netmem_set_pp(netmem, pool); + netmem_or_pp_magic(netmem, PP_SIGNATURE); /* Ensuring all pages have been split into one fragment initially: * page_pool_set_pp_info() is only called once for every page when it @@ -472,10 +473,8 @@ static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) static void page_pool_clear_pp_info(netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - page->pp_magic = 0; - page->pp = NULL; + netmem_clear_pp_magic(netmem); + netmem_set_pp(netmem, NULL); } static struct page *__page_pool_alloc_page_order(struct page_pool *pool, @@ -692,8 +691,9 @@ static bool page_pool_recycle_in_cache(netmem_ref netmem, static bool __page_pool_page_can_be_recycled(netmem_ref netmem) { - return page_ref_count(netmem_to_page(netmem)) == 1 && - !page_is_pfmemalloc(netmem_to_page(netmem)); + return netmem_is_net_iov(netmem) || + (page_ref_count(netmem_to_page(netmem)) == 1 && + !page_is_pfmemalloc(netmem_to_page(netmem))); } /* If the page refcnt == 1, this will try to recycle the page. @@ -728,6 +728,7 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem, /* Page found as candidate for recycling */ return netmem; } + /* Fallback/non-XDP mode: API user have elevated refcnt. * * Many drivers split up the page into fragments, and some @@ -949,7 +950,7 @@ static void page_pool_empty_ring(struct page_pool *pool) /* Empty recycle ring */ while ((netmem = (__force netmem_ref)ptr_ring_consume_bh(&pool->ring))) { /* Verify the refcnt invariant of cached pages */ - if (!(page_ref_count(netmem_to_page(netmem)) == 1)) + if (!(netmem_ref_count(netmem) == 1)) pr_crit("%s() page_pool refcnt %d violation\n", __func__, netmem_ref_count(netmem)); @@ -1102,9 +1103,7 @@ EXPORT_SYMBOL(page_pool_update_nid); dma_addr_t page_pool_get_dma_addr_netmem(netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - dma_addr_t ret = page->dma_addr; + dma_addr_t ret = netmem_get_dma_addr(netmem); if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) ret <<= PAGE_SHIFT; @@ -1115,18 +1114,17 @@ EXPORT_SYMBOL(page_pool_get_dma_addr_netmem); bool page_pool_set_dma_addr_netmem(netmem_ref netmem, dma_addr_t addr) { - struct page *page = netmem_to_page(netmem); - if (PAGE_POOL_32BIT_ARCH_WITH_64BIT_DMA) { - page->dma_addr = addr >> PAGE_SHIFT; + netmem_set_dma_addr(netmem, addr >> PAGE_SHIFT); /* We assume page alignment to shave off bottom bits, * if this "compression" doesn't work we need to drop. */ - return addr != (dma_addr_t)page->dma_addr << PAGE_SHIFT; + return addr != (dma_addr_t)netmem_get_dma_addr(netmem) + << PAGE_SHIFT; } - page->dma_addr = addr; + netmem_set_dma_addr(netmem, addr); return false; } EXPORT_SYMBOL(page_pool_set_dma_addr_netmem); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 83f8cd8aa2d16..6fd822b5d96d0 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -89,6 +89,7 @@ #include "dev.h" #include "sock_destructor.h" +#include "netmem_priv.h" #ifdef CONFIG_SKB_EXTENSIONS static struct kmem_cache *skbuff_ext_cache __ro_after_init; @@ -917,9 +918,9 @@ static void skb_clone_fraglist(struct sk_buff *skb) skb_get(list); } -static bool is_pp_page(struct page *page) +static bool is_pp_netmem(netmem_ref netmem) { - return (page->pp_magic & ~0x3UL) == PP_SIGNATURE; + return (netmem_get_pp_magic(netmem) & ~0x3UL) == PP_SIGNATURE; } int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb, @@ -1017,9 +1018,7 @@ EXPORT_SYMBOL(skb_cow_data_for_xdp); #if IS_ENABLED(CONFIG_PAGE_POOL) bool napi_pp_put_page(netmem_ref netmem) { - struct page *page = netmem_to_page(netmem); - - page = compound_head(page); + netmem = netmem_compound_head(netmem); /* page->pp_magic is OR'ed with PP_SIGNATURE after the allocation * in order to preserve any existing bits, such as bit 0 for the @@ -1028,10 +1027,10 @@ bool napi_pp_put_page(netmem_ref netmem) * and page_is_pfmemalloc() is checked in __page_pool_put_page() * to avoid recycling the pfmemalloc page. */ - if (unlikely(!is_pp_page(page))) + if (unlikely(!is_pp_netmem(netmem))) return false; - page_pool_put_full_netmem(page->pp, page_to_netmem(page), false); + page_pool_put_full_netmem(netmem_get_pp(netmem), netmem, false); return true; } @@ -1058,7 +1057,7 @@ static bool skb_pp_recycle(struct sk_buff *skb, void *data) static int skb_pp_frag_ref(struct sk_buff *skb) { struct skb_shared_info *shinfo; - struct page *head_page; + netmem_ref head_netmem; int i; if (!skb->pp_recycle) @@ -1067,11 +1066,11 @@ static int skb_pp_frag_ref(struct sk_buff *skb) shinfo = skb_shinfo(skb); for (i = 0; i < shinfo->nr_frags; i++) { - head_page = compound_head(skb_frag_page(&shinfo->frags[i])); - if (likely(is_pp_page(head_page))) - page_pool_ref_page(head_page); + head_netmem = netmem_compound_head(shinfo->frags[i].netmem); + if (likely(is_pp_netmem(head_netmem))) + page_pool_ref_netmem(head_netmem); else - page_ref_inc(head_page); + page_ref_inc(netmem_to_page(head_netmem)); } return 0; } From patchwork Tue Jul 30 02:26:12 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815494 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2BC4116C6A0 for ; Tue, 30 Jul 2024 02:26:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306411; cv=none; b=jvgmzyEOjUqgItrOekI/g5SKaLGJTwGDYIsp7GmDgETiBUH/C7xeh43+LScBHpxYp1XHIErQy8Mu2yoeZvIfut4IiESK6NrqNvg5uEeYM6QlQnSYLUCaRqzauXwDtvsqOQAo0/Z8CLOIjbBRc/abeExHMwxda1OkvFIvfO28FUM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306411; c=relaxed/simple; bh=jdl3M2jr4u6pLSlqbqlvsZuuQl3RupjG2vZvs7fMYOE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=sAG+ovTnmxP6Kp8i/G0k39dgYvx8sh3AGJBaKLEvHk2vmZISf8fJPqoPZZCRo5Wpj9kXDeLMeoRBjp3o0EzsF5l/ZDzv4gUctvmIsQPqqqO3AE2cmrakawqUrjmRiLEFeQFqu/Q782vD6C2fmhDROJi9KHvSMaFOqrH9141O2zo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SUngBfQ2; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SUngBfQ2" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-650b621f4cdso69973247b3.1 for ; Mon, 29 Jul 2024 19:26:45 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306404; x=1722911204; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=vplH4mlU/1WQZxtGxZ/nznz8iPlQ8XpYd89zmbdHSbk=; b=SUngBfQ2wMrmVPAEOXTjjnLEpynGdDmarPQCnm+QjAR5L3tzyi1e5G5zrIXvUSz8cG Jd1zRmi1y496Hgp27AVrCkyEkUoA5pD/mI/5R9iIrhl0WqmWxVqoV8WbIh+Xu2lrXSE1 HuQOTb7e4u1gZhbJiURAm/TR3JkdvDnNaBhmwxAcyEh/M1z4GWvQXGAtTfCOzNc+Qb4t 34mV7uwos8BKSZuDDWR1QEimrj6Xa1j+cGlDNKSJl8gsnyYVh3GQZKpCPxXKGVPF1LJe bL5QwhegbLqEOO/tXtdFvgqdoDKYuPYDM/keVe+zAfyUtUXqofL99dLFGMp5/fy1pZTy nPiw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306404; x=1722911204; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=vplH4mlU/1WQZxtGxZ/nznz8iPlQ8XpYd89zmbdHSbk=; b=sWWgAenUL1D59zZgKbYt3JzPKImo+u4tXLoZQMrVZz3Z+xC7WGlnjZEUznWH1KbfxR 04ojLQzs7+9E/Wi+foEVL483gSedGwjl+Ut+JgOsWF+3KJLxnKjGeORCWfWNsT+B0u3q GxynIpOHFmoSmB3dGM6JMrG+ONTioSll84WPH4JJsaryxM8QlGoj1ys2ZbFrVPElvRp3 0ORKEJZ+CC1+905RCbQKGtiLvDz/52+tOpEAf6Pg0AspzCDhR6gGAZ+G3mobJB467IeY 8ruHZomgBMQSEsTSEg3TolALQkXbQino1TVd3957GKrs6JjXxa60mlsdGtxPkWsWpFyG Zbqg== X-Forwarded-Encrypted: i=1; AJvYcCW98iR7oY7UTRBY2Ik7Bt7bc/Mn20S19B0mNfAExHe2HhHnMv5QOLdiOifl6sW7ryE9npxFJBSw+zciwD6iUcv4XToWuA1b+WPgdXu55UGZ X-Gm-Message-State: AOJu0Yw0Lo3AQPe2zVNd9zJxllSvcjTOrRc13sAR7N/uxMkACRaVS1Eo l/eIyS9DThj/VDZLLwRV8IN5FPueRFg/he1UFfAigzJu6MUB5+agYzL4bZpgr1Z4E/ZkwCdpJ30 n9uAHt/mHndqPH/tf7jSuzA== X-Google-Smtp-Source: AGHT+IFXmNBH9+V8ugq5JoD4zenlmgEWIS7YmPxOH4wl4jR2tGtmVMHeRQqlQt6cS9xP6nMVdDQ2QWBEFdi3jssVrg== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:690c:6612:b0:62d:42d:5f63 with SMTP id 00721157ae682-67a08eb9721mr1531347b3.5.1722306403972; Mon, 29 Jul 2024 19:26:43 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:12 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-9-almasrymina@google.com> Subject: [PATCH net-next v17 08/14] net: support non paged skb frags From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevant callers to handle this case. Signed-off-by: Mina Almasry Reviewed-by: Eric Dumazet --- v10: - Fixed newly generated kdoc warnings found by patchwork. While we're at it, fix the Return section of the functions I touched. v6: - Rebased on top of the merged netmem changes. Changes in v1: - Fix illegal_highdma() (Yunsheng). - Rework napi_pp_put_page() slightly to reduce code churn (Willem). --- include/linux/skbuff.h | 42 +++++++++++++++++++++++++++++++++++++- include/linux/skbuff_ref.h | 9 ++++---- net/core/dev.c | 3 ++- net/core/gro.c | 3 ++- net/core/skbuff.c | 11 ++++++++++ net/ipv4/esp4.c | 3 ++- net/ipv4/tcp.c | 3 +++ net/ipv6/esp6.c | 3 ++- 8 files changed, 67 insertions(+), 10 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 29c3ea5b6e930..6e22ff7b91964 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3523,21 +3523,58 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto, fragto->offset = fragfrom->offset; } +/* Return: true if the skb_frag contains a net_iov. */ +static inline bool skb_frag_is_net_iov(const skb_frag_t *frag) +{ + return netmem_is_net_iov(frag->netmem); +} + +/** + * skb_frag_net_iov - retrieve the net_iov referred to by fragment + * @frag: the fragment + * + * Return: the &struct net_iov associated with @frag. Returns NULL if this + * frag has no associated net_iov. + */ +static inline struct net_iov *skb_frag_net_iov(const skb_frag_t *frag) +{ + if (!skb_frag_is_net_iov(frag)) + return NULL; + + return netmem_to_net_iov(frag->netmem); +} + /** * skb_frag_page - retrieve the page referred to by a paged fragment * @frag: the paged fragment * - * Returns the &struct page associated with @frag. + * Return: the &struct page associated with @frag. Returns NULL if this frag + * has no associated page. */ static inline struct page *skb_frag_page(const skb_frag_t *frag) { + if (skb_frag_is_net_iov(frag)) + return NULL; + return netmem_to_page(frag->netmem); } +/** + * skb_frag_netmem - retrieve the netmem referred to by a fragment + * @frag: the fragment + * + * Return: the &netmem_ref associated with @frag. + */ +static inline netmem_ref skb_frag_netmem(const skb_frag_t *frag) +{ + return frag->netmem; +} + int skb_pp_cow_data(struct page_pool *pool, struct sk_buff **pskb, unsigned int headroom); int skb_cow_data_for_xdp(struct page_pool *pool, struct sk_buff **pskb, struct bpf_prog *prog); + /** * skb_frag_address - gets the address of the data contained in a paged fragment * @frag: the paged fragment buffer @@ -3547,6 +3584,9 @@ int skb_cow_data_for_xdp(struct page_pool *pool, struct sk_buff **pskb, */ static inline void *skb_frag_address(const skb_frag_t *frag) { + if (!skb_frag_page(frag)) + return NULL; + return page_address(skb_frag_page(frag)) + skb_frag_off(frag); } diff --git a/include/linux/skbuff_ref.h b/include/linux/skbuff_ref.h index 16c241a234728..0f3c58007488a 100644 --- a/include/linux/skbuff_ref.h +++ b/include/linux/skbuff_ref.h @@ -34,14 +34,13 @@ static inline void skb_frag_ref(struct sk_buff *skb, int f) bool napi_pp_put_page(netmem_ref netmem); -static inline void -skb_page_unref(struct page *page, bool recycle) +static inline void skb_page_unref(netmem_ref netmem, bool recycle) { #ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_put_page(page_to_netmem(page))) + if (recycle && napi_pp_put_page(netmem)) return; #endif - put_page(page); + put_page(netmem_to_page(netmem)); } /** @@ -54,7 +53,7 @@ skb_page_unref(struct page *page, bool recycle) */ static inline void __skb_frag_unref(skb_frag_t *frag, bool recycle) { - skb_page_unref(skb_frag_page(frag), recycle); + skb_page_unref(skb_frag_netmem(frag), recycle); } /** diff --git a/net/core/dev.c b/net/core/dev.c index 29d1920cf01fd..0e392bcda2ff0 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3434,8 +3434,9 @@ static int illegal_highdma(struct net_device *dev, struct sk_buff *skb) if (!(dev->features & NETIF_F_HIGHDMA)) { for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = skb_frag_page(frag); - if (PageHighMem(skb_frag_page(frag))) + if (page && PageHighMem(page)) return 1; } } diff --git a/net/core/gro.c b/net/core/gro.c index b3b43de1a6502..26f09c3e830b7 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -408,7 +408,8 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff) pinfo = skb_shinfo(skb); frag0 = &pinfo->frags[0]; - if (pinfo->nr_frags && !PageHighMem(skb_frag_page(frag0)) && + if (pinfo->nr_frags && skb_frag_page(frag0) && + !PageHighMem(skb_frag_page(frag0)) && (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) { NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0); NAPI_GRO_CB(skb)->frag0_len = min_t(unsigned int, diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 6fd822b5d96d0..9c026ea694dcd 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1368,6 +1368,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) struct page *p; u8 *vaddr; + if (skb_frag_is_net_iov(frag)) { + printk("%sskb frag %d: not readable\n", level, i); + len -= skb_frag_size(frag); + if (!len) + break; + continue; + } + skb_frag_foreach_page(frag, skb_frag_off(frag), skb_frag_size(frag), p, p_off, p_len, copied) { @@ -3160,6 +3168,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; + if (WARN_ON_ONCE(!skb_frag_page(f))) + return false; + if (__splice_segment(skb_frag_page(f), skb_frag_off(f), skb_frag_size(f), offset, len, spd, false, sk, pipe)) diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index 47378ca41904b..f3281312eb5eb 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -115,7 +115,8 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp, struct sk_buff *skb) */ if (req->src != req->dst) for (sg = sg_next(req->src); sg; sg = sg_next(sg)) - skb_page_unref(sg_page(sg), skb->pp_recycle); + skb_page_unref(page_to_netmem(sg_page(sg)), + skb->pp_recycle); } #ifdef CONFIG_INET_ESPINTCP diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index e03a342c9162b..8050148d302b1 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2177,6 +2177,9 @@ static int tcp_zerocopy_receive(struct sock *sk, break; } page = skb_frag_page(frags); + if (WARN_ON_ONCE(!page)) + break; + prefetchw(page); pages[pages_to_map++] = page; length += PAGE_SIZE; diff --git a/net/ipv6/esp6.c b/net/ipv6/esp6.c index 3920e8aa1031e..b2400c226a325 100644 --- a/net/ipv6/esp6.c +++ b/net/ipv6/esp6.c @@ -132,7 +132,8 @@ static void esp_ssg_unref(struct xfrm_state *x, void *tmp, struct sk_buff *skb) */ if (req->src != req->dst) for (sg = sg_next(req->src); sg; sg = sg_next(sg)) - skb_page_unref(sg_page(sg), skb->pp_recycle); + skb_page_unref(page_to_netmem(sg_page(sg)), + skb->pp_recycle); } #ifdef CONFIG_INET6_ESPINTCP From patchwork Tue Jul 30 02:26:14 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815493 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B2A5841E22 for ; Tue, 30 Jul 2024 02:26:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306416; cv=none; b=geWAaGhL8SdfcF6dibgO+V1pmkREdMiRYvNL2XiWMuRYmm8D4XGFQAe8sKmWtl9OXPNJIZ/xn8oquPLwu8DIUaSkYO37/Hb8UTv2IKl3BI1eEfCSpG3SZ40pTatspMLT2hCN6MiOFRQbTD9sJta2GM8kiq76Hrj9vOljW+TWf1I= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306416; c=relaxed/simple; bh=3rYPp6yJ4TQkU0nAHTZU8bAVR7XFl/3iIOgV8FPkTFk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=DHU2M3kTTiU9IbpA9PjGHP+vQ6Yt0QyjIpkpP4rXSlw5h7PILVVHvbW4KCLEQHqsEp5roA0g5xylOIs1IZcUcKDJpV6MUVkYEzazjDIYZCA9oQVWOYdz0UwoYlas0Au7dqNi7SmLCrHoxQ72fWZEorVDV+zCbRdU5Q3R/aAk11s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=paP6w9tI; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="paP6w9tI" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6643016423fso86432567b3.3 for ; Mon, 29 Jul 2024 19:26:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306407; x=1722911207; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=64egfm4rDpW7JIdlp3Mi9TPkhTjwI9/YxKykovbhv54=; b=paP6w9tI/nuf0+JH7I3/cMMuEfpd/32G5YGQS7cQ6osgxTVifC4C8QFJtiH9hTIkOt hYLKlYbUZzhUCwowj+GWAZRah9NN7hlO/bN1cV0b6FVSPPl6XdiMUPswUgMH6uLLZKTu SudPKsNHPk0Z/bpOXc6Q982L3b4/HxQkL2vMBqwOgV4t9aNhkAknxqMtXZmW8ojkY4GF y4gZ/pAw8e8B3j7cArSxtAM1ao5bqHdLmpiWwf5TxEzh35HqxDAJkR7Xm+00UzKEAoIt 9EUTspk+W5X5v3o0FI63ZxckOQfNXb5LBDEu+U6OoB6eHKn5FNbZ5c7gDXk+8S+o2juB QiCQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306407; x=1722911207; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=64egfm4rDpW7JIdlp3Mi9TPkhTjwI9/YxKykovbhv54=; b=HTSJL6HjlvOSPYIxIAoqVffo8LyqquhFATpAHbxcPZvgjlz420dWZjMuiJXvYpO9HG ExwwZS4DuRy5qCjzugqz1WDCmep+WwWV132RARJRW4dgf9VNV/q2hsO14E+KwEniHxww mHT7LWdLOciDFFeowaBMLOB5g+1SLZb5dtaNl818Kfx7Urkuix8lxAlAGYMX73uZ4Db3 E9kYV7R6aRGime1SUl4058LPLBZn+FTlx20CeHuMOOhIG7dbROlIc4Gh0OLqSNb5CCOW 8oO5lNha9ScOPwlIemFmT6nZoTBsHG4XqEg+04VAl0GCqKr48bHvdE6wjoF4LGDvCXvy XI0A== X-Forwarded-Encrypted: i=1; AJvYcCWKP/uOsmMSDIapxh7QF8GXNA+C9mO3ChkN7joNFR1Q9JXkYgJylnchAbmzV/pVuNpu0lt8qUjdFXQ3J3b8Ytmv8XB2l7MPiPKTLsCjssYc X-Gm-Message-State: AOJu0Yxc/R6f8cIiVKqOpvKfb/sI09QPAeUZ+17Do1CqKvQnvO4DjqHp xfg3ZLa2Q9+rr2Fp6uzpqoWYA9cta1w6Kmkl+hwuvkUaPmCXZ0XVcB/wvwgzu9lGf63nv8YC5mW r948kuj+A8vRx7Wp/aWOvxw== X-Google-Smtp-Source: AGHT+IGt6JZOa2u/1mIOg8XQuoqmW2i1+QCVvVsezulwDwMHMU5t0MC2qoyz0lXL/JPthXrZ8qplIhOAj9Q0ZxUyeQ== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:690c:ece:b0:62f:f535:f2c with SMTP id 00721157ae682-67a053db18cmr3521907b3.2.1722306407595; Mon, 29 Jul 2024 19:26:47 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:14 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-11-almasrymina@google.com> Subject: [PATCH net-next v17 10/14] tcp: RX path for devmem TCP From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Willem de Bruijn , Kaiyuan Zhang In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Eric Dumazet --- v16: - Fix number assignement (Arnd). v13: - Refactored user frags cleanup into a common function to avoid __maybe_unused. (Pavel) - change to offset = 0 for some improved clarity. v11: - Refactor to common function te remove conditional lock sparse warning (Paolo) v7: - Updated the SO_DEVMEM_* uapi to use the next available entries (Arnd). - Updated dmabuf_cmsg struct to be __u64 padded (Arnd). - Squashed fix from Eric to initialize sk_user_frags for passive sockets (Eric). v6 - skb->dmabuf -> skb->readable (Pavel) - Fixed asm definitions of SO_DEVMEM_LINEAR/SO_DEVMEM_DMABUF not found on some archs. - Squashed in locking optimizations from edumazet@google.com. With this change we lock the xarray once per per tcp_recvmsg_dmabuf() rather than once per frag in xa_alloc(). Changes in v1: - Added dmabuf_id to dmabuf_cmsg (David/Stan). - Devmem -> dmabuf (David). - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo). - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng). RFC v3: - Fixed issue with put_cmsg() failing silently. --- arch/alpha/include/uapi/asm/socket.h | 5 + arch/mips/include/uapi/asm/socket.h | 5 + arch/parisc/include/uapi/asm/socket.h | 5 + arch/sparc/include/uapi/asm/socket.h | 5 + include/linux/socket.h | 1 + include/net/netmem.h | 13 ++ include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 13 ++ net/ipv4/tcp.c | 255 +++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 16 ++ net/ipv4/tcp_minisocks.c | 2 + 12 files changed, 322 insertions(+), 5 deletions(-) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index e94f621903fee..ef4656a41058a 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -140,6 +140,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 60ebaed28a4ca..414807d55e33f 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -151,6 +151,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index be264c2b1a117..2b817efd45444 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -132,6 +132,11 @@ #define SO_PASSPIDFD 0x404A #define SO_PEERPIDFD 0x404B +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 682da3714686c..00248fc689773 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -133,6 +133,11 @@ #define SO_PASSPIDFD 0x0055 #define SO_PEERPIDFD 0x0056 +#define SO_DEVMEM_LINEAR 0x0057 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 0x0058 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) diff --git a/include/linux/socket.h b/include/linux/socket.h index df9cdb8bbfb88..d18cc47e89bd0 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -327,6 +327,7 @@ struct ucred { * plain text and require encryption */ +#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ diff --git a/include/net/netmem.h b/include/net/netmem.h index 34dd146f9e940..b4077c6c5ba96 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -65,6 +65,19 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov) return niov - net_iov_owner(niov)->niovs; } +static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_virtual + + ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT); +} + +static inline u32 net_iov_binding_id(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding->id; +} + static inline struct net_devmem_dmabuf_binding * net_iov_binding(const struct net_iov *niov) { diff --git a/include/net/sock.h b/include/net/sock.h index cce23ac4d5148..f8ec869be2382 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -337,6 +337,7 @@ struct sk_filter; * @sk_txtime_report_errors: set report errors mode for SO_TXTIME * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference + * @sk_user_frags: xarray of pages the user is holding a reference on. */ struct sock { /* @@ -542,6 +543,7 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; + struct xarray sk_user_frags; }; struct sock_bh_locked { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f0..e993edc9c0ee6 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 78 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 79 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4f..3a22ddae376a2 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,19 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; +struct dmabuf_cmsg { + __u64 frag_offset; /* offset into the dmabuf where the frag starts. + */ + __u32 frag_size; /* size of the frag. */ + __u32 frag_token; /* token representing this frag for + * DEVMEM_DONTNEED. + */ + __u32 dmabuf_id; /* dmabuf id this frag belongs to. */ + __u32 flags; /* Currently unused. Reserved for future + * uses. + */ +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index adc0eb9a9b5c3..308b2fb7aa3d9 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -471,6 +471,7 @@ void tcp_init_sock(struct sock *sk) set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1); } EXPORT_SYMBOL(tcp_init_sock); @@ -2323,6 +2324,220 @@ static int tcp_inq_hint(struct sock *sk) return inq; } +/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */ +struct tcp_xa_pool { + u8 max; /* max <= MAX_SKB_FRAGS */ + u8 idx; /* idx <= max */ + __u32 tokens[MAX_SKB_FRAGS]; + netmem_ref netmems[MAX_SKB_FRAGS]; +}; + +static void tcp_xa_pool_commit_locked(struct sock *sk, struct tcp_xa_pool *p) +{ + int i; + + /* Commit part that has been copied to user space. */ + for (i = 0; i < p->idx; i++) + __xa_cmpxchg(&sk->sk_user_frags, p->tokens[i], XA_ZERO_ENTRY, + (__force void *)p->netmems[i], GFP_KERNEL); + /* Rollback what has been pre-allocated and is no longer needed. */ + for (; i < p->max; i++) + __xa_erase(&sk->sk_user_frags, p->tokens[i]); + + p->max = 0; + p->idx = 0; +} + +static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p) +{ + if (!p->max) + return; + + xa_lock_bh(&sk->sk_user_frags); + + tcp_xa_pool_commit_locked(sk, p); + + xa_unlock_bh(&sk->sk_user_frags); +} + +static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, + unsigned int max_frags) +{ + int err, k; + + if (p->idx < p->max) + return 0; + + xa_lock_bh(&sk->sk_user_frags); + + tcp_xa_pool_commit_locked(sk, p); + + for (k = 0; k < max_frags; k++) { + err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k], + XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); + if (err) + break; + } + + xa_unlock_bh(&sk->sk_user_frags); + + p->max = k; + p->idx = 0; + return k ? 0 : err; +} + +/* On error, returns the -errno. On success, returns number of bytes sent to the + * user. May not consume all of @remaining_len. + */ +static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, + int remaining_len) +{ + struct dmabuf_cmsg dmabuf_cmsg = { 0 }; + struct tcp_xa_pool tcp_xa_pool; + unsigned int start; + int i, copy, n; + int sent = 0; + int err = 0; + + tcp_xa_pool.max = 0; + tcp_xa_pool.idx = 0; + do { + start = skb_headlen(skb); + + if (skb_frags_readable(skb)) { + err = -ENODEV; + goto out; + } + + /* Copy header. */ + copy = start - offset; + if (copy > 0) { + copy = min(copy, remaining_len); + + n = copy_to_iter(skb->data + offset, copy, + &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto out; + } + + offset += copy; + remaining_len -= copy; + + /* First a dmabuf_cmsg for # bytes copied to user + * buffer. + */ + memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg)); + dmabuf_cmsg.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_LINEAR, + sizeof(dmabuf_cmsg), &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + sent += copy; + + if (remaining_len == 0) + goto out; + } + + /* after that, send information of dmabuf pages through a + * sequence of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct net_iov *niov; + u64 frag_offset; + int end; + + /* !skb_frags_readable() should indicate that ALL the + * frags in this skb are dmabuf net_iovs. We're checking + * for that flag above, but also check individual frags + * here. If the tcp stack is not setting + * skb_frags_readable() correctly, we still don't want + * to crash here. + */ + if (!skb_frag_net_iov(frag)) { + net_err_ratelimited("Found non-dmabuf skb with net_iov"); + err = -ENODEV; + goto out; + } + + niov = skb_frag_net_iov(frag); + end = start + skb_frag_size(frag); + copy = end - offset; + + if (copy > 0) { + copy = min(copy, remaining_len); + + frag_offset = net_iov_virtual_addr(niov) + + skb_frag_off(frag) + offset - + start; + dmabuf_cmsg.frag_offset = frag_offset; + dmabuf_cmsg.frag_size = copy; + err = tcp_xa_pool_refill(sk, &tcp_xa_pool, + skb_shinfo(skb)->nr_frags - i); + if (err) + goto out; + + /* Will perform the exchange later */ + dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx]; + dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov); + + offset += copy; + remaining_len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, + SO_DEVMEM_DMABUF, + sizeof(dmabuf_cmsg), + &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + atomic_long_inc(&niov->pp_ref_count); + tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag); + + sent += copy; + + if (remaining_len == 0) + goto out; + } + start = end; + } + + tcp_xa_pool_commit(sk, &tcp_xa_pool); + if (!remaining_len) + goto out; + + /* if remaining_len is not satisfied yet, we need to go to the + * next frag in the frag_list to satisfy remaining_len. + */ + skb = skb_shinfo(skb)->frag_list ?: skb->next; + + offset = 0; + } while (skb); + + if (remaining_len) { + err = -EFAULT; + goto out; + } + +out: + tcp_xa_pool_commit(sk, &tcp_xa_pool); + if (!sent) + sent = err; + + return sent; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2336,6 +2551,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + int last_copied_dmabuf = -1; /* uninitialized */ int copied = 0; u32 peek_seq; u32 *seq; @@ -2515,15 +2731,44 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if (!(flags & MSG_TRUNC)) { - err = skb_copy_datagram_msg(skb, offset, msg, used); - if (err) { - /* Exception. Bailout! */ - if (!copied) - copied = -EFAULT; + if (last_copied_dmabuf != -1 && + last_copied_dmabuf != !skb_frags_readable(skb)) break; + + if (skb_frags_readable(skb)) { + err = skb_copy_datagram_msg(skb, offset, msg, + used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + } else { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* dmabuf skbs can only be received + * with the MSG_SOCK_DEVMEM flag. + */ + if (!copied) + copied = -EFAULT; + + break; + } + + err = tcp_recvmsg_dmabuf(sk, skb, offset, msg, + used); + if (err <= 0) { + if (!copied) + copied = -EFAULT; + + break; + } + used = err; } } + last_copied_dmabuf = !skb_frags_readable(skb); + WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index fd17f25ff288a..f3b2ae0823c4b 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -79,6 +79,7 @@ #include #include #include +#include #include #include @@ -2507,10 +2508,25 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) } #endif +static void tcp_release_user_frags(struct sock *sk) +{ +#ifdef CONFIG_PAGE_POOL + unsigned long index; + void *netmem; + + xa_for_each(&sk->sk_user_frags, index, netmem) + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem)); +#endif +} + void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + tcp_release_user_frags(sk); + + xa_destroy(&sk->sk_user_frags); + trace_tcp_destroy_sock(sk); tcp_clear_xmit_timers(sk); diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c index a19a9dbd3409f..9ab87a41255dd 100644 --- a/net/ipv4/tcp_minisocks.c +++ b/net/ipv4/tcp_minisocks.c @@ -625,6 +625,8 @@ struct sock *tcp_create_openreq_child(const struct sock *sk, __TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS); + xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1); + return newsk; } EXPORT_SYMBOL(tcp_create_openreq_child); From patchwork Tue Jul 30 02:26:16 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815492 Received: from mail-pg1-f201.google.com (mail-pg1-f201.google.com [209.85.215.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9E55516631D for ; Tue, 30 Jul 2024 02:26:54 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.215.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306420; cv=none; b=dOOhiPslU4/Ga2GUtVNsxQosvPZP6cpPYah2QDsg4oPgHsN0oFwAhUYs4S3oMqY3uRrAsEW19guxyWYzDZ5/y2+jPzgAbs9m/Z9kR9XSGG94SaR6xl65Kt3TuH0w0Bn0vjHlYM2HaGSFvHgko0geI2fFWqA79Cw+xEI/fmuUXME= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306420; c=relaxed/simple; bh=ZguThXd5UFU6KcfN5f2B2swOUTtlFzXFRNLBm1mhiGY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=CmWv27iJTMAFA03ugBpd/kt70p5sz/gOmAIrsF6brCJJNyTSVt444aGsaN1vdk9igEyqMl+zsEUcPvSldskY+9eaqj7kB8nTQlgEudoi/4ScE6lD6CoEqre9jDIBI7pWFMPoufeuU3D+CZUKmnQxlzoXLdhd5gge6gFnnx//KjQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=Qiee9KgR; arc=none smtp.client-ip=209.85.215.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="Qiee9KgR" Received: by mail-pg1-f201.google.com with SMTP id 41be03b00d2f7-721d20a0807so4239106a12.1 for ; Mon, 29 Jul 2024 19:26:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306412; x=1722911212; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=sQ9Lh92v4N8uQTK97JV3TjqyrsKJVx9Vrs3hEvza44U=; b=Qiee9KgRqnvGDdbkGo6QbHiWWTaijcfyZaotBUJ9NeEc3ye0fJAlXBbwohhsp6fkcD r+MXubd/bwq9SaRExCxKw3WbeTmqdj1EkieEjh7YO+wNhWespljpkDlILyK1Lr9CPID1 FqJWxoY72SE5DOEQ7m4P3HH94aLSvFXoVBsCbLt0a7fi8Mres6u8LoUVbw5QHv1j/qt8 00bRMbeGURQXPenioW4cKNDYFFs0SSeDuJjXYte2uS7oUe+hTPcVDnUJgcQ2njxxSV9X qLIkLxlWPRtOzFy0Et+xiACzRzUiu/UakpsS10eY817YkzgOwenl9e9AtDfk6YmCPco9 34Gg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306412; x=1722911212; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=sQ9Lh92v4N8uQTK97JV3TjqyrsKJVx9Vrs3hEvza44U=; b=n/JWLeUJ4AsTniZlLkgJsU1jtN86MZxKl81UrotELfABnsRh1udned0cJ+c8lme7ib IlLEs6V3fco+NT7uSnqQUFL44kgtzRhVv2pvGPcJ0wtAr2Xd6jPLKdS59iy44SMegOpn Qp0oHbGbnhaCOEb9KU2L2xh6EY183/Kv3/lYT2q01A7PJoXz+r0jX4JmG+oT17NcphKb dXnVPul5NpuGzvcA8kIkRF3AwnBI5tlTbwmuS+XQejPfCrO62B10Q7Z9PZ+P21ezf8zL hm8o1r+Rd0ad+kHG50Dh+hOVTaOo5ekxiXqc1oJf587A9epNB9HOLznqizHm6TivGBUw zvkA== X-Forwarded-Encrypted: i=1; AJvYcCU4DaXkk5FVEzwYb3yrvNjL1/UjJiZkZ31B2U6vnDteIEbGVi3CCYpzCMzqORLDGE+fA92pyKrAZu++5HT7ezRQsJWWPONVpQubSVd1AiMs X-Gm-Message-State: AOJu0YwsnKB0VyBAXmIENMkaBB1FlvXDhFi/ZeRu+aeNq798iL8Pql4Y KWnq4GP+gX5QCYAWv+X9jSVf+s4f7Oac6rgOZ/aLmJ5SfFAVnLat7ETNF1I8HmiZBzy0q+g0qGM eLJ1NGhek1WmKvmakj9S5Lg== X-Google-Smtp-Source: AGHT+IGuSWVGLjm6BNAxj9VNGjQsoYaKq5yjJAnoeHRkOGCgYBBqNcnicpH8QRX9RPVpSHcGjZeBf7DVtTu1adI2Eg== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:6a02:452:b0:6f6:d708:f6cf with SMTP id 41be03b00d2f7-7ac8adec009mr28511a12.0.1722306411282; Mon, 29 Jul 2024 19:26:51 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:16 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-13-almasrymina@google.com> Subject: [PATCH net-next v17 12/14] net: add devmem TCP documentation From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi Add documentation outlining the usage and details of devmem TCP. Signed-off-by: Mina Almasry Reviewed-by: Bagas Sanjaya Reviewed-by: Donald Hunter --- v16: - Add documentation on unbinding the NIC from dmabuf (Donald). - Add note that any dmabuf should work (Donald). v9: https://lore.kernel.org/netdev/20240403002053.2376017-14-almasrymina@google.com/ - Bagas doc suggestions. v8: - Applied docs suggestions (Randy). Thanks! v7: - Applied docs suggestions (Jakub). v2: - Missing spdx (simon) - add to index.rst (simon) --- Documentation/networking/devmem.rst | 269 ++++++++++++++++++++++++++++ Documentation/networking/index.rst | 1 + 2 files changed, 270 insertions(+) create mode 100644 Documentation/networking/devmem.rst diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst new file mode 100644 index 0000000000000..417fc977844ee --- /dev/null +++ b/Documentation/networking/devmem.rst @@ -0,0 +1,269 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Device Memory TCP +================= + + +Intro +===== + +Device memory TCP (devmem TCP) enables receiving data directly into device +memory (dmabuf). The feature is currently implemented for TCP sockets. + + +Opportunity +----------- + +A large number of data transfers have device memory as the source and/or +destination. Accelerators drastically increased the prevalence of such +transfers. Some examples include: + +- Distributed training, where ML accelerators, such as GPUs on different hosts, + exchange data. + +- Distributed raw block storage applications transfer large amounts of data with + remote SSDs. Much of this data does not require host processing. + +Typically the Device-to-Device data transfers in the network are implemented as +the following low-level operations: Device-to-Host copy, Host-to-Host network +transfer, and Host-to-Device copy. + +The flow involving host copies is suboptimal, especially for bulk data transfers, +and can put significant strains on system resources such as host memory +bandwidth and PCIe bandwidth. + +Devmem TCP optimizes this use case by implementing socket APIs that enable +the user to receive incoming network packets directly into device memory. + +Packet payloads go directly from the NIC to device memory. + +Packet headers go to host memory and are processed by the TCP/IP stack +normally. The NIC must support header split to achieve this. + +Advantages: + +- Alleviate host memory bandwidth pressure, compared to existing + network-transfer + device-copy semantics. + +- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest + level of the PCIe tree, compared to the traditional path which sends data + through the root complex. + + +More Info +--------- + + slides, video + https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html + + patchset + [RFC PATCH v6 00/12] Device Memory TCP + https://lore.kernel.org/netdev/20240305020153.2787423-1-almasrymina@google.com/ + + +Interface +========= + + +Example +------- + +tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up +the RX path of this API. + + +NIC Setup +--------- + +Header split, flow steering, & RSS are required features for devmem TCP. + +Header split is used to split incoming packets into a header buffer in host +memory, and a payload buffer in device memory. + +Flow steering & RSS are used to ensure that only flows targeting devmem land on +an RX queue bound to devmem. + +Enable header split & flow steering:: + + # enable header split + ethtool -G eth1 tcp-data-split on + + + # enable flow steering + ethtool -K eth1 ntuple on + +Configure RSS to steer all traffic away from the target RX queue (queue 15 in +this example):: + + ethtool --set-rxfh-indir eth1 equal 15 + + +The user must bind a dmabuf to any number of RX queues on a given NIC using +the netlink API:: + + /* Bind dmabuf to NIC RX queue 15 */ + struct netdev_queue *queues; + queues = malloc(sizeof(*queues) * 1); + + queues[0]._present.type = 1; + queues[0]._present.idx = 1; + queues[0].type = NETDEV_RX_QUEUE_TYPE_RX; + queues[0].idx = 15; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + + req = netdev_bind_rx_req_alloc(); + netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */); + netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); + __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); + + rsp = netdev_bind_rx(*ys, req); + + dmabuf_id = rsp->dmabuf_id; + + +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf +that has been bound. + +The user can unbind the dmabuf from the netdevice by closing the netlink socket +that established the binding. We do this so that the binding is automatically +unbound even if the userspace process crashes. + +Note that any reasonably well-behaved dmabuf from any exporter should work with +devmem TCP, even if the dmabuf is not actually backed by devmem. An example of +this is udmabuf, which wraps user memory (non-devmem) in a dmabuf. + + +Socket Setup +------------ + +The socket must be flow steered to the dmabuf bound RX queue:: + + ethtool -N eth1 flow-type tcp4 ... queue 15, + + +Receiving data +-------------- + +The user application must signal to the kernel that it is capable of receiving +devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg:: + + ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM); + +Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT +on devmem data. + +Devmem data is received directly into the dmabuf bound to the NIC in 'NIC +Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs:: + + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_DMABUF && + cm->cmsg_type != SCM_DEVMEM_LINEAR)) + continue; + + dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); + + if (cm->cmsg_type == SCM_DEVMEM_DMABUF) { + /* Frag landed in dmabuf. + * + * dmabuf_cmsg->dmabuf_id is the dmabuf the + * frag landed on. + * + * dmabuf_cmsg->frag_offset is the offset into + * the dmabuf where the frag starts. + * + * dmabuf_cmsg->frag_size is the size of the + * frag. + * + * dmabuf_cmsg->frag_token is a token used to + * refer to this frag for later freeing. + */ + + struct dmabuf_token token; + token.token_start = dmabuf_cmsg->frag_token; + token.token_count = 1; + continue; + } + + if (cm->cmsg_type == SCM_DEVMEM_LINEAR) + /* Frag landed in linear buffer. + * + * dmabuf_cmsg->frag_size is the size of the + * frag. + */ + continue; + + } + +Applications may receive 2 cmsgs: + +- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated + by dmabuf_id. + +- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer. + This typically happens when the NIC is unable to split the packet at the + header boundary, such that part (or all) of the payload landed in host + memory. + +Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem, +regular TCP data that landed on an RX queue not bound to a dmabuf. + + +Freeing frags +------------- + +Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user +processes the frag. The user must return the frag to the kernel via +SO_DEVMEM_DONTNEED:: + + ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + +The user must ensure the tokens are returned to the kernel in a timely manner. +Failure to do so will exhaust the limited dmabuf that is bound to the RX queue +and will lead to packet drops. + + +Implementation & Caveats +======================== + +Unreadable skbs +--------------- + +Devmem payloads are inaccessible to the kernel processing the packets. This +results in a few quirks for payloads of devmem skbs: + +- Loopback is not functional. Loopback relies on copying the payload, which is + not possible with devmem skbs. + +- Software checksum calculation fails. + +- TCP Dump and bpf can't access devmem packet payloads. + + +Testing +======= + +More realistic example code can be found in the kernel source under +tools/testing/selftests/net/ncdevmem.c + +ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but +receives data directly into a udmabuf. + +To run ncdevmem, you need to run it on a server on the machine under test, and +you need to run netcat on a peer to provide the TX data. + +ncdevmem has a validation mode as well that expects a repeating pattern of +incoming data and validates it as such. For example, you can launch +ncdevmem on the server by:: + + ncdevmem -s -c -f eth1 -d 3 -n 0000:06:00.0 -l \ + -p 5201 -v 7 + +On client side, use regular netcat to send TX data to ncdevmem process +on the server:: + + yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \ + tr \\n \\0 | head -c 5G | nc 5201 -p 5201 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index d1af04b952f81..0be9924db6423 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -49,6 +49,7 @@ Contents: cdc_mbim dccp dctcp + devmem dns_resolver driver eql From patchwork Tue Jul 30 02:26:18 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 815491 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C562190478 for ; Tue, 30 Jul 2024 02:26:58 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306423; cv=none; b=LJAzJuJh92W4UNnUOO9Xo6TXg6BP40L1GOyki8evwXU/Aa0Thjik9o3BLPeQLEiUskqfhmWVML0Mz9RemEThR4h5+PH/dr1CGNEhVX8GY3M4sxphDi9K5fJ32LJSdQsGQrzDDj4ZFmgFa3K4gRBx3q47IZegDS1DKeYMocVpo4U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1722306423; c=relaxed/simple; bh=lWFZoXTllZzDQ5KVW7/ITGpCF1rpfZvNs38SGJyNKoI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=G60455B2DIj63uodYxSvgIB9F2JSEe8jZUaUPTqqvL7vaXTLx7DSRuk4HAz0j2ZF5rxK85ubkK2q1gp3nJoo8asgMyVcT7I3/37v0AafMpe0hiYEGITIj7mq21+ouib5ogPgmRct3A6wMBBq/l+3uuaq+NTgMb+y7CLc3e7rETI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=THAh/aUD; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="THAh/aUD" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-66b3b4415c7so80161817b3.0 for ; Mon, 29 Jul 2024 19:26:57 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1722306416; x=1722911216; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=YpeL2jjjuQiDotVyStSYknocEdgkMBCGkLNpfXlLDrU=; b=THAh/aUDClyx15DqWIoGLWSXzUOSV/VmWRa2IjEVmel+G6UnLKVMxZ0e6mgZmTuMDi kdgL2Kq2CiD4iLBJuent3eWsAZRAHeEqjgsOWsgLWfE6Qakw0k4IhOrdjYp1n6YWPMjY gOrxL3AELs505g+6/0EA36eeTrOjnckx21hW03w5VcdCyYC194ZD7wKgNcGe8Bb3zQ1n BHYjOT4LHbttDrmFNYDHJNGXQMp9/iNq6zO5J7WDgr2uv+Ww/njP+FyddxBbyD2iBYgH JZSRZBhjABWM2WzHjqlgvCUzyVWXGFK4KIPT9mEZP7yXnWARvmYg3D86mDFGMGjIOiW/ F/rw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1722306416; x=1722911216; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=YpeL2jjjuQiDotVyStSYknocEdgkMBCGkLNpfXlLDrU=; b=LmNKNXsh9k/+r8Padc7HCycr7yZ9L3x4sIt8Z/EdqNH4kDPO0NYM9ijD2bX8jtMiR+ lXLo1zC83HYUVhncl/o5Csu/flBYzfbI71NYam+pm7xesiPYJ5bybSNYbuK7DWWU9bAU CLMERaD/uFkvjM7bbelgCT8sZcF/ojUuEBAoXWAmWUs2YCR82SRHjuU9rAbLtRmFLMS7 fYpKJfu0cQV+HT25hSFI0pxSW/XTytTgVRxsawB+7uKTVplVniumWqWZHqZhScF3aoiY ePOnPXf5VzgP/1Rzr3o03HRAsZCoV2o0yHT+MMU/MP+4L20BcRFAiI1hgEWBxxsL1mdt DCqw== X-Forwarded-Encrypted: i=1; AJvYcCX/XIjJNOqQEpfmRptVkOwp0YKORzc8faovhKuw1o7Cny87vBtlrNUOL6WZMW6CAWVViCr+D8ipTCVojfEgItIkndGSHVvUPwHWaleck75n X-Gm-Message-State: AOJu0Yz9pAQ3bwA9n388UP4ME2tDS3a+tEW3rUAa7YoD3/+XGXasMthz TiHl9+fpUo1H39Ep2838ohxGIo6zPJgCRY9g2WUj7obEKbRZpDthGQ9G+e5zHnqgXBFHLQQqjtI vUm5+XcAERyGcFiaei6hOCQ== X-Google-Smtp-Source: AGHT+IHC29ZYOcgw90qZqEqccd2UZ1WHTmKpe+12sX2SEVfmpnkgtaiJhRBXg0BZNAdwmD96v+mTgP75qI/xOgbFKg== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:6902:1029:b0:e0b:9b5:8647 with SMTP id 3f1490d57ef6-e0b544ec4ddmr18023276.8.1722306415855; Mon, 29 Jul 2024 19:26:55 -0700 (PDT) Date: Tue, 30 Jul 2024 02:26:18 +0000 In-Reply-To: <20240730022623.98909-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240730022623.98909-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.rc1.232.g9752f9e123-goog Message-ID: <20240730022623.98909-15-almasrymina@google.com> Subject: [PATCH net-next v17 14/14] netdev: add dmabuf introspection From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, linux-kselftest@vger.kernel.org, bpf@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Donald Hunter , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi Add dmabuf information to page_pool stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump page-pool-get ... {'dmabuf': 10, 'id': 456, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 455, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 454, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 453, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 452, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 451, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 450, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, {'dmabuf': 10, 'id': 449, 'ifindex': 3, 'inflight': 1023, 'inflight-mem': 4190208}, And queue stats: $ ./cli.py --spec ../netlink/specs/netdev.yaml --dump queue-get ... {'dmabuf': 10, 'id': 8, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 9, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 10, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 11, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 12, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 13, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 14, 'ifindex': 3, 'type': 'rx'}, {'dmabuf': 10, 'id': 15, 'ifindex': 3, 'type': 'rx'}, Suggested-by: Jakub Kicinski Signed-off-by: Mina Almasry Reviewed-by: Jakub Kicinski --- Documentation/netlink/specs/netdev.yaml | 10 ++++++++++ include/uapi/linux/netdev.h | 2 ++ net/core/netdev-genl.c | 10 ++++++++++ net/core/page_pool_user.c | 4 ++++ tools/include/uapi/linux/netdev.h | 2 ++ 5 files changed, 28 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index 0c747530c275e..08412c279297b 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -167,6 +167,10 @@ attribute-sets: "re-attached", they are just waiting to disappear. Attribute is absent if Page Pool has not been detached, and can still be used to allocate new memory. + - + name: dmabuf + doc: ID of the dmabuf this page-pool is attached to. + type: u32 - name: page-pool-info subset-of: page-pool @@ -268,6 +272,10 @@ attribute-sets: name: napi-id doc: ID of the NAPI instance which services this queue. type: u32 + - + name: dmabuf + doc: ID of the dmabuf attached to this queue, if any. + type: u32 - name: qstats @@ -543,6 +551,7 @@ operations: - inflight - inflight-mem - detach-time + - dmabuf dump: reply: *pp-reply config-cond: page-pool @@ -607,6 +616,7 @@ operations: - type - napi-id - ifindex + - dmabuf dump: request: attributes: diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 91bf3ecc5f1d9..7c308f04e7a06 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -93,6 +93,7 @@ enum { NETDEV_A_PAGE_POOL_INFLIGHT, NETDEV_A_PAGE_POOL_INFLIGHT_MEM, NETDEV_A_PAGE_POOL_DETACH_TIME, + NETDEV_A_PAGE_POOL_DMABUF, __NETDEV_A_PAGE_POOL_MAX, NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1) @@ -131,6 +132,7 @@ enum { NETDEV_A_QUEUE_IFINDEX, NETDEV_A_QUEUE_TYPE, NETDEV_A_QUEUE_NAPI_ID, + NETDEV_A_QUEUE_DMABUF, __NETDEV_A_QUEUE_MAX, NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index bd54cf50b658a..e944fd56c6b8e 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -293,6 +293,7 @@ static int netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev, u32 q_idx, u32 q_type, const struct genl_info *info) { + struct net_devmem_dmabuf_binding *binding; struct netdev_rx_queue *rxq; struct netdev_queue *txq; void *hdr; @@ -312,6 +313,15 @@ netdev_nl_queue_fill_one(struct sk_buff *rsp, struct net_device *netdev, if (rxq->napi && nla_put_u32(rsp, NETDEV_A_QUEUE_NAPI_ID, rxq->napi->napi_id)) goto nla_put_failure; + + binding = (struct net_devmem_dmabuf_binding *) + rxq->mp_params.mp_priv; + if (binding) { + if (nla_put_u32(rsp, NETDEV_A_QUEUE_DMABUF, + binding->id)) + goto nla_put_failure; + } + break; case NETDEV_QUEUE_TYPE_TX: txq = netdev_get_tx_queue(netdev, q_idx); diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c index 3a3277ba167b1..ca13363aea343 100644 --- a/net/core/page_pool_user.c +++ b/net/core/page_pool_user.c @@ -212,6 +212,7 @@ static int page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool, const struct genl_info *info) { + struct net_devmem_dmabuf_binding *binding = pool->mp_priv; size_t inflight, refsz; void *hdr; @@ -241,6 +242,9 @@ page_pool_nl_fill(struct sk_buff *rsp, const struct page_pool *pool, pool->user.detach_time)) goto err_cancel; + if (binding && nla_put_u32(rsp, NETDEV_A_PAGE_POOL_DMABUF, binding->id)) + goto err_cancel; + genlmsg_end(rsp, hdr); return 0; diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 91bf3ecc5f1d9..7c308f04e7a06 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -93,6 +93,7 @@ enum { NETDEV_A_PAGE_POOL_INFLIGHT, NETDEV_A_PAGE_POOL_INFLIGHT_MEM, NETDEV_A_PAGE_POOL_DETACH_TIME, + NETDEV_A_PAGE_POOL_DMABUF, __NETDEV_A_PAGE_POOL_MAX, NETDEV_A_PAGE_POOL_MAX = (__NETDEV_A_PAGE_POOL_MAX - 1) @@ -131,6 +132,7 @@ enum { NETDEV_A_QUEUE_IFINDEX, NETDEV_A_QUEUE_TYPE, NETDEV_A_QUEUE_NAPI_ID, + NETDEV_A_QUEUE_DMABUF, __NETDEV_A_QUEUE_MAX, NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1)