From patchwork Mon Sep 9 05:43:06 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 826773 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 49C4F1AD3ED for ; Mon, 9 Sep 2024 05:43:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860609; cv=none; b=gGyH4t4xQkEPPhPB75d8VLvZsKr/cgOeTAXAgXBaa5AZ2dQdHvngJ+3oDKJAZKUZKeDLmu58RYhH5Q3ewAKOPgWsFO61//Q2HL604zFXrnOJixlynISQrwANoQNR9GoTcjyebuYxsG1yY/Uv4R5chemI//IOpwy7J5KDNNLgEBk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860609; c=relaxed/simple; bh=DNA3F85kfwiW9bgsfvLXXO875oSihrq+DpGt530m/B8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=IoYSQWnGht5CLYLayP4IqInr5xNCT3zHS520klQjNEppA9rvjbztuvErfbSYCElEFFn8vZh8Rmntfxcn0bHIbMoX0MUUbNuE1wzVPiYM5xUeiVabZc0VnMd59NTdXiWjbRRbbetLfIxgJo5NESjiXFYJKu+0UAiN1gTpqpH4NWI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=bhjNMIen; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="bhjNMIen" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6d54ab222fcso134229327b3.1 for ; Sun, 08 Sep 2024 22:43:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725860604; x=1726465404; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=i/PEygQn081YuovQ6b+Ov4J7AW7lnL8H+wq29MS093A=; b=bhjNMIenuk14C2l5V7q6HmUTPR/6a5fI1mLuNOuKhTPL4QFbZJ2hAKOng7I4m73L02 dHo+Zen81V06Aijil2ADBN+FROdjz+IeuA3aQGt1ncU0D0V1emng0fAt2+NaiRpEheEo p8kRRkFfRif7SNooNw9eofva7REUSIZrKXRym/sJ5Ko3jToKwvJsM8gFwWXH1uirzcgT mShe2gTJRgNWiabbBO9zBKkzmVgXiZUmgomCqZDYJOsyZ3kbN/vPlb74iaSAddbXRsmV 08KuEaD+WIiPe6JeWkXqBgu2GBMEYmI5pCWsFZD3BO7dQnqu8g72LK+qf0OQCG4GMBgX oxNQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725860604; x=1726465404; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=i/PEygQn081YuovQ6b+Ov4J7AW7lnL8H+wq29MS093A=; b=YGlf0HQUKemFYjH73ANa9aRcJBqA5a9KTRa7J0IVm9PD1tCNGszfsxJT7pkMrcSBXT J2JMTAKik/Fmm4IRI1PRxkDV9sMmuI1Jsvf7KVLTv+IiX+ZVdlYvIdGd6Dhm/Bw5swZm dvjVmkpUPmpPyNf04eVj4hno4Mq8AMPKJF4rAXrRGuQyFBnDhdHqAtInrr6X3LBA0U48 e9P0XRJi2JCn00gF6MpxDF37H8sxpXJoMGLo+LGIvhV+bUc0qR357MMUQ+/NgozzdBOC iYzc2482yQiqjtMnXb9KuD2p1/KHilCox7/yZcyAbsW+aRqqe9eCwTdVf5hGzCh42jpK r0/w== X-Forwarded-Encrypted: i=1; AJvYcCVBouyZ93co1mUH03+vLHMkoskDbinIYCKu1naN5prqPLZLYF6ONtYFhNOwOBXNVZAxnq7Nm79b3eoZO+7CvDg=@vger.kernel.org X-Gm-Message-State: AOJu0YyoYmyxuVHgz4OOagDDS23ObNa1Ye7rhbyZt4N5qEb8GNsH/9EK M23o09LifJpb1v6sJwOzj1NRX8iHJ0UYrO+uv5pCIoSAsxVLU3Pc4UdNmxff8yG9XHwgoLnhu/K T53jI4pZoTWDLUsPIPZ486g== X-Google-Smtp-Source: AGHT+IHY3ZWZM5m6iLmr25lpRxzPgba3UWyYzEPhtrKYdPsv3KEg5Ew5Cd8AVaehz177/KHSxTj8z97y5iwx9jcZ0A== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a05:6902:4d1:b0:e0b:cce3:45c7 with SMTP id 3f1490d57ef6-e1d349e38d5mr18201276.9.1725860604167; Sun, 08 Sep 2024 22:43:24 -0700 (PDT) Date: Mon, 9 Sep 2024 05:43:06 +0000 In-Reply-To: <20240909054318.1809580-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240909054318.1809580-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240909054318.1809580-2-almasrymina@google.com> Subject: [PATCH net-next v25 01/13] netdev: add netdev_rx_queue_restart() From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , Donald Hunter , Jakub Kicinski , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , " =?utf-8?b?QmrDtnJu?= =?utf-8?b?IFTDtnBlbA==?= " , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , Shuah Khan , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo Add netdev_rx_queue_restart(), which resets an rx queue using the queue API recently merged[1]. The queue API was merged to enable the core net stack to reset individual rx queues to actuate changes in the rx queue's configuration. In later patches in this series, we will use netdev_rx_queue_restart() to reset rx queues after binding or unbinding dmabuf configuration, which will cause reallocation of the page_pool to repopulate its memory using the new configuration. [1] https://lore.kernel.org/netdev/20240430231420.699177-1-shailend@google.com/T/ Signed-off-by: David Wei Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Jakub Kicinski --- v18: - Add more color to commit message (Xuan Zhuo). v17: - Use ASSERT_RTNL() (Jakub). v13: - Add reviewed-by from Pavel (thanks!) - Fixed comment (Pavel) v11: - Fix not checking dev->queue_mgmt_ops (Pavel). - Fix ndo_queue_mem_free call that passed the wrong pointer (David). v9: https://lore.kernel.org/all/20240502045410.3524155-4-dw@davidwei.uk/ (submitted by David). - fixed SPDX license identifier (Simon). - Rebased on top of merged queue API definition, and changed implementation to match that. - Replace rtnl_lock() with rtnl_is_locked() to make it useable from my netlink code where rtnl is already locked. --- include/net/netdev_rx_queue.h | 3 ++ net/core/Makefile | 1 + net/core/netdev_rx_queue.c | 74 +++++++++++++++++++++++++++++++++++ 3 files changed, 78 insertions(+) create mode 100644 net/core/netdev_rx_queue.c diff --git a/include/net/netdev_rx_queue.h b/include/net/netdev_rx_queue.h index aa1716fb0e53..e78ca52d67fb 100644 --- a/include/net/netdev_rx_queue.h +++ b/include/net/netdev_rx_queue.h @@ -54,4 +54,7 @@ get_netdev_rx_queue_index(struct netdev_rx_queue *queue) return index; } #endif + +int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq); + #endif diff --git a/net/core/Makefile b/net/core/Makefile index 62be9aef2528..f82232b358a2 100644 --- a/net/core/Makefile +++ b/net/core/Makefile @@ -19,6 +19,7 @@ obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o obj-y += net-sysfs.o obj-y += hotdata.o +obj-y += netdev_rx_queue.o obj-$(CONFIG_PAGE_POOL) += page_pool.o page_pool_user.o obj-$(CONFIG_PROC_FS) += net-procfs.o obj-$(CONFIG_NET_PKTGEN) += pktgen.o diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c new file mode 100644 index 000000000000..da11720a5983 --- /dev/null +++ b/net/core/netdev_rx_queue.c @@ -0,0 +1,74 @@ +// SPDX-License-Identifier: GPL-2.0-or-later + +#include +#include +#include + +int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx) +{ + void *new_mem, *old_mem; + int err; + + if (!dev->queue_mgmt_ops || !dev->queue_mgmt_ops->ndo_queue_stop || + !dev->queue_mgmt_ops->ndo_queue_mem_free || + !dev->queue_mgmt_ops->ndo_queue_mem_alloc || + !dev->queue_mgmt_ops->ndo_queue_start) + return -EOPNOTSUPP; + + ASSERT_RTNL(); + + new_mem = kvzalloc(dev->queue_mgmt_ops->ndo_queue_mem_size, GFP_KERNEL); + if (!new_mem) + return -ENOMEM; + + old_mem = kvzalloc(dev->queue_mgmt_ops->ndo_queue_mem_size, GFP_KERNEL); + if (!old_mem) { + err = -ENOMEM; + goto err_free_new_mem; + } + + err = dev->queue_mgmt_ops->ndo_queue_mem_alloc(dev, new_mem, rxq_idx); + if (err) + goto err_free_old_mem; + + err = dev->queue_mgmt_ops->ndo_queue_stop(dev, old_mem, rxq_idx); + if (err) + goto err_free_new_queue_mem; + + err = dev->queue_mgmt_ops->ndo_queue_start(dev, new_mem, rxq_idx); + if (err) + goto err_start_queue; + + dev->queue_mgmt_ops->ndo_queue_mem_free(dev, old_mem); + + kvfree(old_mem); + kvfree(new_mem); + + return 0; + +err_start_queue: + /* Restarting the queue with old_mem should be successful as we haven't + * changed any of the queue configuration, and there is not much we can + * do to recover from a failure here. + * + * WARN if we fail to recover the old rx queue, and at least free + * old_mem so we don't also leak that. + */ + if (dev->queue_mgmt_ops->ndo_queue_start(dev, old_mem, rxq_idx)) { + WARN(1, + "Failed to restart old queue in error path. RX queue %d may be unhealthy.", + rxq_idx); + dev->queue_mgmt_ops->ndo_queue_mem_free(dev, old_mem); + } + +err_free_new_queue_mem: + dev->queue_mgmt_ops->ndo_queue_mem_free(dev, new_mem); + +err_free_old_mem: + kvfree(old_mem); + +err_free_new_mem: + kvfree(new_mem); + + return err; +} From patchwork Mon Sep 9 05:43:09 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 826771 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C67351AD9E3 for ; Mon, 9 Sep 2024 05:43:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860617; cv=none; b=J4LRxk06gh9Jk8535DKms07/0SHICifsJgzjNJrkcH1DuP5pS4e4E8shv/dg6WdjBccmVNZ9ZKr9LZxRCLyz3rfTOADGWINgejOWy7JqahfJwD8vX2EOPODKvMDU8bZ1kXl1p4HWfA5YjYLyVAjDQS1/tqc2Vir+UNt4M/xRIhk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860617; c=relaxed/simple; bh=WeTDJ7ZZ1u9c8OCH/DRwuCCCr2tmlYwQobBUfuuwPY8=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=FQANaRe7oTFPlGGnbK3yaidxk2igUJlgMgxJ2YEX2zsdCUsLl9Gm1OSFPH5vnshcMnd0eYYl4o1uklW6m2gc5AnF5jZgaNw7FZuNmbLSW5camcTuf1KtzjWWMSKaOBAKwti6+nzckUsOW1Vf569R4ci4/m37AfbwhG4ySiA5MDQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=t5K3PGdP; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="t5K3PGdP" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1a8de19f7aso8245757276.3 for ; Sun, 08 Sep 2024 22:43:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725860609; x=1726465409; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=ECIeZR9RzpNIy7i73AGnhxjov/QI8UJt0jpUvT+70vI=; b=t5K3PGdPt/z6TH4R0diqGEYCEOpL5gAnad9t38DA/ZTVH3fcr8BzajKSnp5t6r4NnK XAjpRUEEtxUQWl+c5qmYv6vqhZzjjKwiJYL2+tFe8bXpP5XA9/V0Z9Ozwxr051bOJ1Cc L8dmvNNHoCxm4fkTC92qzl3BNFB/5HQ4gv3nJSF1n5hvNdS2krQaBAm4lCDA5/QFJTQL JgXXIwHUcU7+zta7BXSX/inf3p/42n2ehbB6jhjGuULH3zy6Rmg7jKfrhagLIGB8j0dv 8Alf+zTJfz6xH8crT2Tw9figfu4tl/ZIbQ4i2MU3vx70UEDoCorLG9/k4kdSrKkdcJXG Ej7Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725860609; x=1726465409; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ECIeZR9RzpNIy7i73AGnhxjov/QI8UJt0jpUvT+70vI=; b=DvubNBqllOG0souzSlsRot+kM7oKzYbM+GcbxY4zWkLVnLPqisyRh9UQ0nh+1WkYSv yakOp04IuIfEPxxjeNUMwnB2wEriR+xqWg+bQaEMVOT1VRAOInLp+at+orFDxT3BtPgU 8ir5cXXQ2aegws9OWkD47DNCAqlB1erbDcrogBICpqmcwX8EHct5A8ldpyyK/ZWkAyoO 9RrVzr7S/t0GAdDHk0mkOllMxPOlC9jG1510iQCvDfUDFz1PA6JqZ6PFE0mTjSbXGta6 U5TLGyYPEpONKvhatRrF5nZuhTYJrz2DIXeXi0eV/W7Wso2d640x15olOG12HyS/X93s HB+A== X-Forwarded-Encrypted: i=1; AJvYcCUVnNlGkU75ik+oZXyUZUW5SmWF/wHa8xaYkh43fuFLwZtlf9qZHu9rO//MBzfwzYC9RTJeKpRq1S0sWogNG9E=@vger.kernel.org X-Gm-Message-State: AOJu0Yz4boNlqV/JSTvpjK0QA46iHxO5DNcx5Thg4Fpo5hFxQ2W8lEqU G8C5a+jHyVhYmyiZEv+OkQDIQXyQyZgUUJ8c3vcvky/bxEx0e+DAHdAG5/9HYhNoIqZgm+LvxgH NcIKeQokd3CH+IzDfCCvQ6A== X-Google-Smtp-Source: AGHT+IH6BBD54EtEYwpncKGXuiMb46yGAk99MKkGAmBV/dBkcwIDJg6JBjgLXyLNhVw8tz5j31vVfWXujpW14zoBJw== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a25:abb3:0:b0:e1d:21ae:3b95 with SMTP id 3f1490d57ef6-e1d34a7232cmr22362276.10.1725860609193; Sun, 08 Sep 2024 22:43:29 -0700 (PDT) Date: Mon, 9 Sep 2024 05:43:09 +0000 In-Reply-To: <20240909054318.1809580-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240909054318.1809580-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240909054318.1809580-5-almasrymina@google.com> Subject: [PATCH net-next v25 04/13] netdev: netdevice devmem allocator From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , Donald Hunter , Jakub Kicinski , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , " =?utf-8?b?QmrDtnJu?= =?utf-8?b?IFTDtnBlbA==?= " , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , Shuah Khan , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Willem de Bruijn , Kaiyuan Zhang Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates net_iov from that binding. The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a net_iov struct. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Jakub Kicinski --- v23: - WARN_ON when we don't see the dma_addr in the gen_pool (Jakub) v20: - Removed dma_addr field in dmabuf_genpool_chunk_owner not used in this patch (moved to later patch where it's used). v19: - Don't reset dma_addr on allocation/free (Jakub) v17: - Don't acquire a binding ref for every allocation (Jakub). v11: - Fix extraneous inline directive (Paolo) v8: - Rename netdev_dmabuf_binding -> net_devmem_dmabuf_binding to avoid patch-by-patch build error. - Move niov->pp_magic/pp/pp_ref_counter usage to later patch to avoid patch-by-patch build error. v7: - netdev_ -> net_devmem_* naming (Yunsheng). v6: - Add comment on net_iov_dma_addr to explain why we don't use niov->dma_addr (Pavel) - Refactor new functions into net/core/devmem.c (Pavel) v1: - Rename devmem -> dmabuf (David). fix allocator --- net/core/devmem.c | 41 +++++++++++++++++++++++++++++++++++++++++ net/core/devmem.h | 34 ++++++++++++++++++++++++++++++++++ 2 files changed, 75 insertions(+) diff --git a/net/core/devmem.c b/net/core/devmem.c index 8dd7beb080d2..9beb03763dc9 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -34,6 +34,14 @@ static void net_devmem_dmabuf_free_chunk_owner(struct gen_pool *genpool, kfree(owner); } +static dma_addr_t net_devmem_get_dma_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_dma_addr + + ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); +} + void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { size_t size, avail; @@ -56,6 +64,39 @@ void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) kfree(binding); } +struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; + struct net_iov *niov; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + niov = &owner->niovs[index]; + + return niov; +} + +void net_devmem_free_dmabuf(struct net_iov *niov) +{ + struct net_devmem_dmabuf_binding *binding = net_iov_binding(niov); + unsigned long dma_addr = net_devmem_get_dma_addr(niov); + + if (WARN_ON(!gen_pool_has_addr(binding->chunk_pool, dma_addr, + PAGE_SIZE))) + return; + + gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); +} + void net_devmem_unbind_dmabuf(struct net_devmem_dmabuf_binding *binding) { struct netdev_rx_queue *rxq; diff --git a/net/core/devmem.h b/net/core/devmem.h index c50f91d858dd..b1db4877cff9 100644 --- a/net/core/devmem.h +++ b/net/core/devmem.h @@ -74,6 +74,23 @@ int net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct netlink_ext_ack *extack); void dev_dmabuf_uninstall(struct net_device *dev); +static inline struct dmabuf_genpool_chunk_owner * +net_iov_owner(const struct net_iov *niov) +{ + return niov->owner; +} + +static inline unsigned int net_iov_idx(const struct net_iov *niov) +{ + return niov - net_iov_owner(niov)->niovs; +} + +static inline struct net_devmem_dmabuf_binding * +net_iov_binding(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding; +} + static inline void net_devmem_dmabuf_binding_get(struct net_devmem_dmabuf_binding *binding) { @@ -89,7 +106,13 @@ net_devmem_dmabuf_binding_put(struct net_devmem_dmabuf_binding *binding) __net_devmem_dmabuf_binding_free(binding); } +struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding); +void net_devmem_free_dmabuf(struct net_iov *ppiov); + #else +struct net_devmem_dmabuf_binding; + static inline void __net_devmem_dmabuf_binding_free(struct net_devmem_dmabuf_binding *binding) { @@ -119,6 +142,17 @@ net_devmem_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, static inline void dev_dmabuf_uninstall(struct net_device *dev) { } + +static inline struct net_iov * +net_devmem_alloc_dmabuf(struct net_devmem_dmabuf_binding *binding) +{ + return NULL; +} + +static inline void net_devmem_free_dmabuf(struct net_iov *ppiov) +{ +} + #endif #endif /* _NET_DEVMEM_H */ From patchwork Mon Sep 9 05:43:11 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 826770 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F4DA1AE051 for ; Mon, 9 Sep 2024 05:43:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860621; cv=none; b=JlQMssPDIdjiw/E0pdZcI1Klhp1/B+a55kHKEWzKIjF6wwG8edm8AN8w8BDAO2S0ycpqO9dDk9rCLtScbV6nbvHBeW3jVMihurG1X/YUyXJZL+X1KkJmTHEMy6YQB3iqvg7QXafUR29RIjQxKcLMJmVS/MMkvicONbGHJyHyWRE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860621; c=relaxed/simple; bh=Wl44tIj63+fUhmECOSdYw72q021zG+rG1uQBl4hLvvU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=B9ARuic0FEJJrBqrt1mEykN7NU2QPTIHOnuGLX1H+xr0SeYT/gbfbY0s+d2MFlPZX7mJvVS2yylQncsf8woQOnYa5eoFiQT/XXvyL9y6C0D4/kCSdvOTOOdJ32b1H8nrtKi35SLmr66A9vevgyms9FskAi8huzUmrOsuFSP7lO4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=gwMQy1t+; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="gwMQy1t+" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-690404fd230so123957877b3.3 for ; Sun, 08 Sep 2024 22:43:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725860613; x=1726465413; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=Z5my9lIBS8Kt8aaeKl6dbNNkzL2f+kvxRUOGslDtYFg=; b=gwMQy1t+7/nM7OwBE8qZgXTAe66YmnoNZsLKqzRyAKhhsPfYmVYPSVaSxKSu4Dj7Pa FJ2ZBFffqwdbEttDqWvtOO/EyFDMqlwVkh9+PINwQy/ozPXF+9mrKFkLGW7rz2/D/9o0 2YYNbg4P8Qih+ng1bru/74uPZ63BAsVELRgPuV6NNJqq7s0DdlILgcHjrmxOJc6Q4pq2 jOSJw/hYerLn7qLtB06QwF4Ql98LTurOXxNm7PQ5/fgawjQOn4H261ZCXrfK5cjHWE0Y vbm9hEJzWDD+X/cUfjOVauD00QxLZJRcaAdgb1llrm9gYYabFIlXaI0ISKGTQcLZi48C id5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725860613; x=1726465413; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Z5my9lIBS8Kt8aaeKl6dbNNkzL2f+kvxRUOGslDtYFg=; b=sakyOhDYq+RZ6uRGWkzR/M0UrB9uuqEDjunIk73O6cuTYfEnMKQOFvGmo0pVXxDmrJ TqrUTb1PmjNUK5VAw3bXuoK22gEVOl1nimohjz/eiAw6r0qQpg/uS9SK2Uj+laol86cJ NI1S86hZ6rRdagSjXRxm22P9x7yKL1h4NWGQnUo7qvbPP5h83kO9i4UvvUlIEanti+OW C5IRj6IQIhS+ZrRam9b0zlAhybCm5kG3r8C38lQpBFllCAPyeOrQOaPIIQVtTJEFpKgQ 5Dsy7a3Trl2Hvblq12TEoGOdfZetlALelPGLmdY2U57eSybBRMuDo4b5V5oPNaZmO5Vb q32g== X-Forwarded-Encrypted: i=1; AJvYcCV9MP7kc5tOaEITdW1o5Q38xaXOBsY2zS3VKf4Uf0MCf0O1PeaPK8HNCalnGUs5Zi05bMGDQzJcYIx97PaDE3s=@vger.kernel.org X-Gm-Message-State: AOJu0YwWcjDGwhwTAyDniIYl+l7c9TysuCcsg8HGxEdBPZpGOH/R8yK+ rjBUSyGtaqdD4FectbBfiwOO9geVwXR2csB63CCBEu+1ptjwbqDVoBNpn/mbqvCR9tjiBVnVPLs hThoJ3kFQngG/OzopJ3AI7w== X-Google-Smtp-Source: AGHT+IG6Opvhjlb6+C6OdNH39NhbTND/s060Br2RN0O6b6IdgeCsfaCjK4AvOPTEVXT00sLdSZzrIYqlDKJBteQ0iw== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a25:86c1:0:b0:e0e:4350:d7de with SMTP id 3f1490d57ef6-e1d349dae5dmr16707276.9.1725860613267; Sun, 08 Sep 2024 22:43:33 -0700 (PDT) Date: Mon, 9 Sep 2024 05:43:11 +0000 In-Reply-To: <20240909054318.1809580-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240909054318.1809580-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240909054318.1809580-7-almasrymina@google.com> Subject: [PATCH net-next v25 06/13] memory-provider: dmabuf devmem memory provider From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , Donald Hunter , Jakub Kicinski , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , " =?utf-8?b?QmrDtnJu?= =?utf-8?b?IFTDtnBlbA==?= " , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , Shuah Khan , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Willem de Bruijn , Kaiyuan Zhang Implement a memory provider that allocates dmabuf devmem in the form of net_iov. The provider receives a reference to the struct netdev_dmabuf_binding via the pool->mp_priv pointer. The driver needs to set this pointer for the provider in the net_iov. The provider obtains a reference on the netdev_dmabuf_binding which guarantees the binding and the underlying mapping remains alive until the provider is destroyed. Usage of PP_FLAG_DMA_MAP is required for this memory provide such that the page_pool can provide the driver with the dma-addrs of the devmem. Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order != 0. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Jakub Kicinski --- v25: - Change page_pool_param netdev_rx_queue struct to queue_idx (Jakub) - Address nits (Jakub) - Move mp_dmabuf_devmem.h to net/core (jakub). v23: - Sort includes (Jakub) - Add missing linux/mm.h include found after sorting. v21: - Provide empty definitions of functions moved to page_pool_priv.h, so that the build still succeeds when CONFIG_PAGE_POOL is not set. v20: - Moved queue pp_params field from fast path entries to slow path entries. - Moved page_pool_check_memory_provider() call to inside netdev_rx_queue_restart (Pavel). - Removed binding arg to page_pool_check_memory_provider() (Pavel). - Removed unnecessary includes from page_pool.c - Removed EXPORT_SYMBOL(page_pool_mem_providers) (Jakub) - Check pool->slow.queue instead of walking binding xarray (Pavel & Jakub). v19: - Add PP_FLAG_ALLOW_UNREADABLE_NETMEM flag. It serves 2 purposes, (a) it guards drivers that don't support unreadable netmem (net_iov backed) from accidentally getting exposed to it, and (b) drivers that wish to create header pools can unset it for that pool to force readable netmem. - Add page_pool_check_memory_provider, which verifies that the driver has created a page_pool with the expected configuration. This is used to report to the user if the mp configuration succeeded, and also verify that the driver is doing the right thing. - Don't reset niov->dma_addr on allocation/free. v17: - Use ASSERT_RTNL (Jakub) v16: - Add DEBUG_NET_WARN_ON_ONCE(!rtnl_is_locked()), to catch cases if page_pool_init without rtnl_locking when the queue is provided. In this case, the queue configuration may be changed while we're initing the page_pool, which could be a race. v13: - Return on warning (Pavel). - Fixed pool->recycle_stats not being freed on error (Pavel). - Applied reviewed-by from Pavel. v11: - Rebase to not use the ops. (Christoph) v8: - Use skb_frag_size instead of frag->bv_len to fix patch-by-patch build error v6: - refactor new memory provider functions into net/core/devmem.c (Pavel) v2: - Disable devmem for p.order != 0 v1: - static_branch check in page_is_page_pool_iov() (Willem & Paolo). - PP_DEVMEM -> PP_IOV (David). - Require PP_FLAG_DMA_MAP (Jakub). --- include/net/netmem.h | 1 + include/net/page_pool/types.h | 17 +++++- net/core/devmem.c | 67 ++++++++++++++++++++++ net/core/mp_dmabuf_devmem.h | 44 +++++++++++++++ net/core/netdev_rx_queue.c | 7 +++ net/core/page_pool.c | 103 +++++++++++++++++++++++++--------- net/core/page_pool_priv.h | 20 +++++++ net/core/page_pool_user.c | 27 ++++++++- 8 files changed, 257 insertions(+), 29 deletions(-) create mode 100644 net/core/mp_dmabuf_devmem.h diff --git a/include/net/netmem.h b/include/net/netmem.h index 5eccc40df92d..8a6e20be4b9d 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -8,6 +8,7 @@ #ifndef _NET_NETMEM_H #define _NET_NETMEM_H +#include #include /* net_iov */ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 4afd6dd56351..c022c410abe3 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -20,8 +20,18 @@ * device driver responsibility */ #define PP_FLAG_SYSTEM_POOL BIT(2) /* Global system page_pool */ + +/* Allow unreadable (net_iov backed) netmem in this page_pool. Drivers setting + * this must be able to support unreadable netmem, where netmem_address() would + * return NULL. This flag should not be set for header page_pools. + * + * If the driver sets PP_FLAG_ALLOW_UNREADABLE_NETMEM, it should also set + * page_pool_params.slow.queue_idx. + */ +#define PP_FLAG_ALLOW_UNREADABLE_NETMEM BIT(3) + #define PP_FLAG_ALL (PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV | \ - PP_FLAG_SYSTEM_POOL) + PP_FLAG_SYSTEM_POOL | PP_FLAG_ALLOW_UNREADABLE_NETMEM) /* * Fast allocation side cache array/stack @@ -57,7 +67,9 @@ struct pp_alloc_cache { * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV * @slow: params with slowpath access only (initialization and Netlink) * @netdev: netdev this pool will serve (leave as NULL if none or multiple) - * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_SYSTEM_POOL + * @queue_idx: queue idx this page_pool is being created for. + * @flags: PP_FLAG_DMA_MAP, PP_FLAG_DMA_SYNC_DEV, PP_FLAG_SYSTEM_POOL, + * PP_FLAG_ALLOW_UNREADABLE_NETMEM. */ struct page_pool_params { struct_group_tagged(page_pool_params_fast, fast, @@ -72,6 +84,7 @@ struct page_pool_params { ); struct_group_tagged(page_pool_params_slow, slow, struct net_device *netdev; + unsigned int queue_idx; unsigned int flags; /* private: used by test code only */ void (*init_callback)(netmem_ref netmem, void *arg); diff --git a/net/core/devmem.c b/net/core/devmem.c index 7efeb602cf45..11b91c12ee11 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -18,6 +18,7 @@ #include #include "devmem.h" +#include "mp_dmabuf_devmem.h" #include "page_pool_priv.h" /* Device memory support */ @@ -320,3 +321,69 @@ void dev_dmabuf_uninstall(struct net_device *dev) } } } + +/*** "Dmabuf devmem memory provider" ***/ + +int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct net_devmem_dmabuf_binding *binding = pool->mp_priv; + + if (!binding) + return -EINVAL; + + if (!pool->dma_map) + return -EOPNOTSUPP; + + if (pool->dma_sync) + return -EOPNOTSUPP; + + if (pool->p.order != 0) + return -E2BIG; + + net_devmem_dmabuf_binding_get(binding); + return 0; +} + +netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp) +{ + struct net_devmem_dmabuf_binding *binding = pool->mp_priv; + struct net_iov *niov; + netmem_ref netmem; + + niov = net_devmem_alloc_dmabuf(binding); + if (!niov) + return 0; + + netmem = net_iov_to_netmem(niov); + + page_pool_set_pp_info(pool, netmem); + + pool->pages_state_hold_cnt++; + trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt); + return netmem; +} + +void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ + struct net_devmem_dmabuf_binding *binding = pool->mp_priv; + + net_devmem_dmabuf_binding_put(binding); +} + +bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem) +{ + long refcount = atomic_long_read(netmem_get_pp_ref_count_ref(netmem)); + + if (WARN_ON_ONCE(!netmem_is_net_iov(netmem))) + return false; + + if (WARN_ON_ONCE(refcount != 1)) + return false; + + page_pool_clear_pp_info(netmem); + + net_devmem_free_dmabuf(netmem_to_net_iov(netmem)); + + /* We don't want the page pool put_page()ing our net_iovs. */ + return false; +} diff --git a/net/core/mp_dmabuf_devmem.h b/net/core/mp_dmabuf_devmem.h new file mode 100644 index 000000000000..67cd0dd7319c --- /dev/null +++ b/net/core/mp_dmabuf_devmem.h @@ -0,0 +1,44 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Dmabuf device memory provider. + * + * Authors: Mina Almasry + * + */ +#ifndef _NET_MP_DMABUF_DEVMEM_H +#define _NET_MP_DMABUF_DEVMEM_H + +#include + +#if defined(CONFIG_NET_DEVMEM) +int mp_dmabuf_devmem_init(struct page_pool *pool); + +netmem_ref mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp); + +void mp_dmabuf_devmem_destroy(struct page_pool *pool); + +bool mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem); +#else +static inline int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + return -EOPNOTSUPP; +} + +static inline netmem_ref +mp_dmabuf_devmem_alloc_netmems(struct page_pool *pool, gfp_t gfp) +{ + return 0; +} + +static inline void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ +} + +static inline bool +mp_dmabuf_devmem_release_page(struct page_pool *pool, netmem_ref netmem) +{ + return false; +} +#endif + +#endif /* _NET_MP_DMABUF_DEVMEM_H */ diff --git a/net/core/netdev_rx_queue.c b/net/core/netdev_rx_queue.c index da11720a5983..e217a5838c87 100644 --- a/net/core/netdev_rx_queue.c +++ b/net/core/netdev_rx_queue.c @@ -4,8 +4,11 @@ #include #include +#include "page_pool_priv.h" + int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx) { + struct netdev_rx_queue *rxq = __netif_get_rx_queue(dev, rxq_idx); void *new_mem, *old_mem; int err; @@ -31,6 +34,10 @@ int netdev_rx_queue_restart(struct net_device *dev, unsigned int rxq_idx) if (err) goto err_free_old_mem; + err = page_pool_check_memory_provider(dev, rxq); + if (err) + goto err_free_new_queue_mem; + err = dev->queue_mgmt_ops->ndo_queue_stop(dev, old_mem, rxq_idx); if (err) goto err_free_new_queue_mem; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 52659db2d765..c737200f4fac 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -11,6 +11,7 @@ #include #include +#include #include #include @@ -24,8 +25,10 @@ #include +#include "mp_dmabuf_devmem.h" #include "netmem_priv.h" #include "page_pool_priv.h" +#include "mp_dmabuf_devmem.h" DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); @@ -190,6 +193,8 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */ + struct netdev_rx_queue *rxq; + int err; page_pool_struct_check(); @@ -271,7 +276,37 @@ static int page_pool_init(struct page_pool *pool, if (pool->dma_map) get_device(pool->p.dev); + if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) { + /* We rely on rtnl_lock()ing to make sure netdev_rx_queue + * configuration doesn't change while we're initializing + * the page_pool. + */ + ASSERT_RTNL(); + rxq = __netif_get_rx_queue(pool->slow.netdev, + pool->slow.queue_idx); + pool->mp_priv = rxq->mp_params.mp_priv; + } + + if (pool->mp_priv) { + err = mp_dmabuf_devmem_init(pool); + if (err) { + pr_warn("%s() mem-provider init failed %d\n", __func__, + err); + goto free_ptr_ring; + } + + static_branch_inc(&page_pool_mem_providers); + } + return 0; + +free_ptr_ring: + ptr_ring_cleanup(&pool->ring, NULL); +#ifdef CONFIG_PAGE_POOL_STATS + if (!pool->system) + free_percpu(pool->recycle_stats); +#endif + return err; } static void page_pool_uninit(struct page_pool *pool) @@ -455,28 +490,6 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) return false; } -static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) -{ - netmem_set_pp(netmem, pool); - netmem_or_pp_magic(netmem, PP_SIGNATURE); - - /* Ensuring all pages have been split into one fragment initially: - * page_pool_set_pp_info() is only called once for every page when it - * is allocated from the page allocator and page_pool_fragment_page() - * is dirtying the same cache line as the page->pp_magic above, so - * the overhead is negligible. - */ - page_pool_fragment_netmem(netmem, 1); - if (pool->has_init_callback) - pool->slow.init_callback(netmem, pool->slow.init_arg); -} - -static void page_pool_clear_pp_info(netmem_ref netmem) -{ - netmem_clear_pp_magic(netmem); - netmem_set_pp(netmem, NULL); -} - static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) { @@ -572,7 +585,10 @@ netmem_ref page_pool_alloc_netmem(struct page_pool *pool, gfp_t gfp) return netmem; /* Slow-path: cache empty, do real allocation */ - netmem = __page_pool_alloc_pages_slow(pool, gfp); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv) + netmem = mp_dmabuf_devmem_alloc_netmems(pool, gfp); + else + netmem = __page_pool_alloc_pages_slow(pool, gfp); return netmem; } EXPORT_SYMBOL(page_pool_alloc_netmem); @@ -608,6 +624,28 @@ s32 page_pool_inflight(const struct page_pool *pool, bool strict) return inflight; } +void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) +{ + netmem_set_pp(netmem, pool); + netmem_or_pp_magic(netmem, PP_SIGNATURE); + + /* Ensuring all pages have been split into one fragment initially: + * page_pool_set_pp_info() is only called once for every page when it + * is allocated from the page allocator and page_pool_fragment_page() + * is dirtying the same cache line as the page->pp_magic above, so + * the overhead is negligible. + */ + page_pool_fragment_netmem(netmem, 1); + if (pool->has_init_callback) + pool->slow.init_callback(netmem, pool->slow.init_arg); +} + +void page_pool_clear_pp_info(netmem_ref netmem) +{ + netmem_clear_pp_magic(netmem); + netmem_set_pp(netmem, NULL); +} + static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, netmem_ref netmem) { @@ -636,8 +674,13 @@ static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) { int count; + bool put; - __page_pool_release_page_dma(pool, netmem); + put = true; + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_priv) + put = mp_dmabuf_devmem_release_page(pool, netmem); + else + __page_pool_release_page_dma(pool, netmem); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -645,8 +688,10 @@ void page_pool_return_page(struct page_pool *pool, netmem_ref netmem) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, netmem, count); - page_pool_clear_pp_info(netmem); - put_page(netmem_to_page(netmem)); + if (put) { + page_pool_clear_pp_info(netmem); + put_page(netmem_to_page(netmem)); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -965,6 +1010,12 @@ static void __page_pool_destroy(struct page_pool *pool) page_pool_unlist(pool); page_pool_uninit(pool); + + if (pool->mp_priv) { + mp_dmabuf_devmem_destroy(pool); + static_branch_dec(&page_pool_mem_providers); + } + kfree(pool); } diff --git a/net/core/page_pool_priv.h b/net/core/page_pool_priv.h index d602c1e728c2..57439787b9c2 100644 --- a/net/core/page_pool_priv.h +++ b/net/core/page_pool_priv.h @@ -35,4 +35,24 @@ static inline bool page_pool_set_dma_addr(struct page *page, dma_addr_t addr) return page_pool_set_dma_addr_netmem(page_to_netmem(page), addr); } +#if defined(CONFIG_PAGE_POOL) +void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem); +void page_pool_clear_pp_info(netmem_ref netmem); +int page_pool_check_memory_provider(struct net_device *dev, + struct netdev_rx_queue *rxq); +#else +static inline void page_pool_set_pp_info(struct page_pool *pool, + netmem_ref netmem) +{ +} +static inline void page_pool_clear_pp_info(netmem_ref netmem) +{ +} +static inline int page_pool_check_memory_provider(struct net_device *dev, + struct netdev_rx_queue *rxq) +{ + return 0; +} +#endif + #endif diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c index 3a3277ba167b..cd6267ba6fa3 100644 --- a/net/core/page_pool_user.c +++ b/net/core/page_pool_user.c @@ -4,8 +4,9 @@ #include #include #include -#include +#include #include +#include #include #include "page_pool_priv.h" @@ -344,6 +345,30 @@ void page_pool_unlist(struct page_pool *pool) mutex_unlock(&page_pools_lock); } +int page_pool_check_memory_provider(struct net_device *dev, + struct netdev_rx_queue *rxq) +{ + struct net_devmem_dmabuf_binding *binding = rxq->mp_params.mp_priv; + struct page_pool *pool; + struct hlist_node *n; + + if (!binding) + return 0; + + mutex_lock(&page_pools_lock); + hlist_for_each_entry_safe(pool, n, &dev->page_pools, user.list) { + if (pool->mp_priv != binding) + continue; + + if (pool->slow.queue_idx == get_netdev_rx_queue_index(rxq)) { + mutex_unlock(&page_pools_lock); + return 0; + } + } + mutex_unlock(&page_pools_lock); + return -ENODATA; +} + static void page_pool_unreg_netdev_wipe(struct net_device *netdev) { struct page_pool *pool; From patchwork Mon Sep 9 05:43:13 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 826769 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 498FD1AED41 for ; Mon, 9 Sep 2024 05:43:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860625; cv=none; b=NXI+n+3jHKPDUPbDWf2OYGXP4pTnwihSIfBABFYHYo4WCYW6k5yclj79P1pw3GV6ny599ZMkstSRtORAdkhzOfKT7ZJC9J4flb44y9xV7B0G8aOG7Ff/CUJGBMRfaqtGzpnve6MsF3GAVIMSUXqRXAE/IvXTDQdcQEZk5z5dOfU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860625; c=relaxed/simple; bh=8yTKyPXR/JJ8W41/h1ILEYUwp0ILfE7ny7htLf2eC3k=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=mx5UxvBFcl+3M4HEx2CzaRzGTi2+JDj6/aAFsqrX1peTB6p15V/vFtR/6gnwJbUcavm4WQuHs/dy3dYhUSlWCkwrbGBDxJhlhMjRXGZzxTVP+f1bOQNUra/jR5lUfMVJ5hMcPN+E76wWFRAB9XZx+mA2fLdJeaVpj6Qa9YOTPM0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=xShQO+Mc; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="xShQO+Mc" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e035949cc4eso8356081276.1 for ; Sun, 08 Sep 2024 22:43:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725860617; x=1726465417; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=hJWTzgzMpuO58f7zhEkczq8k8Bg4k7akg7ko/9LTHJU=; b=xShQO+MciIJ2af+8V8eulpMa3XmdjeVWz397F/qUSG7Jb/tTXuh+npCMjooVskuBew aQ2XktqYayEUOa7akG9bB+fQLqqKXZtyOZDVc5Fi/t4ygLNPjzFqMT08zyVZoXLNjLnQ p0v74cFz5RmABL48u/S5DsAX3Q2BJYlslCIs5RAXl5HBoCGSVZMNZXvmZXgo/kgwgu0C hBP7XuqjDklQEfcdB2noTOZ5ieyUlqGnfkuL17OSaCUhRViZegx2SmHbNFHsDlsM36c+ KH/KAsUo3UC8fJ91v4MB/D2s3lBhvJxsQGAFzL/2faH0rHuuMwxwb7TPdRKyJoJtev7r PXbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725860617; x=1726465417; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=hJWTzgzMpuO58f7zhEkczq8k8Bg4k7akg7ko/9LTHJU=; b=O9yW5kg9txpLTTI1xXVZdLDbIVha35JV7Idatz/c3mMHmmyAdHmp5mFj11yTZeM6VH +ynCUjH+d8r8/olE8bUbbgcynHN3yps+2VJ3nD9ido62nKr55f7FF/K/IxJ/gEq8krJe TKllL7zosUjeyEYHH3s/WWdi6Ct4oCsPIGtMN4zzkj7Duw02aIuOp4/cBtV/0I2Y3zJP rI3d+ZDiao8iDHsz65J1a4MiYgO+tDLv/SfpBzEwOfy0Hhim4/TTWjh/sQ0cK/pxbN4S qSJ+fDPgTguD48iKGj36qbWm0oiSP3MVDDWByPxOcNtXT+hvGGYMMg/3BTAyNDWPAYt9 oWLA== X-Forwarded-Encrypted: i=1; AJvYcCWE06PDK0SBBPHvNBpP0zvcICX1IdtJfKnhBXpEC/WSW137V6+e8PjnfhrGO9CRT2hrODmgINF8D6rkQ5SVHeU=@vger.kernel.org X-Gm-Message-State: AOJu0YwAyjr1cEgLM+w869O86Ig8ddH/wv18o9KVIi2H/gOsFW6oo29X mxeVIN+cHdT5LGWXPLRyrBbsxZF8L7/kCr6Yu3PX5Zy9H5Cigjeyt7uGNrElA240s+i1Tuthnbo bwgQj9GUtLikLq8W2RJy/vw== X-Google-Smtp-Source: AGHT+IHaG4QItTDUAJsf/nvBwk/4JNxRN6QfYUiZZnCQudbcE3ZJunvmcRT3sNrhU3Nm1q9ZUqIwOogeDp7HTE8Qow== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a25:3628:0:b0:e03:2f90:e81d with SMTP id 3f1490d57ef6-e1d34a40ed1mr22965276.11.1725860617082; Sun, 08 Sep 2024 22:43:37 -0700 (PDT) Date: Mon, 9 Sep 2024 05:43:13 +0000 In-Reply-To: <20240909054318.1809580-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240909054318.1809580-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240909054318.1809580-9-almasrymina@google.com> Subject: [PATCH net-next v25 08/13] net: add support for skbs with unreadable frags From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , Donald Hunter , Jakub Kicinski , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , " =?utf-8?b?QmrDtnJu?= =?utf-8?b?IFTDtnBlbA==?= " , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , Shuah Khan , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Willem de Bruijn , Kaiyuan Zhang For device memory TCP, we expect the skb headers to be available in host memory for access, and we expect the skb frags to be in device memory and unaccessible to the host. We expect there to be no mixing and matching of device memory frags (unaccessible) with host memory frags (accessible) in the same skb. Add a skb->devmem flag which indicates whether the frags in this skb are device memory frags or not. __skb_fill_netmem_desc() now checks frags added to skbs for net_iov, and marks the skb as skb->devmem accordingly. Add checks through the network stack to avoid accessing the frags of devmem skbs and avoid coalescing devmem skbs with non devmem skbs. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Eric Dumazet --- v25: - Remove readable check in tcp_skb_can_collapse_to(). - Add check in skb_checksum_help (Jakub). - Add WARN_ON_ONCE around __skb_checksum readable check (Jakub). v16: - Fix unreadable handling in skb_split_no_header() (Eric). v11: - drop excessive checks for frag 0 pull (Paolo) v9: https://lore.kernel.org/netdev/20240403002053.2376017-11-almasrymina@google.com/ - change skb->readable to skb->unreadable (Pavel/David). skb->readable was very complicated, because by default skbs are readable so the flag needed to be set to true in all code paths where new skbs were created or cloned. Forgetting to set skb->readable=true in some paths caused crashes. Flip it to skb->unreadable so that the default 0 value works well, and we only need to set it to true when we add unreadable frags. v6 - skb->dmabuf -> skb->readable (Pavel). Pavel's original suggestion was to remove the skb->dmabuf flag entirely, but when I looked into it closely, I found the issue that if we remove the flag we have to dereference the shinfo(skb) pointer to obtain the first frag, which can cause a performance regression if it dirties the cache line when the shinfo(skb) was not really needed. Instead, I converted the skb->dmabuf flag into a generic skb->readable flag which can be re-used by io_uring. Changes in v1: - Rename devmem -> dmabuf (David). - Flip skb_frags_not_readable (Jakub). --- include/linux/skbuff.h | 19 +++++++++++++++++-- include/net/tcp.h | 3 ++- net/core/datagram.c | 6 ++++++ net/core/dev.c | 4 ++++ net/core/skbuff.c | 43 ++++++++++++++++++++++++++++++++++++++++-- net/ipv4/tcp.c | 3 +++ net/ipv4/tcp_input.c | 13 ++++++++++--- net/ipv4/tcp_output.c | 5 ++++- net/packet/af_packet.c | 4 ++-- 9 files changed, 89 insertions(+), 11 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index dbadf2dd6b35..d02a88bad953 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -827,6 +827,8 @@ enum skb_tstamp_type { * @csum_level: indicates the number of consecutive checksums found in * the packet minus one that have been verified as * CHECKSUM_UNNECESSARY (max 3) + * @unreadable: indicates that at least 1 of the fragments in this skb is + * unreadable. * @dst_pending_confirm: need to confirm neighbour * @decrypted: Decrypted SKB * @slow_gro: state present at GRO time, slower prepare step required @@ -1008,7 +1010,7 @@ struct sk_buff { #if IS_ENABLED(CONFIG_IP_SCTP) __u8 csum_not_inet:1; #endif - + __u8 unreadable:1; #if defined(CONFIG_NET_SCHED) || defined(CONFIG_NET_XGRESS) __u16 tc_index; /* traffic control index */ #endif @@ -1823,6 +1825,12 @@ static inline void skb_zcopy_downgrade_managed(struct sk_buff *skb) __skb_zcopy_downgrade_managed(skb); } +/* Return true if frags in this skb are readable by the host. */ +static inline bool skb_frags_readable(const struct sk_buff *skb) +{ + return !skb->unreadable; +} + static inline void skb_mark_not_on_list(struct sk_buff *skb) { skb->next = NULL; @@ -2539,10 +2547,17 @@ static inline void skb_len_add(struct sk_buff *skb, int delta) static inline void __skb_fill_netmem_desc(struct sk_buff *skb, int i, netmem_ref netmem, int off, int size) { - struct page *page = netmem_to_page(netmem); + struct page *page; __skb_fill_netmem_desc_noacc(skb_shinfo(skb), i, netmem, off, size); + if (netmem_is_net_iov(netmem)) { + skb->unreadable = true; + return; + } + + page = netmem_to_page(netmem); + /* Propagate page pfmemalloc to the skb if we can. The problem is * that not all callers have unique ownership of the page but rely * on page_is_pfmemalloc doing the right thing(tm). diff --git a/include/net/tcp.h b/include/net/tcp.h index 2aac11e7e1cc..f77f812bfbe7 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1069,7 +1069,8 @@ static inline bool tcp_skb_can_collapse(const struct sk_buff *to, /* skb_cmp_decrypted() not needed, use tcp_write_collapse_fence() */ return likely(tcp_skb_can_collapse_to(to) && mptcp_skb_can_collapse(to, from) && - skb_pure_zcopy_same(to, from)); + skb_pure_zcopy_same(to, from) && + skb_frags_readable(to) == skb_frags_readable(from)); } static inline bool tcp_skb_can_collapse_rx(const struct sk_buff *to, diff --git a/net/core/datagram.c b/net/core/datagram.c index a40f733b37d7..f0693707aece 100644 --- a/net/core/datagram.c +++ b/net/core/datagram.c @@ -407,6 +407,9 @@ static int __skb_datagram_iter(const struct sk_buff *skb, int offset, return 0; } + if (!skb_frags_readable(skb)) + goto short_copy; + /* Copy paged appendix. Hmm... why does this look so complicated? */ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -623,6 +626,9 @@ int zerocopy_fill_skb_from_iter(struct sk_buff *skb, { int frag = skb_shinfo(skb)->nr_frags; + if (!skb_frags_readable(skb)) + return -EFAULT; + while (length && iov_iter_count(from)) { struct page *head, *last_head = NULL; struct page *pages[MAX_SKB_FRAGS]; diff --git a/net/core/dev.c b/net/core/dev.c index b4ef6578f31f..c7d5319596c8 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3312,6 +3312,10 @@ int skb_checksum_help(struct sk_buff *skb) return -EINVAL; } + if (!skb_frags_readable(skb)) { + return -EFAULT; + } + /* Before computing a checksum, we should make sure no frag could * be modified by an external entity : checksum could be wrong. */ diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 7ea1508a1176..51a6e9570808 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1972,6 +1972,9 @@ int skb_copy_ubufs(struct sk_buff *skb, gfp_t gfp_mask) if (skb_shared(skb) || skb_unclone(skb, gfp_mask)) return -EINVAL; + if (!skb_frags_readable(skb)) + return -EFAULT; + if (!num_frags) goto release; @@ -2145,6 +2148,9 @@ struct sk_buff *skb_copy(const struct sk_buff *skb, gfp_t gfp_mask) unsigned int size; int headerlen; + if (!skb_frags_readable(skb)) + return NULL; + if (WARN_ON_ONCE(skb_shinfo(skb)->gso_type & SKB_GSO_FRAGLIST)) return NULL; @@ -2483,6 +2489,9 @@ struct sk_buff *skb_copy_expand(const struct sk_buff *skb, struct sk_buff *n; int oldheadroom; + if (!skb_frags_readable(skb)) + return NULL; + if (WARN_ON_ONCE(skb_shinfo(skb)->gso_type & SKB_GSO_FRAGLIST)) return NULL; @@ -2827,6 +2836,9 @@ void *__pskb_pull_tail(struct sk_buff *skb, int delta) */ int i, k, eat = (skb->tail + delta) - skb->end; + if (!skb_frags_readable(skb)) + return NULL; + if (eat > 0 || skb_cloned(skb)) { if (pskb_expand_head(skb, 0, eat > 0 ? eat + 128 : 0, GFP_ATOMIC)) @@ -2980,6 +2992,9 @@ int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len) to += copy; } + if (!skb_frags_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *f = &skb_shinfo(skb)->frags[i]; @@ -3168,6 +3183,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, /* * then map the fragments */ + if (!skb_frags_readable(skb)) + return false; + for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; @@ -3391,6 +3409,9 @@ int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len) from += copy; } + if (!skb_frags_readable(skb)) + goto fault; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; int end; @@ -3470,6 +3491,9 @@ __wsum __skb_checksum(const struct sk_buff *skb, int offset, int len, pos = copy; } + if (WARN_ON_ONCE(!skb_frags_readable(skb))) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; @@ -3570,6 +3594,9 @@ __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, pos = copy; } + if (!skb_frags_readable(skb)) + return 0; + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { int end; @@ -4061,6 +4088,7 @@ static inline void skb_split_inside_header(struct sk_buff *skb, skb_shinfo(skb1)->frags[i] = skb_shinfo(skb)->frags[i]; skb_shinfo(skb1)->nr_frags = skb_shinfo(skb)->nr_frags; + skb1->unreadable = skb->unreadable; skb_shinfo(skb)->nr_frags = 0; skb1->data_len = skb->data_len; skb1->len += skb1->data_len; @@ -4108,6 +4136,8 @@ static inline void skb_split_no_header(struct sk_buff *skb, pos += size; } skb_shinfo(skb1)->nr_frags = k; + + skb1->unreadable = skb->unreadable; } /** @@ -4345,6 +4375,9 @@ unsigned int skb_seq_read(unsigned int consumed, const u8 **data, return block_limit - abs_offset; } + if (!skb_frags_readable(st->cur_skb)) + return 0; + if (st->frag_idx == 0 && !st->frag_data) st->stepped_offset += skb_headlen(st->cur_skb); @@ -5957,7 +5990,10 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from, if (to->pp_recycle != from->pp_recycle) return false; - if (len <= skb_tailroom(to)) { + if (skb_frags_readable(from) != skb_frags_readable(to)) + return false; + + if (len <= skb_tailroom(to) && skb_frags_readable(from)) { if (len) BUG_ON(skb_copy_bits(from, 0, skb_put(to, len), len)); *delta_truesize = 0; @@ -6134,6 +6170,9 @@ int skb_ensure_writable(struct sk_buff *skb, unsigned int write_len) if (!pskb_may_pull(skb, write_len)) return -ENOMEM; + if (!skb_frags_readable(skb)) + return -EFAULT; + if (!skb_cloned(skb) || skb_clone_writable(skb, write_len)) return 0; @@ -6813,7 +6852,7 @@ void skb_condense(struct sk_buff *skb) { if (skb->data_len) { if (skb->data_len > skb->end - skb->tail || - skb_cloned(skb)) + skb_cloned(skb) || !skb_frags_readable(skb)) return; /* Nice, we can free page frag(s) right now */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 05844a36ffeb..30238963fe99 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2160,6 +2160,9 @@ static int tcp_zerocopy_receive(struct sock *sk, skb = tcp_recv_skb(sk, seq, &offset); } + if (!skb_frags_readable(skb)) + break; + if (TCP_SKB_CB(skb)->has_rxtstamp) { tcp_update_recv_tstamps(skb, tss); zc->msg_flags |= TCP_CMSG_TS; diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index e37488d3453f..9f314dfa1490 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5391,6 +5391,9 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) { n = tcp_skb_next(skb, list); + if (!skb_frags_readable(skb)) + goto skip_this; + /* No new bits? It is possible on ofo queue. */ if (!before(start, TCP_SKB_CB(skb)->end_seq)) { skb = tcp_collapse_one(sk, skb, list, root); @@ -5411,17 +5414,20 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, break; } - if (n && n != tail && tcp_skb_can_collapse_rx(skb, n) && + if (n && n != tail && skb_frags_readable(n) && + tcp_skb_can_collapse_rx(skb, n) && TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) { end_of_skbs = false; break; } +skip_this: /* Decided to skip this, advance start seq. */ start = TCP_SKB_CB(skb)->end_seq; } if (end_of_skbs || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + !skb_frags_readable(skb)) return; __skb_queue_head_init(&tmp); @@ -5463,7 +5469,8 @@ tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root, if (!skb || skb == tail || !tcp_skb_can_collapse_rx(nskb, skb) || - (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN))) + (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)) || + !skb_frags_readable(skb)) goto end; } } diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index cdd0def14427..4fd746bd4d54 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2344,7 +2344,8 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len) if (unlikely(TCP_SKB_CB(skb)->eor) || tcp_has_tx_tstamp(skb) || - !skb_pure_zcopy_same(skb, next)) + !skb_pure_zcopy_same(skb, next) || + skb_frags_readable(skb) != skb_frags_readable(next)) return false; len -= skb->len; @@ -3264,6 +3265,8 @@ static bool tcp_can_collapse(const struct sock *sk, const struct sk_buff *skb) return false; if (skb_cloned(skb)) return false; + if (!skb_frags_readable(skb)) + return false; /* Some heuristics for collapsing over SACK'd could be invented */ if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED) return false; diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c index 4a364cdd445e..a705ec214254 100644 --- a/net/packet/af_packet.c +++ b/net/packet/af_packet.c @@ -2216,7 +2216,7 @@ static int packet_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb); res = run_filter(skb, sk, snaplen); if (!res) @@ -2336,7 +2336,7 @@ static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, } } - snaplen = skb->len; + snaplen = skb_frags_readable(skb) ? skb->len : skb_headlen(skb); res = run_filter(skb, sk, snaplen); if (!res) From patchwork Mon Sep 9 05:43:15 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 826768 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D421C1AF4EC for ; Mon, 9 Sep 2024 05:43:43 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860629; cv=none; b=CMPFO/GpSKlHwW6nEqCkkRYxY+BnlYRhMtjvfYi5LrBDEUUlEGG8iiIpFNB8ZOTPDkFDn8nvJqBH/H3qlUBKbFqdKdcpQ1f1uVGAGOnBzjMylQUDlMlzpmln8sDaGEF8ZhEvQ0NoxxKFkZRNnqKOLmQQE0JVa4EQ2SHGEeK3024= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860629; c=relaxed/simple; bh=y493pBuMuXtKSS1nHc/d3DLVMXqmDAj6y5DRsbVmyGQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=E/CR9vwZVgcIKvp7oIX2zLpegECa82hrRcV6xgTbMFn5n8jKl2VPYfLFUZyGQjdy8Mw846TlGmOyyve8KYJIaO0dJfwDuddb2Iyo59KBLBK50SRQiLndaB9U5oy5u0IhpXNp5+uwwJVB6qUnSdtJCL4Z9FQnQvTGWZ9A67UygNM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=pxip9lAM; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="pxip9lAM" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e165fc5d94fso9023782276.2 for ; Sun, 08 Sep 2024 22:43:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725860621; x=1726465421; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=39H1hvAt4oUXephuqAsRMNiSS8rVQOUnNeObWUoRUdg=; b=pxip9lAMdYiMHfexF+TjF9EMjtLm3e/pDGK9lmuQHcs/zqfJO6JbdfFddiaXtijtXi 5xEcZ7HlgD3L124+GWYfVbbjjiw1kYyP1dw0ROE3Aa7dFtedAlRK2Dib7+8r8Zx8VzkU pDb0sGNDu1cFqKslpMvXbBxP3nSzUxb0SowGRQYlR4Y1eSO6rHRImkDSce4X43D//tAU 0VczJSzHWTSKj2wwzM7ZHsoA5WVBvGEmBMaKyGwr2GfLLgci3ZZ8bsIfSEX87PdA5nw1 5rX8sb5laIaS36IBaz/oEp2qfM4z4eWqEbFnXRbaf7p+qJ8n8Irh7Y8RI/piUDW1VUMe y0+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725860621; x=1726465421; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=39H1hvAt4oUXephuqAsRMNiSS8rVQOUnNeObWUoRUdg=; b=JbvyGei7Zu2xtWis0rucDGh5+Ib7viTSKKy7mI6ylJ/Sq91+RdSpFHbNe045mIMO3A Y6hXXd1vuxkJCnTMAS/dRZ8LlXMsrV/J6mBipZWR+HPAIwtbNPT/MxcFBOj2xGY/+mS/ 3O+rJEQ3EBAMnokorcpo+80O+9RMxqTgGRzZXKxl5s4NFvwFMTZvXDFLeW+b50+WrHKn ktGyq1Dokzs53gb3J2HIVKKrfyk3zZuGRsfObGNNcA+04394yPmAGIA9R8/+25ZHF6IZ M/OPHABvRqup9dY6l0pMciKXiNL0PPSj13K16gwUMV/PVVWhwVLvO2BiPfTl2c+vhRmq 0jRg== X-Forwarded-Encrypted: i=1; AJvYcCUbdDNPIG/a3ujZJQI40HK5GmW+RkFu7f8br9n2tQvchjxF/YsCu58ZkrBeKBTlRekjnLYbDxxOMqGGvTndPDM=@vger.kernel.org X-Gm-Message-State: AOJu0Yw2GhHIVWNl2u60E4h3QHI36xjV7PoWZxdwpkazvwMriqig+Trj f+2LD0q4lzwHbm1e9ByZXvSHhOcV4voC4DLmIX+I2eTpc5ezlXfFiYnMbNUVW0HF+WCGGVmEEAj M21XX+TyT3Ph8BnjELEXoEA== X-Google-Smtp-Source: AGHT+IEwmsbqC92+DjushWmcPnvMXZEnQwr6ZXYZAUSZ5yEdh5w1D2StqyVObxPpmpAyWdBFMpepDqvTsTcdgy7ohw== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a25:2e08:0:b0:e0b:ea2e:7b00 with SMTP id 3f1490d57ef6-e1d3489ec34mr23893276.5.1725860621103; Sun, 08 Sep 2024 22:43:41 -0700 (PDT) Date: Mon, 9 Sep 2024 05:43:15 +0000 In-Reply-To: <20240909054318.1809580-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240909054318.1809580-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240909054318.1809580-11-almasrymina@google.com> Subject: [PATCH net-next v25 10/13] net: add SO_DEVMEM_DONTNEED setsockopt to release RX frags From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , Donald Hunter , Jakub Kicinski , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , " =?utf-8?b?QmrDtnJu?= =?utf-8?b?IFTDtnBlbA==?= " , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , Shuah Khan , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Willem de Bruijn , Kaiyuan Zhang Add an interface for the user to notify the kernel that it is done reading the devmem dmabuf frags returned as cmsg. The kernel will drop the reference on the frags to make them available for reuse. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry Reviewed-by: Pavel Begunkov Reviewed-by: Eric Dumazet --- v16: - Use sk_is_tcp(). - Fix unnamed 128 DONTNEED limit (David). - Fix kernel allocating for 128 tokens even if the user didn't ask for that much (Eric). - Fix number assignement (Arnd). v10: - Fix leak of tokens (Nikolay). v7: - Updated SO_DEVMEM_* uapi to use the next available entry (Arnd). v6: - Squash in locking optimizations from edumazet@google.com. With his changes we lock the xarray once per sock_devmem_dontneed operation rather than once per frag. Changes in v1: - devmemtoken -> dmabuf_token (David). - Use napi_pp_put_page() for refcounting (Yunsheng). - Fix build error with missing socket options on other asms. --- arch/alpha/include/uapi/asm/socket.h | 1 + arch/mips/include/uapi/asm/socket.h | 1 + arch/parisc/include/uapi/asm/socket.h | 1 + arch/sparc/include/uapi/asm/socket.h | 1 + include/uapi/asm-generic/socket.h | 1 + include/uapi/linux/uio.h | 4 ++ net/core/sock.c | 68 +++++++++++++++++++++++++++ 7 files changed, 77 insertions(+) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index ef4656a41058..251b73c5481e 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -144,6 +144,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 414807d55e33..8ab7582291ab 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -155,6 +155,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index 2b817efd4544..38fc0b188e08 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -136,6 +136,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 00248fc68977..57084ed2f3c4 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -137,6 +137,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 0x0058 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 0x0059 #if !defined(__KERNEL__) diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index e993edc9c0ee..3b4e3e815602 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -139,6 +139,7 @@ #define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR #define SO_DEVMEM_DMABUF 79 #define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF +#define SO_DEVMEM_DONTNEED 80 #if !defined(__KERNEL__) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 3a22ddae376a..d17f8fcd93ec 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -33,6 +33,10 @@ struct dmabuf_cmsg { */ }; +struct dmabuf_token { + __u32 token_start; + __u32 token_count; +}; /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/core/sock.c b/net/core/sock.c index 468b1239606c..bbb57b5af0b1 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -124,6 +124,7 @@ #include #include #include +#include #include #include #include @@ -1049,6 +1050,69 @@ static int sock_reserve_memory(struct sock *sk, int bytes) return 0; } +#ifdef CONFIG_PAGE_POOL + +/* This is the number of tokens that the user can SO_DEVMEM_DONTNEED in + * 1 syscall. The limit exists to limit the amount of memory the kernel + * allocates to copy these tokens. + */ +#define MAX_DONTNEED_TOKENS 128 + +static noinline_for_stack int +sock_devmem_dontneed(struct sock *sk, sockptr_t optval, unsigned int optlen) +{ + unsigned int num_tokens, i, j, k, netmem_num = 0; + struct dmabuf_token *tokens; + netmem_ref netmems[16]; + int ret = 0; + + if (!sk_is_tcp(sk)) + return -EBADF; + + if (optlen % sizeof(struct dmabuf_token) || + optlen > sizeof(*tokens) * MAX_DONTNEED_TOKENS) + return -EINVAL; + + tokens = kvmalloc_array(optlen, sizeof(*tokens), GFP_KERNEL); + if (!tokens) + return -ENOMEM; + + num_tokens = optlen / sizeof(struct dmabuf_token); + if (copy_from_sockptr(tokens, optval, optlen)) { + kvfree(tokens); + return -EFAULT; + } + + xa_lock_bh(&sk->sk_user_frags); + for (i = 0; i < num_tokens; i++) { + for (j = 0; j < tokens[i].token_count; j++) { + netmem_ref netmem = (__force netmem_ref)__xa_erase( + &sk->sk_user_frags, tokens[i].token_start + j); + + if (netmem && + !WARN_ON_ONCE(!netmem_is_net_iov(netmem))) { + netmems[netmem_num++] = netmem; + if (netmem_num == ARRAY_SIZE(netmems)) { + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); + netmem_num = 0; + xa_lock_bh(&sk->sk_user_frags); + } + ret++; + } + } + } + + xa_unlock_bh(&sk->sk_user_frags); + for (k = 0; k < netmem_num; k++) + WARN_ON_ONCE(!napi_pp_put_page(netmems[k])); + + kvfree(tokens); + return ret; +} +#endif + void sockopt_lock_sock(struct sock *sk) { /* When current->bpf_ctx is set, the setsockopt is called from @@ -1211,6 +1275,10 @@ int sk_setsockopt(struct sock *sk, int level, int optname, ret = -EOPNOTSUPP; return ret; } +#ifdef CONFIG_PAGE_POOL + case SO_DEVMEM_DONTNEED: + return sock_devmem_dontneed(sk, optval, optlen); +#endif } sockopt_lock_sock(sk); From patchwork Mon Sep 9 05:43:17 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 826767 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3FD811B12DC for ; Mon, 9 Sep 2024 05:43:48 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860635; cv=none; b=lnHCqXgnbcSm1hAZsFNa/afjNFr9xjGjbd4xhuRYZezCn5bfYPOmLCAUmu+AGazrnZgKLsbtaMMA0wkCNo8REMnfYTRRcfBKMoKeipPOH5slfTY7BpFOTtus8ODrPt4ZDkjHjT3oSaMm1qhWMrnBPo4+rtiTQrL31yk1iHl25gg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1725860635; c=relaxed/simple; bh=2t0/+0TBLvt24U/Cay2GykrKaGxoygrhy1NVq2NzoCQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=n+uSy+iYwns4vHk8XcS0bW8khj/hiXa+0ZApi3uR/bbfiLMZDiXK3rL4zWi/YPRmSAzV01wvhTZaGpjvCrolM5Nt5TY2fyYaGXilTNWoJew6vbpXCsdgkzfmYdF6eTgPf5AiyLAibCincIN2V39ws2HlE77tMd/fZix2dzuVvuM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=SWRYyWmq; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="SWRYyWmq" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-e1a7fd2eb36so8206954276.0 for ; Sun, 08 Sep 2024 22:43:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1725860625; x=1726465425; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=4UCmDs+YyvPenlA61IqWCFYTu9251Izdm+WiM3HHOus=; b=SWRYyWmqOuaQk0wOJMlulBFRvpSlFghWilrbZvUjuIKX1b2y0uGq+TQun6QLa/Jgay I8zyvKifZh4fCb2W4M8cCGTkVaMicSomawscetljNsmhHD+fqLjkzheHOP59LxLMum0Y +ENR8yiHp9a9hI/NvgZG0MS1oFhOfq0Q9cMB4EfcNENNwzFKrqoWAns/+VjMxpwKQg6W jh80+2ylPDmqvd9TNC/jxeQU3BWDAOJ6WDbrnAlDmvyT3EUHCFT3hw53BpCChVq3dLDQ ub9Usz5aF1Rv4Ps8lf9iYrlN9mReDPC5AdfEMW9o3H8EfgNwOkuqTn6xL3ps+lBIFkFi +Eow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1725860625; x=1726465425; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=4UCmDs+YyvPenlA61IqWCFYTu9251Izdm+WiM3HHOus=; b=ea/Npjs8ylyBSR9qwju5FscsX6eQ7gxEZwsrLdhM60aTkJYUi9wTW3zc6ElqkZW7Jn NEshuGqP7nnhH/VB/RCig3NI6270MvGO5YVhmto38W1GmnKm+MXUmm90bBlJadF2Oehg jcLVXzCuHFsvN3SIieMX/ndlzsJg4qCyxxyUxGt8CSFTlYmaza+bxeOZZXtgyk+gYFpF pZd22T5fPOgSx0R5WlGgK2M2yU6mqaWZZOrcOA53owJ3NOT/LLy4MkS3vJAShzF8o4B3 HHJ+Sa3gtEC7cy/RL+wWow7n8PUqK7WSO3MwBmfP2F5VYoRoeJPgD04kt0etfnDrhqvo PITA== X-Forwarded-Encrypted: i=1; AJvYcCWlX+8ss/Q3zCXd0QuRA9u20U2uqkB5uJAvvfJPHiXRPsggy3UzCYpBEGgER2IoJa8dEYdW6VGc6+4yRbs94ks=@vger.kernel.org X-Gm-Message-State: AOJu0YzZOIeWuqssElz79BPiSv4PvrZVrtw+ecfXpqynw0/ADokccaG9 xTdKUD03bIlXIDaDhtyfThemQ9FML7/QMWZJ4yAqsrSF+EvtAKBtlp0rQ78Wk4sSJiqky6WzlQ3 ZtK+ig+WVrgp6cZjdXQw0Dw== X-Google-Smtp-Source: AGHT+IE2dzlbqeOCkIJo0ES06/D7J34uo+HoGhvXFeIDFEeJtb7QCm38enDVong8KvW/miT6qomHWpfJwUgdnX9wkQ== X-Received: from almasrymina.c.googlers.com ([fda3:e722:ac3:cc00:20:ed76:c0a8:4bc5]) (user=almasrymina job=sendgmr) by 2002:a5b:b50:0:b0:e0e:445b:606 with SMTP id 3f1490d57ef6-e1d346b001fmr18940276.0.1725860625148; Sun, 08 Sep 2024 22:43:45 -0700 (PDT) Date: Mon, 9 Sep 2024 05:43:17 +0000 In-Reply-To: <20240909054318.1809580-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240909054318.1809580-1-almasrymina@google.com> X-Mailer: git-send-email 2.46.0.469.g59c65b2a67-goog Message-ID: <20240909054318.1809580-13-almasrymina@google.com> Subject: [PATCH net-next v25 12/13] selftests: add ncdevmem, netcat for devmem TCP From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , Donald Hunter , Jakub Kicinski , "David S. Miller" , Eric Dumazet , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Steffen Klassert , Herbert Xu , David Ahern , Willem de Bruijn , " =?utf-8?b?QmrDtnJu?= =?utf-8?b?IFTDtnBlbA==?= " , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , Shuah Khan , Alexei Starovoitov , Daniel Borkmann , John Fastabend , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Bagas Sanjaya , Christoph Hellwig , Nikolay Aleksandrov , Taehee Yoo , Stanislav Fomichev ncdevmem is a devmem TCP netcat. It works similarly to netcat, but it sends and receives data using the devmem TCP APIs. It uses udmabuf as the dmabuf provider. It is compatible with a regular netcat running on a peer, or a ncdevmem running on a peer. In addition to normal netcat support, ncdevmem has a validation mode, where it sends a specific pattern and validates this pattern on the receiver side to ensure data integrity. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- v24: - Add *.d to .gitignore, to stop tracking this file the build generates: ``` git status ... ?? tools/net/ynl/lib/ynl.d ``` v22: - Add run_command helper. It reduces boiler plate and prints the commands it is running. v20: - Remove unnecessary sleep(1) - Add test to ensure dmabuf binding fails if header split is disabled. v19: - Check return code of ethtool commands. - Add test for deactivating mp bound rx queues - Add test for attempting to bind with missing netlink attributes. v16: - Remove outdated -n option (Taehee). - Use 'ifname' instead of accidentally hardcoded 'eth1'. (Taehee) - Remove dead code 'iterations' (Taehee). - Use if_nametoindex() instead of passing device index (Taehee). v15: - Fix linking against libynl. (Jakub) v9: https://lore.kernel.org/netdev/20240403002053.2376017-15-almasrymina@google.com/ - Remove unused nic_pci_addr entry (Cong). v6: - Updated to bind 8 queues. - Added RSS configuration. - Added some more tests for the netlink API. Changes in v1: - Many more general cleanups (Willem). - Removed driver reset (Jakub). - Removed hardcoded if index (Paolo). RFC v2: - General cleanups (Willem). --- tools/net/ynl/lib/.gitignore | 1 + tools/testing/selftests/net/.gitignore | 1 + tools/testing/selftests/net/Makefile | 9 + tools/testing/selftests/net/ncdevmem.c | 570 +++++++++++++++++++++++++ 4 files changed, 581 insertions(+) create mode 100644 tools/testing/selftests/net/ncdevmem.c diff --git a/tools/net/ynl/lib/.gitignore b/tools/net/ynl/lib/.gitignore index c18dd8d83cee..296c4035dbf2 100644 --- a/tools/net/ynl/lib/.gitignore +++ b/tools/net/ynl/lib/.gitignore @@ -1 +1,2 @@ __pycache__/ +*.d diff --git a/tools/testing/selftests/net/.gitignore b/tools/testing/selftests/net/.gitignore index 923bf098e2eb..1c04c780db66 100644 --- a/tools/testing/selftests/net/.gitignore +++ b/tools/testing/selftests/net/.gitignore @@ -17,6 +17,7 @@ ipv6_flowlabel ipv6_flowlabel_mgr log.txt msg_zerocopy +ncdevmem nettest psock_fanout psock_snd diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile index 27362e40eb37..fc1c3a64de2d 100644 --- a/tools/testing/selftests/net/Makefile +++ b/tools/testing/selftests/net/Makefile @@ -97,6 +97,11 @@ TEST_PROGS += fq_band_pktlimit.sh TEST_PROGS += vlan_hw_filter.sh TEST_PROGS += bpf_offload.py +# YNL files, must be before "include ..lib.mk" +EXTRA_CLEAN += $(OUTPUT)/libynl.a +YNL_GEN_FILES := ncdevmem +TEST_GEN_FILES += $(YNL_GEN_FILES) + TEST_FILES := settings TEST_FILES += in_netns.sh lib.sh net_helper.sh setup_loopback.sh setup_veth.sh @@ -106,6 +111,10 @@ TEST_INCLUDES := forwarding/lib.sh include ../lib.mk +# YNL build +YNL_GENS := netdev +include ynl.mk + $(OUTPUT)/epoll_busy_poll: LDLIBS += -lcap $(OUTPUT)/reuseport_bpf_numa: LDLIBS += -lnuma $(OUTPUT)/tcp_mmap: LDLIBS += -lpthread -lcrypto diff --git a/tools/testing/selftests/net/ncdevmem.c b/tools/testing/selftests/net/ncdevmem.c new file mode 100644 index 000000000000..64d6805381c5 --- /dev/null +++ b/tools/testing/selftests/net/ncdevmem.c @@ -0,0 +1,570 @@ +// SPDX-License-Identifier: GPL-2.0 +#define _GNU_SOURCE +#define __EXPORTED_HEADERS__ + +#include +#include +#include +#include +#include +#include +#include +#define __iovec_defined +#include +#include +#include + +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "netdev-user.h" +#include + +#define PAGE_SHIFT 12 +#define TEST_PREFIX "ncdevmem" +#define NUM_PAGES 16000 + +#ifndef MSG_SOCK_DEVMEM +#define MSG_SOCK_DEVMEM 0x2000000 +#endif + +/* + * tcpdevmem netcat. Works similarly to netcat but does device memory TCP + * instead of regular TCP. Uses udmabuf to mock a dmabuf provider. + * + * Usage: + * + * On server: + * ncdevmem -s -c -f eth1 -l -p 5201 -v 7 + * + * On client: + * yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \ + * tr \\n \\0 | \ + * head -c 5G | \ + * nc 5201 -p 5201 + * + * Note this is compatible with regular netcat. i.e. the sender or receiver can + * be replaced with regular netcat to test the RX or TX path in isolation. + */ + +static char *server_ip = "192.168.1.4"; +static char *client_ip = "192.168.1.2"; +static char *port = "5201"; +static size_t do_validation; +static int start_queue = 8; +static int num_queues = 8; +static char *ifname = "eth1"; +static unsigned int ifindex; +static unsigned int dmabuf_id; + +void print_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + int i; + + for (i = 0; i < size; i++) + printf("%02hhX ", p[i]); + printf("\n"); +} + +void print_nonzero_bytes(void *ptr, size_t size) +{ + unsigned char *p = ptr; + unsigned int i; + + for (i = 0; i < size; i++) + putchar(p[i]); + printf("\n"); +} + +void validate_buffer(void *line, size_t size) +{ + static unsigned char seed = 1; + unsigned char *ptr = line; + int errors = 0; + size_t i; + + for (i = 0; i < size; i++) { + if (ptr[i] != seed) { + fprintf(stderr, + "Failed validation: expected=%u, actual=%u, index=%lu\n", + seed, ptr[i], i); + errors++; + if (errors > 20) + error(1, 0, "validation failed."); + } + seed++; + if (seed == do_validation) + seed = 0; + } + + fprintf(stdout, "Validated buffer\n"); +} + +#define run_command(cmd, ...) \ + ({ \ + char command[256]; \ + memset(command, 0, sizeof(command)); \ + snprintf(command, sizeof(command), cmd, ##__VA_ARGS__); \ + printf("Running: %s\n", command); \ + system(command); \ + }) + +static int reset_flow_steering(void) +{ + int ret = 0; + + ret = run_command("sudo ethtool -K %s ntuple off", ifname); + if (ret) + return ret; + + return run_command("sudo ethtool -K %s ntuple on", ifname); +} + +static int configure_headersplit(bool on) +{ + return run_command("sudo ethtool -G %s tcp-data-split %s", ifname, + on ? "on" : "off"); +} + +static int configure_rss(void) +{ + return run_command("sudo ethtool -X %s equal %d", ifname, start_queue); +} + +static int configure_channels(unsigned int rx, unsigned int tx) +{ + return run_command("sudo ethtool -L %s rx %u tx %u", ifname, rx, tx); +} + +static int configure_flow_steering(void) +{ + return run_command("sudo ethtool -N %s flow-type tcp4 src-ip %s dst-ip %s src-port %s dst-port %s queue %d", + ifname, client_ip, server_ip, port, port, start_queue); +} + +static int bind_rx_queue(unsigned int ifindex, unsigned int dmabuf_fd, + struct netdev_queue_id *queues, + unsigned int n_queue_index, struct ynl_sock **ys) +{ + struct netdev_bind_rx_req *req = NULL; + struct netdev_bind_rx_rsp *rsp = NULL; + struct ynl_error yerr; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + if (!*ys) { + fprintf(stderr, "YNL: %s\n", yerr.msg); + return -1; + } + + req = netdev_bind_rx_req_alloc(); + netdev_bind_rx_req_set_ifindex(req, ifindex); + netdev_bind_rx_req_set_fd(req, dmabuf_fd); + __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); + + rsp = netdev_bind_rx(*ys, req); + if (!rsp) { + perror("netdev_bind_rx"); + goto err_close; + } + + if (!rsp->_present.id) { + perror("id not present"); + goto err_close; + } + + printf("got dmabuf id=%d\n", rsp->id); + dmabuf_id = rsp->id; + + netdev_bind_rx_req_free(req); + netdev_bind_rx_rsp_free(rsp); + + return 0; + +err_close: + fprintf(stderr, "YNL failed: %s\n", (*ys)->err.msg); + netdev_bind_rx_req_free(req); + ynl_sock_destroy(*ys); + return -1; +} + +static void create_udmabuf(int *devfd, int *memfd, int *buf, size_t dmabuf_size) +{ + struct udmabuf_create create; + int ret; + + *devfd = open("/dev/udmabuf", O_RDWR); + if (*devfd < 0) { + error(70, 0, + "%s: [skip,no-udmabuf: Unable to access DMA buffer device file]\n", + TEST_PREFIX); + } + + *memfd = memfd_create("udmabuf-test", MFD_ALLOW_SEALING); + if (*memfd < 0) + error(70, 0, "%s: [skip,no-memfd]\n", TEST_PREFIX); + + /* Required for udmabuf */ + ret = fcntl(*memfd, F_ADD_SEALS, F_SEAL_SHRINK); + if (ret < 0) + error(73, 0, "%s: [skip,fcntl-add-seals]\n", TEST_PREFIX); + + ret = ftruncate(*memfd, dmabuf_size); + if (ret == -1) + error(74, 0, "%s: [FAIL,memfd-truncate]\n", TEST_PREFIX); + + memset(&create, 0, sizeof(create)); + + create.memfd = *memfd; + create.offset = 0; + create.size = dmabuf_size; + *buf = ioctl(*devfd, UDMABUF_CREATE, &create); + if (*buf < 0) + error(75, 0, "%s: [FAIL, create udmabuf]\n", TEST_PREFIX); +} + +int do_server(void) +{ + char ctrl_data[sizeof(int) * 20000]; + struct netdev_queue_id *queues; + size_t non_page_aligned_frags = 0; + struct sockaddr_in client_addr; + struct sockaddr_in server_sin; + size_t page_aligned_frags = 0; + int devfd, memfd, buf, ret; + size_t total_received = 0; + socklen_t client_addr_len; + bool is_devmem = false; + char *buf_mem = NULL; + struct ynl_sock *ys; + size_t dmabuf_size; + char iobuf[819200]; + char buffer[256]; + int socket_fd; + int client_fd; + size_t i = 0; + int opt = 1; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + if (reset_flow_steering()) + error(1, 0, "Failed to reset flow steering\n"); + + /* Configure RSS to divert all traffic from our devmem queues */ + if (configure_rss()) + error(1, 0, "Failed to configure rss\n"); + + /* Flow steer our devmem flows to start_queue */ + if (configure_flow_steering()) + error(1, 0, "Failed to configure flow steering\n"); + + sleep(1); + + queues = malloc(sizeof(*queues) * num_queues); + + for (i = 0; i < num_queues; i++) { + queues[i]._present.type = 1; + queues[i]._present.id = 1; + queues[i].type = NETDEV_QUEUE_TYPE_RX; + queues[i].id = start_queue + i; + } + + if (bind_rx_queue(ifindex, buf, queues, num_queues, &ys)) + error(1, 0, "Failed to bind\n"); + + buf_mem = mmap(NULL, dmabuf_size, PROT_READ | PROT_WRITE, MAP_SHARED, + buf, 0); + if (buf_mem == MAP_FAILED) + error(1, 0, "mmap()"); + + server_sin.sin_family = AF_INET; + server_sin.sin_port = htons(atoi(port)); + + ret = inet_pton(server_sin.sin_family, server_ip, &server_sin.sin_addr); + if (socket < 0) + error(79, 0, "%s: [FAIL, create socket]\n", TEST_PREFIX); + + socket_fd = socket(server_sin.sin_family, SOCK_STREAM, 0); + if (socket < 0) + error(errno, errno, "%s: [FAIL, create socket]\n", TEST_PREFIX); + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_REUSEPORT, &opt, + sizeof(opt)); + if (ret) + error(errno, errno, "%s: [FAIL, set sock opt]\n", TEST_PREFIX); + + ret = setsockopt(socket_fd, SOL_SOCKET, SO_REUSEADDR, &opt, + sizeof(opt)); + if (ret) + error(errno, errno, "%s: [FAIL, set sock opt]\n", TEST_PREFIX); + + printf("binding to address %s:%d\n", server_ip, + ntohs(server_sin.sin_port)); + + ret = bind(socket_fd, &server_sin, sizeof(server_sin)); + if (ret) + error(errno, errno, "%s: [FAIL, bind]\n", TEST_PREFIX); + + ret = listen(socket_fd, 1); + if (ret) + error(errno, errno, "%s: [FAIL, listen]\n", TEST_PREFIX); + + client_addr_len = sizeof(client_addr); + + inet_ntop(server_sin.sin_family, &server_sin.sin_addr, buffer, + sizeof(buffer)); + printf("Waiting or connection on %s:%d\n", buffer, + ntohs(server_sin.sin_port)); + client_fd = accept(socket_fd, &client_addr, &client_addr_len); + + inet_ntop(client_addr.sin_family, &client_addr.sin_addr, buffer, + sizeof(buffer)); + printf("Got connection from %s:%d\n", buffer, + ntohs(client_addr.sin_port)); + + while (1) { + struct iovec iov = { .iov_base = iobuf, + .iov_len = sizeof(iobuf) }; + struct dmabuf_cmsg *dmabuf_cmsg = NULL; + struct dma_buf_sync sync = { 0 }; + struct cmsghdr *cm = NULL; + struct msghdr msg = { 0 }; + struct dmabuf_token token; + ssize_t ret; + + is_devmem = false; + printf("\n\n"); + + msg.msg_iov = &iov; + msg.msg_iovlen = 1; + msg.msg_control = ctrl_data; + msg.msg_controllen = sizeof(ctrl_data); + ret = recvmsg(client_fd, &msg, MSG_SOCK_DEVMEM); + printf("recvmsg ret=%ld\n", ret); + if (ret < 0 && (errno == EAGAIN || errno == EWOULDBLOCK)) + continue; + if (ret < 0) { + perror("recvmsg"); + continue; + } + if (ret == 0) { + printf("client exited\n"); + goto cleanup; + } + + i++; + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_DMABUF && + cm->cmsg_type != SCM_DEVMEM_LINEAR)) { + fprintf(stdout, "skipping non-devmem cmsg\n"); + continue; + } + + dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); + is_devmem = true; + + if (cm->cmsg_type == SCM_DEVMEM_LINEAR) { + /* TODO: process data copied from skb's linear + * buffer. + */ + fprintf(stdout, + "SCM_DEVMEM_LINEAR. dmabuf_cmsg->frag_size=%u\n", + dmabuf_cmsg->frag_size); + + continue; + } + + token.token_start = dmabuf_cmsg->frag_token; + token.token_count = 1; + + total_received += dmabuf_cmsg->frag_size; + printf("received frag_page=%llu, in_page_offset=%llu, frag_offset=%llu, frag_size=%u, token=%u, total_received=%lu, dmabuf_id=%u\n", + dmabuf_cmsg->frag_offset >> PAGE_SHIFT, + dmabuf_cmsg->frag_offset % getpagesize(), + dmabuf_cmsg->frag_offset, dmabuf_cmsg->frag_size, + dmabuf_cmsg->frag_token, total_received, + dmabuf_cmsg->dmabuf_id); + + if (dmabuf_cmsg->dmabuf_id != dmabuf_id) + error(1, 0, + "received on wrong dmabuf_id: flow steering error\n"); + + if (dmabuf_cmsg->frag_size % getpagesize()) + non_page_aligned_frags++; + else + page_aligned_frags++; + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_START; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + if (do_validation) + validate_buffer( + ((unsigned char *)buf_mem) + + dmabuf_cmsg->frag_offset, + dmabuf_cmsg->frag_size); + else + print_nonzero_bytes( + ((unsigned char *)buf_mem) + + dmabuf_cmsg->frag_offset, + dmabuf_cmsg->frag_size); + + sync.flags = DMA_BUF_SYNC_READ | DMA_BUF_SYNC_END; + ioctl(buf, DMA_BUF_IOCTL_SYNC, &sync); + + ret = setsockopt(client_fd, SOL_SOCKET, + SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + if (ret != 1) + error(1, 0, + "SO_DEVMEM_DONTNEED not enough tokens"); + } + if (!is_devmem) + error(1, 0, "flow steering error\n"); + + printf("total_received=%lu\n", total_received); + } + + fprintf(stdout, "%s: ok\n", TEST_PREFIX); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + + fprintf(stdout, "page_aligned_frags=%lu, non_page_aligned_frags=%lu\n", + page_aligned_frags, non_page_aligned_frags); + +cleanup: + + munmap(buf_mem, dmabuf_size); + close(client_fd); + close(socket_fd); + close(buf); + close(memfd); + close(devfd); + ynl_sock_destroy(ys); + + return 0; +} + +void run_devmem_tests(void) +{ + struct netdev_queue_id *queues; + int devfd, memfd, buf; + struct ynl_sock *ys; + size_t dmabuf_size; + size_t i = 0; + + dmabuf_size = getpagesize() * NUM_PAGES; + + create_udmabuf(&devfd, &memfd, &buf, dmabuf_size); + + /* Configure RSS to divert all traffic from our devmem queues */ + if (configure_rss()) + error(1, 0, "rss error\n"); + + queues = calloc(num_queues, sizeof(*queues)); + + if (configure_headersplit(1)) + error(1, 0, "Failed to configure header split\n"); + + if (!bind_rx_queue(ifindex, buf, queues, num_queues, &ys)) + error(1, 0, "Binding empty queues array should have failed\n"); + + for (i = 0; i < num_queues; i++) { + queues[i]._present.type = 1; + queues[i]._present.id = 1; + queues[i].type = NETDEV_QUEUE_TYPE_RX; + queues[i].id = start_queue + i; + } + + if (configure_headersplit(0)) + error(1, 0, "Failed to configure header split\n"); + + if (!bind_rx_queue(ifindex, buf, queues, num_queues, &ys)) + error(1, 0, "Configure dmabuf with header split off should have failed\n"); + + if (configure_headersplit(1)) + error(1, 0, "Failed to configure header split\n"); + + for (i = 0; i < num_queues; i++) { + queues[i]._present.type = 1; + queues[i]._present.id = 1; + queues[i].type = NETDEV_QUEUE_TYPE_RX; + queues[i].id = start_queue + i; + } + + if (bind_rx_queue(ifindex, buf, queues, num_queues, &ys)) + error(1, 0, "Failed to bind\n"); + + /* Deactivating a bound queue should not be legal */ + if (!configure_channels(num_queues, num_queues - 1)) + error(1, 0, "Deactivating a bound queue should be illegal.\n"); + + /* Closing the netlink socket does an implicit unbind */ + ynl_sock_destroy(ys); +} + +int main(int argc, char *argv[]) +{ + int is_server = 0, opt; + + while ((opt = getopt(argc, argv, "ls:c:p:v:q:t:f:")) != -1) { + switch (opt) { + case 'l': + is_server = 1; + break; + case 's': + server_ip = optarg; + break; + case 'c': + client_ip = optarg; + break; + case 'p': + port = optarg; + break; + case 'v': + do_validation = atoll(optarg); + break; + case 'q': + num_queues = atoi(optarg); + break; + case 't': + start_queue = atoi(optarg); + break; + case 'f': + ifname = optarg; + break; + case '?': + printf("unknown option: %c\n", optopt); + break; + } + } + + ifindex = if_nametoindex(ifname); + + for (; optind < argc; optind++) + printf("extra arguments: %s\n", argv[optind]); + + run_devmem_tests(); + + if (is_server) + return do_server(); + + return 0; +}