From patchwork Tue Mar 5 02:01:37 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778171 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D8241182DB for ; Tue, 5 Mar 2024 02:02:02 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604125; cv=none; b=NpKhpMwM81tFO60LhrvE5A6T7A2DfGhze/HKZPYTkPbUl/5rIWTuRbsRCDGqWWqPyaQUmbck/wEuQSGQEgWmKBV7AbPfaVjhfXXWbg8cmh60mv/hUiDEw4zBFjFa4lQ9XKG3EJrs5nEe3ODxversITmdw34+FvRcKBli/+juVZo= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604125; c=relaxed/simple; bh=kPcvF1yAIkU4SC9auITlzhMeVIQD4E0gxQqihz8N8Fk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=CcMpRtdSUr1DFhM0NcGWfl6X+POpwgDS06EqCRMSxMgJjMGrLbIJdAYw8SI1G6n4rkHRGY65J8Y74MQ7bkmd0uI41UBu88Q650Y/aAAe5/aLKdmVmg/yiEBqG4xyu4dI1n/3K5BJPz11WvYKbA+yX0i3vHQ8pOnFCP4urHm16bA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=gmeKkjC/; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="gmeKkjC/" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-608d6ffc64eso73699827b3.0 for ; Mon, 04 Mar 2024 18:02:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604122; x=1710208922; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=QOWmAeQMGWMC5aJa64Dg0AHzjb+8gPpNDHS7ziT/3ac=; b=gmeKkjC/zxBnWVkZCwsSw7pqWMZgdwGOi3VEVTPYLgq7YA5GF0ar4Qx5H7PKSMJC/6 7HzA+z1BVI9oNpIB0wFEYO+D5jMVHBMUinGnOdtcfm+unNf65u01qPIY7LRi7+YMTb60 m6K62SlyfuOdYF69cdEqwaJKVnbqgivhEFe1ErC5tcy99gy8jE+/Hg+EWw0JI7DrQdA9 PGbgK9+YSaiX9ooaBRYf0y3aqrIMo4cv2b8XUMPcCwXfgmQmLskoPekvP1o9I7JvldCY bl/H5AG2Ol4gk6joZU5Up2XRc3N+nNBaO2igwxyqPjmIztLt0RrFWAiAl0BmUR6P13ia ONKw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604122; x=1710208922; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=QOWmAeQMGWMC5aJa64Dg0AHzjb+8gPpNDHS7ziT/3ac=; b=mtptJVe0GrbAUcSCYluyAlUylW1Fm9dKyMxlnWXYYYuoUEJB1m9EK2TOsU7pYY1OeN fiQLtRdWkvV/qXCSA4twbSgxymQ8p98OdFc3gwYv5elqSqphaaFB0BuR6PH5nBDi/xzT Xc88DBN6vzEnlO6yK9j6H1xYiG+wHraFIC3cL1NUzN9Le1JdsvlLtWVp3v5uJUEnNs48 TVQV4yRAc3utzo9fgiAV6tEinPeSIm9yPmGkMLWoid7JrQlejHyUYPUrleXVuYSepCrx 7Us8fdS2xaxjZtViSjQxcHRRCOU9ZIh8mcq4oFWh9poWpOB1JyYMCjVXL4gxSgjcZYYz 44Hg== X-Forwarded-Encrypted: i=1; AJvYcCWWU6igz3C3y3VxTMmOeH6353df4su11gOtG8qxgyVcB6aLf81rtRlNPfbF8wnrtCEjg/oMGTaOwgKsh24npUmI+2WASJXcSV9+9736stgf X-Gm-Message-State: AOJu0YzLHZoKBdbkwVwSTCcW4hec725SUjXEcEPe8BI+rMlyRM83OK6+ J2TO3B/4O4LNLwtEL9MK2q6qcexlf25m5GQ09hJzSGcbarXPTFTLpRqR2wqa+kGCdgy12YQ66ep 2pAksiIDL9cQWJsiuOPklMA== X-Google-Smtp-Source: AGHT+IG6KeWQw0XbqPbYmQM0P/6OwcAsrFu5mozN13W1clztdP6zot9Fisjb9bkCfEQNwnMXpfHNCZGrzpwJASvg9w== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a81:4c93:0:b0:609:3a33:bacc with SMTP id z141-20020a814c93000000b006093a33baccmr264479ywa.5.1709604121833; Mon, 04 Mar 2024 18:02:01 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:37 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-3-almasrymina@google.com> Subject: [RFC PATCH net-next v6 02/15] net: page_pool: create hooks for custom page providers From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi From: Jakub Kicinski The page providers which try to reuse the same pages will need to hold onto the ref, even if page gets released from the pool - as in releasing the page from the pp just transfers the "ownership" reference from pp to the provider, and provider will wait for other references to be gone before feeding this page back into the pool. Signed-off-by: Jakub Kicinski Signed-off-by: Mina Almasry --- This is implemented by Jakub in his RFC: https://lore.kernel.org/netdev/f8270765-a27b-6ccf-33ea-cda097168d79@redhat.com/T/ I take no credit for the idea or implementation; I only added minor edits to make this workable with device memory TCP, and removed some hacky test code. This is a critical dependency of device memory TCP and thus I'm pulling it into this series to make it revewable and mergeable. RFC v3 -> v1 - Removed unusued mem_provider. (Yunsheng). - Replaced memory_provider & mp_priv with netdev_rx_queue (Jakub). --- include/net/page_pool/types.h | 12 ++++++++++ net/core/page_pool.c | 43 +++++++++++++++++++++++++++++++---- 2 files changed, 50 insertions(+), 5 deletions(-) diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index 5e43a08d3231..ffe5f31fb0da 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -52,6 +52,7 @@ struct pp_alloc_cache { * @dev: device, for DMA pre-mapping purposes * @netdev: netdev this pool will serve (leave as NULL if none or multiple) * @napi: NAPI which is the sole consumer of pages, otherwise NULL + * @queue: struct netdev_rx_queue this page_pool is being created for. * @dma_dir: DMA mapping direction * @max_len: max DMA sync memory size for PP_FLAG_DMA_SYNC_DEV * @offset: DMA sync address offset for PP_FLAG_DMA_SYNC_DEV @@ -64,6 +65,7 @@ struct page_pool_params { int nid; struct device *dev; struct napi_struct *napi; + struct netdev_rx_queue *queue; enum dma_data_direction dma_dir; unsigned int max_len; unsigned int offset; @@ -126,6 +128,13 @@ struct page_pool_stats { }; #endif +struct memory_provider_ops { + int (*init)(struct page_pool *pool); + void (*destroy)(struct page_pool *pool); + struct page *(*alloc_pages)(struct page_pool *pool, gfp_t gfp); + bool (*release_page)(struct page_pool *pool, struct page *page); +}; + struct page_pool { struct page_pool_params_fast p; @@ -176,6 +185,9 @@ struct page_pool { */ struct ptr_ring ring; + void *mp_priv; + const struct memory_provider_ops *mp_ops; + #ifdef CONFIG_PAGE_POOL_STATS /* recycle stats are per-cpu to avoid locking */ struct page_pool_recycle_stats __percpu *recycle_stats; diff --git a/net/core/page_pool.c b/net/core/page_pool.c index d706fe5548df..8776fcad064a 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -25,6 +25,8 @@ #include "page_pool_priv.h" +static DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); + #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -177,6 +179,7 @@ static int page_pool_init(struct page_pool *pool, int cpuid) { unsigned int ring_qsize = 1024; /* Default */ + int err; memcpy(&pool->p, ¶ms->fast, sizeof(pool->p)); memcpy(&pool->slow, ¶ms->slow, sizeof(pool->slow)); @@ -248,10 +251,25 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); + if (pool->mp_ops) { + err = pool->mp_ops->init(pool); + if (err) { + pr_warn("%s() mem-provider init failed %d\n", + __func__, err); + goto free_ptr_ring; + } + + static_branch_inc(&page_pool_mem_providers); + } + if (pool->p.flags & PP_FLAG_DMA_MAP) get_device(pool->p.dev); return 0; + +free_ptr_ring: + ptr_ring_cleanup(&pool->ring, NULL); + return err; } static void page_pool_uninit(struct page_pool *pool) @@ -546,7 +564,10 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp) return page; /* Slow-path: cache empty, do real allocation */ - page = __page_pool_alloc_pages_slow(pool, gfp); + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + page = pool->mp_ops->alloc_pages(pool, gfp); + else + page = __page_pool_alloc_pages_slow(pool, gfp); return page; } EXPORT_SYMBOL(page_pool_alloc_pages); @@ -603,10 +624,13 @@ void __page_pool_release_page_dma(struct page_pool *pool, struct page *page) void page_pool_return_page(struct page_pool *pool, struct page *page) { int count; + bool put; - __page_pool_release_page_dma(pool, page); - - page_pool_clear_pp_info(page); + put = true; + if (static_branch_unlikely(&page_pool_mem_providers) && pool->mp_ops) + put = pool->mp_ops->release_page(pool, page); + else + __page_pool_release_page_dma(pool, page); /* This may be the last page returned, releasing the pool, so * it is not safe to reference pool afterwards. @@ -614,7 +638,10 @@ void page_pool_return_page(struct page_pool *pool, struct page *page) count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt); trace_page_pool_state_release(pool, page, count); - put_page(page); + if (put) { + page_pool_clear_pp_info(page); + put_page(page); + } /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -884,6 +911,12 @@ static void __page_pool_destroy(struct page_pool *pool) page_pool_unlist(pool); page_pool_uninit(pool); + + if (pool->mp_ops) { + pool->mp_ops->destroy(pool); + static_branch_dec(&page_pool_mem_providers); + } + kfree(pool); } From patchwork Tue Mar 5 02:01:39 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778170 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2F2E61B81E for ; Tue, 5 Mar 2024 02:02:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604130; cv=none; b=FwzM8KL4DR9pqMnBACPclVb9yK/z2uJGDgv/KVU+JlldZygQK2s512+F1cpceZp55653/vIxEe0/IDAJ0b4FLE7myVeMBXm90PJNbZMh4K0ePax6aY9rs7c3A11QiFvHC5bvyBM6jsRsy89CMNiO/PJM4ha+i8gxsFQKJsl/JC8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604130; c=relaxed/simple; bh=+5fuQw8FEn9Ki+OtwrFBOWWNabnKmFqlQvLgFxgogjA=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Z1yP11hP3FuEp+yM9X8sytfJTnYUTBGIejp5wZjrCXJxusQwze8Mv04j2bEjDHuUm8sCM03ek/JGrIZo21NecPnVd0TGEW37tlF8eroEPnzJZNl7v/cOip2QJKrYutnGU2AZCwBcz8TvDAm6ECc6i7oFrbRGm5P2YM0eO2uoSWo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=j7THmitg; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="j7THmitg" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-60987370f06so40684497b3.1 for ; Mon, 04 Mar 2024 18:02:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604126; x=1710208926; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=UYMI/uWgjt8U8M7NGaa1me0BRvgZMubHAraMibVciB8=; b=j7THmitgARoCJhwEeNKrgVt16I3Mftk6acapsBrcn3433Ti+wW62Gk5D86wXb0dJq1 71UlRzCzUgKCl1hcof2TQBMC0Clq+CNcUZ2vgh27aV/6I0JzT0bWpRQPfRGTFeYyGGET hZb+3GkNTigmOrU+MDV1g3hngHj5UA9Bt2G59Fk1zIPit9p70EtzFPjbb4rHUSIMaP5Y RmAHXPNln5VwVxchrNWNppC28LGPpGyVrElIxIahQHN1zahO9VVplf8IFNbS2DEsEU1v +8+l3+pMHaC1A7vIhGWDyrroiAqLDkBxfAHh2NSRqU+0ndVWr0/Ne8HU3jJz+f3/EYKo CP5A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604126; x=1710208926; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UYMI/uWgjt8U8M7NGaa1me0BRvgZMubHAraMibVciB8=; b=Ci2+fhVIpzqlnI6St1LJY5wDESwPPQ5evRh7MzZ0RfeRa7K/XW41hyoqFsd+qi7sEH +e0ZnSnvYK4NOMATEhvoC03LVJ0hZP2kU5+iMPLnVwEPRNS4lIQwQT7p3TdPiaYdD5HJ F4y4HzS59kImclawqbVS7QLolq0/W78jCrr4BkCBcMqnOucSS6vDi+lmZMtWYUM00gqT UBgxiitK2m/Yosb3cLiWGHGnmRVnv+D+iEwBitilUYOVumvIAK1ZrK6dirGL5+Iq44Vf nOwn7XygUMGq9dPxyEFTm2UQvk/0uACEnm+gP/rtRMi4uItW9EJF3vph+wrlGbnAxWW0 MCMg== X-Forwarded-Encrypted: i=1; AJvYcCWzuP09tgGZpt5qqZ/DAvpH6fNS8l+cLJ3Ef+EQwO/S7NBrhjsOSEq2WK4+56TnTnThwApLpj6ORvVr2v2hxclnm8royajKddBHq/N/rjEP X-Gm-Message-State: AOJu0YxyJ+khs/iDz5WT7MS5ckHhdiHXdEcBv2gLTZ1NgM5MbJ/kl/s3 ROvZsn0xXJ2QUJS10rlj8cLrMKwnQrbGw43COkhxdqxn2airdofK4mh/L3rc9AfNX2Gp9/fiVy6 kkFc+VYjNk5h1oKlPYP5VdA== X-Google-Smtp-Source: AGHT+IHikshAxx++8nG2dw06luiJ0Gl3uT8tIFsii7AEgYCRGjLOiaQ75+rHkDEm6uTu8+77ovp2pBnS9VNumweGBg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a05:690c:82:b0:609:78d7:4e9 with SMTP id be2-20020a05690c008200b0060978d704e9mr2267874ywb.6.1709604126330; Mon, 04 Mar 2024 18:02:06 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:39 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-5-almasrymina@google.com> Subject: [RFC PATCH net-next v6 04/15] net: netdev netlink api to bind dma-buf to a net device From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi API takes the dma-buf fd as input, and binds it to the netdevice. The user can specify the rx queues to bind the dma-buf to. Suggested-by: Stanislav Fomichev Signed-off-by: Mina Almasry --- Changes in v1: - Add rx-queue-type to distingish rx from tx (Jakub) - Return dma-buf ID from netlink API (David, Stan) Changes in RFC-v3: - Support binding multiple rx rx-queues --- Documentation/netlink/specs/netdev.yaml | 52 +++++++++++++++++++++++++ include/uapi/linux/netdev.h | 19 +++++++++ net/core/netdev-genl-gen.c | 19 +++++++++ net/core/netdev-genl-gen.h | 2 + net/core/netdev-genl.c | 6 +++ tools/include/uapi/linux/netdev.h | 19 +++++++++ 6 files changed, 117 insertions(+) diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml index 3addac970680..6f235fd7b14d 100644 --- a/Documentation/netlink/specs/netdev.yaml +++ b/Documentation/netlink/specs/netdev.yaml @@ -264,6 +264,45 @@ attribute-sets: name: napi-id doc: ID of the NAPI instance which services this queue. type: u32 + - + name: queue-dmabuf + attributes: + - + name: type + doc: rx or tx queue + type: u8 + enum: queue-type + - + name: idx + doc: queue index + type: u32 + + - + name: bind-dmabuf + attributes: + - + name: ifindex + doc: netdev ifindex to bind the dma-buf to. + type: u32 + checks: + min: 1 + - + name: queues + doc: receive queues to bind the dma-buf to. + type: nest + nested-attributes: queue-dmabuf + multi-attr: true + - + name: dmabuf-fd + doc: dmabuf file descriptor to bind. + type: u32 + - + name: dmabuf-id + doc: id of the dmabuf binding + type: u32 + checks: + min: 1 + operations: list: @@ -386,6 +425,19 @@ operations: attributes: - ifindex reply: *queue-get-op + - + name: bind-rx + doc: Bind dmabuf to netdev + attribute-set: bind-dmabuf + do: + request: + attributes: + - ifindex + - dmabuf-fd + - queues + reply: + attributes: + - dmabuf-id - name: napi-get doc: Get information about NAPI instances configured on the system. diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index 93cb411adf72..6a5cd4af68c8 100644 --- a/include/uapi/linux/netdev.h +++ b/include/uapi/linux/netdev.h @@ -132,6 +132,24 @@ enum { NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) }; +enum { + NETDEV_A_QUEUE_DMABUF_TYPE = 1, + NETDEV_A_QUEUE_DMABUF_IDX, + + __NETDEV_A_QUEUE_DMABUF_MAX, + NETDEV_A_QUEUE_DMABUF_MAX = (__NETDEV_A_QUEUE_DMABUF_MAX - 1) +}; + +enum { + NETDEV_A_BIND_DMABUF_IFINDEX = 1, + NETDEV_A_BIND_DMABUF_QUEUES, + NETDEV_A_BIND_DMABUF_DMABUF_FD, + NETDEV_A_BIND_DMABUF_DMABUF_ID, + + __NETDEV_A_BIND_DMABUF_MAX, + NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -143,6 +161,7 @@ enum { NETDEV_CMD_PAGE_POOL_CHANGE_NTF, NETDEV_CMD_PAGE_POOL_STATS_GET, NETDEV_CMD_QUEUE_GET, + NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_GET, __NETDEV_CMD_MAX, diff --git a/net/core/netdev-genl-gen.c b/net/core/netdev-genl-gen.c index be7f2ebd61b2..3384b1ae3f40 100644 --- a/net/core/netdev-genl-gen.c +++ b/net/core/netdev-genl-gen.c @@ -27,6 +27,11 @@ const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFIND [NETDEV_A_PAGE_POOL_IFINDEX] = NLA_POLICY_FULL_RANGE(NLA_U32, &netdev_a_page_pool_ifindex_range), }; +const struct nla_policy netdev_queue_dmabuf_nl_policy[NETDEV_A_QUEUE_DMABUF_IDX + 1] = { + [NETDEV_A_QUEUE_DMABUF_TYPE] = NLA_POLICY_MAX(NLA_U8, 1), + [NETDEV_A_QUEUE_DMABUF_IDX] = { .type = NLA_U32, }, +}; + /* NETDEV_CMD_DEV_GET - do */ static const struct nla_policy netdev_dev_get_nl_policy[NETDEV_A_DEV_IFINDEX + 1] = { [NETDEV_A_DEV_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), @@ -58,6 +63,13 @@ static const struct nla_policy netdev_queue_get_dump_nl_policy[NETDEV_A_QUEUE_IF [NETDEV_A_QUEUE_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), }; +/* NETDEV_CMD_BIND_RX - do */ +static const struct nla_policy netdev_bind_rx_nl_policy[NETDEV_A_BIND_DMABUF_DMABUF_FD + 1] = { + [NETDEV_A_BIND_DMABUF_IFINDEX] = NLA_POLICY_MIN(NLA_U32, 1), + [NETDEV_A_BIND_DMABUF_DMABUF_FD] = { .type = NLA_U32, }, + [NETDEV_A_BIND_DMABUF_QUEUES] = NLA_POLICY_NESTED(netdev_queue_dmabuf_nl_policy), +}; + /* NETDEV_CMD_NAPI_GET - do */ static const struct nla_policy netdev_napi_get_do_nl_policy[NETDEV_A_NAPI_ID + 1] = { [NETDEV_A_NAPI_ID] = { .type = NLA_U32, }, @@ -124,6 +136,13 @@ static const struct genl_split_ops netdev_nl_ops[] = { .maxattr = NETDEV_A_QUEUE_IFINDEX, .flags = GENL_CMD_CAP_DUMP, }, + { + .cmd = NETDEV_CMD_BIND_RX, + .doit = netdev_nl_bind_rx_doit, + .policy = netdev_bind_rx_nl_policy, + .maxattr = NETDEV_A_BIND_DMABUF_DMABUF_FD, + .flags = GENL_CMD_CAP_DO, + }, { .cmd = NETDEV_CMD_NAPI_GET, .doit = netdev_nl_napi_get_doit, diff --git a/net/core/netdev-genl-gen.h b/net/core/netdev-genl-gen.h index a47f2bcbe4fa..a7ede514eccd 100644 --- a/net/core/netdev-genl-gen.h +++ b/net/core/netdev-genl-gen.h @@ -13,6 +13,7 @@ /* Common nested types */ extern const struct nla_policy netdev_page_pool_info_nl_policy[NETDEV_A_PAGE_POOL_IFINDEX + 1]; +extern const struct nla_policy netdev_queue_dmabuf_nl_policy[NETDEV_A_QUEUE_DMABUF_IDX + 1]; int netdev_nl_dev_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); @@ -26,6 +27,7 @@ int netdev_nl_page_pool_stats_get_dumpit(struct sk_buff *skb, int netdev_nl_queue_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_get_doit(struct sk_buff *skb, struct genl_info *info); int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb); diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c index fd98936da3ae..0ed292d87ae0 100644 --- a/net/core/netdev-genl.c +++ b/net/core/netdev-genl.c @@ -469,6 +469,12 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb) return skb->len; } +/* Stub */ +int netdev_nl_bind_rx_doit(struct sk_buff *skb, struct genl_info *info) +{ + return 0; +} + static int netdev_genl_netdevice_event(struct notifier_block *nb, unsigned long event, void *ptr) { diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h index 93cb411adf72..6a5cd4af68c8 100644 --- a/tools/include/uapi/linux/netdev.h +++ b/tools/include/uapi/linux/netdev.h @@ -132,6 +132,24 @@ enum { NETDEV_A_QUEUE_MAX = (__NETDEV_A_QUEUE_MAX - 1) }; +enum { + NETDEV_A_QUEUE_DMABUF_TYPE = 1, + NETDEV_A_QUEUE_DMABUF_IDX, + + __NETDEV_A_QUEUE_DMABUF_MAX, + NETDEV_A_QUEUE_DMABUF_MAX = (__NETDEV_A_QUEUE_DMABUF_MAX - 1) +}; + +enum { + NETDEV_A_BIND_DMABUF_IFINDEX = 1, + NETDEV_A_BIND_DMABUF_QUEUES, + NETDEV_A_BIND_DMABUF_DMABUF_FD, + NETDEV_A_BIND_DMABUF_DMABUF_ID, + + __NETDEV_A_BIND_DMABUF_MAX, + NETDEV_A_BIND_DMABUF_MAX = (__NETDEV_A_BIND_DMABUF_MAX - 1) +}; + enum { NETDEV_CMD_DEV_GET = 1, NETDEV_CMD_DEV_ADD_NTF, @@ -143,6 +161,7 @@ enum { NETDEV_CMD_PAGE_POOL_CHANGE_NTF, NETDEV_CMD_PAGE_POOL_STATS_GET, NETDEV_CMD_QUEUE_GET, + NETDEV_CMD_BIND_RX, NETDEV_CMD_NAPI_GET, __NETDEV_CMD_MAX, From patchwork Tue Mar 5 02:01:41 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778169 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5A29D199BC for ; Tue, 5 Mar 2024 02:02:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604135; cv=none; b=jKGhKgeum1kFxyZtEKFu92OjvFbNL4XUagc4HiUYoW7mg2uK0Oc/YYMvRiZPoY67pkhbWmu1Y8XqeG2HRnfBWoFWHTmKkOm1wVQM9lfDOvx68PvhIX/j0InHdJqjSJ7zKGckW6ZEQBp21RzaA32vjo5VZbs6q4bZb5Uzwvd7xt0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604135; c=relaxed/simple; bh=cvQhiIpC3y+RS9Gm99XbZoZlqQ7K9itfikVyQItkfww=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=eNrxaHtQV1x2qpAzKd6qdK9R5KpNSxBiDCVttqgODqweflQc6Isfs9OkIfFwVk+1e11PchFZ2BYVvXafFoDcFwNFa4T/WoLPEuuYC/oVAnjJyT0OxmLaFPUEXriO7BMcnshv2YUmMK0Jri78lHLchHeAqiQcsHhOljU5YtsyLME= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=AfKALwNM; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="AfKALwNM" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dc64f63d768so318876276.2 for ; Mon, 04 Mar 2024 18:02:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604130; x=1710208930; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=GYNPKIZQ281lVVO6ngxcCFyACEH85a0Tth4BHfTjTEU=; b=AfKALwNMjAAVlDN+SvPY8KglsT5I76vqxr0bHIbXyU+wTdjgM+Z7sMNwzvS5zYxwcd 39LBe/XaH+3Xdd2o9ST323JA8QBylAl+49dz1wrf7ruYNgVsTzXpAA3M44Q/0kkSeFQZ XFf6wQvF2ppRphiGYJ71tlEBOo8i7RET1aLRI85pRsmOFnCnwO+dokpJmzrGwbInsp54 +7gI9GnpZcT80vrKeam638irWwUbQRiOB+soGiYmGkQZNDx1Is/sW9ELvcbX88KMWnCl DQcaqbizaJJguAtAyU3L/8dglxpJd+vwBRTzWQSzOy4aaPoy7foT5HaXVUp+d8oRkIiy dSHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604130; x=1710208930; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=GYNPKIZQ281lVVO6ngxcCFyACEH85a0Tth4BHfTjTEU=; b=aX/ChxbkXBxB19Y2jMb2c4xyuniuyeuTVEr9DWxZ1iH7TWGJGRFEnDcdhnm/nBPEne 6pgPVZ0Xl34k1JxjAOsn0QvJPtRxbYM2X3Iv0e4puF0jk526DXFRcYiKCBc2inv6unTK ps61rncSq0YwhlL0DGRnXlBSlJbhAlQaqWrcyfdfCTKVfezBgT8PLVG/T3TjlvSE6kwB Pmx7m10wwLSZTBEwPKv2sfcz3vyQTaxW0VmEAiKKMYokaJKLbvA8pt9ZcefYfqfB00pM V/HVYsNJdkoRrQbXuYiYttNOS5cyrRqQR3z6SfrxxTTxkcX8HwVDMbXbmziaMbX6onb9 q6Hg== X-Forwarded-Encrypted: i=1; AJvYcCVYzFFIbUCiAQwqHqL178mR6QLf5uIFOiegunGqdq6YTIzsrlgWs2GZw6zDGTKrLSUE4/DMLQPVpkbVKwDsljjZQhGb/cM9uQWbEP0U3QnI X-Gm-Message-State: AOJu0Ywc7OeXbVVupFSiD7/EuT1aOpl5okS53n6V9yQ/RttmPtU7eQyi XEMaQXfeuW8M/R7BrzULfzNfno2ui6Tyir6qhk8SwzWCvLqm2Mr0+MnUn1YrfZYgGU76wzvH0Cb 9IBJmmHNzrssr5QvPQovwYQ== X-Google-Smtp-Source: AGHT+IFp6UEtUOWoZdRym40bjXyf80/9kbOdm1B3G9vAhJbMjUPJhOdTewfBOuFMN7wsO34DBHuDxFG4/drkODwMOA== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a05:6902:2492:b0:dcb:b9d7:2760 with SMTP id ds18-20020a056902249200b00dcbb9d72760mr2965477ybb.13.1709604130576; Mon, 04 Mar 2024 18:02:10 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:41 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-7-almasrymina@google.com> Subject: [RFC PATCH net-next v6 06/15] netdev: netdevice devmem allocator From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Willem de Bruijn , Kaiyuan Zhang Implement netdev devmem allocator. The allocator takes a given struct netdev_dmabuf_binding as input and allocates net_iov from that binding. The allocation simply delegates to the binding's genpool for the allocation logic and wraps the returned memory region in a net_iov struct. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- v6: - Add comment on net_iov_dma_addr to explain why we don't use niov->dma_addr (Pavel) - Refactor new functions into net/core/devmem.c (Pavel) v1: - Rename devmem -> dmabuf (David). --- include/net/devmem.h | 12 ++++++++++++ include/net/netmem.h | 40 ++++++++++++++++++++++++++++++++++++++++ net/core/devmem.c | 38 ++++++++++++++++++++++++++++++++++++++ 3 files changed, 90 insertions(+) diff --git a/include/net/devmem.h b/include/net/devmem.h index 85ccbbe84c65..4207adadc2bb 100644 --- a/include/net/devmem.h +++ b/include/net/devmem.h @@ -67,6 +67,8 @@ struct dmabuf_genpool_chunk_owner { }; #ifdef CONFIG_DMA_SHARED_BUFFER +struct net_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding); +void netdev_free_dmabuf(struct net_iov *ppiov); void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, struct netdev_dmabuf_binding **out); @@ -74,6 +76,16 @@ void netdev_unbind_dmabuf(struct netdev_dmabuf_binding *binding); int netdev_bind_dmabuf_to_queue(struct net_device *dev, u32 rxq_idx, struct netdev_dmabuf_binding *binding); #else +static inline struct net_iov * +netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding) +{ + return NULL; +} + +static inline void netdev_free_dmabuf(struct net_iov *ppiov) +{ +} + static inline void __netdev_dmabuf_binding_free(struct netdev_dmabuf_binding *binding) { diff --git a/include/net/netmem.h b/include/net/netmem.h index 72e932a1a948..ca17ea1d33f8 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -14,8 +14,48 @@ struct net_iov { struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; }; +static inline struct dmabuf_genpool_chunk_owner * +net_iov_owner(const struct net_iov *niov) +{ + return niov->owner; +} + +static inline unsigned int net_iov_idx(const struct net_iov *niov) +{ + return niov - net_iov_owner(niov)->niovs; +} + +/* This returns the absolute dma_addr_t calculated from + * net_iov_owner(niov)->owner->base_dma_addr, not the page_pool-owned + * niov->dma_addr. + * + * The absolute dma_addr_t is a dma_addr_t that is always uncompressed. + * + * The page_pool-owner niov->dma_addr is the absolute dma_addr compressed into + * an unsigned long. Special handling is done when the unsigned long is 32-bit + * but the dma_addr_t is 64-bit. + * + * In general code looking for the dma_addr_t should use net_iov_dma_addr(), + * while page_pool code looking for the unsigned long dma_addr which mirrors + * the field in struct page should use niov->dma_addr. + */ +static inline dma_addr_t net_iov_dma_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_dma_addr + + ((dma_addr_t)net_iov_idx(niov) << PAGE_SHIFT); +} + +static inline struct netdev_dmabuf_binding * +net_iov_binding(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding; +} + /* netmem */ /** diff --git a/net/core/devmem.c b/net/core/devmem.c index 779ad990971e..57d3a1f223ef 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -93,6 +93,44 @@ static int netdev_restart_rx_queue(struct net_device *dev, int rxq_idx) return err; } +struct net_iov *netdev_alloc_dmabuf(struct netdev_dmabuf_binding *binding) +{ + struct dmabuf_genpool_chunk_owner *owner; + unsigned long dma_addr; + struct net_iov *niov; + ssize_t offset; + ssize_t index; + + dma_addr = gen_pool_alloc_owner(binding->chunk_pool, PAGE_SIZE, + (void **)&owner); + if (!dma_addr) + return NULL; + + offset = dma_addr - owner->base_dma_addr; + index = offset / PAGE_SIZE; + niov = &owner->niovs[index]; + + niov->pp_magic = 0; + niov->pp = NULL; + niov->dma_addr = 0; + atomic_long_set(&niov->pp_ref_count, 0); + + netdev_dmabuf_binding_get(binding); + + return niov; +} + +void netdev_free_dmabuf(struct net_iov *niov) +{ + struct netdev_dmabuf_binding *binding = net_iov_binding(niov); + unsigned long dma_addr = net_iov_dma_addr(niov); + + if (gen_pool_has_addr(binding->chunk_pool, dma_addr, PAGE_SIZE)) + gen_pool_free(binding->chunk_pool, dma_addr, PAGE_SIZE); + + netdev_dmabuf_binding_put(binding); +} + /* Protected by rtnl_lock() */ static DEFINE_XARRAY_FLAGS(netdev_dmabuf_bindings, XA_FLAGS_ALLOC1); From patchwork Tue Mar 5 02:01:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778168 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 627BF7E0EA for ; Tue, 5 Mar 2024 02:02:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604143; cv=none; b=S8v5iOEJe0JUcO+DidFjEDveFKNukDeoc3dfxiWCdoEz/CzsQB/vkjYYFgTsdvEaRnTxwK1uWq8ops715CsXZB8xbwP8h5WtH8lpxFX2jIwbshx5HCHzOmOWDpF9fRfwYPKIme4POer5wxS8yEd4cCkr9zX0DgWoSNyq9reFgbA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604143; c=relaxed/simple; bh=CBkia3F06CuPcVqYc1E2oaD9qB4csiiy6tPKTs3axrc=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=VWcwiRd8H9UX/qCj6OBOKrueq9Ow6OPcGklsVVIFvxweUCreIkx4/rNfoVZZgUnSkJXkomEReUF1LBujEaNSYQSK61X5eFeWLQtOcpcHGLKz8bqMRQMBuajxLRlB2RomDmiC3VPoPImKo+7MQGKfbIAxJat0AWmrOD/c6sqaoh0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=RP0+u7X1; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="RP0+u7X1" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-608e4171382so67829767b3.3 for ; Mon, 04 Mar 2024 18:02:18 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604137; x=1710208937; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=byPVwM06Y0oEPlUZgHka14Nm3J9bKSTxMh+j9bVfhME=; b=RP0+u7X1RbgJrrk7oCN6yxZeqbm6dSTpERpzE4b5pziQub90+TQtTF13kSLgJC90RU A3L0EL/2Llgrt8yh1wg4yzpqFZnESLLBJ0JckNieEFXigR947T++AP6ErLvKW2dTcDGD dKfjLYPNCqDCjZzsm23HBokKvPIVIgOOEWP5YcF8Z4TDsWNiluTrKYMsWM+uBHcHZ7gI jEywNiHmky3GMOyPhvE2v+dEy0LqvWvV8QV3wj6Y3egOytlAmpFVPRhPdtbR/D3k4vys 7Ai66M+qPn3iUzbuRD1EJB23eV20A/6nIn0hYODfoZIzQwedwUzdsMYpd/Ti365W03ar Dv6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604137; x=1710208937; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=byPVwM06Y0oEPlUZgHka14Nm3J9bKSTxMh+j9bVfhME=; b=EOxZA9SmqxYNCgHif9sugeoSGhtvNNiA1PCdOYrA2R0evOjJSuntJBBcfP9KoM2INT szSJ1HHf6Jx6ESOP2RUnTCbQs4AuvvrJUNU/O3XWi0SqG5n688AWEOUCsDrk9m8tKpfS RRfWgwcur1gO0/LNvmuROkR74q/emOnmUWQkvdL01jqirwN6ihxCvlXSvxCyG+BYcMiK ugrnoNX/NGf3EfjP/yiHhNZky/oH/fU3CwTncB0IdFmH1MnbcV8c1Ssv4nz/3CiIrJxV KrTM1ac/DVpepdMAgBX1B6pLVGUNMJIuZ7wMujWkOJnWqNvbr2T5KZ/idHbGxpPvBCjm E6vg== X-Forwarded-Encrypted: i=1; AJvYcCXOejIuitmjdKkhCmmRd46IAbJhGA2960cPubf0BhStPZaZ1BreZcjl4bIaOIOml+Fkp1jJTrhGvdFjdWJjZVHZ9yvDZd9RiNiSYbXufyFz X-Gm-Message-State: AOJu0YwvqPTkOlMS3mN/5zqDIWR9z0fg9LkNeoerjD+KjOErx2bxm3SB cMozfs8eV7OUpEBLoXVjB+5ABjBCzUVjvLOtZfINrmoTOdrsGfd4OP4oRZ5YM9oow2+lYYj9++z zWYWDERl5RpM3B+CXV9Jsag== X-Google-Smtp-Source: AGHT+IEoXl0wNdHnbs6D9FyjuWe8y/qyisOWLTyu22a+L//Vxmoz1SXuDH6MSW6bIn4oo2iyEmWJJoNV4OBrBIfN8g== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a05:6902:f0b:b0:dcc:79ab:e522 with SMTP id et11-20020a0569020f0b00b00dcc79abe522mr463378ybb.11.1709604137118; Mon, 04 Mar 2024 18:02:17 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:44 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-10-almasrymina@google.com> Subject: [RFC PATCH net-next v6 09/15] memory-provider: dmabuf devmem memory provider From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Willem de Bruijn , Kaiyuan Zhang Implement a memory provider that allocates dmabuf devmem in the form of net_iov. The provider receives a reference to the struct netdev_dmabuf_binding via the pool->mp_priv pointer. The driver needs to set this pointer for the provider in the net_iov. The provider obtains a reference on the netdev_dmabuf_binding which guarantees the binding and the underlying mapping remains alive until the provider is destroyed. Usage of PP_FLAG_DMA_MAP is required for this memory provide such that the page_pool can provide the driver with the dma-addrs of the devmem. Support for PP_FLAG_DMA_SYNC_DEV is omitted for simplicity & p.order != 0. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- v6: - refactor new memory provider functions into net/core/devmem.c (Pavel) v2: - Disable devmem for p.order != 0 v1: - static_branch check in page_is_page_pool_iov() (Willem & Paolo). - PP_DEVMEM -> PP_IOV (David). - Require PP_FLAG_DMA_MAP (Jakub). --- include/net/netmem.h | 14 ++++++ include/net/page_pool/helpers.h | 21 +++++++++ include/net/page_pool/types.h | 2 + net/core/devmem.c | 82 +++++++++++++++++++++++++++++++++ net/core/page_pool.c | 35 ++++++-------- 5 files changed, 132 insertions(+), 22 deletions(-) diff --git a/include/net/netmem.h b/include/net/netmem.h index 8699788d587d..a2de9411025d 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -127,6 +127,20 @@ static inline struct page *netmem_to_page(netmem_ref netmem) return (__force struct page *)netmem; } +static inline struct net_iov *netmem_to_net_iov(netmem_ref netmem) +{ + if (netmem_is_net_iov(netmem)) + return (struct net_iov *)((__force unsigned long)netmem & ~NET_IOV); + + DEBUG_NET_WARN_ON_ONCE(true); + return NULL; +} + +static inline netmem_ref net_iov_to_netmem(struct net_iov *niov) +{ + return (__force netmem_ref)((unsigned long)niov | NET_IOV); +} + static inline netmem_ref page_to_netmem(struct page *page) { return (__force netmem_ref)page; diff --git a/include/net/page_pool/helpers.h b/include/net/page_pool/helpers.h index c6a55eddefae..00682b4de6e8 100644 --- a/include/net/page_pool/helpers.h +++ b/include/net/page_pool/helpers.h @@ -453,4 +453,25 @@ static inline void page_pool_nid_changed(struct page_pool *pool, int new_nid) page_pool_update_nid(pool, new_nid); } +static inline void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) +{ + netmem_set_pp(netmem, pool); + netmem_or_pp_magic(netmem, PP_SIGNATURE); + + /* Ensuring all pages have been split into one fragment initially: + * page_pool_set_pp_info() is only called once for every page when it + * is allocated from the page allocator and page_pool_fragment_page() + * is dirtying the same cache line as the page->pp_magic above, so + * the overhead is negligible. + */ + page_pool_fragment_netmem(netmem, 1); + if (pool->has_init_callback) + pool->slow.init_callback(netmem, pool->slow.init_arg); +} + +static inline void page_pool_clear_pp_info(netmem_ref netmem) +{ + netmem_clear_pp_magic(netmem); + netmem_set_pp(netmem, NULL); +} #endif /* _NET_PAGE_POOL_HELPERS_H */ diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.h index e29e77f7934e..096cd2455b2c 100644 --- a/include/net/page_pool/types.h +++ b/include/net/page_pool/types.h @@ -136,6 +136,8 @@ struct memory_provider_ops { bool (*release_page)(struct page_pool *pool, netmem_ref netmem); }; +extern const struct memory_provider_ops dmabuf_devmem_ops; + struct page_pool { struct page_pool_params_fast p; diff --git a/net/core/devmem.c b/net/core/devmem.c index 57d3a1f223ef..3ced312f7860 100644 --- a/net/core/devmem.c +++ b/net/core/devmem.c @@ -329,3 +329,85 @@ int netdev_bind_dmabuf(struct net_device *dev, unsigned int dmabuf_fd, return err; } #endif + +/*** "Dmabuf devmem memory provider" ***/ + +static int mp_dmabuf_devmem_init(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + if (!binding) + return -EINVAL; + + if (!(pool->p.flags & PP_FLAG_DMA_MAP)) + return -EOPNOTSUPP; + + if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV) + return -EOPNOTSUPP; + + if (pool->p.order != 0) + return -E2BIG; + + netdev_dmabuf_binding_get(binding); + return 0; +} + +static netmem_ref mp_dmabuf_devmem_alloc_pages(struct page_pool *pool, + gfp_t gfp) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + netmem_ref netmem; + struct net_iov *niov; + dma_addr_t dma_addr; + + niov = netdev_alloc_dmabuf(binding); + if (!niov) + return 0; + + dma_addr = net_iov_dma_addr(niov); + + netmem = net_iov_to_netmem(niov); + + page_pool_set_pp_info(pool, netmem); + + if (page_pool_set_dma_addr_netmem(netmem, dma_addr)) + goto err_free; + + pool->pages_state_hold_cnt++; + trace_page_pool_state_hold(pool, netmem, pool->pages_state_hold_cnt); + return netmem; + +err_free: + netdev_free_dmabuf(niov); + return 0; +} + +static void mp_dmabuf_devmem_destroy(struct page_pool *pool) +{ + struct netdev_dmabuf_binding *binding = pool->mp_priv; + + netdev_dmabuf_binding_put(binding); +} + +static bool mp_dmabuf_devmem_release_page(struct page_pool *pool, + netmem_ref netmem) +{ + WARN_ON_ONCE(!netmem_is_net_iov(netmem)); + WARN_ON_ONCE(atomic_long_read(netmem_get_pp_ref_count_ref(netmem)) + != 1); + + page_pool_clear_pp_info(netmem); + + netdev_free_dmabuf(netmem_to_net_iov(netmem)); + + /* We don't want the page pool put_page()ing our net_iovs. */ + return false; +} + +const struct memory_provider_ops dmabuf_devmem_ops = { + .init = mp_dmabuf_devmem_init, + .destroy = mp_dmabuf_devmem_destroy, + .alloc_pages = mp_dmabuf_devmem_alloc_pages, + .release_page = mp_dmabuf_devmem_release_page, +}; +EXPORT_SYMBOL(dmabuf_devmem_ops); diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 22e3d439da18..2cee7d9f6ca6 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -12,6 +12,7 @@ #include #include +#include #include #include @@ -20,12 +21,15 @@ #include #include #include +#include +#include #include #include "page_pool_priv.h" DEFINE_STATIC_KEY_FALSE(page_pool_mem_providers); +EXPORT_SYMBOL(page_pool_mem_providers); #define DEFER_TIME (msecs_to_jiffies(1000)) #define DEFER_WARN_INTERVAL (60 * HZ) @@ -178,6 +182,7 @@ static int page_pool_init(struct page_pool *pool, const struct page_pool_params *params, int cpuid) { + struct netdev_dmabuf_binding *binding = NULL; unsigned int ring_qsize = 1024; /* Default */ int err; @@ -251,6 +256,14 @@ static int page_pool_init(struct page_pool *pool, /* Driver calling page_pool_create() also call page_pool_destroy() */ refcount_set(&pool->user_cnt, 1); + if (pool->p.queue) + binding = READ_ONCE(pool->p.queue->binding); + + if (binding) { + pool->mp_ops = &dmabuf_devmem_ops; + pool->mp_priv = binding; + } + if (pool->mp_ops) { err = pool->mp_ops->init(pool); if (err) { @@ -444,28 +457,6 @@ static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) return false; } -static void page_pool_set_pp_info(struct page_pool *pool, netmem_ref netmem) -{ - netmem_set_pp(netmem, pool); - netmem_or_pp_magic(netmem, PP_SIGNATURE); - - /* Ensuring all pages have been split into one fragment initially: - * page_pool_set_pp_info() is only called once for every page when it - * is allocated from the page allocator and page_pool_fragment_page() - * is dirtying the same cache line as the page->pp_magic above, so - * the overhead is negligible. - */ - page_pool_fragment_netmem(netmem, 1); - if (pool->has_init_callback) - pool->slow.init_callback(netmem, pool->slow.init_arg); -} - -static void page_pool_clear_pp_info(netmem_ref netmem) -{ - netmem_clear_pp_magic(netmem); - netmem_set_pp(netmem, NULL); -} - static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) { From patchwork Tue Mar 5 02:01:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778167 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72B8F7E599 for ; Tue, 5 Mar 2024 02:02:20 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604145; cv=none; b=dTI9HRkwEURFUJrKPRv+uSR146pvGCyh/f5ZteL0Mo0mtpK44d8D652cIFWEwHwW+D/9NEEwnETP4mzCZRmXITjFmxzMnh6kJzkVglAYF3H13l8fkNs4HLZPSk07rsuCV/hkAfBL7EDVsw4p0Z2lVXUq9bZK1oa40IHf5aK+bwE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604145; c=relaxed/simple; bh=ZkKDRjbXOJaCgdsUUhVdqCPuQ0SfWJoIIrbbmmGQ/FE=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=Beh5uLq691JiSIw9FtkTxwOpv2D+JTRxWe+aRlRpXZ9EJgiXbsccg9JMFafR//GOHhjmhncGgtdFrtHm3yQLkuKMZ2aIRHihDrAUX600YJmDulikCsl250NyxKo59lFgsf891FXkboflD4aquR1OrAbR92uUrW9SS6zT4zzIFFM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=WLoWZiZp; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WLoWZiZp" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dd0ae66422fso225731276.0 for ; Mon, 04 Mar 2024 18:02:20 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604139; x=1710208939; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=kFOLH8z6Kc1/DHNjJCRtctXlk4wkt0G4E5LxaUbD2lM=; b=WLoWZiZpUGPt4dSD7lzG4aIwzfZFDrSu2b9dCRJI0s0vPx1F/4kssBhpB/fm7J5vM4 YGrzYu5FikmZpB/QM6HH8/roiaA3J/zaE9kbJwyib2Esv/gG7G2EtE0pvOSIMvYeuDZn cE1g0GZkmVTpkpY4ktpZPtuiBap6M6ChGB/G6zxyhq70o1sNFr7v8o/Xl7J2tkFosklJ rKP2Nx/+5+Wnc9JCAiPP09YRZa8EqY1HSxlr8Zjs9ut3WndncFAvAlNd38BuRBcZLcS7 KQ0OXjzG1WV6nuXHeVo62EjW9ffcoue8rRspAYYPDJJyqltH3qGHhZwD7pRWVqn4GHfm bP7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604139; x=1710208939; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=kFOLH8z6Kc1/DHNjJCRtctXlk4wkt0G4E5LxaUbD2lM=; b=XMV5IDcP2g+Eh25Kx69kxDd7QhMkSfKMPlQqcrEUrInUCNWkzfX4Ssiwpmb++cG/lK VLRjmpDod4rynVxCWkV0SnFWZoQHPdxj6/uj+v6TPV0t7B2EzT1LGgP9JZkidIZFBNpJ 5du6al1SGNxt6vxmGDSAzrkLUZnEI1WEblsAbq/ciCmaJHlh+PQTz3gTh3axqscnQoWp 3/znI0orAjbb3RqYM/PxLVNBx7esLRRsGy/H43XXqa/39q6aVScM+zaUDSTSF2HHfKHB EKwzm0sDZ0j++ZH8T/W/0kRqFXAJaTACSWcV9HrWanzrubvviZ8PcDlKuCJyf7zqFtXV UZFg== X-Forwarded-Encrypted: i=1; AJvYcCXd9+OlqRioAvRhQ/EiXhB1J1u0XqYgGY33CRbzjnt0yP1GK/YWs53ZxmFEFJUXd08OHYVvV1yvPJhGEJec8FkYPXg6hmHpAz6ADI7vF9up X-Gm-Message-State: AOJu0YwLdN10Cps6WBR14y7zcY7T4WbcKOR7dQ395BMCyo5NFhnwIaFY iKrt9UpMp54Hx1xl/unpY7k3AYOIFsfAUYYqYY66VBTZNj/AWxboZE8Xka+pSO32z2pTBMFENiN Tc7STBXnBd2qOAZ8Cvd81nQ== X-Google-Smtp-Source: AGHT+IERAxLDH9CLsybVTiG1Kl+Ij4FHQxio+Skkp7XFr1VDsZ/QGA1eBFcDme38Y+MfcJE7+b6HhIBoi+3CQvIBng== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a05:6902:1504:b0:dbd:b4e8:1565 with SMTP id q4-20020a056902150400b00dbdb4e81565mr342247ybu.4.1709604139120; Mon, 04 Mar 2024 18:02:19 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:45 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-11-almasrymina@google.com> Subject: [RFC PATCH net-next v6 10/15] net: support non paged skb frags From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi Make skb_frag_page() fail in the case where the frag is not backed by a page, and fix its relevant callers to handle this case. Signed-off-by: Mina Almasry --- v6: - Rebased on top of the merged netmem changes. Changes in v1: - Fix illegal_highdma() (Yunsheng). - Rework napi_pp_put_page() slightly to reduce code churn (Willem). --- include/linux/skbuff.h | 47 +++++++++++++++++++++++++++++++++++++----- net/core/dev.c | 3 ++- net/core/gro.c | 2 +- net/core/skbuff.c | 11 ++++++++++ net/ipv4/tcp.c | 3 +++ 5 files changed, 59 insertions(+), 7 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index ca29d1fd4561..d68d430dc596 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -3466,17 +3466,53 @@ static inline void skb_frag_off_copy(skb_frag_t *fragto, fragto->offset = fragfrom->offset; } +/* Returns true if the skb_frag contains a net_iov. */ +static inline bool skb_frag_is_net_iov(const skb_frag_t *frag) +{ + return netmem_is_net_iov(frag->netmem); +} + +/** + * skb_frag_net_iov - retrieve the net_iov referred to by fragment + * @frag: the fragment + * + * Returns the &struct net_iov associated with @frag. Returns NULL if this + * frag has no associated net_iov. + */ +static inline struct net_iov *skb_frag_net_iov(const skb_frag_t *frag) +{ + if (!skb_frag_is_net_iov(frag)) + return NULL; + + return netmem_to_net_iov(frag->netmem); +} + /** * skb_frag_page - retrieve the page referred to by a paged fragment * @frag: the paged fragment * - * Returns the &struct page associated with @frag. + * Returns the &struct page associated with @frag. Returns NULL if this frag + * has no associated page. */ static inline struct page *skb_frag_page(const skb_frag_t *frag) { + if (skb_frag_is_net_iov(frag)) + return NULL; + return netmem_to_page(frag->netmem); } +/** + * skb_frag_netmem - retrieve the netmem referred to by a fragment + * @frag: the fragment + * + * Returns the &netmem_ref associated with @frag. + */ +static inline netmem_ref skb_frag_netmem(const skb_frag_t *frag) +{ + return frag->netmem; +} + /** * __skb_frag_ref - take an addition reference on a paged fragment. * @frag: the paged fragment @@ -3509,13 +3545,11 @@ bool napi_pp_put_page(netmem_ref netmem, bool napi_safe); static inline void napi_frag_unref(skb_frag_t *frag, bool recycle, bool napi_safe) { - struct page *page = skb_frag_page(frag); - #ifdef CONFIG_PAGE_POOL - if (recycle && napi_pp_put_page(page_to_netmem(page), napi_safe)) + if (recycle && napi_pp_put_page(skb_frag_netmem(frag), napi_safe)) return; #endif - put_page(page); + put_page(skb_frag_page(frag)); } /** @@ -3555,6 +3589,9 @@ static inline void skb_frag_unref(struct sk_buff *skb, int f) */ static inline void *skb_frag_address(const skb_frag_t *frag) { + if (!skb_frag_page(frag)) + return NULL; + return page_address(skb_frag_page(frag)) + skb_frag_off(frag); } diff --git a/net/core/dev.c b/net/core/dev.c index bbea1b252529..765f4a995693 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -3385,8 +3385,9 @@ static int illegal_highdma(struct net_device *dev, struct sk_buff *skb) if (!(dev->features & NETIF_F_HIGHDMA)) { for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct page *page = skb_frag_page(frag); - if (PageHighMem(skb_frag_page(frag))) + if (page && PageHighMem(page)) return 1; } } diff --git a/net/core/gro.c b/net/core/gro.c index 0759277dc14e..42d7f6755f32 100644 --- a/net/core/gro.c +++ b/net/core/gro.c @@ -376,7 +376,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff) NAPI_GRO_CB(skb)->frag0 = NULL; NAPI_GRO_CB(skb)->frag0_len = 0; - if (!skb_headlen(skb) && pinfo->nr_frags && + if (!skb_headlen(skb) && pinfo->nr_frags && skb_frag_page(frag0) && !PageHighMem(skb_frag_page(frag0)) && (!NET_IP_ALIGN || !((skb_frag_off(frag0) + nhoff) & 3))) { NAPI_GRO_CB(skb)->frag0 = skb_frag_address(frag0); diff --git a/net/core/skbuff.c b/net/core/skbuff.c index cf23392e97f5..907fff2894d3 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -1378,6 +1378,14 @@ void skb_dump(const char *level, const struct sk_buff *skb, bool full_pkt) struct page *p; u8 *vaddr; + if (skb_frag_is_net_iov(frag)) { + printk("%sskb frag %d: not readable\n", level, i); + len -= frag->bv_len; + if (!len) + break; + continue; + } + skb_frag_foreach_page(frag, skb_frag_off(frag), skb_frag_size(frag), p, p_off, p_len, copied) { @@ -3145,6 +3153,9 @@ static bool __skb_splice_bits(struct sk_buff *skb, struct pipe_inode_info *pipe, for (seg = 0; seg < skb_shinfo(skb)->nr_frags; seg++) { const skb_frag_t *f = &skb_shinfo(skb)->frags[seg]; + if (WARN_ON_ONCE(!skb_frag_page(f))) + return false; + if (__splice_segment(skb_frag_page(f), skb_frag_off(f), skb_frag_size(f), offset, len, spd, false, sk, pipe)) diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index c82dc42f57c6..51a5d263e8b4 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -2166,6 +2166,9 @@ static int tcp_zerocopy_receive(struct sock *sk, break; } page = skb_frag_page(frags); + if (WARN_ON_ONCE(!page)) + break; + prefetchw(page); pages[pages_to_map++] = page; length += PAGE_SIZE; From patchwork Tue Mar 5 02:01:47 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778166 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id E9EF27EF09 for ; Tue, 5 Mar 2024 02:02:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604151; cv=none; b=uwIe/xjSviIl6zwQaG1R0M2MqFZYMRLmHNvCtcJ3/DJjwbYtsTBuUdFIWAuGJDVHX1wFUk8bkdflwiyx3a0SOeqr8JaYxHYcg5T7FaWNsONzG7aMmUyBxm1xs/JOHszMxygZH5ThHGrmAYu3szYxbrCwurnS+SpPtQ4xi6udEqI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604151; c=relaxed/simple; bh=OxUK2yuPZ0gVaZ38KRC8xVb9nea7y4CyLEB+TAnkjgk=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=fAH+7qnzQA9FCQUUzq5T5mzcWKzXN7TMVG/Vn/kDC2trqlHvLP9pfLlFqBW1tpmqAXLATzjfDpLzV4NpG9VtlKRpj+A/Mkjymq4x9ln7xYw4dXA0RY3ASLnej53JwJdtV54dniJC7PQ/OgVhA1t6bXtjwff6RfxQxwF51pTc098= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=GejCZmLA; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="GejCZmLA" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dbf216080f5so8634056276.1 for ; Mon, 04 Mar 2024 18:02:24 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604143; x=1710208943; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=G8LLKKAsPxPmekNKpp7Ok+iKx4V3zK/x5GMXZWuZ1C8=; b=GejCZmLArLJ+kDcv3WbACnCBvOjJlYBEKkG+RjW5GBcC11mGffC+ldMsLWXjcRE1SC ASuo55dI0dnR0S6vfuMiUPlmOMIQNwMFzdnNPcCW8EZqjq9kw5LEceC7MluFl8Gzdtjt /bz38t8srfltfwZs6gj9uW20DbCBgmaa0w1L4OIYDEH0fiWwYyw0ly0Z4OT7w5wVYf0n /U96qi+ggqkrDDYM/pF7RENIaSYgzFMxfWs9DrLv1TLmfN5PjK1R+fefLsoU8tkPWSFv XnPuwITn1L+GkiOGBQfh8rTF5wwxZaX3oSWGB1pjAa5w1DpMJ67Y8QbaxQW2hXjKzicx sjlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604143; x=1710208943; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=G8LLKKAsPxPmekNKpp7Ok+iKx4V3zK/x5GMXZWuZ1C8=; b=XmHGm7hDq3eTT0201+2NV0/IRblGUQyCKqUVATTHvbx0wl3uX7rZJ2rDk71d/Ijq4s GYOgyFyNaLEg/GecXnD/lGbIXdzFR9lgqG61Ybe4UrQFrg0XYboatZ55Lq51WAR5Jn1n HQ0kBt03aLKXh959Q4CdjjUFZZmsBxFmMZwxujJkz/sdKtXSzRCr/Ml0jnJTaCFeyGbq aLEytNdZ0BsG8sJrBlclpCb6/5ddCdGFJDOCvh7FJtlRijVb0jY+9sgHtmWXwI1opkBO 92JiJyS1Z2AdBgIgGTAkB5MjItkZbRTJsdyYMLEVKSJ4agumH8f8y4Rq7f6vqzHCl2wf tc9g== X-Forwarded-Encrypted: i=1; AJvYcCVicPl9wwbau2Zlm+y3WF5Jc6/IhwlV7TlkUlng3EhG5UJHtQGYLbJmM4jKFOzUJyMWkT79NVmoadmAnejzaz+weOa9fONfnBLZT//59NvF X-Gm-Message-State: AOJu0Yxej7GThIWQuzqMMousyCt5N9Kmnw17n306ckp5FQDbXCFUGdFx hT/8byW+mjnq0v6KFmwtR78E3G792ury32nbh01/hEBp48KT74JxQ9NHVoqC+cyY32noB6lPlWE s23ztSfADrYjhtjTm1CQwSA== X-Google-Smtp-Source: AGHT+IEuuDlIhDQWa3FGDlFl4giPLInY87bQxHaIbAPxFLDRPGmK63Thv5zNW54mQh/PGq+T64W5x3XMjWtk/YtF1A== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a05:6902:1744:b0:dcf:6b50:9bd7 with SMTP id bz4-20020a056902174400b00dcf6b509bd7mr2805190ybb.7.1709604143276; Mon, 04 Mar 2024 18:02:23 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:47 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-13-almasrymina@google.com> Subject: [RFC PATCH net-next v6 12/15] tcp: RX path for devmem TCP From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi , Willem de Bruijn , Kaiyuan Zhang In tcp_recvmsg_locked(), detect if the skb being received by the user is a devmem skb. In this case - if the user provided the MSG_SOCK_DEVMEM flag - pass it to tcp_recvmsg_devmem() for custom handling. tcp_recvmsg_devmem() copies any data in the skb header to the linear buffer, and returns a cmsg to the user indicating the number of bytes returned in the linear buffer. tcp_recvmsg_devmem() then loops over the unaccessible devmem skb frags, and returns to the user a cmsg_devmem indicating the location of the data in the dmabuf device memory. cmsg_devmem contains this information: 1. the offset into the dmabuf where the payload starts. 'frag_offset'. 2. the size of the frag. 'frag_size'. 3. an opaque token 'frag_token' to return to the kernel when the buffer is to be released. The pages awaiting freeing are stored in the newly added sk->sk_user_frags, and each page passed to userspace is get_page()'d. This reference is dropped once the userspace indicates that it is done reading this page. All pages are released when the socket is destroyed. Signed-off-by: Willem de Bruijn Signed-off-by: Kaiyuan Zhang Signed-off-by: Mina Almasry --- v6 - skb->dmabuf -> skb->readable (Pavel) - Fixed asm definitions of SO_DEVMEM_LINEAR/SO_DEVMEM_DMABUF not found on some archs. - Squashed in locking optimizations from edumazet@google.com. With this change we lock the xarray once per per tcp_recvmsg_dmabuf() rather than once per frag in xa_alloc(). Changes in v1: - Added dmabuf_id to dmabuf_cmsg (David/Stan). - Devmem -> dmabuf (David). - Change tcp_recvmsg_dmabuf() check to skb->dmabuf (Paolo). - Use __skb_frag_ref() & napi_pp_put_page() for refcounting (Yunsheng). RFC v3: - Fixed issue with put_cmsg() failing silently. --- arch/alpha/include/uapi/asm/socket.h | 5 + arch/mips/include/uapi/asm/socket.h | 5 + arch/parisc/include/uapi/asm/socket.h | 5 + arch/sparc/include/uapi/asm/socket.h | 5 + include/linux/socket.h | 1 + include/net/netmem.h | 13 ++ include/net/sock.h | 2 + include/uapi/asm-generic/socket.h | 5 + include/uapi/linux/uio.h | 10 + net/ipv4/tcp.c | 251 +++++++++++++++++++++++++- net/ipv4/tcp_ipv4.c | 9 + 11 files changed, 306 insertions(+), 5 deletions(-) diff --git a/arch/alpha/include/uapi/asm/socket.h b/arch/alpha/include/uapi/asm/socket.h index e94f621903fe..1a9439f1c02e 100644 --- a/arch/alpha/include/uapi/asm/socket.h +++ b/arch/alpha/include/uapi/asm/socket.h @@ -140,6 +140,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 79 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 80 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/mips/include/uapi/asm/socket.h b/arch/mips/include/uapi/asm/socket.h index 60ebaed28a4c..a940747504a0 100644 --- a/arch/mips/include/uapi/asm/socket.h +++ b/arch/mips/include/uapi/asm/socket.h @@ -151,6 +151,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 79 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 80 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/parisc/include/uapi/asm/socket.h b/arch/parisc/include/uapi/asm/socket.h index be264c2b1a11..1d399f104f08 100644 --- a/arch/parisc/include/uapi/asm/socket.h +++ b/arch/parisc/include/uapi/asm/socket.h @@ -132,6 +132,11 @@ #define SO_PASSPIDFD 0x404A #define SO_PEERPIDFD 0x404B +#define SO_DEVMEM_LINEAR 98 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 99 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 diff --git a/arch/sparc/include/uapi/asm/socket.h b/arch/sparc/include/uapi/asm/socket.h index 682da3714686..a5961ecff31d 100644 --- a/arch/sparc/include/uapi/asm/socket.h +++ b/arch/sparc/include/uapi/asm/socket.h @@ -133,6 +133,11 @@ #define SO_PASSPIDFD 0x0055 #define SO_PEERPIDFD 0x0056 +#define SO_DEVMEM_LINEAR 0x0058 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 0x0059 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) diff --git a/include/linux/socket.h b/include/linux/socket.h index cfcb7e2c3813..fe2b9e2081bb 100644 --- a/include/linux/socket.h +++ b/include/linux/socket.h @@ -326,6 +326,7 @@ struct ucred { * plain text and require encryption */ +#define MSG_SOCK_DEVMEM 0x2000000 /* Receive devmem skbs as cmsg */ #define MSG_ZEROCOPY 0x4000000 /* Use user data in kernel path */ #define MSG_SPLICE_PAGES 0x8000000 /* Splice the pages from the iterator in sendmsg() */ #define MSG_FASTOPEN 0x20000000 /* Send data in TCP SYN */ diff --git a/include/net/netmem.h b/include/net/netmem.h index a2de9411025d..2f36f21b3a3c 100644 --- a/include/net/netmem.h +++ b/include/net/netmem.h @@ -65,6 +65,19 @@ static inline unsigned int net_iov_idx(const struct net_iov *niov) return niov - net_iov_owner(niov)->niovs; } +static inline unsigned long net_iov_virtual_addr(const struct net_iov *niov) +{ + struct dmabuf_genpool_chunk_owner *owner = net_iov_owner(niov); + + return owner->base_virtual + + ((unsigned long)net_iov_idx(niov) << PAGE_SHIFT); +} + +static inline u32 net_iov_binding_id(const struct net_iov *niov) +{ + return net_iov_owner(niov)->binding->id; +} + /* This returns the absolute dma_addr_t calculated from * net_iov_owner(niov)->owner->base_dma_addr, not the page_pool-owned * niov->dma_addr. diff --git a/include/net/sock.h b/include/net/sock.h index 09a0cde8bf52..2b500a804191 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -337,6 +337,7 @@ struct sk_filter; * @sk_txtime_report_errors: set report errors mode for SO_TXTIME * @sk_txtime_unused: unused txtime flags * @ns_tracker: tracker for netns reference + * @sk_user_frags: xarray of pages the user is holding a reference on. */ struct sock { /* @@ -542,6 +543,7 @@ struct sock { #endif struct rcu_head sk_rcu; netns_tracker ns_tracker; + struct xarray sk_user_frags; }; enum sk_pacing { diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h index 8ce8a39a1e5f..25a2f5255f52 100644 --- a/include/uapi/asm-generic/socket.h +++ b/include/uapi/asm-generic/socket.h @@ -135,6 +135,11 @@ #define SO_PASSPIDFD 76 #define SO_PEERPIDFD 77 +#define SO_DEVMEM_LINEAR 98 +#define SCM_DEVMEM_LINEAR SO_DEVMEM_LINEAR +#define SO_DEVMEM_DMABUF 99 +#define SCM_DEVMEM_DMABUF SO_DEVMEM_DMABUF + #if !defined(__KERNEL__) #if __BITS_PER_LONG == 64 || (defined(__x86_64__) && defined(__ILP32__)) diff --git a/include/uapi/linux/uio.h b/include/uapi/linux/uio.h index 059b1a9147f4..ad92e37699da 100644 --- a/include/uapi/linux/uio.h +++ b/include/uapi/linux/uio.h @@ -20,6 +20,16 @@ struct iovec __kernel_size_t iov_len; /* Must be size_t (1003.1g) */ }; +struct dmabuf_cmsg { + __u64 frag_offset; /* offset into the dmabuf where the frag starts. + */ + __u32 frag_size; /* size of the frag. */ + __u32 frag_token; /* token representing this frag for + * DEVMEM_DONTNEED. + */ + __u32 dmabuf_id; /* dmabuf id this frag belongs to. */ +}; + /* * UIO_MAXIOV shall be at least 16 1003.1g (5.4.1.1) */ diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 00b5ffa06ab6..60cbd166f0df 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -461,6 +461,7 @@ void tcp_init_sock(struct sock *sk) set_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags); sk_sockets_allocated_inc(sk); + xa_init_flags(&sk->sk_user_frags, XA_FLAGS_ALLOC1); } EXPORT_SYMBOL(tcp_init_sock); @@ -2312,6 +2313,216 @@ static int tcp_inq_hint(struct sock *sk) return inq; } +/* batch __xa_alloc() calls and reduce xa_lock()/xa_unlock() overhead. */ +struct tcp_xa_pool { + u8 max; /* max <= MAX_SKB_FRAGS */ + u8 idx; /* idx <= max */ + __u32 tokens[MAX_SKB_FRAGS]; + netmem_ref netmems[MAX_SKB_FRAGS]; +}; + +static void tcp_xa_pool_commit(struct sock *sk, struct tcp_xa_pool *p, + bool lock) +{ + int i; + + if (!p->max) + return; + if (lock) + xa_lock_bh(&sk->sk_user_frags); + /* Commit part that has been copied to user space. */ + for (i = 0; i < p->idx; i++) + __xa_cmpxchg(&sk->sk_user_frags, + p->tokens[i], + XA_ZERO_ENTRY, + (__force void *)p->netmems[i], + GFP_KERNEL); + /* Rollback what has been pre-allocated and is no longer needed. */ + for (; i < p->max; i++) + __xa_erase(&sk->sk_user_frags, p->tokens[i]); + if (lock) + xa_unlock_bh(&sk->sk_user_frags); + p->max = 0; + p->idx = 0; +} + +static int tcp_xa_pool_refill(struct sock *sk, struct tcp_xa_pool *p, + unsigned int max_frags) +{ + int err, k; + + if (p->idx < p->max) + return 0; + + xa_lock_bh(&sk->sk_user_frags); + + tcp_xa_pool_commit(sk, p, false); + for (k = 0; k < max_frags; k++) { + err = __xa_alloc(&sk->sk_user_frags, &p->tokens[k], + XA_ZERO_ENTRY, xa_limit_31b, GFP_KERNEL); + if (err) + break; + } + + xa_unlock_bh(&sk->sk_user_frags); + + p->max = k; + p->idx = 0; + return k ? 0 : err; +} + +/* On error, returns the -errno. On success, returns number of bytes sent to the + * user. May not consume all of @remaining_len. + */ +static int tcp_recvmsg_dmabuf(struct sock *sk, const struct sk_buff *skb, + unsigned int offset, struct msghdr *msg, + int remaining_len) +{ + struct dmabuf_cmsg dmabuf_cmsg = { 0 }; + struct tcp_xa_pool tcp_xa_pool; + unsigned int start; + int i, copy, n; + int sent = 0; + int err = 0; + + tcp_xa_pool.max = 0; + tcp_xa_pool.idx = 0; + do { + start = skb_headlen(skb); + + if (skb->readable) { + err = -ENODEV; + goto out; + } + + /* Copy header. */ + copy = start - offset; + if (copy > 0) { + copy = min(copy, remaining_len); + + n = copy_to_iter(skb->data + offset, copy, + &msg->msg_iter); + if (n != copy) { + err = -EFAULT; + goto out; + } + + offset += copy; + remaining_len -= copy; + + /* First a dmabuf_cmsg for # bytes copied to user + * buffer. + */ + memset(&dmabuf_cmsg, 0, sizeof(dmabuf_cmsg)); + dmabuf_cmsg.frag_size = copy; + err = put_cmsg(msg, SOL_SOCKET, SO_DEVMEM_LINEAR, + sizeof(dmabuf_cmsg), &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + sent += copy; + + if (remaining_len == 0) + goto out; + } + + /* after that, send information of dmabuf pages through a + * sequence of cmsg + */ + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + skb_frag_t *frag = &skb_shinfo(skb)->frags[i]; + struct net_iov *niov; + u64 frag_offset; + int end; + + /* !skb->readable should indicate that ALL the frags in + * this skb are dmabuf net_iovs. We're checking + * for that flag above, but also check individual frags + * here. If the tcp stack is not setting skb->readable + * correctly, we still don't want to crash here when + * accessing pgmap or priv below. + */ + if (!skb_frag_net_iov(frag)) { + net_err_ratelimited("Found non-dmabuf skb with net_iov"); + err = -ENODEV; + goto out; + } + + niov = skb_frag_net_iov(frag); + end = start + skb_frag_size(frag); + copy = end - offset; + + if (copy > 0) { + copy = min(copy, remaining_len); + + frag_offset = net_iov_virtual_addr(niov) + + skb_frag_off(frag) + offset - + start; + dmabuf_cmsg.frag_offset = frag_offset; + dmabuf_cmsg.frag_size = copy; + err = tcp_xa_pool_refill(sk, &tcp_xa_pool, + skb_shinfo(skb)->nr_frags - i); + if (err) + goto out; + + /* Will perform the exchange later */ + dmabuf_cmsg.frag_token = tcp_xa_pool.tokens[tcp_xa_pool.idx]; + dmabuf_cmsg.dmabuf_id = net_iov_binding_id(niov); + + offset += copy; + remaining_len -= copy; + + err = put_cmsg(msg, SOL_SOCKET, + SO_DEVMEM_DMABUF, + sizeof(dmabuf_cmsg), + &dmabuf_cmsg); + if (err || msg->msg_flags & MSG_CTRUNC) { + msg->msg_flags &= ~MSG_CTRUNC; + if (!err) + err = -ETOOSMALL; + goto out; + } + + atomic_long_inc(&niov->pp_ref_count); + tcp_xa_pool.netmems[tcp_xa_pool.idx++] = skb_frag_netmem(frag); + + sent += copy; + + if (remaining_len == 0) + goto out; + } + start = end; + } + + tcp_xa_pool_commit(sk, &tcp_xa_pool, true); + if (!remaining_len) + goto out; + + /* if remaining_len is not satisfied yet, we need to go to the + * next frag in the frag_list to satisfy remaining_len. + */ + skb = skb_shinfo(skb)->frag_list ?: skb->next; + + offset = offset - start; + } while (skb); + + if (remaining_len) { + err = -EFAULT; + goto out; + } + +out: + tcp_xa_pool_commit(sk, &tcp_xa_pool, true); + if (!sent) + sent = err; + + return sent; +} + /* * This routine copies from a sock struct into the user buffer. * @@ -2325,6 +2536,7 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, int *cmsg_flags) { struct tcp_sock *tp = tcp_sk(sk); + int last_copied_dmabuf = -1; /* uninitialized */ int copied = 0; u32 peek_seq; u32 *seq; @@ -2502,15 +2714,44 @@ static int tcp_recvmsg_locked(struct sock *sk, struct msghdr *msg, size_t len, } if (!(flags & MSG_TRUNC)) { - err = skb_copy_datagram_msg(skb, offset, msg, used); - if (err) { - /* Exception. Bailout! */ - if (!copied) - copied = -EFAULT; + if (last_copied_dmabuf != -1 && + last_copied_dmabuf != !skb->readable) break; + + if (skb->readable) { + err = skb_copy_datagram_msg(skb, offset, msg, + used); + if (err) { + /* Exception. Bailout! */ + if (!copied) + copied = -EFAULT; + break; + } + } else { + if (!(flags & MSG_SOCK_DEVMEM)) { + /* dmabuf skbs can only be received + * with the MSG_SOCK_DEVMEM flag. + */ + if (!copied) + copied = -EFAULT; + + break; + } + + err = tcp_recvmsg_dmabuf(sk, skb, offset, msg, + used); + if (err <= 0) { + if (!copied) + copied = -EFAULT; + + break; + } + used = err; } } + last_copied_dmabuf = !skb->readable; + WRITE_ONCE(*seq, *seq + used); copied += used; len -= used; diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index a22ee5838751..e8dc831df007 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2498,6 +2498,15 @@ static void tcp_md5sig_info_free_rcu(struct rcu_head *head) void tcp_v4_destroy_sock(struct sock *sk) { struct tcp_sock *tp = tcp_sk(sk); + __maybe_unused unsigned long index; + __maybe_unused void *netmem; + +#ifdef CONFIG_PAGE_POOL + xa_for_each(&sk->sk_user_frags, index, netmem) + WARN_ON_ONCE(!napi_pp_put_page((__force netmem_ref)netmem, false)); +#endif + + xa_destroy(&sk->sk_user_frags); trace_tcp_destroy_sock(sk); From patchwork Tue Mar 5 02:01:49 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mina Almasry X-Patchwork-Id: 778165 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8A0A57F7FA for ; Tue, 5 Mar 2024 02:02:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604154; cv=none; b=DonNaIEPYMwJku2214EDHq+TuSEcyHEgd8yJbuTu+6E8W3J8wSqQNYmT3i/dc5LtcnaoY0kkdtwi+e49k4RN1KM/gfiFGCQ4bVWEUVFnikB0wPyNV7NUqeIyFYA80b/s5F+n/YgIbiWtxBziObi4aR7Sh618l2nbO8iAcBgn3aM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1709604154; c=relaxed/simple; bh=pfSBox9Ga0DV9HP7LL34Jt5xxP+j2xIrHHM73+n+DHw=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=tml4KnULrZDMy3O7PngPI2Jwa2vAGl8mqTNqWNkVmFmiHApPquQW6dU6+9y3lqzY944dkF6+H4RWqg8TZTMXmWp8vAnHPq5CGlq4rT+Qfx/DvsKWB7zQWJa4FS0tVqIfY8ppEH52/BF003US/tra4UckyvYhmBDsq8chaevtKtY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=kbYkxrnX; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--almasrymina.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="kbYkxrnX" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-607838c0800so67759997b3.1 for ; Mon, 04 Mar 2024 18:02:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1709604147; x=1710208947; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=cdaLgPgB3k0IUnbuFiSyW7eJOw/whsgAOWZdbzevfeI=; b=kbYkxrnXAOt9lB6I0LQo4/KBekOwEsHRtYE1sMrEEQbtr7ohxPkZcWhCxivuttc97/ tkoH8e3F0KolYpoL/IrfDDTxEs8vWXCswalyEjF4H+v8ZBWWwOgnWer9ecwNPURaI8fP DX1u/GmTr/iXpA2c+bUlDT85sOMon55Q4F62J69qmSY4cI4mT7lJRmvQkg6RiLogtgDH U27Vk+at9Ayci3qiW62ME3aIo3C+VnQA3aC2wc7c1bTJYq/K6WaPkFFay1bzAzHxk00L 83IGr+vYvyh5WW9XqBoZyzQcbwZsM389I8V33vZdf9UEeW/yO9hQyTGl9lUbNiPI7xA5 fBLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709604147; x=1710208947; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=cdaLgPgB3k0IUnbuFiSyW7eJOw/whsgAOWZdbzevfeI=; b=rJIm5XgqOpt2G9xqIzEaypO209w0sQjiZ6lhlwoqs4CEpjwR4mq+tQLTY8yIZCS7wa SEbDD0G6uys+wvTUJakzYX1MIOtHxiDknqnvxt1jLqKGLXeZgd8pMtVCZo6NIQ4z4y6i dabSVBkPs5j2Hpjo1huOrsIPyFHm1Rnw4rBJ4HmHU4zYWcv1kZLcn3Y56gjOEPgYdFeR ATQhYPfODzJbQiWeu4L/hOsdDnJAa9EAK8xufTJOU+Xl4L1+Rr3tqkFqZxYesgN1iixY lFyBO/zz2jL/ivE0Wq08TmkeWvt/8O4BnbSm9lpsnzaVarSPmqiydunRc6Zrs52cEH62 Keuw== X-Forwarded-Encrypted: i=1; AJvYcCUfYsG0Knk8OwkqCsKHJuXGowwIb9nzVns4m3U0dqnPZYWm28hOQhrhwaTqZn0g1tg03fRpjB2aUSb7vCrjbxPAxHsJ+GBMSoKO4FVwESD7 X-Gm-Message-State: AOJu0YzCezaN9amDuAFoIg6IZ2y2clqEiTBA73vgPufnn1kOc88EhnZu q0SfqLpM0aBU+OX+3YWFQTKDwPZz+R4Q/JsW9eyb+Tnup9XeRP73PwVy7Z9LWbGwb2s/7TdQ4/n Sx942wYXOPmZqEADvB9VX8w== X-Google-Smtp-Source: AGHT+IHQs0SIwa2ekbs3tLJ18U+t6mqdEQQQa57loe9W+ctuY8OVB4RpI+mzdPMB5Lnn+KZ21aUA8Q2c99dGnxQY1A== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2c4:200:b614:914c:63cd:3830]) (user=almasrymina job=sendgmr) by 2002:a5b:54c:0:b0:dc7:5925:92d2 with SMTP id r12-20020a5b054c000000b00dc7592592d2mr249722ybp.1.1709604147667; Mon, 04 Mar 2024 18:02:27 -0800 (PST) Date: Mon, 4 Mar 2024 18:01:49 -0800 In-Reply-To: <20240305020153.2787423-1-almasrymina@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240305020153.2787423-1-almasrymina@google.com> X-Mailer: git-send-email 2.44.0.rc1.240.g4c46232300-goog Message-ID: <20240305020153.2787423-15-almasrymina@google.com> Subject: [RFC PATCH net-next v6 14/15] net: add devmem TCP documentation From: Mina Almasry To: netdev@vger.kernel.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-alpha@vger.kernel.org, linux-mips@vger.kernel.org, linux-parisc@vger.kernel.org, sparclinux@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-arch@vger.kernel.org, bpf@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-media@vger.kernel.org, dri-devel@lists.freedesktop.org Cc: Mina Almasry , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Jonathan Corbet , Richard Henderson , Ivan Kokshaysky , Matt Turner , Thomas Bogendoerfer , "James E.J. Bottomley" , Helge Deller , Andreas Larsson , Jesper Dangaard Brouer , Ilias Apalodimas , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Arnd Bergmann , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , David Ahern , Willem de Bruijn , Shuah Khan , Sumit Semwal , " =?utf-8?q?Christian_K=C3=B6nig?= " , Pavel Begunkov , David Wei , Jason Gunthorpe , Yunsheng Lin , Shailend Chand , Harshitha Ramamurthy , Shakeel Butt , Jeroen de Borst , Praveen Kaligineedi Signed-off-by: Mina Almasry --- v1 -> v2: - Missing spdx (simon) - add to index.rst (simon) --- Documentation/networking/devmem.rst | 271 ++++++++++++++++++++++++++++ Documentation/networking/index.rst | 1 + 2 files changed, 272 insertions(+) create mode 100644 Documentation/networking/devmem.rst diff --git a/Documentation/networking/devmem.rst b/Documentation/networking/devmem.rst new file mode 100644 index 000000000000..4712f029e5ed --- /dev/null +++ b/Documentation/networking/devmem.rst @@ -0,0 +1,271 @@ +.. SPDX-License-Identifier: GPL-2.0 + +================= +Device Memory TCP +================= + + +Intro +===== + +Device memory TCP (devmem TCP) enables receiving data directly into device +memory (dmabuf). The feature is currently implemented for TCP sockets. + + +Opportunity +----------- + +A large amount of data transfers have device memory as the source and/or +destination. Accelerators drastically increased the volume of such transfers. +Some examples include: + +- Distributed training, where ML accelerators, such as GPUs on different hosts, + exchange data among them. + +- Distributed raw block storage applications transfer large amounts of data with + remote SSDs, much of this data does not require host processing. + +Today, the majority of the Device-to-Device data transfers the network are +implemented as the following low level operations: Device-to-Host copy, +Host-to-Host network transfer, and Host-to-Device copy. + +The implementation is suboptimal, especially for bulk data transfers, and can +put significant strains on system resources such as host memory bandwidth and +PCIe bandwidth. + +Devmem TCP optimizes this use case by implementing socket APIs that enable +the user to receive incoming network packets directly into device memory. + +Packet payloads go directly from the NIC to device memory. + +Packet headers go to host memory and are processed by the TCP/IP stack +normally. The NIC must support header split to achieve this. + +Advantages: + +- Alleviate host memory bandwidth pressure, compared to existing + network-transfer + device-copy semantics. + +- Alleviate PCIe bandwidth pressure, by limiting data transfer to the lowest + level of the PCIe tree, compared to traditional path which sends data through + the root complex. + + +More Info +--------- + + slides, video + https://netdevconf.org/0x17/sessions/talk/device-memory-tcp.html + + patchset + [RFC PATCH v3 00/12] Device Memory TCP + https://lore.kernel.org/lkml/20231106024413.2801438-1-almasrymina@google.com/T/ + + +Interface +========= + +Example +------- + +tools/testing/selftests/net/ncdevmem.c:do_server shows an example of setting up +the RX path of this API. + +NIC Setup +--------- + +Header split, flow steering, & RSS are required features for devmem TCP. + +Header split is used to split incoming packets into a header buffer in host +memory, and a payload buffer in device memory. + +Flow steering & RSS are used to ensure that only flows targeting devmem land on +RX queue bound to devmem. + +Enable header split & flow steering: + +:: + + # enable header split (assuming priv-flag) + ethtool --set-priv-flags eth1 enable-header-split on + + # enable flow steering + ethtool -K eth1 ntuple on + +Configure RSS to steer all traffic away from the target RX queue (queue 15 in +this example): + +:: + + ethtool --set-rxfh-indir eth1 equal 15 + + +The user must bind a dmabuf to any number of RX queues on a given NIC using +netlink API: + +:: + + /* Bind dmabuf to NIC RX queue 15 */ + struct netdev_queue *queues; + queues = malloc(sizeof(*queues) * 1); + + queues[0]._present.type = 1; + queues[0]._present.idx = 1; + queues[0].type = NETDEV_RX_QUEUE_TYPE_RX; + queues[0].idx = 15; + + *ys = ynl_sock_create(&ynl_netdev_family, &yerr); + + req = netdev_bind_rx_req_alloc(); + netdev_bind_rx_req_set_ifindex(req, 1 /* ifindex */); + netdev_bind_rx_req_set_dmabuf_fd(req, dmabuf_fd); + __netdev_bind_rx_req_set_queues(req, queues, n_queue_index); + + rsp = netdev_bind_rx(*ys, req); + + dmabuf_id = rsp->dmabuf_id; + + +The netlink API returns a dmabuf_id: a unique ID that refers to this dmabuf +that has been bound. + +Socket Setup +------------ + +The socket must be flow steering to the dmabuf bound RX queue: + +:: + + ethtool -N eth1 flow-type tcp4 ... queue 15, + + +Receiving data +-------------- + +The user application must signal to the kernel that it is capable of receiving +devmem data by passing the MSG_SOCK_DEVMEM flag to recvmsg: + +:: + + ret = recvmsg(fd, &msg, MSG_SOCK_DEVMEM); + +Applications that do not specify the MSG_SOCK_DEVMEM flag will receive an EFAULT +on devmem data. + +Devmem data is received directly into the dmabuf bound to the NIC in 'NIC +Setup', and the kernel signals such to the user via the SCM_DEVMEM_* cmsgs: + +:: + + for (cm = CMSG_FIRSTHDR(&msg); cm; cm = CMSG_NXTHDR(&msg, cm)) { + if (cm->cmsg_level != SOL_SOCKET || + (cm->cmsg_type != SCM_DEVMEM_DMABUF && + cm->cmsg_type != SCM_DEVMEM_LINEAR)) + continue; + + dmabuf_cmsg = (struct dmabuf_cmsg *)CMSG_DATA(cm); + + if (cm->cmsg_type == SCM_DEVMEM_DMABUF) { + /* Frag landed in dmabuf. + * + * dmabuf_cmsg->dmabuf_id is the dmabuf the + * frag landed on. + * + * dmabuf_cmsg->frag_offset is the offset into + * the dmabuf where the frag starts. + * + * dmabuf_cmsg->frag_size is the size of the + * frag. + * + * dmabuf_cmsg->frag_token is a token used to + * refer to this frag for later freeing. + */ + + struct dmabuf_token token; + token.token_start = dmabuf_cmsg->frag_token; + token.token_count = 1; + continue; + } + + if (cm->cmsg_type == SCM_DEVMEM_LINEAR) + /* Frag landed in linear buffer. + * + * dmabuf_cmsg->frag_size is the size of the + * frag. + */ + continue; + + } + +Applications may receive 2 cmsgs: + +- SCM_DEVMEM_DMABUF: this indicates the fragment landed in the dmabuf indicated + by dmabuf_id. + +- SCM_DEVMEM_LINEAR: this indicates the fragment landed in the linear buffer. + This typically happens when the NIC is unable to split the packet at the + header boundary, such that part (or all) of the payload landed in host + memory. + +Applications may receive no SO_DEVMEM_* cmsgs. That indicates non-devmem, +regular TCP data that landed on an RX queue not bound to a dmabuf. + + +Freeing frags +------------- + +Frags received via SCM_DEVMEM_DMABUF are pinned by the kernel while the user +processes the frag. The user must return the frag to the kernel via +SO_DEVMEM_DONTNEED: + +:: + + ret = setsockopt(client_fd, SOL_SOCKET, SO_DEVMEM_DONTNEED, &token, + sizeof(token)); + +The user must ensure the tokens are returned to the kernel in a timely manner. +Failure to do so will exhaust the limited dmabuf that is bound to the RX queue +and will lead to packet drops. + + +Implementation & Caveats +======================== + +Unreadable skbs +--------------- + +Devmem payloads are inaccessible to the kernel processing the packets. This +results in a few quirks for payloads of devmem skbs: + +- Loopback is not functional. Loopback relies on copying the payload, which is + not possible with devmem skbs. + +- Software checksum calculation fails. + +- TCP Dump and bpf can't access devmem packet payloads. + + +Testing +======= + +More realistic example code can be found in the kernel source under +tools/testing/selftests/net/ncdevmem.c + +ncdevmem is a devmem TCP netcat. It works very similarly to netcat, but +receives data directly into a udmabuf. + +To run ncdevmem, you need to run it a server on the machine under test, and you +need to run netcat on a peer to provide the TX data. + +ncdevmem has a validation mode as well that expects a repeating pattern of +incoming data and validates it as such: + +:: + + # On server: + ncdevmem -s -c -f eth1 -d 3 -n 0000:06:00.0 -l \ + -p 5201 -v 7 + + # On client: + yes $(echo -e \\x01\\x02\\x03\\x04\\x05\\x06) | \ + tr \\n \\0 | head -c 5G | nc 5201 -p 5201 diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 69f3d6dcd9fd..d9f86514aa1e 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -48,6 +48,7 @@ Contents: cdc_mbim dccp dctcp + devmem dns_resolver driver eql