mbox series

[v12,bpf-next,00/18] mvneta: introduce XDP multi-buffer support

Message ID cover.1629473233.git.lorenzo@kernel.org
Headers show
Series mvneta: introduce XDP multi-buffer support | expand

Message

Lorenzo Bianconi Aug. 20, 2021, 3:40 p.m. UTC
This series introduce XDP multi-buffer support. The mvneta driver is
the first to support these new "non-linear" xdp_{buff,frame}. Reviewers
please focus on how these new types of xdp_{buff,frame} packets
traverse the different layers and the layout design. It is on purpose
that BPF-helpers are kept simple, as we don't want to expose the
internal layout to allow later changes.

The main idea for the new multi-buffer layout is to reuse the same
structure used for non-linear SKB. This rely on the "skb_shared_info"
struct at the end of the first buffer to link together subsequent
buffers. Keeping the layout compatible with SKBs is also done to ease
and speedup creating a SKB from an xdp_{buff,frame}.
Converting xdp_frame to SKB and deliver it to the network stack is shown
in patch 05/18 (e.g. cpumaps).

A multi-buffer bit (mb) has been introduced in the flags field of xdp_{buff,frame}
structure to notify the bpf/network layer if this is a xdp multi-buffer frame
(mb = 1) or not (mb = 0).
The mb bit will be set by a xdp multi-buffer capable driver only for
non-linear frames maintaining the capability to receive linear frames
without any extra cost since the skb_shared_info structure at the end
of the first buffer will be initialized only if mb is set.
Moreover the flags field in xdp_{buff,frame} will be reused even for
xdp rx csum offloading in future series.

Typical use cases for this series are:
- Jumbo-frames
- Packet header split (please see Google’s use-case @ NetDevConf 0x14, [0])
- TSO/GRO

The two following ebpf helpers (and related selftests) has been introduced:
- bpf_xdp_adjust_data:
  Move xdp_md->data and xdp_md->data_end pointers in subsequent fragments
  according to the offset provided by the ebpf program. This helper can be
  used to read/write values in frame payload.
- bpf_xdp_get_buff_len:
  Return the total frame size (linear + paged parts)

bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into
account xdp multi-buff frames.

More info about the main idea behind this approach can be found here [1][2].

Changes since v11:
- add missing static to bpf_xdp_get_buff_len_proto structure
- fix bpf_xdp_adjust_data helper when offset is smaller than linear area length.

Changes since v10:
- move xdp->data to the requested payload offset instead of to the beginning of
  the fragment in bpf_xdp_adjust_data()

Changes since v9:
- introduce bpf_xdp_adjust_data helper and related selftest
- add xdp_frags_size and xdp_frags_tsize fields in skb_shared_info
- introduce xdp_update_skb_shared_info utility routine in ordere to not reset
  frags array in skb_shared_info converting from a xdp_buff/xdp_frame to a skb 
- simplify bpf_xdp_copy routine

Changes since v8:
- add proper dma unmapping if XDP_TX fails on mvneta for a xdp multi-buff
- switch back to skb_shared_info implementation from previous xdp_shared_info
  one
- avoid using a bietfield in xdp_buff/xdp_frame since it introduces performance
  regressions. Tested now on 10G NIC (ixgbe) to verify there are no performance
  penalties for regular codebase
- add bpf_xdp_get_buff_len helper and remove frame_length field in xdp ctx
- add data_len field in skb_shared_info struct
- introduce XDP_FLAGS_FRAGS_PF_MEMALLOC flag

Changes since v7:
- rebase on top of bpf-next
- fix sparse warnings
- improve comments for frame_length in include/net/xdp.h

Changes since v6:
- the main difference respect to previous versions is the new approach proposed
  by Eelco to pass full length of the packet to eBPF layer in XDP context
- reintroduce multi-buff support to eBPF kself-tests
- reintroduce multi-buff support to bpf_xdp_adjust_tail helper
- introduce multi-buffer support to bpf_xdp_copy helper
- rebase on top of bpf-next

Changes since v5:
- rebase on top of bpf-next
- initialize mb bit in xdp_init_buff() and drop per-driver initialization
- drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame()
- postpone introduction of frame_length field in XDP ctx to another series
- minor changes

Changes since v4:
- rebase ontop of bpf-next
- introduce xdp_shared_info to build xdp multi-buff instead of using the
  skb_shared_info struct
- introduce frame_length in xdp ctx
- drop previous bpf helpers
- fix bpf_xdp_adjust_tail for xdp multi-buff
- introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail
- fix xdp_return_frame_bulk for xdp multi-buff

Changes since v3:
- rebase ontop of bpf-next
- add patch 10/13 to copy back paged data from a xdp multi-buff frame to
  userspace buffer for xdp multi-buff selftests

Changes since v2:
- add throughput measurements
- drop bpf_xdp_adjust_mb_header bpf helper
- introduce selftest for xdp multibuffer
- addressed comments on bpf_xdp_get_frags_count
- introduce xdp multi-buff support to cpumaps

Changes since v1:
- Fix use-after-free in xdp_return_{buff/frame}
- Introduce bpf helpers
- Introduce xdp_mb sample program
- access skb_shared_info->nr_frags only on the last fragment

Changes since RFC:
- squash multi-buffer bit initialization in a single patch
- add mvneta non-linear XDP buff support for tx side

[0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org
[2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

Eelco Chaudron (3):
  bpf: add multi-buff support to the bpf_xdp_adjust_tail() API
  bpf: add multi-buffer support to xdp copy helpers
  bpf: update xdp_adjust_tail selftest to include multi-buffer

Lorenzo Bianconi (15):
  net: skbuff: add size metadata to skb_shared_info for xdp
  xdp: introduce flags field in xdp_buff/xdp_frame
  net: mvneta: update mb bit before passing the xdp buffer to eBPF layer
  net: mvneta: simplify mvneta_swbm_add_rx_fragment management
  net: xdp: add xdp_update_skb_shared_info utility routine
  net: marvell: rely on xdp_update_skb_shared_info utility routine
  xdp: add multi-buff support to xdp_return_{buff/frame}
  net: mvneta: add multi buffer support to XDP_TX
  net: mvneta: enable jumbo frames for XDP
  bpf: introduce bpf_xdp_get_buff_len helper
  bpf: move user_size out of bpf_test_init
  bpf: introduce multibuff support to bpf_prog_test_run_xdp()
  bpf: test_run: add xdp_shared_info pointer in bpf_test_finish
    signature
  net: xdp: introduce bpf_xdp_adjust_data helper
  bpf: add bpf_xdp_adjust_data selftest

 drivers/net/ethernet/marvell/mvneta.c         | 204 ++++++++++-------
 include/linux/skbuff.h                        |   6 +-
 include/net/xdp.h                             |  95 +++++++-
 include/uapi/linux/bpf.h                      |  39 ++++
 kernel/trace/bpf_trace.c                      |   3 +
 net/bpf/test_run.c                            | 117 ++++++++--
 net/core/filter.c                             | 213 +++++++++++++++++-
 net/core/xdp.c                                |  76 ++++++-
 tools/include/uapi/linux/bpf.h                |  39 ++++
 .../bpf/prog_tests/xdp_adjust_data.c          |  55 +++++
 .../bpf/prog_tests/xdp_adjust_tail.c          | 118 ++++++++++
 .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 151 +++++++++----
 .../bpf/progs/test_xdp_adjust_tail_grow.c     |  10 +-
 .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-
 .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   2 +-
 .../bpf/progs/test_xdp_update_frags.c         |  41 ++++
 16 files changed, 1036 insertions(+), 165 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_adjust_data.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_update_frags.c

Comments

John Fastabend Aug. 31, 2021, 11:13 p.m. UTC | #1
Lorenzo Bianconi wrote:
> Introduce xdp_frags_tsize field in skb_shared_info data structure

> to store xdp_buff/xdp_frame truesize (xdp_frags_tsize will be used

> in xdp multi-buff support). In order to not increase skb_shared_info

> size we will use a hole due to skb_shared_info alignment.

> Introduce xdp_frags_size field in skb_shared_info data structure

> reusing gso_type field in order to store xdp_buff/xdp_frame paged size.

> xdp_frags_size will be used in xdp multi-buff support.

> 

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>


I assume we can use xdp_frags_tsize for anything else above XDP later?
Other than simple question looks OK to me.

Acked-by: John Fastabend <john.fastabend@gmail.com>


> ---

>  include/linux/skbuff.h | 6 +++++-

>  1 file changed, 5 insertions(+), 1 deletion(-)

> 

> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h

> index 6bdb0db3e825..1abeba7ef82e 100644

> --- a/include/linux/skbuff.h

> +++ b/include/linux/skbuff.h

> @@ -522,13 +522,17 @@ struct skb_shared_info {

>  	unsigned short	gso_segs;

>  	struct sk_buff	*frag_list;

>  	struct skb_shared_hwtstamps hwtstamps;

> -	unsigned int	gso_type;

> +	union {

> +		unsigned int	gso_type;

> +		unsigned int	xdp_frags_size;

> +	};

>  	u32		tskey;

>  

>  	/*

>  	 * Warning : all fields before dataref are cleared in __alloc_skb()

>  	 */

>  	atomic_t	dataref;

> +	unsigned int	xdp_frags_tsize;

>  

>  	/* Intermediate layers must ensure that destructor_arg

>  	 * remains valid until skb destructor */

> -- 

> 2.31.1

>
John Fastabend Aug. 31, 2021, 11:23 p.m. UTC | #2
Lorenzo Bianconi wrote:
> Update multi-buffer bit (mb) in xdp_buff to notify XDP/eBPF layer and

> XDP remote drivers if this is a "non-linear" XDP buffer. Access

> skb_shared_info only if xdp_buff mb is set in order to avoid possible

> cache-misses.

> 

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> ---

>  drivers/net/ethernet/marvell/mvneta.c | 23 ++++++++++++++++++-----

>  1 file changed, 18 insertions(+), 5 deletions(-)

> 

> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c

> index 5d1007e1b5c9..9f4858e35566 100644

> --- a/drivers/net/ethernet/marvell/mvneta.c

> +++ b/drivers/net/ethernet/marvell/mvneta.c

> @@ -2037,9 +2037,14 @@ mvneta_xdp_put_buff(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,

>  {

>  	int i;

>  

> +	if (likely(!xdp_buff_is_mb(xdp)))

> +		goto out;

> +


Wouldn't nr_frags = 0 in the !xdp_buff_is_mb case? Is the
xdp_buff_is_mb check with goto really required?

>  	for (i = 0; i < sinfo->nr_frags; i++)

>  		page_pool_put_full_page(rxq->page_pool,

>  					skb_frag_page(&sinfo->frags[i]), true);

> +

> +out:

>  	page_pool_put_page(rxq->page_pool, virt_to_head_page(xdp->data),

>  			   sync_len, true);

>  }

> @@ -2241,7 +2246,6 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,

>  	int data_len = -MVNETA_MH_SIZE, len;

>  	struct net_device *dev = pp->dev;

>  	enum dma_data_direction dma_dir;

> -	struct skb_shared_info *sinfo;

>  

>  	if (*size > MVNETA_MAX_RX_BUF_SIZE) {

>  		len = MVNETA_MAX_RX_BUF_SIZE;

> @@ -2261,11 +2265,9 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,

>  

>  	/* Prefetch header */

>  	prefetch(data);

> +	xdp_buff_clear_mb(xdp);

>  	xdp_prepare_buff(xdp, data, pp->rx_offset_correction + MVNETA_MH_SIZE,

>  			 data_len, false);

> -

> -	sinfo = xdp_get_shared_info_from_buff(xdp);

> -	sinfo->nr_frags = 0;

>  }

>  

>  static void

> @@ -2299,6 +2301,9 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,

>  		skb_frag_off_set(frag, pp->rx_offset_correction);

>  		skb_frag_size_set(frag, data_len);

>  		__skb_frag_set_page(frag, page);

> +

> +		if (!xdp_buff_is_mb(xdp))

> +			xdp_buff_set_mb(xdp);

>  	} else {

>  		page_pool_put_full_page(rxq->page_pool, page, true);

>  	}

> @@ -2320,8 +2325,12 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,

>  		      struct xdp_buff *xdp, u32 desc_status)

>  {

>  	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);

> -	int i, num_frags = sinfo->nr_frags;

>  	struct sk_buff *skb;

> +	u8 num_frags;

> +	int i;

> +

> +	if (unlikely(xdp_buff_is_mb(xdp)))

> +		num_frags = sinfo->nr_frags;

>  

>  	skb = build_skb(xdp->data_hard_start, PAGE_SIZE);

>  	if (!skb)

> @@ -2333,6 +2342,9 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,

>  	skb_put(skb, xdp->data_end - xdp->data);

>  	skb->ip_summed = mvneta_rx_csum(pp, desc_status);

>  

> +	if (likely(!xdp_buff_is_mb(xdp)))

> +		goto out;

> +


Not that I care much, but couldn't you just init num_frags = 0 and
avoid the goto?

Anyways its not my driver so no need to change it if you like it better
the way it is. Mostly just checking my understanding.

>  	for (i = 0; i < num_frags; i++) {

>  		skb_frag_t *frag = &sinfo->frags[i];

>  

> @@ -2341,6 +2353,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,

>  				skb_frag_size(frag), PAGE_SIZE);

>  	}

>  

> +out:

>  	return skb;

>  }

>  

> -- 

> 2.31.1

>
John Fastabend Aug. 31, 2021, 11:38 p.m. UTC | #3
Lorenzo Bianconi wrote:
> Introduce xdp_update_skb_shared_info routine to update frags array

> metadata in skb_shared_info data structure converting to a skb from

> a xdp_buff or xdp_frame.

> According to the current skb_shared_info architecture in

> xdp_frame/xdp_buff and to the xdp multi-buff support, there is

> no need to run skb_add_rx_frag() and reset frags array converting the buffer

> to a skb since the frag array will be in the same position for xdp_buff/xdp_frame

> and for the skb, we just need to update memory metadata.

> Introduce XDP_FLAGS_PF_MEMALLOC flag in xdp_buff_flags in order to mark

> the xdp_buff or xdp_frame as under memory-pressure if pages of the frags array

> are under memory pressure. Doing so we can avoid looping over all fragments in

> xdp_update_skb_shared_info routine. The driver is expected to set the

> flag constructing the xdp_buffer using xdp_buff_set_frag_pfmemalloc

> utility routine.

> Rely on xdp_update_skb_shared_info in __xdp_build_skb_from_frame routine

> converting the multi-buff xdp_frame to a skb after performing a XDP_REDIRECT.

> 

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>


Acked-by: John Fastabend <john.fastabend@gmail.com>
John Fastabend Aug. 31, 2021, 11:43 p.m. UTC | #4
Lorenzo Bianconi wrote:
> Take into account if the received xdp_buff/xdp_frame is non-linear

> recycling/returning the frame memory to the allocator or into

> xdp_frame_bulk.

> 

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> ---


Acked-by: John Fastabend <john.fastabend@gmail.com>
John Fastabend Aug. 31, 2021, 11:45 p.m. UTC | #5
Lorenzo Bianconi wrote:
> Enable the capability to receive jumbo frames even if the interface is

> running in XDP mode

> 

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> ---


Acked-by: John Fastabend <john.fastabend@gmail.com>
John Fastabend Sept. 1, 2021, 12:10 a.m. UTC | #6
Lorenzo Bianconi wrote:
> From: Eelco Chaudron <echaudro@redhat.com>

> 

> This change adds support for tail growing and shrinking for XDP multi-buff.

> 

> When called on a multi-buffer packet with a grow request, it will always

> work on the last fragment of the packet. So the maximum grow size is the

> last fragments tailroom, i.e. no new buffer will be allocated.

> 

> When shrinking, it will work from the last fragment, all the way down to

> the base buffer depending on the shrinking size. It's important to mention

> that once you shrink down the fragment(s) are freed, so you can not grow

> again to the original size.

> 

> Co-developed-by: Lorenzo Bianconi <lorenzo@kernel.org>

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> Signed-off-by: Eelco Chaudron <echaudro@redhat.com>

> ---


LGTM.

Acked-by: John Fastabend <john.fastabend@gmail.com>
John Fastabend Sept. 1, 2021, 12:36 a.m. UTC | #7
Lorenzo Bianconi wrote:
> For XDP frames split over multiple buffers, the xdp_md->data and

> xdp_md->data_end pointers will point to the start and end of the first

> fragment only. bpf_xdp_adjust_data can be used to access subsequent

> fragments by moving the data pointers. To use, an XDP program can call

> this helper with the byte offset of the packet payload that

> it wants to access; the helper will move xdp_md->data and xdp_md ->data_end

> so they point to the requested payload offset and to the end of the

> fragment containing this byte offset, and return the byte offset of the

> start of the fragment.

> To move back to the beginning of the packet, simply call the

> helper with an offset of '0'.

> Note also that the helpers that modify the packet boundaries

> (bpf_xdp_adjust_head(), bpf_xdp_adjust_tail() and

> bpf_xdp_adjust_meta()) will fail if the pointers have been

> moved; it is the responsibility of the BPF program to move them

> back before using these helpers.


I'm ok with this for a first iteration I guess with more work we
can make the helpers use the updated pointers though.

> 

> Suggested-by: John Fastabend <john.fastabend@gmail.com>

> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>


Overall looks good couple small nits/questions below. Thanks!

> ---

>  include/net/xdp.h              |  8 +++++

>  include/uapi/linux/bpf.h       | 32 ++++++++++++++++++

>  net/bpf/test_run.c             |  8 +++++

>  net/core/filter.c              | 62 +++++++++++++++++++++++++++++++++-

>  tools/include/uapi/linux/bpf.h | 32 ++++++++++++++++++

>  5 files changed, 141 insertions(+), 1 deletion(-)

> 

> diff --git a/include/net/xdp.h b/include/net/xdp.h

> index cdaecf8d4d61..ce4764c7cd40 100644

> --- a/include/net/xdp.h

> +++ b/include/net/xdp.h

> @@ -82,6 +82,11 @@ struct xdp_buff {

>  	struct xdp_txq_info *txq;

>  	u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/

>  	u16 flags; /* supported values defined in xdp_flags */

> +	/* xdp multi-buff metadata used for frags iteration */

> +	struct {

> +		u16 headroom;	/* frame headroom: data - data_hard_start */

> +		u16 headlen;	/* first buffer length: data_end - data */

> +	} mb;

>  };

>  

>  static __always_inline bool xdp_buff_is_mb(struct xdp_buff *xdp)

> @@ -127,6 +132,9 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,

>  	xdp->data = data;

>  	xdp->data_end = data + data_len;

>  	xdp->data_meta = meta_valid ? data : data + 1;

> +	/* mb metadata for frags iteration */

> +	xdp->mb.headroom = headroom;

> +	xdp->mb.headlen = data_len;

>  }

>  

>  /* Reserve memory area at end-of data area.

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index 9e2c3b12ea49..a7b5185a718a 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -4877,6 +4877,37 @@ union bpf_attr {

>   *		Get the total size of a given xdp buff (linear and paged area)

>   *	Return

>   *		The total size of a given xdp buffer.

> + *

> + * long bpf_xdp_adjust_data(struct xdp_buff *xdp_md, u32 offset)

> + *	Description

> + *		For XDP frames split over multiple buffers, the

> + *		*xdp_md*\ **->data** and*xdp_md *\ **->data_end** pointers

                                       ^^^^
missing space?

> + *		will point to the start and end of the first fragment only.

> + *		This helper can be used to access subsequent fragments by

> + *		moving the data pointers. To use, an XDP program can call

> + *		this helper with the byte offset of the packet payload that

> + *		it wants to access; the helper will move *xdp_md*\ **->data**

> + *		and *xdp_md *\ **->data_end** so they point to the requested

> + *		payload offset and to the end of the fragment containing this

> + *		byte offset, and return the byte offset of the start of the

> + *		fragment.

> + *		To move back to the beginning of the packet, simply call the

> + *		helper with an offset of '0'.

> + *		Note also that the helpers that modify the packet boundaries

> + *		(*bpf_xdp_adjust_head()*, *bpf_xdp_adjust_tail()* and

> + *		*bpf_xdp_adjust_meta()*) will fail if the pointers have been

> + *		moved; it is the responsibility of the BPF program to move them

> + *		back before using these helpers.

> + *

> + *		A call to this helper is susceptible to change the underlying

> + *		packet buffer. Therefore, at load time, all checks on pointers

> + *		previously done by the verifier are invalidated and must be

> + *		performed again, if the helper is used in combination with

> + *		direct packet access.

> + *	Return

> + *		offset between the beginning of the current fragment and

> + *		original *xdp_md*\ **->data** on success, or a negative error

> + *		in case of failure.

>   */

>  #define __BPF_FUNC_MAPPER(FN)		\

>  	FN(unspec),			\

> @@ -5055,6 +5086,7 @@ union bpf_attr {

>  	FN(get_func_ip),		\

>  	FN(get_attach_cookie),		\

>  	FN(xdp_get_buff_len),		\

> +	FN(xdp_adjust_data),		\

>  	/* */

>  

>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

> diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c

> index 869dcf23a1ca..f09c2c8c0d6c 100644

> --- a/net/bpf/test_run.c

> +++ b/net/bpf/test_run.c

> @@ -757,6 +757,8 @@ static int xdp_convert_md_to_buff(struct xdp_md *xdp_md, struct xdp_buff *xdp)

>  	}

>  

>  	xdp->data = xdp->data_meta + xdp_md->data;

> +	xdp->mb.headroom = xdp->data - xdp->data_hard_start;

> +	xdp->mb.headlen = xdp->data_end - xdp->data;

>  	return 0;

>  

>  free_dev:

> @@ -871,6 +873,12 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,

>  	if (ret)

>  		goto out;

>  

> +	/* data pointers need to be reset after frag iteration */

> +	if (unlikely(xdp.data_hard_start + xdp.mb.headroom != xdp.data)) {

> +		ret = -EFAULT;

> +		goto out;

> +	}

> +

>  	size = xdp.data_end - xdp.data_meta + sinfo->xdp_frags_size;

>  	ret = bpf_test_finish(kattr, uattr, xdp.data_meta, sinfo, size,

>  			      retval, duration);

> diff --git a/net/core/filter.c b/net/core/filter.c

> index 2122c00c680f..ed2a6632adce 100644

> --- a/net/core/filter.c

> +++ b/net/core/filter.c

> @@ -3827,6 +3827,10 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)

>  	void *data_start = xdp_frame_end + metalen;

>  	void *data = xdp->data + offset;

>  

> +	/* data pointers need to be reset after frag iteration */

> +	if (unlikely(xdp->data_hard_start + xdp->mb.headroom != xdp->data))

> +		return -EINVAL;


-EFAULT? It might be nice if error code is different from below
for debugging?

> +

>  	if (unlikely(data < data_start ||

>  		     data > xdp->data_end - ETH_HLEN))

>  		return -EINVAL;

> @@ -3836,6 +3840,9 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)

>  			xdp->data_meta, metalen);

>  	xdp->data_meta += offset;

>  	xdp->data = data;

> +	/* update metada for multi-buff frag iteration */

> +	xdp->mb.headroom = xdp->data - xdp->data_hard_start;

> +	xdp->mb.headlen = xdp->data_end - xdp->data;

>  

>  	return 0;

>  }

> @@ -3910,6 +3917,10 @@ BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)

>  	void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */

>  	void *data_end = xdp->data_end + offset;

>  

> +	/* data pointer needs to be reset after frag iteration */

> +	if (unlikely(xdp->data + xdp->mb.headlen != xdp->data_end))

> +		return -EINVAL;


EFAULT?

> +

>  	if (unlikely(xdp_buff_is_mb(xdp)))

>  		return bpf_xdp_mb_adjust_tail(xdp, offset);

>  

> @@ -3949,6 +3960,10 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)

>  	void *meta = xdp->data_meta + offset;

>  	unsigned long metalen = xdp->data - meta;

>  

> +	/* data pointer needs to be reset after frag iteration */

> +	if (unlikely(xdp->data_hard_start + xdp->mb.headroom != xdp->data))

> +		return -EINVAL;


same comment.

>  	if (xdp_data_meta_unsupported(xdp))

>  		return -ENOTSUPP;

>  	if (unlikely(meta < xdp_frame_end ||

> @@ -3970,6 +3985,48 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {

>  	.arg2_type	= ARG_ANYTHING,

>  };

>  

> +BPF_CALL_2(bpf_xdp_adjust_data, struct xdp_buff *, xdp, u32, offset)

> +{

> +	struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);

> +	u32 base_offset = xdp->mb.headlen;

> +	int i;

> +

> +	if (!xdp_buff_is_mb(xdp) || offset > sinfo->xdp_frags_size)

> +		return -EINVAL;


Do we need to error this? If its not mb we can just return the same
as offset==0?

> +

> +	if (offset < xdp->mb.headlen) {

> +		/* linear area */

> +		xdp->data = xdp->data_hard_start + xdp->mb.headroom + offset;

> +		xdp->data_end = xdp->data_hard_start + xdp->mb.headroom +

> +				xdp->mb.headlen;

> +		return 0;

> +	}

> +

> +	for (i = 0; i < sinfo->nr_frags; i++) {

> +		/* paged area */

> +		skb_frag_t *frag = &sinfo->frags[i];

> +		unsigned int size = skb_frag_size(frag);

> +

> +		if (offset < base_offset + size) {

> +			u8 *addr = skb_frag_address(frag);

> +

> +			xdp->data = addr + offset - base_offset;

> +			xdp->data_end = addr + size;

> +			break;

> +		}

> +		base_offset += size;

> +	}

> +	return base_offset;

> +}
John Fastabend Sept. 1, 2021, 12:45 a.m. UTC | #8
Lorenzo Bianconi wrote:
> This series introduce XDP multi-buffer support. The mvneta driver is

> the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> please focus on how these new types of xdp_{buff,frame} packets

> traverse the different layers and the layout design. It is on purpose

> that BPF-helpers are kept simple, as we don't want to expose the

> internal layout to allow later changes.

> 

> The main idea for the new multi-buffer layout is to reuse the same

> structure used for non-linear SKB. This rely on the "skb_shared_info"

> struct at the end of the first buffer to link together subsequent

> buffers. Keeping the layout compatible with SKBs is also done to ease

> and speedup creating a SKB from an xdp_{buff,frame}.

> Converting xdp_frame to SKB and deliver it to the network stack is shown

> in patch 05/18 (e.g. cpumaps).

> 

> A multi-buffer bit (mb) has been introduced in the flags field of xdp_{buff,frame}

> structure to notify the bpf/network layer if this is a xdp multi-buffer frame

> (mb = 1) or not (mb = 0).

> The mb bit will be set by a xdp multi-buffer capable driver only for

> non-linear frames maintaining the capability to receive linear frames

> without any extra cost since the skb_shared_info structure at the end

> of the first buffer will be initialized only if mb is set.

> Moreover the flags field in xdp_{buff,frame} will be reused even for

> xdp rx csum offloading in future series.

> 


The series is looking really close to me. Couple small comments/questions
inline. Also I think we should call out the potential issues in the cover
letter with regards to backwards compatibility. Something like, 

"
A multi-buffer enabled NIC may receive XDP frames with multiple frags.
If a BPF program does not understand mb layouts its possible to contrive
a BPF program that incorrectly views data_end as the end of data when
there is more data in the payload. Note helpers will generally due the
correct thing, for example perf_output will consume entire payload. But,
it is still possible some programs could do the wrong thing even if in
an edge case. Although we expect most BPF programs not to be impacted
we can't rule out, you've been warned.
"

I can't think of an elegant way around this and it does require at least
some type of opt-in by increasing the MTU limit so I'm OK with it given
I think it should impact few (no?) real programs.

> Typical use cases for this series are:

> - Jumbo-frames

> - Packet header split (please see Google���s use-case @ NetDevConf 0x14, [0])

> - TSO/GRO

> 

> The two following ebpf helpers (and related selftests) has been introduced:

> - bpf_xdp_adjust_data:

>   Move xdp_md->data and xdp_md->data_end pointers in subsequent fragments

>   according to the offset provided by the ebpf program. This helper can be

>   used to read/write values in frame payload.

> - bpf_xdp_get_buff_len:

>   Return the total frame size (linear + paged parts)

> 

> bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into

> account xdp multi-buff frames.

> 

> More info about the main idea behind this approach can be found here [1][2].

> 

> Changes since v11:

> - add missing static to bpf_xdp_get_buff_len_proto structure

> - fix bpf_xdp_adjust_data helper when offset is smaller than linear area length.

> 

> Changes since v10:

> - move xdp->data to the requested payload offset instead of to the beginning of

>   the fragment in bpf_xdp_adjust_data()

> 

> Changes since v9:

> - introduce bpf_xdp_adjust_data helper and related selftest

> - add xdp_frags_size and xdp_frags_tsize fields in skb_shared_info

> - introduce xdp_update_skb_shared_info utility routine in ordere to not reset

>   frags array in skb_shared_info converting from a xdp_buff/xdp_frame to a skb 

> - simplify bpf_xdp_copy routine

> 

> Changes since v8:

> - add proper dma unmapping if XDP_TX fails on mvneta for a xdp multi-buff

> - switch back to skb_shared_info implementation from previous xdp_shared_info

>   one

> - avoid using a bietfield in xdp_buff/xdp_frame since it introduces performance

>   regressions. Tested now on 10G NIC (ixgbe) to verify there are no performance

>   penalties for regular codebase

> - add bpf_xdp_get_buff_len helper and remove frame_length field in xdp ctx

> - add data_len field in skb_shared_info struct

> - introduce XDP_FLAGS_FRAGS_PF_MEMALLOC flag

> 

> Changes since v7:

> - rebase on top of bpf-next

> - fix sparse warnings

> - improve comments for frame_length in include/net/xdp.h

> 

> Changes since v6:

> - the main difference respect to previous versions is the new approach proposed

>   by Eelco to pass full length of the packet to eBPF layer in XDP context

> - reintroduce multi-buff support to eBPF kself-tests

> - reintroduce multi-buff support to bpf_xdp_adjust_tail helper

> - introduce multi-buffer support to bpf_xdp_copy helper

> - rebase on top of bpf-next

> 

> Changes since v5:

> - rebase on top of bpf-next

> - initialize mb bit in xdp_init_buff() and drop per-driver initialization

> - drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame()

> - postpone introduction of frame_length field in XDP ctx to another series

> - minor changes

> 

> Changes since v4:

> - rebase ontop of bpf-next

> - introduce xdp_shared_info to build xdp multi-buff instead of using the

>   skb_shared_info struct

> - introduce frame_length in xdp ctx

> - drop previous bpf helpers

> - fix bpf_xdp_adjust_tail for xdp multi-buff

> - introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail

> - fix xdp_return_frame_bulk for xdp multi-buff

> 

> Changes since v3:

> - rebase ontop of bpf-next

> - add patch 10/13 to copy back paged data from a xdp multi-buff frame to

>   userspace buffer for xdp multi-buff selftests

> 

> Changes since v2:

> - add throughput measurements

> - drop bpf_xdp_adjust_mb_header bpf helper

> - introduce selftest for xdp multibuffer

> - addressed comments on bpf_xdp_get_frags_count

> - introduce xdp multi-buff support to cpumaps

> 

> Changes since v1:

> - Fix use-after-free in xdp_return_{buff/frame}

> - Introduce bpf helpers

> - Introduce xdp_mb sample program

> - access skb_shared_info->nr_frags only on the last fragment

> 

> Changes since RFC:

> - squash multi-buffer bit initialization in a single patch

> - add mvneta non-linear XDP buff support for tx side

> 

> [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> 

> Eelco Chaudron (3):

>   bpf: add multi-buff support to the bpf_xdp_adjust_tail() API

>   bpf: add multi-buffer support to xdp copy helpers

>   bpf: update xdp_adjust_tail selftest to include multi-buffer

> 

> Lorenzo Bianconi (15):

>   net: skbuff: add size metadata to skb_shared_info for xdp

>   xdp: introduce flags field in xdp_buff/xdp_frame

>   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer

>   net: mvneta: simplify mvneta_swbm_add_rx_fragment management

>   net: xdp: add xdp_update_skb_shared_info utility routine

>   net: marvell: rely on xdp_update_skb_shared_info utility routine

>   xdp: add multi-buff support to xdp_return_{buff/frame}

>   net: mvneta: add multi buffer support to XDP_TX

>   net: mvneta: enable jumbo frames for XDP

>   bpf: introduce bpf_xdp_get_buff_len helper

>   bpf: move user_size out of bpf_test_init

>   bpf: introduce multibuff support to bpf_prog_test_run_xdp()

>   bpf: test_run: add xdp_shared_info pointer in bpf_test_finish

>     signature

>   net: xdp: introduce bpf_xdp_adjust_data helper

>   bpf: add bpf_xdp_adjust_data selftest

> 

>  drivers/net/ethernet/marvell/mvneta.c         | 204 ++++++++++-------

>  include/linux/skbuff.h                        |   6 +-

>  include/net/xdp.h                             |  95 +++++++-

>  include/uapi/linux/bpf.h                      |  39 ++++

>  kernel/trace/bpf_trace.c                      |   3 +

>  net/bpf/test_run.c                            | 117 ++++++++--

>  net/core/filter.c                             | 213 +++++++++++++++++-

>  net/core/xdp.c                                |  76 ++++++-

>  tools/include/uapi/linux/bpf.h                |  39 ++++

>  .../bpf/prog_tests/xdp_adjust_data.c          |  55 +++++

>  .../bpf/prog_tests/xdp_adjust_tail.c          | 118 ++++++++++

>  .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 151 +++++++++----

>  .../bpf/progs/test_xdp_adjust_tail_grow.c     |  10 +-

>  .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-

>  .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   2 +-

>  .../bpf/progs/test_xdp_update_frags.c         |  41 ++++

>  16 files changed, 1036 insertions(+), 165 deletions(-)

>  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_adjust_data.c

>  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_update_frags.c

> 

> -- 

> 2.31.1

>
Lorenzo Bianconi Sept. 3, 2021, 5:13 p.m. UTC | #9
>

> Lorenzo Bianconi wrote:

> > Introduce xdp_frags_tsize field in skb_shared_info data structure

> > to store xdp_buff/xdp_frame truesize (xdp_frags_tsize will be used

> > in xdp multi-buff support). In order to not increase skb_shared_info

> > size we will use a hole due to skb_shared_info alignment.

> > Introduce xdp_frags_size field in skb_shared_info data structure

> > reusing gso_type field in order to store xdp_buff/xdp_frame paged size.

> > xdp_frags_size will be used in xdp multi-buff support.

> >

> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

>

> I assume we can use xdp_frags_tsize for anything else above XDP later?

> Other than simple question looks OK to me.


yes, right as we did for gso_type/xdp_frags_size.

Regards,
Lorenzo

>

> Acked-by: John Fastabend <john.fastabend@gmail.com>

>

> > ---

> >  include/linux/skbuff.h | 6 +++++-

> >  1 file changed, 5 insertions(+), 1 deletion(-)

> >

> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h

> > index 6bdb0db3e825..1abeba7ef82e 100644

> > --- a/include/linux/skbuff.h

> > +++ b/include/linux/skbuff.h

> > @@ -522,13 +522,17 @@ struct skb_shared_info {

> >       unsigned short  gso_segs;

> >       struct sk_buff  *frag_list;

> >       struct skb_shared_hwtstamps hwtstamps;

> > -     unsigned int    gso_type;

> > +     union {

> > +             unsigned int    gso_type;

> > +             unsigned int    xdp_frags_size;

> > +     };

> >       u32             tskey;

> >

> >       /*

> >        * Warning : all fields before dataref are cleared in __alloc_skb()

> >        */

> >       atomic_t        dataref;

> > +     unsigned int    xdp_frags_tsize;

> >

> >       /* Intermediate layers must ensure that destructor_arg

> >        * remains valid until skb destructor */

> > --

> > 2.31.1

> >

>

>
Lorenzo Bianconi Sept. 3, 2021, 5:23 p.m. UTC | #10
On Wed, Sep 1, 2021 at 1:24 AM John Fastabend <john.fastabend@gmail.com> wrote:
>

> Lorenzo Bianconi wrote:

> > Update multi-buffer bit (mb) in xdp_buff to notify XDP/eBPF layer and

> > XDP remote drivers if this is a "non-linear" XDP buffer. Access

> > skb_shared_info only if xdp_buff mb is set in order to avoid possible

> > cache-misses.

> >

> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

> > ---

> >  drivers/net/ethernet/marvell/mvneta.c | 23 ++++++++++++++++++-----

> >  1 file changed, 18 insertions(+), 5 deletions(-)

> >

> > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c

> > index 5d1007e1b5c9..9f4858e35566 100644

> > --- a/drivers/net/ethernet/marvell/mvneta.c

> > +++ b/drivers/net/ethernet/marvell/mvneta.c

> > @@ -2037,9 +2037,14 @@ mvneta_xdp_put_buff(struct mvneta_port *pp, struct mvneta_rx_queue *rxq,

> >  {

> >       int i;

> >

> > +     if (likely(!xdp_buff_is_mb(xdp)))

> > +             goto out;

> > +

>

> Wouldn't nr_frags = 0 in the !xdp_buff_is_mb case? Is the

> xdp_buff_is_mb check with goto really required?


if xdp_buff_is_mb is false, nr_frags will not be initialized otherwise
we will trigger a cache-miss for the single-buffer use case (where
initializing skb_shared_info is not required).

>

> >       for (i = 0; i < sinfo->nr_frags; i++)

> >               page_pool_put_full_page(rxq->page_pool,

> >                                       skb_frag_page(&sinfo->frags[i]), true);

> > +

> > +out:

> >       page_pool_put_page(rxq->page_pool, virt_to_head_page(xdp->data),

> >                          sync_len, true);

> >  }

> > @@ -2241,7 +2246,6 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,

> >       int data_len = -MVNETA_MH_SIZE, len;

> >       struct net_device *dev = pp->dev;

> >       enum dma_data_direction dma_dir;

> > -     struct skb_shared_info *sinfo;

> >

> >       if (*size > MVNETA_MAX_RX_BUF_SIZE) {

> >               len = MVNETA_MAX_RX_BUF_SIZE;

> > @@ -2261,11 +2265,9 @@ mvneta_swbm_rx_frame(struct mvneta_port *pp,

> >

> >       /* Prefetch header */

> >       prefetch(data);

> > +     xdp_buff_clear_mb(xdp);

> >       xdp_prepare_buff(xdp, data, pp->rx_offset_correction + MVNETA_MH_SIZE,

> >                        data_len, false);

> > -

> > -     sinfo = xdp_get_shared_info_from_buff(xdp);

> > -     sinfo->nr_frags = 0;

> >  }

> >

> >  static void

> > @@ -2299,6 +2301,9 @@ mvneta_swbm_add_rx_fragment(struct mvneta_port *pp,

> >               skb_frag_off_set(frag, pp->rx_offset_correction);

> >               skb_frag_size_set(frag, data_len);

> >               __skb_frag_set_page(frag, page);

> > +

> > +             if (!xdp_buff_is_mb(xdp))

> > +                     xdp_buff_set_mb(xdp);

> >       } else {

> >               page_pool_put_full_page(rxq->page_pool, page, true);

> >       }

> > @@ -2320,8 +2325,12 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,

> >                     struct xdp_buff *xdp, u32 desc_status)

> >  {

> >       struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);

> > -     int i, num_frags = sinfo->nr_frags;

> >       struct sk_buff *skb;

> > +     u8 num_frags;

> > +     int i;

> > +

> > +     if (unlikely(xdp_buff_is_mb(xdp)))

> > +             num_frags = sinfo->nr_frags;

> >

> >       skb = build_skb(xdp->data_hard_start, PAGE_SIZE);

> >       if (!skb)

> > @@ -2333,6 +2342,9 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,

> >       skb_put(skb, xdp->data_end - xdp->data);

> >       skb->ip_summed = mvneta_rx_csum(pp, desc_status);

> >

> > +     if (likely(!xdp_buff_is_mb(xdp)))

> > +             goto out;

> > +

>

> Not that I care much, but couldn't you just init num_frags = 0 and

> avoid the goto?


same here.

Regards,
Lorenzo

>

> Anyways its not my driver so no need to change it if you like it better

> the way it is. Mostly just checking my understanding.

>

> >       for (i = 0; i < num_frags; i++) {

> >               skb_frag_t *frag = &sinfo->frags[i];

> >

> > @@ -2341,6 +2353,7 @@ mvneta_swbm_build_skb(struct mvneta_port *pp, struct page_pool *pool,

> >                               skb_frag_size(frag), PAGE_SIZE);

> >       }

> >

> > +out:

> >       return skb;

> >  }

> >

> > --

> > 2.31.1

> >

>

>
Lorenzo Bianconi Sept. 3, 2021, 5:57 p.m. UTC | #11
>

> Lorenzo Bianconi wrote:

> > For XDP frames split over multiple buffers, the xdp_md->data and

> > xdp_md->data_end pointers will point to the start and end of the first

> > fragment only. bpf_xdp_adjust_data can be used to access subsequent

> > fragments by moving the data pointers. To use, an XDP program can call

> > this helper with the byte offset of the packet payload that

> > it wants to access; the helper will move xdp_md->data and xdp_md ->data_end

> > so they point to the requested payload offset and to the end of the

> > fragment containing this byte offset, and return the byte offset of the

> > start of the fragment.

> > To move back to the beginning of the packet, simply call the

> > helper with an offset of '0'.

> > Note also that the helpers that modify the packet boundaries

> > (bpf_xdp_adjust_head(), bpf_xdp_adjust_tail() and

> > bpf_xdp_adjust_meta()) will fail if the pointers have been

> > moved; it is the responsibility of the BPF program to move them

> > back before using these helpers.

>

> I'm ok with this for a first iteration I guess with more work we

> can make the helpers use the updated pointers though.

>

> >

> > Suggested-by: John Fastabend <john.fastabend@gmail.com>

> > Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>

>

> Overall looks good couple small nits/questions below. Thanks!

>

> > ---

> >  include/net/xdp.h              |  8 +++++

> >  include/uapi/linux/bpf.h       | 32 ++++++++++++++++++

> >  net/bpf/test_run.c             |  8 +++++

> >  net/core/filter.c              | 62 +++++++++++++++++++++++++++++++++-

> >  tools/include/uapi/linux/bpf.h | 32 ++++++++++++++++++

> >  5 files changed, 141 insertions(+), 1 deletion(-)

> >

> > diff --git a/include/net/xdp.h b/include/net/xdp.h

> > index cdaecf8d4d61..ce4764c7cd40 100644

> > --- a/include/net/xdp.h

> > +++ b/include/net/xdp.h

> > @@ -82,6 +82,11 @@ struct xdp_buff {

> >       struct xdp_txq_info *txq;

> >       u32 frame_sz; /* frame size to deduce data_hard_end/reserved tailroom*/

> >       u16 flags; /* supported values defined in xdp_flags */

> > +     /* xdp multi-buff metadata used for frags iteration */

> > +     struct {

> > +             u16 headroom;   /* frame headroom: data - data_hard_start */

> > +             u16 headlen;    /* first buffer length: data_end - data */

> > +     } mb;

> >  };

> >

> >  static __always_inline bool xdp_buff_is_mb(struct xdp_buff *xdp)

> > @@ -127,6 +132,9 @@ xdp_prepare_buff(struct xdp_buff *xdp, unsigned char *hard_start,

> >       xdp->data = data;

> >       xdp->data_end = data + data_len;

> >       xdp->data_meta = meta_valid ? data : data + 1;

> > +     /* mb metadata for frags iteration */

> > +     xdp->mb.headroom = headroom;

> > +     xdp->mb.headlen = data_len;

> >  }

> >

> >  /* Reserve memory area at end-of data area.

> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> > index 9e2c3b12ea49..a7b5185a718a 100644

> > --- a/include/uapi/linux/bpf.h

> > +++ b/include/uapi/linux/bpf.h

> > @@ -4877,6 +4877,37 @@ union bpf_attr {

> >   *           Get the total size of a given xdp buff (linear and paged area)

> >   *   Return

> >   *           The total size of a given xdp buffer.

> > + *

> > + * long bpf_xdp_adjust_data(struct xdp_buff *xdp_md, u32 offset)

> > + *   Description

> > + *           For XDP frames split over multiple buffers, the

> > + *           *xdp_md*\ **->data** and*xdp_md *\ **->data_end** pointers

>                                        ^^^^

> missing space?


ack, right. I will fix it.

>

> > + *           will point to the start and end of the first fragment only.

> > + *           This helper can be used to access subsequent fragments by

> > + *           moving the data pointers. To use, an XDP program can call

> > + *           this helper with the byte offset of the packet payload that

> > + *           it wants to access; the helper will move *xdp_md*\ **->data**

> > + *           and *xdp_md *\ **->data_end** so they point to the requested

> > + *           payload offset and to the end of the fragment containing this

> > + *           byte offset, and return the byte offset of the start of the

> > + *           fragment.

> > + *           To move back to the beginning of the packet, simply call the

> > + *           helper with an offset of '0'.

> > + *           Note also that the helpers that modify the packet boundaries

> > + *           (*bpf_xdp_adjust_head()*, *bpf_xdp_adjust_tail()* and

> > + *           *bpf_xdp_adjust_meta()*) will fail if the pointers have been

> > + *           moved; it is the responsibility of the BPF program to move them

> > + *           back before using these helpers.

> > + *

> > + *           A call to this helper is susceptible to change the underlying

> > + *           packet buffer. Therefore, at load time, all checks on pointers

> > + *           previously done by the verifier are invalidated and must be

> > + *           performed again, if the helper is used in combination with

> > + *           direct packet access.

> > + *   Return

> > + *           offset between the beginning of the current fragment and

> > + *           original *xdp_md*\ **->data** on success, or a negative error

> > + *           in case of failure.

> >   */

> >  #define __BPF_FUNC_MAPPER(FN)                \

> >       FN(unspec),                     \

> > @@ -5055,6 +5086,7 @@ union bpf_attr {

> >       FN(get_func_ip),                \

> >       FN(get_attach_cookie),          \

> >       FN(xdp_get_buff_len),           \

> > +     FN(xdp_adjust_data),            \

> >       /* */

> >

> >  /* integer value in 'imm' field of BPF_CALL instruction selects which helper

> > diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c

> > index 869dcf23a1ca..f09c2c8c0d6c 100644

> > --- a/net/bpf/test_run.c

> > +++ b/net/bpf/test_run.c

> > @@ -757,6 +757,8 @@ static int xdp_convert_md_to_buff(struct xdp_md *xdp_md, struct xdp_buff *xdp)

> >       }

> >

> >       xdp->data = xdp->data_meta + xdp_md->data;

> > +     xdp->mb.headroom = xdp->data - xdp->data_hard_start;

> > +     xdp->mb.headlen = xdp->data_end - xdp->data;

> >       return 0;

> >

> >  free_dev:

> > @@ -871,6 +873,12 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,

> >       if (ret)

> >               goto out;

> >

> > +     /* data pointers need to be reset after frag iteration */

> > +     if (unlikely(xdp.data_hard_start + xdp.mb.headroom != xdp.data)) {

> > +             ret = -EFAULT;

> > +             goto out;

> > +     }

> > +

> >       size = xdp.data_end - xdp.data_meta + sinfo->xdp_frags_size;

> >       ret = bpf_test_finish(kattr, uattr, xdp.data_meta, sinfo, size,

> >                             retval, duration);

> > diff --git a/net/core/filter.c b/net/core/filter.c

> > index 2122c00c680f..ed2a6632adce 100644

> > --- a/net/core/filter.c

> > +++ b/net/core/filter.c

> > @@ -3827,6 +3827,10 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)

> >       void *data_start = xdp_frame_end + metalen;

> >       void *data = xdp->data + offset;

> >

> > +     /* data pointers need to be reset after frag iteration */

> > +     if (unlikely(xdp->data_hard_start + xdp->mb.headroom != xdp->data))

> > +             return -EINVAL;

>

> -EFAULT? It might be nice if error code is different from below

> for debugging?


ack, I will fix it in v13

>

> > +

> >       if (unlikely(data < data_start ||

> >                    data > xdp->data_end - ETH_HLEN))

> >               return -EINVAL;

> > @@ -3836,6 +3840,9 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)

> >                       xdp->data_meta, metalen);

> >       xdp->data_meta += offset;

> >       xdp->data = data;

> > +     /* update metada for multi-buff frag iteration */

> > +     xdp->mb.headroom = xdp->data - xdp->data_hard_start;

> > +     xdp->mb.headlen = xdp->data_end - xdp->data;

> >

> >       return 0;

> >  }

> > @@ -3910,6 +3917,10 @@ BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)

> >       void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */

> >       void *data_end = xdp->data_end + offset;

> >

> > +     /* data pointer needs to be reset after frag iteration */

> > +     if (unlikely(xdp->data + xdp->mb.headlen != xdp->data_end))

> > +             return -EINVAL;

>

> EFAULT?


ack, I will fix it in v13

>

> > +

> >       if (unlikely(xdp_buff_is_mb(xdp)))

> >               return bpf_xdp_mb_adjust_tail(xdp, offset);

> >

> > @@ -3949,6 +3960,10 @@ BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)

> >       void *meta = xdp->data_meta + offset;

> >       unsigned long metalen = xdp->data - meta;

> >

> > +     /* data pointer needs to be reset after frag iteration */

> > +     if (unlikely(xdp->data_hard_start + xdp->mb.headroom != xdp->data))

> > +             return -EINVAL;

>

> same comment.


ack, I will fix it in v13

>

> >       if (xdp_data_meta_unsupported(xdp))

> >               return -ENOTSUPP;

> >       if (unlikely(meta < xdp_frame_end ||

> > @@ -3970,6 +3985,48 @@ static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {

> >       .arg2_type      = ARG_ANYTHING,

> >  };

> >

> > +BPF_CALL_2(bpf_xdp_adjust_data, struct xdp_buff *, xdp, u32, offset)

> > +{

> > +     struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);

> > +     u32 base_offset = xdp->mb.headlen;

> > +     int i;

> > +

> > +     if (!xdp_buff_is_mb(xdp) || offset > sinfo->xdp_frags_size)

> > +             return -EINVAL;

>

> Do we need to error this? If its not mb we can just return the same

> as offset==0?


ack, we can check do something like:

u32 max_offset = xdp->mb.headlen;

if (xdp_buff_is_mb(xdp))
   max_offset += sinfo->xdp_frags_size;

if (offset > max_offset)
   return -EINVAL;

what do you think?

Regards,
Lorenzo

>

> > +

> > +     if (offset < xdp->mb.headlen) {

> > +             /* linear area */

> > +             xdp->data = xdp->data_hard_start + xdp->mb.headroom + offset;

> > +             xdp->data_end = xdp->data_hard_start + xdp->mb.headroom +

> > +                             xdp->mb.headlen;

> > +             return 0;

> > +     }

> > +

> > +     for (i = 0; i < sinfo->nr_frags; i++) {

> > +             /* paged area */

> > +             skb_frag_t *frag = &sinfo->frags[i];

> > +             unsigned int size = skb_frag_size(frag);

> > +

> > +             if (offset < base_offset + size) {

> > +                     u8 *addr = skb_frag_address(frag);

> > +

> > +                     xdp->data = addr + offset - base_offset;

> > +                     xdp->data_end = addr + size;

> > +                     break;

> > +             }

> > +             base_offset += size;

> > +     }

> > +     return base_offset;

> > +}

>
Lorenzo Bianconi Sept. 7, 2021, 8:35 a.m. UTC | #12
> Lorenzo Bianconi wrote:

> > This series introduce XDP multi-buffer support. The mvneta driver is

> > the first to support these new "non-linear" xdp_{buff,frame}. Reviewers

> > please focus on how these new types of xdp_{buff,frame} packets

> > traverse the different layers and the layout design. It is on purpose

> > that BPF-helpers are kept simple, as we don't want to expose the

> > internal layout to allow later changes.

> > 

> > The main idea for the new multi-buffer layout is to reuse the same

> > structure used for non-linear SKB. This rely on the "skb_shared_info"

> > struct at the end of the first buffer to link together subsequent

> > buffers. Keeping the layout compatible with SKBs is also done to ease

> > and speedup creating a SKB from an xdp_{buff,frame}.

> > Converting xdp_frame to SKB and deliver it to the network stack is shown

> > in patch 05/18 (e.g. cpumaps).

> > 

> > A multi-buffer bit (mb) has been introduced in the flags field of xdp_{buff,frame}

> > structure to notify the bpf/network layer if this is a xdp multi-buffer frame

> > (mb = 1) or not (mb = 0).

> > The mb bit will be set by a xdp multi-buffer capable driver only for

> > non-linear frames maintaining the capability to receive linear frames

> > without any extra cost since the skb_shared_info structure at the end

> > of the first buffer will be initialized only if mb is set.

> > Moreover the flags field in xdp_{buff,frame} will be reused even for

> > xdp rx csum offloading in future series.

> > 

> 

> The series is looking really close to me. Couple small comments/questions

> inline. Also I think we should call out the potential issues in the cover

> letter with regards to backwards compatibility. Something like, 

> 

> "

> A multi-buffer enabled NIC may receive XDP frames with multiple frags.

> If a BPF program does not understand mb layouts its possible to contrive

> a BPF program that incorrectly views data_end as the end of data when

> there is more data in the payload. Note helpers will generally due the

> correct thing, for example perf_output will consume entire payload. But,

> it is still possible some programs could do the wrong thing even if in

> an edge case. Although we expect most BPF programs not to be impacted

> we can't rule out, you've been warned.


ack, I will add it to the cover letter in v13.

Regards,
Lorenzo

> "

> 

> I can't think of an elegant way around this and it does require at least

> some type of opt-in by increasing the MTU limit so I'm OK with it given

> I think it should impact few (no?) real programs.

> 

> > Typical use cases for this series are:

> > - Jumbo-frames

> > - Packet header split (please see Google���s use-case @ NetDevConf 0x14, [0])

> > - TSO/GRO

> > 

> > The two following ebpf helpers (and related selftests) has been introduced:

> > - bpf_xdp_adjust_data:

> >   Move xdp_md->data and xdp_md->data_end pointers in subsequent fragments

> >   according to the offset provided by the ebpf program. This helper can be

> >   used to read/write values in frame payload.

> > - bpf_xdp_get_buff_len:

> >   Return the total frame size (linear + paged parts)

> > 

> > bpf_xdp_adjust_tail and bpf_xdp_copy helpers have been modified to take into

> > account xdp multi-buff frames.

> > 

> > More info about the main idea behind this approach can be found here [1][2].

> > 

> > Changes since v11:

> > - add missing static to bpf_xdp_get_buff_len_proto structure

> > - fix bpf_xdp_adjust_data helper when offset is smaller than linear area length.

> > 

> > Changes since v10:

> > - move xdp->data to the requested payload offset instead of to the beginning of

> >   the fragment in bpf_xdp_adjust_data()

> > 

> > Changes since v9:

> > - introduce bpf_xdp_adjust_data helper and related selftest

> > - add xdp_frags_size and xdp_frags_tsize fields in skb_shared_info

> > - introduce xdp_update_skb_shared_info utility routine in ordere to not reset

> >   frags array in skb_shared_info converting from a xdp_buff/xdp_frame to a skb 

> > - simplify bpf_xdp_copy routine

> > 

> > Changes since v8:

> > - add proper dma unmapping if XDP_TX fails on mvneta for a xdp multi-buff

> > - switch back to skb_shared_info implementation from previous xdp_shared_info

> >   one

> > - avoid using a bietfield in xdp_buff/xdp_frame since it introduces performance

> >   regressions. Tested now on 10G NIC (ixgbe) to verify there are no performance

> >   penalties for regular codebase

> > - add bpf_xdp_get_buff_len helper and remove frame_length field in xdp ctx

> > - add data_len field in skb_shared_info struct

> > - introduce XDP_FLAGS_FRAGS_PF_MEMALLOC flag

> > 

> > Changes since v7:

> > - rebase on top of bpf-next

> > - fix sparse warnings

> > - improve comments for frame_length in include/net/xdp.h

> > 

> > Changes since v6:

> > - the main difference respect to previous versions is the new approach proposed

> >   by Eelco to pass full length of the packet to eBPF layer in XDP context

> > - reintroduce multi-buff support to eBPF kself-tests

> > - reintroduce multi-buff support to bpf_xdp_adjust_tail helper

> > - introduce multi-buffer support to bpf_xdp_copy helper

> > - rebase on top of bpf-next

> > 

> > Changes since v5:

> > - rebase on top of bpf-next

> > - initialize mb bit in xdp_init_buff() and drop per-driver initialization

> > - drop xdp->mb initialization in xdp_convert_zc_to_xdp_frame()

> > - postpone introduction of frame_length field in XDP ctx to another series

> > - minor changes

> > 

> > Changes since v4:

> > - rebase ontop of bpf-next

> > - introduce xdp_shared_info to build xdp multi-buff instead of using the

> >   skb_shared_info struct

> > - introduce frame_length in xdp ctx

> > - drop previous bpf helpers

> > - fix bpf_xdp_adjust_tail for xdp multi-buff

> > - introduce xdp multi-buff self-tests for bpf_xdp_adjust_tail

> > - fix xdp_return_frame_bulk for xdp multi-buff

> > 

> > Changes since v3:

> > - rebase ontop of bpf-next

> > - add patch 10/13 to copy back paged data from a xdp multi-buff frame to

> >   userspace buffer for xdp multi-buff selftests

> > 

> > Changes since v2:

> > - add throughput measurements

> > - drop bpf_xdp_adjust_mb_header bpf helper

> > - introduce selftest for xdp multibuffer

> > - addressed comments on bpf_xdp_get_frags_count

> > - introduce xdp multi-buff support to cpumaps

> > 

> > Changes since v1:

> > - Fix use-after-free in xdp_return_{buff/frame}

> > - Introduce bpf helpers

> > - Introduce xdp_mb sample program

> > - access skb_shared_info->nr_frags only on the last fragment

> > 

> > Changes since RFC:

> > - squash multi-buffer bit initialization in a single patch

> > - add mvneta non-linear XDP buff support for tx side

> > 

> > [0] https://netdevconf.info/0x14/session.html?talk-the-path-to-tcp-4k-mtu-and-rx-zerocopy

> > [1] https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp-multi-buffer01-design.org

> > [2] https://netdevconf.info/0x14/session.html?tutorial-add-XDP-support-to-a-NIC-driver (XDPmulti-buffers section)

> > 

> > Eelco Chaudron (3):

> >   bpf: add multi-buff support to the bpf_xdp_adjust_tail() API

> >   bpf: add multi-buffer support to xdp copy helpers

> >   bpf: update xdp_adjust_tail selftest to include multi-buffer

> > 

> > Lorenzo Bianconi (15):

> >   net: skbuff: add size metadata to skb_shared_info for xdp

> >   xdp: introduce flags field in xdp_buff/xdp_frame

> >   net: mvneta: update mb bit before passing the xdp buffer to eBPF layer

> >   net: mvneta: simplify mvneta_swbm_add_rx_fragment management

> >   net: xdp: add xdp_update_skb_shared_info utility routine

> >   net: marvell: rely on xdp_update_skb_shared_info utility routine

> >   xdp: add multi-buff support to xdp_return_{buff/frame}

> >   net: mvneta: add multi buffer support to XDP_TX

> >   net: mvneta: enable jumbo frames for XDP

> >   bpf: introduce bpf_xdp_get_buff_len helper

> >   bpf: move user_size out of bpf_test_init

> >   bpf: introduce multibuff support to bpf_prog_test_run_xdp()

> >   bpf: test_run: add xdp_shared_info pointer in bpf_test_finish

> >     signature

> >   net: xdp: introduce bpf_xdp_adjust_data helper

> >   bpf: add bpf_xdp_adjust_data selftest

> > 

> >  drivers/net/ethernet/marvell/mvneta.c         | 204 ++++++++++-------

> >  include/linux/skbuff.h                        |   6 +-

> >  include/net/xdp.h                             |  95 +++++++-

> >  include/uapi/linux/bpf.h                      |  39 ++++

> >  kernel/trace/bpf_trace.c                      |   3 +

> >  net/bpf/test_run.c                            | 117 ++++++++--

> >  net/core/filter.c                             | 213 +++++++++++++++++-

> >  net/core/xdp.c                                |  76 ++++++-

> >  tools/include/uapi/linux/bpf.h                |  39 ++++

> >  .../bpf/prog_tests/xdp_adjust_data.c          |  55 +++++

> >  .../bpf/prog_tests/xdp_adjust_tail.c          | 118 ++++++++++

> >  .../selftests/bpf/prog_tests/xdp_bpf2bpf.c    | 151 +++++++++----

> >  .../bpf/progs/test_xdp_adjust_tail_grow.c     |  10 +-

> >  .../bpf/progs/test_xdp_adjust_tail_shrink.c   |  32 ++-

> >  .../selftests/bpf/progs/test_xdp_bpf2bpf.c    |   2 +-

> >  .../bpf/progs/test_xdp_update_frags.c         |  41 ++++

> >  16 files changed, 1036 insertions(+), 165 deletions(-)

> >  create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_adjust_data.c

> >  create mode 100644 tools/testing/selftests/bpf/progs/test_xdp_update_frags.c

> > 

> > -- 

> > 2.31.1

> > 

> 

>