mbox series

[PATCHv15,bpf-next,0/6] xdp: add a new helper for dev map multicast support

Message ID 20210120022514.2862872-1-liuhangbin@gmail.com
Headers show
Series xdp: add a new helper for dev map multicast support | expand

Message

Hangbin Liu Jan. 20, 2021, 2:25 a.m. UTC
This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
The 2st patch add a new bpf arg to allow NULL map pointer.
The 3rd patch add the new bpf_redirect_map_multi() helper.
The 4-6 patches are for usage sample and testing purpose.

I did same perf tests with the following topo:

---------------------             ---------------------
| Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
---------------------             |                   |
                                  |   Host B          |
---------------------             |                   |
| Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
---------------------    vlan2    |          -------- |
                                  | veth1 -- | veth0| |
                                  |          -------- |
                                  --------------------|
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
The veth0 in netns load dummy drop program. The forward_map max_entries in
xdp_redirect_map_multi is modify to 4.

Here is the perf result with 5.10 rc6:

The are about +/- 0.1M deviation for native testing
Version             | Test                                    | Generic | Native | Native + 2nd
5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v15:
Update bq_xmit_all() logic for patch 01.
Add some comments and remove useless variable for patch 03.
Use bpf_object__find_program_by_title() for patch 04 and 06.

v14:
No code update, just rebase the code on latest bpf-next

v13:
Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
code in patch 03.

v12:
Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
Add 2nd xdp_prog test on the sample and selftests.

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Rebase the code to latest bpf-next.
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
a) Update function dev_in_exclude_map():
   - remove duplicate ex_map map_type check in
   - lookup the element in dev map by obj dev index directly instead
     of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (5):
  bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: Add verifier tests for bpf arg
    ARG_CONST_MAP_PTR_OR_NULL
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  21 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  28 ++
 kernel/bpf/devmap.c                           | 232 +++++++++++---
 kernel/bpf/verifier.c                         |  16 +-
 net/core/filter.c                             | 124 ++++++-
 net/core/xdp.c                                |  29 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 302 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  28 ++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       | 111 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  22 +-
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
 .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
 .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
 18 files changed, 1488 insertions(+), 50 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c

Comments

Maciej Fijalkowski Jan. 20, 2021, 10:42 p.m. UTC | #1
On Wed, Jan 20, 2021 at 10:25:09AM +0800, Hangbin Liu wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>

> 

> This changes the devmap XDP program support to run the program when the

> bulk queue is flushed instead of before the frame is enqueued. This has

> a couple of benefits:

> 

> - It "sorts" the packets by destination devmap entry, and then runs the

>   same BPF program on all the packets in sequence. This ensures that we

>   keep the XDP program and destination device properties hot in I-cache.

> 

> - It makes the multicast implementation simpler because it can just

>   enqueue packets using bq_enqueue() without having to deal with the

>   devmap program at all.

> 

> The drawback is that if the devmap program drops the packet, the enqueue

> step is redundant. However, arguably this is mostly visible in a

> micro-benchmark, and with more mixed traffic the I-cache benefit should

> win out. The performance impact of just this patch is as follows:

> 

> Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send

> pkts via pktgen cmd:

> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

> 

> There are about +/- 0.1M deviation for native testing, the performance

> improved for the base-case, but some drop back with xdp devmap prog attached.

> 

> Version          | Test                           | Generic | Native | Native + 2nd xdp_prog

> 5.10 rc6         | xdp_redirect_map   i40e->i40e  |    2.0M |   9.1M |  8.0M

> 5.10 rc6         | xdp_redirect_map   i40e->veth  |    1.7M |  11.0M |  9.7M

> 5.10 rc6 + patch | xdp_redirect_map   i40e->i40e  |    2.0M |   9.5M |  7.5M

> 5.10 rc6 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  11.6M |  9.1M

> 

> [1] https://patchwork.ozlabs.org/project/netdev/patch/20201208120159.2278277-1-liuhangbin@gmail.com/


nit: probably would be good to update the link to patch, I see it's v8
already whereas link refers to v4.

> 

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

> 

> ---

> v15:

> a) do not use unlikely when checking bq->xdp_prog

> b) return sent frames for dev_map_bpf_prog_run()

> 

> v14: no update, only rebase the code

> v13: pass in xdp_prog through __xdp_enqueue()

> v2-v12: no this patch

> ---

>  kernel/bpf/devmap.c | 116 +++++++++++++++++++++++++++-----------------

>  1 file changed, 71 insertions(+), 45 deletions(-)

> 

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c

> index f6e9c68afdd4..13ed68c24aad 100644

> --- a/kernel/bpf/devmap.c

> +++ b/kernel/bpf/devmap.c

> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {

>  	struct list_head flush_node;

>  	struct net_device *dev;

>  	struct net_device *dev_rx;

> +	struct bpf_prog *xdp_prog;

>  	unsigned int count;

>  };

>  

> @@ -327,44 +328,93 @@ bool dev_map_can_have_prog(struct bpf_map *map)

>  	return false;

>  }

>  

> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,

> +				struct xdp_frame **frames, int n,

> +				struct net_device *dev)

> +{

> +	struct xdp_txq_info txq = { .dev = dev };

> +	struct xdp_buff xdp;

> +	int i, nframes = 0;

> +

> +	for (i = 0; i < n; i++) {

> +		struct xdp_frame *xdpf = frames[i];

> +		u32 act;

> +		int err;

> +

> +		xdp_convert_frame_to_buff(xdpf, &xdp);

> +		xdp.txq = &txq;

> +

> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

> +		switch (act) {

> +		case XDP_PASS:

> +			err = xdp_update_frame_from_buff(&xdp, xdpf);


Bump on John's question.

> +			if (unlikely(err < 0))

> +				xdp_return_frame_rx_napi(xdpf);

> +			else

> +				frames[nframes++] = xdpf;

> +			break;

> +		default:

> +			bpf_warn_invalid_xdp_action(act);

> +			fallthrough;

> +		case XDP_ABORTED:

> +			trace_xdp_exception(dev, xdp_prog, act);

> +			fallthrough;

> +		case XDP_DROP:

> +			xdp_return_frame_rx_napi(xdpf);

> +			break;

> +		}

> +	}

> +	return nframes; /* sent frames count */

> +}

> +

>  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)

>  {

>  	struct net_device *dev = bq->dev;

>  	int sent = 0, drops = 0, err = 0;


if sent is inited below, the assignment above is useless.
Couldn't we do sent = cnt at the declaration over here?

> +	unsigned int cnt = bq->count;

>  	int i;

>  

> -	if (unlikely(!bq->count))

> +	if (unlikely(!cnt))

>  		return;

>  

> -	for (i = 0; i < bq->count; i++) {

> +	for (i = 0; i < cnt; i++) {

>  		struct xdp_frame *xdpf = bq->q[i];

>  

>  		prefetch(xdpf);

>  	}

>  

> -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> +	/* Init sent to cnt in case there is no xdp_prog */

> +	sent = cnt;

> +	if (bq->xdp_prog) {

> +		sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> +		if (!sent)

> +			goto out;


Sorry, but 'sent' is a bit confusing to me, actual sending happens below
via ndo_xdp_xmit, right? This hook will not actually send frames.
Can we do a subtle change to have it in separate variable 'to_send' ?

> +	}

> +

> +	/* Backup drops value before xmit as we may need it in error label */

> +	drops = cnt - sent;

> +	sent = dev->netdev_ops->ndo_xdp_xmit(dev, sent, bq->q, flags);

>  	if (sent < 0) {

>  		err = sent;

>  		sent = 0;

>  		goto error;

>  	}

> -	drops = bq->count - sent;

>  out:

> +	drops = cnt - sent;

>  	bq->count = 0;

>  

>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

>  	bq->dev_rx = NULL;

> +	bq->xdp_prog = NULL;

>  	__list_del_clearprev(&bq->flush_node);

>  	return;

>  error:

>  	/* If ndo_xdp_xmit fails with an errno, no frames have been

>  	 * xmit'ed and it's our responsibility to them free all.

>  	 */

> -	for (i = 0; i < bq->count; i++) {

> +	for (i = 0; i < cnt - drops; i++) {

>  		struct xdp_frame *xdpf = bq->q[i];

> -

>  		xdp_return_frame_rx_napi(xdpf);

> -		drops++;

>  	}

>  	goto out;


Although I'm a huge goto advocate, I feel like this particular usage could
be simplified. Not sure why we had that in first place.

I gave a shot at rewriting/refactoring whole bq_xmit_all and I feel like
it's more readable. I introduced 'to_send' variable and got rid of 'error'
label.

Thoughts?

I might have missed something, though.

static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
{
	struct net_device *dev = bq->dev;
	unsigned int cnt = bq->count;
	int drops = 0, err = 0;
	int to_send = 0;
	int sent = cnt;
	int i;

	if (unlikely(!cnt))
		return;

	for (i = 0; i < cnt; i++) {
		struct xdp_frame *xdpf = bq->q[i];

		prefetch(xdpf);
	}

	if (bq->xdp_prog) {
		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);
		if (!to_send) {
			sent = 0;
			goto out;
		}
	}

	drops = cnt - to_send;
	sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_send, bq->q, flags);
	if (sent < 0) {
		err = sent;
		sent = 0;

		/* If ndo_xdp_xmit fails with an errno, no frames have been
		 * xmit'ed and it's our responsibility to them free all.
		 */
		for (i = 0; i < cnt - drops; i++) {
			struct xdp_frame *xdpf = bq->q[i];

			xdp_return_frame_rx_napi(xdpf);
		}
	}
out:
	drops = cnt - sent;
	bq->count = 0;

	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);
	bq->dev_rx = NULL;
	bq->xdp_prog = NULL;
	__list_del_clearprev(&bq->flush_node);

	return;
}

>  }

> @@ -408,7 +458,7 @@ struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)

>   * Thus, safe percpu variable access.

>   */

>  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,

> -		       struct net_device *dev_rx)

> +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)

>  {

>  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);

>  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);

> @@ -423,6 +473,14 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,

>  	if (!bq->dev_rx)

>  		bq->dev_rx = dev_rx;

>  

> +	/* Store (potential) xdp_prog that run before egress to dev as

> +	 * part of bulk_queue.  This will be same xdp_prog for all

> +	 * xdp_frame's in bulk_queue, because this per-CPU store must

> +	 * be flushed from net_device drivers NAPI func end.

> +	 */

> +	if (!bq->xdp_prog)

> +		bq->xdp_prog = xdp_prog;

> +

>  	bq->q[bq->count++] = xdpf;

>  

>  	if (!bq->flush_node.prev)

> @@ -430,7 +488,8 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,

>  }

>  

>  static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,

> -			       struct net_device *dev_rx)

> +				struct net_device *dev_rx,

> +				struct bpf_prog *xdp_prog)

>  {

>  	struct xdp_frame *xdpf;

>  	int err;

> @@ -446,42 +505,14 @@ static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,

>  	if (unlikely(!xdpf))

>  		return -EOVERFLOW;

>  

> -	bq_enqueue(dev, xdpf, dev_rx);

> +	bq_enqueue(dev, xdpf, dev_rx, xdp_prog);

>  	return 0;

>  }

>  

> -static struct xdp_buff *dev_map_run_prog(struct net_device *dev,

> -					 struct xdp_buff *xdp,

> -					 struct bpf_prog *xdp_prog)

> -{

> -	struct xdp_txq_info txq = { .dev = dev };

> -	u32 act;

> -

> -	xdp_set_data_meta_invalid(xdp);

> -	xdp->txq = &txq;

> -

> -	act = bpf_prog_run_xdp(xdp_prog, xdp);

> -	switch (act) {

> -	case XDP_PASS:

> -		return xdp;

> -	case XDP_DROP:

> -		break;

> -	default:

> -		bpf_warn_invalid_xdp_action(act);

> -		fallthrough;

> -	case XDP_ABORTED:

> -		trace_xdp_exception(dev, xdp_prog, act);

> -		break;

> -	}

> -

> -	xdp_return_buff(xdp);

> -	return NULL;

> -}

> -

>  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,

>  		    struct net_device *dev_rx)

>  {

> -	return __xdp_enqueue(dev, xdp, dev_rx);

> +	return __xdp_enqueue(dev, xdp, dev_rx, NULL);

>  }

>  

>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,

> @@ -489,12 +520,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,

>  {

>  	struct net_device *dev = dst->dev;

>  

> -	if (dst->xdp_prog) {

> -		xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);

> -		if (!xdp)

> -			return 0;

> -	}

> -	return __xdp_enqueue(dev, xdp, dev_rx);

> +	return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog);

>  }

>  

>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,

> -- 

> 2.26.2

>
Hangbin Liu Jan. 21, 2021, 3:54 a.m. UTC | #2
Hi Maciej,
On Wed, Jan 20, 2021 at 11:42:38PM +0100, Maciej Fijalkowski wrote:
> > +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,

> > +				struct xdp_frame **frames, int n,

> > +				struct net_device *dev)

> > +{

> > +	struct xdp_txq_info txq = { .dev = dev };

> > +	struct xdp_buff xdp;

> > +	int i, nframes = 0;

> > +

> > +	for (i = 0; i < n; i++) {

> > +		struct xdp_frame *xdpf = frames[i];

> > +		u32 act;

> > +		int err;

> > +

> > +		xdp_convert_frame_to_buff(xdpf, &xdp);

> > +		xdp.txq = &txq;

> > +

> > +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

> > +		switch (act) {

> > +		case XDP_PASS:

> > +			err = xdp_update_frame_from_buff(&xdp, xdpf);

> 

> Bump on John's question.


Hi Jesper, would you please help answer John's question?
> >  

> > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> > +	/* Init sent to cnt in case there is no xdp_prog */

> > +	sent = cnt;

> > +	if (bq->xdp_prog) {

> > +		sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> > +		if (!sent)

> > +			goto out;

> 

> Sorry, but 'sent' is a bit confusing to me, actual sending happens below

> via ndo_xdp_xmit, right? This hook will not actually send frames.

> Can we do a subtle change to have it in separate variable 'to_send' ?


Makes sense to me.
> 

> Although I'm a huge goto advocate, I feel like this particular usage could

> be simplified. Not sure why we had that in first place.

> 

> I gave a shot at rewriting/refactoring whole bq_xmit_all and I feel like

> it's more readable. I introduced 'to_send' variable and got rid of 'error'

> label.

> 

> Thoughts?

> 

> I might have missed something, though.

> 

> static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)

> {

> 	struct net_device *dev = bq->dev;

> 	unsigned int cnt = bq->count;

> 	int drops = 0, err = 0;

> 	int to_send = 0;


The to_send also need to init to cnt.

> 	int sent = cnt;

> 	int i;

> 

> 	if (unlikely(!cnt))

> 		return;

> 

> 	for (i = 0; i < cnt; i++) {

> 		struct xdp_frame *xdpf = bq->q[i];

> 

> 		prefetch(xdpf);

> 	}

> 

> 	if (bq->xdp_prog) {

> 		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> 		if (!to_send) {

> 			sent = 0;

> 			goto out;

> 		}

> 	}

> 

> 	drops = cnt - to_send;


This line could move in to the xdp_prog brackets to save time when no xdp_prog.

	if (bq->xdp_prog) {
		to_send = ...
		if (!to_send) {
			...
		}
		drops = cnt - to_send;
	}

> 	sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_send, bq->q, flags);


If we don't have xdp_prog, the to_send should be cnt.

> 	if (sent < 0) {

> 		err = sent;

> 		sent = 0;

> 

> 		/* If ndo_xdp_xmit fails with an errno, no frames have been

> 		 * xmit'ed and it's our responsibility to them free all.

> 		 */

> 		for (i = 0; i < cnt - drops; i++) {

> 			struct xdp_frame *xdpf = bq->q[i];

> 

> 			xdp_return_frame_rx_napi(xdpf);

> 		}

> 	}

> out:

> 	drops = cnt - sent;

> 	bq->count = 0;

> 

> 	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

> 	bq->dev_rx = NULL;

> 	bq->xdp_prog = NULL;

> 	__list_del_clearprev(&bq->flush_node);

> 

> 	return;

> }


Thanks for your code, looks much clear now.

Hangbin
Maciej Fijalkowski Jan. 21, 2021, 1:35 p.m. UTC | #3
On Thu, Jan 21, 2021 at 11:54:24AM +0800, Hangbin Liu wrote:
> Hi Maciej,

> On Wed, Jan 20, 2021 at 11:42:38PM +0100, Maciej Fijalkowski wrote:

> > > +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,

> > > +				struct xdp_frame **frames, int n,

> > > +				struct net_device *dev)

> > > +{

> > > +	struct xdp_txq_info txq = { .dev = dev };

> > > +	struct xdp_buff xdp;

> > > +	int i, nframes = 0;

> > > +

> > > +	for (i = 0; i < n; i++) {

> > > +		struct xdp_frame *xdpf = frames[i];

> > > +		u32 act;

> > > +		int err;

> > > +

> > > +		xdp_convert_frame_to_buff(xdpf, &xdp);

> > > +		xdp.txq = &txq;

> > > +

> > > +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

> > > +		switch (act) {

> > > +		case XDP_PASS:

> > > +			err = xdp_update_frame_from_buff(&xdp, xdpf);

> > 

> > Bump on John's question.

> 

> Hi Jesper, would you please help answer John's question?

> > >  

> > > -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> > > +	/* Init sent to cnt in case there is no xdp_prog */

> > > +	sent = cnt;

> > > +	if (bq->xdp_prog) {

> > > +		sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> > > +		if (!sent)

> > > +			goto out;

> > 

> > Sorry, but 'sent' is a bit confusing to me, actual sending happens below

> > via ndo_xdp_xmit, right? This hook will not actually send frames.

> > Can we do a subtle change to have it in separate variable 'to_send' ?

> 

> Makes sense to me.

> > 

> > Although I'm a huge goto advocate, I feel like this particular usage could

> > be simplified. Not sure why we had that in first place.

> > 

> > I gave a shot at rewriting/refactoring whole bq_xmit_all and I feel like

> > it's more readable. I introduced 'to_send' variable and got rid of 'error'

> > label.

> > 

> > Thoughts?

> > 

> > I might have missed something, though.

> > 

> > static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)

> > {

> > 	struct net_device *dev = bq->dev;

> > 	unsigned int cnt = bq->count;

> > 	int drops = 0, err = 0;

> > 	int to_send = 0;

> 

> The to_send also need to init to cnt.


So I missed something indeed :P you're correct

> 

> > 	int sent = cnt;

> > 	int i;

> > 

> > 	if (unlikely(!cnt))

> > 		return;

> > 

> > 	for (i = 0; i < cnt; i++) {

> > 		struct xdp_frame *xdpf = bq->q[i];

> > 

> > 		prefetch(xdpf);

> > 	}

> > 

> > 	if (bq->xdp_prog) {

> > 		to_send = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> > 		if (!to_send) {

> > 			sent = 0;

> > 			goto out;

> > 		}

> > 	}

> > 

> > 	drops = cnt - to_send;

> 

> This line could move in to the xdp_prog brackets to save time when no xdp_prog.


Hmm, looks like we can do it.
For scenario where there was no bq->xdp_prog and failure of ndo_xdp_xmit,
we didn't alter the count of frames to be sent, so we would basically free
all of the frames (as drops is 0, cnt = bq->count). After that we
recalculate drops and correct value will be reported in tracepoint.

(needed to explain it to myself)

> 

> 	if (bq->xdp_prog) {

> 		to_send = ...

> 		if (!to_send) {

> 			...

> 		}

> 		drops = cnt - to_send;

> 	}

> 

> > 	sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_send, bq->q, flags);

> 

> If we don't have xdp_prog, the to_send should be cnt.


Yes, we should init to_send to cnt as you're suggesting above.

> 

> > 	if (sent < 0) {

> > 		err = sent;

> > 		sent = 0;

> > 

> > 		/* If ndo_xdp_xmit fails with an errno, no frames have been

> > 		 * xmit'ed and it's our responsibility to them free all.

> > 		 */

> > 		for (i = 0; i < cnt - drops; i++) {

> > 			struct xdp_frame *xdpf = bq->q[i];

> > 

> > 			xdp_return_frame_rx_napi(xdpf);

> > 		}

> > 	}

> > out:

> > 	drops = cnt - sent;

> > 	bq->count = 0;

> > 

> > 	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

> > 	bq->dev_rx = NULL;

> > 	bq->xdp_prog = NULL;

> > 	__list_del_clearprev(&bq->flush_node);

> > 

> > 	return;

> > }

> 

> Thanks for your code, looks much clear now.


Good to hear! I agree on your points as well.

> 

> Hangbin
Hangbin Liu Jan. 22, 2021, 7:46 a.m. UTC | #4
This patch is for xdp multicast support. which has been discussed before[0],
The goal is to be able to implement an OVS-like data plane in XDP, i.e.,
a software switch that can forward XDP frames to multiple ports.

To achieve this, an application needs to specify a group of interfaces
to forward a packet to. It is also common to want to exclude one or more
physical interfaces from the forwarding operation - e.g., to forward a
packet to all interfaces in the multicast group except the interface it
arrived on. While this could be done simply by adding more groups, this
quickly leads to a combinatorial explosion in the number of groups an
application has to maintain.

To avoid the combinatorial explosion, we propose to include the ability
to specify an "exclude group" as part of the forwarding operation. This
needs to be a group (instead of just a single port index), because there
may have multi interfaces you want to exclude.

Thus, the logical forwarding operation becomes a "set difference"
operation, i.e. "forward to all ports in group A that are not also in
group B". This series implements such an operation using device maps to
represent the groups. This means that the XDP program specifies two
device maps, one containing the list of netdevs to redirect to, and the
other containing the exclude list.

To achieve this, I re-implement a new helper bpf_redirect_map_multi()
to accept two maps, the forwarding map and exclude map. If user
don't want to use exclude map and just want simply stop redirecting back
to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

The 1st patch is Jesper's run devmap xdp_prog later in bulking step.
The 2st patch add a new bpf arg to allow NULL map pointer.
The 3rd patch add the new bpf_redirect_map_multi() helper.
The 4-6 patches are for usage sample and testing purpose.

I did same perf tests with the following topo:

---------------------             ---------------------
| Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |
---------------------             |                   |
                                  |   Host B          |
---------------------             |                   |
| Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |
---------------------    vlan2    |          -------- |
                                  | veth1 -- | veth0| |
                                  |          -------- |
                                  --------------------|
On Host A:
# pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):
Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.
The veth0 in netns load dummy drop program. The forward_map max_entries in
xdp_redirect_map_multi is modify to 4.

Here is the perf result with 5.10 rc6:

The are about +/- 0.1M deviation for native testing
Version             | Test                                    | Generic | Native | Native + 2nd
5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M
5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M
5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for
suggestions and help on implementation.

[0] https://xdp-project.net/#Handling-multicast

v16:
refactor bq_xmit_all logic and remove error label for patch 01

v15:
Update bq_xmit_all() logic for patch 01.
Add some comments and remove useless variable for patch 03.
Use bpf_object__find_program_by_title() for patch 04 and 06.

v14:
No code update, just rebase the code on latest bpf-next

v13:
Pass in xdp_prog through __xdp_enqueue() for patch 01. Update related
code in patch 03.

v12:
Add Jesper's xdp_prog patch, rebase my works on this and latest bpf-next
Add 2nd xdp_prog test on the sample and selftests.

v11:
Fix bpf_redirect_map_multi() helper description typo.
Add loop limit for devmap_get_next_obj() and dev_map_redirect_multi().

v10:
Rebase the code to latest bpf-next.
Update helper bpf_xdp_redirect_map_multi()
- No need to check map pointer as we will do the check in verifier.

v9:
Update helper bpf_xdp_redirect_map_multi()
- Use ARG_CONST_MAP_PTR_OR_NULL for helper arg2

v8:
a) Update function dev_in_exclude_map():
   - remove duplicate ex_map map_type check in
   - lookup the element in dev map by obj dev index directly instead
     of looping all the map

v7:
a) Fix helper flag check
b) Limit the *ex_map* to use DEVMAP_HASH only and update function
   dev_in_exclude_map() to get better performance.

v6: converted helper return types from int to long

v5:
a) Check devmap_get_next_key() return value.
b) Pass through flags to __bpf_tx_xdp_map() instead of bool value.
c) In function dev_map_enqueue_multi(), consume xdpf for the last
   obj instead of the first on.
d) Update helper description and code comments to explain that we
   use NULL target value to distinguish multicast and unicast
   forwarding.
e) Update memory model, memory id and frame_sz in xdpf_clone().
f) Split the tests from sample and add a bpf kernel selftest patch.

v4: Fix bpf_xdp_redirect_map_multi_proto arg2_type typo

v3: Based on Toke's suggestion, do the following update
a) Update bpf_redirect_map_multi() description in bpf.h.
b) Fix exclude_ifindex checking order in dev_in_exclude_map().
c) Fix one more xdpf clone in dev_map_enqueue_multi().
d) Go find next one in dev_map_enqueue_multi() if the interface is not
   able to forward instead of abort the whole loop.
e) Remove READ_ONCE/WRITE_ONCE for ex_map.

v2: Add new syscall bpf_xdp_redirect_map_multi() which could accept
include/exclude maps directly.

Hangbin Liu (5):
  bpf: add a new bpf argument type ARG_CONST_MAP_PTR_OR_NULL
  xdp: add a new helper for dev map multicast support
  sample/bpf: add xdp_redirect_map_multicast test
  selftests/bpf: Add verifier tests for bpf arg
    ARG_CONST_MAP_PTR_OR_NULL
  selftests/bpf: add xdp_redirect_multi test

Jesper Dangaard Brouer (1):
  bpf: run devmap xdp_prog on flush instead of bulk enqueue

 include/linux/bpf.h                           |  21 ++
 include/linux/filter.h                        |   1 +
 include/net/xdp.h                             |   1 +
 include/uapi/linux/bpf.h                      |  28 ++
 kernel/bpf/devmap.c                           | 252 ++++++++++++---
 kernel/bpf/verifier.c                         |  16 +-
 net/core/filter.c                             | 124 ++++++-
 net/core/xdp.c                                |  29 ++
 samples/bpf/Makefile                          |   3 +
 samples/bpf/xdp_redirect_map_multi_kern.c     |  87 +++++
 samples/bpf/xdp_redirect_map_multi_user.c     | 302 ++++++++++++++++++
 tools/include/uapi/linux/bpf.h                |  28 ++
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../bpf/progs/xdp_redirect_multi_kern.c       | 111 +++++++
 tools/testing/selftests/bpf/test_verifier.c   |  21 +-
 .../selftests/bpf/test_xdp_redirect_multi.sh  | 208 ++++++++++++
 .../testing/selftests/bpf/verifier/map_ptr.c  |  70 ++++
 .../selftests/bpf/xdp_redirect_multi.c        | 252 +++++++++++++++
 18 files changed, 1497 insertions(+), 60 deletions(-)
 create mode 100644 samples/bpf/xdp_redirect_map_multi_kern.c
 create mode 100644 samples/bpf/xdp_redirect_map_multi_user.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_redirect_multi_kern.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_redirect_multi.sh
 create mode 100644 tools/testing/selftests/bpf/xdp_redirect_multi.c
Maciej Fijalkowski Jan. 22, 2021, 10:50 a.m. UTC | #5
On Fri, Jan 22, 2021 at 03:46:47PM +0800, Hangbin Liu wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>

> 

> This changes the devmap XDP program support to run the program when the

> bulk queue is flushed instead of before the frame is enqueued. This has

> a couple of benefits:

> 

> - It "sorts" the packets by destination devmap entry, and then runs the

>   same BPF program on all the packets in sequence. This ensures that we

>   keep the XDP program and destination device properties hot in I-cache.

> 

> - It makes the multicast implementation simpler because it can just

>   enqueue packets using bq_enqueue() without having to deal with the

>   devmap program at all.

> 

> The drawback is that if the devmap program drops the packet, the enqueue

> step is redundant. However, arguably this is mostly visible in a

> micro-benchmark, and with more mixed traffic the I-cache benefit should

> win out. The performance impact of just this patch is as follows:

> 

> Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send

> pkts via pktgen cmd:

> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

> 

> There are about +/- 0.1M deviation for native testing, the performance

> improved for the base-case, but some drop back with xdp devmap prog attached.

> 

> Version          | Test                           | Generic | Native | Native + 2nd xdp_prog

> 5.10 rc6         | xdp_redirect_map   i40e->i40e  |    2.0M |   9.1M |  8.0M

> 5.10 rc6         | xdp_redirect_map   i40e->veth  |    1.7M |  11.0M |  9.7M

> 5.10 rc6 + patch | xdp_redirect_map   i40e->i40e  |    2.0M |   9.5M |  7.5M

> 5.10 rc6 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  11.6M |  9.1M

> 

> [1] https://lore.kernel.org/bpf/20210122025007.2968381-1-liuhangbin@gmail.com

> 

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

> 

> ---

> v16:

> a) refactor bq_xmit_all logic and remove error label

> 

> v15:

> a) do not use unlikely when checking bq->xdp_prog

> b) return sent frames for dev_map_bpf_prog_run()

> 

> v14: no update, only rebase the code

> v13: pass in xdp_prog through __xdp_enqueue()

> v2-v12: no this patch

> ---

>  kernel/bpf/devmap.c | 136 ++++++++++++++++++++++++++------------------

>  1 file changed, 81 insertions(+), 55 deletions(-)

> 

> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c

> index f6e9c68afdd4..c24fcffbbfad 100644

> --- a/kernel/bpf/devmap.c

> +++ b/kernel/bpf/devmap.c

> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {

>  	struct list_head flush_node;

>  	struct net_device *dev;

>  	struct net_device *dev_rx;

> +	struct bpf_prog *xdp_prog;

>  	unsigned int count;

>  };

>  

> @@ -327,46 +328,95 @@ bool dev_map_can_have_prog(struct bpf_map *map)

>  	return false;

>  }

>  

> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,

> +				struct xdp_frame **frames, int n,

> +				struct net_device *dev)

> +{

> +	struct xdp_txq_info txq = { .dev = dev };

> +	struct xdp_buff xdp;

> +	int i, nframes = 0;

> +

> +	for (i = 0; i < n; i++) {

> +		struct xdp_frame *xdpf = frames[i];

> +		u32 act;

> +		int err;

> +

> +		xdp_convert_frame_to_buff(xdpf, &xdp);

> +		xdp.txq = &txq;

> +

> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

> +		switch (act) {

> +		case XDP_PASS:

> +			err = xdp_update_frame_from_buff(&xdp, xdpf);

> +			if (unlikely(err < 0))

> +				xdp_return_frame_rx_napi(xdpf);

> +			else

> +				frames[nframes++] = xdpf;

> +			break;

> +		default:

> +			bpf_warn_invalid_xdp_action(act);

> +			fallthrough;

> +		case XDP_ABORTED:

> +			trace_xdp_exception(dev, xdp_prog, act);

> +			fallthrough;

> +		case XDP_DROP:

> +			xdp_return_frame_rx_napi(xdpf);

> +			break;

> +		}

> +	}

> +	return nframes; /* sent frames count */

> +}

> +

>  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)

>  {

>  	struct net_device *dev = bq->dev;

> -	int sent = 0, drops = 0, err = 0;

> +	unsigned int cnt = bq->count;

> +	int drops = 0, err = 0;

> +	int to_sent = cnt;


Hmm if I would be super picky then I'd like to have this variable as
"to_send", as we spoke. Could you change that?

With that, you can add my:

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>


to next revision.

> +	int sent = cnt;

>  	int i;

>  

> -	if (unlikely(!bq->count))

> +	if (unlikely(!cnt))

>  		return;

>  

> -	for (i = 0; i < bq->count; i++) {

> +	for (i = 0; i < cnt; i++) {

>  		struct xdp_frame *xdpf = bq->q[i];

>  

>  		prefetch(xdpf);

>  	}

>  

> -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

> +	if (bq->xdp_prog) {

> +		to_sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

> +		if (!to_sent) {

> +			sent = 0;

> +			goto out;

> +		}

> +		drops = cnt - to_sent;

> +	}

> +

> +	sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_sent, bq->q, flags);

>  	if (sent < 0) {

>  		err = sent;

>  		sent = 0;

> -		goto error;

> +

> +		/* If ndo_xdp_xmit fails with an errno, no frames have been

> +		 * xmit'ed and it's our responsibility to them free all.

> +		 */

> +		for (i = 0; i < cnt - drops; i++) {

> +			struct xdp_frame *xdpf = bq->q[i];

> +

> +			xdp_return_frame_rx_napi(xdpf);

> +		}

>  	}

> -	drops = bq->count - sent;

>  out:

> +	drops = cnt - sent;

>  	bq->count = 0;

>  

>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

>  	bq->dev_rx = NULL;

> +	bq->xdp_prog = NULL;


One more question, do you really have to do that per each bq_xmit_all
call? Couldn't you clear it in __dev_flush ?

Or IOW - what's the rationale behind storing xdp_prog in
xdp_dev_bulk_queue. Why can't you propagate the dst->xdp_prog and rely on
that without that local pointer?

You probably have an answer for that, so maybe include it in commit
message.

BTW same question for clearing dev_rx. To me this will be the same for all
bq_xmit_all() calls that will happen within same napi.

>  	__list_del_clearprev(&bq->flush_node);

>  	return;

> -error:

> -	/* If ndo_xdp_xmit fails with an errno, no frames have been

> -	 * xmit'ed and it's our responsibility to them free all.

> -	 */

> -	for (i = 0; i < bq->count; i++) {

> -		struct xdp_frame *xdpf = bq->q[i];

> -

> -		xdp_return_frame_rx_napi(xdpf);

> -		drops++;

> -	}

> -	goto out;

>  }

>  

>  /* __dev_flush is called from xdp_do_flush() which _must_ be signaled

> @@ -408,7 +458,7 @@ struct bpf_dtab_netdev *__dev_map_lookup_elem(struct bpf_map *map, u32 key)

>   * Thus, safe percpu variable access.

>   */

>  static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,

> -		       struct net_device *dev_rx)

> +		       struct net_device *dev_rx, struct bpf_prog *xdp_prog)

>  {

>  	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);

>  	struct xdp_dev_bulk_queue *bq = this_cpu_ptr(dev->xdp_bulkq);

> @@ -423,6 +473,14 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,

>  	if (!bq->dev_rx)

>  		bq->dev_rx = dev_rx;

>  

> +	/* Store (potential) xdp_prog that run before egress to dev as

> +	 * part of bulk_queue.  This will be same xdp_prog for all

> +	 * xdp_frame's in bulk_queue, because this per-CPU store must

> +	 * be flushed from net_device drivers NAPI func end.

> +	 */

> +	if (!bq->xdp_prog)

> +		bq->xdp_prog = xdp_prog;

> +

>  	bq->q[bq->count++] = xdpf;

>  

>  	if (!bq->flush_node.prev)

> @@ -430,7 +488,8 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,

>  }

>  

>  static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,

> -			       struct net_device *dev_rx)

> +				struct net_device *dev_rx,

> +				struct bpf_prog *xdp_prog)

>  {

>  	struct xdp_frame *xdpf;

>  	int err;

> @@ -446,42 +505,14 @@ static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,

>  	if (unlikely(!xdpf))

>  		return -EOVERFLOW;

>  

> -	bq_enqueue(dev, xdpf, dev_rx);

> +	bq_enqueue(dev, xdpf, dev_rx, xdp_prog);

>  	return 0;

>  }

>  

> -static struct xdp_buff *dev_map_run_prog(struct net_device *dev,

> -					 struct xdp_buff *xdp,

> -					 struct bpf_prog *xdp_prog)

> -{

> -	struct xdp_txq_info txq = { .dev = dev };

> -	u32 act;

> -

> -	xdp_set_data_meta_invalid(xdp);

> -	xdp->txq = &txq;

> -

> -	act = bpf_prog_run_xdp(xdp_prog, xdp);

> -	switch (act) {

> -	case XDP_PASS:

> -		return xdp;

> -	case XDP_DROP:

> -		break;

> -	default:

> -		bpf_warn_invalid_xdp_action(act);

> -		fallthrough;

> -	case XDP_ABORTED:

> -		trace_xdp_exception(dev, xdp_prog, act);

> -		break;

> -	}

> -

> -	xdp_return_buff(xdp);

> -	return NULL;

> -}

> -

>  int dev_xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,

>  		    struct net_device *dev_rx)

>  {

> -	return __xdp_enqueue(dev, xdp, dev_rx);

> +	return __xdp_enqueue(dev, xdp, dev_rx, NULL);

>  }

>  

>  int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,

> @@ -489,12 +520,7 @@ int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_buff *xdp,

>  {

>  	struct net_device *dev = dst->dev;

>  

> -	if (dst->xdp_prog) {

> -		xdp = dev_map_run_prog(dev, xdp, dst->xdp_prog);

> -		if (!xdp)

> -			return 0;

> -	}

> -	return __xdp_enqueue(dev, xdp, dev_rx);

> +	return __xdp_enqueue(dev, xdp, dev_rx, dst->xdp_prog);

>  }

>  

>  int dev_map_generic_redirect(struct bpf_dtab_netdev *dst, struct sk_buff *skb,

> -- 

> 2.26.2

>
Toke Høiland-Jørgensen Jan. 22, 2021, 1:38 p.m. UTC | #6
Maciej Fijalkowski <maciej.fijalkowski@intel.com> writes:

> On Fri, Jan 22, 2021 at 03:46:47PM +0800, Hangbin Liu wrote:

>> From: Jesper Dangaard Brouer <brouer@redhat.com>

>> 

>> This changes the devmap XDP program support to run the program when the

>> bulk queue is flushed instead of before the frame is enqueued. This has

>> a couple of benefits:

>> 

>> - It "sorts" the packets by destination devmap entry, and then runs the

>>   same BPF program on all the packets in sequence. This ensures that we

>>   keep the XDP program and destination device properties hot in I-cache.

>> 

>> - It makes the multicast implementation simpler because it can just

>>   enqueue packets using bq_enqueue() without having to deal with the

>>   devmap program at all.

>> 

>> The drawback is that if the devmap program drops the packet, the enqueue

>> step is redundant. However, arguably this is mostly visible in a

>> micro-benchmark, and with more mixed traffic the I-cache benefit should

>> win out. The performance impact of just this patch is as follows:

>> 

>> Using xdp_redirect_map(with a 2nd xdp_prog patch[1]) in sample/bpf and send

>> pkts via pktgen cmd:

>> ./pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -t 10 -s 64

>> 

>> There are about +/- 0.1M deviation for native testing, the performance

>> improved for the base-case, but some drop back with xdp devmap prog attached.

>> 

>> Version          | Test                           | Generic | Native | Native + 2nd xdp_prog

>> 5.10 rc6         | xdp_redirect_map   i40e->i40e  |    2.0M |   9.1M |  8.0M

>> 5.10 rc6         | xdp_redirect_map   i40e->veth  |    1.7M |  11.0M |  9.7M

>> 5.10 rc6 + patch | xdp_redirect_map   i40e->i40e  |    2.0M |   9.5M |  7.5M

>> 5.10 rc6 + patch | xdp_redirect_map   i40e->veth  |    1.7M |  11.6M |  9.1M

>> 

>> [1] https://lore.kernel.org/bpf/20210122025007.2968381-1-liuhangbin@gmail.com

>> 

>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

>> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

>> 

>> ---

>> v16:

>> a) refactor bq_xmit_all logic and remove error label

>> 

>> v15:

>> a) do not use unlikely when checking bq->xdp_prog

>> b) return sent frames for dev_map_bpf_prog_run()

>> 

>> v14: no update, only rebase the code

>> v13: pass in xdp_prog through __xdp_enqueue()

>> v2-v12: no this patch

>> ---

>>  kernel/bpf/devmap.c | 136 ++++++++++++++++++++++++++------------------

>>  1 file changed, 81 insertions(+), 55 deletions(-)

>> 

>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c

>> index f6e9c68afdd4..c24fcffbbfad 100644

>> --- a/kernel/bpf/devmap.c

>> +++ b/kernel/bpf/devmap.c

>> @@ -57,6 +57,7 @@ struct xdp_dev_bulk_queue {

>>  	struct list_head flush_node;

>>  	struct net_device *dev;

>>  	struct net_device *dev_rx;

>> +	struct bpf_prog *xdp_prog;

>>  	unsigned int count;

>>  };

>>  

>> @@ -327,46 +328,95 @@ bool dev_map_can_have_prog(struct bpf_map *map)

>>  	return false;

>>  }

>>  

>> +static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog,

>> +				struct xdp_frame **frames, int n,

>> +				struct net_device *dev)

>> +{

>> +	struct xdp_txq_info txq = { .dev = dev };

>> +	struct xdp_buff xdp;

>> +	int i, nframes = 0;

>> +

>> +	for (i = 0; i < n; i++) {

>> +		struct xdp_frame *xdpf = frames[i];

>> +		u32 act;

>> +		int err;

>> +

>> +		xdp_convert_frame_to_buff(xdpf, &xdp);

>> +		xdp.txq = &txq;

>> +

>> +		act = bpf_prog_run_xdp(xdp_prog, &xdp);

>> +		switch (act) {

>> +		case XDP_PASS:

>> +			err = xdp_update_frame_from_buff(&xdp, xdpf);

>> +			if (unlikely(err < 0))

>> +				xdp_return_frame_rx_napi(xdpf);

>> +			else

>> +				frames[nframes++] = xdpf;

>> +			break;

>> +		default:

>> +			bpf_warn_invalid_xdp_action(act);

>> +			fallthrough;

>> +		case XDP_ABORTED:

>> +			trace_xdp_exception(dev, xdp_prog, act);

>> +			fallthrough;

>> +		case XDP_DROP:

>> +			xdp_return_frame_rx_napi(xdpf);

>> +			break;

>> +		}

>> +	}

>> +	return nframes; /* sent frames count */

>> +}

>> +

>>  static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)

>>  {

>>  	struct net_device *dev = bq->dev;

>> -	int sent = 0, drops = 0, err = 0;

>> +	unsigned int cnt = bq->count;

>> +	int drops = 0, err = 0;

>> +	int to_sent = cnt;

>

> Hmm if I would be super picky then I'd like to have this variable as

> "to_send", as we spoke. Could you change that?

>

> With that, you can add my:

>

> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

>

> to next revision.

>

>> +	int sent = cnt;

>>  	int i;

>>  

>> -	if (unlikely(!bq->count))

>> +	if (unlikely(!cnt))

>>  		return;

>>  

>> -	for (i = 0; i < bq->count; i++) {

>> +	for (i = 0; i < cnt; i++) {

>>  		struct xdp_frame *xdpf = bq->q[i];

>>  

>>  		prefetch(xdpf);

>>  	}

>>  

>> -	sent = dev->netdev_ops->ndo_xdp_xmit(dev, bq->count, bq->q, flags);

>> +	if (bq->xdp_prog) {

>> +		to_sent = dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev);

>> +		if (!to_sent) {

>> +			sent = 0;

>> +			goto out;

>> +		}

>> +		drops = cnt - to_sent;

>> +	}

>> +

>> +	sent = dev->netdev_ops->ndo_xdp_xmit(dev, to_sent, bq->q, flags);

>>  	if (sent < 0) {

>>  		err = sent;

>>  		sent = 0;

>> -		goto error;

>> +

>> +		/* If ndo_xdp_xmit fails with an errno, no frames have been

>> +		 * xmit'ed and it's our responsibility to them free all.

>> +		 */

>> +		for (i = 0; i < cnt - drops; i++) {

>> +			struct xdp_frame *xdpf = bq->q[i];

>> +

>> +			xdp_return_frame_rx_napi(xdpf);

>> +		}

>>  	}

>> -	drops = bq->count - sent;

>>  out:

>> +	drops = cnt - sent;

>>  	bq->count = 0;

>>  

>>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

>>  	bq->dev_rx = NULL;

>> +	bq->xdp_prog = NULL;

>

> One more question, do you really have to do that per each bq_xmit_all

> call? Couldn't you clear it in __dev_flush ?

>

> Or IOW - what's the rationale behind storing xdp_prog in

> xdp_dev_bulk_queue. Why can't you propagate the dst->xdp_prog and rely on

> that without that local pointer?

>

> You probably have an answer for that, so maybe include it in commit

> message.

>

> BTW same question for clearing dev_rx. To me this will be the same for all

> bq_xmit_all() calls that will happen within same napi.


I think you're right: When bq_xmit_all() is called from bq_enqueue(),
another packet will always be enqueued immediately after, so clearing
out all of those things in bq_xmit_all() is redundant. This also
includes the list_del on bq->flush_node, BTW.

And while we're getting into e micro-optimisations: In bq_enqueue() we
have two checks:

	if (!bq->dev_rx)
		bq->dev_rx = dev_rx;

	bq->q[bq->count++] = xdpf;

	if (!bq->flush_node.prev)
		list_add(&bq->flush_node, flush_list);


those two if() checks can be collapsed into one, since the list and the
dev_rx field are only ever modified together. This will also be the case
for bq->xdp_prog, so putting all three under the same check in
bq_enqueue() and only clearing them in __dev_flush() would be a win, I
suppose - nice catch! :)

-Toke
Toke Høiland-Jørgensen Jan. 22, 2021, 1:43 p.m. UTC | #7
Hangbin Liu <liuhangbin@gmail.com> writes:

> This patch is for xdp multicast support. which has been discussed before[0],

> The goal is to be able to implement an OVS-like data plane in XDP, i.e.,

> a software switch that can forward XDP frames to multiple ports.

>

> To achieve this, an application needs to specify a group of interfaces

> to forward a packet to. It is also common to want to exclude one or more

> physical interfaces from the forwarding operation - e.g., to forward a

> packet to all interfaces in the multicast group except the interface it

> arrived on. While this could be done simply by adding more groups, this

> quickly leads to a combinatorial explosion in the number of groups an

> application has to maintain.

>

> To avoid the combinatorial explosion, we propose to include the ability

> to specify an "exclude group" as part of the forwarding operation. This

> needs to be a group (instead of just a single port index), because there

> may have multi interfaces you want to exclude.

>

> Thus, the logical forwarding operation becomes a "set difference"

> operation, i.e. "forward to all ports in group A that are not also in

> group B". This series implements such an operation using device maps to

> represent the groups. This means that the XDP program specifies two

> device maps, one containing the list of netdevs to redirect to, and the

> other containing the exclude list.

>

> To achieve this, I re-implement a new helper bpf_redirect_map_multi()

> to accept two maps, the forwarding map and exclude map. If user

> don't want to use exclude map and just want simply stop redirecting back

> to ingress device, they can use flag BPF_F_EXCLUDE_INGRESS.

>

> The 1st patch is Jesper's run devmap xdp_prog later in bulking step.

> The 2st patch add a new bpf arg to allow NULL map pointer.

> The 3rd patch add the new bpf_redirect_map_multi() helper.

> The 4-6 patches are for usage sample and testing purpose.

>

> I did same perf tests with the following topo:

>

> ---------------------             ---------------------

> | Host A (i40e 10G) |  ---------- | eno1(i40e 10G)    |

> ---------------------             |                   |

>                                   |   Host B          |

> ---------------------             |                   |

> | Host C (i40e 10G) |  ---------- | eno2(i40e 10G)    |

> ---------------------    vlan2    |          -------- |

>                                   | veth1 -- | veth0| |

>                                   |          -------- |

>                                   --------------------|

> On Host A:

> # pktgen/pktgen_sample03_burst_single_flow.sh -i eno1 -d $dst_ip -m $dst_mac -s 64

>

> On Host B(Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz, 128G Memory):

> Use xdp_redirect_map and xdp_redirect_map_multi in samples/bpf for testing.

> The veth0 in netns load dummy drop program. The forward_map max_entries in

> xdp_redirect_map_multi is modify to 4.

>

> Here is the perf result with 5.10 rc6:

>

> The are about +/- 0.1M deviation for native testing

> Version             | Test                                    | Generic | Native | Native + 2nd

> 5.10 rc6            | xdp_redirect_map        i40e->i40e      |    2.0M |   9.1M |  8.0M

> 5.10 rc6            | xdp_redirect_map        i40e->veth      |    1.7M |  11.0M |  9.7M

> 5.10 rc6 + patch1   | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M

> 5.10 rc6 + patch1   | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M

> 5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->i40e      |    2.0M |   9.5M |  7.5M

> 5.10 rc6 + patch1-6 | xdp_redirect_map        i40e->veth      |    1.7M |  11.6M |  9.1M

> 5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e      |    1.7M |   7.8M |  6.4M

> 5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->veth      |    1.4M |   9.3M |  7.5M

> 5.10 rc6 + patch1-6 | xdp_redirect_map_multi  i40e->i40e+veth |    1.0M |   3.2M |  2.7M

>

> Last but not least, thanks a lot to Toke, Jesper, Jiri and Eelco for

> suggestions and help on implementation.


Nice work, and thank you for sticking with this! With the last couple of
fixes discussed for patch 1, when you resubmit please add my:

Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>


for the series!

-Toke
Hangbin Liu Jan. 25, 2021, 3:30 a.m. UTC | #8
On Fri, Jan 22, 2021 at 02:38:40PM +0100, Toke Høiland-Jørgensen wrote:
> >>  out:

> >> +	drops = cnt - sent;

> >>  	bq->count = 0;

> >>  

> >>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

> >>  	bq->dev_rx = NULL;

> >> +	bq->xdp_prog = NULL;

> >

> > One more question, do you really have to do that per each bq_xmit_all

> > call? Couldn't you clear it in __dev_flush ?

> >

> > Or IOW - what's the rationale behind storing xdp_prog in

> > xdp_dev_bulk_queue. Why can't you propagate the dst->xdp_prog and rely on

> > that without that local pointer?

> >

> > You probably have an answer for that, so maybe include it in commit

> > message.

> >

> > BTW same question for clearing dev_rx. To me this will be the same for all

> > bq_xmit_all() calls that will happen within same napi.

> 

> I think you're right: When bq_xmit_all() is called from bq_enqueue(),

> another packet will always be enqueued immediately after, so clearing

> out all of those things in bq_xmit_all() is redundant. This also

> includes the list_del on bq->flush_node, BTW.

> 

> And while we're getting into e micro-optimisations: In bq_enqueue() we

> have two checks:

> 

> 	if (!bq->dev_rx)

> 		bq->dev_rx = dev_rx;

> 

> 	bq->q[bq->count++] = xdpf;

> 

> 	if (!bq->flush_node.prev)

> 		list_add(&bq->flush_node, flush_list);

> 

> 

> those two if() checks can be collapsed into one, since the list and the

> dev_rx field are only ever modified together. This will also be the case

> for bq->xdp_prog, so putting all three under the same check in

> bq_enqueue() and only clearing them in __dev_flush() would be a win, I

> suppose - nice catch! :)


Thanks for the advice, so how about modify it like:

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index bc38f7193149..217e09533097 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -413,9 +413,6 @@ static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags)
 	bq->count = 0;
 
 	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);
-	bq->dev_rx = NULL;
-	bq->xdp_prog = NULL;
-	__list_del_clearprev(&bq->flush_node);
 	return;
 }
 
@@ -434,8 +431,12 @@ void __dev_flush(void)
 	struct list_head *flush_list = this_cpu_ptr(&dev_flush_list);
 	struct xdp_dev_bulk_queue *bq, *tmp;
 
-	list_for_each_entry_safe(bq, tmp, flush_list, flush_node)
+	list_for_each_entry_safe(bq, tmp, flush_list, flush_node) {
 		bq_xmit_all(bq, XDP_XMIT_FLUSH);
+		bq->dev_rx = NULL;
+		bq->xdp_prog = NULL;
+		__list_del_clearprev(&bq->flush_node);
+	}
 }
 
 /* rcu_read_lock (from syscall and BPF contexts) ensures that if a delete and/or
@@ -469,22 +470,17 @@ static void bq_enqueue(struct net_device *dev, struct xdp_frame *xdpf,
 	/* Ingress dev_rx will be the same for all xdp_frame's in
 	 * bulk_queue, because bq stored per-CPU and must be flushed
 	 * from net_device drivers NAPI func end.
+	 *
+	 * Do the same with xdp_prog and flush_list since these fields
+	 * are modified together.
 	 */
-	if (!bq->dev_rx)
+	if (!bq->dev_rx) {
 		bq->dev_rx = dev_rx;
-
-	/* Store (potential) xdp_prog that run before egress to dev as
-	 * part of bulk_queue.  This will be same xdp_prog for all
-	 * xdp_frame's in bulk_queue, because this per-CPU store must
-	 * be flushed from net_device drivers NAPI func end.
-	 */
-	if (!bq->xdp_prog)
 		bq->xdp_prog = xdp_prog;
+		list_add(&bq->flush_node, flush_list);
+	}
 
 	bq->q[bq->count++] = xdpf;
-
-	if (!bq->flush_node.prev)
-		list_add(&bq->flush_node, flush_list);
 }
 
 static inline int __xdp_enqueue(struct net_device *dev, struct xdp_buff *xdp,
Toke Høiland-Jørgensen Jan. 25, 2021, 11:21 a.m. UTC | #9
Hangbin Liu <liuhangbin@gmail.com> writes:

> On Fri, Jan 22, 2021 at 02:38:40PM +0100, Toke Høiland-Jørgensen wrote:

>> >>  out:

>> >> +	drops = cnt - sent;

>> >>  	bq->count = 0;

>> >>  

>> >>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

>> >>  	bq->dev_rx = NULL;

>> >> +	bq->xdp_prog = NULL;

>> >

>> > One more question, do you really have to do that per each bq_xmit_all

>> > call? Couldn't you clear it in __dev_flush ?

>> >

>> > Or IOW - what's the rationale behind storing xdp_prog in

>> > xdp_dev_bulk_queue. Why can't you propagate the dst->xdp_prog and rely on

>> > that without that local pointer?

>> >

>> > You probably have an answer for that, so maybe include it in commit

>> > message.

>> >

>> > BTW same question for clearing dev_rx. To me this will be the same for all

>> > bq_xmit_all() calls that will happen within same napi.

>> 

>> I think you're right: When bq_xmit_all() is called from bq_enqueue(),

>> another packet will always be enqueued immediately after, so clearing

>> out all of those things in bq_xmit_all() is redundant. This also

>> includes the list_del on bq->flush_node, BTW.

>> 

>> And while we're getting into e micro-optimisations: In bq_enqueue() we

>> have two checks:

>> 

>> 	if (!bq->dev_rx)

>> 		bq->dev_rx = dev_rx;

>> 

>> 	bq->q[bq->count++] = xdpf;

>> 

>> 	if (!bq->flush_node.prev)

>> 		list_add(&bq->flush_node, flush_list);

>> 

>> 

>> those two if() checks can be collapsed into one, since the list and the

>> dev_rx field are only ever modified together. This will also be the case

>> for bq->xdp_prog, so putting all three under the same check in

>> bq_enqueue() and only clearing them in __dev_flush() would be a win, I

>> suppose - nice catch! :)

>

> Thanks for the advice, so how about modify it like:


Yup, exactly! :)

-Toke
Maciej Fijalkowski Jan. 25, 2021, 12:29 p.m. UTC | #10
On Mon, Jan 25, 2021 at 12:21:26PM +0100, Toke Høiland-Jørgensen wrote:
> Hangbin Liu <liuhangbin@gmail.com> writes:

> 

> > On Fri, Jan 22, 2021 at 02:38:40PM +0100, Toke Høiland-Jørgensen wrote:

> >> >>  out:

> >> >> +	drops = cnt - sent;

> >> >>  	bq->count = 0;

> >> >>  

> >> >>  	trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, drops, err);

> >> >>  	bq->dev_rx = NULL;

> >> >> +	bq->xdp_prog = NULL;

> >> >

> >> > One more question, do you really have to do that per each bq_xmit_all

> >> > call? Couldn't you clear it in __dev_flush ?

> >> >

> >> > Or IOW - what's the rationale behind storing xdp_prog in

> >> > xdp_dev_bulk_queue. Why can't you propagate the dst->xdp_prog and rely on

> >> > that without that local pointer?

> >> >

> >> > You probably have an answer for that, so maybe include it in commit

> >> > message.

> >> >

> >> > BTW same question for clearing dev_rx. To me this will be the same for all

> >> > bq_xmit_all() calls that will happen within same napi.

> >> 

> >> I think you're right: When bq_xmit_all() is called from bq_enqueue(),

> >> another packet will always be enqueued immediately after, so clearing

> >> out all of those things in bq_xmit_all() is redundant. This also

> >> includes the list_del on bq->flush_node, BTW.

> >> 

> >> And while we're getting into e micro-optimisations: In bq_enqueue() we

> >> have two checks:

> >> 

> >> 	if (!bq->dev_rx)

> >> 		bq->dev_rx = dev_rx;

> >> 

> >> 	bq->q[bq->count++] = xdpf;

> >> 

> >> 	if (!bq->flush_node.prev)

> >> 		list_add(&bq->flush_node, flush_list);

> >> 

> >> 

> >> those two if() checks can be collapsed into one, since the list and the

> >> dev_rx field are only ever modified together. This will also be the case

> >> for bq->xdp_prog, so putting all three under the same check in

> >> bq_enqueue() and only clearing them in __dev_flush() would be a win, I

> >> suppose - nice catch! :)


Huh, nice further optimization! :) Of course I agree on that.

> >

> > Thanks for the advice, so how about modify it like:

> 

> Yup, exactly! :)

> 

> -Toke

>