[RFC,net-next,0/9] net: bridge: Forward offloading

Message ID	20210426170411.1789186-1-tobias@waldekranz.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Tobias Waldekranz <tobias@waldekranz.com> To: davem@davemloft.net, kuba@kernel.org Cc: andrew@lunn.ch, vivien.didelot@gmail.com, f.fainelli@gmail.com, olteanv@gmail.com, roopa@nvidia.com, nikolay@nvidia.com, jiri@resnulli.us, idosch@idosch.org, stephen@networkplumber.org, netdev@vger.kernel.org, bridge@lists.linux-foundation.org Subject: [RFC net-next 0/9] net: bridge: Forward offloading Date: Mon, 26 Apr 2021 19:04:02 +0200 Message-Id: <20210426170411.1789186-1-tobias@waldekranz.com> MIME-Version: 1.0 Organization: Westermo Content-Transfer-Encoding: 8bit Precedence: bulk
Series	net: bridge: Forward offloading \| expand [RFC,net-next,0/9] net: bridge: Forward offloading [RFC,net-next,1/9] net: dfwd: Constrain existing users to macvlan subordinates [RFC,net-next,2/9] net: bridge: Disambiguate offload_fwd_mark [RFC,net-next,3/9] net: bridge: switchdev: Recycle unused hwdoms [RFC,net-next,4/9] net: bridge: switchdev: Forward offloading [RFC,net-next,5/9] net: dsa: Track port PVIDs [RFC,net-next,6/9] net: dsa: Forward offloading [RFC,net-next,7/9] net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge [RFC,net-next,8/9] net: dsa: mv88e6xxx: Map virtual bridge port in PVT [RFC,net-next,9/9] net: dsa: mv88e6xxx: Forward offloading

Tobias Waldekranz April 26, 2021, 5:04 p.m. UTC

## Overview

   vlan1   vlan2
       \   /
   .-----------.
   |    br0    |
   '-----------'
   /   /   \   \
swp0 swp1 swp2 eth0
  :   :   :
  (hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
  to be forwarded to one or several of the ports swp{0,1,2}. Notably
  this also includes routed flows, e.g. a packet ingressing swp0 on
  VLAN 1 which is then routed over to VLAN 2 by the CPU and then
  forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
  is not offloaded by a particular switchdev instance. This includes
  ports belonging to other switchdev instances. A typical example
  would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
  1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
  swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.


## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
  that we had too many things called offload_fwd_mark, and that as the
  use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
  might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
  and if so do you have any suggestion/preference on how to signal the
  offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
  skbs) means that you have to mark the forwarding as completed very
  early (right after should_deliver in maybe_deliver) in order to
  avoid duplicates. Is there some way to move this decision point to a
  later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
  compatible with unicast-to-multicast being used on a port. Then
  again, I think that this would also be broken for regular switchdev
  bridge offloading as this flag is not offloaded to the switchdev
  port, so there is no way for the driver to refuse it. Any ideas on
  how to handle this?


## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
  bridge. Note that these flows match neither of the classes discussed
  in the overview.

- Packets that are sent directly to a port netdev without going
  through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
  socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.

Tobias Waldekranz (9):
  net: dfwd: Constrain existing users to macvlan subordinates
  net: bridge: Disambiguate offload_fwd_mark
  net: bridge: switchdev: Recycle unused hwdoms
  net: bridge: switchdev: Forward offloading
  net: dsa: Track port PVIDs
  net: dsa: Forward offloading
  net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge
  net: dsa: mv88e6xxx: Map virtual bridge port in PVT
  net: dsa: mv88e6xxx: Forward offloading

 MAINTAINERS                                   |   1 +
 drivers/net/dsa/mv88e6xxx/Makefile            |   1 +
 drivers/net/dsa/mv88e6xxx/chip.c              |  61 ++++++-
 drivers/net/dsa/mv88e6xxx/dst.c               | 160 ++++++++++++++++++
 drivers/net/dsa/mv88e6xxx/dst.h               |  14 ++
 .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +
 include/linux/dsa/mv88e6xxx.h                 |  13 ++
 include/net/dsa.h                             |  13 ++
 net/bridge/br_forward.c                       |  11 +-
 net/bridge/br_if.c                            |   4 +-
 net/bridge/br_private.h                       |  54 +++++-
 net/bridge/br_switchdev.c                     | 141 +++++++++++----
 net/dsa/port.c                                |  16 +-
 net/dsa/slave.c                               |  36 +++-
 net/dsa/tag_dsa.c                             |  33 +++-
 17 files changed, 510 insertions(+), 57 deletions(-)
 create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c
 create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h
 create mode 100644 include/linux/dsa/mv88e6xxx.h

Nikolay Aleksandrov April 27, 2021, 10:35 a.m. UTC | #1

On 26/04/2021 20:04, Tobias Waldekranz wrote:
> Allow switchdevs to forward frames from the CPU in accordance with the

> bridge configuration in the same way as is done between bridge

> ports. This means that the bridge will only send a single skb towards

> one of the ports under the switchdev's control, and expects the driver

> to deliver the packet to all eligible ports in its domain.

> 

> Primarily this improves the performance of multicast flows with

> multiple subscribers, as it allows the hardware to perform the frame

> replication.

> 

> The basic flow between the driver and the bridge is as follows:

> 

> - The switchdev accepts the offload by returning a non-null pointer

>   from .ndo_dfwd_add_station when the port is added to the bridge.

> 

> - The bridge sends offloadable skbs to one of the ports under the

>   switchdev's control using dev_queue_xmit_accel.

> 

> - The switchdev notices the offload by checking for a non-NULL

>   "sb_dev" in the core's call to .ndo_select_queue.

> 

> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

> ---

>  net/bridge/br_forward.c   | 11 +++++++-

>  net/bridge/br_private.h   | 27 ++++++++++++++++++

>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--

>  3 files changed, 93 insertions(+), 4 deletions(-)

> 


Hi,
Please try to find a way to reduce the number of new tests in the fast path.
This specific feature might help these devices, but the new tests hurt everybody else.
I don't mind the control plane changes, but I'd like to minimize the fast-path impact.

Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?
Either way - you can mark the port via its internal flags if it can accelerate, those are
used frequently and are in a hot cache line (by the way that reminds me that the port
offload mark/hwdom should be moved in the first cache line).

For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer
in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to
avoid the final test. Furthermore since the hwdoms are bits and if the port accel is a bit
you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.

In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark
so in the software forwarding case we could avoid them.

I might be missing something above, but we have to try and reduce these tests as much as
possible, also the port's first cache line is quite crowded so avoiding any new fields
would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling
in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv
there too.

Cheers,
 Nik

> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c

> index 6e9b049ae521..b4fb3b0bb1ec 100644

> --- a/net/bridge/br_forward.c

> +++ b/net/bridge/br_forward.c

> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,

>  

>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)

>  {

> +	struct net_device *sb_dev = NULL;

> +

>  	skb_push(skb, ETH_HLEN);

>  	if (!is_skb_forwardable(skb->dev, skb))

>  		goto drop;

> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb

>  		skb_set_network_header(skb, depth);

>  	}

>  

> -	dev_queue_xmit(skb);

> +	if (br_switchdev_accels_skb(skb))

> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;

> +

> +	dev_queue_xmit_accel(skb, sb_dev);

>  

>  	return 0;

>  

> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,

>  		indev = NULL;

>  	}

>  

> +	nbp_switchdev_frame_mark_accel(to, skb);

> +

>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,

>  		net, NULL, skb, indev, skb->dev,

>  		br_forward_finish);

> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(

>  	if (!should_deliver(p, skb))

>  		return prev;

>  

> +	nbp_switchdev_frame_mark_fwd(p, skb);

> +

>  	if (!prev)

>  		goto out;

>  

> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h

> index aba92864d285..933e951b0d7a 100644

> --- a/net/bridge/br_private.h

> +++ b/net/bridge/br_private.h

> @@ -332,6 +332,7 @@ struct net_bridge_port {

>  #endif

>  #ifdef CONFIG_NET_SWITCHDEV

>  	int				hwdom;

> +	void				*accel_priv;

>  #endif

>  	u16				group_fwd_mask;

>  	u16				backup_redirected_cnt;

> @@ -506,7 +507,9 @@ struct br_input_skb_cb {

>  #endif

>  

>  #ifdef CONFIG_NET_SWITCHDEV

> +	u8 fwd_accel:1;

>  	int src_hwdom;

> +	br_hwdom_map_t fwd_hwdoms;

>  #endif

>  };

>  

> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }

>  

>  /* br_switchdev.c */

>  #ifdef CONFIG_NET_SWITCHDEV

> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

> +{

> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;

> +}

> +

> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

> +				    struct sk_buff *skb);

> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

> +				  struct sk_buff *skb);

>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>  			      struct sk_buff *skb);

>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)

>  	skb->offload_fwd_mark = 0;

>  }

>  #else

> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

> +{

> +	return false;

> +}

> +

> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

> +						  struct sk_buff *skb)

> +{

> +}

> +

> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

> +						struct sk_buff *skb)

> +{

> +}

> +

>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>  					    struct sk_buff *skb)

>  {

> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c

> index 54bd7205bfb5..c903171ad291 100644

> --- a/net/bridge/br_switchdev.c

> +++ b/net/bridge/br_switchdev.c

> @@ -8,6 +8,26 @@

>  

>  #include "br_private.h"

>  

> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,

> +				    const struct sk_buff *skb)

> +{

> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);

> +}

> +

> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

> +				    struct sk_buff *skb)

> +{

> +	if (nbp_switchdev_can_accel(p, skb))

> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;

> +}

> +

> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

> +				  struct sk_buff *skb)

> +{

> +	if (nbp_switchdev_can_accel(p, skb))

> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);

> +}

> +

>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>  			      struct sk_buff *skb)

>  {

> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>  				  const struct sk_buff *skb)

>  {

> -	return !skb->offload_fwd_mark ||

> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;

> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);

> +

> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&

> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);

>  }

>  

>  /* Flags that can be offloaded to hardware */

> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)

>  	return switchdev_port_obj_del(dev, &v.obj);

>  }

>  

> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)

> +{

> +	void *priv;

> +

> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))

> +		return;

> +

> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);

> +	if (!IS_ERR_OR_NULL(priv))

> +		p->accel_priv = priv;

> +}

> +

> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)

> +{

> +	if (!p->accel_priv)

> +		return;

> +

> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);

> +	p->accel_priv = NULL;

> +}

> +

>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)

>  {

>  	struct net_bridge *br = joining->br;

> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)

>  		return err;

>  	}

>  

> -	return nbp_switchdev_hwdom_set(p);

> +	err = nbp_switchdev_hwdom_set(p);

> +	if (err)

> +		return err;

> +

> +	if (p->hwdom)

> +		nbp_switchdev_fwd_offload_add(p);

> +

> +	return 0;

>  }

>  

>  void nbp_switchdev_del(struct net_bridge_port *p)

>  {

>  	ASSERT_RTNL();

>  

> +	if (p->accel_priv)

> +		nbp_switchdev_fwd_offload_del(p);

> +

>  	if (p->hwdom)

>  		nbp_switchdev_hwdom_put(p);

>  }

>

Tobias Waldekranz April 28, 2021, 10:47 p.m. UTC | #2

On Tue, Apr 27, 2021 at 13:35, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:
> On 26/04/2021 20:04, Tobias Waldekranz wrote:

>> Allow switchdevs to forward frames from the CPU in accordance with the

>> bridge configuration in the same way as is done between bridge

>> ports. This means that the bridge will only send a single skb towards

>> one of the ports under the switchdev's control, and expects the driver

>> to deliver the packet to all eligible ports in its domain.

>> 

>> Primarily this improves the performance of multicast flows with

>> multiple subscribers, as it allows the hardware to perform the frame

>> replication.

>> 

>> The basic flow between the driver and the bridge is as follows:

>> 

>> - The switchdev accepts the offload by returning a non-null pointer

>>   from .ndo_dfwd_add_station when the port is added to the bridge.

>> 

>> - The bridge sends offloadable skbs to one of the ports under the

>>   switchdev's control using dev_queue_xmit_accel.

>> 

>> - The switchdev notices the offload by checking for a non-NULL

>>   "sb_dev" in the core's call to .ndo_select_queue.

>> 

>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

>> ---

>>  net/bridge/br_forward.c   | 11 +++++++-

>>  net/bridge/br_private.h   | 27 ++++++++++++++++++

>>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--

>>  3 files changed, 93 insertions(+), 4 deletions(-)

>> 

>

> Hi,

> Please try to find a way to reduce the number of new tests in the fast path.

> This specific feature might help these devices, but the new tests hurt everybody else.

> I don't mind the control plane changes, but I'd like to minimize the fast-path impact.


Wholeheartedly agree.

> Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?

> Either way - you can mark the port via its internal flags if it can accelerate, those are

> used frequently and are in a hot cache line (by the way that reminds me that the port

> offload mark/hwdom should be moved in the first cache line).


I need to stash accel_priv somewhere as .ndo_dfwd_del_station expects it
sent back when unregistering the offload. But there is no need for it to
be part of the fast-path. Would it be ok to add a BR_FORWARD_OFFLOAD to
p->flags, which would be used in the fast-path, while also keeping
accel_priv on the port, but on a colder line?

> For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer

> in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to

> avoid the final test.


Great idea! I will add it in v1.

> Furthermore since the hwdoms are bits and if the port accel is a bit

> you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.


Not sure I follow. The current code has two tests:

1. Is offloading enabled on the port. (To be done using p->flags in v1)
2. Is offloading allowed for this frame to this port.

The port can be part of a hwdom that does not support forward
offloading; indeed only one driver would support it initially. So how do
I avoid having to test the conditions individually?

> In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark

> so in the software forwarding case we could avoid them.


Flipping the left and right side of the expression is possible, but I
think that would only impact the case where the frame is _not_ allowed
to egress. Is that what you mean? Otherwise you still need to test for
the condition that we have forwarded to this port's hwdom already, to
avoid sending duplicates on the wire. This is independent of
skb->offload_fwd_mark as both Rx-offloaded and non-Rx-offloaded frames
can still be Tx-offloaded to other hwdoms.

A typical example would be a broadcast frame ingressing the bridge from
eth0 in the figure from the cover letter. skb->offload_fwd_mark would
always be 0, but you still need to test fwd_hwdoms to skip over swp{1,2}
after you have sent the skb to swp0.

> I might be missing something above, but we have to try and reduce these tests as much as

> possible, also the port's first cache line is quite crowded so avoiding any new fields

> would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling

> in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv

> there too.

>

> Cheers,

>  Nik

>

>> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c

>> index 6e9b049ae521..b4fb3b0bb1ec 100644

>> --- a/net/bridge/br_forward.c

>> +++ b/net/bridge/br_forward.c

>> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,

>>  

>>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)

>>  {

>> +	struct net_device *sb_dev = NULL;

>> +

>>  	skb_push(skb, ETH_HLEN);

>>  	if (!is_skb_forwardable(skb->dev, skb))

>>  		goto drop;

>> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb

>>  		skb_set_network_header(skb, depth);

>>  	}

>>  

>> -	dev_queue_xmit(skb);

>> +	if (br_switchdev_accels_skb(skb))

>> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;

>> +

>> +	dev_queue_xmit_accel(skb, sb_dev);

>>  

>>  	return 0;

>>  

>> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,

>>  		indev = NULL;

>>  	}

>>  

>> +	nbp_switchdev_frame_mark_accel(to, skb);

>> +

>>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,

>>  		net, NULL, skb, indev, skb->dev,

>>  		br_forward_finish);

>> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(

>>  	if (!should_deliver(p, skb))

>>  		return prev;

>>  

>> +	nbp_switchdev_frame_mark_fwd(p, skb);

>> +

>>  	if (!prev)

>>  		goto out;

>>  

>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h

>> index aba92864d285..933e951b0d7a 100644

>> --- a/net/bridge/br_private.h

>> +++ b/net/bridge/br_private.h

>> @@ -332,6 +332,7 @@ struct net_bridge_port {

>>  #endif

>>  #ifdef CONFIG_NET_SWITCHDEV

>>  	int				hwdom;

>> +	void				*accel_priv;

>>  #endif

>>  	u16				group_fwd_mask;

>>  	u16				backup_redirected_cnt;

>> @@ -506,7 +507,9 @@ struct br_input_skb_cb {

>>  #endif

>>  

>>  #ifdef CONFIG_NET_SWITCHDEV

>> +	u8 fwd_accel:1;

>>  	int src_hwdom;

>> +	br_hwdom_map_t fwd_hwdoms;

>>  #endif

>>  };

>>  

>> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }

>>  

>>  /* br_switchdev.c */

>>  #ifdef CONFIG_NET_SWITCHDEV

>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

>> +{

>> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;

>> +}

>> +

>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>> +				    struct sk_buff *skb);

>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>> +				  struct sk_buff *skb);

>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>  			      struct sk_buff *skb);

>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)

>>  	skb->offload_fwd_mark = 0;

>>  }

>>  #else

>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

>> +{

>> +	return false;

>> +}

>> +

>> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>> +						  struct sk_buff *skb)

>> +{

>> +}

>> +

>> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>> +						struct sk_buff *skb)

>> +{

>> +}

>> +

>>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>  					    struct sk_buff *skb)

>>  {

>> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c

>> index 54bd7205bfb5..c903171ad291 100644

>> --- a/net/bridge/br_switchdev.c

>> +++ b/net/bridge/br_switchdev.c

>> @@ -8,6 +8,26 @@

>>  

>>  #include "br_private.h"

>>  

>> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,

>> +				    const struct sk_buff *skb)

>> +{

>> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);

>> +}

>> +

>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>> +				    struct sk_buff *skb)

>> +{

>> +	if (nbp_switchdev_can_accel(p, skb))

>> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;

>> +}

>> +

>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>> +				  struct sk_buff *skb)

>> +{

>> +	if (nbp_switchdev_can_accel(p, skb))

>> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);

>> +}

>> +

>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>  			      struct sk_buff *skb)

>>  {

>> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>>  				  const struct sk_buff *skb)

>>  {

>> -	return !skb->offload_fwd_mark ||

>> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;

>> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);

>> +

>> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&

>> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);

>>  }

>>  

>>  /* Flags that can be offloaded to hardware */

>> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)

>>  	return switchdev_port_obj_del(dev, &v.obj);

>>  }

>>  

>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)

>> +{

>> +	void *priv;

>> +

>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))

>> +		return;

>> +

>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);

>> +	if (!IS_ERR_OR_NULL(priv))

>> +		p->accel_priv = priv;

>> +}

>> +

>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)

>> +{

>> +	if (!p->accel_priv)

>> +		return;

>> +

>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);

>> +	p->accel_priv = NULL;

>> +}

>> +

>>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)

>>  {

>>  	struct net_bridge *br = joining->br;

>> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)

>>  		return err;

>>  	}

>>  

>> -	return nbp_switchdev_hwdom_set(p);

>> +	err = nbp_switchdev_hwdom_set(p);

>> +	if (err)

>> +		return err;

>> +

>> +	if (p->hwdom)

>> +		nbp_switchdev_fwd_offload_add(p);

>> +

>> +	return 0;

>>  }

>>  

>>  void nbp_switchdev_del(struct net_bridge_port *p)

>>  {

>>  	ASSERT_RTNL();

>>  

>> +	if (p->accel_priv)

>> +		nbp_switchdev_fwd_offload_del(p);

>> +

>>  	if (p->hwdom)

>>  		nbp_switchdev_hwdom_put(p);

>>  }

>>

Nikolay Aleksandrov April 29, 2021, 9:16 a.m. UTC | #3

On 29/04/2021 01:47, Tobias Waldekranz wrote:
> On Tue, Apr 27, 2021 at 13:35, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:

>> On 26/04/2021 20:04, Tobias Waldekranz wrote:

>>> Allow switchdevs to forward frames from the CPU in accordance with the

>>> bridge configuration in the same way as is done between bridge

>>> ports. This means that the bridge will only send a single skb towards

>>> one of the ports under the switchdev's control, and expects the driver

>>> to deliver the packet to all eligible ports in its domain.

>>>

>>> Primarily this improves the performance of multicast flows with

>>> multiple subscribers, as it allows the hardware to perform the frame

>>> replication.

>>>

>>> The basic flow between the driver and the bridge is as follows:

>>>

>>> - The switchdev accepts the offload by returning a non-null pointer

>>>   from .ndo_dfwd_add_station when the port is added to the bridge.

>>>

>>> - The bridge sends offloadable skbs to one of the ports under the

>>>   switchdev's control using dev_queue_xmit_accel.

>>>

>>> - The switchdev notices the offload by checking for a non-NULL

>>>   "sb_dev" in the core's call to .ndo_select_queue.

>>>

>>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

>>> ---

>>>  net/bridge/br_forward.c   | 11 +++++++-

>>>  net/bridge/br_private.h   | 27 ++++++++++++++++++

>>>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--

>>>  3 files changed, 93 insertions(+), 4 deletions(-)

>>>

>>

>> Hi,

>> Please try to find a way to reduce the number of new tests in the fast path.

>> This specific feature might help these devices, but the new tests hurt everybody else.

>> I don't mind the control plane changes, but I'd like to minimize the fast-path impact.

> 

> Wholeheartedly agree.

> 

>> Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?

>> Either way - you can mark the port via its internal flags if it can accelerate, those are

>> used frequently and are in a hot cache line (by the way that reminds me that the port

>> offload mark/hwdom should be moved in the first cache line).

> 

> I need to stash accel_priv somewhere as .ndo_dfwd_del_station expects it

> sent back when unregistering the offload. But there is no need for it to

> be part of the fast-path. Would it be ok to add a BR_FORWARD_OFFLOAD to

> p->flags, which would be used in the fast-path, while also keeping

> accel_priv on the port, but on a colder line?

> 


About the flag - yes, that is what I'm proposing. Use an internal port flag for it
and for the tests.

About the pointer - it is certainly not appropriate to use net_bridge_port for a void pointer
coming in from a driver or external place. I see that it's always the bridge device so can
we just compare the result of the add op to the bridge dev and set only the flag based on it?
Then on the del path if the flag is set we know it's the bridge and use it as sb_dev.

>> For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer

>> in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to

>> avoid the final test.

> 

> Great idea! I will add it in v1.

> 

Actually you can also use a static key to avoid all checks and effects of this feature on
the bridge fast-path. You can enable it when the first device that can accelerate shows
up and disable it when the last one leaves.

By the way __br_forward() can take one more argument (sb_dev) and always set it because
that cache line is dirty anyway due to the tstamp zeroing, but that doesn't matter if
static keys are used. Just noting it. :)

>> Furthermore since the hwdoms are bits and if the port accel is a bit

>> you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.

> 

> Not sure I follow. The current code has two tests:

> 

> 1. Is offloading enabled on the port. (To be done using p->flags in v1)

> 2. Is offloading allowed for this frame to this port.

> 

> The port can be part of a hwdom that does not support forward

> offloading; indeed only one driver would support it initially. So how do

> I avoid having to test the conditions individually?

> 


Coming to think of it with the port bit test first it should be fine.


For the sake of fun here's one way that can turn it into one test,
obviously I haven't tested anything:
 - reserve a few most significant bits of hwdom as "feature" bits, say 4
 - add a "feature" bit which encodes ability to accelerate
 => test becomes p->hwdom | (src_hwdom & hwdom_bitmask) > (accel feature bit) | p->hwdom
It's very hacky and _not_ to be used. :)

>> In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark

>> so in the software forwarding case we could avoid them.

> 

> Flipping the left and right side of the expression is possible, but I

> think that would only impact the case where the frame is _not_ allowed

> to egress. Is that what you mean? Otherwise you still need to test for

> the condition that we have forwarded to this port's hwdom already, to

> avoid sending duplicates on the wire. This is independent of

> skb->offload_fwd_mark as both Rx-offloaded and non-Rx-offloaded frames

> can still be Tx-offloaded to other hwdoms.

> 

> A typical example would be a broadcast frame ingressing the bridge from

> eth0 in the figure from the cover letter. skb->offload_fwd_mark would

> always be 0, but you still need to test fwd_hwdoms to skip over swp{1,2}

> after you have sent the skb to swp0.

> 


Yeah, you're right I was still thinking only of offloaded skbs, didn't consider
mixing the two.

>> I might be missing something above, but we have to try and reduce these tests as much as

>> possible, also the port's first cache line is quite crowded so avoiding any new fields

>> would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling

>> in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv

>> there too.

>>

>> Cheers,

>>  Nik

>>

>>> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c

>>> index 6e9b049ae521..b4fb3b0bb1ec 100644

>>> --- a/net/bridge/br_forward.c

>>> +++ b/net/bridge/br_forward.c

>>> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,

>>>  

>>>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)

>>>  {

>>> +	struct net_device *sb_dev = NULL;

>>> +

>>>  	skb_push(skb, ETH_HLEN);

>>>  	if (!is_skb_forwardable(skb->dev, skb))

>>>  		goto drop;

>>> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb

>>>  		skb_set_network_header(skb, depth);

>>>  	}

>>>  

>>> -	dev_queue_xmit(skb);

>>> +	if (br_switchdev_accels_skb(skb))

>>> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;

>>> +

>>> +	dev_queue_xmit_accel(skb, sb_dev);

>>>  

>>>  	return 0;

>>>  

>>> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,

>>>  		indev = NULL;

>>>  	}

>>>  

>>> +	nbp_switchdev_frame_mark_accel(to, skb);

>>> +

>>>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,

>>>  		net, NULL, skb, indev, skb->dev,

>>>  		br_forward_finish);

>>> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(

>>>  	if (!should_deliver(p, skb))

>>>  		return prev;

>>>  

>>> +	nbp_switchdev_frame_mark_fwd(p, skb);

>>> +

>>>  	if (!prev)

>>>  		goto out;

>>>  

>>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h

>>> index aba92864d285..933e951b0d7a 100644

>>> --- a/net/bridge/br_private.h

>>> +++ b/net/bridge/br_private.h

>>> @@ -332,6 +332,7 @@ struct net_bridge_port {

>>>  #endif

>>>  #ifdef CONFIG_NET_SWITCHDEV

>>>  	int				hwdom;

>>> +	void				*accel_priv;

>>>  #endif

>>>  	u16				group_fwd_mask;

>>>  	u16				backup_redirected_cnt;

>>> @@ -506,7 +507,9 @@ struct br_input_skb_cb {

>>>  #endif

>>>  

>>>  #ifdef CONFIG_NET_SWITCHDEV

>>> +	u8 fwd_accel:1;

>>>  	int src_hwdom;

>>> +	br_hwdom_map_t fwd_hwdoms;

>>>  #endif

>>>  };

>>>  

>>> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }

>>>  

>>>  /* br_switchdev.c */

>>>  #ifdef CONFIG_NET_SWITCHDEV

>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

>>> +{

>>> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;

>>> +}

>>> +

>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>>> +				    struct sk_buff *skb);

>>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>>> +				  struct sk_buff *skb);

>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>  			      struct sk_buff *skb);

>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>>> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)

>>>  	skb->offload_fwd_mark = 0;

>>>  }

>>>  #else

>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

>>> +{

>>> +	return false;

>>> +}

>>> +

>>> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>>> +						  struct sk_buff *skb)

>>> +{

>>> +}

>>> +

>>> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>>> +						struct sk_buff *skb)

>>> +{

>>> +}

>>> +

>>>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>  					    struct sk_buff *skb)

>>>  {

>>> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c

>>> index 54bd7205bfb5..c903171ad291 100644

>>> --- a/net/bridge/br_switchdev.c

>>> +++ b/net/bridge/br_switchdev.c

>>> @@ -8,6 +8,26 @@

>>>  

>>>  #include "br_private.h"

>>>  

>>> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,

>>> +				    const struct sk_buff *skb)

>>> +{

>>> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);

>>> +}

>>> +

>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>>> +				    struct sk_buff *skb)

>>> +{

>>> +	if (nbp_switchdev_can_accel(p, skb))

>>> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;

>>> +}

>>> +

>>> +void  (const struct net_bridge_port *p,

>>> +				  struct sk_buff *skb)

>>> +{

>>> +	if (nbp_switchdev_can_accel(p, skb))

>>> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);

>>> +}

>>> +

>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>  			      struct sk_buff *skb)

>>>  {

>>> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>>>  				  const struct sk_buff *skb)

>>>  {

>>> -	return !skb->offload_fwd_mark ||

>>> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;

>>> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);

>>> +

>>> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&

>>> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);

>>>  }

>>>  

>>>  /* Flags that can be offloaded to hardware */

>>> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)

>>>  	return switchdev_port_obj_del(dev, &v.obj);

>>>  }

>>>  

>>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)

>>> +{

>>> +	void *priv;

>>> +

>>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))

>>> +		return;

>>> +

>>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);

>>> +	if (!IS_ERR_OR_NULL(priv))

>>> +		p->accel_priv = priv;

>>> +}

>>> +

>>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)

>>> +{

>>> +	if (!p->accel_priv)

>>> +		return;

>>> +

>>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);

>>> +	p->accel_priv = NULL;

>>> +}

>>> +

>>>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)

>>>  {

>>>  	struct net_bridge *br = joining->br;

>>> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)

>>>  		return err;

>>>  	}

>>>  

>>> -	return nbp_switchdev_hwdom_set(p);

>>> +	err = nbp_switchdev_hwdom_set(p);

>>> +	if (err)

>>> +		return err;

>>> +

>>> +	if (p->hwdom)

>>> +		nbp_switchdev_fwd_offload_add(p);

>>> +

>>> +	return 0;

>>>  }

>>>  

>>>  void nbp_switchdev_del(struct net_bridge_port *p)

>>>  {

>>>  	ASSERT_RTNL();

>>>  

>>> +	if (p->accel_priv)

>>> +		nbp_switchdev_fwd_offload_del(p);

>>> +

>>>  	if (p->hwdom)

>>>  		nbp_switchdev_hwdom_put(p);

>>>  }

>>>

Tobias Waldekranz April 29, 2021, 2:55 p.m. UTC | #4

On Thu, Apr 29, 2021 at 12:16, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:
> On 29/04/2021 01:47, Tobias Waldekranz wrote:

>> On Tue, Apr 27, 2021 at 13:35, Nikolay Aleksandrov <nikolay@nvidia.com> wrote:

>>> On 26/04/2021 20:04, Tobias Waldekranz wrote:

>>>> Allow switchdevs to forward frames from the CPU in accordance with the

>>>> bridge configuration in the same way as is done between bridge

>>>> ports. This means that the bridge will only send a single skb towards

>>>> one of the ports under the switchdev's control, and expects the driver

>>>> to deliver the packet to all eligible ports in its domain.

>>>>

>>>> Primarily this improves the performance of multicast flows with

>>>> multiple subscribers, as it allows the hardware to perform the frame

>>>> replication.

>>>>

>>>> The basic flow between the driver and the bridge is as follows:

>>>>

>>>> - The switchdev accepts the offload by returning a non-null pointer

>>>>   from .ndo_dfwd_add_station when the port is added to the bridge.

>>>>

>>>> - The bridge sends offloadable skbs to one of the ports under the

>>>>   switchdev's control using dev_queue_xmit_accel.

>>>>

>>>> - The switchdev notices the offload by checking for a non-NULL

>>>>   "sb_dev" in the core's call to .ndo_select_queue.

>>>>

>>>> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

>>>> ---

>>>>  net/bridge/br_forward.c   | 11 +++++++-

>>>>  net/bridge/br_private.h   | 27 ++++++++++++++++++

>>>>  net/bridge/br_switchdev.c | 59 +++++++++++++++++++++++++++++++++++++--

>>>>  3 files changed, 93 insertions(+), 4 deletions(-)

>>>>

>>>

>>> Hi,

>>> Please try to find a way to reduce the number of new tests in the fast path.

>>> This specific feature might help these devices, but the new tests hurt everybody else.

>>> I don't mind the control plane changes, but I'd like to minimize the fast-path impact.

>> 

>> Wholeheartedly agree.

>> 

>>> Do you need "accel_priv" to be a pointer, I mean can't derive sb_dev from the port alone ?

>>> Either way - you can mark the port via its internal flags if it can accelerate, those are

>>> used frequently and are in a hot cache line (by the way that reminds me that the port

>>> offload mark/hwdom should be moved in the first cache line).

>> 

>> I need to stash accel_priv somewhere as .ndo_dfwd_del_station expects it

>> sent back when unregistering the offload. But there is no need for it to

>> be part of the fast-path. Would it be ok to add a BR_FORWARD_OFFLOAD to

>> p->flags, which would be used in the fast-path, while also keeping

>> accel_priv on the port, but on a colder line?

>> 

>

> About the flag - yes, that is what I'm proposing. Use an internal port flag for it

> and for the tests.

>

> About the pointer - it is certainly not appropriate to use net_bridge_port for a void pointer

> coming in from a driver or external place. I see that it's always the bridge device so can

> we just compare the result of the add op to the bridge dev and set only the flag based on it?

> Then on the del path if the flag is set we know it's the bridge and use it as sb_dev.


I am not an expert on this API, but I believe that the returned pointer
is supposed to be treated as an opaque cookie that the driver can use to
associate with anything it likes. On mv88e6xxx I saw no need for it, so
I just returned the device pointer to indicate success. But other
drivers might have use of it if we add a getter - similar to
macvlan_accel_priv.

Now that I think of it, the DSA tagger might also benefit from it. It
would avoid lots of pointer chasing on egress. I will probably try this
out in v1 and see what you think.

>>> For example you could possibly drop fwd_accel, add the bitmap in a union with sb_dev pointer

>>> in br_input_skb_cb which can be set at __br_forward at the accel check and pass it down to

>>> avoid the final test.

>> 

>> Great idea! I will add it in v1.

>> 

> Actually you can also use a static key to avoid all checks and effects of this feature on

> the bridge fast-path. You can enable it when the first device that can accelerate shows

> up and disable it when the last one leaves.


That would be great. Never used the static key stuff before, but it
looks pretty straightforward. These are always global, right?  Since you
rewrite the actual .text when you toggle the keys? So it would have to
be enabled as soon as an offloadable port joins any bridge in the
system. Anyway, I will read up on it and add it in v1.

> By the way __br_forward() can take one more argument (sb_dev) and always set it because

> that cache line is dirty anyway due to the tstamp zeroing, but that doesn't matter if

> static keys are used. Just noting it. :)

>

>>> Furthermore since the hwdoms are bits and if the port accel is a bit

>>> you could probably reduce the nbp_switchdev_can_accel() helper to one test with a few bitops.

>> 

>> Not sure I follow. The current code has two tests:

>> 

>> 1. Is offloading enabled on the port. (To be done using p->flags in v1)

>> 2. Is offloading allowed for this frame to this port.

>> 

>> The port can be part of a hwdom that does not support forward

>> offloading; indeed only one driver would support it initially. So how do

>> I avoid having to test the conditions individually?

>> 

>

> Coming to think of it with the port bit test first it should be fine.

>

>

> For the sake of fun here's one way that can turn it into one test,

> obviously I haven't tested anything:

>  - reserve a few most significant bits of hwdom as "feature" bits, say 4

>  - add a "feature" bit which encodes ability to accelerate

>  => test becomes p->hwdom | (src_hwdom & hwdom_bitmask) > (accel feature bit) | p->hwdom

> It's very hacky and _not_ to be used. :)


I stand corrected - and in awe :)

>>> In nbp_switchdev_allowed_egress() I'd make the hwdom tests rely on skb's offload_fwd_mark

>>> so in the software forwarding case we could avoid them.

>> 

>> Flipping the left and right side of the expression is possible, but I

>> think that would only impact the case where the frame is _not_ allowed

>> to egress. Is that what you mean? Otherwise you still need to test for

>> the condition that we have forwarded to this port's hwdom already, to

>> avoid sending duplicates on the wire. This is independent of

>> skb->offload_fwd_mark as both Rx-offloaded and non-Rx-offloaded frames

>> can still be Tx-offloaded to other hwdoms.

>> 

>> A typical example would be a broadcast frame ingressing the bridge from

>> eth0 in the figure from the cover letter. skb->offload_fwd_mark would

>> always be 0, but you still need to test fwd_hwdoms to skip over swp{1,2}

>> after you have sent the skb to swp0.

>> 

>

> Yeah, you're right I was still thinking only of offloaded skbs, didn't consider

> mixing the two.

>

>>> I might be missing something above, but we have to try and reduce these tests as much as

>>> possible, also the port's first cache line is quite crowded so avoiding any new fields

>>> would be best, i.e. at some point we'll move the hwdom/offload mark there to avoid pulling

>>> in the last cache line of net_bridge_port, so it'd be best to avoid having to move accel_priv

>>> there too.

>>>

>>> Cheers,

>>>  Nik

>>>

>>>> diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c

>>>> index 6e9b049ae521..b4fb3b0bb1ec 100644

>>>> --- a/net/bridge/br_forward.c

>>>> +++ b/net/bridge/br_forward.c

>>>> @@ -32,6 +32,8 @@ static inline int should_deliver(const struct net_bridge_port *p,

>>>>  

>>>>  int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb)

>>>>  {

>>>> +	struct net_device *sb_dev = NULL;

>>>> +

>>>>  	skb_push(skb, ETH_HLEN);

>>>>  	if (!is_skb_forwardable(skb->dev, skb))

>>>>  		goto drop;

>>>> @@ -48,7 +50,10 @@ int br_dev_queue_push_xmit(struct net *net, struct sock *sk, struct sk_buff *skb

>>>>  		skb_set_network_header(skb, depth);

>>>>  	}

>>>>  

>>>> -	dev_queue_xmit(skb);

>>>> +	if (br_switchdev_accels_skb(skb))

>>>> +		sb_dev = BR_INPUT_SKB_CB(skb)->brdev;

>>>> +

>>>> +	dev_queue_xmit_accel(skb, sb_dev);

>>>>  

>>>>  	return 0;

>>>>  

>>>> @@ -105,6 +110,8 @@ static void __br_forward(const struct net_bridge_port *to,

>>>>  		indev = NULL;

>>>>  	}

>>>>  

>>>> +	nbp_switchdev_frame_mark_accel(to, skb);

>>>> +

>>>>  	NF_HOOK(NFPROTO_BRIDGE, br_hook,

>>>>  		net, NULL, skb, indev, skb->dev,

>>>>  		br_forward_finish);

>>>> @@ -174,6 +181,8 @@ static struct net_bridge_port *maybe_deliver(

>>>>  	if (!should_deliver(p, skb))

>>>>  		return prev;

>>>>  

>>>> +	nbp_switchdev_frame_mark_fwd(p, skb);

>>>> +

>>>>  	if (!prev)

>>>>  		goto out;

>>>>  

>>>> diff --git a/net/bridge/br_private.h b/net/bridge/br_private.h

>>>> index aba92864d285..933e951b0d7a 100644

>>>> --- a/net/bridge/br_private.h

>>>> +++ b/net/bridge/br_private.h

>>>> @@ -332,6 +332,7 @@ struct net_bridge_port {

>>>>  #endif

>>>>  #ifdef CONFIG_NET_SWITCHDEV

>>>>  	int				hwdom;

>>>> +	void				*accel_priv;

>>>>  #endif

>>>>  	u16				group_fwd_mask;

>>>>  	u16				backup_redirected_cnt;

>>>> @@ -506,7 +507,9 @@ struct br_input_skb_cb {

>>>>  #endif

>>>>  

>>>>  #ifdef CONFIG_NET_SWITCHDEV

>>>> +	u8 fwd_accel:1;

>>>>  	int src_hwdom;

>>>> +	br_hwdom_map_t fwd_hwdoms;

>>>>  #endif

>>>>  };

>>>>  

>>>> @@ -1597,6 +1600,15 @@ static inline void br_sysfs_delbr(struct net_device *dev) { return; }

>>>>  

>>>>  /* br_switchdev.c */

>>>>  #ifdef CONFIG_NET_SWITCHDEV

>>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

>>>> +{

>>>> +	return BR_INPUT_SKB_CB(skb)->fwd_accel;

>>>> +}

>>>> +

>>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>>>> +				    struct sk_buff *skb);

>>>> +void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>>>> +				  struct sk_buff *skb);

>>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>>  			      struct sk_buff *skb);

>>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>>>> @@ -1619,6 +1631,21 @@ static inline void br_switchdev_frame_unmark(struct sk_buff *skb)

>>>>  	skb->offload_fwd_mark = 0;

>>>>  }

>>>>  #else

>>>> +static inline bool br_switchdev_accels_skb(struct sk_buff *skb)

>>>> +{

>>>> +	return false;

>>>> +}

>>>> +

>>>> +static inline void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>>>> +						  struct sk_buff *skb)

>>>> +{

>>>> +}

>>>> +

>>>> +static inline void nbp_switchdev_frame_mark_fwd(const struct net_bridge_port *p,

>>>> +						struct sk_buff *skb)

>>>> +{

>>>> +}

>>>> +

>>>>  static inline void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>>  					    struct sk_buff *skb)

>>>>  {

>>>> diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c

>>>> index 54bd7205bfb5..c903171ad291 100644

>>>> --- a/net/bridge/br_switchdev.c

>>>> +++ b/net/bridge/br_switchdev.c

>>>> @@ -8,6 +8,26 @@

>>>>  

>>>>  #include "br_private.h"

>>>>  

>>>> +static bool nbp_switchdev_can_accel(const struct net_bridge_port *p,

>>>> +				    const struct sk_buff *skb)

>>>> +{

>>>> +	return p->accel_priv && (p->hwdom != BR_INPUT_SKB_CB(skb)->src_hwdom);

>>>> +}

>>>> +

>>>> +void nbp_switchdev_frame_mark_accel(const struct net_bridge_port *p,

>>>> +				    struct sk_buff *skb)

>>>> +{

>>>> +	if (nbp_switchdev_can_accel(p, skb))

>>>> +		BR_INPUT_SKB_CB(skb)->fwd_accel = true;

>>>> +}

>>>> +

>>>> +void  (const struct net_bridge_port *p,

>>>> +				  struct sk_buff *skb)

>>>> +{

>>>> +	if (nbp_switchdev_can_accel(p, skb))

>>>> +		set_bit(p->hwdom, BR_INPUT_SKB_CB(skb)->fwd_hwdoms);

>>>> +}

>>>> +

>>>>  void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>>  			      struct sk_buff *skb)

>>>>  {

>>>> @@ -18,8 +38,10 @@ void nbp_switchdev_frame_mark(const struct net_bridge_port *p,

>>>>  bool nbp_switchdev_allowed_egress(const struct net_bridge_port *p,

>>>>  				  const struct sk_buff *skb)

>>>>  {

>>>> -	return !skb->offload_fwd_mark ||

>>>> -	       BR_INPUT_SKB_CB(skb)->src_hwdom != p->hwdom;

>>>> +	struct br_input_skb_cb *cb = BR_INPUT_SKB_CB(skb);

>>>> +

>>>> +	return !test_bit(p->hwdom, cb->fwd_hwdoms) &&

>>>> +		(!skb->offload_fwd_mark || cb->src_hwdom != p->hwdom);

>>>>  }

>>>>  

>>>>  /* Flags that can be offloaded to hardware */

>>>> @@ -125,6 +147,27 @@ int br_switchdev_port_vlan_del(struct net_device *dev, u16 vid)

>>>>  	return switchdev_port_obj_del(dev, &v.obj);

>>>>  }

>>>>  

>>>> +static void nbp_switchdev_fwd_offload_add(struct net_bridge_port *p)

>>>> +{

>>>> +	void *priv;

>>>> +

>>>> +	if (!(p->dev->features & NETIF_F_HW_L2FW_DOFFLOAD))

>>>> +		return;

>>>> +

>>>> +	priv = p->dev->netdev_ops->ndo_dfwd_add_station(p->dev, p->br->dev);

>>>> +	if (!IS_ERR_OR_NULL(priv))

>>>> +		p->accel_priv = priv;

>>>> +}

>>>> +

>>>> +static void nbp_switchdev_fwd_offload_del(struct net_bridge_port *p)

>>>> +{

>>>> +	if (!p->accel_priv)

>>>> +		return;

>>>> +

>>>> +	p->dev->netdev_ops->ndo_dfwd_del_station(p->dev, p->accel_priv);

>>>> +	p->accel_priv = NULL;

>>>> +}

>>>> +

>>>>  static int nbp_switchdev_hwdom_set(struct net_bridge_port *joining)

>>>>  {

>>>>  	struct net_bridge *br = joining->br;

>>>> @@ -176,13 +219,23 @@ int nbp_switchdev_add(struct net_bridge_port *p)

>>>>  		return err;

>>>>  	}

>>>>  

>>>> -	return nbp_switchdev_hwdom_set(p);

>>>> +	err = nbp_switchdev_hwdom_set(p);

>>>> +	if (err)

>>>> +		return err;

>>>> +

>>>> +	if (p->hwdom)

>>>> +		nbp_switchdev_fwd_offload_add(p);

>>>> +

>>>> +	return 0;

>>>>  }

>>>>  

>>>>  void nbp_switchdev_del(struct net_bridge_port *p)

>>>>  {

>>>>  	ASSERT_RTNL();

>>>>  

>>>> +	if (p->accel_priv)

>>>> +		nbp_switchdev_fwd_offload_del(p);

>>>> +

>>>>  	if (p->hwdom)

>>>>  		nbp_switchdev_hwdom_put(p);

>>>>  }

>>>>

Ido Schimmel May 2, 2021, 2:58 p.m. UTC | #5

On Mon, Apr 26, 2021 at 07:04:02PM +0200, Tobias Waldekranz wrote:
> ## Overview

> 

>    vlan1   vlan2

>        \   /

>    .-----------.

>    |    br0    |

>    '-----------'

>    /   /   \   \

> swp0 swp1 swp2 eth0

>   :   :   :

>   (hwdom 1)

> 

> Up to this point, switchdevs have been trusted with offloading

> forwarding between bridge ports, e.g. forwarding a unicast from swp0

> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This

> series extends forward offloading to include some new classes of

> traffic:

> 

> - Locally originating flows, i.e. packets that ingress on br0 that are

>   to be forwarded to one or several of the ports swp{0,1,2}. Notably

>   this also includes routed flows, e.g. a packet ingressing swp0 on

>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then

>   forwarded to swp1 is "locally originating" from br0's point of view.

> 

> - Flows originating from "foreign" interfaces, i.e. an interface that

>   is not offloaded by a particular switchdev instance. This includes

>   ports belonging to other switchdev instances. A typical example

>   would be flows from eth0 towards swp{0,1,2}.

> 

> The bridge still looks up its FDB/MDB as usual and then notifies the

> switchdev driver that a particular skb should be offloaded if it

> matches one of the classes above. It does so by using the _accel

> version of dev_queue_xmit, supplying its own netdev as the

> "subordinate" device. The driver can react to the presence of the

> subordinate in its .ndo_select_queue in what ever way it needs to make

> sure to forward the skb in much the same way that it would for packets

> ingressing on regular ports.

> 

> Hardware domains to which a particular skb has been forwarded are

> recorded so that duplicates are avoided.

> 

> The main performance benefit is thus seen on multicast flows. Imagine

> for example that:

> 

> - An IP camera is connected to swp0 (VLAN 1)

> 

> - The CPU is acting as a multicast router, routing the group from VLAN

>   1 to VLAN 2.

> 

> - There are subscribers for the group in question behind both swp1 and

>   swp2 (VLAN 2).


IIUC, this falls under the first use case ("Locally originating flows").
Do you have a need for this optimization in the forwarding case? Asking
because it might allow us to avoid unnecessary modifications to the
forwarding path. I have yet to look at the code, so maybe it's not a big
deal.

> 

> With this offloading in place, the bridge need only send a single skb

> to the driver, which will send it to the hardware marked in such a way

> that the switch will perform the multicast replication according to

> the MDB configuration. Naturally, the number of saved skb_clones

> increase linearly with the number of subscribed ports.


Yes, this is clear. FWIW, Spectrum has something similar. You can send
packets as either "data" or "control". Data packets are injected via the
CPU port and forwarded according to the hardware database. Control
packets are sent as-is via the specified front panel port, bypassing the
hardware data path. mlxsw is always sending packets as "control".

> 

> As an extra benefit, on mv88e6xxx, this also allows the switch to

> perform source address learning on these flows, which avoids having to

> sync dynamic FDB entries over slow configuration interfaces like MDIO

> to avoid flows directed towards the CPU being flooded as unknown

> unicast by the switch.


Since you are not syncing FDBs, it is possible that you are needlessly
flooding locally generated packets. This optimization avoids it.

> 

> 

> ## RFC

> 

> - In general, what do you think about this idea?


Looks sane to me

> 

> - hwdom. What do you think about this terminology? Personally I feel

>   that we had too many things called offload_fwd_mark, and that as the

>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it

>   might be useful to have a separate term for it.


Sounds OK

> 

> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,

>   and if so do you have any suggestion/preference on how to signal the

>   offloading from the bridge down to the switchdev driver?


I was not aware of this interface before the RFC, but your use case
seems to fit the kdoc: "Called by upper layer devices to accelerate
switching or other station functionality into hardware".

Do you expect this optimization to only work when physical netdevs are
enslaved to the bridge? What about LAG/VLANs?

> 

> - The way that flooding is implemented in br_forward.c (lazily cloning

>   skbs) means that you have to mark the forwarding as completed very

>   early (right after should_deliver in maybe_deliver) in order to

>   avoid duplicates. Is there some way to move this decision point to a

>   later stage that I am missing?

> 

> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not

>   compatible with unicast-to-multicast being used on a port. Then

>   again, I think that this would also be broken for regular switchdev

>   bridge offloading as this flag is not offloaded to the switchdev

>   port, so there is no way for the driver to refuse it. Any ideas on

>   how to handle this?

> 

> 

> ## mv88e6xxx Specifics

> 

> Since we are now only receiving a single skb for both unicast and

> multicast flows, we can tag the packets with the FORWARD command

> instead of FROM_CPU. The swich(es) will then forward the packet in

> accordance with its ATU, VTU, STU, and PVT configuration - just like

> for packets ingressing on user ports.

> 

> Crucially, FROM_CPU is still used for:

> 

> - Ports in standalone mode.

> 

> - Flows that are trapped to the CPU and software-forwarded by a

>   bridge. Note that these flows match neither of the classes discussed

>   in the overview.

> 

> - Packets that are sent directly to a port netdev without going

>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET

>   socket.

> 

> We thus have a pretty clean separation where the data plane uses

> FORWARDs and the control plane uses TO_/FROM_CPU.

> 

> The barrier between different bridges is enforced by port based VLANs

> on mv88e6xxx, which in essence is a mapping from a source device/port

> pair to an allowed set of egress ports. In order to have a FORWARD

> frame (which carries a _source_ device/port) correctly mapped by the

> PVT, we must use a unique pair for each bridge.

> 

> Fortunately, there is typically lots of unused address space in most

> switch trees. When was the last time you saw an mv88e6xxx product

> using more than 4 chips? Even if you found one with 16 (!) devices,

> you would still have room to allocate 16*16 virtual ports to software

> bridges.

> 

> Therefore, the mv88e6xxx driver will allocate a virtual device/port

> pair to each bridge that it offloads. All members of the same bridge

> are then configured to allow packets from this virtual port in their

> PVTs.

> 

> Tobias Waldekranz (9):

>   net: dfwd: Constrain existing users to macvlan subordinates

>   net: bridge: Disambiguate offload_fwd_mark

>   net: bridge: switchdev: Recycle unused hwdoms

>   net: bridge: switchdev: Forward offloading

>   net: dsa: Track port PVIDs

>   net: dsa: Forward offloading

>   net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge

>   net: dsa: mv88e6xxx: Map virtual bridge port in PVT

>   net: dsa: mv88e6xxx: Forward offloading

> 

>  MAINTAINERS                                   |   1 +

>  drivers/net/dsa/mv88e6xxx/Makefile            |   1 +

>  drivers/net/dsa/mv88e6xxx/chip.c              |  61 ++++++-

>  drivers/net/dsa/mv88e6xxx/dst.c               | 160 ++++++++++++++++++

>  drivers/net/dsa/mv88e6xxx/dst.h               |  14 ++

>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +

>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +

>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +

>  include/linux/dsa/mv88e6xxx.h                 |  13 ++

>  include/net/dsa.h                             |  13 ++

>  net/bridge/br_forward.c                       |  11 +-

>  net/bridge/br_if.c                            |   4 +-

>  net/bridge/br_private.h                       |  54 +++++-

>  net/bridge/br_switchdev.c                     | 141 +++++++++++----

>  net/dsa/port.c                                |  16 +-

>  net/dsa/slave.c                               |  36 +++-

>  net/dsa/tag_dsa.c                             |  33 +++-

>  17 files changed, 510 insertions(+), 57 deletions(-)

>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c

>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h

>  create mode 100644 include/linux/dsa/mv88e6xxx.h

> 

> -- 

> 2.25.1

>

Tobias Waldekranz May 3, 2021, 9:44 a.m. UTC | #6

On Sun, May 02, 2021 at 17:58, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Apr 26, 2021 at 07:04:02PM +0200, Tobias Waldekranz wrote:

>> ## Overview

>> 

>>    vlan1   vlan2

>>        \   /

>>    .-----------.

>>    |    br0    |

>>    '-----------'

>>    /   /   \   \

>> swp0 swp1 swp2 eth0

>>   :   :   :

>>   (hwdom 1)

>> 

>> Up to this point, switchdevs have been trusted with offloading

>> forwarding between bridge ports, e.g. forwarding a unicast from swp0

>> to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This

>> series extends forward offloading to include some new classes of

>> traffic:

>> 

>> - Locally originating flows, i.e. packets that ingress on br0 that are

>>   to be forwarded to one or several of the ports swp{0,1,2}. Notably

>>   this also includes routed flows, e.g. a packet ingressing swp0 on

>>   VLAN 1 which is then routed over to VLAN 2 by the CPU and then

>>   forwarded to swp1 is "locally originating" from br0's point of view.

>> 

>> - Flows originating from "foreign" interfaces, i.e. an interface that

>>   is not offloaded by a particular switchdev instance. This includes

>>   ports belonging to other switchdev instances. A typical example

>>   would be flows from eth0 towards swp{0,1,2}.

>> 

>> The bridge still looks up its FDB/MDB as usual and then notifies the

>> switchdev driver that a particular skb should be offloaded if it

>> matches one of the classes above. It does so by using the _accel

>> version of dev_queue_xmit, supplying its own netdev as the

>> "subordinate" device. The driver can react to the presence of the

>> subordinate in its .ndo_select_queue in what ever way it needs to make

>> sure to forward the skb in much the same way that it would for packets

>> ingressing on regular ports.

>> 

>> Hardware domains to which a particular skb has been forwarded are

>> recorded so that duplicates are avoided.

>> 

>> The main performance benefit is thus seen on multicast flows. Imagine

>> for example that:

>> 

>> - An IP camera is connected to swp0 (VLAN 1)

>> 

>> - The CPU is acting as a multicast router, routing the group from VLAN

>>   1 to VLAN 2.

>> 

>> - There are subscribers for the group in question behind both swp1 and

>>   swp2 (VLAN 2).

>

> IIUC, this falls under the first use case ("Locally originating flows").

> Do you have a need for this optimization in the forwarding case? Asking

> because it might allow us to avoid unnecessary modifications to the

> forwarding path. I have yet to look at the code, so maybe it's not a big

> deal.


Routed multicast is the most pressing issue. But being able to avoid
issues with learning on flows from the CPU (locally originating and from
foreign interfaces) is a close second. Yes you can handle the second
issue by syncing FDBs but it means lots of extra traffic over an already
congested interface (MDIO).

The overhead is pretty small in this version, and with Nikolay's
suggestions about hiding it behind a static key, it should go down to 0
in v1.

>> With this offloading in place, the bridge need only send a single skb

>> to the driver, which will send it to the hardware marked in such a way

>> that the switch will perform the multicast replication according to

>> the MDB configuration. Naturally, the number of saved skb_clones

>> increase linearly with the number of subscribed ports.

>

> Yes, this is clear. FWIW, Spectrum has something similar. You can send

> packets as either "data" or "control". Data packets are injected via the

> CPU port and forwarded according to the hardware database. Control

> packets are sent as-is via the specified front panel port, bypassing the

> hardware data path. mlxsw is always sending packets as "control".


Marvell has the same concept, but they call "data" "forward" and
"control" "from CPU". mv88e6xxx has thus far also been limited to only
sending control frames. I imagine that many chips will be able to make
use of this acceleration.

>> As an extra benefit, on mv88e6xxx, this also allows the switch to

>> perform source address learning on these flows, which avoids having to

>> sync dynamic FDB entries over slow configuration interfaces like MDIO

>> to avoid flows directed towards the CPU being flooded as unknown

>> unicast by the switch.

>

> Since you are not syncing FDBs, it is possible that you are needlessly

> flooding locally generated packets. This optimization avoids it.


Since the switchdev driver muxes incoming frames to the right port
netdev, the bridge will know the correct port to use on egress. It is
more that return traffic towards the CPU will be flooded by the switch
as unknown unicast.

.-----.   .--------.
| CPU +---0 Switch 1--- A
'-----'   | (fdb)  2--- B
          '--------'

If a ping is sent from CPU to A, A's reply will also be flooded to B
because the CPU's SA has never been seen on a "forward"/"data" packet
and therefore has never been added to the FDB.

>> 

>> 

>> ## RFC

>> 

>> - In general, what do you think about this idea?

>

> Looks sane to me


Glad to hear it, thanks!

>> - hwdom. What do you think about this terminology? Personally I feel

>>   that we had too many things called offload_fwd_mark, and that as the

>>   use of the bridge internal ID (nbp->offload_fwd_mark) expands, it

>>   might be useful to have a separate term for it.

>

> Sounds OK

>

>> 

>> - .dfwd_{add,del}_station. Am I stretching this abstraction too far,

>>   and if so do you have any suggestion/preference on how to signal the

>>   offloading from the bridge down to the switchdev driver?

>

> I was not aware of this interface before the RFC, but your use case

> seems to fit the kdoc: "Called by upper layer devices to accelerate

> switching or other station functionality into hardware".

>

> Do you expect this optimization to only work when physical netdevs are

> enslaved to the bridge? What about LAG/VLANs?


LAGs should definitely work once the .ndo_dfwd_{add,del}_station helpers
are in place.

Stacked VLANs could also be made to work. But they may require some
extra work.

In v1, the bridge will always send offloaded frames with the VLAN
information intact, even if the port is configured to egress the VID
untagged. This is needed so that the driver can determine the correct
VID to use in cases where multiple VIDs are set to egress untagged.

If any kind of VID translation takes place I think things get very
sticky. Then again, that is maybe not all that defined without this
change applied either?

What is the typical use-case for using an "external" stacked VLAN device
over configuring the VLAN inside the bridge?

>> 

>> - The way that flooding is implemented in br_forward.c (lazily cloning

>>   skbs) means that you have to mark the forwarding as completed very

>>   early (right after should_deliver in maybe_deliver) in order to

>>   avoid duplicates. Is there some way to move this decision point to a

>>   later stage that I am missing?

>> 

>> - BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not

>>   compatible with unicast-to-multicast being used on a port. Then

>>   again, I think that this would also be broken for regular switchdev

>>   bridge offloading as this flag is not offloaded to the switchdev

>>   port, so there is no way for the driver to refuse it. Any ideas on

>>   how to handle this?

>> 

>> 

>> ## mv88e6xxx Specifics

>> 

>> Since we are now only receiving a single skb for both unicast and

>> multicast flows, we can tag the packets with the FORWARD command

>> instead of FROM_CPU. The swich(es) will then forward the packet in

>> accordance with its ATU, VTU, STU, and PVT configuration - just like

>> for packets ingressing on user ports.

>> 

>> Crucially, FROM_CPU is still used for:

>> 

>> - Ports in standalone mode.

>> 

>> - Flows that are trapped to the CPU and software-forwarded by a

>>   bridge. Note that these flows match neither of the classes discussed

>>   in the overview.

>> 

>> - Packets that are sent directly to a port netdev without going

>>   through the bridge, e.g. lldpd sending out PDU via an AF_PACKET

>>   socket.

>> 

>> We thus have a pretty clean separation where the data plane uses

>> FORWARDs and the control plane uses TO_/FROM_CPU.

>> 

>> The barrier between different bridges is enforced by port based VLANs

>> on mv88e6xxx, which in essence is a mapping from a source device/port

>> pair to an allowed set of egress ports. In order to have a FORWARD

>> frame (which carries a _source_ device/port) correctly mapped by the

>> PVT, we must use a unique pair for each bridge.

>> 

>> Fortunately, there is typically lots of unused address space in most

>> switch trees. When was the last time you saw an mv88e6xxx product

>> using more than 4 chips? Even if you found one with 16 (!) devices,

>> you would still have room to allocate 16*16 virtual ports to software

>> bridges.

>> 

>> Therefore, the mv88e6xxx driver will allocate a virtual device/port

>> pair to each bridge that it offloads. All members of the same bridge

>> are then configured to allow packets from this virtual port in their

>> PVTs.

>> 

>> Tobias Waldekranz (9):

>>   net: dfwd: Constrain existing users to macvlan subordinates

>>   net: bridge: Disambiguate offload_fwd_mark

>>   net: bridge: switchdev: Recycle unused hwdoms

>>   net: bridge: switchdev: Forward offloading

>>   net: dsa: Track port PVIDs

>>   net: dsa: Forward offloading

>>   net: dsa: mv88e6xxx: Allocate a virtual DSA port for each bridge

>>   net: dsa: mv88e6xxx: Map virtual bridge port in PVT

>>   net: dsa: mv88e6xxx: Forward offloading

>> 

>>  MAINTAINERS                                   |   1 +

>>  drivers/net/dsa/mv88e6xxx/Makefile            |   1 +

>>  drivers/net/dsa/mv88e6xxx/chip.c              |  61 ++++++-

>>  drivers/net/dsa/mv88e6xxx/dst.c               | 160 ++++++++++++++++++

>>  drivers/net/dsa/mv88e6xxx/dst.h               |  14 ++

>>  .../net/ethernet/intel/fm10k/fm10k_netdev.c   |   3 +

>>  drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +

>>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   3 +

>>  include/linux/dsa/mv88e6xxx.h                 |  13 ++

>>  include/net/dsa.h                             |  13 ++

>>  net/bridge/br_forward.c                       |  11 +-

>>  net/bridge/br_if.c                            |   4 +-

>>  net/bridge/br_private.h                       |  54 +++++-

>>  net/bridge/br_switchdev.c                     | 141 +++++++++++----

>>  net/dsa/port.c                                |  16 +-

>>  net/dsa/slave.c                               |  36 +++-

>>  net/dsa/tag_dsa.c                             |  33 +++-

>>  17 files changed, 510 insertions(+), 57 deletions(-)

>>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.c

>>  create mode 100644 drivers/net/dsa/mv88e6xxx/dst.h

>>  create mode 100644 include/linux/dsa/mv88e6xxx.h

>> 

>> -- 

>> 2.25.1

>>

Vladimir Oltean May 6, 2021, 10:59 a.m. UTC | #7

On Mon, May 03, 2021 at 11:44:21AM +0200, Tobias Waldekranz wrote:
> > Do you expect this optimization to only work when physical netdevs are

> > enslaved to the bridge? What about LAG/VLANs?

> 

> LAGs should definitely work once the .ndo_dfwd_{add,del}_station helpers

> are in place.

> 

> Stacked VLANs could also be made to work. But they may require some

> extra work.

> 

> In v1, the bridge will always send offloaded frames with the VLAN

> information intact, even if the port is configured to egress the VID

> untagged. This is needed so that the driver can determine the correct

> VID to use in cases where multiple VIDs are set to egress untagged.

> 

> If any kind of VID translation takes place I think things get very

> sticky. Then again, that is maybe not all that defined without this

> change applied either?

> 

> What is the typical use-case for using an "external" stacked VLAN device

> over configuring the VLAN inside the bridge?

I think it will be very sticky to support forwarding offload for this
setup anyway:

            br0
swp0.100  swp0.101   swp1.100

Need to understand what the use case is. The correct behavior IMO is for
the physical switch port to remain standalone and for the bridge to know
that the bridge port (swp0.100) is not offloaded.
FWIW I had an attempt to handle situations like this here:
https://patchwork.kernel.org/project/netdevbpf/patch/20210224114350.2791260-17-olteanv@gmail.com/

I will let Tobias finish with his forwarding offload patch set before
rebasing and resending mine, his work looks a lot better at this point.

[RFC,net-next,0/9] net: bridge: Forward offloading

Message

Comments