mbox series

[v4,net-next,00/15] Allow forwarding for the software bridge data path to be offloaded to capable devices

Message ID 20210718214434.3938850-1-vladimir.oltean@nxp.com
Headers show
Series Allow forwarding for the software bridge data path to be offloaded to capable devices | expand

Message

Vladimir Oltean July 18, 2021, 9:44 p.m. UTC
I would like these patches to simmer for a few days and get some
feedback / confirmation that there are no regressions from switchdev
driver maintainers.

===========================================================

The data plane of the software bridge can be partially offloaded to
switchdev, in the sense that we can trust the accelerator to:
(a) look up its FDB (which is more or less in sync with the software
    bridge FDB) for selecting the destination ports for a packet
(b) replicate the frame in hardware in case it's a multicast/broadcast,
    instead of the software bridge having to clone it and send the
    clones to each net device one at a time. This reduces the bandwidth
    needed between the CPU and the accelerator, as well as the CPU time
    spent.

The data path forwarding offload is managed per "hardware domain" - a
generalization of the "offload_fwd_mark" concept which is being
introduced in this series. Every packet is delivered only once to each
hardware domain.

===========================================================

Message for v4:

The biggest change compared to the previous series is not present in the
patches, but is rather a lack of them. Previously we were replaying
switchdev objects on the public notifier chain, but that was a mistake
in my reasoning and it was reverted for v4. Therefore, we are now
passing the notifier blocks as arguments to switchdev_bridge_port_offload()
for all drivers. This alone gets rid of 7 patches compared to v3.

Other changes are:
- Take more care for the case where mlxsw leaves a VLAN or LAG upper
  that is a bridge port, make sure that switchdev_bridge_port_unoffload()
  gets called for that case
- A couple of DSA bug fixes
- Add change logs for all patches
- Copy all switchdev driver maintainers on the changes relevant to them

===========================================================

Message for v3:
https://patchwork.kernel.org/project/netdevbpf/cover/20210712152142.800651-1-vladimir.oltean@nxp.com/

In this submission I have introduced a "native switchdev" driver API to
signal whether the TX forwarding offload is supported or not. This comes
after a third person has said that the macvlan offload framework used
for v2 and v1 was simply too convoluted.

This large patch set is submitted for discussion purposes (it is
provided in its entirety so it can be applied & tested on net-next).
It is only minimally tested, and yet I will not copy all switchdev
driver maintainers until we agree on the viability of this approach.

The major changes compared to v2:
- The introduction of switchdev_bridge_port_offload() and
  switchdev_bridge_port_unoffload() as two major API changes from the
  perspective of a switchdev driver. All drivers were converted to call
  these.
- Augment switchdev_bridge_port_{,un}offload to also handle the
  switchdev object replays on port join/leave.
- Augment switchdev_bridge_port_offload to also signal whether the TX
  forwarding offload is supported.

===========================================================

Message for v2:
https://patchwork.kernel.org/project/netdevbpf/cover/20210703115705.1034112-1-vladimir.oltean@nxp.com/

For this series I have taken Tobias' work from here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210426170411.1789186-1-tobias@waldekranz.com/
and made the following changes:
- I collected and integrated (hopefully all of) Nikolay's, Ido's and my
  feedback on the bridge driver changes. Otherwise, the structure of the
  bridge changes is pretty much the same as Tobias left it.
- I basically rewrote the DSA infrastructure for the data plane
  forwarding offload, based on the commonalities with another switch
  driver for which I implemented this feature (not submitted here)
- I adapted mv88e6xxx to use the new infrastructure, hopefully it still
  works but I didn't test that

In addition, Tobias said in the original cover letter:

===========================================================

## Overview

   vlan1   vlan2
       \   /
   .-----------.
   |    br0    |
   '-----------'
   /   /   \   \
swp0 swp1 swp2 eth0
  :   :   :
  (hwdom 1)

Up to this point, switchdevs have been trusted with offloading
forwarding between bridge ports, e.g. forwarding a unicast from swp0
to swp1 or flooding a broadcast from swp2 to swp1 and swp0. This
series extends forward offloading to include some new classes of
traffic:

- Locally originating flows, i.e. packets that ingress on br0 that are
  to be forwarded to one or several of the ports swp{0,1,2}. Notably
  this also includes routed flows, e.g. a packet ingressing swp0 on
  VLAN 1 which is then routed over to VLAN 2 by the CPU and then
  forwarded to swp1 is "locally originating" from br0's point of view.

- Flows originating from "foreign" interfaces, i.e. an interface that
  is not offloaded by a particular switchdev instance. This includes
  ports belonging to other switchdev instances. A typical example
  would be flows from eth0 towards swp{0,1,2}.

The bridge still looks up its FDB/MDB as usual and then notifies the
switchdev driver that a particular skb should be offloaded if it
matches one of the classes above. It does so by using the _accel
version of dev_queue_xmit, supplying its own netdev as the
"subordinate" device. The driver can react to the presence of the
subordinate in its .ndo_select_queue in what ever way it needs to make
sure to forward the skb in much the same way that it would for packets
ingressing on regular ports.

Hardware domains to which a particular skb has been forwarded are
recorded so that duplicates are avoided.

The main performance benefit is thus seen on multicast flows. Imagine
for example that:

- An IP camera is connected to swp0 (VLAN 1)

- The CPU is acting as a multicast router, routing the group from VLAN
  1 to VLAN 2.

- There are subscribers for the group in question behind both swp1 and
  swp2 (VLAN 2).

With this offloading in place, the bridge need only send a single skb
to the driver, which will send it to the hardware marked in such a way
that the switch will perform the multicast replication according to
the MDB configuration. Naturally, the number of saved skb_clones
increase linearly with the number of subscribed ports.

As an extra benefit, on mv88e6xxx, this also allows the switch to
perform source address learning on these flows, which avoids having to
sync dynamic FDB entries over slow configuration interfaces like MDIO
to avoid flows directed towards the CPU being flooded as unknown
unicast by the switch.


## RFC

- In general, what do you think about this idea?

- hwdom. What do you think about this terminology? Personally I feel
  that we had too many things called offload_fwd_mark, and that as the
  use of the bridge internal ID (nbp->offload_fwd_mark) expands, it
  might be useful to have a separate term for it.

- .dfwd_{add,del}_station. Am I stretching this abstraction too far,
  and if so do you have any suggestion/preference on how to signal the
  offloading from the bridge down to the switchdev driver?

- The way that flooding is implemented in br_forward.c (lazily cloning
  skbs) means that you have to mark the forwarding as completed very
  early (right after should_deliver in maybe_deliver) in order to
  avoid duplicates. Is there some way to move this decision point to a
  later stage that I am missing?

- BR_MULTICAST_TO_UNICAST. Right now, I expect that this series is not
  compatible with unicast-to-multicast being used on a port. Then
  again, I think that this would also be broken for regular switchdev
  bridge offloading as this flag is not offloaded to the switchdev
  port, so there is no way for the driver to refuse it. Any ideas on
  how to handle this?


## mv88e6xxx Specifics

Since we are now only receiving a single skb for both unicast and
multicast flows, we can tag the packets with the FORWARD command
instead of FROM_CPU. The swich(es) will then forward the packet in
accordance with its ATU, VTU, STU, and PVT configuration - just like
for packets ingressing on user ports.

Crucially, FROM_CPU is still used for:

- Ports in standalone mode.

- Flows that are trapped to the CPU and software-forwarded by a
  bridge. Note that these flows match neither of the classes discussed
  in the overview.

- Packets that are sent directly to a port netdev without going
  through the bridge, e.g. lldpd sending out PDU via an AF_PACKET
  socket.

We thus have a pretty clean separation where the data plane uses
FORWARDs and the control plane uses TO_/FROM_CPU.

The barrier between different bridges is enforced by port based VLANs
on mv88e6xxx, which in essence is a mapping from a source device/port
pair to an allowed set of egress ports. In order to have a FORWARD
frame (which carries a _source_ device/port) correctly mapped by the
PVT, we must use a unique pair for each bridge.

Fortunately, there is typically lots of unused address space in most
switch trees. When was the last time you saw an mv88e6xxx product
using more than 4 chips? Even if you found one with 16 (!) devices,
you would still have room to allocate 16*16 virtual ports to software
bridges.

Therefore, the mv88e6xxx driver will allocate a virtual device/port
pair to each bridge that it offloads. All members of the same bridge
are then configured to allow packets from this virtual port in their
PVTs.
====================

Cc: Vadym Kochan <vkochan@marvell.com>
Cc: Taras Chornyi <tchornyi@marvell.com>
Cc: Ioana Ciornei <ioana.ciornei@nxp.com>
Cc: Lars Povlsen <lars.povlsen@microchip.com>
Cc: Steen Hegelund <Steen.Hegelund@microchip.com>
Cc: UNGLinuxDriver@microchip.com
Cc: Claudiu Manoil <claudiu.manoil@nxp.com>
Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>
Cc: Grygorii Strashko <grygorii.strashko@ti.com>

Tobias Waldekranz (4):
  net: bridge: disambiguate offload_fwd_mark
  net: bridge: switchdev: recycle unused hwdoms
  net: bridge: switchdev: allow the TX data plane forwarding to be
    offloaded
  net: dsa: tag_dsa: offload the bridge forwarding process

Vladimir Oltean (11):
  net: dpaa2-switch: use extack in dpaa2_switch_port_bridge_join
  net: dpaa2-switch: refactor prechangeupper sanity checks
  mlxsw: spectrum: refactor prechangeupper sanity checks
  mlxsw: spectrum: refactor leaving an 8021q upper that is a bridge port
  net: marvell: prestera: refactor prechangeupper sanity checks
  net: switchdev: guard against multiple switchdev obj replays on same
    bridge port
  net: bridge: switchdev: let drivers inform which bridge ports are
    offloaded
  net: bridge: switchdev object replay helpers for everybody
  net: dsa: track the number of switches in a tree
  net: dsa: add support for bridge TX forwarding offload
  net: dsa: mv88e6xxx: map virtual bridges with forwarding offload in
    the PVT

 drivers/net/dsa/mv88e6xxx/chip.c              |  78 +++-
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  69 +++-
 .../ethernet/marvell/prestera/prestera_main.c |  99 +++--
 .../marvell/prestera/prestera_switchdev.c     |  42 ++-
 .../marvell/prestera/prestera_switchdev.h     |   7 +-
 .../net/ethernet/mellanox/mlxsw/spectrum.c    | 347 ++++++++++++------
 .../net/ethernet/mellanox/mlxsw/spectrum.h    |   4 +
 .../mellanox/mlxsw/spectrum_switchdev.c       |  28 +-
 .../microchip/sparx5/sparx5_switchdev.c       |  48 ++-
 drivers/net/ethernet/mscc/ocelot_net.c        | 115 ++++--
 drivers/net/ethernet/rocker/rocker.h          |   9 +-
 drivers/net/ethernet/rocker/rocker_main.c     |  34 +-
 drivers/net/ethernet/rocker/rocker_ofdpa.c    |  42 ++-
 drivers/net/ethernet/ti/am65-cpsw-nuss.c      |  34 +-
 drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  14 +-
 drivers/net/ethernet/ti/am65-cpsw-switchdev.h |   3 +
 drivers/net/ethernet/ti/cpsw_new.c            |  32 +-
 drivers/net/ethernet/ti/cpsw_switchdev.c      |   4 +-
 drivers/net/ethernet/ti/cpsw_switchdev.h      |   3 +
 include/linux/if_bridge.h                     |  63 ++--
 include/net/dsa.h                             |  21 ++
 net/bridge/br_fdb.c                           |   1 -
 net/bridge/br_forward.c                       |   9 +
 net/bridge/br_if.c                            |  11 +-
 net/bridge/br_mdb.c                           |   1 -
 net/bridge/br_private.h                       |  84 ++++-
 net/bridge/br_switchdev.c                     | 287 +++++++++++++--
 net/bridge/br_vlan.c                          |  11 +-
 net/dsa/dsa2.c                                |   4 +
 net/dsa/dsa_priv.h                            |   6 +
 net/dsa/port.c                                | 192 +++++++---
 net/dsa/tag_dsa.c                             |  52 ++-
 32 files changed, 1406 insertions(+), 348 deletions(-)

Comments

Florian Fainelli July 19, 2021, 2:16 a.m. UTC | #1
On 7/18/2021 2:44 PM, Vladimir Oltean wrote:
> For symmetry with mlxsw_sp_port_lag_leave(), introduce a small function

> called mlxsw_sp_port_vlan_leave() which checks whether the 8021q upper

> we're leaving is a bridge port, and if it is, stop offloading that

> bridge too.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian
Florian Fainelli July 19, 2021, 2:17 a.m. UTC | #2
On 7/18/2021 2:44 PM, Vladimir Oltean wrote:
> Prepare the drivers which support LAG offload but don't have support for

> switchdev object replay yet, i.e. the mlxsw and prestera drivers, to

> deal with bridge switchdev objects being replayed on the LAG bridge port

> multiple times, once for each time a physical port beneath the LAG calls

> switchdev_bridge_port_offload().

> 

> Cc: Vadym Kochan <vkochan@marvell.com>

> Cc: Taras Chornyi <tchornyi@marvell.com>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian
Florian Fainelli July 19, 2021, 2:43 a.m. UTC | #3
On 7/18/2021 2:44 PM, Vladimir Oltean wrote:
> From: Tobias Waldekranz <tobias@waldekranz.com>

> 

> Allow switchdevs to forward frames from the CPU in accordance with the

> bridge configuration in the same way as is done between bridge

> ports. This means that the bridge will only send a single skb towards

> one of the ports under the switchdev's control, and expects the driver

> to deliver the packet to all eligible ports in its domain.

> 

> Primarily this improves the performance of multicast flows with

> multiple subscribers, as it allows the hardware to perform the frame

> replication.

> 

> The basic flow between the driver and the bridge is as follows:

> 

> - When joining a bridge port, the switchdev driver calls

>    switchdev_bridge_port_offload() with tx_fwd_offload = true.

> 

> - The bridge sends offloadable skbs to one of the ports under the

>    switchdev's control using skb->offload_fwd_mark = true.

> 

> - The switchdev driver checks the skb->offload_fwd_mark field and lets

>    its FDB lookup select the destination port mask for this packet.

> 

> v1->v2:

> - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long

> - introduce a static key "br_switchdev_fwd_offload_used" to minimize the

>    impact of the newly introduced feature on all the setups which don't

>    have hardware that can make use of it

> - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache

>    line access

> - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in

>    __br_forward()

> - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge

>    is being used

> - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP

> 

> v2->v3:

> - replace the solution based on .ndo_dfwd_add_station with a solution

>    based on switchdev_bridge_port_offload

> - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD

> v3->v4: rebase

> 

> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian
Florian Fainelli July 19, 2021, 2:47 a.m. UTC | #4
On 7/18/2021 2:44 PM, Vladimir Oltean wrote:
> From: Tobias Waldekranz <tobias@waldekranz.com>

> 

> Allow the DSA tagger to generate FORWARD frames for offloaded skbs

> sent from a bridge that we offload, allowing the switch to handle any

> frame replication that may be required. This also means that source

> address learning takes place on packets sent from the CPU, meaning

> that return traffic no longer needs to be flooded as unknown unicast.

> 

> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


This looks pretty complicated to but if this is how it has to work, it 
has to. For tag_brcm.c we can simply indicate that the frame to be 
transmitted should have a specific bitmask of egress ports.
-- 
Florian
Florian Fainelli July 19, 2021, 2:51 a.m. UTC | #5
On 7/18/2021 2:44 PM, Vladimir Oltean wrote:
> For a DSA switch, to offload the forwarding process of a bridge device

> means to send the packets coming from the software bridge as data plane

> packets. This is contrary to everything that DSA has done so far,

> because the current taggers only know to send control packets (ones that

> target a specific destination port), whereas data plane packets are

> supposed to be forwarded according to the FDB lookup, much like packets

> ingressing on any regular ingress port. If the FDB lookup process

> returns multiple destination ports (flooding, multicast), then

> replication is also handled by the switch hardware - the bridge only

> sends a single packet and avoids the skb_clone().

> 

> DSA plays a substantial role in backing the forwarding offload, and

> leaves relatively few things up to the switch driver. In particular, DSA

> keeps for each bridge port a zero-based index (the number of the

> bridge). Multiple ports enslaved to the same bridge have a pointer to

> the same accel_priv structure.

> 

> The tagger can check if the packet that is being transmitted on has

> skb->offload_fwd_mark = true or not. If it does, it can be sure that the

> packet belongs to the data plane of a bridge, further information about

> which can be obtained based on dp->bridge_dev and dp->bridge_num.

> It can then compose a DSA tag for injecting a data plane packet into

> that bridge number.

> 

> For the switch driver side, we offer two new dsa_switch_ops methods,

> called .port_bridge_fwd_offload_{add,del}, which are modeled after

> .port_bridge_{join,leave}.

> These methods are provided in case the driver needs to configure the

> hardware to treat packets coming from that bridge software interface as

> data plane packets. The switchdev <-> bridge interaction happens during

> the netdev_master_upper_dev_link() call, so to switch drivers, the

> effect is that the .port_bridge_fwd_offload_add() method is called

> immediately after .port_bridge_join().

> 

> If the bridge number exceeds the number of bridges for which the switch

> driver can offload the TX data plane (and this includes the case where

> the driver can offload none), DSA falls back to simply returning

> tx_fwd_offload = false in the switchdev_bridge_port_offload() call.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> ---


Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

-- 
Florian
Vladimir Oltean July 19, 2021, 7:22 a.m. UTC | #6
On Mon, Jul 19, 2021 at 12:44:30AM +0300, Vladimir Oltean wrote:
>  static int nbp_switchdev_add(struct net_bridge_port *p,

>  			     struct netdev_phys_item_id ppid,

> +			     bool tx_fwd_offload,

>  			     struct netlink_ext_ack *extack)

>  {

> +	int err;

> +

>  	if (p->offload_count) {

>  		/* Prevent unsupported configurations such as a bridge port

>  		 * which is a bonding interface, and the member ports are from

> @@ -189,7 +228,16 @@ static int nbp_switchdev_add(struct net_bridge_port *p,

>  	p->ppid = ppid;

>  	p->offload_count = 1;

>  

> -	return nbp_switchdev_hwdom_set(p);

> +	err = nbp_switchdev_hwdom_set(p);

> +	if (err)

> +		return err;

> +

> +	if (tx_fwd_offload) {

> +		p->flags |= BR_TX_FWD_OFFLOAD;

> +		static_branch_inc(&br_switchdev_fwd_offload_used);

> +	}

> +

> +	return 0;

>  }

>  

>  static void nbp_switchdev_del(struct net_bridge_port *p,

> @@ -210,6 +258,8 @@ static void nbp_switchdev_del(struct net_bridge_port *p,

>  

>  	if (p->hwdom)

>  		nbp_switchdev_hwdom_put(p);

> +

> +	p->flags &= ~BR_TX_FWD_OFFLOAD;

>  }


Not the end of the world, but the static_branch_dec(&br_switchdev_fwd_offload_used)
was lost here in a rebase. Not a functional issue per se, but it is on
my list of things I would like to fix when I resend.
Vladimir Oltean July 19, 2021, 7:41 a.m. UTC | #7
On Sun, Jul 18, 2021 at 07:47:22PM -0700, Florian Fainelli wrote:
> On 7/18/2021 2:44 PM, Vladimir Oltean wrote:

> > From: Tobias Waldekranz <tobias@waldekranz.com>

> >

> > Allow the DSA tagger to generate FORWARD frames for offloaded skbs

> > sent from a bridge that we offload, allowing the switch to handle any

> > frame replication that may be required. This also means that source

> > address learning takes place on packets sent from the CPU, meaning

> > that return traffic no longer needs to be flooded as unknown unicast.

> >

> > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>

> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

>

> This looks pretty complicated to but if this is how it has to work, it has

> to. For tag_brcm.c we can simply indicate that the frame to be transmitted

> should have a specific bitmask of egress ports.


Complicated in the sense that we need to nail the VLAN ID so that
the FDB / MDB is looked up correctly by the accelerator, to ensure that
it produces a result that is in sync with the software tables?

What you are proposing is not really TX forwarding offload but TX
replication offload. A CPU-injected packet targeting multiple egress
ports is still a control plane packet nonetheless, with all features
that characterize one:
- Ingress stage of the CPU port is bypassed (no hardware address
  learning for that MAC SA)
- FDB lookup is bypassed (we trust the software). This is also perhaps
  an advantage, because for example, if we have a MAC address learned
  towards the CPU port, and then we inject a packet from the CPU towards
  that destination MAC address, then a data plane packet would be
  dropped due to source port pruning (source == destination port), but a
  control plane packet would be sent regardless.
- Can inject into a BLOCKING egress port (we trust the software not to
  do that)

Whereas this patch set is really about laying the ground for data plane
packets to be safely created and sent by the network stack. There are
switches which have a clear distinction between the control plane and
the data plane, and injecting a control packet is a fairly expensive
operation. So it would be very good to support this operating mode,
regardless of whatever else we do.

I can look into adding support for your use case with just the
replication offload, since it should be possible nonetheless, and if you
really don't have the option to send a data plane packet then it is a
valid approach too, however I believe that the brick wall will be where
to encode the destination bit mask in the egress skb. For the full TX
forwarding offload we managed to dodge that because we already had
skb->offload_fwd_mark, but that's just one bit and we would need more.
I'm thinking we would need to add another bit (skb->offload_tx_replication)
and then add a struct list_head tx_dev to the skb which contains all the
net devices that the packet was not cloned to?
Vladimir Oltean July 19, 2021, 8:19 a.m. UTC | #8
On Mon, Jul 19, 2021 at 12:44:29AM +0300, Vladimir Oltean wrote:
> Note that:

> (c) I do not expect a lot of functional change introduced for drivers in

>     this patch, because:

>     - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),

>       so br_vlan_replay() should not do anything for the new drivers on

>       which we call it. The existing drivers where there was even a

>       slight possibility for there to exist a VLAN on a bridge port

>       before they join it are already guarded against this: mlxsw and

>       prestera deny joining LAG interfaces that are members of a bridge.

>     - br_fdb_replay() should now notify of local FDB entries, but I

>       patched all drivers except DSA to ignore these new entries in

>       commit 2c4eca3ef716 ("net: bridge: switchdev: include local flag

>       in FDB notifications"). Driver authors can lift this restriction

>       as they wish.

>     - br_mdb_replay() should now fix the issue described in commit

>       2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB

>       notifications") for all drivers, I don't see any downside.


I really meant commit 4f2673b3a2b6 ("net: bridge: add helper to replay
port and host-joined mdb entries"), sorry for the copy-pasta mistake.
Ioana Ciornei July 19, 2021, 9:17 a.m. UTC | #9
On Mon, Jul 19, 2021 at 12:44:20AM +0300, Vladimir Oltean wrote:
> We need to propagate the extack argument for

> dpaa2_switch_port_bridge_join to use it in a future patch, and it looks

> like there is already an error message there which is currently printed

> to the console. Move it over netlink so it is properly transmitted to

> user space.

> 

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>

> Cc: Ioana Ciornei <ioana.ciornei@nxp.com>

> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>


Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>

Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Ioana Ciornei July 19, 2021, 9:26 a.m. UTC | #10
On Mon, Jul 19, 2021 at 12:44:29AM +0300, Vladimir Oltean wrote:
> Starting with commit 4f2673b3a2b6 ("net: bridge: add helper to replay

> port and host-joined mdb entries"), DSA has introduced some bridge

> helpers that replay switchdev events (FDB/MDB/VLAN additions and

> deletions) that can be lost by the switchdev drivers in a variety of

> circumstances:

> 

> - an IP multicast group was host-joined on the bridge itself before any

>   switchdev port joined the bridge, leading to the host MDB entries

>   missing in the hardware database.

> - during the bridge creation process, the MAC address of the bridge was

>   added to the FDB as an entry pointing towards the bridge device

>   itself, but with no switchdev ports being part of the bridge yet, this

>   local FDB entry would remain unknown to the switchdev hardware

>   database.

> - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface,

>   before any switchdev port joined that LAG, leading to the hardware

>   database missing those entries.

> - a switchdev port left a LAG that is a bridge port, while the LAG

>   remained part of the bridge, and all FDB/MDB/VLAN entries remained

>   installed in the hardware database of the switchdev port.

> 

> Also, since commit 0d2cfbd41c4a ("net: bridge: ignore switchdev events

> for LAG ports which didn't request replay"), DSA introduced a method,

> based on a const void *ctx, to ensure that two switchdev ports under the

> same LAG that is a bridge port do not see the same MDB/VLAN entry being

> replayed twice by the bridge, once for every bridge port that joins the

> LAG.

> 

> With so many ordering corner cases being possible, it seems unreasonable

> to expect a switchdev driver writer to get it right from the first try.

> Therefore, now that we are past the beta testing period for the bridge

> replay helpers used in DSA only, it is time to roll them out to all

> switchdev drivers.

> 

> To convert the switchdev object replay helpers from "pull mode" (where

> the driver asks for them) to a "push mode" (where the bridge offers them

> automatically), the biggest problem is that the bridge needs to be aware

> when a switchdev port joins and leaves, even when the switchdev is only

> indirectly a bridge port (for example when the bridge port is a LAG

> upper of the switchdev).

> 

> Luckily, we already have a hook for that, in the form of the newly

> introduced switchdev_bridge_port_offload() and

> switchdev_bridge_port_unoffload() calls. These offer a natural place for

> hooking the object addition and deletion replays.

> 

> Extend the above 2 functions with:

> - pointers to the switchdev atomic notifier (for FDB replays) and the

>   blocking notifier (for MDB and VLAN replays).

> - the "const void *ctx" argument required for drivers to be able to

>   disambiguate between which port is targeted, when multiple ports are

>   lowers of the same LAG that is a bridge port. Most of the drivers pass

>   NULL to this argument, except the ones that support LAG offload and have

>   the proper context check already in place in the switchdev blocking

>   notifier handler.

> 

> am65_cpsw and cpsw had the same name for the switchdev notifiers, so I

> renamed the am65_cpsw ones with an am65_ prefix.

> 

> Also unexport the replay helpers, since nobody except the bridge calls

> them directly now.

> 

> Note that:

> (a) we abuse the terminology slightly, because FDB entries are not

>     "switchdev objects", but we count them as objects nonetheless.

>     With no direct way to prove it, I think they are not modeled as

>     switchdev objects because those can only be installed by the bridge

>     to the hardware (as opposed to FDB entries which can be propagated

>     in the other direction too). This is merely an abuse of terms, FDB

>     entries are replayed too, despite not being objects.

> (b) the bridge does not attempt to sync port attributes to newly joined

>     ports, just the countable stuff (the objects). The reason for this

>     is simple: no universal and symmetric way to sync and unsync them is

>     known. For example, VLAN filtering: what to do on unsync, disable or

>     leave it enabled? Similarly, STP state, ageing timer, etc etc. What

>     a switchdev port does when it becomes standalone again is not really

>     up to the bridge's competence, and the driver should deal with it.

>     On the other hand, replaying deletions of switchdev objects can be

>     seen a matter of cleanup and therefore be treated by the bridge,

>     hence this patch.

> (c) I do not expect a lot of functional change introduced for drivers in

>     this patch, because:

>     - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(),

>       so br_vlan_replay() should not do anything for the new drivers on

>       which we call it. The existing drivers where there was even a

>       slight possibility for there to exist a VLAN on a bridge port

>       before they join it are already guarded against this: mlxsw and

>       prestera deny joining LAG interfaces that are members of a bridge.

>     - br_fdb_replay() should now notify of local FDB entries, but I

>       patched all drivers except DSA to ignore these new entries in

>       commit 2c4eca3ef716 ("net: bridge: switchdev: include local flag

>       in FDB notifications"). Driver authors can lift this restriction

>       as they wish.

>     - br_mdb_replay() should now fix the issue described in commit

>       2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB

>       notifications") for all drivers, I don't see any downside.

> 

> Cc: Vadym Kochan <vkochan@marvell.com>

> Cc: Taras Chornyi <tchornyi@marvell.com>

> Cc: Ioana Ciornei <ioana.ciornei@nxp.com>

> Cc: Lars Povlsen <lars.povlsen@microchip.com>

> Cc: Steen Hegelund <Steen.Hegelund@microchip.com>

> Cc: UNGLinuxDriver@microchip.com

> Cc: Claudiu Manoil <claudiu.manoil@nxp.com>

> Cc: Alexandre Belloni <alexandre.belloni@bootlin.com>

> Cc: Grygorii Strashko <grygorii.strashko@ti.com>

> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>


Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
Ido Schimmel July 20, 2021, 11:24 a.m. UTC | #11
On Mon, Jul 19, 2021 at 12:44:19AM +0300, Vladimir Oltean wrote:
>  drivers/net/dsa/mv88e6xxx/chip.c              |  78 +++-

>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |  69 +++-

>  .../ethernet/marvell/prestera/prestera_main.c |  99 +++--

>  .../marvell/prestera/prestera_switchdev.c     |  42 ++-

>  .../marvell/prestera/prestera_switchdev.h     |   7 +-

>  .../net/ethernet/mellanox/mlxsw/spectrum.c    | 347 ++++++++++++------

>  .../net/ethernet/mellanox/mlxsw/spectrum.h    |   4 +

>  .../mellanox/mlxsw/spectrum_switchdev.c       |  28 +-

>  .../microchip/sparx5/sparx5_switchdev.c       |  48 ++-

>  drivers/net/ethernet/mscc/ocelot_net.c        | 115 ++++--

>  drivers/net/ethernet/rocker/rocker.h          |   9 +-

>  drivers/net/ethernet/rocker/rocker_main.c     |  34 +-

>  drivers/net/ethernet/rocker/rocker_ofdpa.c    |  42 ++-

>  drivers/net/ethernet/ti/am65-cpsw-nuss.c      |  34 +-

>  drivers/net/ethernet/ti/am65-cpsw-switchdev.c |  14 +-

>  drivers/net/ethernet/ti/am65-cpsw-switchdev.h |   3 +

>  drivers/net/ethernet/ti/cpsw_new.c            |  32 +-

>  drivers/net/ethernet/ti/cpsw_switchdev.c      |   4 +-

>  drivers/net/ethernet/ti/cpsw_switchdev.h      |   3 +

>  include/linux/if_bridge.h                     |  63 ++--

>  include/net/dsa.h                             |  21 ++

>  net/bridge/br_fdb.c                           |   1 -

>  net/bridge/br_forward.c                       |   9 +

>  net/bridge/br_if.c                            |  11 +-

>  net/bridge/br_mdb.c                           |   1 -

>  net/bridge/br_private.h                       |  84 ++++-

>  net/bridge/br_switchdev.c                     | 287 +++++++++++++--

>  net/bridge/br_vlan.c                          |  11 +-

>  net/dsa/dsa2.c                                |   4 +

>  net/dsa/dsa_priv.h                            |   6 +

>  net/dsa/port.c                                | 192 +++++++---

>  net/dsa/tag_dsa.c                             |  52 ++-

>  32 files changed, 1406 insertions(+), 348 deletions(-)


Too many things are squashed into this one patchset. It needs to be
split. According to the title, the patchset is focused on improving
performance, but there are no performance numbers that I could see and
most of the patches deal with the replay stuff instead.

The TX forwarding offload in mv88e6xxx is not related to the replay
stuff and should be added in a separate patchset. This can be done by
first adding the switchdev_bridge_port_offload() /
switchdev_bridge_port_unoffload() APIs that only take care of setting /
unsetting the hardware domain for the bridge port. Then, in a different
patchset, these APIs can be augmented with a parameter for the replay
stuff. It should be easier to review that way and require less
unnecessary surgeries in drivers that do not require the added
functionality.

Thanks
Vladimir Oltean July 20, 2021, 1:20 p.m. UTC | #12
On Tue, Jul 20, 2021 at 02:24:29PM +0300, Ido Schimmel wrote:
> Too many things are squashed into this one patchset. It needs to be

> split.

>

> The TX forwarding offload in mv88e6xxx is not related to the replay

> stuff and should be added in a separate patchset. This can be done by

> first adding the switchdev_bridge_port_offload() /

> switchdev_bridge_port_unoffload() APIs that only take care of setting /

> unsetting the hardware domain for the bridge port. Then, in a different

> patchset, these APIs can be augmented with a parameter for the replay

> stuff. It should be easier to review that way and require less

> unnecessary surgeries in drivers that do not require the added

> functionality.


Fair point. I will submit patches 1-10 and 11-15 separately.

> According to the title, the patchset is focused on improving

> performance, but there are no performance numbers that I could see and

> most of the patches deal with the replay stuff instead.


Maybe, but the truth is that it is not really the performance
improvement that I care about. The performance quote is from Tobias'
original cover letter, which I took as-is. I can build a synthetic test
for multicasting on 10 mv88e6xxx ports or something like that, or maybe
Tobias can provide a more relevant example out of Westermo's use cases.
But it would be silly if this patchset's acceptance would depend on the
numbers. This is one of those cases where completely different interests
led me and Tobias to the the same solution.

I don't want to bore you to death with details, but for some switches
(DSA or otherwise), being able to send bridge packets as they are (data
plane packets) instead of what they aren't (control plane packets) is a
matter of functionality and not performance. Such switches only use
control plane packets for link-local packet traps, and sending/receiving
a control packet is expensive.

For this class of switches (some may call them "dumb", but whatever),
this patch series makes the difference between supporting and not
supporting local IP termination through a VLAN-aware bridge, bridging
with a foreign interface, bridging with software upper interfaces like
LAG, etc.
Ido Schimmel July 20, 2021, 1:51 p.m. UTC | #13
On Tue, Jul 20, 2021 at 04:20:26PM +0300, Vladimir Oltean wrote:
> On Tue, Jul 20, 2021 at 02:24:29PM +0300, Ido Schimmel wrote:

> > Too many things are squashed into this one patchset. It needs to be

> > split.

> >

> > The TX forwarding offload in mv88e6xxx is not related to the replay

> > stuff and should be added in a separate patchset. This can be done by

> > first adding the switchdev_bridge_port_offload() /

> > switchdev_bridge_port_unoffload() APIs that only take care of setting /

> > unsetting the hardware domain for the bridge port. Then, in a different

> > patchset, these APIs can be augmented with a parameter for the replay

> > stuff. It should be easier to review that way and require less

> > unnecessary surgeries in drivers that do not require the added

> > functionality.

> 

> Fair point. I will submit patches 1-10 and 11-15 separately.


Not sure if you mean in that order or not, but I suggested first getting
the TX forwarding offload (patches 11-15) in and then extending the new
APIs with replay argument so that drivers can opt-in. This should reduce
the complexity of the second patchset and make it less likely to
introduce bugs.

> 

> > According to the title, the patchset is focused on improving

> > performance, but there are no performance numbers that I could see and

> > most of the patches deal with the replay stuff instead.

> 

> Maybe, but the truth is that it is not really the performance

> improvement that I care about. The performance quote is from Tobias'

> original cover letter, which I took as-is. I can build a synthetic test

> for multicasting on 10 mv88e6xxx ports or something like that, or maybe

> Tobias can provide a more relevant example out of Westermo's use cases.

> But it would be silly if this patchset's acceptance would depend on the

> numbers. This is one of those cases where completely different interests

> led me and Tobias to the the same solution.

> 

> I don't want to bore you to death with details, but for some switches

> (DSA or otherwise), being able to send bridge packets as they are (data

> plane packets) instead of what they aren't (control plane packets) is a

> matter of functionality and not performance. Such switches only use

> control plane packets for link-local packet traps, and sending/receiving

> a control packet is expensive.

> 

> For this class of switches (some may call them "dumb", but whatever),

> this patch series makes the difference between supporting and not

> supporting local IP termination through a VLAN-aware bridge, bridging

> with a foreign interface, bridging with software upper interfaces like

> LAG, etc.


OK, so this can be mentioned in the cover letter as well as an argument
for the feature. Wanted to make sure the patches were actually tested
given Tobias was the first to publish them and I'm not sure if he tested
them in the new form or if you have the required hardware.