Message ID | 20210718214434.3938850-1-vladimir.oltean@nxp.com |
---|---|
Headers | show |
Series | Allow forwarding for the software bridge data path to be offloaded to capable devices | expand |
On 7/18/2021 2:44 PM, Vladimir Oltean wrote: > For symmetry with mlxsw_sp_port_lag_leave(), introduce a small function > called mlxsw_sp_port_vlan_leave() which checks whether the 8021q upper > we're leaving is a bridge port, and if it is, stop offloading that > bridge too. > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> -- Florian
On 7/18/2021 2:44 PM, Vladimir Oltean wrote: > Prepare the drivers which support LAG offload but don't have support for > switchdev object replay yet, i.e. the mlxsw and prestera drivers, to > deal with bridge switchdev objects being replayed on the LAG bridge port > multiple times, once for each time a physical port beneath the LAG calls > switchdev_bridge_port_offload(). > > Cc: Vadym Kochan <vkochan@marvell.com> > Cc: Taras Chornyi <tchornyi@marvell.com> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> -- Florian
On 7/18/2021 2:44 PM, Vladimir Oltean wrote: > From: Tobias Waldekranz <tobias@waldekranz.com> > > Allow switchdevs to forward frames from the CPU in accordance with the > bridge configuration in the same way as is done between bridge > ports. This means that the bridge will only send a single skb towards > one of the ports under the switchdev's control, and expects the driver > to deliver the packet to all eligible ports in its domain. > > Primarily this improves the performance of multicast flows with > multiple subscribers, as it allows the hardware to perform the frame > replication. > > The basic flow between the driver and the bridge is as follows: > > - When joining a bridge port, the switchdev driver calls > switchdev_bridge_port_offload() with tx_fwd_offload = true. > > - The bridge sends offloadable skbs to one of the ports under the > switchdev's control using skb->offload_fwd_mark = true. > > - The switchdev driver checks the skb->offload_fwd_mark field and lets > its FDB lookup select the destination port mask for this packet. > > v1->v2: > - convert br_input_skb_cb::fwd_hwdoms to a plain unsigned long > - introduce a static key "br_switchdev_fwd_offload_used" to minimize the > impact of the newly introduced feature on all the setups which don't > have hardware that can make use of it > - introduce a check for nbp->flags & BR_FWD_OFFLOAD to optimize cache > line access > - reorder nbp_switchdev_frame_mark_accel() and br_handle_vlan() in > __br_forward() > - do not strip VLAN on egress if forwarding offload on VLAN-aware bridge > is being used > - propagate errors from .ndo_dfwd_add_station() if not EOPNOTSUPP > > v2->v3: > - replace the solution based on .ndo_dfwd_add_station with a solution > based on switchdev_bridge_port_offload > - rename BR_FWD_OFFLOAD to BR_TX_FWD_OFFLOAD > v3->v4: rebase > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> -- Florian
On 7/18/2021 2:44 PM, Vladimir Oltean wrote: > From: Tobias Waldekranz <tobias@waldekranz.com> > > Allow the DSA tagger to generate FORWARD frames for offloaded skbs > sent from a bridge that we offload, allowing the switch to handle any > frame replication that may be required. This also means that source > address learning takes place on packets sent from the CPU, meaning > that return traffic no longer needs to be flooded as unknown unicast. > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> This looks pretty complicated to but if this is how it has to work, it has to. For tag_brcm.c we can simply indicate that the frame to be transmitted should have a specific bitmask of egress ports. -- Florian
On 7/18/2021 2:44 PM, Vladimir Oltean wrote: > For a DSA switch, to offload the forwarding process of a bridge device > means to send the packets coming from the software bridge as data plane > packets. This is contrary to everything that DSA has done so far, > because the current taggers only know to send control packets (ones that > target a specific destination port), whereas data plane packets are > supposed to be forwarded according to the FDB lookup, much like packets > ingressing on any regular ingress port. If the FDB lookup process > returns multiple destination ports (flooding, multicast), then > replication is also handled by the switch hardware - the bridge only > sends a single packet and avoids the skb_clone(). > > DSA plays a substantial role in backing the forwarding offload, and > leaves relatively few things up to the switch driver. In particular, DSA > keeps for each bridge port a zero-based index (the number of the > bridge). Multiple ports enslaved to the same bridge have a pointer to > the same accel_priv structure. > > The tagger can check if the packet that is being transmitted on has > skb->offload_fwd_mark = true or not. If it does, it can be sure that the > packet belongs to the data plane of a bridge, further information about > which can be obtained based on dp->bridge_dev and dp->bridge_num. > It can then compose a DSA tag for injecting a data plane packet into > that bridge number. > > For the switch driver side, we offer two new dsa_switch_ops methods, > called .port_bridge_fwd_offload_{add,del}, which are modeled after > .port_bridge_{join,leave}. > These methods are provided in case the driver needs to configure the > hardware to treat packets coming from that bridge software interface as > data plane packets. The switchdev <-> bridge interaction happens during > the netdev_master_upper_dev_link() call, so to switch drivers, the > effect is that the .port_bridge_fwd_offload_add() method is called > immediately after .port_bridge_join(). > > If the bridge number exceeds the number of bridges for which the switch > driver can offload the TX data plane (and this includes the case where > the driver can offload none), DSA falls back to simply returning > tx_fwd_offload = false in the switchdev_bridge_port_offload() call. > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > --- Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> -- Florian
On Mon, Jul 19, 2021 at 12:44:30AM +0300, Vladimir Oltean wrote: > static int nbp_switchdev_add(struct net_bridge_port *p, > struct netdev_phys_item_id ppid, > + bool tx_fwd_offload, > struct netlink_ext_ack *extack) > { > + int err; > + > if (p->offload_count) { > /* Prevent unsupported configurations such as a bridge port > * which is a bonding interface, and the member ports are from > @@ -189,7 +228,16 @@ static int nbp_switchdev_add(struct net_bridge_port *p, > p->ppid = ppid; > p->offload_count = 1; > > - return nbp_switchdev_hwdom_set(p); > + err = nbp_switchdev_hwdom_set(p); > + if (err) > + return err; > + > + if (tx_fwd_offload) { > + p->flags |= BR_TX_FWD_OFFLOAD; > + static_branch_inc(&br_switchdev_fwd_offload_used); > + } > + > + return 0; > } > > static void nbp_switchdev_del(struct net_bridge_port *p, > @@ -210,6 +258,8 @@ static void nbp_switchdev_del(struct net_bridge_port *p, > > if (p->hwdom) > nbp_switchdev_hwdom_put(p); > + > + p->flags &= ~BR_TX_FWD_OFFLOAD; > } Not the end of the world, but the static_branch_dec(&br_switchdev_fwd_offload_used) was lost here in a rebase. Not a functional issue per se, but it is on my list of things I would like to fix when I resend.
On Sun, Jul 18, 2021 at 07:47:22PM -0700, Florian Fainelli wrote: > On 7/18/2021 2:44 PM, Vladimir Oltean wrote: > > From: Tobias Waldekranz <tobias@waldekranz.com> > > > > Allow the DSA tagger to generate FORWARD frames for offloaded skbs > > sent from a bridge that we offload, allowing the switch to handle any > > frame replication that may be required. This also means that source > > address learning takes place on packets sent from the CPU, meaning > > that return traffic no longer needs to be flooded as unknown unicast. > > > > Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > > This looks pretty complicated to but if this is how it has to work, it has > to. For tag_brcm.c we can simply indicate that the frame to be transmitted > should have a specific bitmask of egress ports. Complicated in the sense that we need to nail the VLAN ID so that the FDB / MDB is looked up correctly by the accelerator, to ensure that it produces a result that is in sync with the software tables? What you are proposing is not really TX forwarding offload but TX replication offload. A CPU-injected packet targeting multiple egress ports is still a control plane packet nonetheless, with all features that characterize one: - Ingress stage of the CPU port is bypassed (no hardware address learning for that MAC SA) - FDB lookup is bypassed (we trust the software). This is also perhaps an advantage, because for example, if we have a MAC address learned towards the CPU port, and then we inject a packet from the CPU towards that destination MAC address, then a data plane packet would be dropped due to source port pruning (source == destination port), but a control plane packet would be sent regardless. - Can inject into a BLOCKING egress port (we trust the software not to do that) Whereas this patch set is really about laying the ground for data plane packets to be safely created and sent by the network stack. There are switches which have a clear distinction between the control plane and the data plane, and injecting a control packet is a fairly expensive operation. So it would be very good to support this operating mode, regardless of whatever else we do. I can look into adding support for your use case with just the replication offload, since it should be possible nonetheless, and if you really don't have the option to send a data plane packet then it is a valid approach too, however I believe that the brick wall will be where to encode the destination bit mask in the egress skb. For the full TX forwarding offload we managed to dodge that because we already had skb->offload_fwd_mark, but that's just one bit and we would need more. I'm thinking we would need to add another bit (skb->offload_tx_replication) and then add a struct list_head tx_dev to the skb which contains all the net devices that the packet was not cloned to?
On Mon, Jul 19, 2021 at 12:44:29AM +0300, Vladimir Oltean wrote: > Note that: > (c) I do not expect a lot of functional change introduced for drivers in > this patch, because: > - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(), > so br_vlan_replay() should not do anything for the new drivers on > which we call it. The existing drivers where there was even a > slight possibility for there to exist a VLAN on a bridge port > before they join it are already guarded against this: mlxsw and > prestera deny joining LAG interfaces that are members of a bridge. > - br_fdb_replay() should now notify of local FDB entries, but I > patched all drivers except DSA to ignore these new entries in > commit 2c4eca3ef716 ("net: bridge: switchdev: include local flag > in FDB notifications"). Driver authors can lift this restriction > as they wish. > - br_mdb_replay() should now fix the issue described in commit > 2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB > notifications") for all drivers, I don't see any downside. I really meant commit 4f2673b3a2b6 ("net: bridge: add helper to replay port and host-joined mdb entries"), sorry for the copy-pasta mistake.
On Mon, Jul 19, 2021 at 12:44:20AM +0300, Vladimir Oltean wrote: > We need to propagate the extack argument for > dpaa2_switch_port_bridge_join to use it in a future patch, and it looks > like there is already an error message there which is currently printed > to the console. Move it over netlink so it is properly transmitted to > user space. > > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> > Cc: Ioana Ciornei <ioana.ciornei@nxp.com> > Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com>
On Mon, Jul 19, 2021 at 12:44:29AM +0300, Vladimir Oltean wrote: > Starting with commit 4f2673b3a2b6 ("net: bridge: add helper to replay > port and host-joined mdb entries"), DSA has introduced some bridge > helpers that replay switchdev events (FDB/MDB/VLAN additions and > deletions) that can be lost by the switchdev drivers in a variety of > circumstances: > > - an IP multicast group was host-joined on the bridge itself before any > switchdev port joined the bridge, leading to the host MDB entries > missing in the hardware database. > - during the bridge creation process, the MAC address of the bridge was > added to the FDB as an entry pointing towards the bridge device > itself, but with no switchdev ports being part of the bridge yet, this > local FDB entry would remain unknown to the switchdev hardware > database. > - a VLAN/FDB/MDB was added to a bridge port that is a LAG interface, > before any switchdev port joined that LAG, leading to the hardware > database missing those entries. > - a switchdev port left a LAG that is a bridge port, while the LAG > remained part of the bridge, and all FDB/MDB/VLAN entries remained > installed in the hardware database of the switchdev port. > > Also, since commit 0d2cfbd41c4a ("net: bridge: ignore switchdev events > for LAG ports which didn't request replay"), DSA introduced a method, > based on a const void *ctx, to ensure that two switchdev ports under the > same LAG that is a bridge port do not see the same MDB/VLAN entry being > replayed twice by the bridge, once for every bridge port that joins the > LAG. > > With so many ordering corner cases being possible, it seems unreasonable > to expect a switchdev driver writer to get it right from the first try. > Therefore, now that we are past the beta testing period for the bridge > replay helpers used in DSA only, it is time to roll them out to all > switchdev drivers. > > To convert the switchdev object replay helpers from "pull mode" (where > the driver asks for them) to a "push mode" (where the bridge offers them > automatically), the biggest problem is that the bridge needs to be aware > when a switchdev port joins and leaves, even when the switchdev is only > indirectly a bridge port (for example when the bridge port is a LAG > upper of the switchdev). > > Luckily, we already have a hook for that, in the form of the newly > introduced switchdev_bridge_port_offload() and > switchdev_bridge_port_unoffload() calls. These offer a natural place for > hooking the object addition and deletion replays. > > Extend the above 2 functions with: > - pointers to the switchdev atomic notifier (for FDB replays) and the > blocking notifier (for MDB and VLAN replays). > - the "const void *ctx" argument required for drivers to be able to > disambiguate between which port is targeted, when multiple ports are > lowers of the same LAG that is a bridge port. Most of the drivers pass > NULL to this argument, except the ones that support LAG offload and have > the proper context check already in place in the switchdev blocking > notifier handler. > > am65_cpsw and cpsw had the same name for the switchdev notifiers, so I > renamed the am65_cpsw ones with an am65_ prefix. > > Also unexport the replay helpers, since nobody except the bridge calls > them directly now. > > Note that: > (a) we abuse the terminology slightly, because FDB entries are not > "switchdev objects", but we count them as objects nonetheless. > With no direct way to prove it, I think they are not modeled as > switchdev objects because those can only be installed by the bridge > to the hardware (as opposed to FDB entries which can be propagated > in the other direction too). This is merely an abuse of terms, FDB > entries are replayed too, despite not being objects. > (b) the bridge does not attempt to sync port attributes to newly joined > ports, just the countable stuff (the objects). The reason for this > is simple: no universal and symmetric way to sync and unsync them is > known. For example, VLAN filtering: what to do on unsync, disable or > leave it enabled? Similarly, STP state, ageing timer, etc etc. What > a switchdev port does when it becomes standalone again is not really > up to the bridge's competence, and the driver should deal with it. > On the other hand, replaying deletions of switchdev objects can be > seen a matter of cleanup and therefore be treated by the bridge, > hence this patch. > (c) I do not expect a lot of functional change introduced for drivers in > this patch, because: > - nbp_vlan_init() is called _after_ netdev_master_upper_dev_link(), > so br_vlan_replay() should not do anything for the new drivers on > which we call it. The existing drivers where there was even a > slight possibility for there to exist a VLAN on a bridge port > before they join it are already guarded against this: mlxsw and > prestera deny joining LAG interfaces that are members of a bridge. > - br_fdb_replay() should now notify of local FDB entries, but I > patched all drivers except DSA to ignore these new entries in > commit 2c4eca3ef716 ("net: bridge: switchdev: include local flag > in FDB notifications"). Driver authors can lift this restriction > as they wish. > - br_mdb_replay() should now fix the issue described in commit > 2c4eca3ef716 ("net: bridge: switchdev: include local flag in FDB > notifications") for all drivers, I don't see any downside. > > Cc: Vadym Kochan <vkochan@marvell.com> > Cc: Taras Chornyi <tchornyi@marvell.com> > Cc: Ioana Ciornei <ioana.ciornei@nxp.com> > Cc: Lars Povlsen <lars.povlsen@microchip.com> > Cc: Steen Hegelund <Steen.Hegelund@microchip.com> > Cc: UNGLinuxDriver@microchip.com > Cc: Claudiu Manoil <claudiu.manoil@nxp.com> > Cc: Alexandre Belloni <alexandre.belloni@bootlin.com> > Cc: Grygorii Strashko <grygorii.strashko@ti.com> > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Ioana Ciornei <ioana.ciornei@nxp.com> # dpaa2-switch
On Mon, Jul 19, 2021 at 12:44:19AM +0300, Vladimir Oltean wrote: > drivers/net/dsa/mv88e6xxx/chip.c | 78 +++- > .../ethernet/freescale/dpaa2/dpaa2-switch.c | 69 +++- > .../ethernet/marvell/prestera/prestera_main.c | 99 +++-- > .../marvell/prestera/prestera_switchdev.c | 42 ++- > .../marvell/prestera/prestera_switchdev.h | 7 +- > .../net/ethernet/mellanox/mlxsw/spectrum.c | 347 ++++++++++++------ > .../net/ethernet/mellanox/mlxsw/spectrum.h | 4 + > .../mellanox/mlxsw/spectrum_switchdev.c | 28 +- > .../microchip/sparx5/sparx5_switchdev.c | 48 ++- > drivers/net/ethernet/mscc/ocelot_net.c | 115 ++++-- > drivers/net/ethernet/rocker/rocker.h | 9 +- > drivers/net/ethernet/rocker/rocker_main.c | 34 +- > drivers/net/ethernet/rocker/rocker_ofdpa.c | 42 ++- > drivers/net/ethernet/ti/am65-cpsw-nuss.c | 34 +- > drivers/net/ethernet/ti/am65-cpsw-switchdev.c | 14 +- > drivers/net/ethernet/ti/am65-cpsw-switchdev.h | 3 + > drivers/net/ethernet/ti/cpsw_new.c | 32 +- > drivers/net/ethernet/ti/cpsw_switchdev.c | 4 +- > drivers/net/ethernet/ti/cpsw_switchdev.h | 3 + > include/linux/if_bridge.h | 63 ++-- > include/net/dsa.h | 21 ++ > net/bridge/br_fdb.c | 1 - > net/bridge/br_forward.c | 9 + > net/bridge/br_if.c | 11 +- > net/bridge/br_mdb.c | 1 - > net/bridge/br_private.h | 84 ++++- > net/bridge/br_switchdev.c | 287 +++++++++++++-- > net/bridge/br_vlan.c | 11 +- > net/dsa/dsa2.c | 4 + > net/dsa/dsa_priv.h | 6 + > net/dsa/port.c | 192 +++++++--- > net/dsa/tag_dsa.c | 52 ++- > 32 files changed, 1406 insertions(+), 348 deletions(-) Too many things are squashed into this one patchset. It needs to be split. According to the title, the patchset is focused on improving performance, but there are no performance numbers that I could see and most of the patches deal with the replay stuff instead. The TX forwarding offload in mv88e6xxx is not related to the replay stuff and should be added in a separate patchset. This can be done by first adding the switchdev_bridge_port_offload() / switchdev_bridge_port_unoffload() APIs that only take care of setting / unsetting the hardware domain for the bridge port. Then, in a different patchset, these APIs can be augmented with a parameter for the replay stuff. It should be easier to review that way and require less unnecessary surgeries in drivers that do not require the added functionality. Thanks
On Tue, Jul 20, 2021 at 02:24:29PM +0300, Ido Schimmel wrote: > Too many things are squashed into this one patchset. It needs to be > split. > > The TX forwarding offload in mv88e6xxx is not related to the replay > stuff and should be added in a separate patchset. This can be done by > first adding the switchdev_bridge_port_offload() / > switchdev_bridge_port_unoffload() APIs that only take care of setting / > unsetting the hardware domain for the bridge port. Then, in a different > patchset, these APIs can be augmented with a parameter for the replay > stuff. It should be easier to review that way and require less > unnecessary surgeries in drivers that do not require the added > functionality. Fair point. I will submit patches 1-10 and 11-15 separately. > According to the title, the patchset is focused on improving > performance, but there are no performance numbers that I could see and > most of the patches deal with the replay stuff instead. Maybe, but the truth is that it is not really the performance improvement that I care about. The performance quote is from Tobias' original cover letter, which I took as-is. I can build a synthetic test for multicasting on 10 mv88e6xxx ports or something like that, or maybe Tobias can provide a more relevant example out of Westermo's use cases. But it would be silly if this patchset's acceptance would depend on the numbers. This is one of those cases where completely different interests led me and Tobias to the the same solution. I don't want to bore you to death with details, but for some switches (DSA or otherwise), being able to send bridge packets as they are (data plane packets) instead of what they aren't (control plane packets) is a matter of functionality and not performance. Such switches only use control plane packets for link-local packet traps, and sending/receiving a control packet is expensive. For this class of switches (some may call them "dumb", but whatever), this patch series makes the difference between supporting and not supporting local IP termination through a VLAN-aware bridge, bridging with a foreign interface, bridging with software upper interfaces like LAG, etc.
On Tue, Jul 20, 2021 at 04:20:26PM +0300, Vladimir Oltean wrote: > On Tue, Jul 20, 2021 at 02:24:29PM +0300, Ido Schimmel wrote: > > Too many things are squashed into this one patchset. It needs to be > > split. > > > > The TX forwarding offload in mv88e6xxx is not related to the replay > > stuff and should be added in a separate patchset. This can be done by > > first adding the switchdev_bridge_port_offload() / > > switchdev_bridge_port_unoffload() APIs that only take care of setting / > > unsetting the hardware domain for the bridge port. Then, in a different > > patchset, these APIs can be augmented with a parameter for the replay > > stuff. It should be easier to review that way and require less > > unnecessary surgeries in drivers that do not require the added > > functionality. > > Fair point. I will submit patches 1-10 and 11-15 separately. Not sure if you mean in that order or not, but I suggested first getting the TX forwarding offload (patches 11-15) in and then extending the new APIs with replay argument so that drivers can opt-in. This should reduce the complexity of the second patchset and make it less likely to introduce bugs. > > > According to the title, the patchset is focused on improving > > performance, but there are no performance numbers that I could see and > > most of the patches deal with the replay stuff instead. > > Maybe, but the truth is that it is not really the performance > improvement that I care about. The performance quote is from Tobias' > original cover letter, which I took as-is. I can build a synthetic test > for multicasting on 10 mv88e6xxx ports or something like that, or maybe > Tobias can provide a more relevant example out of Westermo's use cases. > But it would be silly if this patchset's acceptance would depend on the > numbers. This is one of those cases where completely different interests > led me and Tobias to the the same solution. > > I don't want to bore you to death with details, but for some switches > (DSA or otherwise), being able to send bridge packets as they are (data > plane packets) instead of what they aren't (control plane packets) is a > matter of functionality and not performance. Such switches only use > control plane packets for link-local packet traps, and sending/receiving > a control packet is expensive. > > For this class of switches (some may call them "dumb", but whatever), > this patch series makes the difference between supporting and not > supporting local IP termination through a VLAN-aware bridge, bridging > with a foreign interface, bridging with software upper interfaces like > LAG, etc. OK, so this can be mentioned in the cover letter as well as an argument for the feature. Wanted to make sure the patches were actually tested given Tobias was the first to publish them and I'm not sure if he tested them in the new form or if you have the required hardware.