Message ID | 20210819160723.2186424-1-vladimir.oltean@nxp.com |
---|---|
Headers | show |
Series | Make SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE blocking | expand |
Hi Vlad, On Thu, Aug 19, 2021 at 09:15:17PM +0300, Vlad Buslov wrote: > On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote: > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c > > index 0c38c2e319be..ea7c3f07f6fe 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c > > @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev, > > return err; > > } > > > > +static struct mlx5_bridge_switchdev_fdb_work * > > +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add, > > + struct switchdev_notifier_fdb_info *fdb_info, > > + struct mlx5_esw_bridge_offloads *br_offloads); > > + > > +static int > > +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event, > > + struct switchdev_notifier_info *info, > > + struct mlx5_esw_bridge_offloads *br_offloads) > > +{ > > + struct switchdev_notifier_fdb_info *fdb_info; > > + struct mlx5_bridge_switchdev_fdb_work *work; > > + struct mlx5_eswitch *esw = br_offloads->esw; > > + u16 vport_num, esw_owner_vhca_id; > > + struct net_device *upper, *rep; > > + > > + upper = netdev_master_upper_dev_get_rcu(dev); > > + if (!upper) > > + return 0; > > + if (!netif_is_bridge_master(upper)) > > + return 0; > > + > > + rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw, > > + &vport_num, > > + &esw_owner_vhca_id); > > + if (!rep) > > + return 0; > > + > > + /* only handle the event on peers */ > > + if (mlx5_esw_bridge_is_local(dev, rep, esw)) > > + return 0; > > This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it > breaks the offload. Very good point, thanks for looking. I copied the entire atomic notifier handler and deleted the code which wasn't needed, but I actually took a break while converting mlx5, and so I forgot to delete this part when I came back. > > + > > + fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info); > > + > > + work = mlx5_esw_bridge_init_switchdev_fdb_work(dev, > > + event == SWITCHDEV_FDB_ADD_TO_DEVICE, > > + fdb_info, > > Here FDB info can already be deallocated[1] since this is now executing > asynchronously and races with fdb_rcu_free() that is scheduled to be > called after rcu grace period by fdb_delete(). I am incredibly lucky that you caught this, apparently I needed to add an msleep(1000) to see it as well. It is not the struct switchdev_notifier_fdb_info *fdb_info that gets freed under RCU. It is fdb_info->addr (the MAC address), since switchdev_deferred_enqueue only performs a shallow copy. I will address that in v3. > > @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb, > > /* only handle the event on peers */ > > if (mlx5_esw_bridge_is_local(dev, rep, esw)) > > break; > > I really like the idea of completely remove the driver wq from FDB > handling code, but I'm not yet too familiar with bridge internals to > easily determine whether same approach can be applied to > SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted. > It seems that all current users already generate these events from > blocking context, so would it be a trivial change for me to do in your > opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq > in our driver. If all callers really are in blocking context (and they do appear to be) you can even forgo the switchdev_deferred_enqueue that switchdev_fdb_add_to_device does, and just call_switchdev_blocking_notifiers() directly. Then you move the bridge handler from br_switchdev_event() to br_switchdev_blocking_event(). It should be even simpler than this conversion.
On Fri 20 Aug 2021 at 02:18, Vladimir Oltean <olteanv@gmail.com> wrote: > Hi Vlad, > > On Thu, Aug 19, 2021 at 09:15:17PM +0300, Vlad Buslov wrote: >> On Thu 19 Aug 2021 at 19:07, Vladimir Oltean <vladimir.oltean@nxp.com> wrote: >> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c >> > index 0c38c2e319be..ea7c3f07f6fe 100644 >> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c >> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/rep/bridge.c >> > @@ -276,6 +276,55 @@ mlx5_esw_bridge_port_obj_attr_set(struct net_device *dev, >> > return err; >> > } >> > >> > +static struct mlx5_bridge_switchdev_fdb_work * >> > +mlx5_esw_bridge_init_switchdev_fdb_work(struct net_device *dev, bool add, >> > + struct switchdev_notifier_fdb_info *fdb_info, >> > + struct mlx5_esw_bridge_offloads *br_offloads); >> > + >> > +static int >> > +mlx5_esw_bridge_fdb_event(struct net_device *dev, unsigned long event, >> > + struct switchdev_notifier_info *info, >> > + struct mlx5_esw_bridge_offloads *br_offloads) >> > +{ >> > + struct switchdev_notifier_fdb_info *fdb_info; >> > + struct mlx5_bridge_switchdev_fdb_work *work; >> > + struct mlx5_eswitch *esw = br_offloads->esw; >> > + u16 vport_num, esw_owner_vhca_id; >> > + struct net_device *upper, *rep; >> > + >> > + upper = netdev_master_upper_dev_get_rcu(dev); >> > + if (!upper) >> > + return 0; >> > + if (!netif_is_bridge_master(upper)) >> > + return 0; >> > + >> > + rep = mlx5_esw_bridge_rep_vport_num_vhca_id_get(dev, esw, >> > + &vport_num, >> > + &esw_owner_vhca_id); >> > + if (!rep) >> > + return 0; >> > + >> > + /* only handle the event on peers */ >> > + if (mlx5_esw_bridge_is_local(dev, rep, esw)) >> > + return 0; >> >> This check is only needed for SWITCHDEV_FDB_DEL_TO_BRIDGE case. Here it >> breaks the offload. > > Very good point, thanks for looking. I copied the entire atomic notifier > handler and deleted the code which wasn't needed, but I actually took a > break while converting mlx5, and so I forgot to delete this part when I > came back. > >> > + >> > + fdb_info = container_of(info, struct switchdev_notifier_fdb_info, info); >> > + >> > + work = mlx5_esw_bridge_init_switchdev_fdb_work(dev, >> > + event == SWITCHDEV_FDB_ADD_TO_DEVICE, >> > + fdb_info, >> >> Here FDB info can already be deallocated[1] since this is now executing >> asynchronously and races with fdb_rcu_free() that is scheduled to be >> called after rcu grace period by fdb_delete(). > > I am incredibly lucky that you caught this, apparently I needed to add > an msleep(1000) to see it as well. > > It is not the struct switchdev_notifier_fdb_info *fdb_info that gets > freed under RCU. It is fdb_info->addr (the MAC address), since > switchdev_deferred_enqueue only performs a shallow copy. I will address > that in v3. > >> > @@ -415,9 +470,7 @@ static int mlx5_esw_bridge_switchdev_event(struct notifier_block *nb, >> > /* only handle the event on peers */ >> > if (mlx5_esw_bridge_is_local(dev, rep, esw)) >> > break; >> >> I really like the idea of completely remove the driver wq from FDB >> handling code, but I'm not yet too familiar with bridge internals to >> easily determine whether same approach can be applied to >> SWITCHDEV_FDB_{ADD|DEL}_TO_BRIDGE event after this series is accepted. >> It seems that all current users already generate these events from >> blocking context, so would it be a trivial change for me to do in your >> opinion? That would allow me to get rid of mlx5_esw_bridge_offloads->wq >> in our driver. > > If all callers really are in blocking context (and they do appear to be) > you can even forgo the switchdev_deferred_enqueue that switchdev_fdb_add_to_device > does, and just call_switchdev_blocking_notifiers() directly. Then you > move the bridge handler from br_switchdev_event() to br_switchdev_blocking_event(). > It should be even simpler than this conversion. Thanks for your advice! I'll start looking into it as soon as this series is accepted.
On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > Problem statement: > > Any time a driver needs to create a private association between a bridge > upper interface and use that association within its > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > entries deleted by the bridge when the port leaves. The issue is that > all switchdev drivers schedule a work item to have sleepable context, > and that work item can be actually scheduled after the port has left the > bridge, which means the association might have already been broken by > the time the scheduled FDB work item attempts to use it. This is handled in mlxsw by telling the device to flush the FDB entries pointing to the {port, FID} when the VLAN is deleted (synchronously). > > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER > mechanism to make the FDB notifiers emitted from the fastpath be > scheduled in sleepable context. All drivers are converted to handle > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block > handler (or register a blocking switchdev notifier handler if they > didn't have one). This solves the aforementioned problem because the > bridge waits for the switchdev deferred work items to finish before a > port leaves (del_nbp calls switchdev_deferred_process), whereas a work > item privately scheduled by the driver will obviously not be waited upon > by the bridge, leading to the possibility of having the race. How the problem is solved if after this patchset drivers still queue a work item? DSA supports learning, but does not report the entries to the bridge. How are these entries deleted when a port leaves the bridge? > > This is a dependency for the "DSA FDB isolation" posted here. It was > split out of that series hence the numbering starts directly at v2. > > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/ What is FDB isolation? Cover letter says: "There are use cases which need FDB isolation between standalone ports and bridged ports, as well as isolation between ports of different bridges". Does it mean that DSA currently forwards packets between ports even if they are member in different bridges or standalone? > > Vladimir Oltean (5): > net: switchdev: move SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to the blocking > notifier chain > net: bridge: switchdev: make br_fdb_replay offer sleepable context to > consumers > net: switchdev: drop the atomic notifier block from > switchdev_bridge_port_{,un}offload > net: switchdev: don't assume RCU context in > switchdev_handle_fdb_{add,del}_to_device > net: dsa: handle SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE synchronously > > .../ethernet/freescale/dpaa2/dpaa2-switch.c | 86 +++++------ > .../marvell/prestera/prestera_switchdev.c | 110 +++++++------- > .../mellanox/mlx5/core/en/rep/bridge.c | 59 +++++++- > .../mellanox/mlxsw/spectrum_switchdev.c | 61 +++++++- > .../microchip/sparx5/sparx5_switchdev.c | 78 +++++----- > drivers/net/ethernet/mscc/ocelot_net.c | 3 - > drivers/net/ethernet/rocker/rocker_main.c | 73 ++++----- > drivers/net/ethernet/rocker/rocker_ofdpa.c | 4 +- > drivers/net/ethernet/ti/am65-cpsw-nuss.c | 4 +- > drivers/net/ethernet/ti/am65-cpsw-switchdev.c | 57 ++++---- > drivers/net/ethernet/ti/cpsw_new.c | 4 +- > drivers/net/ethernet/ti/cpsw_switchdev.c | 60 ++++---- > drivers/s390/net/qeth_l2_main.c | 10 +- > include/net/switchdev.h | 30 +++- > net/bridge/br.c | 5 +- > net/bridge/br_fdb.c | 40 ++++- > net/bridge/br_private.h | 4 - > net/bridge/br_switchdev.c | 18 +-- > net/dsa/dsa.c | 15 -- > net/dsa/dsa_priv.h | 15 -- > net/dsa/port.c | 3 - > net/dsa/slave.c | 138 ++++++------------ > net/switchdev/switchdev.c | 61 +++++++- > 23 files changed, 529 insertions(+), 409 deletions(-) > > -- > 2.25.1 >
On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > Problem statement: > > > > Any time a driver needs to create a private association between a bridge > > upper interface and use that association within its > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > entries deleted by the bridge when the port leaves. The issue is that > > all switchdev drivers schedule a work item to have sleepable context, > > and that work item can be actually scheduled after the port has left the > > bridge, which means the association might have already been broken by > > the time the scheduled FDB work item attempts to use it. > > This is handled in mlxsw by telling the device to flush the FDB entries > pointing to the {port, FID} when the VLAN is deleted (synchronously). Again, central solution vs mlxsw solution. If a port leaves a LAG that is offloaded but the LAG does not leave the bridge, the driver still needs to initiate the VLAN deletion. I really don't like that, it makes switchdev drivers bloated. As long as you call switchdev_bridge_port_unoffload and you populate the blocking notifier pointer, you will get replays of item deletion from the bridge. > > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER > > mechanism to make the FDB notifiers emitted from the fastpath be > > scheduled in sleepable context. All drivers are converted to handle > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block > > handler (or register a blocking switchdev notifier handler if they > > didn't have one). This solves the aforementioned problem because the > > bridge waits for the switchdev deferred work items to finish before a > > port leaves (del_nbp calls switchdev_deferred_process), whereas a work > > item privately scheduled by the driver will obviously not be waited upon > > by the bridge, leading to the possibility of having the race. > > How the problem is solved if after this patchset drivers still queue a > work item? It's only a problem if you bank on any stateful association between FDB entries and your ports (aka you expect that port->bridge_dev still holds the same value in the atomic handler as in the deferred work item). I think drivers don't do this at the moment, otherwise they would be broken. When they need that, they will convert to synchronous handling and all will be fine. > DSA supports learning, but does not report the entries to the bridge. Why is this relevant exactly? > How are these entries deleted when a port leaves the bridge? dsa_port_fast_age does the following (a) deletes the hardware learned entries on a port, in all VLANs (b) notifies the bridge to also flush its software FDB on that port It is called (a) when the STP state changes from a learning-capable state (LEARNING, FORWARDING) to a non-learning capable state (BLOCKING, LISTENING) (b) when learning is turned off by the user (c) when learning is turned off by the port becoming standalone after leaving a bridge (actually same code path as b) So the FDB of a port is also flushed when a single switch port leaves a LAG that is the actual bridge port (maybe not ideal, but I don't know any better). > > This is a dependency for the "DSA FDB isolation" posted here. It was > > split out of that series hence the numbering starts directly at v2. > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/ > > What is FDB isolation? Cover letter says: "There are use cases which > need FDB isolation between standalone ports and bridged ports, as well > as isolation between ports of different bridges". FDB isolation means exactly what it says: that the hardware FDB lookup of ports that are standalone, or under one bridge, is unable to find FDB entries (same MAC address, same VID) learned on another port from another bridge. > Does it mean that DSA currently forwards packets between ports even if > they are member in different bridges or standalone? No, that is plain forwarding isolation in my understanding of terms, and we have had that for many years now.
On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > Problem statement: > > > > Any time a driver needs to create a private association between a bridge > > upper interface and use that association within its > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > entries deleted by the bridge when the port leaves. The issue is that > > all switchdev drivers schedule a work item to have sleepable context, > > and that work item can be actually scheduled after the port has left the > > bridge, which means the association might have already been broken by > > the time the scheduled FDB work item attempts to use it. > > This is handled in mlxsw by telling the device to flush the FDB entries > pointing to the {port, FID} when the VLAN is deleted (synchronously). If you have FDB entries pointing to bridge ports that are foreign interfaces and you offload them, do you catch the VLAN deletion on the foreign port and flush your entries towards it at that time?
On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote: > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > > Problem statement: > > > > > > Any time a driver needs to create a private association between a bridge > > > upper interface and use that association within its > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > > entries deleted by the bridge when the port leaves. The issue is that > > > all switchdev drivers schedule a work item to have sleepable context, > > > and that work item can be actually scheduled after the port has left the > > > bridge, which means the association might have already been broken by > > > the time the scheduled FDB work item attempts to use it. > > > > This is handled in mlxsw by telling the device to flush the FDB entries > > pointing to the {port, FID} when the VLAN is deleted (synchronously). > > Again, central solution vs mlxsw solution. Again, a solution is forced on everyone regardless if it benefits them or not. List is bombarded with version after version until patches are applied. *EXHAUSTING*. With these patches, except DSA, everyone gets another queue_work() for each FDB entry. In some cases, it completely misses the purpose of the patchset. Want a central solution? Make sure it is properly integrated. "Don't have the energy"? Ask for help. Do not try to force a solution on everyone and motivate them to change the code by doing a poor conversion yourself. I don't accept "this will have to do". > > If a port leaves a LAG that is offloaded but the LAG does not leave the > bridge, the driver still needs to initiate the VLAN deletion. I really > don't like that, it makes switchdev drivers bloated. > > As long as you call switchdev_bridge_port_unoffload and you populate the > blocking notifier pointer, you will get replays of item deletion from > the bridge. > > > > The solution is to modify switchdev to use its embedded SWITCHDEV_F_DEFER > > > mechanism to make the FDB notifiers emitted from the fastpath be > > > scheduled in sleepable context. All drivers are converted to handle > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE from their blocking notifier block > > > handler (or register a blocking switchdev notifier handler if they > > > didn't have one). This solves the aforementioned problem because the > > > bridge waits for the switchdev deferred work items to finish before a > > > port leaves (del_nbp calls switchdev_deferred_process), whereas a work > > > item privately scheduled by the driver will obviously not be waited upon > > > by the bridge, leading to the possibility of having the race. > > > > How the problem is solved if after this patchset drivers still queue a > > work item? > > It's only a problem if you bank on any stateful association between FDB > entries and your ports (aka you expect that port->bridge_dev still holds > the same value in the atomic handler as in the deferred work item). I > think drivers don't do this at the moment, otherwise they would be > broken. > > When they need that, they will convert to synchronous handling and all > will be fine. > > > DSA supports learning, but does not report the entries to the bridge. > > Why is this relevant exactly? Because I wanted to make sure that FDB entries that are not present in the bridge are also flushed. > > > How are these entries deleted when a port leaves the bridge? > > dsa_port_fast_age does the following > (a) deletes the hardware learned entries on a port, in all VLANs > (b) notifies the bridge to also flush its software FDB on that port > > It is called > (a) when the STP state changes from a learning-capable state (LEARNING, > FORWARDING) to a non-learning capable state (BLOCKING, LISTENING) > (b) when learning is turned off by the user > (c) when learning is turned off by the port becoming standalone after > leaving a bridge (actually same code path as b) > > So the FDB of a port is also flushed when a single switch port leaves a > LAG that is the actual bridge port (maybe not ideal, but I don't know > any better). > > > > This is a dependency for the "DSA FDB isolation" posted here. It was > > > split out of that series hence the numbering starts directly at v2. > > > > > > https://patchwork.kernel.org/project/netdevbpf/cover/20210818120150.892647-1-vladimir.oltean@nxp.com/ > > > > What is FDB isolation? Cover letter says: "There are use cases which > > need FDB isolation between standalone ports and bridged ports, as well > > as isolation between ports of different bridges". > > FDB isolation means exactly what it says: that the hardware FDB lookup > of ports that are standalone, or under one bridge, is unable to find FDB entries > (same MAC address, same VID) learned on another port from another bridge. > > > Does it mean that DSA currently forwards packets between ports even if > > they are member in different bridges or standalone? > > No, that is plain forwarding isolation in my understanding of terms, and > we have had that for many years now. So if I have {00:01:02:03:04:05, 5} in br0, but not in br1 and now a packet with this DMAC/VID needs to be forwarded in br1 it will be dropped instead of being flooded?
On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote: > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > > Problem statement: > > > > > > Any time a driver needs to create a private association between a bridge > > > upper interface and use that association within its > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > > entries deleted by the bridge when the port leaves. The issue is that > > > all switchdev drivers schedule a work item to have sleepable context, > > > and that work item can be actually scheduled after the port has left the > > > bridge, which means the association might have already been broken by > > > the time the scheduled FDB work item attempts to use it. > > > > This is handled in mlxsw by telling the device to flush the FDB entries > > pointing to the {port, FID} when the VLAN is deleted (synchronously). > > If you have FDB entries pointing to bridge ports that are foreign > interfaces and you offload them, do you catch the VLAN deletion on the > foreign port and flush your entries towards it at that time? Yes, that's how VXLAN offload works. VLAN addition is used to determine the mapping between VNI and VLAN.
On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote: > On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote: > > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > > > Problem statement: > > > > > > > > Any time a driver needs to create a private association between a bridge > > > > upper interface and use that association within its > > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > > > entries deleted by the bridge when the port leaves. The issue is that > > > > all switchdev drivers schedule a work item to have sleepable context, > > > > and that work item can be actually scheduled after the port has left the > > > > bridge, which means the association might have already been broken by > > > > the time the scheduled FDB work item attempts to use it. > > > > > > This is handled in mlxsw by telling the device to flush the FDB entries > > > pointing to the {port, FID} when the VLAN is deleted (synchronously). > > > > Again, central solution vs mlxsw solution. > > Again, a solution is forced on everyone regardless if it benefits them > or not. List is bombarded with version after version until patches are > applied. *EXHAUSTING*. So if I replace "bombarded" with a more neutral word, isn't that how it's done though? What would you do if you wanted to achieve something but the framework stood in your way? Would you work around it to avoid bombarding the list? > With these patches, except DSA, everyone gets another queue_work() for > each FDB entry. In some cases, it completely misses the purpose of the > patchset. I also fail to see the point. Patch 3 will have to make things worse before they get better. It is like that in DSA too, and made more reasonable only in the last patch from the series. If I saw any middle-ground way, like keeping the notifiers on the atomic chain for unconverted drivers, I would have done it. But what do you do if more than one driver listens for one event, one driver wants it blocking, the other wants it atomic. Do you make the bridge emit it twice? That's even worse than having one useless queue_work() in some drivers. So if you think I can avoid that please tell me how. > Want a central solution? Make sure it is properly integrated. "Don't > have the energy"? Ask for help. Do not try to force a solution on > everyone and motivate them to change the code by doing a poor conversion > yourself. > > I don't accept "this will have to do". So I can make many suppositions about what I did wrong, but I would prefer that you tell me. Is it the timing, as we're late in the development cycle? Maybe, and that would make a lot of sense, but I don't want to assume anything that has not been said. Is it that I converted too few drivers? You said I'm bombarding the list. Can I convert more drivers with less code? I would be absolutely glad to. I have more driver conversions unsubmitted, some tested on hardware. Is it that I didn't ask for help? I still believe that it is best I leave the driver maintainers to do the rest of the conversion, at their own pace and with hardware to test and find issues I can not using just code analysis and non-expert knowledge. After all, with all due respect to the net-next tree, I sent these patches to a development git tree, not to a production facility. > > > What is FDB isolation? Cover letter says: "There are use cases which > > > need FDB isolation between standalone ports and bridged ports, as well > > > as isolation between ports of different bridges". > > > > FDB isolation means exactly what it says: that the hardware FDB lookup > > of ports that are standalone, or under one bridge, is unable to find FDB entries > > (same MAC address, same VID) learned on another port from another bridge. > > > > > Does it mean that DSA currently forwards packets between ports even if > > > they are member in different bridges or standalone? > > > > No, that is plain forwarding isolation in my understanding of terms, and > > we have had that for many years now. > > So if I have {00:01:02:03:04:05, 5} in br0, but not in br1 and now a > packet with this DMAC/VID needs to be forwarded in br1 it will be > dropped instead of being flooded? Yes.
On 20/08/2021 20:06, Vladimir Oltean wrote: > On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote: >> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote: >>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: >>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: >>>>> Problem statement: >>>>> >>>>> Any time a driver needs to create a private association between a bridge >>>>> upper interface and use that association within its >>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB >>>>> entries deleted by the bridge when the port leaves. The issue is that >>>>> all switchdev drivers schedule a work item to have sleepable context, >>>>> and that work item can be actually scheduled after the port has left the >>>>> bridge, which means the association might have already been broken by >>>>> the time the scheduled FDB work item attempts to use it. >>>> >>>> This is handled in mlxsw by telling the device to flush the FDB entries >>>> pointing to the {port, FID} when the VLAN is deleted (synchronously). >>> >>> Again, central solution vs mlxsw solution. >> >> Again, a solution is forced on everyone regardless if it benefits them >> or not. List is bombarded with version after version until patches are >> applied. *EXHAUSTING*. > > So if I replace "bombarded" with a more neutral word, isn't that how > it's done though? What would you do if you wanted to achieve something > but the framework stood in your way? Would you work around it to avoid > bombarding the list? > >> With these patches, except DSA, everyone gets another queue_work() for >> each FDB entry. In some cases, it completely misses the purpose of the >> patchset. > > I also fail to see the point. Patch 3 will have to make things worse > before they get better. It is like that in DSA too, and made more > reasonable only in the last patch from the series. > > If I saw any middle-ground way, like keeping the notifiers on the atomic > chain for unconverted drivers, I would have done it. But what do you do > if more than one driver listens for one event, one driver wants it > blocking, the other wants it atomic. Do you make the bridge emit it > twice? That's even worse than having one useless queue_work() in some > drivers. > > So if you think I can avoid that please tell me how. > Hi, I don't like the double-queuing for each fdb for everyone either, it's forcing them to rework it asap due to inefficiency even though that shouldn't be necessary. In the long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually. For most drivers this is introducing more work (as in processing) rather than helping them right now, give them the option to convert to it on their own accord or bite the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking I don't see why not convert everyone to just execute their otherwise queued work. I'm sure driver maintainers would appreciate such help and would test and review it. You're halfway there already.. Cheers, Nik
On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote: > Hi, > I don't like the double-queuing for each fdb for everyone either, it's forcing them > to rework it asap due to inefficiency even though that shouldn't be necessary. Let's be honest, with the vast majority of drivers having absurdities such as the "if (!fdb_info->added_by_user || fdb_info->is_local) => nothing to do here, bye" check placed _inside_ the actual work item (and therefore scheduling for nothing for entries dynamically learned by the bridge), it's hard to believe that driver authors cared too much about inefficiency when mindlessly copy-pasting that snippet from mlxsw [ which for the record does call mlxsw_sp_span_respin for dynamically learned FDB entries, so that driver doesn't schedule for nothing like the rest - although maybe even mlxsw could call mlxsw_sp_port_dev_lower_find_rcu instead of mlxsw_sp_port_dev_lower_find, and could save a queue_work for FDB entries on foreign && non-VXLAN ports. Who knows?! ] Now I get to care for them. But I can see how a partial conversion could leave things in an even more absurd position. I don't want to contribute to the absurdity. > In the > long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually. > For most drivers this is introducing more work (as in processing) rather than helping > them right now, give them the option to convert to it on their own accord or bite > the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking > I don't see why not convert everyone to just execute their otherwise queued work. > I'm sure driver maintainers would appreciate such help and would test and review it. You're > halfway there already.. Agree, this needs more work. Thanks for looking.
On Fri, Aug 20, 2021 at 07:11:15PM +0300, Ido Schimmel wrote: > On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote: > > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > > > Problem statement: > > > > > > > > Any time a driver needs to create a private association between a bridge > > > > upper interface and use that association within its > > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > > > entries deleted by the bridge when the port leaves. The issue is that > > > > all switchdev drivers schedule a work item to have sleepable context, > > > > and that work item can be actually scheduled after the port has left the > > > > bridge, which means the association might have already been broken by > > > > the time the scheduled FDB work item attempts to use it. > > > > > > This is handled in mlxsw by telling the device to flush the FDB entries > > > pointing to the {port, FID} when the VLAN is deleted (synchronously). > > > > If you have FDB entries pointing to bridge ports that are foreign > > interfaces and you offload them, do you catch the VLAN deletion on the > > foreign port and flush your entries towards it at that time? > > Yes, that's how VXLAN offload works. VLAN addition is used to determine > the mapping between VNI and VLAN. I was only able to follow as far as: mlxsw_sp_switchdev_blocking_event -> mlxsw_sp_switchdev_handle_vxlan_obj_del -> mlxsw_sp_switchdev_vxlan_vlans_del -> mlxsw_sp_switchdev_vxlan_vlan_del -> ??? where are the FDB entries flushed? I was expecting to see something along the lines of mlxsw_sp_switchdev_blocking_event -> mlxsw_sp_port_vlans_del -> mlxsw_sp_bridge_port_vlan_del -> mlxsw_sp_port_vlan_bridge_leave -> mlxsw_sp_bridge_port_fdb_flush but that is exactly on the other branch of the "if (netif_is_vxlan(dev))" condition (and also, mlxsw_sp_bridge_port_fdb_flush flushes an externally-facing port, not really what I needed to know, see below). Anyway, it also seems to me that we are referring to slightly different things by "foreign" interfaces. To me, a "foreign" interface is one towards which there is no hardware data path. Like for example if you have a mlxsw port in a plain L2 bridge with an Intel card. The data path is the CPU and that was my question: do you track FDB entries towards those interfaces (implicitly: towards the CPU)? You've answered about VXLAN, which is quite not "foreign" in the sense I am thinking about, because mlxsw does have a hardware data path towards a VXLAN interface (as you've mentioned, it associates a VID with each VNI). I've been searching through the mlxsw driver and I don't see that this is being done, so I'm guessing you might wonder/ask why you would want to do that in the first place. If you bridge a mlxsw port with an Intel card, then (from another thread where you've said that mlxsw always injects control packets where hardware learning is not performed) my guess is that the MAC addresses learned on the Intel bridge port will never be learned on the mlxsw device. So every packet that ingresses the mlxsw and must egress the Intel card will reach the CPU through flooding (and will consequently be flooded in the entire broadcast domain of the mlxsw side of the bridge). Right?
On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote: > On 20/08/2021 20:06, Vladimir Oltean wrote: > > On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote: > >> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote: > >>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > >>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > >>>>> Problem statement: > >>>>> > >>>>> Any time a driver needs to create a private association between a bridge > >>>>> upper interface and use that association within its > >>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > >>>>> entries deleted by the bridge when the port leaves. The issue is that > >>>>> all switchdev drivers schedule a work item to have sleepable context, > >>>>> and that work item can be actually scheduled after the port has left the > >>>>> bridge, which means the association might have already been broken by > >>>>> the time the scheduled FDB work item attempts to use it. > >>>> > >>>> This is handled in mlxsw by telling the device to flush the FDB entries > >>>> pointing to the {port, FID} when the VLAN is deleted (synchronously). > >>> > >>> Again, central solution vs mlxsw solution. > >> > >> Again, a solution is forced on everyone regardless if it benefits them > >> or not. List is bombarded with version after version until patches are > >> applied. *EXHAUSTING*. > > > > So if I replace "bombarded" with a more neutral word, isn't that how > > it's done though? What would you do if you wanted to achieve something > > but the framework stood in your way? Would you work around it to avoid > > bombarding the list? > > > >> With these patches, except DSA, everyone gets another queue_work() for > >> each FDB entry. In some cases, it completely misses the purpose of the > >> patchset. > > > > I also fail to see the point. Patch 3 will have to make things worse > > before they get better. It is like that in DSA too, and made more > > reasonable only in the last patch from the series. > > > > If I saw any middle-ground way, like keeping the notifiers on the atomic > > chain for unconverted drivers, I would have done it. But what do you do > > if more than one driver listens for one event, one driver wants it > > blocking, the other wants it atomic. Do you make the bridge emit it > > twice? That's even worse than having one useless queue_work() in some > > drivers. > > > > So if you think I can avoid that please tell me how. > > > > Hi, > I don't like the double-queuing for each fdb for everyone either, it's forcing them > to rework it asap due to inefficiency even though that shouldn't be necessary. In the > long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually. The fundamental problem is that these operations need to be deferred in the first place. It would have been much better if user space could get a synchronous feedback. It all stems from the fact that control plane operations need to be done under a spin lock because the shared databases (e.g., FDB, MDB) or states (e.g., STP) that they are updating can also be updated from the data plane in softIRQ. I don't have a clean solution to this problem without doing a surgery in the bridge driver. Deferring updates from the data plane using a work queue and converting the spin locks to mutexes. This will also allow us to emit netlink notifications from process context and convert GFP_ATOMIC to GFP_KERNEL. Is that something you consider as acceptable? Does anybody have a better idea? > For most drivers this is introducing more work (as in processing) rather than helping > them right now, give them the option to convert to it on their own accord or bite > the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking > I don't see why not convert everyone to just execute their otherwise queued work. > I'm sure driver maintainers would appreciate such help and would test and review it. You're > halfway there already.. > > Cheers, > Nik > > > > > > > >
On Sat, Aug 21, 2021 at 10:09:14PM +0300, Vladimir Oltean wrote: > On Fri, Aug 20, 2021 at 07:11:15PM +0300, Ido Schimmel wrote: > > On Fri, Aug 20, 2021 at 01:49:48PM +0300, Vladimir Oltean wrote: > > > On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > > > > On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > > > > > Problem statement: > > > > > > > > > > Any time a driver needs to create a private association between a bridge > > > > > upper interface and use that association within its > > > > > SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > > > > > entries deleted by the bridge when the port leaves. The issue is that > > > > > all switchdev drivers schedule a work item to have sleepable context, > > > > > and that work item can be actually scheduled after the port has left the > > > > > bridge, which means the association might have already been broken by > > > > > the time the scheduled FDB work item attempts to use it. > > > > > > > > This is handled in mlxsw by telling the device to flush the FDB entries > > > > pointing to the {port, FID} when the VLAN is deleted (synchronously). > > > > > > If you have FDB entries pointing to bridge ports that are foreign > > > interfaces and you offload them, do you catch the VLAN deletion on the > > > foreign port and flush your entries towards it at that time? > > > > Yes, that's how VXLAN offload works. VLAN addition is used to determine > > the mapping between VNI and VLAN. > > I was only able to follow as far as: > > mlxsw_sp_switchdev_blocking_event > -> mlxsw_sp_switchdev_handle_vxlan_obj_del > -> mlxsw_sp_switchdev_vxlan_vlans_del > -> mlxsw_sp_switchdev_vxlan_vlan_del > -> ??? where are the FDB entries flushed? mlxsw_sp_switchdev_blocking_event -> mlxsw_sp_switchdev_handle_vxlan_obj_del -> mlxsw_sp_switchdev_vxlan_vlans_del -> mlxsw_sp_switchdev_vxlan_vlan_del -> mlxsw_sp_bridge_vxlan_leave -> mlxsw_sp_nve_fid_disable -> mlxsw_sp_nve_fdb_flush_by_fid > > I was expecting to see something along the lines of > > mlxsw_sp_switchdev_blocking_event > -> mlxsw_sp_port_vlans_del > -> mlxsw_sp_bridge_port_vlan_del > -> mlxsw_sp_port_vlan_bridge_leave > -> mlxsw_sp_bridge_port_fdb_flush > > but that is exactly on the other branch of the "if (netif_is_vxlan(dev))" > condition (and also, mlxsw_sp_bridge_port_fdb_flush flushes an externally-facing > port, not really what I needed to know, see below). > > Anyway, it also seems to me that we are referring to slightly different > things by "foreign" interfaces. To me, a "foreign" interface is one > towards which there is no hardware data path. Like for example if you > have a mlxsw port in a plain L2 bridge with an Intel card. The data path > is the CPU and that was my question: do you track FDB entries towards > those interfaces (implicitly: towards the CPU)? You've answered about > VXLAN, which is quite not "foreign" in the sense I am thinking about, > because mlxsw does have a hardware data path towards a VXLAN interface > (as you've mentioned, it associates a VID with each VNI). > > I've been searching through the mlxsw driver and I don't see that this > is being done, so I'm guessing you might wonder/ask why you would want > to do that in the first place. If you bridge a mlxsw port with an Intel > card, then (from another thread where you've said that mlxsw always > injects control packets where hardware learning is not performed) my > guess is that the MAC addresses learned on the Intel bridge port will > never be learned on the mlxsw device. So every packet that ingresses the > mlxsw and must egress the Intel card will reach the CPU through flooding > (and will consequently be flooded in the entire broadcast domain of the > mlxsw side of the bridge). Right? I can see how this use case makes sense on systems where the difference in performance between the ASIC and the CPU is not huge, but it doesn't make much sense with Spectrum and I have yet to get requests to support it (might change). Keep in mind that Spectrum is able to forward several Bpps with a switching capacity of several Tbps. It is usually connected to a weak CPU (e.g., low-end ARM, Intel Atom) through a PCI bus with a bandwidth of several Gbps. There is usually one "Intel card" on such systems which is connected to the management network that is separated from the data plane network. If we were to support it, FDB entries towards "foreign" interfaces would be programmed to trap packets to the CPU. For now, for correctness / rigor purposes, I would prefer simply returning an error / warning via extack when such topologies are configured.
On 22/08/2021 09:48, Ido Schimmel wrote: > On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote: >> On 20/08/2021 20:06, Vladimir Oltean wrote: >>> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote: >>>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote: >>>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: >>>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: >>>>>>> Problem statement: >>>>>>> >>>>>>> Any time a driver needs to create a private association between a bridge >>>>>>> upper interface and use that association within its >>>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB >>>>>>> entries deleted by the bridge when the port leaves. The issue is that >>>>>>> all switchdev drivers schedule a work item to have sleepable context, >>>>>>> and that work item can be actually scheduled after the port has left the >>>>>>> bridge, which means the association might have already been broken by >>>>>>> the time the scheduled FDB work item attempts to use it. >>>>>> >>>>>> This is handled in mlxsw by telling the device to flush the FDB entries >>>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously). >>>>> >>>>> Again, central solution vs mlxsw solution. >>>> >>>> Again, a solution is forced on everyone regardless if it benefits them >>>> or not. List is bombarded with version after version until patches are >>>> applied. *EXHAUSTING*. >>> >>> So if I replace "bombarded" with a more neutral word, isn't that how >>> it's done though? What would you do if you wanted to achieve something >>> but the framework stood in your way? Would you work around it to avoid >>> bombarding the list? >>> >>>> With these patches, except DSA, everyone gets another queue_work() for >>>> each FDB entry. In some cases, it completely misses the purpose of the >>>> patchset. >>> >>> I also fail to see the point. Patch 3 will have to make things worse >>> before they get better. It is like that in DSA too, and made more >>> reasonable only in the last patch from the series. >>> >>> If I saw any middle-ground way, like keeping the notifiers on the atomic >>> chain for unconverted drivers, I would have done it. But what do you do >>> if more than one driver listens for one event, one driver wants it >>> blocking, the other wants it atomic. Do you make the bridge emit it >>> twice? That's even worse than having one useless queue_work() in some >>> drivers. >>> >>> So if you think I can avoid that please tell me how. >>> >> >> Hi, >> I don't like the double-queuing for each fdb for everyone either, it's forcing them >> to rework it asap due to inefficiency even though that shouldn't be necessary. In the >> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually. > > The fundamental problem is that these operations need to be deferred in > the first place. It would have been much better if user space could get > a synchronous feedback. > > It all stems from the fact that control plane operations need to be done > under a spin lock because the shared databases (e.g., FDB, MDB) or > states (e.g., STP) that they are updating can also be updated from the > data plane in softIRQ. > Right, but changing that, as you've noted below, would require moving the delaying to the bridge, I'd like to avoid that. > I don't have a clean solution to this problem without doing a surgery in > the bridge driver. Deferring updates from the data plane using a work > queue and converting the spin locks to mutexes. This will also allow us > to emit netlink notifications from process context and convert > GFP_ATOMIC to GFP_KERNEL. > > Is that something you consider as acceptable? Does anybody have a better > idea? > Moving the delays to the bridge for this purpose does not sound like a good solution, I'd prefer the delaying to be done by the interested third party as in this case rather than the bridge. If there's a solution that avoids delaying and doesn't hurt the software fast-path then of course I'll be ok with that. >> For most drivers this is introducing more work (as in processing) rather than helping >> them right now, give them the option to convert to it on their own accord or bite >> the bullet and convert everyone so the change won't affect them, it holds rtnl, it is blocking >> I don't see why not convert everyone to just execute their otherwise queued work. >> I'm sure driver maintainers would appreciate such help and would test and review it. You're >> halfway there already.. >> >> Cheers, >> Nik >> >> >> >> >> >> >> >>
On Sun, Aug 22, 2021 at 12:12:02PM +0300, Nikolay Aleksandrov wrote: > On 22/08/2021 09:48, Ido Schimmel wrote: > > On Sat, Aug 21, 2021 at 02:36:26AM +0300, Nikolay Aleksandrov wrote: > >> On 20/08/2021 20:06, Vladimir Oltean wrote: > >>> On Fri, Aug 20, 2021 at 07:09:18PM +0300, Ido Schimmel wrote: > >>>> On Fri, Aug 20, 2021 at 12:37:23PM +0300, Vladimir Oltean wrote: > >>>>> On Fri, Aug 20, 2021 at 12:16:10PM +0300, Ido Schimmel wrote: > >>>>>> On Thu, Aug 19, 2021 at 07:07:18PM +0300, Vladimir Oltean wrote: > >>>>>>> Problem statement: > >>>>>>> > >>>>>>> Any time a driver needs to create a private association between a bridge > >>>>>>> upper interface and use that association within its > >>>>>>> SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE handler, we have an issue with FDB > >>>>>>> entries deleted by the bridge when the port leaves. The issue is that > >>>>>>> all switchdev drivers schedule a work item to have sleepable context, > >>>>>>> and that work item can be actually scheduled after the port has left the > >>>>>>> bridge, which means the association might have already been broken by > >>>>>>> the time the scheduled FDB work item attempts to use it. > >>>>>> > >>>>>> This is handled in mlxsw by telling the device to flush the FDB entries > >>>>>> pointing to the {port, FID} when the VLAN is deleted (synchronously). > >>>>> > >>>>> Again, central solution vs mlxsw solution. > >>>> > >>>> Again, a solution is forced on everyone regardless if it benefits them > >>>> or not. List is bombarded with version after version until patches are > >>>> applied. *EXHAUSTING*. > >>> > >>> So if I replace "bombarded" with a more neutral word, isn't that how > >>> it's done though? What would you do if you wanted to achieve something > >>> but the framework stood in your way? Would you work around it to avoid > >>> bombarding the list? > >>> > >>>> With these patches, except DSA, everyone gets another queue_work() for > >>>> each FDB entry. In some cases, it completely misses the purpose of the > >>>> patchset. > >>> > >>> I also fail to see the point. Patch 3 will have to make things worse > >>> before they get better. It is like that in DSA too, and made more > >>> reasonable only in the last patch from the series. > >>> > >>> If I saw any middle-ground way, like keeping the notifiers on the atomic > >>> chain for unconverted drivers, I would have done it. But what do you do > >>> if more than one driver listens for one event, one driver wants it > >>> blocking, the other wants it atomic. Do you make the bridge emit it > >>> twice? That's even worse than having one useless queue_work() in some > >>> drivers. > >>> > >>> So if you think I can avoid that please tell me how. > >>> > >> > >> Hi, > >> I don't like the double-queuing for each fdb for everyone either, it's forcing them > >> to rework it asap due to inefficiency even though that shouldn't be necessary. In the > >> long run I hope everyone would migrate to such scheme, but perhaps we can do it gradually. > > > > The fundamental problem is that these operations need to be deferred in > > the first place. It would have been much better if user space could get > > a synchronous feedback. > > > > It all stems from the fact that control plane operations need to be done > > under a spin lock because the shared databases (e.g., FDB, MDB) or > > states (e.g., STP) that they are updating can also be updated from the > > data plane in softIRQ. > > > > Right, but changing that, as you've noted below, would require moving > the delaying to the bridge, I'd like to avoid that. > > > I don't have a clean solution to this problem without doing a surgery in > > the bridge driver. Deferring updates from the data plane using a work > > queue and converting the spin locks to mutexes. This will also allow us > > to emit netlink notifications from process context and convert > > GFP_ATOMIC to GFP_KERNEL. > > > > Is that something you consider as acceptable? Does anybody have a better > > idea? > > > > Moving the delays to the bridge for this purpose does not sound like a good solution, > I'd prefer the delaying to be done by the interested third party as in this case rather > than the bridge. If there's a solution that avoids delaying and doesn't hurt the software > fast-path then of course I'll be ok with that. Maybe emitting two notifiers, one atomic and one blocking, per FDB add/del event is not such a stupid idea after all. Here's an alternative I've been cooking. Obviously it still has pros and cons. Hopefully by reading the commit message you get the basic idea and I don't need to post the full series. -----------------------------[ cut here ]----------------------------- From 9870699f0fafeb6175af3462173a957ece551322 Mon Sep 17 00:00:00 2001 From: Vladimir Oltean <vladimir.oltean@nxp.com> Date: Sat, 21 Aug 2021 15:57:40 +0300 Subject: [PATCH] net: switchdev: add an option for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE to be deferred Most existing switchdev drivers either talk to firmware, or to a device over a bus where the I/O is sleepable (SPI, I2C, MDIO etc). So there exists a pattern where drivers make a sleepable context for offloading the given FDB entry by registering an ordered workqueue and scheduling work items on it, and doing all the work from there. This solution works, but there are some issues with it: 1. It creates large amounts of duplication between switchdev drivers, and they don't even copy all the right patterns from each other. For example: * DSA, dpaa2-switch, rocker allocate an ordered workqueue with the WQ_MEM_RECLAIM flag and no one knows why. * dpaa2-switch, sparx5, am65_cpsw, cpsw, rocker, prestera, all have this check, or one very similar to it: if (!fdb_info->added_by_user || fdb_info->is_local) break; /* do nothing and exit */ within the actually scheduled workqueue item. That is to say, they schedule and take the rtnl_mutex for nothing - every single time that an FDB entry is dynamically learned by the software bridge and they are not interested in it. Same thing for the *_dev_check function - the function which checks if an FDB entry was learned on a network interface owned by the driver. 2. The work items scheduled privately by the driver are not synchronous with bridge events (i.e. the bridge will not wait for the driver to finish deleting an FDB entry before e.g. calling del_nbp and deleting that interface as a bridge port). This might matter for middle layers like DSA which construct their own API to their downstream consumers on top of the switchdev primitives. With the current switchdev API design, it is not possible to guarantee that the bridge which generated an FDB entry deletion is still the upper interface by the time that the work item is scheduled and the FDB deletion is actually executed. To obtain this guarantee it would be necessary to introduce a refcounting system where the reference to the bridge is kept by DSA for as long as there are pending bridge FDB entry additions/deletions. Not really ideal if we look at the big picture. 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are deferred by drivers even from code paths that are initially blocking (are running in process context): br_fdb_add -> __br_fdb_add -> fdb_add_entry -> fdb_notify -> br_switchdev_fdb_notify It seems fairly trivial to move the fdb_notify call outside of the atomic section of fdb_add_entry, but with switchdev offering only an API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would still have to defer these events and are unable to provide synchronous feedback to user space (error codes, extack). The above issues would warrant an attempt to fix a central problem, and make switchdev expose an API that is easier to consume rather than having drivers implement lateral workarounds. In this case, we must notice that (a) switchdev already has the concept of notifiers emitted from the fast path that are still processed by drivers from blocking context. This is accomplished through the SWITCHDEV_F_DEFER flag which is used by e.g. SWITCHDEV_OBJ_ID_HOST_MDB. (b) the bridge del_nbp() function already calls switchdev_deferred_process(). So if we could hook into that, we could have a chance that the bridge simply waits for our FDB entry offloading procedure to finish before it calls netdev_upper_dev_unlink() - which is almost immediately afterwards, and also when switchdev drivers typically break their stateful associations between the bridge upper and private data. So it is in fact possible to use switchdev's generic switchdev_deferred_enqueue mechanism to get a sleepable callback, and from there we can call_switchdev_blocking_notifiers(). To address all requirements: - drivers that are unconverted from atomic to blocking still work - drivers that currently have a private workqueue are not worse off - drivers that want the bridge to wait for their deferred work can use the bridge's defer mechanism - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested parties does not get deferred for no reason, because this takes the rtnl_mutex and schedules a worker thread for nothing it looks like we can in fact start off by emitting SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in struct switchdev_notifier_fdb_info called "needs_defer", and any interested party can set this to true. This way: - unconverted drivers do their work (i.e. schedule their private work item) based on the atomic notifier, and do not set "needs_defer" - converted drivers only mark "needs_defer" and treat a separate notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED Additionally, code paths that are blocking right not, like br_fdb_replay, could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all consumers of the replayed FDB events support that (right now, that is DSA and dpaa2-switch). Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set needs_defer as appropriate, then the notifiers emitted from process context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED directly, and we would also have fully blocking context all the way down, with the opportunity for error propagation and extack. Some disadvantages of this solution though: - A driver now needs to check whether it is interested in an event twice: first on the atomic call chain, then again on the blocking call chain (because it is a notifier chain, it is potentially not the only driver subscribed to it, it may be listening to another driver's needs_defer request). The flip side: on sistems with mixed switchdev setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB entries on foreign interfaces), there are some "synergies", and the SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as opposed to what would happen if each driver scheduled its own private work item. - Right now drivers take rtnl_lock() as soon as their private work item runs. They need the rtnl_lock for the call to call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). But it doesn't really seem necessary for them to perform the actual hardware manipulation (adding the FDB entry) with the rtnl_lock held (anyway most do that). But with the new option of servicing SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken top-level by switchdev, so even if these drivers wanted to be more self-conscious, they couldn't. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> --- include/net/switchdev.h | 26 ++++++++++++++- net/bridge/br_switchdev.c | 6 ++-- net/switchdev/switchdev.c | 69 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 96 insertions(+), 5 deletions(-) diff --git a/include/net/switchdev.h b/include/net/switchdev.h index 6764fb7692e2..67ddb80c828f 100644 --- a/include/net/switchdev.h +++ b/include/net/switchdev.h @@ -193,6 +193,8 @@ enum switchdev_notifier_type { SWITCHDEV_FDB_DEL_TO_BRIDGE, SWITCHDEV_FDB_ADD_TO_DEVICE, SWITCHDEV_FDB_DEL_TO_DEVICE, + SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, /* Blocking. */ + SWITCHDEV_FDB_DEL_TO_DEVICE_DEFERRED, /* Blocking. */ SWITCHDEV_FDB_OFFLOADED, SWITCHDEV_FDB_FLUSH_TO_BRIDGE, @@ -222,7 +224,8 @@ struct switchdev_notifier_fdb_info { u16 vid; u8 added_by_user:1, is_local:1, - offloaded:1; + offloaded:1, + needs_defer:1; }; struct switchdev_notifier_port_obj_info { @@ -283,6 +286,13 @@ int switchdev_port_obj_add(struct net_device *dev, int switchdev_port_obj_del(struct net_device *dev, const struct switchdev_obj *obj); +int +switchdev_fdb_add_to_device(struct net_device *dev, + struct switchdev_notifier_fdb_info *fdb_info); +int +switchdev_fdb_del_to_device(struct net_device *dev, + struct switchdev_notifier_fdb_info *fdb_info); + int register_switchdev_notifier(struct notifier_block *nb); int unregister_switchdev_notifier(struct notifier_block *nb); int call_switchdev_notifiers(unsigned long val, struct net_device *dev, @@ -386,6 +396,20 @@ static inline int switchdev_port_obj_del(struct net_device *dev, return -EOPNOTSUPP; } +static inline int +switchdev_fdb_add_to_device(struct net_device *dev, + struct switchdev_notifier_fdb_info *fdb_info) +{ + return -EOPNOTSUPP; +} + +static inline int +switchdev_fdb_del_to_device(struct net_device *dev, + struct switchdev_notifier_fdb_info *fdb_info) +{ + return -EOPNOTSUPP; +} + static inline int register_switchdev_notifier(struct notifier_block *nb) { return 0; diff --git a/net/bridge/br_switchdev.c b/net/bridge/br_switchdev.c index 7e62904089c8..687100ca7088 100644 --- a/net/bridge/br_switchdev.c +++ b/net/bridge/br_switchdev.c @@ -140,12 +140,10 @@ br_switchdev_fdb_notify(struct net_bridge *br, switch (type) { case RTM_DELNEIGH: - call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE, - dev, &info.info, NULL); + switchdev_fdb_del_to_device(dev, &info); break; case RTM_NEWNEIGH: - call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE, - dev, &info.info, NULL); + switchdev_fdb_add_to_device(dev, &info); break; } } diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c index 0b2c18efc079..d2f0bfc8a0b4 100644 --- a/net/switchdev/switchdev.c +++ b/net/switchdev/switchdev.c @@ -378,6 +378,75 @@ int call_switchdev_blocking_notifiers(unsigned long val, struct net_device *dev, } EXPORT_SYMBOL_GPL(call_switchdev_blocking_notifiers); +static void switchdev_fdb_add_deferred(struct net_device *dev, const void *data) +{ + const struct switchdev_notifier_fdb_info *fdb_info = data; + struct switchdev_notifier_fdb_info tmp = *fdb_info; + int err; + + ASSERT_RTNL(); + err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, + dev, &tmp.info, NULL); + err = notifier_to_errno(err); + if (err && err != -EOPNOTSUPP) + netdev_err(dev, "failed to add FDB entry: %pe\n", ERR_PTR(err)); +} + +static void switchdev_fdb_del_deferred(struct net_device *dev, const void *data) +{ + const struct switchdev_notifier_fdb_info *fdb_info = data; + struct switchdev_notifier_fdb_info tmp = *fdb_info; + int err; + + ASSERT_RTNL(); + err = call_switchdev_blocking_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE_DEFERRED, + dev, &tmp.info, NULL); + err = notifier_to_errno(err); + if (err && err != -EOPNOTSUPP) + netdev_err(dev, "failed to delete FDB entry: %pe\n", + ERR_PTR(err)); +} + +int +switchdev_fdb_add_to_device(struct net_device *dev, + struct switchdev_notifier_fdb_info *fdb_info) +{ + int err; + + err = call_switchdev_notifiers(SWITCHDEV_FDB_ADD_TO_DEVICE, dev, + &fdb_info->info, NULL); + err = notifier_to_errno(err); + if (err) + return err; + + if (!fdb_info->needs_defer) + return 0; + + return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info), + switchdev_fdb_add_deferred); +} +EXPORT_SYMBOL_GPL(switchdev_fdb_add_to_device); + +int +switchdev_fdb_del_to_device(struct net_device *dev, + struct switchdev_notifier_fdb_info *fdb_info) +{ + int err; + + err = call_switchdev_notifiers(SWITCHDEV_FDB_DEL_TO_DEVICE, dev, + &fdb_info->info, NULL); + err = notifier_to_errno(err); + if (err) + return err; + + if (!fdb_info->needs_defer) + return 0; + + return switchdev_deferred_enqueue(dev, fdb_info, sizeof(*fdb_info), + switchdev_fdb_del_deferred); +} +EXPORT_SYMBOL_GPL(switchdev_fdb_del_to_device); + struct switchdev_nested_priv { bool (*check_cb)(const struct net_device *dev); bool (*foreign_dev_check_cb)(const struct net_device *dev, -----------------------------[ cut here ]-----------------------------
On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote: > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are > deferred by drivers even from code paths that are initially blocking > (are running in process context): > > br_fdb_add > -> __br_fdb_add > -> fdb_add_entry > -> fdb_notify > -> br_switchdev_fdb_notify > > It seems fairly trivial to move the fdb_notify call outside of the > atomic section of fdb_add_entry, but with switchdev offering only an > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would > still have to defer these events and are unable to provide > synchronous feedback to user space (error codes, extack). > > The above issues would warrant an attempt to fix a central problem, and > make switchdev expose an API that is easier to consume rather than > having drivers implement lateral workarounds. > > In this case, we must notice that > > (a) switchdev already has the concept of notifiers emitted from the fast > path that are still processed by drivers from blocking context. This > is accomplished through the SWITCHDEV_F_DEFER flag which is used by > e.g. SWITCHDEV_OBJ_ID_HOST_MDB. > > (b) the bridge del_nbp() function already calls switchdev_deferred_process(). > So if we could hook into that, we could have a chance that the > bridge simply waits for our FDB entry offloading procedure to finish > before it calls netdev_upper_dev_unlink() - which is almost > immediately afterwards, and also when switchdev drivers typically > break their stateful associations between the bridge upper and > private data. > > So it is in fact possible to use switchdev's generic > switchdev_deferred_enqueue mechanism to get a sleepable callback, and > from there we can call_switchdev_blocking_notifiers(). > > To address all requirements: > > - drivers that are unconverted from atomic to blocking still work > - drivers that currently have a private workqueue are not worse off > - drivers that want the bridge to wait for their deferred work can use > the bridge's defer mechanism > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested > parties does not get deferred for no reason, because this takes the > rtnl_mutex and schedules a worker thread for nothing > > it looks like we can in fact start off by emitting > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in > struct switchdev_notifier_fdb_info called "needs_defer", and any > interested party can set this to true. > > This way: > > - unconverted drivers do their work (i.e. schedule their private work > item) based on the atomic notifier, and do not set "needs_defer" > - converted drivers only mark "needs_defer" and treat a separate > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > Additionally, code paths that are blocking right not, like br_fdb_replay, > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all > consumers of the replayed FDB events support that (right now, that is > DSA and dpaa2-switch). > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > needs_defer as appropriate, then the notifiers emitted from process > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > directly, and we would also have fully blocking context all the way > down, with the opportunity for error propagation and extack. IIUC, at this stage all the FDB notifications drivers get are blocking, either from the work queue (because they were deferred) or directly from process context. If so, how do we synchronize the two and ensure drivers get the notifications at the correct order? I was thinking of adding all the notifications to the 'deferred' list when 'hash_lock' is held and then calling switchdev_deferred_process() directly in process context. It's not very pretty (do we return an error only for the entry the user added or for any other entry we flushed from the list?), but I don't have a better idea right now. > > Some disadvantages of this solution though: > > - A driver now needs to check whether it is interested in an event > twice: first on the atomic call chain, then again on the blocking call > chain (because it is a notifier chain, it is potentially not the only > driver subscribed to it, it may be listening to another driver's > needs_defer request). The flip side: on sistems with mixed switchdev > setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB > entries on foreign interfaces), there are some "synergies", and the > SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as > opposed to what would happen if each driver scheduled its own private > work item. > > - Right now drivers take rtnl_lock() as soon as their private work item > runs. They need the rtnl_lock for the call to > call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). I think RCU is enough? > But it doesn't really seem necessary for them to perform the actual > hardware manipulation (adding the FDB entry) with the rtnl_lock held > (anyway most do that). But with the new option of servicing > SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken > top-level by switchdev, so even if these drivers wanted to be more > self-conscious, they couldn't. Yes, I want to remove this dependency in mlxsw assuming notifications remain atomic. The more pressing issue is actually removing it from the learning path.
On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote: > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote: > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are > > deferred by drivers even from code paths that are initially blocking > > (are running in process context): > > > > br_fdb_add > > -> __br_fdb_add > > -> fdb_add_entry > > -> fdb_notify > > -> br_switchdev_fdb_notify > > > > It seems fairly trivial to move the fdb_notify call outside of the > > atomic section of fdb_add_entry, but with switchdev offering only an > > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would > > still have to defer these events and are unable to provide > > synchronous feedback to user space (error codes, extack). > > > > The above issues would warrant an attempt to fix a central problem, and > > make switchdev expose an API that is easier to consume rather than > > having drivers implement lateral workarounds. > > > > In this case, we must notice that > > > > (a) switchdev already has the concept of notifiers emitted from the fast > > path that are still processed by drivers from blocking context. This > > is accomplished through the SWITCHDEV_F_DEFER flag which is used by > > e.g. SWITCHDEV_OBJ_ID_HOST_MDB. > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process(). > > So if we could hook into that, we could have a chance that the > > bridge simply waits for our FDB entry offloading procedure to finish > > before it calls netdev_upper_dev_unlink() - which is almost > > immediately afterwards, and also when switchdev drivers typically > > break their stateful associations between the bridge upper and > > private data. > > > > So it is in fact possible to use switchdev's generic > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and > > from there we can call_switchdev_blocking_notifiers(). > > > > To address all requirements: > > > > - drivers that are unconverted from atomic to blocking still work > > - drivers that currently have a private workqueue are not worse off > > - drivers that want the bridge to wait for their deferred work can use > > the bridge's defer mechanism > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested > > parties does not get deferred for no reason, because this takes the > > rtnl_mutex and schedules a worker thread for nothing > > > > it looks like we can in fact start off by emitting > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in > > struct switchdev_notifier_fdb_info called "needs_defer", and any > > interested party can set this to true. > > > > This way: > > > > - unconverted drivers do their work (i.e. schedule their private work > > item) based on the atomic notifier, and do not set "needs_defer" > > - converted drivers only mark "needs_defer" and treat a separate > > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not > > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > Additionally, code paths that are blocking right not, like br_fdb_replay, > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all > > consumers of the replayed FDB events support that (right now, that is > > DSA and dpaa2-switch). > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > needs_defer as appropriate, then the notifiers emitted from process > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > directly, and we would also have fully blocking context all the way > > down, with the opportunity for error propagation and extack. > > IIUC, at this stage all the FDB notifications drivers get are blocking, > either from the work queue (because they were deferred) or directly from > process context. If so, how do we synchronize the two and ensure drivers > get the notifications at the correct order? What does 'at this stage' mean? Does it mean 'assuming the patch we're discussing now gets accepted'? If that's what it means, then 'at this stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE, then would set needs_defer, then would receive the blocking FDB_ADD_TO_DEVICE. Thinking a bit more - this two-stage notification process ends up being more efficient for br_fdb_replay too. We don't queue up FDB entries except if the driver tells us that it wants to process them in blocking context. > I was thinking of adding all the notifications to the 'deferred' list > when 'hash_lock' is held and then calling switchdev_deferred_process() > directly in process context. It's not very pretty (do we return an error > only for the entry the user added or for any other entry we flushed from > the list?), but I don't have a better idea right now. I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to the switchdev_fdb_add_to_device() which defers, this does not defer at all but just call_blocking_switchdev_notifiers(). So it would not go through switchdev_deferred_enqueue. For the code path I talked above, we would temporarily drop the spin_lock, then call switchdev_fdb_add_to_device_now(), then if that fails, take the spin_lock again and delete the software fdb entry we've just added. So as long as we use a _now() variant and don't resynchronize with the deferred work, we shouldn't have any ordering issues, or am I misunderstanding your question? > > > > > Some disadvantages of this solution though: > > > > - A driver now needs to check whether it is interested in an event > > twice: first on the atomic call chain, then again on the blocking call > > chain (because it is a notifier chain, it is potentially not the only > > driver subscribed to it, it may be listening to another driver's > > needs_defer request). The flip side: on sistems with mixed switchdev > > setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB > > entries on foreign interfaces), there are some "synergies", and the > > SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as > > opposed to what would happen if each driver scheduled its own private > > work item. > > > > - Right now drivers take rtnl_lock() as soon as their private work item > > runs. They need the rtnl_lock for the call to > > call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). > > I think RCU is enough? Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set would notify back rtnetlink, but it looks like it doesn't. > > But it doesn't really seem necessary for them to perform the actual > > hardware manipulation (adding the FDB entry) with the rtnl_lock held > > (anyway most do that). But with the new option of servicing > > SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken > > top-level by switchdev, so even if these drivers wanted to be more > > self-conscious, they couldn't. > > Yes, I want to remove this dependency in mlxsw assuming notifications > remain atomic. The more pressing issue is actually removing it from the > learning path. Bah, I understand where you're coming from, but it would be tricky to remove the rtnl_lock from switchdev_deferred_process_work (that's what it boils down to). My switchdev_handle_fdb_add_to_device helper currently assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE, it would still need to traverse the netdev adjacency lists, so it would need the rtnl_mutex for that. If we remove the rtnl_lock from switchdev_deferred_process_work we'd have to add it back in DSA and to any other callers of switchdev_handle_fdb_add_to_device.
On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote: > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote: > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote: > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are > > > deferred by drivers even from code paths that are initially blocking > > > (are running in process context): > > > > > > br_fdb_add > > > -> __br_fdb_add > > > -> fdb_add_entry > > > -> fdb_notify > > > -> br_switchdev_fdb_notify > > > > > > It seems fairly trivial to move the fdb_notify call outside of the > > > atomic section of fdb_add_entry, but with switchdev offering only an > > > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would > > > still have to defer these events and are unable to provide > > > synchronous feedback to user space (error codes, extack). > > > > > > The above issues would warrant an attempt to fix a central problem, and > > > make switchdev expose an API that is easier to consume rather than > > > having drivers implement lateral workarounds. > > > > > > In this case, we must notice that > > > > > > (a) switchdev already has the concept of notifiers emitted from the fast > > > path that are still processed by drivers from blocking context. This > > > is accomplished through the SWITCHDEV_F_DEFER flag which is used by > > > e.g. SWITCHDEV_OBJ_ID_HOST_MDB. > > > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process(). > > > So if we could hook into that, we could have a chance that the > > > bridge simply waits for our FDB entry offloading procedure to finish > > > before it calls netdev_upper_dev_unlink() - which is almost > > > immediately afterwards, and also when switchdev drivers typically > > > break their stateful associations between the bridge upper and > > > private data. > > > > > > So it is in fact possible to use switchdev's generic > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and > > > from there we can call_switchdev_blocking_notifiers(). > > > > > > To address all requirements: > > > > > > - drivers that are unconverted from atomic to blocking still work > > > - drivers that currently have a private workqueue are not worse off > > > - drivers that want the bridge to wait for their deferred work can use > > > the bridge's defer mechanism > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested > > > parties does not get deferred for no reason, because this takes the > > > rtnl_mutex and schedules a worker thread for nothing > > > > > > it looks like we can in fact start off by emitting > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in > > > struct switchdev_notifier_fdb_info called "needs_defer", and any > > > interested party can set this to true. > > > > > > This way: > > > > > > - unconverted drivers do their work (i.e. schedule their private work > > > item) based on the atomic notifier, and do not set "needs_defer" > > > - converted drivers only mark "needs_defer" and treat a separate > > > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not > > > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > > Additionally, code paths that are blocking right not, like br_fdb_replay, > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all > > > consumers of the replayed FDB events support that (right now, that is > > > DSA and dpaa2-switch). > > > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > > needs_defer as appropriate, then the notifiers emitted from process > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > directly, and we would also have fully blocking context all the way > > > down, with the opportunity for error propagation and extack. > > > > IIUC, at this stage all the FDB notifications drivers get are blocking, > > either from the work queue (because they were deferred) or directly from > > process context. If so, how do we synchronize the two and ensure drivers > > get the notifications at the correct order? > > What does 'at this stage' mean? Does it mean 'assuming the patch we're > discussing now gets accepted'? If that's what it means, then 'at this > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE, > then would set needs_defer, then would receive the blocking > FDB_ADD_TO_DEVICE. I meant after: "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set needs_defer as appropriate, then the notifiers emitted from process context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED directly, and we would also have fully blocking context all the way down, with the opportunity for error propagation and extack." IIUC, after the conversion the 'needs_defer' is gone and all the FDB events are blocking? Either from syscall context or the workqueue. If so, I'm not sure how we synchronize the two. That is, making sure that an event from syscall context does not reach drivers before an earlier event that was added to the 'deferred' list. I mean, in syscall context we are holding RTNL so whatever is already on the 'deferred' list cannot be dequeued and processed. > > Thinking a bit more - this two-stage notification process ends up being > more efficient for br_fdb_replay too. We don't queue up FDB entries > except if the driver tells us that it wants to process them in blocking > context. > > > I was thinking of adding all the notifications to the 'deferred' list > > when 'hash_lock' is held and then calling switchdev_deferred_process() > > directly in process context. It's not very pretty (do we return an error > > only for the entry the user added or for any other entry we flushed from > > the list?), but I don't have a better idea right now. > > I was thinking to add a switchdev_fdb_add_to_device_now(). As opposed to > the switchdev_fdb_add_to_device() which defers, this does not defer at > all but just call_blocking_switchdev_notifiers(). So it would not go > through switchdev_deferred_enqueue. For the code path I talked above, > we would temporarily drop the spin_lock, then call > switchdev_fdb_add_to_device_now(), then if that fails, take the > spin_lock again and delete the software fdb entry we've just added. > > So as long as we use a _now() variant and don't resynchronize with the > deferred work, we shouldn't have any ordering issues, or am I > misunderstanding your question? Not sure I'm following. I tried to explain above. > > > > > > > > > Some disadvantages of this solution though: > > > > > > - A driver now needs to check whether it is interested in an event > > > twice: first on the atomic call chain, then again on the blocking call > > > chain (because it is a notifier chain, it is potentially not the only > > > driver subscribed to it, it may be listening to another driver's > > > needs_defer request). The flip side: on sistems with mixed switchdev > > > setups (dpaa2-switch + DSA, and DSA sniffs dynamically learned FDB > > > entries on foreign interfaces), there are some "synergies", and the > > > SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED event is only emitted once, as > > > opposed to what would happen if each driver scheduled its own private > > > work item. > > > > > > - Right now drivers take rtnl_lock() as soon as their private work item > > > runs. They need the rtnl_lock for the call to > > > call_switchdev_notifiers(SWITCHDEV_FDB_OFFLOADED). > > > > I think RCU is enough? > > Maybe, I haven't experimented with it. I thought br_fdb_offloaded_set > would notify back rtnetlink, but it looks like it doesn't. You mean emit a RTM_NEWNEIGH? This can be done even without RTNL (from the data path, for example) > > > > But it doesn't really seem necessary for them to perform the actual > > > hardware manipulation (adding the FDB entry) with the rtnl_lock held > > > (anyway most do that). But with the new option of servicing > > > SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, the rtnl_lock is taken > > > top-level by switchdev, so even if these drivers wanted to be more > > > self-conscious, they couldn't. > > > > Yes, I want to remove this dependency in mlxsw assuming notifications > > remain atomic. The more pressing issue is actually removing it from the > > learning path. > > Bah, I understand where you're coming from, but it would be tricky to > remove the rtnl_lock from switchdev_deferred_process_work (that's what > it boils down to). My switchdev_handle_fdb_add_to_device helper currently > assumes rcu_read_lock(). With the blocking variant of SWITCHDEV_FDB_ADD_TO_DEVICE, > it would still need to traverse the netdev adjacency lists, so it would > need the rtnl_mutex for that. If we remove the rtnl_lock from > switchdev_deferred_process_work we'd have to add it back in DSA and to > any other callers of switchdev_handle_fdb_add_to_device.
On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote: > On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote: > > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote: > > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote: > > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are > > > > deferred by drivers even from code paths that are initially blocking > > > > (are running in process context): > > > > > > > > br_fdb_add > > > > -> __br_fdb_add > > > > -> fdb_add_entry > > > > -> fdb_notify > > > > -> br_switchdev_fdb_notify > > > > > > > > It seems fairly trivial to move the fdb_notify call outside of the > > > > atomic section of fdb_add_entry, but with switchdev offering only an > > > > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would > > > > still have to defer these events and are unable to provide > > > > synchronous feedback to user space (error codes, extack). > > > > > > > > The above issues would warrant an attempt to fix a central problem, and > > > > make switchdev expose an API that is easier to consume rather than > > > > having drivers implement lateral workarounds. > > > > > > > > In this case, we must notice that > > > > > > > > (a) switchdev already has the concept of notifiers emitted from the fast > > > > path that are still processed by drivers from blocking context. This > > > > is accomplished through the SWITCHDEV_F_DEFER flag which is used by > > > > e.g. SWITCHDEV_OBJ_ID_HOST_MDB. > > > > > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process(). > > > > So if we could hook into that, we could have a chance that the > > > > bridge simply waits for our FDB entry offloading procedure to finish > > > > before it calls netdev_upper_dev_unlink() - which is almost > > > > immediately afterwards, and also when switchdev drivers typically > > > > break their stateful associations between the bridge upper and > > > > private data. > > > > > > > > So it is in fact possible to use switchdev's generic > > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and > > > > from there we can call_switchdev_blocking_notifiers(). > > > > > > > > To address all requirements: > > > > > > > > - drivers that are unconverted from atomic to blocking still work > > > > - drivers that currently have a private workqueue are not worse off > > > > - drivers that want the bridge to wait for their deferred work can use > > > > the bridge's defer mechanism > > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested > > > > parties does not get deferred for no reason, because this takes the > > > > rtnl_mutex and schedules a worker thread for nothing > > > > > > > > it looks like we can in fact start off by emitting > > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in > > > > struct switchdev_notifier_fdb_info called "needs_defer", and any > > > > interested party can set this to true. > > > > > > > > This way: > > > > > > > > - unconverted drivers do their work (i.e. schedule their private work > > > > item) based on the atomic notifier, and do not set "needs_defer" > > > > - converted drivers only mark "needs_defer" and treat a separate > > > > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not > > > > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > > > > Additionally, code paths that are blocking right not, like br_fdb_replay, > > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all > > > > consumers of the replayed FDB events support that (right now, that is > > > > DSA and dpaa2-switch). > > > > > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > > > needs_defer as appropriate, then the notifiers emitted from process > > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > directly, and we would also have fully blocking context all the way > > > > down, with the opportunity for error propagation and extack. > > > > > > IIUC, at this stage all the FDB notifications drivers get are blocking, > > > either from the work queue (because they were deferred) or directly from > > > process context. If so, how do we synchronize the two and ensure drivers > > > get the notifications at the correct order? > > > > What does 'at this stage' mean? Does it mean 'assuming the patch we're > > discussing now gets accepted'? If that's what it means, then 'at this > > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE, > > then would set needs_defer, then would receive the blocking > > FDB_ADD_TO_DEVICE. > > I meant after: > > "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > needs_defer as appropriate, then the notifiers emitted from process > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > directly, and we would also have fully blocking context all the way > down, with the opportunity for error propagation and extack." > > IIUC, after the conversion the 'needs_defer' is gone and all the FDB > events are blocking? Either from syscall context or the workqueue. We would not delete 'needs_defer'. It still offers a useful preliminary filtering mechanism for the fast path (and for br_fdb_replay). In retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer' instead of jumping to blocking context (if we care so much about performance). If a FDB event does not need to be processed by anyone (dynamically learned entry on a switchdev port), the bridge notifies the atomic call chain for the sake of it, but not the blocking chain. > If so, I'm not sure how we synchronize the two. That is, making sure > that an event from syscall context does not reach drivers before an > earlier event that was added to the 'deferred' list. > > I mean, in syscall context we are holding RTNL so whatever is already on > the 'deferred' list cannot be dequeued and processed. So switchdev_deferred_process() has ASSERT_RTNL. If we call switchdev_deferred_process() right before adding the blocking FDB entry in process context (and we already hold rtnl_mutex), I though that would be enough to ensure we have a synchronization point: Everything that was scheduled before is flushed now, everything that is scheduled while we are running will run after we unlock the rtnl_mutex. Is that not the order we expect? I mean, if there is a fast path FDB entry being learned / deleted while user space say adds that same FDB entry as static, how is the relative ordering ensured between the two?
On Mon, Aug 23, 2021 at 02:00:46PM +0300, Vladimir Oltean wrote: > On Mon, Aug 23, 2021 at 01:47:57PM +0300, Ido Schimmel wrote: > > On Sun, Aug 22, 2021 at 08:44:49PM +0300, Vladimir Oltean wrote: > > > On Sun, Aug 22, 2021 at 08:06:00PM +0300, Ido Schimmel wrote: > > > > On Sun, Aug 22, 2021 at 04:31:45PM +0300, Vladimir Oltean wrote: > > > > > 3. There is a larger issue that SWITCHDEV_FDB_ADD_TO_DEVICE events are > > > > > deferred by drivers even from code paths that are initially blocking > > > > > (are running in process context): > > > > > > > > > > br_fdb_add > > > > > -> __br_fdb_add > > > > > -> fdb_add_entry > > > > > -> fdb_notify > > > > > -> br_switchdev_fdb_notify > > > > > > > > > > It seems fairly trivial to move the fdb_notify call outside of the > > > > > atomic section of fdb_add_entry, but with switchdev offering only an > > > > > API where the SWITCHDEV_FDB_ADD_TO_DEVICE is atomic, drivers would > > > > > still have to defer these events and are unable to provide > > > > > synchronous feedback to user space (error codes, extack). > > > > > > > > > > The above issues would warrant an attempt to fix a central problem, and > > > > > make switchdev expose an API that is easier to consume rather than > > > > > having drivers implement lateral workarounds. > > > > > > > > > > In this case, we must notice that > > > > > > > > > > (a) switchdev already has the concept of notifiers emitted from the fast > > > > > path that are still processed by drivers from blocking context. This > > > > > is accomplished through the SWITCHDEV_F_DEFER flag which is used by > > > > > e.g. SWITCHDEV_OBJ_ID_HOST_MDB. > > > > > > > > > > (b) the bridge del_nbp() function already calls switchdev_deferred_process(). > > > > > So if we could hook into that, we could have a chance that the > > > > > bridge simply waits for our FDB entry offloading procedure to finish > > > > > before it calls netdev_upper_dev_unlink() - which is almost > > > > > immediately afterwards, and also when switchdev drivers typically > > > > > break their stateful associations between the bridge upper and > > > > > private data. > > > > > > > > > > So it is in fact possible to use switchdev's generic > > > > > switchdev_deferred_enqueue mechanism to get a sleepable callback, and > > > > > from there we can call_switchdev_blocking_notifiers(). > > > > > > > > > > To address all requirements: > > > > > > > > > > - drivers that are unconverted from atomic to blocking still work > > > > > - drivers that currently have a private workqueue are not worse off > > > > > - drivers that want the bridge to wait for their deferred work can use > > > > > the bridge's defer mechanism > > > > > - a SWITCHDEV_FDB_ADD_TO_DEVICE event which does not have any interested > > > > > parties does not get deferred for no reason, because this takes the > > > > > rtnl_mutex and schedules a worker thread for nothing > > > > > > > > > > it looks like we can in fact start off by emitting > > > > > SWITCHDEV_FDB_ADD_TO_DEVICE on the atomic chain. But we add a new bit in > > > > > struct switchdev_notifier_fdb_info called "needs_defer", and any > > > > > interested party can set this to true. > > > > > > > > > > This way: > > > > > > > > > > - unconverted drivers do their work (i.e. schedule their private work > > > > > item) based on the atomic notifier, and do not set "needs_defer" > > > > > - converted drivers only mark "needs_defer" and treat a separate > > > > > notifier, on the blocking chain, called SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > - SWITCHDEV_FDB_ADD_TO_DEVICE events with no interested party do not > > > > > generate any follow-up SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > > > > > > Additionally, code paths that are blocking right not, like br_fdb_replay, > > > > > could notify only SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED, as long as all > > > > > consumers of the replayed FDB events support that (right now, that is > > > > > DSA and dpaa2-switch). > > > > > > > > > > Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > > > > needs_defer as appropriate, then the notifiers emitted from process > > > > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > > > > directly, and we would also have fully blocking context all the way > > > > > down, with the opportunity for error propagation and extack. > > > > > > > > IIUC, at this stage all the FDB notifications drivers get are blocking, > > > > either from the work queue (because they were deferred) or directly from > > > > process context. If so, how do we synchronize the two and ensure drivers > > > > get the notifications at the correct order? > > > > > > What does 'at this stage' mean? Does it mean 'assuming the patch we're > > > discussing now gets accepted'? If that's what it means, then 'at this > > > stage' all drivers would first receive the atomic FDB_ADD_TO_DEVICE, > > > then would set needs_defer, then would receive the blocking > > > FDB_ADD_TO_DEVICE. > > > > I meant after: > > > > "Once all consumers of SWITCHDEV_FDB_ADD_TO_DEVICE are converted to set > > needs_defer as appropriate, then the notifiers emitted from process > > context by the bridge could call SWITCHDEV_FDB_ADD_TO_DEVICE_DEFERRED > > directly, and we would also have fully blocking context all the way > > down, with the opportunity for error propagation and extack." > > > > IIUC, after the conversion the 'needs_defer' is gone and all the FDB > > events are blocking? Either from syscall context or the workqueue. > > We would not delete 'needs_defer'. It still offers a useful preliminary > filtering mechanism for the fast path (and for br_fdb_replay). In > retrospect, the SWITCHDEV_OBJ_ID_HOST_MDB would also benefit from 'needs_defer' > instead of jumping to blocking context (if we care so much about performance). > > If a FDB event does not need to be processed by anyone (dynamically > learned entry on a switchdev port), the bridge notifies the atomic call > chain for the sake of it, but not the blocking chain. > > > If so, I'm not sure how we synchronize the two. That is, making sure > > that an event from syscall context does not reach drivers before an > > earlier event that was added to the 'deferred' list. > > > > I mean, in syscall context we are holding RTNL so whatever is already on > > the 'deferred' list cannot be dequeued and processed. > > So switchdev_deferred_process() has ASSERT_RTNL. If we call > switchdev_deferred_process() right before adding the blocking FDB entry > in process context (and we already hold rtnl_mutex), I though that would > be enough to ensure we have a synchronization point: Everything that was > scheduled before is flushed now, everything that is scheduled while we > are running will run after we unlock the rtnl_mutex. Is that not the > order we expect? I mean, if there is a fast path FDB entry being learned > / deleted while user space say adds that same FDB entry as static, how > is the relative ordering ensured between the two? I was thinking about the following case: t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock' t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in response to STP state. Notifications are added to 'deferred' list t2 - switchdev_deferred_process() is called in syscall context t3 - <MAC1,VID1,P1> is notified as blocking Updates to the SW FDB are protected by 'hash_lock', but updates to the HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but it will exist in HW. Another case assuming switchdev_deferred_process() is called first: t0 - switchdev_deferred_process() is called in syscall context t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added to 'deferred' list t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to <MAC1,VID1,P2> t3 - <MAC1,VID1,P2> is notified as blocking t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred' list is processed) In this case, the HW will have <MAC1,VID1,P1>, but SW will have <MAC1,VID1,P2>
On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote: > I was thinking about the following case: > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock' > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in > response to STP state. Notifications are added to 'deferred' list > t2 - switchdev_deferred_process() is called in syscall context > t3 - <MAC1,VID1,P1> is notified as blocking > > Updates to the SW FDB are protected by 'hash_lock', but updates to the > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but > it will exist in HW. > > Another case assuming switchdev_deferred_process() is called first: > > t0 - switchdev_deferred_process() is called in syscall context > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added > to 'deferred' list > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to > <MAC1,VID1,P2> > t3 - <MAC1,VID1,P2> is notified as blocking > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred' > list is processed) > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have > <MAC1,VID1,P2> Ok, so if the hardware FDB entry needs to be updated under the same hash_lock as the software FDB entry, then it seems that the goal of updating the hardware FDB synchronously and in a sleepable manner is if the data path defers the learning to sleepable context too. That in turn means that there will be 'dead time' between the reception of a packet from a given {MAC SA, VID} flow and the learning of that address. So I don't think that is really desirable. So I don't know if it is actually realistic to do this. Can we drop it from the requirements of this change, or do you feel like it's not worth it to make my change if this problem is not solved? There is of course the option of going half-way too, just like for SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the atomic chain, the switchdev throws as many errors as it can reasonably can, then you defer the actual installation which means a hardware access.
On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote: > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote: > > I was thinking about the following case: > > > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock' > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in > > response to STP state. Notifications are added to 'deferred' list > > t2 - switchdev_deferred_process() is called in syscall context > > t3 - <MAC1,VID1,P1> is notified as blocking > > > > Updates to the SW FDB are protected by 'hash_lock', but updates to the > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but > > it will exist in HW. > > > > Another case assuming switchdev_deferred_process() is called first: > > > > t0 - switchdev_deferred_process() is called in syscall context > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added > > to 'deferred' list > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to > > <MAC1,VID1,P2> > > t3 - <MAC1,VID1,P2> is notified as blocking > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred' > > list is processed) > > > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have > > <MAC1,VID1,P2> > > Ok, so if the hardware FDB entry needs to be updated under the same > hash_lock as the software FDB entry, then it seems that the goal of > updating the hardware FDB synchronously and in a sleepable manner is if > the data path defers the learning to sleepable context too. That in turn > means that there will be 'dead time' between the reception of a packet > from a given {MAC SA, VID} flow and the learning of that address. So I > don't think that is really desirable. So I don't know if it is actually > realistic to do this. > > Can we drop it from the requirements of this change, or do you feel like > it's not worth it to make my change if this problem is not solved? I didn't pose it as a requirement, but as a desirable goal that I don't know how to achieve w/o a surgery in the bridge driver that Nik and you (understandably) don't like. Regarding a more practical solution, earlier versions (not what you posted yesterday) have the undesirable properties of being both asynchronous (current state) and mandating RTNL to be held. If we are going with the asynchronous model, then I think we should have a model that doesn't force RTNL and allows batching. I have the following proposal, which I believe solves your problem and allows for batching without RTNL: The pattern of enqueuing a work item per-entry is not very smart. Instead, it is better to to add the notification info to a list (protected by a spin lock) and scheduling a single work item whose purpose is to dequeue entries from this list and batch process them. Inside the work item you would do something like: spin_lock_bh() list_splice_init() spin_unlock_bh() mutex_lock() // rtnl or preferably private lock list_for_each_entry_safe() // process entry cond_resched() mutex_unlock() In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will instruct the driver to flush all its pending FDB notifications. You don't strictly need this notification because of the netdev_upper_dev_unlink() that follows, but it helps in making things more structured. Pros: 1. Solves your problem? 2. Pattern is not worse than what we currently have 3. Does not force RTNL 4. Allows for batching. For example, mlxsw has the ability to program up to 64 entries in one transaction with the device. I assume other devices in the same grade have similar capabilities Cons: 1. Asynchronous 2. Pattern we will see in multiple drivers? Can consider migrating it into switchdev itself at some point 3. Something I missed / overlooked > There is of course the option of going half-way too, just like for > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the > atomic chain, the switchdev throws as many errors as it can reasonably > can, then you defer the actual installation which means a hardware access. Yes, the above proposal has the same property. You can throw errors before enqueueing the notification info on your list.
On 23/08/2021 18:18, Ido Schimmel wrote: > On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote: >> On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote: >>> I was thinking about the following case: >>> >>> t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock' >>> t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in >>> response to STP state. Notifications are added to 'deferred' list >>> t2 - switchdev_deferred_process() is called in syscall context >>> t3 - <MAC1,VID1,P1> is notified as blocking >>> >>> Updates to the SW FDB are protected by 'hash_lock', but updates to the >>> HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but >>> it will exist in HW. >>> >>> Another case assuming switchdev_deferred_process() is called first: >>> >>> t0 - switchdev_deferred_process() is called in syscall context >>> t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added >>> to 'deferred' list >>> t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to >>> <MAC1,VID1,P2> >>> t3 - <MAC1,VID1,P2> is notified as blocking >>> t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred' >>> list is processed) >>> >>> In this case, the HW will have <MAC1,VID1,P1>, but SW will have >>> <MAC1,VID1,P2> >> >> Ok, so if the hardware FDB entry needs to be updated under the same >> hash_lock as the software FDB entry, then it seems that the goal of >> updating the hardware FDB synchronously and in a sleepable manner is if >> the data path defers the learning to sleepable context too. That in turn >> means that there will be 'dead time' between the reception of a packet >> from a given {MAC SA, VID} flow and the learning of that address. So I >> don't think that is really desirable. So I don't know if it is actually >> realistic to do this. >> >> Can we drop it from the requirements of this change, or do you feel like >> it's not worth it to make my change if this problem is not solved? > > I didn't pose it as a requirement, but as a desirable goal that I don't > know how to achieve w/o a surgery in the bridge driver that Nik and you > (understandably) don't like. > > Regarding a more practical solution, earlier versions (not what you > posted yesterday) have the undesirable properties of being both > asynchronous (current state) and mandating RTNL to be held. If we are > going with the asynchronous model, then I think we should have a model > that doesn't force RTNL and allows batching. > > I have the following proposal, which I believe solves your problem and > allows for batching without RTNL: > > The pattern of enqueuing a work item per-entry is not very smart. > Instead, it is better to to add the notification info to a list > (protected by a spin lock) and scheduling a single work item whose > purpose is to dequeue entries from this list and batch process them. > > Inside the work item you would do something like: > > spin_lock_bh() > list_splice_init() > spin_unlock_bh() > > mutex_lock() // rtnl or preferably private lock > list_for_each_entry_safe() > // process entry > cond_resched() > mutex_unlock() > > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will > instruct the driver to flush all its pending FDB notifications. You > don't strictly need this notification because of the > netdev_upper_dev_unlink() that follows, but it helps in making things > more structured. > I was also thinking about a solution along these lines, I like this proposition. > Pros: > > 1. Solves your problem? > 2. Pattern is not worse than what we currently have > 3. Does not force RTNL > 4. Allows for batching. For example, mlxsw has the ability to program up > to 64 entries in one transaction with the device. I assume other devices > in the same grade have similar capabilities Batching would help a lot even if we don't remove rtnl, on loaded systems rtnl itself is a bottleneck and we've seen crazy delays in commands because of contention. That coupled with the ability to program multiple entries would be a nice win. > > Cons: > > 1. Asynchronous > 2. Pattern we will see in multiple drivers? Can consider migrating it > into switchdev itself at some point > 3. Something I missed / overlooked > >> There is of course the option of going half-way too, just like for >> SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the >> atomic chain, the switchdev throws as many errors as it can reasonably >> can, then you defer the actual installation which means a hardware access. > > Yes, the above proposal has the same property. You can throw errors > before enqueueing the notification info on your list. > Thanks, Nik
On Mon, Aug 23, 2021 at 06:18:08PM +0300, Ido Schimmel wrote: > On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote: > > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote: > > > I was thinking about the following case: > > > > > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock' > > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in > > > response to STP state. Notifications are added to 'deferred' list > > > t2 - switchdev_deferred_process() is called in syscall context > > > t3 - <MAC1,VID1,P1> is notified as blocking > > > > > > Updates to the SW FDB are protected by 'hash_lock', but updates to the > > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but > > > it will exist in HW. > > > > > > Another case assuming switchdev_deferred_process() is called first: > > > > > > t0 - switchdev_deferred_process() is called in syscall context > > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added > > > to 'deferred' list > > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to > > > <MAC1,VID1,P2> > > > t3 - <MAC1,VID1,P2> is notified as blocking > > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred' > > > list is processed) > > > > > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have > > > <MAC1,VID1,P2> > > > > Ok, so if the hardware FDB entry needs to be updated under the same > > hash_lock as the software FDB entry, then it seems that the goal of > > updating the hardware FDB synchronously and in a sleepable manner is if > > the data path defers the learning to sleepable context too. That in turn > > means that there will be 'dead time' between the reception of a packet > > from a given {MAC SA, VID} flow and the learning of that address. So I > > don't think that is really desirable. So I don't know if it is actually > > realistic to do this. > > > > Can we drop it from the requirements of this change, or do you feel like > > it's not worth it to make my change if this problem is not solved? > > I didn't pose it as a requirement, but as a desirable goal that I don't > know how to achieve w/o a surgery in the bridge driver that Nik and you > (understandably) don't like. > > Regarding a more practical solution, earlier versions (not what you > posted yesterday) have the undesirable properties of being both > asynchronous (current state) and mandating RTNL to be held. If we are > going with the asynchronous model, then I think we should have a model > that doesn't force RTNL and allows batching. > > I have the following proposal, which I believe solves your problem and > allows for batching without RTNL: > > The pattern of enqueuing a work item per-entry is not very smart. > Instead, it is better to to add the notification info to a list > (protected by a spin lock) and scheduling a single work item whose > purpose is to dequeue entries from this list and batch process them. I don't have hardware where FDB entries can be installed in bulk, so this is new to me. Might make sense though where you are in fact talking to firmware, and the firmware is in fact still committing to hardware one by one, you are still reducing the number of round trips. > Inside the work item you would do something like: > > spin_lock_bh() > list_splice_init() > spin_unlock_bh() > > mutex_lock() // rtnl or preferably private lock > list_for_each_entry_safe() > // process entry > cond_resched() > mutex_unlock() When is the work item scheduled in your proposal? I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is there some sort of timer to allow for some batching to occur? > > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will > instruct the driver to flush all its pending FDB notifications. You > don't strictly need this notification because of the > netdev_upper_dev_unlink() that follows, but it helps in making things > more structured. > > Pros: > > 1. Solves your problem? > 2. Pattern is not worse than what we currently have > 3. Does not force RTNL > 4. Allows for batching. For example, mlxsw has the ability to program up > to 64 entries in one transaction with the device. I assume other devices > in the same grade have similar capabilities > > Cons: > > 1. Asynchronous > 2. Pattern we will see in multiple drivers? Can consider migrating it > into switchdev itself at some point I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave() and this will solve the problem in the same way, will it not? It's not that I don't have driver-level solutions and hook points. My concern is that there are way too many moving parts and the entrance barrier for a new switchdev driver is getting higher and higher to achieve even basic stuff. For example, I need to maintain a DSA driver and a switchdev driver for the exact same class of hardware (ocelot is switchdev, felix is DSA, but the hardware is the same) and it is just so annoying that the interaction with switchdev is so verbose and open-coded, it just leads to so much duplication of basic patterns. When I add support for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE in ocelot I really don't want to add a boatload of code, all copied from DSA. > 3. Something I missed / overlooked > > > There is of course the option of going half-way too, just like for > > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the > > atomic chain, the switchdev throws as many errors as it can reasonably > > can, then you defer the actual installation which means a hardware access. > > Yes, the above proposal has the same property. You can throw errors > before enqueueing the notification info on your list.
On Mon, Aug 23, 2021 at 06:42:44PM +0300, Vladimir Oltean wrote: > On Mon, Aug 23, 2021 at 06:18:08PM +0300, Ido Schimmel wrote: > > On Mon, Aug 23, 2021 at 05:29:53PM +0300, Vladimir Oltean wrote: > > > On Mon, Aug 23, 2021 at 03:16:48PM +0300, Ido Schimmel wrote: > > > > I was thinking about the following case: > > > > > > > > t0 - <MAC1,VID1,P1> is added in syscall context under 'hash_lock' > > > > t1 - br_fdb_delete_by_port() flushes entries under 'hash_lock' in > > > > response to STP state. Notifications are added to 'deferred' list > > > > t2 - switchdev_deferred_process() is called in syscall context > > > > t3 - <MAC1,VID1,P1> is notified as blocking > > > > > > > > Updates to the SW FDB are protected by 'hash_lock', but updates to the > > > > HW FDB are not. In this case, <MAC1,VID1,P1> does not exist in SW, but > > > > it will exist in HW. > > > > > > > > Another case assuming switchdev_deferred_process() is called first: > > > > > > > > t0 - switchdev_deferred_process() is called in syscall context > > > > t1 - <MAC1,VID,P1> is learned under 'hash_lock'. Notification is added > > > > to 'deferred' list > > > > t2 - <MAC1,VID1,P1> is modified in syscall context under 'hash_lock' to > > > > <MAC1,VID1,P2> > > > > t3 - <MAC1,VID1,P2> is notified as blocking > > > > t4 - <MAC1,VID1,P1> is notified as blocking (next time the 'deferred' > > > > list is processed) > > > > > > > > In this case, the HW will have <MAC1,VID1,P1>, but SW will have > > > > <MAC1,VID1,P2> > > > > > > Ok, so if the hardware FDB entry needs to be updated under the same > > > hash_lock as the software FDB entry, then it seems that the goal of > > > updating the hardware FDB synchronously and in a sleepable manner is if > > > the data path defers the learning to sleepable context too. That in turn > > > means that there will be 'dead time' between the reception of a packet > > > from a given {MAC SA, VID} flow and the learning of that address. So I > > > don't think that is really desirable. So I don't know if it is actually > > > realistic to do this. > > > > > > Can we drop it from the requirements of this change, or do you feel like > > > it's not worth it to make my change if this problem is not solved? > > > > I didn't pose it as a requirement, but as a desirable goal that I don't > > know how to achieve w/o a surgery in the bridge driver that Nik and you > > (understandably) don't like. > > > > Regarding a more practical solution, earlier versions (not what you > > posted yesterday) have the undesirable properties of being both > > asynchronous (current state) and mandating RTNL to be held. If we are > > going with the asynchronous model, then I think we should have a model > > that doesn't force RTNL and allows batching. > > > > I have the following proposal, which I believe solves your problem and > > allows for batching without RTNL: > > > > The pattern of enqueuing a work item per-entry is not very smart. > > Instead, it is better to to add the notification info to a list > > (protected by a spin lock) and scheduling a single work item whose > > purpose is to dequeue entries from this list and batch process them. > > I don't have hardware where FDB entries can be installed in bulk, so > this is new to me. Might make sense though where you are in fact talking > to firmware, and the firmware is in fact still committing to hardware > one by one, you are still reducing the number of round trips. Yes > > > Inside the work item you would do something like: > > > > spin_lock_bh() > > list_splice_init() > > spin_unlock_bh() > > > > mutex_lock() // rtnl or preferably private lock > > list_for_each_entry_safe() > > // process entry > > cond_resched() > > mutex_unlock() > > When is the work item scheduled in your proposal? Calling queue_work() whenever you get a notification. The work item might already be queued, which is fine. > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is > there some sort of timer to allow for some batching to occur? You can add an hysteresis timer if you want, but I don't think it's necessary. Assuming user space is programming entries at a high rate, then by the time you finish a batch, you will have a new one enqueued. > > > > > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some > > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will > > instruct the driver to flush all its pending FDB notifications. You > > don't strictly need this notification because of the > > netdev_upper_dev_unlink() that follows, but it helps in making things > > more structured. > > > > Pros: > > > > 1. Solves your problem? > > 2. Pattern is not worse than what we currently have > > 3. Does not force RTNL > > 4. Allows for batching. For example, mlxsw has the ability to program up > > to 64 entries in one transaction with the device. I assume other devices > > in the same grade have similar capabilities > > > > Cons: > > > > 1. Asynchronous > > 2. Pattern we will see in multiple drivers? Can consider migrating it > > into switchdev itself at some point > > I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave() > and this will solve the problem in the same way, will it not? Problem is that you will deadlock if your work item tries to take RTNL. > > It's not that I don't have driver-level solutions and hook points. > My concern is that there are way too many moving parts and the entrance > barrier for a new switchdev driver is getting higher and higher to > achieve even basic stuff. I understand the frustration, but that's my best proposal at the moment. IMO, it doesn't make things worse and has some nice advantages. > > For example, I need to maintain a DSA driver and a switchdev driver for > the exact same class of hardware (ocelot is switchdev, felix is DSA, but > the hardware is the same) and it is just so annoying that the interaction > with switchdev is so verbose and open-coded, it just leads to so much > duplication of basic patterns. > When I add support for SWITCHDEV_FDB_{ADD,DEL}_TO_DEVICE in ocelot I > really don't want to add a boatload of code, all copied from DSA. > > > 3. Something I missed / overlooked > > > > > There is of course the option of going half-way too, just like for > > > SWITCHDEV_PORT_ATTR_SET. You notify it once, synchronously, on the > > > atomic chain, the switchdev throws as many errors as it can reasonably > > > can, then you defer the actual installation which means a hardware access. > > > > Yes, the above proposal has the same property. You can throw errors > > before enqueueing the notification info on your list.
On Mon, Aug 23, 2021 at 07:02:15PM +0300, Ido Schimmel wrote: > > > Inside the work item you would do something like: > > > > > > spin_lock_bh() > > > list_splice_init() > > > spin_unlock_bh() > > > > > > mutex_lock() // rtnl or preferably private lock > > > list_for_each_entry_safe() > > > // process entry > > > cond_resched() > > > mutex_unlock() > > > > When is the work item scheduled in your proposal? > > Calling queue_work() whenever you get a notification. The work item > might already be queued, which is fine. > > > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is > > there some sort of timer to allow for some batching to occur? > > You can add an hysteresis timer if you want, but I don't think it's > necessary. Assuming user space is programming entries at a high rate, > then by the time you finish a batch, you will have a new one enqueued. With the current model, nobody really stops any driver from doing that if so they wish. No switchdev or bridge changes needed. We have maximum flexibility now, with this async model. Yet it just so happens that no one is exploiting it, and instead the existing options are poorly utilized by most drivers. > > > In del_nbp(), after br_fdb_delete_by_port(), the bridge will emit some > > > new blocking event (e.g., SWITCHDEV_FDB_FLUSH_TO_DEVICE) that will > > > instruct the driver to flush all its pending FDB notifications. You > > > don't strictly need this notification because of the > > > netdev_upper_dev_unlink() that follows, but it helps in making things > > > more structured. > > > > > > Pros: > > > > > > 1. Solves your problem? > > > 2. Pattern is not worse than what we currently have > > > 3. Does not force RTNL > > > 4. Allows for batching. For example, mlxsw has the ability to program up > > > to 64 entries in one transaction with the device. I assume other devices > > > in the same grade have similar capabilities > > > > > > Cons: > > > > > > 1. Asynchronous > > > 2. Pattern we will see in multiple drivers? Can consider migrating it > > > into switchdev itself at some point > > > > I can already flush_workqueue(dsa_owq) in dsa_port_pre_bridge_leave() > > and this will solve the problem in the same way, will it not? > > Problem is that you will deadlock if your work item tries to take RTNL. I think we agreed that the rtnl_lock could be dropped from driver FDB work items. I have not tried that yet though. > > It's not that I don't have driver-level solutions and hook points. > > My concern is that there are way too many moving parts and the entrance > > barrier for a new switchdev driver is getting higher and higher to > > achieve even basic stuff. > > I understand the frustration, but that's my best proposal at the moment. > IMO, it doesn't make things worse and has some nice advantages. Reconsidering my options, I don't want to reduce the available optimizations that other switchdev drivers can make, in the name of a simpler baseline. I am also not smart enough for reworking the bridge data path. I will probably do something like flush_workqueue in the PRECHANGEUPPER handler, see what other common patterns there might be, and try to synthesize them in library code (a la switchdev_handle_*) that can be used by drivers that wish to, and ignored by drivers that don't.
On Mon, Aug 23, 2021 at 07:02:15PM +0300, Ido Schimmel wrote: > > When is the work item scheduled in your proposal? > > Calling queue_work() whenever you get a notification. The work item > might already be queued, which is fine. > > > I assume not only when SWITCHDEV_FDB_FLUSH_TO_DEVICE is emitted. Is > > there some sort of timer to allow for some batching to occur? > > You can add an hysteresis timer if you want, but I don't think it's > necessary. Assuming user space is programming entries at a high rate, > then by the time you finish a batch, you will have a new one enqueued. I tried to do something similar in DSA. There we have .ndo_fdb_dump because we don't sync the hardware FDB. We also have some drivers where the FDB flush on a port is a very slow procedure, because the FDB needs to be walked element by element to see what needs to be deleted. So I wanted to defer the FDB flush to a background work queue, and let the port leave the bridge quickly and not block the rtnl_mutex. But it gets really nasty really quick. The FDB flush workqueue cannot run concurrently with the ndo_fdb_dump, for reasons that have to do with hardware access. Also, any fdb_add or fdb_del would need to flush the FDB flush workqueue, for the same reasons. All these are currently implicitly serialized by the rtnl_mutex now. Your hardware/firmware might be smarter, but I think that if you drop the rtnl_mutex requirement, you will be seriously surprised by the amount of extra concurrency you need to handle. In the end I scrapped everything and I'm happy with a synchronous FDB flush even if it's slow. YMMV of course.