mbox series

[RFC,0/4] Faster ndo_fdb_dump for drivers with shared FDB

Message ID 20210821210018.1314952-1-vladimir.oltean@nxp.com
Headers show
Series Faster ndo_fdb_dump for drivers with shared FDB | expand

Message

Vladimir Oltean Aug. 21, 2021, 9 p.m. UTC
I have a board where it is painfully slow to run "bridge fdb". It has 16
switch ports which are accessed over an I2C controller -> I2C mux 1 ->
I2C mux 2 -> I2C-to-SPI bridge.

It doesn't really help either that we traverse the hardware FDB of each
switch for every netdev, even though we already know all there is to
know the first time we traversed it. In fact, I hacked up some rtnetlink
and DSA changes, and with those, the time to run 'bridge fdb' on this
board decreases from 207 seconds to 26 seconds (2 FDB traversals instead
of 16), turning something intolerable into 'tolerable'.

I don't know how much we care about .ndo_fdb_dump implemented directly
by drivers (and that's where I expect this to be most useful), because
of SWITCHDEV_FDB_ADD_TO_BRIDGE and all that. So this is RFC in case it
is helpful for somebody, at least during debugging.

Vladimir Oltean (4):
  net: rtnetlink: create a netlink cb context struct for fdb dump
  net: rtnetlink: add a minimal state machine for dumping shared FDBs
  net: dsa: implement a shared FDB dump procedure
  net: dsa: sja1105: implement shared FDB dump

 drivers/net/dsa/sja1105/sja1105_main.c        |  50 +++--
 .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   9 +-
 drivers/net/ethernet/mscc/ocelot.c            |   5 +-
 drivers/net/ethernet/mscc/ocelot_net.c        |   4 +
 drivers/net/vxlan.c                           |   8 +-
 include/linux/rtnetlink.h                     |  25 +++
 include/net/dsa.h                             |  17 ++
 net/bridge/br_fdb.c                           |   6 +-
 net/core/rtnetlink.c                          | 105 +++++++---
 net/dsa/dsa2.c                                |   2 +
 net/dsa/dsa_priv.h                            |   1 +
 net/dsa/slave.c                               | 189 ++++++++++++++++--
 net/dsa/switch.c                              |   8 +
 13 files changed, 368 insertions(+), 61 deletions(-)

Comments

Vladimir Oltean Sept. 23, 2021, 3:25 p.m. UTC | #1
Roopa, Andrew, Florian,

On Sun, Aug 22, 2021 at 12:00:14AM +0300, Vladimir Oltean wrote:
> I have a board where it is painfully slow to run "bridge fdb". It has 16

> switch ports which are accessed over an I2C controller -> I2C mux 1 ->

> I2C mux 2 -> I2C-to-SPI bridge.

> 

> It doesn't really help either that we traverse the hardware FDB of each

> switch for every netdev, even though we already know all there is to

> know the first time we traversed it. In fact, I hacked up some rtnetlink

> and DSA changes, and with those, the time to run 'bridge fdb' on this

> board decreases from 207 seconds to 26 seconds (2 FDB traversals instead

> of 16), turning something intolerable into 'tolerable'.

> 

> I don't know how much we care about .ndo_fdb_dump implemented directly

> by drivers (and that's where I expect this to be most useful), because

> of SWITCHDEV_FDB_ADD_TO_BRIDGE and all that. So this is RFC in case it

> is helpful for somebody, at least during debugging.

> 

> Vladimir Oltean (4):

>   net: rtnetlink: create a netlink cb context struct for fdb dump

>   net: rtnetlink: add a minimal state machine for dumping shared FDBs

>   net: dsa: implement a shared FDB dump procedure

>   net: dsa: sja1105: implement shared FDB dump

> 

>  drivers/net/dsa/sja1105/sja1105_main.c        |  50 +++--

>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   9 +-

>  drivers/net/ethernet/mscc/ocelot.c            |   5 +-

>  drivers/net/ethernet/mscc/ocelot_net.c        |   4 +

>  drivers/net/vxlan.c                           |   8 +-

>  include/linux/rtnetlink.h                     |  25 +++

>  include/net/dsa.h                             |  17 ++

>  net/bridge/br_fdb.c                           |   6 +-

>  net/core/rtnetlink.c                          | 105 +++++++---

>  net/dsa/dsa2.c                                |   2 +

>  net/dsa/dsa_priv.h                            |   1 +

>  net/dsa/slave.c                               | 189 ++++++++++++++++--

>  net/dsa/switch.c                              |   8 +

>  13 files changed, 368 insertions(+), 61 deletions(-)

> 

> -- 

> 2.25.1

> 


Does something like this have any chance of being accepted?
https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/
Florian Fainelli Sept. 23, 2021, 10:29 p.m. UTC | #2
On 9/23/21 8:25 AM, Vladimir Oltean wrote:
> Roopa, Andrew, Florian,

> 

> On Sun, Aug 22, 2021 at 12:00:14AM +0300, Vladimir Oltean wrote:

>> I have a board where it is painfully slow to run "bridge fdb". It has 16

>> switch ports which are accessed over an I2C controller -> I2C mux 1 ->

>> I2C mux 2 -> I2C-to-SPI bridge.

>>

>> It doesn't really help either that we traverse the hardware FDB of each

>> switch for every netdev, even though we already know all there is to

>> know the first time we traversed it. In fact, I hacked up some rtnetlink

>> and DSA changes, and with those, the time to run 'bridge fdb' on this

>> board decreases from 207 seconds to 26 seconds (2 FDB traversals instead

>> of 16), turning something intolerable into 'tolerable'.

>>

>> I don't know how much we care about .ndo_fdb_dump implemented directly

>> by drivers (and that's where I expect this to be most useful), because

>> of SWITCHDEV_FDB_ADD_TO_BRIDGE and all that. So this is RFC in case it

>> is helpful for somebody, at least during debugging.

>>

>> Vladimir Oltean (4):

>>   net: rtnetlink: create a netlink cb context struct for fdb dump

>>   net: rtnetlink: add a minimal state machine for dumping shared FDBs

>>   net: dsa: implement a shared FDB dump procedure

>>   net: dsa: sja1105: implement shared FDB dump

>>

>>  drivers/net/dsa/sja1105/sja1105_main.c        |  50 +++--

>>  .../ethernet/freescale/dpaa2/dpaa2-switch.c   |   9 +-

>>  drivers/net/ethernet/mscc/ocelot.c            |   5 +-

>>  drivers/net/ethernet/mscc/ocelot_net.c        |   4 +

>>  drivers/net/vxlan.c                           |   8 +-

>>  include/linux/rtnetlink.h                     |  25 +++

>>  include/net/dsa.h                             |  17 ++

>>  net/bridge/br_fdb.c                           |   6 +-

>>  net/core/rtnetlink.c                          | 105 +++++++---

>>  net/dsa/dsa2.c                                |   2 +

>>  net/dsa/dsa_priv.h                            |   1 +

>>  net/dsa/slave.c                               | 189 ++++++++++++++++--

>>  net/dsa/switch.c                              |   8 +

>>  13 files changed, 368 insertions(+), 61 deletions(-)

>>

>> -- 

>> 2.25.1

>>

> 

> Does something like this have any chance of being accepted?

> https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/


Had not seen the link you just posted, in premise speeding up the FDB
dump sounds good to me, especially given that we typically have these
slow buses to work with.

These questions are probably super stupid and trivial and I really
missed reviewing properly your latest work, how do we manage to keep the
bridge's FDB and hardware FDB in sync given that switches don't
typically tell us when they learn new addresses?
--
Florian
Vladimir Oltean Sept. 23, 2021, 10:49 p.m. UTC | #3
On Thu, Sep 23, 2021 at 03:29:07PM -0700, Florian Fainelli wrote:
> > Does something like this have any chance of being accepted?

> > https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/

> 

> Had not seen the link you just posted, in premise speeding up the FDB

> dump sounds good to me, especially given that we typically have these

> slow buses to work with.


I didn't copy you because you were on vacation.

> These questions are probably super stupid and trivial and I really

> missed reviewing properly your latest work, how do we manage to keep the

> bridge's FDB and hardware FDB in sync given that switches don't

> typically tell us when they learn new addresses?


We don't, that's the premise, you didn't miss anything there. I mean in
the switchdev world, I see that an interrupt is the primary mechanism,
and we do have DSA switches which are theoretically capable of notifying
of new addresses, but not through interrupts but over Ethernet, of
course. Ocelot for example supports sending "learn frames" to the CPU -
these are just like flooded frames, except that a "flooded" frame is one
with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.
I've been thinking whether it's worth adding support for learn frames
coming from Ocelot/Felix in DSA, and it doesn't really look like it.
Anyway, when the DSA tagging protocol receives a "learn" frame, it needs
to consume_skb() it, then trigger some sort of switchdev atomic notifier
for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only
the beginning of a long series of complications, because we also need to
know when the hardware has fogotten it, and it doesn't look like
"forget" frames are a thing. So because the risk of having an address
expire in hardware while it is still present in software is kind of
high, the only option left is to make the hardware entry static, and
(a) delete it manually when the software entry expires
(b) set up a second alert which triggers a MAC SA has been received on a
    port other than the destination port which is pointed towards by an
    existing FDB entry. In short, "station migration alert". Because the
    FDB entry is static, we need to migrate it by hand, in software.
So it all looks kind of clunky. Whereas what we do now is basically
assume that the amount of frames with unknown MAC DA reaching the CPU
via flooding is more or less equal and equally distributed with the
frames with unknown MAC SA reaching the CPU. I have not yet encountered
a use case where the two are tragically different, in a way that could
be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.


Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs
and that is the reason in the first place why we have an .ndo_fdb_dump
implementation. Theoretically if all hardware drivers were perfect,
you'd only have the .ndo_fdb_dump implementation done for the bridge,
vxlan, things like that. So that is why I asked Roopa whether hacking up
rtnl_fdb_dump in this way, transforming it into a state machine even
more complicated than it already was, is acceptable. We aren't the
primary use case of it, I think.
Nikolay Aleksandrov Sept. 24, 2021, 10:03 a.m. UTC | #4
On 24/09/2021 01:49, Vladimir Oltean wrote:
> On Thu, Sep 23, 2021 at 03:29:07PM -0700, Florian Fainelli wrote:

>>> Does something like this have any chance of being accepted?

>>> https://patchwork.kernel.org/project/netdevbpf/cover/20210821210018.1314952-1-vladimir.oltean@nxp.com/

>>

>> Had not seen the link you just posted, in premise speeding up the FDB

>> dump sounds good to me, especially given that we typically have these

>> slow buses to work with.

> 

> I didn't copy you because you were on vacation.

> 

>> These questions are probably super stupid and trivial and I really

>> missed reviewing properly your latest work, how do we manage to keep the

>> bridge's FDB and hardware FDB in sync given that switches don't

>> typically tell us when they learn new addresses?

> 

> We don't, that's the premise, you didn't miss anything there. I mean in

> the switchdev world, I see that an interrupt is the primary mechanism,

> and we do have DSA switches which are theoretically capable of notifying

> of new addresses, but not through interrupts but over Ethernet, of

> course. Ocelot for example supports sending "learn frames" to the CPU -

> these are just like flooded frames, except that a "flooded" frame is one

> with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.

> I've been thinking whether it's worth adding support for learn frames

> coming from Ocelot/Felix in DSA, and it doesn't really look like it.

> Anyway, when the DSA tagging protocol receives a "learn" frame, it needs

> to consume_skb() it, then trigger some sort of switchdev atomic notifier

> for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only

> the beginning of a long series of complications, because we also need to

> know when the hardware has fogotten it, and it doesn't look like

> "forget" frames are a thing. So because the risk of having an address

> expire in hardware while it is still present in software is kind of

> high, the only option left is to make the hardware entry static, and

> (a) delete it manually when the software entry expires

> (b) set up a second alert which triggers a MAC SA has been received on a

>     port other than the destination port which is pointed towards by an

>     existing FDB entry. In short, "station migration alert". Because the

>     FDB entry is static, we need to migrate it by hand, in software.

> So it all looks kind of clunky. Whereas what we do now is basically

> assume that the amount of frames with unknown MAC DA reaching the CPU

> via flooding is more or less equal and equally distributed with the

> frames with unknown MAC SA reaching the CPU. I have not yet encountered

> a use case where the two are tragically different, in a way that could

> be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.

> 

> 

> Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs

> and that is the reason in the first place why we have an .ndo_fdb_dump

> implementation. Theoretically if all hardware drivers were perfect,

> you'd only have the .ndo_fdb_dump implementation done for the bridge,

> vxlan, things like that. So that is why I asked Roopa whether hacking up

> rtnl_fdb_dump in this way, transforming it into a state machine even

> more complicated than it already was, is acceptable. We aren't the

> primary use case of it, I think.

> 


Hi Vladimir,
I glanced over the patches and the obvious question that comes first is have you
tried pushing all of this complexity closer to the driver which needs it?
I mean rtnl_fdb_dump() can already "resume" and passes all the necessary resume indices
to ndo_fdb_dump(), so it seems to me that all of this can be hidden away. I doubt
there will be a many users overall, so it would be nice to avoid adding the complexity
as you put it and supporting it in the core. I'd imagine a hacked driver would simply cache
the dump for some time while needed (that's important to define well, more below) and just
return it for the next couple of devices which share it upon request, basically you'd have the
same type of solution you've done here, just have the details hidden in the layer which needs it.

Now the hard part seems to be figuring out when to finish in this case. Prepare should be a simple
check if a shared fdb list is populated, finish would need to be inferred. One way to do that is
based on a transaction/dump id which is tracked for each shared device and the last dump. Another
is if you just pass a new argument to ndo_fdb_dump() if it's dumping a single device or doing a
full dump, since that's the case that's difficult to differentiate. If you're doing a single
dump - obviously you do a normal fdb dump without caching, if you're doing a full dump (all devices)
then you need to check if the list is populated, if it is and this is the last device you need to free
the cached shared list (finish phase). That would make the core change very small and would push the
complexity to be maintained where it's needed. Actually you have the netlink_callback struct passed
down to ndo_fdb_dump() so probably that can be used to infer if you're doing a single or multi- device
dump already and there can be 0 core changes.

Also one current downside with this set is the multiple walks over the net device list for a single dump,
especially for setups with thousands of net devices. Actually, I might be missing something, but what
happens if a dump is terminated before it's finished? Looks like the finish phase will never be reached.
That would be a problem for both solutions, you might have to add a netlink ->done() callback anyway.

Cheers,
 Nik
Roopa Prabhu Sept. 25, 2021, 2:31 p.m. UTC | #5
On 9/24/21 3:03 AM, Nikolay Aleksandrov wrote:
> On 24/09/2021 01:49, Vladimir Oltean wrote:

>>

[snip]
>> We don't, that's the premise, you didn't miss anything there. I mean in

>> the switchdev world, I see that an interrupt is the primary mechanism,

>> and we do have DSA switches which are theoretically capable of notifying

>> of new addresses, but not through interrupts but over Ethernet, of

>> course. Ocelot for example supports sending "learn frames" to the CPU -

>> these are just like flooded frames, except that a "flooded" frame is one

>> with unknown MAC DA, and a "learn" frame is one with unknown MAC SA.

>> I've been thinking whether it's worth adding support for learn frames

>> coming from Ocelot/Felix in DSA, and it doesn't really look like it.

>> Anyway, when the DSA tagging protocol receives a "learn" frame, it needs

>> to consume_skb() it, then trigger some sort of switchdev atomic notifier

>> for that MAC SA (SWITCHDEV_FDB_ADD_TO_BRIDGE), but sadly that is only

>> the beginning of a long series of complications, because we also need to

>> know when the hardware has fogotten it, and it doesn't look like

>> "forget" frames are a thing. So because the risk of having an address

>> expire in hardware while it is still present in software is kind of

>> high, the only option left is to make the hardware entry static, and

>> (a) delete it manually when the software entry expires

>> (b) set up a second alert which triggers a MAC SA has been received on a

>>      port other than the destination port which is pointed towards by an

>>      existing FDB entry. In short, "station migration alert". Because the

>>      FDB entry is static, we need to migrate it by hand, in software.

>> So it all looks kind of clunky. Whereas what we do now is basically

>> assume that the amount of frames with unknown MAC DA reaching the CPU

>> via flooding is more or less equal and equally distributed with the

>> frames with unknown MAC SA reaching the CPU. I have not yet encountered

>> a use case where the two are tragically different, in a way that could

>> be solved only with SWITCHDEV_FDB_ADD_TO_BRIDGE events and in no other way.

>>

>>

>> Anyway, huge digression, the idea was that DSA doesn't synchronize FDBs

>> and that is the reason in the first place why we have an .ndo_fdb_dump

>> implementation. Theoretically if all hardware drivers were perfect,

>> you'd only have the .ndo_fdb_dump implementation done for the bridge,

>> vxlan, things like that. So that is why I asked Roopa whether hacking up

>> rtnl_fdb_dump in this way, transforming it into a state machine even

>> more complicated than it already was, is acceptable. We aren't the

>> primary use case of it, I think.

>>

> Hi Vladimir,

> I glanced over the patches and the obvious question that comes first is have you

> tried pushing all of this complexity closer to the driver which needs it?

> I mean rtnl_fdb_dump() can already "resume" and passes all the necessary resume indices

> to ndo_fdb_dump(), so it seems to me that all of this can be hidden away. I doubt

> there will be a many users overall, so it would be nice to avoid adding the complexity

> as you put it and supporting it in the core. I'd imagine a hacked driver would simply cache

> the dump for some time while needed (that's important to define well, more below) and just

> return it for the next couple of devices which share it upon request, basically you'd have the

> same type of solution you've done here, just have the details hidden in the layer which needs it.

>

> Now the hard part seems to be figuring out when to finish in this case. Prepare should be a simple

> check if a shared fdb list is populated, finish would need to be inferred. One way to do that is

> based on a transaction/dump id which is tracked for each shared device and the last dump. Another

> is if you just pass a new argument to ndo_fdb_dump() if it's dumping a single device or doing a

> full dump, since that's the case that's difficult to differentiate. If you're doing a single

> dump - obviously you do a normal fdb dump without caching, if you're doing a full dump (all devices)

> then you need to check if the list is populated, if it is and this is the last device you need to free

> the cached shared list (finish phase). That would make the core change very small and would push the

> complexity to be maintained where it's needed. Actually you have the netlink_callback struct passed

> down to ndo_fdb_dump() so probably that can be used to infer if you're doing a single or multi- device

> dump already and there can be 0 core changes.

>

> Also one current downside with this set is the multiple walks over the net device list for a single dump,

> especially for setups with thousands of net devices. Actually, I might be missing something, but what

> happens if a dump is terminated before it's finished? Looks like the finish phase will never be reached.

> That would be a problem for both solutions, you might have to add a netlink ->done() callback anyway.


+1, the series conceptually looks good, but core fdb dump is already 
unnecessarily complex. moving this to the driver with indicators passed 
as flags

is preferred (you can possibly also mark a port as designated for shared 
fdb operations. again in the driver)