[RFC,net-next,0/3] Offload learnt bridge addresses to DSA

Message ID	20201108131953.2462644-1-olteanv@gmail.com
Headers	show Return-Path: <SRS0=gN88=EO=vger.kernel.org=netdev-owner@kernel.org> From: Vladimir Oltean <olteanv@gmail.com> To: Andrew Lunn <andrew@lunn.ch>, Vivien Didelot <vivien.didelot@gmail.com>, Florian Fainelli <f.fainelli@gmail.com>, Jakub Kicinski <kuba@kernel.org>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Cc: DENG Qingfang <dqfext@gmail.com>, Tobias Waldekranz <tobias@waldekranz.com>, Marek Behun <marek.behun@nic.cz>, Russell King - ARM Linux admin <linux@armlinux.org.uk> Subject: [RFC PATCH net-next 0/3] Offload learnt bridge addresses to DSA Date: Sun, 8 Nov 2020 15:19:50 +0200 Message-Id: <20201108131953.2462644-1-olteanv@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	Offload learnt bridge addresses to DSA \| expand [RFC,net-next,0/3] Offload learnt bridge addresses to DSA [RFC,net-next,2/3] net: dsa: move switchdev event implementation under the same switch/case state...

Vladimir Oltean Nov. 8, 2020, 1:19 p.m. UTC

This small series tries to make DSA behave a bit more sanely when
bridged with "foreign" (non-DSA) interfaces. When a station A connected
to a DSA switch port needs to talk to another station B connected to a
non-DSA port through the Linux bridge, DSA must explicitly add a route
for station B towards its CPU port. It cannot rely on hardware address
learning for that.

Vladimir Oltean (3):
  net: dsa: don't use switchdev_notifier_fdb_info in
    dsa_switchdev_event_work
  net: dsa: move switchdev event implementation under the same
    switch/case statement
  net: dsa: listen for SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE on foreign
    bridge neighbors

 net/dsa/dsa_priv.h |  12 ++++
 net/dsa/slave.c    | 160 +++++++++++++++++++++++++++------------------
 2 files changed, 110 insertions(+), 62 deletions(-)

Vladimir Oltean Nov. 9, 2020, 12:30 a.m. UTC | #1

On Mon, Nov 09, 2020 at 12:59:39AM +0100, Andrew Lunn wrote:
> On Sun, Nov 08, 2020 at 07:23:55PM +0200, Vladimir Oltean wrote:
> > On Sun, Nov 08, 2020 at 10:09:25PM +0800, DENG Qingfang wrote:
> > > Can it be turned off for switches that support SA learning from CPU?
> > 
> > Is there a good reason I would add another property per switch and not
> > just do it unconditionally?
> 
> Just throwing out ideas, i've no idea if they are relevant. I wonder
> if we can get into issues with fast ageing with a topology change?  We
> don't have too much control over the hardware. I think some devices
> just flush everything, or maybe just one port. So we have different
> life times for CPU port database entries and user port database
> entries?

A quick scan for "port_fast_age" did not find any implementers who do
not act upon the "port" argument.

> We might also run into bugs with flushing removing static database
> entries which should not be. But that would be a bug.

I can imagine that happening, when there are multiple bridges spanning a
DSA switch, each bridge also contains a "foreign" interface, and the 2
bridging domains service 2 stations that have the same MAC address. In
that case, since the fdb_add and fdb_del are not reference-counted on
the shared DSA CPU port, we would indeed trigger this bug.

I was on the fence on whether to include the reference counting patch I
have for host MDBs, and to make these addresses refcounted as well.
What do you think?

> Also, dumping the database might run into bugs since we have not had
> entries for the CPU port before.

I don't see what conditions can make this happen.

> We also need to make sure the static entries get removed correctly
> when a host moves. The mv88e6xxx will not replace a static entry with
> a dynamically learned one. It will probably rise an ATU violation
> interrupt that frames have come in the wrong port.

This is a good one. Currently every implementer of .port_fdb_add assumes
a static entry is what we want, but that is not the case here. We want
an entry that can expire or the switch can move it to a different port
when there is evidence that it's wrong. Should we add more arguments to
the API?

> What about switches which do not implement port_fdb_add? Do these
> patches at least do something sensible?

dsa_slave_switchdev_event
-> dsa_slave_switchdev_event_work
   -> dsa_port_fdb_add
      -> dsa_port_notify(DSA_NOTIFIER_FDB_ADD)
         -> dsa_switch_fdb_add
            -> if (!ds->ops->port_fdb_add) return -EOPNOTSUPP;
   -> an error is printed with dev_dbg, and
      dsa_fdb_offload_notify(switchdev_work) is not called.

On dsa_port_fdb_del error, there is also an attempt to call dev_close()
on error, but only on user ports, which the CPU port is not.

So, we do something almost sensible, but mostly by mistake it seems.

I think the simplest would be to simply avoid all this nonsense right
away in dsa_slave_switchdev_event:

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -2188,6 +2188,9 @@ static int dsa_slave_switchdev_event(struct notifier_block *unused,
 			dp = p->dp->cpu_dp;
 		}

+		if (!dp->ds->ops->port_fdb_add || !dp->ds->ops->port_fdb_del)
+			return NOTIFY_DONE;
+
 		switchdev_work = kzalloc(sizeof(*switchdev_work), GFP_ATOMIC);
 		if (!switchdev_work)
 			return NOTIFY_BAD;

Vladimir Oltean Nov. 9, 2020, 10:03 a.m. UTC | #2

On Mon, Nov 09, 2020 at 09:09:37AM +0100, Tobias Waldekranz wrote:
> On Mon, Nov 09, 2020 at 02:30, Vladimir Oltean <olteanv@gmail.com> wrote:
> > On Mon, Nov 09, 2020 at 12:59:39AM +0100, Andrew Lunn wrote:
> >> We also need to make sure the static entries get removed correctly
> >> when a host moves. The mv88e6xxx will not replace a static entry with
> >> a dynamically learned one. It will probably rise an ATU violation
> >> interrupt that frames have come in the wrong port.
> >
> > This is a good one. Currently every implementer of .port_fdb_add assumes
> > a static entry is what we want, but that is not the case here. We want
> > an entry that can expire or the switch can move it to a different port
> > when there is evidence that it's wrong. Should we add more arguments to
> > the API?
>
> I don't think that would help. You would essentially be trading one
> situation where station moves causes loss of traffic for another
> one. But now you have also increased the background load of an already
> choked resource, the MDIO bus.

In practice, DSA switches are already very demanding of their management
interface throughput, for PTP and things like that. I do expect that if
you spent any significant amount of time with DSA, you already know the
ins and outs of your MDIO/SPI/I2C controller and it would already be
optimized for efficiency. But ok, we can add this to the list of cons.

> At least on mv88e6xxx, your only option to allow the hardware to move
> the station to another port autonomously is to add the entry as a
> dynamically learnt one. However, since the switch does not perform any
> SA learning on the CPU port in this world, the entry would have to be
> refreshed by software, otherwise it would just age out.
>
> Then you run in to this situation:
>
> A and B are communicating.
>
>        br0
>   .----'|'----.
>   |     |     |
> swp0  swp1  wlan0
>   |           |
>   A           B
>
> The switch's FDB:
> A: swp0
> B: cpu0 (due to this patchset)
>
> Now B roams to an AP somewhere behind swp1 and continues to communicate
> with A.
>
>        br0
>   .----'|'----.
>   |     |     |
> swp0  swp1  wlan0
>   |     |
>   A     B
>
> The switch's FDB:
> A: swp0
> B: swp1
>
> But br0 sees none of this, so at whatever interval we choose we will
> refresh the FDB, moving the station back to the cpu:
>
> A: swp0
> B: cpu0

No, br0 should see some traffic from station B. Not the unicast traffic
towards station A, of course (because that has already been learnt to go
towards swp0), but some broadcast ARP, or some multicast ND. This is the
big assumption behind any solution: that the stations are not silent and
make their presence known somehow.

> So now you have traded the issue of having to wait for the hardware to
> age out its entry, to the issue of having to wait for br0 to age out its
> entry. Right?

That's the thing.
The software bridge will never expire its entry in br_fdb_update if
traffic is continuously coming in. fdb->updated will just keep getting
larger and larger after each incoming packet. But the hardware bridge is
not aware of this traffic. So:
- if the hardware bridge has a dynamic entry installed (one that's
  subject to ageing), that entry will eventually expire within 5 minutes
  when its software equivalent won't. Then no switchdev event will ever
  come back to update the hardware bridge, since from the software's
  perspective it was never supposed to expire. It's as if we _do_ want
  the entry to be static. But:
- if the hardware bridge has a static entry installed, then that entry
  might become wrong and cause connectivity loss until the software
  bridge figures it out.
It's what Andrew described as a 'hybrid' entry. We would want a 'static'
entry (one that doesn't age out based on a timer) that is 'weak' (can be
overridden when traffic comes in on a different port). I'm not sure
either that such thing exists.

So for now, static entries are the best we've got. Let's re-run the
simulation knowing that we're working with static addresses towards the
CPU, to see how bad things are.

 AP 1:
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
       |                                                       ^        ^
       |                                                       |        |
       |                                                       |        |
       |                                                    Client A  Client B
       |
       |
       |
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 AP 2

- br0 of AP 1 will lean that Clients A and B are reachable via wlan0.
  The DSA switch will snoop these and add static entries towards the CPU
  port.
- the hardware fdb of the DSA switch, as well as br0 on AP 2, will learn
  that Clients A and B are reachable through swp0, because of our
  assumption of non-silent stations. There are no static entries
  involved on AP 2 for now.

Client B disconnects from AP 1 and roams to AP 2.

 AP 1:
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
       |                                                            ^
       |                                                            |
       |                                                         Client A
       |
       |
       |                                                         Client B
       |                                                            |
       |                                                            v
 +------------+ +------------+ +------------+ +------------+ +------------+
 |    swp0    | |    swp1    | |    swp2    | |    swp3    | |    wlan0   |
 +------------+ +------------+ +------------+ +------------+ +------------+
 +------------------------------------------------------------------------+
 |                                          br0                           |
 +------------------------------------------------------------------------+
 AP 2

- br0 of AP 1 still knows that Client A is reachable via wlan0 (no change)
- In the general case, br0 of AP 1 will _NOT_ know that Client B has
  left wlan0. So there is still a static entry for Client B towards the
  CPU port.
- Right now, any attempt from Client A to directly address Client B via
  unicast would result, if the FDB were to be consulted, in packet
  drops, because the switch on AP 1 would say 'wait a minute, I'm
  receiving a packet for Client B from the CPU port, but Client B is
  reachable via the CPU port!'. Luckily for us, the switches that we're
  working with are not looking up the FDB for CPU injected traffic,
  remember? So I don't think this is a problem. So unicast packets would
  be delivered to anywhere that the software bridge wanted to. Right
  now, even the software bridge has a wrong impression of where Client B
  is.
- remember the assumption that Client B is not silent at startup. So
  some broadcast packets with Client B's source MAC address will reach
  the Ethernet segment. The hardware switch on AP 1 will have no problem
  accepting these packets, since they are broadcast/multicast. They will
  reach the software bridge. At this point, the software bridge finally
  learns the new destination for Client B, and it emits a new
  SWITCHDEV_FDB_ADD_TO_DEVICE event. Today we ignore that, because
  added_by_user will be false. That's what we do wrong/incomplete in
  this RFC patch set. We should keep track of static addresses installed
  on the CPU port, and if we ever receive a !added_by_user notification
  on a DSA switch port for one of those addresses, we should update the
  hardware FDB of the DSA switch on AP 1.

So there you have it, it's not that bad. More work needs to be done, but
IMO it's still workable.

But now maybe it makes more sense to treat the switches that perform
hardware SA learning on the CPU port separately, after I've digested
this a bit.

Tobias Waldekranz Nov. 9, 2020, 11:05 a.m. UTC | #3

On Mon, Nov 09, 2020 at 12:03, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Mon, Nov 09, 2020 at 09:09:37AM +0100, Tobias Waldekranz wrote:
>> one. But now you have also increased the background load of an already
>> choked resource, the MDIO bus.
>
> In practice, DSA switches are already very demanding of their management
> interface throughput, for PTP and things like that. I do expect that if
> you spent any significant amount of time with DSA, you already know the
> ins and outs of your MDIO/SPI/I2C controller and it would already be
> optimized for efficiency. But ok, we can add this to the list of cons.

You are arguing for my position though, no? Yes it is demanding; that is
why we must allocate it carefully.

> So there you have it, it's not that bad. More work needs to be done, but
> IMO it's still workable.

If you bypass learning on all frames sent from the CPU (as today), yes I
agree that you should be able to solve it with static entries. But I
think that you will have lots of weird problems with initial packet loss
as the FDB updates are not synchronous with the packet flow. I.e. the
bridge will tell DSA to update the entry, but the update in HW will
occur some time later when the workqueue actually performs the
operation.

> But now maybe it makes more sense to treat the switches that perform
> hardware SA learning on the CPU port separately, after I've digested
> this a bit.

Yes, please. Because it will be impossible to add tx forward offloading
otherwise.

Alexandra Winter Nov. 11, 2020, 10:13 a.m. UTC | #4

On 08.11.20 18:23, Vladimir Oltean wrote:
> On Sun, Nov 08, 2020 at 10:09:25PM +0800, DENG Qingfang wrote:

>> Can it be turned off for switches that support SA learning from CPU?

> 

> Is there a good reason I would add another property per switch and not

> just do it unconditionally?

> 

I have a similar concern for a future patch, where I want to turn on or off, whether the
device driver listens to SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE for a certain interface.
(Options will be: static MACs only, learning in the device or learning in bridge and notifications to device)
What about 'bridge link set dev $netdev learning_sync on self' respectively the corresponding netlink message?

Vladimir Oltean Nov. 11, 2020, 10:36 a.m. UTC | #5

Hi Alexandra,

On Wed, Nov 11, 2020 at 11:13:03AM +0100, Alexandra Winter wrote:
> On 08.11.20 18:23, Vladimir Oltean wrote:

> > On Sun, Nov 08, 2020 at 10:09:25PM +0800, DENG Qingfang wrote:

> >> Can it be turned off for switches that support SA learning from CPU?

> > 

> > Is there a good reason I would add another property per switch and not

> > just do it unconditionally?

> > 

> I have a similar concern for a future patch, where I want to turn on or off, whether the

> device driver listens to SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE for a certain interface.

> (Options will be: static MACs only, learning in the device or learning in bridge and notifications to device)

> What about 'bridge link set dev $netdev learning_sync on self' respectively the corresponding netlink message?


My understanding is that "learning_sync" is for pushing learnt addresses
from device to bridge, not from bridge to device.

Alexandra Winter Nov. 11, 2020, 2:14 p.m. UTC | #6

On 11.11.20 11:36, Vladimir Oltean wrote:
> Hi Alexandra,

> 

> On Wed, Nov 11, 2020 at 11:13:03AM +0100, Alexandra Winter wrote:

>> On 08.11.20 18:23, Vladimir Oltean wrote:

>>> On Sun, Nov 08, 2020 at 10:09:25PM +0800, DENG Qingfang wrote:

>>>> Can it be turned off for switches that support SA learning from CPU?

>>>

>>> Is there a good reason I would add another property per switch and not

>>> just do it unconditionally?

>>>

>> I have a similar concern for a future patch, where I want to turn on or off, whether the

>> device driver listens to SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE for a certain interface.

>> (Options will be: static MACs only, learning in the device or learning in bridge and notifications to device)

>> What about 'bridge link set dev $netdev learning_sync on self' respectively the corresponding netlink message?

> 

> My understanding is that "learning_sync" is for pushing learnt addresses

> from device to bridge, not from bridge to device.

> 

uh, sorry copy-paste error. I meant:
'bridge link set dev $netdev learning on self'

Vladimir Oltean Nov. 12, 2020, 1:49 p.m. UTC | #7

On Wed, Nov 11, 2020 at 03:14:26PM +0100, Alexandra Winter wrote:
> On 11.11.20 11:36, Vladimir Oltean wrote:

> > Hi Alexandra,

> > 

> > On Wed, Nov 11, 2020 at 11:13:03AM +0100, Alexandra Winter wrote:

> >> On 08.11.20 18:23, Vladimir Oltean wrote:

> >>> On Sun, Nov 08, 2020 at 10:09:25PM +0800, DENG Qingfang wrote:

> >>>> Can it be turned off for switches that support SA learning from CPU?

> >>>

> >>> Is there a good reason I would add another property per switch and not

> >>> just do it unconditionally?

> >>>

> >> I have a similar concern for a future patch, where I want to turn on or off, whether the

> >> device driver listens to SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE for a certain interface.

> >> (Options will be: static MACs only, learning in the device or learning in bridge and notifications to device)

> >> What about 'bridge link set dev $netdev learning_sync on self' respectively the corresponding netlink message?

> > 

> > My understanding is that "learning_sync" is for pushing learnt addresses

> > from device to bridge, not from bridge to device.

> > 

> uh, sorry copy-paste error. I meant:

> 'bridge link set dev $netdev learning on self'


Even with "learning" instead of "learning_sync", I don't understand what
the "self" modifier would mean and how it would help, sorry.

Florian Fainelli Nov. 13, 2020, 3:48 a.m. UTC | #8

On 11/9/2020 4:38 AM, Vladimir Oltean wrote:
> On Mon, Nov 09, 2020 at 02:31:11PM +0200, Vladimir Oltean wrote:

>> I need to sit on this for a while. How many DSA drivers do we have that

>> don't do SA learning in hardware for CPU-injected packets? ocelot/felix

>> and mv88e6xxx? Who else? Because if there aren't that many (or any at

>> all except for these two), then I could try to spend some time and see

>> how Felix behaves when I send FORWARD frames to it. Then we could go on

>> full blast with the other alternative, to force-enable address learning

>> from the CPU port, and declare this one as too complicated and not worth

>> the effort.

> 

> In fact I'm not sure that I should be expecting an answer to this

> question. We can evaluate the other alternative in parallel. Would you

> be so kind to send some sort of RFC for your TX-side offload_fwd_mark so

> that I could test with the hardware I have, and get a better understanding

> of the limitations there?


For Broadcom switches, ARL (Address Resolution Logic, where learning
happens) is bypassed when packets ingress the CPU port with opcode 1
which is what net/dsa/tag_brcm.c uses. When opcode 0 is used, address
learning occurs.

The reason why opcode 1 is used is because of the Advanced Congestion
Buffering (ACB) which requires us to steer packets towards a particular
switch port and egress queue number within that port. With opcode 0 we
would not be able to do that.

We could make the opcode dependent on the switch/DSA master since not
all combinations support ACB, but given we have 3 or 4 Ethernet switches
kind within DSA that do not do learning from the CPU port, I guess we
need a solution to that problem somehow.
-- 
Florian

Alexandra Winter Nov. 16, 2020, 8:02 a.m. UTC | #9

On 12.11.20 14:49, Vladimir Oltean wrote:
> On Wed, Nov 11, 2020 at 03:14:26PM +0100, Alexandra Winter wrote:

>> On 11.11.20 11:36, Vladimir Oltean wrote:

>>> Hi Alexandra,

>>>

>>> On Wed, Nov 11, 2020 at 11:13:03AM +0100, Alexandra Winter wrote:

>>>> On 08.11.20 18:23, Vladimir Oltean wrote:

>>>>> On Sun, Nov 08, 2020 at 10:09:25PM +0800, DENG Qingfang wrote:

>>>>>> Can it be turned off for switches that support SA learning from CPU?

>>>>>

>>>>> Is there a good reason I would add another property per switch and not

>>>>> just do it unconditionally?

>>>>>

>>>> I have a similar concern for a future patch, where I want to turn on or off, whether the

>>>> device driver listens to SWITCHDEV_{FDB,DEL}_ADD_TO_DEVICE for a certain interface.

>>>> (Options will be: static MACs only, learning in the device or learning in bridge and notifications to device)

>>>> What about 'bridge link set dev $netdev learning_sync on self' respectively the corresponding netlink message?

>>>

>>> My understanding is that "learning_sync" is for pushing learnt addresses

>>> from device to bridge, not from bridge to device.

>>>

>> uh, sorry copy-paste error. I meant:

>> 'bridge link set dev $netdev learning on self'

> 

> Even with "learning" instead of "learning_sync", I don't understand what

> the "self" modifier would mean and how it would help, sorry.

> 

With the self modifier, the command is not executed by the linux bridge but instead sent to the device driver of the
respective bridgeport. So AFAIU 'learning on self' would mean, that instead of only the bridge doing the learning on this
bridgeport, the device itself should do the learning. Which sounds to me like a good fit for SA learning from CPU.

It seems that the self modifier is not used on hardware switches today, but rather on virtualized network cards, where
it is e.g. used ot turn VEPA mode on or off per virtual interface. In drivers/s390/net/qeth we use 'learning_sync on self',
to control whether a virtualized interface should synchronize the card's fdb with an attached linux bridge.

[RFC,net-next,0/3] Offload learnt bridge addresses to DSA

Message

Comments