[RFC,0/4] net: dsa: link aggregation support

Message ID	20201027105117.23052-1-tobias@waldekranz.com
Headers	show Return-Path: <SRS0=YXMs=EC=vger.kernel.org=netdev-owner@kernel.org> From: Tobias Waldekranz <tobias@waldekranz.com> To: andrew@lunn.ch, vivien.didelot@gmail.com, f.fainelli@gmail.com, olteanv@gmail.com Cc: netdev@vger.kernel.org Subject: [RFC PATCH 0/4] net: dsa: link aggregation support Date: Tue, 27 Oct 2020 11:51:13 +0100 Message-Id: <20201027105117.23052-1-tobias@waldekranz.com> Organization: Westermo Precedence: bulk
Series	net: dsa: link aggregation support \| expand [RFC,0/4] net: dsa: link aggregation support [RFC,1/4] net: dsa: mv88e6xxx: use ethertyped dsa for 6390/6390X [RFC,2/4] net: dsa: link aggregation support [RFC,3/4] net: dsa: mv88e6xxx: link aggregation support [RFC,4/4] net: dsa: tag_edsa: support reception of packets from lag devices

Tobias Waldekranz Oct. 27, 2020, 10:51 a.m. UTC

This series starts by adding the generic support required to offload
link aggregates to drivers built on top of the DSA subsystem. It then
implements offloading for the mv88e6xxx driver, i.e. Marvell's
LinkStreet family.

Posting this as an RFC as there are some topics that I would like
feedback on before going further with testing. Thus far I've done some
basic tests to verify that:

- A LAG can be used as a regular interface.
- Bridging a LAG with other DSA ports works as expected.
- Load balancing is done by the hardware, both in single- and
  multi-chip systems.
- Load balancing is dynamically reconfigured when the state of
  individual links change.

Testing as been done on two systems:

1. Single-chip system with one Peridot (6390X).
2. Multi-chip system with one Agate (6352) daisy-chained with an Opal
   (6097F).

I would really appreciate feedback on the following:

All LAG configuration is cached in `struct dsa_lag`s. I realize that
the standard M.O. of DSA is to read back information from hardware
when required. With LAGs this becomes very tricky though. For example,
the change of a link state on one switch will require re-balancing of
LAG hash buckets on another one, which in turn depends on the total
number of active links in the LAG. Do you agree that this is
motivated?

The LAG driver ops all receive the LAG netdev as an argument when this
information is already available through the port's lag pointer. This
was done to match the way that the bridge netdev is passed to all VLAN
ops even though it is in the port's bridge_dev. Is there a reason for
this or should I just remove it from the LAG ops?

At least on mv88e6xxx, the exact source port is not available when
packets are received on the CPU. The way I see it, there are two ways
around that problem:

- Inject the packet directly on the LAG device (what this series
  does). Feels right because it matches all that we actually know; the
  packet came in on the LAG. It does complicate dsa_switch_rcv
  somewhat as we can no longer assume that skb->dev is a DSA port.

- Inject the packet on "the designated port", i.e. some port in the
  LAG. This lets us keep the current Rx path untouched. The problem is
  that (a) the port would have to be dynamically updated to match the
  expectations of the LAG driver (team/bond) as links are
  enabled/disabled and (b) we would be presenting a lie because
  packets would appear to ingress on netdevs that they might not in
  fact have been physically received on.

(mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It
seems like all chips capable of doing EDSA are using that, except for
the Peridot.

(mv88e6xxx) The cross-chip PVT changes required to allow a LAG to
communicate with the other ports do not feel quite right, but I'm
unsure about what the proper way of doing it would be. Any ideas?

(mv88e6xxx) Marvell has historically used the idiosyncratic term
"trunk" to refer to link aggregates. Somewhere around the Peridot they
have switched and are now referring to the same registers/tables using
the term "LAG". In this series I've stuck to using LAG for all generic
stuff, and only used trunk for driver-internal functions. Do we want
to rename everything to use the LAG nomenclature?

Thanks,
Tobias

Tobias Waldekranz (4):
  net: dsa: mv88e6xxx: use ethertyped dsa for 6390/6390X
  net: dsa: link aggregation support
  net: dsa: mv88e6xxx: link aggregation support
  net: dsa: tag_edsa: support reception of packets from lag devices

 drivers/net/dsa/mv88e6xxx/chip.c    | 232 +++++++++++++++++++++++++++-
 drivers/net/dsa/mv88e6xxx/chip.h    |   4 +
 drivers/net/dsa/mv88e6xxx/global2.c |   8 +-
 drivers/net/dsa/mv88e6xxx/global2.h |   5 +
 drivers/net/dsa/mv88e6xxx/port.c    |  21 +++
 drivers/net/dsa/mv88e6xxx/port.h    |   5 +
 include/net/dsa.h                   |  68 ++++++++
 net/dsa/dsa.c                       |  23 +--
 net/dsa/dsa2.c                      |   3 +
 net/dsa/dsa_priv.h                  |  16 ++
 net/dsa/port.c                      | 146 +++++++++++++++++
 net/dsa/slave.c                     |  53 ++++++-
 net/dsa/switch.c                    |  64 ++++++++
 net/dsa/tag_edsa.c                  |  12 +-
 14 files changed, 635 insertions(+), 25 deletions(-)

Vladimir Oltean Oct. 27, 2020, 12:27 p.m. UTC | #1

Hi Tobias,

On Tue, Oct 27, 2020 at 11:51:13AM +0100, Tobias Waldekranz wrote:
> I would really appreciate feedback on the following:

> 

> All LAG configuration is cached in `struct dsa_lag`s. I realize that

> the standard M.O. of DSA is to read back information from hardware

> when required. With LAGs this becomes very tricky though. For example,

> the change of a link state on one switch will require re-balancing of

> LAG hash buckets on another one, which in turn depends on the total

> number of active links in the LAG. Do you agree that this is

> motivated?


I don't really have an issue with that.

> The LAG driver ops all receive the LAG netdev as an argument when this

> information is already available through the port's lag pointer. This

> was done to match the way that the bridge netdev is passed to all VLAN

> ops even though it is in the port's bridge_dev. Is there a reason for

> this or should I just remove it from the LAG ops?


Maybe because on "leave", the bridge/LAG net device pointer inside
struct dsa_port is first set to NULL, then the DSA notifier is called?

> At least on mv88e6xxx, the exact source port is not available when

> packets are received on the CPU. The way I see it, there are two ways

> around that problem:

> 

> - Inject the packet directly on the LAG device (what this series

>   does). Feels right because it matches all that we actually know; the

>   packet came in on the LAG. It does complicate dsa_switch_rcv

>   somewhat as we can no longer assume that skb->dev is a DSA port.

> 

> - Inject the packet on "the designated port", i.e. some port in the

>   LAG. This lets us keep the current Rx path untouched. The problem is

>   that (a) the port would have to be dynamically updated to match the

>   expectations of the LAG driver (team/bond) as links are

>   enabled/disabled and (b) we would be presenting a lie because

>   packets would appear to ingress on netdevs that they might not in

>   fact have been physically received on.


Since ocelot/felix does not have this restriction, and supports
individual port addressing even under a LAG, you can imagine I am not
very happy to see the RX data path punishing everyone else that is not
mv88e6xxx.

> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It

> seems like all chips capable of doing EDSA are using that, except for

> the Peridot.


I have no documentation whatsoever for mv88e6xxx, but just wondering,
what is the benefit brought by EDSA here vs DSA? Does DSA have the
same restriction when the ports are in a LAG?

Andrew Lunn Oct. 27, 2020, 2 p.m. UTC | #2

> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It

> seems like all chips capable of doing EDSA are using that, except for

> the Peridot.


Hi Tobias

Marvell removed the ability to use EDSA, in the way we do in Linux
DSA, on Peridot. One of the configuration bits is gone. So i had to
use DSA.  It could be i just don't understand how to configure
Peridot, and EDSA could be used, but i never figured it out.

I will get back to your other questions.

  Andrew

Andrew Lunn Oct. 27, 2020, 2:29 p.m. UTC | #3

> > (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It
> > seems like all chips capable of doing EDSA are using that, except for
> > the Peridot.
> 
> I have no documentation whatsoever for mv88e6xxx, but just wondering,
> what is the benefit brought by EDSA here vs DSA? Does DSA have the
> same restriction when the ports are in a LAG?

Hi Vladimir

One advantage of EDSA is that it uses a well known Ether Type. It was
easy for me to add support to tcpdump to spot this Ether type, decode
the EDSA header, and pass the rest of the frame on for further
processing as normal.

With DSA, you cannot look at the packet and know it is DSA, and then
correctly decode it. So tcpdump just show the packet as undecodable.

Florian fixed this basic problem a while ago, since not being able to
decode packets is a problem for all tagger except EDSA. So now there
is extra meta information inserted into the pcap file, which gives
tcpdump the hint it needs to do the extra decoding of the tagger
header. But before that was added, it was much easier to debug EDSA vs
DSA because of tcpdump decoding.

	Andrew

Marek Behún Oct. 27, 2020, 2:52 p.m. UTC | #4

On Tue, 27 Oct 2020 11:51:14 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> The policy is to use ethertyped DSA for all devices that are capable
> of doing so, which the Peridot is.

What is the benefit here?

Marek Behún Oct. 27, 2020, 2:54 p.m. UTC | #5

On Tue, 27 Oct 2020 15:52:13 +0100
Marek Behun <marek.behun@nic.cz> wrote:

> On Tue, 27 Oct 2020 11:51:14 +0100

> Tobias Waldekranz <tobias@waldekranz.com> wrote:

> 

> > The policy is to use ethertyped DSA for all devices that are capable

> > of doing so, which the Peridot is.

> 

> What is the benefit here?

Also, when you are changing something for 6390, please do the same
change for the non-industrial version of Peridot (6190, 6190X), for
6290 and 6191.

And since Topaz (6341 and 6141) are basically smaller Peridot's (with 6
ports instead of 11), such a change should also go there.

But again, what is the benefit here?

Marek

Marek Behún Oct. 27, 2020, 2:58 p.m. UTC | #6

On Tue, 27 Oct 2020 15:54:36 +0100
Marek Behun <marek.behun@nic.cz> wrote:

> But again, what is the benefit here?


OK, you need this for the LAG support, somehow those emails went to
another folder, sorry :)

Tobias Waldekranz Oct. 27, 2020, 2:59 p.m. UTC | #7

On Tue, Oct 27, 2020 at 14:27, Vladimir Oltean <olteanv@gmail.com> wrote:
>> The LAG driver ops all receive the LAG netdev as an argument when this

>> information is already available through the port's lag pointer. This

>> was done to match the way that the bridge netdev is passed to all VLAN

>> ops even though it is in the port's bridge_dev. Is there a reason for

>> this or should I just remove it from the LAG ops?

>

> Maybe because on "leave", the bridge/LAG net device pointer inside

> struct dsa_port is first set to NULL, then the DSA notifier is called?

Right, that makes sense. For LAGs I keep ds->lag set until the leave ops
have run. But perhaps I should change it to match the VLAN ops?

> Since ocelot/felix does not have this restriction, and supports

> individual port addressing even under a LAG, you can imagine I am not

> very happy to see the RX data path punishing everyone else that is not

> mv88e6xxx.

I understand that, for sure. Though to be clear, the only penalty in
terms of performance is an extra call to dsa_slave_check, which is just
a load and compare on skb->dev->netdev_ops.

>> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It

>> seems like all chips capable of doing EDSA are using that, except for

>> the Peridot.

>

> I have no documentation whatsoever for mv88e6xxx, but just wondering,

> what is the benefit brought by EDSA here vs DSA? Does DSA have the

> same restriction when the ports are in a LAG?

The same restrictions apply I'm afraid. The only difference is that you
prepend a proper ethertype before the tag.

The idea (as far as I know) is that you can trap control traffic (TO_CPU
in DSA parlance) to the CPU and receive (E)DSA tagged to implement
things like STP and LLDP, but you receive the data traffic (FORWARD)
untagged or with an 802.1Q tag.

This means you can use standard VLAN accelerators on NICs to
remove/insert the 1Q tags. In a routing scenario this can bring a
significant speed-up as you skip two memcpys per packet to remove and
insert the tag.

Marek Behún Oct. 27, 2020, 3:05 p.m. UTC | #8

When I first read about port trunking in the Peridot documentation, I
immediately thought that this could be used to transparently offload
that which is called Bonding in Linux...

Is this what you want to eventually do?

BTW, I thought about using port trunking to solve the multi-CPU DSA
issue as well. On Turris Omnia we have 2 switch ports connected to the
CPU. So I could trunk these 2 swtich ports, and on the other side
create a bonding interface from eth0 and eth1.

Andrew, what do you think about this? Is this something that can be
done? Or is it too complicated?

Marek

Tobias Waldekranz Oct. 27, 2020, 3:09 p.m. UTC | #9

On Tue, Oct 27, 2020 at 15:00, Andrew Lunn <andrew@lunn.ch> wrote:
>> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It

>> seems like all chips capable of doing EDSA are using that, except for

>> the Peridot.

>

> Hi Tobias

>

> Marvell removed the ability to use EDSA, in the way we do in Linux

> DSA, on Peridot. One of the configuration bits is gone. So i had to

> use DSA.  It could be i just don't understand how to configure

> Peridot, and EDSA could be used, but i never figured it out.

>

> I will get back to your other questions.

>

>   Andrew


Hi Andrew,

Very interesting! Which bit is that?

Looking at the datasheets for Agate and Peridot side-by-side, the
sections named "Ether Type DSA Tag" seem to be identical.

Thanks,
Tobias

Andrew Lunn Oct. 27, 2020, 3:23 p.m. UTC | #10

On Tue, Oct 27, 2020 at 04:05:30PM +0100, Marek Behun wrote:
> When I first read about port trunking in the Peridot documentation, I
> immediately thought that this could be used to transparently offload
> that which is called Bonding in Linux...
> 
> Is this what you want to eventually do?
> 
> BTW, I thought about using port trunking to solve the multi-CPU DSA
> issue as well. On Turris Omnia we have 2 switch ports connected to the
> CPU. So I could trunk these 2 swtich ports, and on the other side
> create a bonding interface from eth0 and eth1.
> 
> Andrew, what do you think about this? Is this something that can be
> done? Or is it too complicated?

Hi Marek

trunking is something i've looked at once, but never had time to work
on. There are three different use cases i thought of:

1) trunk user ports, with team/bonding controlling it
2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup
3) trunk CPU ports.

What Tobias is implementing here is 1). This seems like a good first
step.

I'm not sure 3) is even possible. Or it might depend on the switch
generation. The 6352 for example, the CPU Dest field is a port
number. It does not appear to allow for a trunk. 6390 moved this
register, but as far as i know, it did not add trunk support.  It
might be possible to have multiple SoC interfaces sending frames to
the Switch using DSA tags, but i don't see a way to have the switch
send frames to the SoC using multiple ports.

     Andrew

Tobias Waldekranz Oct. 27, 2020, 6:25 p.m. UTC | #11

On Tue, Oct 27, 2020 at 16:23, Andrew Lunn <andrew@lunn.ch> wrote:
> Hi Marek

>

> trunking is something i've looked at once, but never had time to work

> on. There are three different use cases i thought of:

>

> 1) trunk user ports, with team/bonding controlling it

> 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup

> 3) trunk CPU ports.

>

> What Tobias is implementing here is 1). This seems like a good first

> step.

>

> I'm not sure 3) is even possible. Or it might depend on the switch

> generation. The 6352 for example, the CPU Dest field is a port

> number. It does not appear to allow for a trunk. 6390 moved this

> register, but as far as i know, it did not add trunk support.  It

> might be possible to have multiple SoC interfaces sending frames to

> the Switch using DSA tags, but i don't see a way to have the switch

> send frames to the SoC using multiple ports.

I think that (2) and (3) are essentially the same problem, i.e. creating
LAGs out of DSA links, be they switch-to-switch or switch-to-cpu
connections. I think you are correct that the CPU port can not be a
LAG/trunk, but I believe that limitation only applies to TO_CPU packets.
If the CPU ports configured as a LAG, all FORWARDs, i.e. the bulk
traffic, would benefit from the load-balancing. Something like this:

.-----. TO_CPU, FORWARD .-----. TO_CPU, FORWARD .-----.
|     +-----------------+     +-----------------+     |
| CPU |                 | sw0 |                 | sw1 |
|     +-----------------+     +-----------------+     |
'-----'    FORWARD      '-+-+-'    FORWARD      '-+-+-'
                          | |                     | |
                       swp1 swp2               swp3 swp4

So the links selected as the CPU ports will see a marginally higher load
due to all TO_CPU being sent over it. But the hashing is not that great
on this hardware anyway (DA/SA only) so some imbalance is unavoidable.

In order for this to work on transmit, we need to add forward offloading
to the bridge so that we can, for example, send one FORWARD from the CPU
to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.

	Tobias

Marek Behún Oct. 27, 2020, 6:33 p.m. UTC | #12

> In order for this to work on transmit, we need to add forward offloading
> to the bridge so that we can, for example, send one FORWARD from the CPU
> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.

Wouldn't this be solved if the CPU master interface was a bonding interface?

Vladimir Oltean Oct. 27, 2020, 7 p.m. UTC | #13

On Tue, Oct 27, 2020 at 07:25:16PM +0100, Tobias Waldekranz wrote:
> > 1) trunk user ports, with team/bonding controlling it

> > 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup

> > 3) trunk CPU ports.

[...]
> I think that (2) and (3) are essentially the same problem, i.e. creating

> LAGs out of DSA links, be they switch-to-switch or switch-to-cpu

> connections. I think you are correct that the CPU port can not be a

> LAG/trunk, but I believe that limitation only applies to TO_CPU packets.


Which would still be ok? They are called "slow protocol PDUs" for a reason.

> In order for this to work on transmit, we need to add forward offloading

> to the bridge so that we can, for example, send one FORWARD from the CPU

> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.


That surely sounds like an interesting (and tough to implement)
optimization to increase the throughput, but why would it be _needed_
for things to work? What's wrong with 4 FROM_CPU packets?

Vladimir Oltean Oct. 27, 2020, 7:04 p.m. UTC | #14

On Tue, Oct 27, 2020 at 07:33:37PM +0100, Marek Behun wrote:
> > In order for this to work on transmit, we need to add forward offloading
> > to the bridge so that we can, for example, send one FORWARD from the CPU
> > to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.
> 
> Wouldn't this be solved if the CPU master interface was a bonding interface?

I don't see how you would do that. Would DSA keep returning -EPROBE_DEFER
until user space decides to set up a bond over the master interfaces?
How would you even describe that in device tree?

Tobias Waldekranz Oct. 27, 2020, 7:21 p.m. UTC | #15

On Tue, Oct 27, 2020 at 21:04, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Oct 27, 2020 at 07:33:37PM +0100, Marek Behun wrote:

>> > In order for this to work on transmit, we need to add forward offloading

>> > to the bridge so that we can, for example, send one FORWARD from the CPU

>> > to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.

>> 

>> Wouldn't this be solved if the CPU master interface was a bonding interface?

>

> I don't see how you would do that. Would DSA keep returning -EPROBE_DEFER

> until user space decides to set up a bond over the master interfaces?

> How would you even describe that in device tree?

Yeah that would be very hard indeed. Since this is going to be
completely transparent to the user I think the best way is to just setup
the hardware to see the two CPU ports as a LAG whenever you find
e.g. "cpu0" and "cpu1", but have no representation of it as a separate
netdev.

DSA will have an rx_handler attached to both ports anyway, so it can
just run the same handler for both. On Tx it can just load-balance in
software like team does.

Tobias Waldekranz Oct. 27, 2020, 7:37 p.m. UTC | #16

On Tue, Oct 27, 2020 at 21:00, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Oct 27, 2020 at 07:25:16PM +0100, Tobias Waldekranz wrote:
>> > 1) trunk user ports, with team/bonding controlling it
>> > 2) trunk DSA ports, i.e. the ports between switches in a D in DSA setup
>> > 3) trunk CPU ports.
> [...]
>> I think that (2) and (3) are essentially the same problem, i.e. creating
>> LAGs out of DSA links, be they switch-to-switch or switch-to-cpu
>> connections. I think you are correct that the CPU port can not be a
>> LAG/trunk, but I believe that limitation only applies to TO_CPU packets.
>
> Which would still be ok? They are called "slow protocol PDUs" for a reason.

Oh yes, completely agree. That was the point I was trying to make :)

>> In order for this to work on transmit, we need to add forward offloading
>> to the bridge so that we can, for example, send one FORWARD from the CPU
>> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.
>
> That surely sounds like an interesting (and tough to implement)
> optimization to increase the throughput, but why would it be _needed_
> for things to work? What's wrong with 4 FROM_CPU packets?

We have internal patches that do this, and I can confirm that it is
tough :) I really would like to figure out a way to solve this, that
would also be acceptable upstream. I have some ideas, it is on my TODO.

In a single-chip system I agree that it is not needed, the CPU can do
the load-balancing in software. But in order to have the hardware do
load-balancing on a switch-to-switch LAG, you need to send a FORWARD.

FROM_CPUs would just follow whatever is in the device mapping table. You
essentially have the inverse of the TO_CPU problem, but on Tx FROM_CPU
would make up 100% of traffic.

Other than that there are some things that, while strictly speaking
possible to do without FORWARDs, become much easier to deal with:

- Multicast routing. This is one case where performance _really_ suffers
  from having to skb_clone() to each recipient.

- Bridging between virtual interfaces and DSA ports. Typical example is
  an L2 VPN tunnel or one end of a veth pair. On FROM_CPUs, the switch
  can not perform SA learning, which means that once you bridge traffic
  from the VPN out to a DSA port, the return traffic will be classified
  as unknown unicast by the switch and be flooded everywhere.

Vladimir Oltean Oct. 27, 2020, 8:02 p.m. UTC | #17

On Tue, Oct 27, 2020 at 08:37:58PM +0100, Tobias Waldekranz wrote:
> >> In order for this to work on transmit, we need to add forward offloading

> >> to the bridge so that we can, for example, send one FORWARD from the CPU

> >> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.

[...]

> In a single-chip system I agree that it is not needed, the CPU can do

> the load-balancing in software. But in order to have the hardware do

> load-balancing on a switch-to-switch LAG, you need to send a FORWARD.

> 

> FROM_CPUs would just follow whatever is in the device mapping table. You

> essentially have the inverse of the TO_CPU problem, but on Tx FROM_CPU

> would make up 100% of traffic.

Woah, hold on, could you explain in more detail for non-expert people
like myself to understand.

So FROM_CPU frames (what tag_edsa.c uses now in xmit) can encode a
_single_ destination port in the frame header.

Whereas the FORWARD frames encode a _source_ port in the frame header.
You inject FORWARD frames from the CPU port, and you just let the L2
forwarding process select the adequate destination ports (or LAG, if
any ports are under one) _automatically_. The reason why you do this, is
because you want to take advantage of the switch's flooding abilities in
order to replicate the packet into 4 packets. So you will avoid cloning
that packet in the bridge in the first place.

But correct me if I'm wrong, sending a FORWARD frame from the CPU is a
slippery slope, since you're never sure that the switch will perform the
replication exactly as you intended to. The switch will replicate a
FORWARD frame by looking up the FDB, and we don't even attempt in DSA to
keep the FDB in sync between software and hardware. And that's why we
send FROM_CPU frames in tag_edsa.c and not FORWARD frames.

What you are really looking for is hardware where the destination field
for FROM_CPU packets is not a single port index, but a port mask.

Right?

Also, this problem is completely orthogonal to LAG? Where does LAG even
come into play here?

> Other than that there are some things that, while strictly speaking

> possible to do without FORWARDs, become much easier to deal with:

> 

> - Multicast routing. This is one case where performance _really_ suffers

>   from having to skb_clone() to each recipient.

> 

> - Bridging between virtual interfaces and DSA ports. Typical example is

>   an L2 VPN tunnel or one end of a veth pair. On FROM_CPUs, the switch

>   can not perform SA learning, which means that once you bridge traffic

>   from the VPN out to a DSA port, the return traffic will be classified

>   as unknown unicast by the switch and be flooded everywhere.

And how is this going to solve that problem? You mean that the switch
learns only from FORWARD, but not from FROM_CPU?

Why don't you attempt to solve this more generically somehow? Your
switch is not the only one that can't perform source address learning
for injected traffic, there are tons more, some are not even DSA. We
can't have everybody roll their own solution.

Tobias Waldekranz Oct. 27, 2020, 8:53 p.m. UTC | #18

On Tue, Oct 27, 2020 at 22:02, Vladimir Oltean <olteanv@gmail.com> wrote:
> On Tue, Oct 27, 2020 at 08:37:58PM +0100, Tobias Waldekranz wrote:

>> >> In order for this to work on transmit, we need to add forward offloading

>> >> to the bridge so that we can, for example, send one FORWARD from the CPU

>> >> to send an ARP broadcast to swp1..4 instead of four FROM_CPUs.

>

> [...]

>

>> In a single-chip system I agree that it is not needed, the CPU can do

>> the load-balancing in software. But in order to have the hardware do

>> load-balancing on a switch-to-switch LAG, you need to send a FORWARD.

>> 

>> FROM_CPUs would just follow whatever is in the device mapping table. You

>> essentially have the inverse of the TO_CPU problem, but on Tx FROM_CPU

>> would make up 100% of traffic.

>

> Woah, hold on, could you explain in more detail for non-expert people

> like myself to understand.

>

> So FROM_CPU frames (what tag_edsa.c uses now in xmit) can encode a

> _single_ destination port in the frame header.

Correct.

> Whereas the FORWARD frames encode a _source_ port in the frame header.

> You inject FORWARD frames from the CPU port, and you just let the L2

> forwarding process select the adequate destination ports (or LAG, if

> any ports are under one) _automatically_. The reason why you do this, is

> because you want to take advantage of the switch's flooding abilities in

> order to replicate the packet into 4 packets. So you will avoid cloning

> that packet in the bridge in the first place.

Exactly so.

> But correct me if I'm wrong, sending a FORWARD frame from the CPU is a

> slippery slope, since you're never sure that the switch will perform the

> replication exactly as you intended to. The switch will replicate a

> FORWARD frame by looking up the FDB, and we don't even attempt in DSA to

> keep the FDB in sync between software and hardware. And that's why we

> send FROM_CPU frames in tag_edsa.c and not FORWARD frames.

I'm not sure if I agree that it's a slippery slope. The whole point of
the switchdev effort is to sync the switch with the bridge. We trust the
fabric to do all the steps you describe for _all_ other ports.

> What you are really looking for is hardware where the destination field

> for FROM_CPU packets is not a single port index, but a port mask.

>

> Right?

Sure, if that's available it's great. Chips from Marvell's Prestera line
can do this, and many others I'm sure. Alas, LinkStreet devices can not,
and I still want the best performance I can get i that case.

> Also, this problem is completely orthogonal to LAG? Where does LAG even

> come into play here?

It matters if you setup switch-to-switch LAGs. FROM_CPU packets encode
the final device/port, and switches will forward those packet according
to their device mapping tables, which selects a _single_ local port to
use to reach a remote device/port. So all FROM_CPU packets to a given
device/port will always travel through the same set of ports.

In the FORWARD case, you look up the destination in the FDB of each
device, find that it is located on the other side of a LAG, and the
hardware will perform load-balancing.

>> Other than that there are some things that, while strictly speaking

>> possible to do without FORWARDs, become much easier to deal with:

>> 

>> - Multicast routing. This is one case where performance _really_ suffers

>>   from having to skb_clone() to each recipient.

>> 

>> - Bridging between virtual interfaces and DSA ports. Typical example is

>>   an L2 VPN tunnel or one end of a veth pair. On FROM_CPUs, the switch

>>   can not perform SA learning, which means that once you bridge traffic

>>   from the VPN out to a DSA port, the return traffic will be classified

>>   as unknown unicast by the switch and be flooded everywhere.

>

> And how is this going to solve that problem? You mean that the switch

> learns only from FORWARD, but not from FROM_CPU?

Yes, so when you send the FORWARD the switch knows that the station is
located somewhere behind the CPU port. It does not know exactly where,
i.e. it has no knowledge of the VPN tunnel or anything. It just directs
it towards the CPU and the bridge's FDB will take care of the rest.

> Why don't you attempt to solve this more generically somehow? Your

> switch is not the only one that can't perform source address learning

> for injected traffic, there are tons more, some are not even DSA. We

> can't have everybody roll their own solution.

Who said anything about rolling my solution? I'm going for a generic
solution where a netdev can announce to the bridge it is being added to
that it can offload forwarding of packets for all ports belonging to the
same switchdev device. Most probably modeled after how the macvlan
offloading stuff is done.

In the case of mv88e6xxx that would kill two birds with one stone -
great! In other cases you might have to have the DSA subsystem listen to
new neighbors appearing on the bridge and sync those to hardware or
something. Hopefully someone working with that kind of hardware can
solve that problem.

Vladimir Oltean Oct. 27, 2020, 10:32 p.m. UTC | #19

On Tue, Oct 27, 2020 at 09:53:45PM +0100, Tobias Waldekranz wrote:
> So all FROM_CPU packets to a given device/port will always travel
> through the same set of ports.

Ah, this is the part that didn't click for me.
For the simple case where you have a multi-CPU port system:

                     Switch 0
                   +--------------------------------+
DSA master #1      |CPU port #1                     |
   +------+        +------+                 +-------+
   | eth0 | ---->  |      | -----\    ----- | sw0p0 | ------>
   +------+        +------+       \  /      +-------+
                   |               \/               |
 DSA master #2     |CPU port #2    /\               |
   +------+        +------+       /  \      +-------|
   | eth1 | ---->  |      | -----/    \-----| sw0p1 | ------>
   +------+        +------+                 +-------+
                   |                                |
                   +--------------------------------+

you can have Linux do load balancing of CPU ports on TX for many streams
being delivered to the same egress port (sw0p0).

But if you have a cascade:

                     Switch 0                                 Switch 1
                   +--------------------------------+       +--------------------------------+
DSA master #1      |CPU port #1         DSA link #1 |       |DSA link #1                     |
   +------+        +------+                 +-------+       +------+                 +-------+
   | eth0 | ---->  |      | -----\    ----- |       | ----> |      | -----\    ----- | sw1p0 | ---->
   +------+        +------+       \  /      +-------+       +------+       \  /      +-------+
                   |               \/               |       |               \/               |
 DSA master #2     |CPU port #2    /\   DSA link #2 |       |DSA link #2    /\               |
   +------+        +------+       /  \      +-------|       +------+       /  \      +-------|
   | eth1 | ---->  |      | -----/    \-----|       | ----> |      | -----/    \-----| sw1p1 | ---->
   +------+        +------+                 +-------+       +------+                 +-------+
                   |                                |       |                                |
                   +--------------------------------+       +--------------------------------+

then you have no good way to spread the same load (many streams all
delivered to the same egress port, sw1p0) between DSA link #1 and DSA
link #2. DSA link #1 will get congested, while DSA link #2 will remain
unused.

And this all happens because for FROM_CPU packets, the hardware is
configured in mv88e6xxx_devmap_setup to deliver all packets with a
non-local switch ID towards the same "routing" port, right?

Whereas for FORWARD frames, the destination port for non-local switch ID
will not be established based on mv88e6xxx_devmap_setup, but based on
FDB lookup of {DMAC, VID}. In the second case above, this is the only
way for your hardware that the FDB could select the LAG as the
destination based on the FDB. Then, the hash code would be determined
from the packet, and the appropriate egress port within the LAG would be
selected.

So, to answer my own question:
Q: What does using FORWARD frames to offload TX flooding from the bridge
   have to do with a LAG between 2 switches?
A: Nothing, they would just both need FORWARD frames to be used.

> > Why don't you attempt to solve this more generically somehow? Your
> > switch is not the only one that can't perform source address learning
> > for injected traffic, there are tons more, some are not even DSA. We
> > can't have everybody roll their own solution.
> 
> Who said anything about rolling my solution? I'm going for a generic
> solution where a netdev can announce to the bridge it is being added to
> that it can offload forwarding of packets for all ports belonging to the
> same switchdev device. Most probably modeled after how the macvlan
> offloading stuff is done.

The fact that I have no idea how the macvlan offloading is done does not
really help me, but here, the fact that I understood nothing doesn't
appear to stem from that.

"a netdev can announce to the bridge it is being added to that it can
offload forwarding of packets for all ports belonging to the same
switchdev device"

What do you mean? skb->offload_fwd_mark? Or are you still talking about
its TX-side equivalent here, which is what we've been talking about in
these past few mails? If so, I'm confused by you calling it "offload
forwarding of packets", I was expecting a description more in the lines
of "offload flooding of packets coming from host" or something like
that.

> In the case of mv88e6xxx that would kill two birds with one stone -
> great! In other cases you might have to have the DSA subsystem listen to
> new neighbors appearing on the bridge and sync those to hardware or
> something. Hopefully someone working with that kind of hardware can
> solve that problem.

If by "neighbors" you mean that you bridge a DSA swp0 with an e1000
eth0, then that is not going to be enough. The CPU port of swp0 will
need to learn not eth0's MAC address, but in fact the MAC address of all
stations that might be connected to eth0. There might even be a network
switch connected to eth0, not just a directly connected link partner.
So there are potentially many MAC addresses to be learnt, and all are
unknown off-hand.
I admit I haven't actually looked at implementing this, but I would
expect that what needs to be done is that the local (master) FDB of the
bridge (which would get populated on the RX side of the "foreign
interface" through software learning) would need to get offloaded in its
entirety towards all switchdev ports, via a new switchdev "host FDB"
object or something of that kind (where a "host FDB" entry offloaded on
a port would mean "see this {DMAC, VID} pair? send it to the CPU").

With your FORWARD frames life-hack you can eschew all of that, good for
you. I was just speculatively hoping you might be interested in tackling
the hard way.

Anyway, this discussion has started mixing up basic stuff (like
resolving your source address learning issue on the CPU port, when
bridged with a foreign interface) with advanced / optimization stuff
(LAG, offload flooding from host), the only commonality appearing to be
a need for FORWARD frames. Can you even believe we are still commenting
on a series about something as mundane as link aggregation on DSA user
ports? At least I can't. I'll go off and start reviewing your patches,
before we manage to lose everybody along the way.

Andrew Lunn Oct. 27, 2020, 10:36 p.m. UTC | #20

Hi Tobias

> All LAG configuration is cached in `struct dsa_lag`s. I realize that

> the standard M.O. of DSA is to read back information from hardware

> when required. With LAGs this becomes very tricky though. For example,

> the change of a link state on one switch will require re-balancing of

> LAG hash buckets on another one, which in turn depends on the total

> number of active links in the LAG. Do you agree that this is

> motivated?


As you say, DSA tries to be stateless and not allocate any
memory. That keeps things simple. If you cannot allocate the needed
memory, you need to ensure you leave the system untouched. And that
needs to happen all the way up the stack when you have nested devices
etc. That is why many APIs have a prepare phase and then a commit
phase. The prepare phase allocates all the needed memory, can fail,
but does not otherwise touch the running system. The commit phase
cannot fail, since it has everything it needs.

If you are dynamically allocating dsa_lag structures, at run time, you
need to think about this. But the number of LAGs is limited by the
number of ports. So i would consider just allocating the worst case
number at probe, and KISS for runtime.

> At least on mv88e6xxx, the exact source port is not available when

> packets are received on the CPU. The way I see it, there are two ways

> around that problem:


Does that break team/bonding? Do any of the algorithms send packets on
specific ports to make sure they are alive? I've not studied how
team/bonding works, but it must have a way to determine if a link has
failed and it needs to fallover.

> (mv88e6xxx) The cross-chip PVT changes required to allow a LAG to

> communicate with the other ports do not feel quite right, but I'm

> unsure about what the proper way of doing it would be. Any ideas?


Vivien implemented all that. I hope he can help you, i've no real idea
how that all works.

> (mv88e6xxx) Marvell has historically used the idiosyncratic term

> "trunk" to refer to link aggregates. Somewhere around the Peridot they

> have switched and are now referring to the same registers/tables using

> the term "LAG". In this series I've stuck to using LAG for all generic

> stuff, and only used trunk for driver-internal functions. Do we want

> to rename everything to use the LAG nomenclature?


Where possible, i would keep to the datasheet terminology. So any 6352
specific function should use 6352 terminology. Any 6390 specific
function should use 6390 terminology. For code which supports a range
of generations, we have used the terminology from the first device
which had the feature. In practice, this probably means trunk is going
to be used most of the time, and LAG in just 6390 code. Often, the
glue code in chip.c uses linux stack terminology.

   Andrew

Tobias Waldekranz Oct. 28, 2020, 12:27 a.m. UTC | #21

On Wed, Oct 28, 2020 at 00:32, Vladimir Oltean <olteanv@gmail.com> wrote:
> And this all happens because for FROM_CPU packets, the hardware is
> configured in mv88e6xxx_devmap_setup to deliver all packets with a
> non-local switch ID towards the same "routing" port, right?

Precisely.

> Whereas for FORWARD frames, the destination port for non-local switch ID
> will not be established based on mv88e6xxx_devmap_setup, but based on
> FDB lookup of {DMAC, VID}. In the second case above, this is the only
> way for your hardware that the FDB could select the LAG as the
> destination based on the FDB. Then, the hash code would be determined
> from the packet, and the appropriate egress port within the LAG would be
> selected.

That's it!

> What do you mean? skb->offload_fwd_mark? Or are you still talking about
> its TX-side equivalent here, which is what we've been talking about in
> these past few mails? If so, I'm confused by you calling it "offload
> forwarding of packets", I was expecting a description more in the lines
> of "offload flooding of packets coming from host" or something like
> that.

I'm still talking about the TX-equivalent. I chose my words carefully
because it is not _only_ for flooding, although that is the main
benefit.

If I've understood the basics of macvlan offloading correctly, it uses
the ndo_dfwd_add/del_station ops to ask the lower device if it can
offload transmissions on behalf of the macvlan device. If the lower is
capable, the macvlan code will use dev_queue_xmit_accel to specify that
the skb is being forwarded from a "subordinate" device. For a bridge,
that would mean "forward this packet to the relevant ports, given the
current configuration".

This is just one possible approach though.

>> In the case of mv88e6xxx that would kill two birds with one stone -
>> great! In other cases you might have to have the DSA subsystem listen to
>> new neighbors appearing on the bridge and sync those to hardware or
>> something. Hopefully someone working with that kind of hardware can
>> solve that problem.
>
> If by "neighbors" you mean that you bridge a DSA swp0 with an e1000
> eth0, then that is not going to be enough. The CPU port of swp0 will
> need to learn not eth0's MAC address, but in fact the MAC address of all
> stations that might be connected to eth0. There might even be a network
> switch connected to eth0, not just a directly connected link partner.
> So there are potentially many MAC addresses to be learnt, and all are
> unknown off-hand.

Yep, hence the "technically possible, but hard" remark I made earlier :)

> I admit I haven't actually looked at implementing this, but I would
> expect that what needs to be done is that the local (master) FDB of the
> bridge (which would get populated on the RX side of the "foreign
> interface" through software learning) would need to get offloaded in its
> entirety towards all switchdev ports, via a new switchdev "host FDB"
> object or something of that kind (where a "host FDB" entry offloaded on
> a port would mean "see this {DMAC, VID} pair? send it to the CPU").
>
> With your FORWARD frames life-hack you can eschew all of that, good for
> you. I was just speculatively hoping you might be interested in tackling
> the hard way.

Being able to set host FDB entries like we can for host MDB is useful
for other things as well, so I might very well be willing to do it.

> Anyway, this discussion has started mixing up basic stuff (like
> resolving your source address learning issue on the CPU port, when
> bridged with a foreign interface) with advanced / optimization stuff
> (LAG, offload flooding from host), the only commonality appearing to be
> a need for FORWARD frames. Can you even believe we are still commenting
> on a series about something as mundane as link aggregation on DSA user
> ports? At least I can't. I'll go off and start reviewing your patches,
> before we manage to lose everybody along the way.

Agreed, we went deep down the rabbit hole! This might not have been the
most natural place for these discussions, but it was fun nonetheless :)

Tobias Waldekranz Oct. 28, 2020, 12:45 a.m. UTC | #22

On Tue, Oct 27, 2020 at 23:36, Andrew Lunn <andrew@lunn.ch> wrote:
> If you are dynamically allocating dsa_lag structures, at run time, you
> need to think about this. But the number of LAGs is limited by the
> number of ports. So i would consider just allocating the worst case
> number at probe, and KISS for runtime.

Oh OK, yeah that just makes stuff easier so that's absolutely fine. I
got the sense that the overall movement within DSA was in the opposite
direction. E.g. didn't the dst use to have an array of ds pointers?
Whereas now you iterate through dst->ports to find them?

>> At least on mv88e6xxx, the exact source port is not available when
>> packets are received on the CPU. The way I see it, there are two ways
>> around that problem:
>
> Does that break team/bonding? Do any of the algorithms send packets on
> specific ports to make sure they are alive? I've not studied how
> team/bonding works, but it must have a way to determine if a link has
> failed and it needs to fallover.

This limitation only applies to FORWARD packets. TO_CPU packets will
still contain device/port. So you have to make sure that the control
packets are trapped and not forwarded to the CPU (e.g. by setting the
Resvd2CPU bits in Global2).

> Where possible, i would keep to the datasheet terminology. So any 6352
> specific function should use 6352 terminology. Any 6390 specific
> function should use 6390 terminology. For code which supports a range
> of generations, we have used the terminology from the first device
> which had the feature. In practice, this probably means trunk is going
> to be used most of the time, and LAG in just 6390 code. Often, the
> glue code in chip.c uses linux stack terminology.

Fair enough, trunking it is then. I don't expect we'll have anything
mv88e6xxx specific using the LAG term in that case. From what I can
tell, the trunk settings have not changed since at least 6095.

Andrew Lunn Oct. 28, 2020, 1:03 a.m. UTC | #23

On Wed, Oct 28, 2020 at 01:45:11AM +0100, Tobias Waldekranz wrote:
> On Tue, Oct 27, 2020 at 23:36, Andrew Lunn <andrew@lunn.ch> wrote:
> > If you are dynamically allocating dsa_lag structures, at run time, you
> > need to think about this. But the number of LAGs is limited by the
> > number of ports. So i would consider just allocating the worst case
> > number at probe, and KISS for runtime.
> 
> Oh OK, yeah that just makes stuff easier so that's absolutely fine. I
> got the sense that the overall movement within DSA was in the opposite
> direction. E.g. didn't the dst use to have an array of ds pointers?
> Whereas now you iterate through dst->ports to find them?

Yes, but they are all allocated at probe time. It saved a bit of heap
for adding some code.

   Andrew

Marek Behún Oct. 28, 2020, 10:35 p.m. UTC | #24

On Tue, 27 Oct 2020 19:25:16 +0100
Tobias Waldekranz <tobias@waldekranz.com> wrote:

> .-----. TO_CPU, FORWARD .-----. TO_CPU, FORWARD .-----.
> |     +-----------------+     +-----------------+     |
> | CPU |                 | sw0 |                 | sw1 |
> |     +-----------------+     +-----------------+     |
> '-----'    FORWARD      '-+-+-'    FORWARD      '-+-+-'
>                           | |                     | |
>                        swp1 swp2               swp3 swp4
> 
> So the links selected as the CPU ports will see a marginally higher load
> due to all TO_CPU being sent over it. But the hashing is not that great
> on this hardware anyway (DA/SA only) so some imbalance is unavoidable.

The hashing is horrible :( On Turris Omnia we have 5 user ports and 2
CPU ports, and I suspect that for most of our users there is at most
one peer MAC address on the other side of an user port. So if such a
user has 5 devices connected to each switch port, there are 5 pairs of
(DA,SA), so 2^5 = 32 different assignments of (DA,SA) pairs to CPU
ports.

With probability 2/32 = 6.25% traffic from all 5 user ports would go via
one port,
with probability 10/32 = 31.25% traffic from 4 user ports would go via
one port.

That is not good balancing :)

Marek

Florian Fainelli Nov. 11, 2020, 4:28 a.m. UTC | #25

On 10/27/2020 3:51 AM, Tobias Waldekranz wrote:
> This series starts by adding the generic support required to offload

> link aggregates to drivers built on top of the DSA subsystem. It then

> implements offloading for the mv88e6xxx driver, i.e. Marvell's

> LinkStreet family.

> 

> Posting this as an RFC as there are some topics that I would like

> feedback on before going further with testing. Thus far I've done some

> basic tests to verify that:

> 

> - A LAG can be used as a regular interface.

> - Bridging a LAG with other DSA ports works as expected.

> - Load balancing is done by the hardware, both in single- and

>   multi-chip systems.

> - Load balancing is dynamically reconfigured when the state of

>   individual links change.

> 

> Testing as been done on two systems:

> 

> 1. Single-chip system with one Peridot (6390X).

> 2. Multi-chip system with one Agate (6352) daisy-chained with an Opal

>    (6097F).

> 

> I would really appreciate feedback on the following:

> 

> All LAG configuration is cached in `struct dsa_lag`s. I realize that

> the standard M.O. of DSA is to read back information from hardware

> when required. With LAGs this becomes very tricky though. For example,

> the change of a link state on one switch will require re-balancing of

> LAG hash buckets on another one, which in turn depends on the total

> number of active links in the LAG. Do you agree that this is

> motivated?


Yes, this makes sense, I did something quite similar in this branch
nearly 3 years ago, it was tested to the point where the switch was
programmed correctly but I had not configured the CPU port to support
2Gbits/sec (doh) to verify that we got 2x the desired throughput:

https://github.com/ffainelli/linux/commits/b53-bond

Your patch looks definitively more complete.

> 

> The LAG driver ops all receive the LAG netdev as an argument when this

> information is already available through the port's lag pointer. This

> was done to match the way that the bridge netdev is passed to all VLAN

> ops even though it is in the port's bridge_dev. Is there a reason for

> this or should I just remove it from the LAG ops?

> 

> At least on mv88e6xxx, the exact source port is not available when

> packets are received on the CPU. The way I see it, there are two ways

> around that problem:

> 

> - Inject the packet directly on the LAG device (what this series

>   does). Feels right because it matches all that we actually know; the

>   packet came in on the LAG. It does complicate dsa_switch_rcv

>   somewhat as we can no longer assume that skb->dev is a DSA port.

> 

> - Inject the packet on "the designated port", i.e. some port in the

>   LAG. This lets us keep the current Rx path untouched. The problem is

>   that (a) the port would have to be dynamically updated to match the

>   expectations of the LAG driver (team/bond) as links are

>   enabled/disabled and (b) we would be presenting a lie because

>   packets would appear to ingress on netdevs that they might not in

>   fact have been physically received on.

> 

> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It

> seems like all chips capable of doing EDSA are using that, except for

> the Peridot.

> 

> (mv88e6xxx) The cross-chip PVT changes required to allow a LAG to

> communicate with the other ports do not feel quite right, but I'm

> unsure about what the proper way of doing it would be. Any ideas?

> 

> (mv88e6xxx) Marvell has historically used the idiosyncratic term

> "trunk" to refer to link aggregates. Somewhere around the Peridot they

> have switched and are now referring to the same registers/tables using

> the term "LAG". In this series I've stuck to using LAG for all generic

> stuff, and only used trunk for driver-internal functions. Do we want

> to rename everything to use the LAG nomenclature?


Yes please!
-- 
Florian

Vladimir Oltean Nov. 19, 2020, 10:51 a.m. UTC | #26

On Tue, Oct 27, 2020 at 11:51:13AM +0100, Tobias Waldekranz wrote:
> This series starts by adding the generic support required to offload

> link aggregates to drivers built on top of the DSA subsystem. It then

> implements offloading for the mv88e6xxx driver, i.e. Marvell's

> LinkStreet family.

>

> Posting this as an RFC as there are some topics that I would like

> feedback on before going further with testing. Thus far I've done some

> basic tests to verify that:

>

> - A LAG can be used as a regular interface.

> - Bridging a LAG with other DSA ports works as expected.

> - Load balancing is done by the hardware, both in single- and

>   multi-chip systems.

> - Load balancing is dynamically reconfigured when the state of

>   individual links change.

>

> Testing as been done on two systems:

>

> 1. Single-chip system with one Peridot (6390X).

> 2. Multi-chip system with one Agate (6352) daisy-chained with an Opal

>    (6097F).

>

> I would really appreciate feedback on the following:

>

> All LAG configuration is cached in `struct dsa_lag`s. I realize that

> the standard M.O. of DSA is to read back information from hardware

> when required. With LAGs this becomes very tricky though. For example,

> the change of a link state on one switch will require re-balancing of

> LAG hash buckets on another one, which in turn depends on the total

> number of active links in the LAG. Do you agree that this is

> motivated?

>

> The LAG driver ops all receive the LAG netdev as an argument when this

> information is already available through the port's lag pointer. This

> was done to match the way that the bridge netdev is passed to all VLAN

> ops even though it is in the port's bridge_dev. Is there a reason for

> this or should I just remove it from the LAG ops?

>

> At least on mv88e6xxx, the exact source port is not available when

> packets are received on the CPU. The way I see it, there are two ways

> around that problem:

>

> - Inject the packet directly on the LAG device (what this series

>   does). Feels right because it matches all that we actually know; the

>   packet came in on the LAG. It does complicate dsa_switch_rcv

>   somewhat as we can no longer assume that skb->dev is a DSA port.

>

> - Inject the packet on "the designated port", i.e. some port in the

>   LAG. This lets us keep the current Rx path untouched. The problem is

>   that (a) the port would have to be dynamically updated to match the

>   expectations of the LAG driver (team/bond) as links are

>   enabled/disabled and (b) we would be presenting a lie because

>   packets would appear to ingress on netdevs that they might not in

>   fact have been physically received on.

>

> (mv88e6xxx) What is the policy regarding the use of DSA vs. EDSA?  It

> seems like all chips capable of doing EDSA are using that, except for

> the Peridot.

>

> (mv88e6xxx) The cross-chip PVT changes required to allow a LAG to

> communicate with the other ports do not feel quite right, but I'm

> unsure about what the proper way of doing it would be. Any ideas?

>

> (mv88e6xxx) Marvell has historically used the idiosyncratic term

> "trunk" to refer to link aggregates. Somewhere around the Peridot they

> have switched and are now referring to the same registers/tables using

> the term "LAG". In this series I've stuck to using LAG for all generic

> stuff, and only used trunk for driver-internal functions. Do we want

> to rename everything to use the LAG nomenclature?


I have tested these patches on ocelot/felix and all is OK there, I would
appreciate if you could resend as non-RFC. In the case of my hardware,
it appears that I don't need the .port_lag_change callback, and that the
source port that is being put in the DSA header is still the physical
port ID and not the logical port ID (the LAG number). That being said,
the framework you've built is clearly nice and works well even with
bridging a bond.

Tobias Waldekranz Nov. 19, 2020, 11:52 a.m. UTC | #27

On Thu Nov 19, 2020 at 1:51 PM CET, Vladimir Oltean wrote:
> I have tested these patches on ocelot/felix and all is OK there, I would

> appreciate if you could resend as non-RFC. In the case of my hardware,

For sure, I am working on it as we speak. I was mostly waiting for the
dsa-tag-unification series to make its way to net-next first as v1
depends on that. But then I remembered that I had to test against the
bonding driver (I have used team up to this point), and there is some
bug there that I need to squash first.

> it appears that I don't need the .port_lag_change callback, and that the

Ok, does ocelot automatically rebalance the LAG based on link state? I
took a quick look through the datasheet for another switch from
Vitesse, and it explicitly states that you need to update a table on
link changes.

I.e. in this situation:

    br0
   /  |
 lag  |
 /|\  |
1 2 3 4
| | |  \
| | |   B
| | |
1 2 3
  A

If you unplug cable 1, does the hardware rebalance all flows between
A<->B to only use 2 and 3 without software assistance? If not, you
will loose 1/3 of your flows.

> source port that is being put in the DSA header is still the physical

> port ID and not the logical port ID (the LAG number). That being said,

Ok, yeah I really wish this was true for mv88e6xxx as well.

> the framework you've built is clearly nice and works well even with

> bridging a bond.

Thank you!

Vladimir Oltean Nov. 19, 2020, 6:12 p.m. UTC | #28

On Thu, Nov 19, 2020 at 12:52:14PM +0100, Tobias Waldekranz wrote:
> > it appears that I don't need the .port_lag_change callback, and that the

>

> Ok, does ocelot automatically rebalance the LAG based on link state? I

> took a quick look through the datasheet for another switch from

> Vitesse, and it explicitly states that you need to update a table on

> link changes.

>

> I.e. in this situation:

>

>     br0

>    /  |

>  lag  |

>  /|\  |

> 1 2 3 4

> | | |  \

> | | |   B

> | | |

> 1 2 3

>   A

>

> If you unplug cable 1, does the hardware rebalance all flows between

> A<->B to only use 2 and 3 without software assistance? If not, you

> will loose 1/3 of your flows.

Yes, you're right, the switch doesn't rebalance the aggregation codes
across the remaining ports automatically. In my mind I was subconsiously
hoping that would be the case, because I need to make use of the
information in struct dsa_lag in non-DSA code (long story, but the
drivers/net/dsa/ocelot code shares the implementation with
drivers/net/ethernet/mscc/ocelot* which is a switchdev-only driver).
It doesn't mean that keeping state in dp->lag is the wrong thing to do,
it's just that this is an extra challenge for an already odd driver,
that I will have to see how I deal with :D

[RFC,0/4] net: dsa: link aggregation support

Message

Comments