[RFC,net-next] bonding: add a vlan+srcmac tx hashing option

Message ID	20201218193033.6138-1-jarod@redhat.com
State	New
Headers	show Return-Path: <netdev-owner@kernel.org> From: Jarod Wilson <jarod@redhat.com> To: linux-kernel@vger.kernel.org Cc: Jarod Wilson <jarod@redhat.com>, Jay Vosburgh <j.vosburgh@gmail.com>, Veaceslav Falico <vfalico@gmail.com>, Andy Gospodarek <andy@greyhouse.net>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Thomas Davis <tadavis@lbl.gov>, netdev@vger.kernel.org Subject: [RFC PATCH net-next] bonding: add a vlan+srcmac tx hashing option Date: Fri, 18 Dec 2020 14:30:33 -0500 Message-Id: <20201218193033.6138-1-jarod@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	[RFC,net-next] bonding: add a vlan+srcmac tx hashing option \| expand [RFC,net-next] bonding: add a vlan+srcmac tx hashing option

Jarod Wilson Dec. 18, 2020, 7:30 p.m. UTC

This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.

Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.

This has been rudimetarily tested to provide similar results, suitable for
them to use to move off their current proprietary solution.

Still on the TODO list, if these even looks sane to begin with, is
fleshing out Documentation/networking/bonding.rst.

Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Thomas Davis <tadavis@lbl.gov>
Cc: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
---
 drivers/net/bonding/bond_main.c    | 27 +++++++++++++++++++++++++--
 drivers/net/bonding/bond_options.c |  1 +
 include/linux/netdevice.h          |  1 +
 include/uapi/linux/if_bonding.h    |  1 +
 4 files changed, 28 insertions(+), 2 deletions(-)

Jay Vosburgh Dec. 19, 2020, 12:18 a.m. UTC | #1

Jarod Wilson <jarod@redhat.com> wrote:

>This comes from an end-user request, where they're running multiple VMs on
>hosts with bonded interfaces connected to some interest switch topologies,
>where 802.3ad isn't an option. They're currently running a proprietary
>solution that effectively achieves load-balancing of VMs and bandwidth
>utilization improvements with a similar form of transmission algorithm.
>
>Basically, each VM has it's own vlan, so it always sends its traffic out
>the same interface, unless that interface fails. Traffic gets split
>between the interfaces, maintaining a consistent path, with failover still
>available if an interface goes down.
>
>This has been rudimetarily tested to provide similar results, suitable for
>them to use to move off their current proprietary solution.
>
>Still on the TODO list, if these even looks sane to begin with, is
>fleshing out Documentation/networking/bonding.rst.

	I'm sure you're aware, but any final submission will also need
to include netlink and iproute2 support.

>Cc: Jay Vosburgh <j.vosburgh@gmail.com>
>Cc: Veaceslav Falico <vfalico@gmail.com>
>Cc: Andy Gospodarek <andy@greyhouse.net>
>Cc: "David S. Miller" <davem@davemloft.net>
>Cc: Jakub Kicinski <kuba@kernel.org>
>Cc: Thomas Davis <tadavis@lbl.gov>
>Cc: netdev@vger.kernel.org
>Signed-off-by: Jarod Wilson <jarod@redhat.com>
>---
> drivers/net/bonding/bond_main.c    | 27 +++++++++++++++++++++++++--
> drivers/net/bonding/bond_options.c |  1 +
> include/linux/netdevice.h          |  1 +
> include/uapi/linux/if_bonding.h    |  1 +
> 4 files changed, 28 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 5fe5232cc3f3..151ce8c7a56f 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);
> MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 802.3ad hashing method; "
> 				   "0 for layer 2 (default), 1 for layer 3+4, "
> 				   "2 for layer 2+3, 3 for encap layer 2+3, "
>-				   "4 for encap layer 3+4");
>+				   "4 for encap layer 3+4, 5 for vlan+srcmac");
> module_param(arp_interval, int, 0);
> MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");
> module_param_array(arp_ip_target, charp, NULL, 0);
>@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct bonding *bond,
> 		return NETDEV_LAG_HASH_E23;
> 	case BOND_XMIT_POLICY_ENCAP34:
> 		return NETDEV_LAG_HASH_E34;
>+	case BOND_XMIT_POLICY_VLAN_SRCMAC:
>+		return NETDEV_LAG_HASH_VLAN_SRCMAC;
> 	default:
> 		return NETDEV_LAG_HASH_UNKNOWN;
> 	}
>@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct flow_keys *fk,
> 	return true;
> }
> 
>+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)
>+{
>+	struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);
>+	u32 srcmac = mac_hdr->h_source[5];
>+	u16 vlan;
>+
>+	if (!skb_vlan_tag_present(skb))
>+		return srcmac;
>+
>+	vlan = skb_vlan_tag_get(skb);
>+
>+	return srcmac ^ vlan;

	For the common configuration wherein multiple VLANs are
configured atop a single interface (and thus by default end up with the
same MAC address), this seems like a fairly weak hash.  The TCI is 16
bits (12 of which are the VID), but only 8 are used from the MAC, which
will be a constant.

	Is this algorithm copying the proprietary solution you mention?

	-J

>+}
>+
> /* Extract the appropriate headers based on bond's xmit policy */
> static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
> 			      struct flow_keys *fk)
>@@ -3501,10 +3517,14 @@ static bool bond_flow_dissect(struct bonding *bond, struct sk_buff *skb,
> 	bool l34 = bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER34;
> 	int noff, proto = -1;
> 
>-	if (bond->params.xmit_policy > BOND_XMIT_POLICY_LAYER23) {
>+	switch (bond->params.xmit_policy) {
>+	case BOND_XMIT_POLICY_ENCAP23:
>+	case BOND_XMIT_POLICY_ENCAP34:
> 		memset(fk, 0, sizeof(*fk));
> 		return __skb_flow_dissect(NULL, skb, &flow_keys_bonding,
> 					  fk, NULL, 0, 0, 0, 0);
>+	default:
>+		break;
> 	}
> 
> 	fk->ports.ports = 0;
>@@ -3556,6 +3576,9 @@ u32 bond_xmit_hash(struct bonding *bond, struct sk_buff *skb)
> 	    skb->l4_hash)
> 		return skb->hash;
> 
>+	if (bond->params.xmit_policy == BOND_XMIT_POLICY_VLAN_SRCMAC)
>+		return bond_vlan_srcmac_hash(skb);
>+
> 	if (bond->params.xmit_policy == BOND_XMIT_POLICY_LAYER2 ||
> 	    !bond_flow_dissect(bond, skb, &flow))
> 		return bond_eth_hash(skb);
>diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c
>index a4e4e15f574d..9826fe46fca1 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -101,6 +101,7 @@ static const struct bond_opt_value bond_xmit_hashtype_tbl[] = {
> 	{ "layer2+3", BOND_XMIT_POLICY_LAYER23, 0},
> 	{ "encap2+3", BOND_XMIT_POLICY_ENCAP23, 0},
> 	{ "encap3+4", BOND_XMIT_POLICY_ENCAP34, 0},
>+	{ "vlansrc",  BOND_XMIT_POLICY_VLAN_SRCMAC,  0},
> 	{ NULL,       -1,                       0},
> };
> 
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>index 7bf167993c05..551eac4ab747 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -2633,6 +2633,7 @@ enum netdev_lag_hash {
> 	NETDEV_LAG_HASH_L23,
> 	NETDEV_LAG_HASH_E23,
> 	NETDEV_LAG_HASH_E34,
>+	NETDEV_LAG_HASH_VLAN_SRCMAC,
> 	NETDEV_LAG_HASH_UNKNOWN,
> };
> 
>diff --git a/include/uapi/linux/if_bonding.h b/include/uapi/linux/if_bonding.h
>index 45f3750aa861..e8eb4ad03cf1 100644
>--- a/include/uapi/linux/if_bonding.h
>+++ b/include/uapi/linux/if_bonding.h
>@@ -94,6 +94,7 @@
> #define BOND_XMIT_POLICY_LAYER23	2 /* layer 2+3 (IP ^ MAC) */
> #define BOND_XMIT_POLICY_ENCAP23	3 /* encapsulated layer 2+3 */
> #define BOND_XMIT_POLICY_ENCAP34	4 /* encapsulated layer 3+4 */
>+#define BOND_XMIT_POLICY_VLAN_SRCMAC	5 /* vlan + source MAC */
> 
> /* 802.3ad port state definitions (43.4.2.2 in the 802.3ad standard) */
> #define LACP_STATE_LACP_ACTIVITY   0x1
>-- 
>2.29.2
>

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

Jiri Pirko Dec. 28, 2020, 10:11 a.m. UTC | #2

Fri, Dec 18, 2020 at 08:30:33PM CET, jarod@redhat.com wrote:
>This comes from an end-user request, where they're running multiple VMs on

>hosts with bonded interfaces connected to some interest switch topologies,

>where 802.3ad isn't an option. They're currently running a proprietary

>solution that effectively achieves load-balancing of VMs and bandwidth

>utilization improvements with a similar form of transmission algorithm.

>

>Basically, each VM has it's own vlan, so it always sends its traffic out

>the same interface, unless that interface fails. Traffic gets split

>between the interfaces, maintaining a consistent path, with failover still

>available if an interface goes down.

>

>This has been rudimetarily tested to provide similar results, suitable for

>them to use to move off their current proprietary solution.

>

>Still on the TODO list, if these even looks sane to begin with, is

>fleshing out Documentation/networking/bonding.rst.


Jarod, did you consider using team driver instead ? :)

Jarod Wilson Jan. 7, 2021, 11:58 p.m. UTC | #3

On Mon, Dec 28, 2020 at 11:11:45AM +0100, Jiri Pirko wrote:
> Fri, Dec 18, 2020 at 08:30:33PM CET, jarod@redhat.com wrote:

> >This comes from an end-user request, where they're running multiple VMs on

> >hosts with bonded interfaces connected to some interest switch topologies,

> >where 802.3ad isn't an option. They're currently running a proprietary

> >solution that effectively achieves load-balancing of VMs and bandwidth

> >utilization improvements with a similar form of transmission algorithm.

> >

> >Basically, each VM has it's own vlan, so it always sends its traffic out

> >the same interface, unless that interface fails. Traffic gets split

> >between the interfaces, maintaining a consistent path, with failover still

> >available if an interface goes down.

> >

> >This has been rudimetarily tested to provide similar results, suitable for

> >them to use to move off their current proprietary solution.

> >

> >Still on the TODO list, if these even looks sane to begin with, is

> >fleshing out Documentation/networking/bonding.rst.

> 

> Jarod, did you consider using team driver instead ? :)


That's actually one of the things that was suggested, since team I believe
already has support for this, but the user really wants to use bonding.
We're finding that a lot of users really still prefer bonding over team.

-- 
Jarod Wilson
jarod@redhat.com

Jarod Wilson Jan. 8, 2021, 12:03 a.m. UTC | #4

On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:
> Jarod Wilson <jarod@redhat.com> wrote:

> 

> >This comes from an end-user request, where they're running multiple VMs on

> >hosts with bonded interfaces connected to some interest switch topologies,

> >where 802.3ad isn't an option. They're currently running a proprietary

> >solution that effectively achieves load-balancing of VMs and bandwidth

> >utilization improvements with a similar form of transmission algorithm.

> >

> >Basically, each VM has it's own vlan, so it always sends its traffic out

> >the same interface, unless that interface fails. Traffic gets split

> >between the interfaces, maintaining a consistent path, with failover still

> >available if an interface goes down.

> >

> >This has been rudimetarily tested to provide similar results, suitable for

> >them to use to move off their current proprietary solution.

> >

> >Still on the TODO list, if these even looks sane to begin with, is

> >fleshing out Documentation/networking/bonding.rst.

> 

> 	I'm sure you're aware, but any final submission will also need

> to include netlink and iproute2 support.


I believe everything for netlink support is already included, but I'll
double-check that before submitting something for inclusion consideration.

> >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c

> >index 5fe5232cc3f3..151ce8c7a56f 100644

> >--- a/drivers/net/bonding/bond_main.c

> >+++ b/drivers/net/bonding/bond_main.c

> >@@ -164,7 +164,7 @@ module_param(xmit_hash_policy, charp, 0);

> > MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 802.3ad hashing method; "

> > 				   "0 for layer 2 (default), 1 for layer 3+4, "

> > 				   "2 for layer 2+3, 3 for encap layer 2+3, "

> >-				   "4 for encap layer 3+4");

> >+				   "4 for encap layer 3+4, 5 for vlan+srcmac");

> > module_param(arp_interval, int, 0);

> > MODULE_PARM_DESC(arp_interval, "arp interval in milliseconds");

> > module_param_array(arp_ip_target, charp, NULL, 0);

> >@@ -1434,6 +1434,8 @@ static enum netdev_lag_hash bond_lag_hash_type(struct bonding *bond,

> > 		return NETDEV_LAG_HASH_E23;

> > 	case BOND_XMIT_POLICY_ENCAP34:

> > 		return NETDEV_LAG_HASH_E34;

> >+	case BOND_XMIT_POLICY_VLAN_SRCMAC:

> >+		return NETDEV_LAG_HASH_VLAN_SRCMAC;

> > 	default:

> > 		return NETDEV_LAG_HASH_UNKNOWN;

> > 	}

> >@@ -3494,6 +3496,20 @@ static bool bond_flow_ip(struct sk_buff *skb, struct flow_keys *fk,

> > 	return true;

> > }

> > 

> >+static inline u32 bond_vlan_srcmac_hash(struct sk_buff *skb)

> >+{

> >+	struct ethhdr *mac_hdr = (struct ethhdr *)skb_mac_header(skb);

> >+	u32 srcmac = mac_hdr->h_source[5];

> >+	u16 vlan;

> >+

> >+	if (!skb_vlan_tag_present(skb))

> >+		return srcmac;

> >+

> >+	vlan = skb_vlan_tag_get(skb);

> >+

> >+	return srcmac ^ vlan;

> 

> 	For the common configuration wherein multiple VLANs are

> configured atop a single interface (and thus by default end up with the

> same MAC address), this seems like a fairly weak hash.  The TCI is 16

> bits (12 of which are the VID), but only 8 are used from the MAC, which

> will be a constant.

> 

> 	Is this algorithm copying the proprietary solution you mention?


I've not actually seen the code in question, so I can't be 100% certain,
but this is exactly how it was described to me, and testing seems to bear
out that it behaves at least similarly enough for the user. They like
simplicity, and the very basic hash suits their needs, which are basically
just getting some load-balancing with failover w/o having to have lacp,
running on some older switches that can't do lacp.

-- 
Jarod Wilson
jarod@redhat.com

Jiri Pirko Jan. 8, 2021, 1:12 p.m. UTC | #5

Fri, Jan 08, 2021 at 12:58:13AM CET, jarod@redhat.com wrote:
>On Mon, Dec 28, 2020 at 11:11:45AM +0100, Jiri Pirko wrote:

>> Fri, Dec 18, 2020 at 08:30:33PM CET, jarod@redhat.com wrote:

>> >This comes from an end-user request, where they're running multiple VMs on

>> >hosts with bonded interfaces connected to some interest switch topologies,

>> >where 802.3ad isn't an option. They're currently running a proprietary

>> >solution that effectively achieves load-balancing of VMs and bandwidth

>> >utilization improvements with a similar form of transmission algorithm.

>> >

>> >Basically, each VM has it's own vlan, so it always sends its traffic out

>> >the same interface, unless that interface fails. Traffic gets split

>> >between the interfaces, maintaining a consistent path, with failover still

>> >available if an interface goes down.

>> >

>> >This has been rudimetarily tested to provide similar results, suitable for

>> >them to use to move off their current proprietary solution.

>> >

>> >Still on the TODO list, if these even looks sane to begin with, is

>> >fleshing out Documentation/networking/bonding.rst.

>> 

>> Jarod, did you consider using team driver instead ? :)

>

>That's actually one of the things that was suggested, since team I believe

>already has support for this, but the user really wants to use bonding.

>We're finding that a lot of users really still prefer bonding over team.


Do you know the reason, other than "nostalgia"?

Jarod Wilson Jan. 8, 2021, 3:21 p.m. UTC | #6

On Fri, Jan 08, 2021 at 02:12:56PM +0100, Jiri Pirko wrote:
> Fri, Jan 08, 2021 at 12:58:13AM CET, jarod@redhat.com wrote:

> >On Mon, Dec 28, 2020 at 11:11:45AM +0100, Jiri Pirko wrote:

> >> Fri, Dec 18, 2020 at 08:30:33PM CET, jarod@redhat.com wrote:

> >> >This comes from an end-user request, where they're running multiple VMs on

> >> >hosts with bonded interfaces connected to some interest switch topologies,

> >> >where 802.3ad isn't an option. They're currently running a proprietary

> >> >solution that effectively achieves load-balancing of VMs and bandwidth

> >> >utilization improvements with a similar form of transmission algorithm.

> >> >

> >> >Basically, each VM has it's own vlan, so it always sends its traffic out

> >> >the same interface, unless that interface fails. Traffic gets split

> >> >between the interfaces, maintaining a consistent path, with failover still

> >> >available if an interface goes down.

> >> >

> >> >This has been rudimetarily tested to provide similar results, suitable for

> >> >them to use to move off their current proprietary solution.

> >> >

> >> >Still on the TODO list, if these even looks sane to begin with, is

> >> >fleshing out Documentation/networking/bonding.rst.

> >> 

> >> Jarod, did you consider using team driver instead ? :)

> >

> >That's actually one of the things that was suggested, since team I believe

> >already has support for this, but the user really wants to use bonding.

> >We're finding that a lot of users really still prefer bonding over team.

> 

> Do you know the reason, other than "nostalgia"?

I've heard a few different reasons that come to mind:

1) nostalgia is definitely one -- "we know bonding here"
2) support -- "the things I'm running say I need bonding to properly
support failover in their environment". How accurate this is, I don't
actually know.
3) monitoring -- "my monitoring solution knows about bonding, but not
about team". This is probably easily fixed, but may or may not be in the
user's direct control.
4) footprint -- "bonding does the job w/o team's userspace footprint".
I think this one is kind of hard for team to do anything about, bonding
really does have a smaller userspace footprint, which is a plus for
embedded type applications and high-security environments looking to keep
things as minimal as possible.

I think I've heard a few "we tried team years ago and it didn't work" as
well, which of course is ridiculous as a reason not to try something again,
since a lot can change in a few years in this world.

-- 
Jarod Wilson
jarod@redhat.com

Jarod Wilson Jan. 12, 2021, 9:12 p.m. UTC | #7

On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:
> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:

> > Jarod Wilson <jarod@redhat.com> wrote:

> > 

> > >This comes from an end-user request, where they're running multiple VMs on

> > >hosts with bonded interfaces connected to some interest switch topologies,

> > >where 802.3ad isn't an option. They're currently running a proprietary

> > >solution that effectively achieves load-balancing of VMs and bandwidth

> > >utilization improvements with a similar form of transmission algorithm.

> > >

> > >Basically, each VM has it's own vlan, so it always sends its traffic out

> > >the same interface, unless that interface fails. Traffic gets split

> > >between the interfaces, maintaining a consistent path, with failover still

> > >available if an interface goes down.

> > >

> > >This has been rudimetarily tested to provide similar results, suitable for

> > >them to use to move off their current proprietary solution.

> > >

> > >Still on the TODO list, if these even looks sane to begin with, is

> > >fleshing out Documentation/networking/bonding.rst.

> > 

> > 	I'm sure you're aware, but any final submission will also need

> > to include netlink and iproute2 support.

> 

> I believe everything for netlink support is already included, but I'll

> double-check that before submitting something for inclusion consideration.


I'm not certain if what you actually meant was that I'd have to patch
iproute2 as well, which I've definitely stumbled onto today, but it's a
2-line patch, and everything seems to be working fine with it:

$ sudo ip link set bond0 type bond xmit_hash_policy 5
$ ip -d link show bond0
11: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
    bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy vlansrc resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535
$ grep Hash /proc/net/bonding/bond0
Transmit Hash Policy: vlansrc (5)

Nothing bad seems to happen on an older kernel if one tries to set the new
hash, you just get told that it's an invalid argument.

I *think* this is all ready for submission then, so I'll get both the kernel
and iproute2 patches out soon.

-- 
Jarod Wilson
jarod@redhat.com

Jay Vosburgh Jan. 12, 2021, 9:39 p.m. UTC | #8

Jarod Wilson <jarod@redhat.com> wrote:

>On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:

>> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:

>> > Jarod Wilson <jarod@redhat.com> wrote:

>> > 

>> > >This comes from an end-user request, where they're running multiple VMs on

>> > >hosts with bonded interfaces connected to some interest switch topologies,

>> > >where 802.3ad isn't an option. They're currently running a proprietary

>> > >solution that effectively achieves load-balancing of VMs and bandwidth

>> > >utilization improvements with a similar form of transmission algorithm.

>> > >

>> > >Basically, each VM has it's own vlan, so it always sends its traffic out

>> > >the same interface, unless that interface fails. Traffic gets split

>> > >between the interfaces, maintaining a consistent path, with failover still

>> > >available if an interface goes down.

>> > >

>> > >This has been rudimetarily tested to provide similar results, suitable for

>> > >them to use to move off their current proprietary solution.

>> > >

>> > >Still on the TODO list, if these even looks sane to begin with, is

>> > >fleshing out Documentation/networking/bonding.rst.

>> > 

>> > 	I'm sure you're aware, but any final submission will also need

>> > to include netlink and iproute2 support.

>> 

>> I believe everything for netlink support is already included, but I'll

>> double-check that before submitting something for inclusion consideration.

>

>I'm not certain if what you actually meant was that I'd have to patch

>iproute2 as well, which I've definitely stumbled onto today, but it's a

>2-line patch, and everything seems to be working fine with it:


	Yes, that's what I meant.

>$ sudo ip link set bond0 type bond xmit_hash_policy 5


	Does the above work with the text label (presumably "vlansrc")
as well as the number, and does "ip link add test type bond help" print
the correct text for XMIT_HASH_POLICY?

	-J

>$ ip -d link show bond0

>11: bond0: <BROADCAST,MULTICAST,MASTER> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000

>    link/ether ce:85:5e:24:ce:90 brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535

>    bond mode balance-xor miimon 0 updelay 0 downdelay 0 peer_notify_delay 0 use_carrier 1 arp_interval 0 arp_validate none arp_all_targets any primary_reselect always fail_over_mac none xmit_hash_policy vlansrc resend_igmp 1 num_grat_arp 1 all_slaves_active 0 min_links 0 lp_interval 1 packets_per_slave 1 lacp_rate slow ad_select stable tlb_dynamic_lb 1 addrgenmode eui64 numtxqueues 16 numrxqueues 16 gso_max_size 65536 gso_max_segs 65535

>$ grep Hash /proc/net/bonding/bond0

>Transmit Hash Policy: vlansrc (5)

>

>Nothing bad seems to happen on an older kernel if one tries to set the new

>hash, you just get told that it's an invalid argument.

>

>I *think* this is all ready for submission then, so I'll get both the kernel

>and iproute2 patches out soon.

>

>-- 

>Jarod Wilson

>jarod@redhat.com


---
	-Jay Vosburgh, jay.vosburgh@canonical.com

Jarod Wilson Jan. 12, 2021, 10:32 p.m. UTC | #9

On Tue, Jan 12, 2021 at 01:39:10PM -0800, Jay Vosburgh wrote:
> Jarod Wilson <jarod@redhat.com> wrote:

> 

> >On Thu, Jan 07, 2021 at 07:03:40PM -0500, Jarod Wilson wrote:

> >> On Fri, Dec 18, 2020 at 04:18:59PM -0800, Jay Vosburgh wrote:

> >> > Jarod Wilson <jarod@redhat.com> wrote:

> >> > 

> >> > >This comes from an end-user request, where they're running multiple VMs on

> >> > >hosts with bonded interfaces connected to some interest switch topologies,

> >> > >where 802.3ad isn't an option. They're currently running a proprietary

> >> > >solution that effectively achieves load-balancing of VMs and bandwidth

> >> > >utilization improvements with a similar form of transmission algorithm.

> >> > >

> >> > >Basically, each VM has it's own vlan, so it always sends its traffic out

> >> > >the same interface, unless that interface fails. Traffic gets split

> >> > >between the interfaces, maintaining a consistent path, with failover still

> >> > >available if an interface goes down.

> >> > >

> >> > >This has been rudimetarily tested to provide similar results, suitable for

> >> > >them to use to move off their current proprietary solution.

> >> > >

> >> > >Still on the TODO list, if these even looks sane to begin with, is

> >> > >fleshing out Documentation/networking/bonding.rst.

> >> > 

> >> > 	I'm sure you're aware, but any final submission will also need

> >> > to include netlink and iproute2 support.

> >> 

> >> I believe everything for netlink support is already included, but I'll

> >> double-check that before submitting something for inclusion consideration.

> >

> >I'm not certain if what you actually meant was that I'd have to patch

> >iproute2 as well, which I've definitely stumbled onto today, but it's a

> >2-line patch, and everything seems to be working fine with it:

> 

> 	Yes, that's what I meant.

> 

> >$ sudo ip link set bond0 type bond xmit_hash_policy 5

> 

> 	Does the above work with the text label (presumably "vlansrc")

> as well as the number, and does "ip link add test type bond help" print

> the correct text for XMIT_HASH_POLICY?


All of the above looks correct to me, output below. Before submitting...
Could rename it from vlansrc to vlan+srcmac or some variation thereof if
it's desired. I tried to keep it relatively short, but it's perhaps a bit
less succinct like I have it now, and other modes include a +.

$ sudo modprobe bonding mode=2 max_bonds=1 xmit_hash_policy=0
$ sudo ip link set bond0 type bond xmit_hash_policy vlansrc
$ cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v4.18.0-272.el8.vstx.x86_64

Bonding Mode: load balancing (xor)
Transmit Hash Policy: vlansrc (5)
MII Status: down
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0
Peer Notification Delay (ms): 0

$ sudo ip link add test type bond help
Usage: ... bond [ mode BONDMODE ] [ active_slave SLAVE_DEV ]
                [ clear_active_slave ] [ miimon MIIMON ]
                [ updelay UPDELAY ] [ downdelay DOWNDELAY ]
                [ peer_notify_delay DELAY ]
                [ use_carrier USE_CARRIER ]
                [ arp_interval ARP_INTERVAL ]
                [ arp_validate ARP_VALIDATE ]
                [ arp_all_targets ARP_ALL_TARGETS ]
                [ arp_ip_target [ ARP_IP_TARGET, ... ] ]
                [ primary SLAVE_DEV ]
                [ primary_reselect PRIMARY_RESELECT ]
                [ fail_over_mac FAIL_OVER_MAC ]
                [ xmit_hash_policy XMIT_HASH_POLICY ]
                [ resend_igmp RESEND_IGMP ]
                [ num_grat_arp|num_unsol_na NUM_GRAT_ARP|NUM_UNSOL_NA ]
                [ all_slaves_active ALL_SLAVES_ACTIVE ]
                [ min_links MIN_LINKS ]
                [ lp_interval LP_INTERVAL ]
                [ packets_per_slave PACKETS_PER_SLAVE ]
                [ tlb_dynamic_lb TLB_DYNAMIC_LB ]
                [ lacp_rate LACP_RATE ]
                [ ad_select AD_SELECT ]
                [ ad_user_port_key PORTKEY ]
                [ ad_actor_sys_prio SYSPRIO ]
                [ ad_actor_system LLADDR ]

BONDMODE := balance-rr|active-backup|balance-xor|broadcast|802.3ad|balance-tlb|balance-alb
ARP_VALIDATE := none|active|backup|all
ARP_ALL_TARGETS := any|all
PRIMARY_RESELECT := always|better|failure
FAIL_OVER_MAC := none|active|follow
XMIT_HASH_POLICY := layer2|layer2+3|layer3+4|encap2+3|encap3+4|vlansrc
LACP_RATE := slow|fast
AD_SELECT := stable|bandwidth|count


-- 
Jarod Wilson
jarod@redhat.com

Yufeng Mo Jan. 15, 2021, 2:02 a.m. UTC | #10

Hi Team,

I have a question about bonding. During testing bonding mode 4
scenarios, I find that there is a very low probability that
the pointer is null. The following information is displayed:

[99359.795934] bond0: (slave eth13.2001): Port 2 did not find a suitable aggregator
[99359.796960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020
[99359.798127] Mem abort info:
[99359.798526]   ESR = 0x96000004
[99359.798938]   EC = 0x25: DABT (current EL), IL = 32 bits
[99359.799673]   SET = 0, FnV = 0
[99359.800106]   EA = 0, S1PTW = 0
[99359.800554] Data abort info:
[99359.800952]   ISV = 0, ISS = 0x00000004
[99359.801522]   CM = 0, WnR = 0
[99359.801970] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c64e6000
[99359.802876] [0000000000000020] pgd=0000000000000000
[99359.803555] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[99359.804369] Modules linked in: bonding hns3(-) hclgevf hnae3 [last unloaded: bonding]
[99359.805494] CPU: 1 PID: 951 Comm: kworker/u10:2 Not tainted 5.7.0-rc4+ #1
[99359.806455] Hardware name: linux,dummy-virt (DT)
[99359.807107] Workqueue: bond0 bond_3ad_state_machine_handler [bonding]
[99359.808056] pstate: 60c00005 (nZCv daif +PAN +UAO)
[99359.808722] pc : bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]
[99359.809652] lr : bond_3ad_state_machine_handler+0x7f4/0xdb8 [bonding]
[99359.810535] sp : ffff80001882bd20
[99359.811012] x29: ffff80001882bd20 x28: ffff000085939a38
[99359.811791] x27: ffff00008649bb68 x26: 00000000aaaaaaab
[99359.812871] x25: ffff800009401000 x24: ffff800009408de4
[99359.814049] x23: ffff80001882bd98 x22: ffff00008649b880
[99359.815210] x21: 0000000000000000 x20: ffff000085939a00
[99359.816401] x19: ffff00008649b880 x18: ffff800012572988
[99359.817637] x17: 0000000000000000 x16: 0000000000000000
[99359.818870] x15: ffff80009882b987 x14: 726f746167657267
[99359.820090] x13: 676120656c626174 x12: 697573206120646e
[99359.821374] x11: 696620746f6e2064 x10: 696420322074726f
[99359.822659] x9 : 50203a2931303032 x8 : 0000000000081391
[99359.823891] x7 : ffff8000108e3ad0 x6 : ffff8000128858bb
[99359.825109] x5 : 0000000000000000 x4 : 0000000000000000
[99359.826262] x3 : 00000000ffffffff x2 : 906b329bb5362a00
[99359.827394] x1 : 906b329bb5362a00 x0 : 0000000000000000
[99359.828540] Call trace:
[99359.829071]  bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]
[99359.830367]  process_one_work+0x15c/0x4a0
[99359.831216]  worker_thread+0x50/0x478
[99359.832022]  kthread+0x130/0x160
[99359.832716]  ret_from_fork+0x10/0x18
[99359.833487] Code: 910c0021 95f704bb f9403f80 b5ffe300 (f9401000)
[99359.834742] ---[ end trace c7a8e02914afc4e0 ]---
[99359.835817] Kernel panic - not syncing: Fatal exception in interrupt
[99359.837334] SMP: stopping secondary CPUs
[99359.838277] Kernel Offset: disabled
[99359.839086] CPU features: 0x080002,22208218
[99359.840053] Memory Limit: none
[99359.840783] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

The test procedure is as follows:
1. Configure bonding and set it to mode 4.
    echo "4" > /sys/class/net/bond0/bonding/mode
    ifconfig bond0 up

2. Configure two VLANs and add them to the bonding in step 1.
    vconfig add eth0 2001
    vconfig add eth1 2001
    ifenslave bond0 eth0.2001 eth1.2001

3. Unload the network device driver and bonding driver.
    rmmod hns3
    rmmod hclge
    rmmod hnae3
    rmmod bonding.ko

4. Repeat the preceding steps for a long time.

By checking the logic in ad_port_selection_logic(), I find that
if enter the branch "Port %d did not find a suitable aggregator",
the value of port->aggregator will be NULL, causing the problem.

So I'd like to ask what circumstances will be involved in this
branch, and what should be done in this case?


The detailed code analysis is as follows:

static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)
{
[...]
	/* if the port is connected to other aggregator, detach it */
	if (port->aggregator) {
		/* detach the port from its former aggregator */
		temp_aggregator = port->aggregator;
		for (curr_port = temp_aggregator->lag_ports; curr_port;
		     last_port = curr_port,
		     curr_port = curr_port->next_port_in_aggregator) {
			if (curr_port == port) {
				temp_aggregator->num_of_ports--;
				/* if it is the first port attached to the
				 * aggregator
				 */
				if (!last_port) {
					temp_aggregator->lag_ports =
						port->next_port_in_aggregator;
				} else {
					/* not the first port attached to the
					 * aggregator
					 */
					last_port->next_port_in_aggregator =
						port->next_port_in_aggregator;
				}

				/* clear the port's relations to this
				 * aggregator
				 */
				port->aggregator = NULL;

----analysis: set port->aggregator NULL at the beginning of ad_port_selection_logic().

				port->next_port_in_aggregator = NULL;
				port->actor_port_aggregator_identifier = 0;

				slave_dbg(bond->dev, port->slave->dev, "Port %d left LAG %d\n",
					  port->actor_port_number,
					  temp_aggregator->aggregator_identifier);
				/* if the aggregator is empty, clear its
				 * parameters, and set it ready to be attached
				 */
				if (!temp_aggregator->lag_ports)
					ad_clear_agg(temp_aggregator);
				break;
			}
		}
		if (!curr_port) {
			/* meaning: the port was related to an aggregator
			 * but was not on the aggregator port list
			 */
			net_warn_ratelimited("%s: (slave %s): Warning: Port %d was related to aggregator %d but was not on its port list\n",
					     port->slave->bond->dev->name,
					     port->slave->dev->name,
					     port->actor_port_number,
					     port->aggregator->aggregator_identifier);
		}
	}
	/* search on all aggregators for a suitable aggregator for this port */
	bond_for_each_slave(bond, slave, iter) {
		aggregator = &(SLAVE_AD_INFO(slave)->aggregator);
		/* keep a free aggregator for later use(if needed) */
		if (!aggregator->lag_ports) {
			if (!free_aggregator)
				free_aggregator = aggregator;

----analysis: Save free_aggregator if found a free aggregator. But in this case, no free aggregator is available.

			continue;
		}
		/* check if current aggregator suits us */
		if (((aggregator->actor_oper_aggregator_key == port->actor_oper_port_key) && /* if all parameters match AND */
		     MAC_ADDRESS_EQUAL(&(aggregator->partner_system), &(port->partner_oper.system)) &&
		     (aggregator->partner_system_priority == port->partner_oper.system_priority) &&
		     (aggregator->partner_oper_aggregator_key == port->partner_oper.key)
		    ) &&
		    ((!MAC_ADDRESS_EQUAL(&(port->partner_oper.system), &(null_mac_addr)) && /* partner answers */
		      !aggregator->is_individual)  /* but is not individual OR */
		    )
		   ) {
			/* attach to the founded aggregator */
			port->aggregator = aggregator;

----analysis: If a proper aggregator is found, port->aggregator is assigned here.

			port->actor_port_aggregator_identifier =
				port->aggregator->aggregator_identifier;
			port->next_port_in_aggregator = aggregator->lag_ports;
			port->aggregator->num_of_ports++;
			aggregator->lag_ports = port;
			slave_dbg(bond->dev, slave->dev, "Port %d joined LAG %d (existing LAG)\n",
				  port->actor_port_number,
				  port->aggregator->aggregator_identifier);

			/* mark this port as selected */
			port->sm_vars |= AD_PORT_SELECTED;
			found = 1;

----analysis: Set found to 1 if port->aggregator is assigned.

			break;
		}
	}
	/* the port couldn't find an aggregator - attach it to a new
	 * aggregator
	 */
	if (!found) {
		if (free_aggregator) {
			/* assign port a new aggregator */
			port->aggregator = free_aggregator;

----analysis: No proper free_aggregator is found. Therefore, port->aggregator cannot be assigned here.

			port->actor_port_aggregator_identifier =
				port->aggregator->aggregator_identifier;

			/* update the new aggregator's parameters
			 * if port was responsed from the end-user
			 */
			if (port->actor_oper_port_key & AD_DUPLEX_KEY_MASKS)
				/* if port is full duplex */
				port->aggregator->is_individual = false;
			else
				port->aggregator->is_individual = true;

			port->aggregator->actor_admin_aggregator_key =
				port->actor_admin_port_key;
			port->aggregator->actor_oper_aggregator_key =
				port->actor_oper_port_key;
			port->aggregator->partner_system =
				port->partner_oper.system;
			port->aggregator->partner_system_priority =
				port->partner_oper.system_priority;
			port->aggregator->partner_oper_aggregator_key = port->partner_oper.key;
			port->aggregator->receive_state = 1;
			port->aggregator->transmit_state = 1;
			port->aggregator->lag_ports = port;
			port->aggregator->num_of_ports++;

			/* mark this port as selected */
			port->sm_vars |= AD_PORT_SELECTED;

			slave_dbg(bond->dev, port->slave->dev, "Port %d joined LAG %d (new LAG)\n",
				  port->actor_port_number,
				  port->aggregator->aggregator_identifier);
		} else {
			slave_err(bond->dev, port->slave->dev,
				  "Port %d did not find a suitable aggregator\n",
				  port->actor_port_number);

----analysis: The fault scenario goes to this branch, and port->aggregator remains NULL.

		}
	}
	/* if all aggregator's ports are READY_N == TRUE, set ready=TRUE
	 * in all aggregator's ports, else set ready=FALSE in all
	 * aggregator's ports
	 */
	__set_agg_ports_ready(port->aggregator,
			      __agg_ports_are_ready(port->aggregator));

----analysis: port->aggregator is still NULL, which causes problem.


	aggregator = __get_first_agg(port);
	ad_agg_selection_logic(aggregator, update_slave_arr);

	if (!port->aggregator->is_active)
		port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;
}


Thanks.

Yufeng Mo Jan. 23, 2021, 6:10 a.m. UTC | #11

Ping...
Any comments? Thanks!

On 2021/1/15 10:02, moyufeng wrote:
> Hi Team,

> 

> I have a question about bonding. During testing bonding mode 4

> scenarios, I find that there is a very low probability that

> the pointer is null. The following information is displayed:

> 

> [99359.795934] bond0: (slave eth13.2001): Port 2 did not find a suitable aggregator

> [99359.796960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020

> [99359.798127] Mem abort info:

> [99359.798526]   ESR = 0x96000004

> [99359.798938]   EC = 0x25: DABT (current EL), IL = 32 bits

> [99359.799673]   SET = 0, FnV = 0

> [99359.800106]   EA = 0, S1PTW = 0

> [99359.800554] Data abort info:

> [99359.800952]   ISV = 0, ISS = 0x00000004

> [99359.801522]   CM = 0, WnR = 0

> [99359.801970] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c64e6000

> [99359.802876] [0000000000000020] pgd=0000000000000000

> [99359.803555] Internal error: Oops: 96000004 [#1] PREEMPT SMP

> [99359.804369] Modules linked in: bonding hns3(-) hclgevf hnae3 [last unloaded: bonding]

> [99359.805494] CPU: 1 PID: 951 Comm: kworker/u10:2 Not tainted 5.7.0-rc4+ #1

> [99359.806455] Hardware name: linux,dummy-virt (DT)

> [99359.807107] Workqueue: bond0 bond_3ad_state_machine_handler [bonding]

> [99359.808056] pstate: 60c00005 (nZCv daif +PAN +UAO)

> [99359.808722] pc : bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]

> [99359.809652] lr : bond_3ad_state_machine_handler+0x7f4/0xdb8 [bonding]

> [99359.810535] sp : ffff80001882bd20

> [99359.811012] x29: ffff80001882bd20 x28: ffff000085939a38

> [99359.811791] x27: ffff00008649bb68 x26: 00000000aaaaaaab

> [99359.812871] x25: ffff800009401000 x24: ffff800009408de4

> [99359.814049] x23: ffff80001882bd98 x22: ffff00008649b880

> [99359.815210] x21: 0000000000000000 x20: ffff000085939a00

> [99359.816401] x19: ffff00008649b880 x18: ffff800012572988

> [99359.817637] x17: 0000000000000000 x16: 0000000000000000

> [99359.818870] x15: ffff80009882b987 x14: 726f746167657267

> [99359.820090] x13: 676120656c626174 x12: 697573206120646e

> [99359.821374] x11: 696620746f6e2064 x10: 696420322074726f

> [99359.822659] x9 : 50203a2931303032 x8 : 0000000000081391

> [99359.823891] x7 : ffff8000108e3ad0 x6 : ffff8000128858bb

> [99359.825109] x5 : 0000000000000000 x4 : 0000000000000000

> [99359.826262] x3 : 00000000ffffffff x2 : 906b329bb5362a00

> [99359.827394] x1 : 906b329bb5362a00 x0 : 0000000000000000

> [99359.828540] Call trace:

> [99359.829071]  bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]

> [99359.830367]  process_one_work+0x15c/0x4a0

> [99359.831216]  worker_thread+0x50/0x478

> [99359.832022]  kthread+0x130/0x160

> [99359.832716]  ret_from_fork+0x10/0x18

> [99359.833487] Code: 910c0021 95f704bb f9403f80 b5ffe300 (f9401000)

> [99359.834742] ---[ end trace c7a8e02914afc4e0 ]---

> [99359.835817] Kernel panic - not syncing: Fatal exception in interrupt

> [99359.837334] SMP: stopping secondary CPUs

> [99359.838277] Kernel Offset: disabled

> [99359.839086] CPU features: 0x080002,22208218

> [99359.840053] Memory Limit: none

> [99359.840783] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

> 

> The test procedure is as follows:

> 1. Configure bonding and set it to mode 4.

>     echo "4" > /sys/class/net/bond0/bonding/mode

>     ifconfig bond0 up

> 

> 2. Configure two VLANs and add them to the bonding in step 1.

>     vconfig add eth0 2001

>     vconfig add eth1 2001

>     ifenslave bond0 eth0.2001 eth1.2001

> 

> 3. Unload the network device driver and bonding driver.

>     rmmod hns3

>     rmmod hclge

>     rmmod hnae3

>     rmmod bonding.ko

> 

> 4. Repeat the preceding steps for a long time.

> 

> By checking the logic in ad_port_selection_logic(), I find that

> if enter the branch "Port %d did not find a suitable aggregator",

> the value of port->aggregator will be NULL, causing the problem.

> 

> So I'd like to ask what circumstances will be involved in this

> branch, and what should be done in this case?

> 

> 

> The detailed code analysis is as follows:

> 

> static void ad_port_selection_logic(struct port *port, bool *update_slave_arr)

> {

> [...]

> 	/* if the port is connected to other aggregator, detach it */

> 	if (port->aggregator) {

> 		/* detach the port from its former aggregator */

> 		temp_aggregator = port->aggregator;

> 		for (curr_port = temp_aggregator->lag_ports; curr_port;

> 		     last_port = curr_port,

> 		     curr_port = curr_port->next_port_in_aggregator) {

> 			if (curr_port == port) {

> 				temp_aggregator->num_of_ports--;

> 				/* if it is the first port attached to the

> 				 * aggregator

> 				 */

> 				if (!last_port) {

> 					temp_aggregator->lag_ports =

> 						port->next_port_in_aggregator;

> 				} else {

> 					/* not the first port attached to the

> 					 * aggregator

> 					 */

> 					last_port->next_port_in_aggregator =

> 						port->next_port_in_aggregator;

> 				}

> 

> 				/* clear the port's relations to this

> 				 * aggregator

> 				 */

> 				port->aggregator = NULL;

> 

> ----analysis: set port->aggregator NULL at the beginning of ad_port_selection_logic().

> 

> 				port->next_port_in_aggregator = NULL;

> 				port->actor_port_aggregator_identifier = 0;

> 

> 				slave_dbg(bond->dev, port->slave->dev, "Port %d left LAG %d\n",

> 					  port->actor_port_number,

> 					  temp_aggregator->aggregator_identifier);

> 				/* if the aggregator is empty, clear its

> 				 * parameters, and set it ready to be attached

> 				 */

> 				if (!temp_aggregator->lag_ports)

> 					ad_clear_agg(temp_aggregator);

> 				break;

> 			}

> 		}

> 		if (!curr_port) {

> 			/* meaning: the port was related to an aggregator

> 			 * but was not on the aggregator port list

> 			 */

> 			net_warn_ratelimited("%s: (slave %s): Warning: Port %d was related to aggregator %d but was not on its port list\n",

> 					     port->slave->bond->dev->name,

> 					     port->slave->dev->name,

> 					     port->actor_port_number,

> 					     port->aggregator->aggregator_identifier);

> 		}

> 	}

> 	/* search on all aggregators for a suitable aggregator for this port */

> 	bond_for_each_slave(bond, slave, iter) {

> 		aggregator = &(SLAVE_AD_INFO(slave)->aggregator);

> 		/* keep a free aggregator for later use(if needed) */

> 		if (!aggregator->lag_ports) {

> 			if (!free_aggregator)

> 				free_aggregator = aggregator;

> 

> ----analysis: Save free_aggregator if found a free aggregator. But in this case, no free aggregator is available.

> 

> 			continue;

> 		}

> 		/* check if current aggregator suits us */

> 		if (((aggregator->actor_oper_aggregator_key == port->actor_oper_port_key) && /* if all parameters match AND */

> 		     MAC_ADDRESS_EQUAL(&(aggregator->partner_system), &(port->partner_oper.system)) &&

> 		     (aggregator->partner_system_priority == port->partner_oper.system_priority) &&

> 		     (aggregator->partner_oper_aggregator_key == port->partner_oper.key)

> 		    ) &&

> 		    ((!MAC_ADDRESS_EQUAL(&(port->partner_oper.system), &(null_mac_addr)) && /* partner answers */

> 		      !aggregator->is_individual)  /* but is not individual OR */

> 		    )

> 		   ) {

> 			/* attach to the founded aggregator */

> 			port->aggregator = aggregator;

> 

> ----analysis: If a proper aggregator is found, port->aggregator is assigned here.

> 

> 			port->actor_port_aggregator_identifier =

> 				port->aggregator->aggregator_identifier;

> 			port->next_port_in_aggregator = aggregator->lag_ports;

> 			port->aggregator->num_of_ports++;

> 			aggregator->lag_ports = port;

> 			slave_dbg(bond->dev, slave->dev, "Port %d joined LAG %d (existing LAG)\n",

> 				  port->actor_port_number,

> 				  port->aggregator->aggregator_identifier);

> 

> 			/* mark this port as selected */

> 			port->sm_vars |= AD_PORT_SELECTED;

> 			found = 1;

> 

> ----analysis: Set found to 1 if port->aggregator is assigned.

> 

> 			break;

> 		}

> 	}

> 	/* the port couldn't find an aggregator - attach it to a new

> 	 * aggregator

> 	 */

> 	if (!found) {

> 		if (free_aggregator) {

> 			/* assign port a new aggregator */

> 			port->aggregator = free_aggregator;

> 

> ----analysis: No proper free_aggregator is found. Therefore, port->aggregator cannot be assigned here.

> 

> 			port->actor_port_aggregator_identifier =

> 				port->aggregator->aggregator_identifier;

> 

> 			/* update the new aggregator's parameters

> 			 * if port was responsed from the end-user

> 			 */

> 			if (port->actor_oper_port_key & AD_DUPLEX_KEY_MASKS)

> 				/* if port is full duplex */

> 				port->aggregator->is_individual = false;

> 			else

> 				port->aggregator->is_individual = true;

> 

> 			port->aggregator->actor_admin_aggregator_key =

> 				port->actor_admin_port_key;

> 			port->aggregator->actor_oper_aggregator_key =

> 				port->actor_oper_port_key;

> 			port->aggregator->partner_system =

> 				port->partner_oper.system;

> 			port->aggregator->partner_system_priority =

> 				port->partner_oper.system_priority;

> 			port->aggregator->partner_oper_aggregator_key = port->partner_oper.key;

> 			port->aggregator->receive_state = 1;

> 			port->aggregator->transmit_state = 1;

> 			port->aggregator->lag_ports = port;

> 			port->aggregator->num_of_ports++;

> 

> 			/* mark this port as selected */

> 			port->sm_vars |= AD_PORT_SELECTED;

> 

> 			slave_dbg(bond->dev, port->slave->dev, "Port %d joined LAG %d (new LAG)\n",

> 				  port->actor_port_number,

> 				  port->aggregator->aggregator_identifier);

> 		} else {

> 			slave_err(bond->dev, port->slave->dev,

> 				  "Port %d did not find a suitable aggregator\n",

> 				  port->actor_port_number);

> 

> ----analysis: The fault scenario goes to this branch, and port->aggregator remains NULL.

> 

> 		}

> 	}

> 	/* if all aggregator's ports are READY_N == TRUE, set ready=TRUE

> 	 * in all aggregator's ports, else set ready=FALSE in all

> 	 * aggregator's ports

> 	 */

> 	__set_agg_ports_ready(port->aggregator,

> 			      __agg_ports_are_ready(port->aggregator));

> 

> ----analysis: port->aggregator is still NULL, which causes problem.

> 

> 

> 	aggregator = __get_first_agg(port);

> 	ad_agg_selection_logic(aggregator, update_slave_arr);

> 

> 	if (!port->aggregator->is_active)

> 		port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;

> }

> 

> 

> Thanks.

> .

>

Jay Vosburgh Jan. 29, 2021, 7:11 p.m. UTC | #12

moyufeng <moyufeng@huawei.com> wrote:

>Ping...

>Any comments? Thanks!

>

>On 2021/1/15 10:02, moyufeng wrote:

>> Hi Team,

>> 

>> I have a question about bonding. During testing bonding mode 4

>> scenarios, I find that there is a very low probability that

>> the pointer is null. The following information is displayed:

>> 

>> [99359.795934] bond0: (slave eth13.2001): Port 2 did not find a suitable aggregator

>> [99359.796960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020

>> [99359.798127] Mem abort info:

>> [99359.798526]   ESR = 0x96000004

>> [99359.798938]   EC = 0x25: DABT (current EL), IL = 32 bits

>> [99359.799673]   SET = 0, FnV = 0

>> [99359.800106]   EA = 0, S1PTW = 0

>> [99359.800554] Data abort info:

>> [99359.800952]   ISV = 0, ISS = 0x00000004

>> [99359.801522]   CM = 0, WnR = 0

>> [99359.801970] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c64e6000

>> [99359.802876] [0000000000000020] pgd=0000000000000000

>> [99359.803555] Internal error: Oops: 96000004 [#1] PREEMPT SMP

>> [99359.804369] Modules linked in: bonding hns3(-) hclgevf hnae3 [last unloaded: bonding]

>> [99359.805494] CPU: 1 PID: 951 Comm: kworker/u10:2 Not tainted 5.7.0-rc4+ #1

>> [99359.806455] Hardware name: linux,dummy-virt (DT)

>> [99359.807107] Workqueue: bond0 bond_3ad_state_machine_handler [bonding]

>> [99359.808056] pstate: 60c00005 (nZCv daif +PAN +UAO)

>> [99359.808722] pc : bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]

>> [99359.809652] lr : bond_3ad_state_machine_handler+0x7f4/0xdb8 [bonding]

>> [99359.810535] sp : ffff80001882bd20

>> [99359.811012] x29: ffff80001882bd20 x28: ffff000085939a38

>> [99359.811791] x27: ffff00008649bb68 x26: 00000000aaaaaaab

>> [99359.812871] x25: ffff800009401000 x24: ffff800009408de4

>> [99359.814049] x23: ffff80001882bd98 x22: ffff00008649b880

>> [99359.815210] x21: 0000000000000000 x20: ffff000085939a00

>> [99359.816401] x19: ffff00008649b880 x18: ffff800012572988

>> [99359.817637] x17: 0000000000000000 x16: 0000000000000000

>> [99359.818870] x15: ffff80009882b987 x14: 726f746167657267

>> [99359.820090] x13: 676120656c626174 x12: 697573206120646e

>> [99359.821374] x11: 696620746f6e2064 x10: 696420322074726f

>> [99359.822659] x9 : 50203a2931303032 x8 : 0000000000081391

>> [99359.823891] x7 : ffff8000108e3ad0 x6 : ffff8000128858bb

>> [99359.825109] x5 : 0000000000000000 x4 : 0000000000000000

>> [99359.826262] x3 : 00000000ffffffff x2 : 906b329bb5362a00

>> [99359.827394] x1 : 906b329bb5362a00 x0 : 0000000000000000

>> [99359.828540] Call trace:

>> [99359.829071]  bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]

>> [99359.830367]  process_one_work+0x15c/0x4a0

>> [99359.831216]  worker_thread+0x50/0x478

>> [99359.832022]  kthread+0x130/0x160

>> [99359.832716]  ret_from_fork+0x10/0x18

>> [99359.833487] Code: 910c0021 95f704bb f9403f80 b5ffe300 (f9401000)

>> [99359.834742] ---[ end trace c7a8e02914afc4e0 ]---

>> [99359.835817] Kernel panic - not syncing: Fatal exception in interrupt

>> [99359.837334] SMP: stopping secondary CPUs

>> [99359.838277] Kernel Offset: disabled

>> [99359.839086] CPU features: 0x080002,22208218

>> [99359.840053] Memory Limit: none

>> [99359.840783] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

>> 

>> The test procedure is as follows:

>> 1. Configure bonding and set it to mode 4.

>>     echo "4" > /sys/class/net/bond0/bonding/mode

>>     ifconfig bond0 up

>> 

>> 2. Configure two VLANs and add them to the bonding in step 1.

>>     vconfig add eth0 2001

>>     vconfig add eth1 2001

>>     ifenslave bond0 eth0.2001 eth1.2001

>> 

>> 3. Unload the network device driver and bonding driver.

>>     rmmod hns3

>>     rmmod hclge

>>     rmmod hnae3

>>     rmmod bonding.ko


	Are you running the above in a script, and can you share the
entire thing?

	Does the issue occur with the current net-next?

>> 4. Repeat the preceding steps for a long time.


	When you run this test, what are the network interfaces eth0 and
eth1 connected to, and are those ports configured for VLAN 2001 and
LACP?

>> By checking the logic in ad_port_selection_logic(), I find that

>> if enter the branch "Port %d did not find a suitable aggregator",

>> the value of port->aggregator will be NULL, causing the problem.

>> 

>> So I'd like to ask what circumstances will be involved in this

>> branch, and what should be done in this case?


	Well, in principle, this shouldn't ever happen.  Every port
structure contains an aggregator structure, so there should always be
one available somewhere.  I'm going to speculate that there's a race
condition somewhere in the teardown processing vs the LACP state machine
that invalidates this presumption.

>> The detailed code analysis is as follows:


[...]

>> 	/* if all aggregator's ports are READY_N == TRUE, set ready=TRUE

>> 	 * in all aggregator's ports, else set ready=FALSE in all

>> 	 * aggregator's ports

>> 	 */

>> 	__set_agg_ports_ready(port->aggregator,

>> 			      __agg_ports_are_ready(port->aggregator));

>> 

>> ----analysis: port->aggregator is still NULL, which causes problem.

>> 

>> 	aggregator = __get_first_agg(port);

>> 	ad_agg_selection_logic(aggregator, update_slave_arr);

>> 

>> 	if (!port->aggregator->is_active)

>> 		port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;


	Correct, if the "did not find a suitable aggregator" path is
taken, port->aggregator is NULL and bad things happen in the above
block.

	This is something that needs to be fixed, but I'm also concerned
that there are other issues lurking, so I'd like to be able to reproduce
this.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com

Yufeng Mo Jan. 30, 2021, 9:41 a.m. UTC | #13

On 2021/1/30 3:11, Jay Vosburgh wrote:
> moyufeng <moyufeng@huawei.com> wrote:

> 

>> Ping...

>> Any comments? Thanks!

>>

>> On 2021/1/15 10:02, moyufeng wrote:

>>> Hi Team,

>>>

>>> I have a question about bonding. During testing bonding mode 4

>>> scenarios, I find that there is a very low probability that

>>> the pointer is null. The following information is displayed:

>>>

>>> [99359.795934] bond0: (slave eth13.2001): Port 2 did not find a suitable aggregator

>>> [99359.796960] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020

>>> [99359.798127] Mem abort info:

>>> [99359.798526]   ESR = 0x96000004

>>> [99359.798938]   EC = 0x25: DABT (current EL), IL = 32 bits

>>> [99359.799673]   SET = 0, FnV = 0

>>> [99359.800106]   EA = 0, S1PTW = 0

>>> [99359.800554] Data abort info:

>>> [99359.800952]   ISV = 0, ISS = 0x00000004

>>> [99359.801522]   CM = 0, WnR = 0

>>> [99359.801970] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000c64e6000

>>> [99359.802876] [0000000000000020] pgd=0000000000000000

>>> [99359.803555] Internal error: Oops: 96000004 [#1] PREEMPT SMP

>>> [99359.804369] Modules linked in: bonding hns3(-) hclgevf hnae3 [last unloaded: bonding]

>>> [99359.805494] CPU: 1 PID: 951 Comm: kworker/u10:2 Not tainted 5.7.0-rc4+ #1

>>> [99359.806455] Hardware name: linux,dummy-virt (DT)

>>> [99359.807107] Workqueue: bond0 bond_3ad_state_machine_handler [bonding]

>>> [99359.808056] pstate: 60c00005 (nZCv daif +PAN +UAO)

>>> [99359.808722] pc : bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]

>>> [99359.809652] lr : bond_3ad_state_machine_handler+0x7f4/0xdb8 [bonding]

>>> [99359.810535] sp : ffff80001882bd20

>>> [99359.811012] x29: ffff80001882bd20 x28: ffff000085939a38

>>> [99359.811791] x27: ffff00008649bb68 x26: 00000000aaaaaaab

>>> [99359.812871] x25: ffff800009401000 x24: ffff800009408de4

>>> [99359.814049] x23: ffff80001882bd98 x22: ffff00008649b880

>>> [99359.815210] x21: 0000000000000000 x20: ffff000085939a00

>>> [99359.816401] x19: ffff00008649b880 x18: ffff800012572988

>>> [99359.817637] x17: 0000000000000000 x16: 0000000000000000

>>> [99359.818870] x15: ffff80009882b987 x14: 726f746167657267

>>> [99359.820090] x13: 676120656c626174 x12: 697573206120646e

>>> [99359.821374] x11: 696620746f6e2064 x10: 696420322074726f

>>> [99359.822659] x9 : 50203a2931303032 x8 : 0000000000081391

>>> [99359.823891] x7 : ffff8000108e3ad0 x6 : ffff8000128858bb

>>> [99359.825109] x5 : 0000000000000000 x4 : 0000000000000000

>>> [99359.826262] x3 : 00000000ffffffff x2 : 906b329bb5362a00

>>> [99359.827394] x1 : 906b329bb5362a00 x0 : 0000000000000000

>>> [99359.828540] Call trace:

>>> [99359.829071]  bond_3ad_state_machine_handler+0x7fc/0xdb8 [bonding]

>>> [99359.830367]  process_one_work+0x15c/0x4a0

>>> [99359.831216]  worker_thread+0x50/0x478

>>> [99359.832022]  kthread+0x130/0x160

>>> [99359.832716]  ret_from_fork+0x10/0x18

>>> [99359.833487] Code: 910c0021 95f704bb f9403f80 b5ffe300 (f9401000)

>>> [99359.834742] ---[ end trace c7a8e02914afc4e0 ]---

>>> [99359.835817] Kernel panic - not syncing: Fatal exception in interrupt

>>> [99359.837334] SMP: stopping secondary CPUs

>>> [99359.838277] Kernel Offset: disabled

>>> [99359.839086] CPU features: 0x080002,22208218

>>> [99359.840053] Memory Limit: none

>>> [99359.840783] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

>>>

>>> The test procedure is as follows:

>>> 1. Configure bonding and set it to mode 4.

>>>     echo "4" > /sys/class/net/bond0/bonding/mode

>>>     ifconfig bond0 up

>>>

>>> 2. Configure two VLANs and add them to the bonding in step 1.

>>>     vconfig add eth0 2001

>>>     vconfig add eth1 2001

>>>     ifenslave bond0 eth0.2001 eth1.2001

>>>

>>> 3. Unload the network device driver and bonding driver.

>>>     rmmod hns3

>>>     rmmod hclge

>>>     rmmod hnae3

>>>     rmmod bonding.ko

> 

> 	Are you running the above in a script, and can you share the

> entire thing?


Thanks for your reply.

Yes, it's a script, as below. However, the recurrence probability is
very low. So far, this problem occurs only once :(

#start
#!/bin/bash
Version=$(uname -r)
KoPath=/lib/modules/"${Version}"
while [ 1 ];do
	insmod ${KoPath}/hnae3.ko
	insmod ${KoPath}/hclgevf.ko
	insmod ${KoPath}/hns3.ko
	insmod ${KoPath}/bonding.ko

	ifconfig eth0 up
	ifconfig eth1 up

	echo "4" > /sys/class/net/bond0/bonding/mode
	ifconfig bond0 192.168.1.101
	echo 100 > "/sys/class/net/bond0/bonding/miimon"
	sleep 8

	vconfig add eth0 2001
	vconfig add eth1 2001
	ifconfig eth0.2001 192.168.2.101 up
	ifconfig eth1.2001 192.168.3.101 up
	sleep 8

	ifenslave bond0 eth0.2001 eth13.2001
	sleep 8

	rmmod hns3
	rmmod hclge
	rmmod hclgevf
	rmmod hnae3
	rmmod bonding.ko
	sleep 5
done
#end

> 

> 	Does the issue occur with the current net-next?

> 

Yes

>>> 4. Repeat the preceding steps for a long time.

> 

> 	When you run this test, what are the network interfaces eth0 and

> eth1 connected to, and are those ports configured for VLAN 2001 and

> LACP?

> 

Yes, the configurations at both ends are the same.

>>> By checking the logic in ad_port_selection_logic(), I find that

>>> if enter the branch "Port %d did not find a suitable aggregator",

>>> the value of port->aggregator will be NULL, causing the problem.

>>>

>>> So I'd like to ask what circumstances will be involved in this

>>> branch, and what should be done in this case?

> 

> 	Well, in principle, this shouldn't ever happen.  Every port

> structure contains an aggregator structure, so there should always be

> one available somewhere.  I'm going to speculate that there's a race

> condition somewhere in the teardown processing vs the LACP state machine

> that invalidates this presumption.

> 

I agree with your inference. But I don't have a better way to prove it,
since the recurrence probability is too low.

>>> The detailed code analysis is as follows:

> 

> [...]

> 

>>> 	/* if all aggregator's ports are READY_N == TRUE, set ready=TRUE

>>> 	 * in all aggregator's ports, else set ready=FALSE in all

>>> 	 * aggregator's ports

>>> 	 */

>>> 	__set_agg_ports_ready(port->aggregator,

>>> 			      __agg_ports_are_ready(port->aggregator));

>>>

>>> ----analysis: port->aggregator is still NULL, which causes problem.

>>>

>>> 	aggregator = __get_first_agg(port);

>>> 	ad_agg_selection_logic(aggregator, update_slave_arr);

>>>

>>> 	if (!port->aggregator->is_active)

>>> 		port->actor_oper_port_state &= ~LACP_STATE_SYNCHRONIZATION;

> 

> 	Correct, if the "did not find a suitable aggregator" path is

> taken, port->aggregator is NULL and bad things happen in the above

> block.

> 

> 	This is something that needs to be fixed, but I'm also concerned

> that there are other issues lurking, so I'd like to be able to reproduce

> this.

> 

> 	-J

> 

I will continue to reproduce the problem and try to find ways to improve
the reproducibility probability.

Since this path is incorrect, could you help to fix it?

Thank you! :)

> ---

> 	-Jay Vosburgh, jay.vosburgh@canonical.com

> .

>

[RFC,net-next] bonding: add a vlan+srcmac tx hashing option

Commit Message

Comments

Patch