Message ID | 1604303803-30660-1-git-send-email-i@liuyulong.me |
---|---|
State | New |
Headers | show |
Series | [v2] net: bonding: alb disable balance for IPv6 multicast related mac | expand |
On Mon, 2 Nov 2020 15:56:43 +0800 LIU Yulong wrote: > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to > construct the multicast destination MAC address for IPv6 multicast traffic. > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such > rule. The work steps [6] are: > *) Let's assume a destination address of 2001:db8:1:1::1. > *) This is mapped into the "Solicited Node Multicast Address" (SNMA) > format of ff02::1:ffXX:XXXX. > *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived > directly from the last 24 bits of the destination address. > *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1. > *) This, being a multicast address, can be mapped to a multicast MAC > address, using the format 33-33-XX-XX-XX-XX > *) Resulting in 33-33-ff-00-00-01. > *) This is a MAC address that is only being listened for by nodes > sharing the same last 24 bits. > *) In other words, while there is a chance for a "address collision", > it is a vast improvement over ARP's guaranteed "collision". > Kernel related code can be found at [3][4][5]. Please make sure you keep maintainers CCed on your postings, adding bond maintainers now. > +static inline bool is_ipv6_multicast_ether_addr(const u8 *addr) > +{ > + return (addr[0] == 0x33) && (addr[1] == 0x33); > +} nit: brackets are not necessary here.
On Tue, 3 Nov 2020 13:05:59 -0800 Jakub Kicinski wrote: > On Mon, 2 Nov 2020 15:56:43 +0800 LIU Yulong wrote: > > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to > > construct the multicast destination MAC address for IPv6 multicast traffic. > > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such > > rule. The work steps [6] are: > > *) Let's assume a destination address of 2001:db8:1:1::1. > > *) This is mapped into the "Solicited Node Multicast Address" (SNMA) > > format of ff02::1:ffXX:XXXX. > > *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived > > directly from the last 24 bits of the destination address. > > *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1. > > *) This, being a multicast address, can be mapped to a multicast MAC > > address, using the format 33-33-XX-XX-XX-XX > > *) Resulting in 33-33-ff-00-00-01. > > *) This is a MAC address that is only being listened for by nodes > > sharing the same last 24 bits. > > *) In other words, while there is a chance for a "address collision", > > it is a vast improvement over ARP's guaranteed "collision". > > Kernel related code can be found at [3][4][5]. > > Please make sure you keep maintainers CCed on your postings, adding bond > maintainers now. Looks like no reviews are coming in, so I had a closer look. It's concerning that we'll disable load balancing for all IPv6 multicast addresses now. AFAIU you're only concerned about 33:33:ff:00:00:01, can we not compare against that? The way the comparison is written now it does a single 64bit comparison per address, so it's the same number of instructions to compare the top two bytes or two full addresses.
Jakub Kicinski <kuba@kernel.org> wrote: >On Tue, 3 Nov 2020 13:05:59 -0800 Jakub Kicinski wrote: >> On Mon, 2 Nov 2020 15:56:43 +0800 LIU Yulong wrote: >> > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to >> > construct the multicast destination MAC address for IPv6 multicast traffic. >> > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such >> > rule. The work steps [6] are: >> > *) Let's assume a destination address of 2001:db8:1:1::1. >> > *) This is mapped into the "Solicited Node Multicast Address" (SNMA) >> > format of ff02::1:ffXX:XXXX. >> > *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived >> > directly from the last 24 bits of the destination address. >> > *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1. >> > *) This, being a multicast address, can be mapped to a multicast MAC >> > address, using the format 33-33-XX-XX-XX-XX >> > *) Resulting in 33-33-ff-00-00-01. >> > *) This is a MAC address that is only being listened for by nodes >> > sharing the same last 24 bits. >> > *) In other words, while there is a chance for a "address collision", >> > it is a vast improvement over ARP's guaranteed "collision". >> > Kernel related code can be found at [3][4][5]. >> >> Please make sure you keep maintainers CCed on your postings, adding bond >> maintainers now. > >Looks like no reviews are coming in, so I had a closer look. > >It's concerning that we'll disable load balancing for all IPv6 multicast >addresses now. AFAIU you're only concerned about 33:33:ff:00:00:01, can >we not compare against that? It's not fixed as 33:33:ff:00:00:01, that's just the example. The first two octets are fixed as 33:33, and the remaining four are derived from the SNMA, which in turn comes from the destination IPv6 address. I can't decide if this is genuinely a reasonable change overall, or if the described topology is simply untenable in the environment that the balance-alb mode creates. My specific concern is that the alb mode will periodically rebalance its TX load, so outgoing traffic will migrate from one bond port to another from time to time. It's unclear to me how the described topology that's broken by the multicast traffic being TX balanced is not also broken by the alb TX side rebalances. -J >The way the comparison is written now it does a single 64bit comparison >per address, so it's the same number of instructions to compare the top >two bytes or two full addresses. --- -Jay Vosburgh, jay.vosburgh@canonical.com
Yes, the 33:33:ff:00:00:01 is just an example, the destination MAC address can be various. The code of current solution is simple but indeed may need have more attentions on the real world topologys. The current solution refers to the action of ARP protocol in IPv4 [1]. While the IPv4 diabled the ARP tx balance, for the IPv6 we disable the all-nodes multicast [2] (when there are no multicast domain, it can be considered as all, aka broadcast [3]). But please note, the MAC "33:33:00:00:00:01" for IPv6 RA (Router Advertisement) destination. I have an alternative which is to verify the packet type, if it is the ICMPv6 and the type is 135(Neighbor Solicitation), we disable the tx balance. A new if-conditon will be added right below the all-nodes multicast check. [1] https://github.com/torvalds/linux/blob/master/drivers/net/bonding/bond_alb.c#L1423 [2] https://github.com/torvalds/linux/blob/master/drivers/net/bonding/bond_alb.c#L1431 [3] https://en.wikipedia.org/wiki/Solicited-node_multicast_address On Mon, Nov 9, 2020 at 5:37 AM Jay Vosburgh <jay.vosburgh@canonical.com> wrote: > > Jakub Kicinski <kuba@kernel.org> wrote: > > >On Tue, 3 Nov 2020 13:05:59 -0800 Jakub Kicinski wrote: > >> On Mon, 2 Nov 2020 15:56:43 +0800 LIU Yulong wrote: > >> > According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to > >> > construct the multicast destination MAC address for IPv6 multicast traffic. > >> > The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such > >> > rule. The work steps [6] are: > >> > *) Let's assume a destination address of 2001:db8:1:1::1. > >> > *) This is mapped into the "Solicited Node Multicast Address" (SNMA) > >> > format of ff02::1:ffXX:XXXX. > >> > *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived > >> > directly from the last 24 bits of the destination address. > >> > *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1. > >> > *) This, being a multicast address, can be mapped to a multicast MAC > >> > address, using the format 33-33-XX-XX-XX-XX > >> > *) Resulting in 33-33-ff-00-00-01. > >> > *) This is a MAC address that is only being listened for by nodes > >> > sharing the same last 24 bits. > >> > *) In other words, while there is a chance for a "address collision", > >> > it is a vast improvement over ARP's guaranteed "collision". > >> > Kernel related code can be found at [3][4][5]. > >> > >> Please make sure you keep maintainers CCed on your postings, adding bond > >> maintainers now. > > > >Looks like no reviews are coming in, so I had a closer look. > > > >It's concerning that we'll disable load balancing for all IPv6 multicast > >addresses now. AFAIU you're only concerned about 33:33:ff:00:00:01, can > >we not compare against that? > > It's not fixed as 33:33:ff:00:00:01, that's just the example. > The first two octets are fixed as 33:33, and the remaining four are > derived from the SNMA, which in turn comes from the destination IPv6 > address. > > I can't decide if this is genuinely a reasonable change overall, > or if the described topology is simply untenable in the environment that > the balance-alb mode creates. My specific concern is that the alb mode > will periodically rebalance its TX load, so outgoing traffic will > migrate from one bond port to another from time to time. It's unclear > to me how the described topology that's broken by the multicast traffic > being TX balanced is not also broken by the alb TX side rebalances. > > -J > > >The way the comparison is written now it does a single 64bit comparison > >per address, so it's the same number of instructions to compare the top > >two bytes or two full addresses. > > > --- > -Jay Vosburgh, jay.vosburgh@canonical.com
diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c index c3091e0..eda9046 100644 --- a/drivers/net/bonding/bond_alb.c +++ b/drivers/net/bonding/bond_alb.c @@ -24,9 +24,6 @@ #include <net/bonding.h> #include <net/bond_alb.h> -static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = { - 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 -}; static const int alb_delta_in_ticks = HZ / ALB_TIMER_TICKS_PER_SEC; #pragma pack(1) @@ -1425,10 +1422,9 @@ struct slave *bond_xmit_alb_slave_get(struct bonding *bond, break; } - /* IPv6 uses all-nodes multicast as an equivalent to - * broadcasts in IPv4. + /* IPv6 multicast destinations should not be tx-balanced. */ - if (ether_addr_equal_64bits(eth_data->h_dest, mac_v6_allmcast)) { + if (is_ipv6_multicast_ether_addr(eth_data->h_dest)) { do_tx_balance = false; break; } diff --git a/include/linux/etherdevice.h b/include/linux/etherdevice.h index 2e5debc..ac74a99 100644 --- a/include/linux/etherdevice.h +++ b/include/linux/etherdevice.h @@ -178,6 +178,18 @@ static inline bool is_unicast_ether_addr(const u8 *addr) } /** + * is_ipv6_multicast_ether_addr - Determine if the Ethernet address is for + * IPv6 multicast (rfc2464). + * @addr: Pointer to a six-byte array containing the Ethernet address + * + * Return true if the address is a multicast for IPv6. + */ +static inline bool is_ipv6_multicast_ether_addr(const u8 *addr) +{ + return (addr[0] == 0x33) && (addr[1] == 0x33); +} + +/** * is_valid_ether_addr - Determine if the given Ethernet address is valid * @addr: Pointer to a six-byte array containing the Ethernet address *
According to the RFC 2464 [1] the prefix "33:33:xx:xx:xx:xx" is defined to construct the multicast destination MAC address for IPv6 multicast traffic. The NDP (Neighbor Discovery Protocol for IPv6)[2] will comply with such rule. The work steps [6] are: *) Let's assume a destination address of 2001:db8:1:1::1. *) This is mapped into the "Solicited Node Multicast Address" (SNMA) format of ff02::1:ffXX:XXXX. *) The XX:XXXX represent the last 24 bits of the SNMA, and are derived directly from the last 24 bits of the destination address. *) Resulting in a SNMA ff02::1:ff00:0001, or ff02::1:ff00:1. *) This, being a multicast address, can be mapped to a multicast MAC address, using the format 33-33-XX-XX-XX-XX *) Resulting in 33-33-ff-00-00-01. *) This is a MAC address that is only being listened for by nodes sharing the same last 24 bits. *) In other words, while there is a chance for a "address collision", it is a vast improvement over ARP's guaranteed "collision". Kernel related code can be found at [3][4][5]. The current bond alb has some leaks of such MAC ranges which will cause the physical world failed to determain the back tunnel of the reply packet during the response in a Spine-and-Leaf data center architecture. The basic topology looks like this: +-------------+ +---| Border Leaf |-----+ tunnel-1| +-------------+ | tunnel-2 | | +---+----+ +------+-+ | Leaf1 +-----X-----+ Leaf2 | tunnel-3 has loop avoidance +--------+ tunnel-3 +-+------+ | | +----+ +----+ +--+nic1+---+nic2+---+ | +----+ +----+ | | bond6 | | HOST | +--------------------+ When nic1 is sending the normal IPv6 traffic to the gateway in Border leaf, the nic2 (slave) will send the NS packet out periodically, automatically and implicitly as well. This is an example packet sending from the slave nic2 which will broke the traffic. ac:1f:6b:90:5c:eb > 33:33:ff:00:00:01, ethertype 802.1Q (0x8100), length 90: vlan 205, p 0, ethertype IPv6, (hlim 255, next-header ICMPv6 (58) payload length: 32) fe80::f816:3eff:feba:2d8c > ff02::1:ff00:1: [icmp6 sum ok] ICMP6, neighbor solicitation, length 32, who has 240e:980:2f00:4000::1 source link-address option (1), length 8 (1): fa:16:3e:ba:2d:8c The packet source MAC "ac:1f:6b:90:5c:eb" was the nic2 MAC whose original value should be "fa:16:3e:ba:2d:8c", but it was changed by alb related MAC address mechanism [8]. MAC "fa:16:3e:ba:2d:8c" was the virtual device MAC from a cloud service inside a kernel network namespace, the topology is here [7]. MAC "fa:16:3e:ba:2d:8c" was first learnt at Leaf1 based on the underlay mechanism(BGP EVPN). When this example packet was sent to Border leaf and replied with dst_mac "fa:16:3e:ba:2d:8c", Leaf2 will try to send packet back to tunnel-3 at this point dropping happens because of the loop defense. All the original normal IPv6 traffic will be lead to the tunnel-2 and then drop. Link is broken now. This patch addresses such issue by check the entire MAC range definde by the RFC 2464. Adding a new helper method to check the first two octets are the value 3333. If the dest MAC is matched, no balance will be enabled. [1] https://tools.ietf.org/html/rfc2464#section-7 [2] https://tools.ietf.org/html/rfc4861 [3] linux.git/tree/include/net/if_inet6.h#n209-n221 [4] linux.git/tree/net/ipv6/ndisc.c#n291 [5] linux.git/tree/net/ipv6/ndisc.c#n346-n348 [6] https://en.citizendium.org/wiki/Neighbor_Discovery [7] https://docs.openstack.org/neutron/latest/admin/deploy-ovs-selfservice.html#architecture [8] linux.git/tree/drivers/net/bonding/bond_alb.c#n1320 Signed-off-by: LIU Yulong <i@liuyulong.me> --- drivers/net/bonding/bond_alb.c | 8 ++------ include/linux/etherdevice.h | 12 ++++++++++++ 2 files changed, 14 insertions(+), 6 deletions(-)