Message ID | 20210206050240.48410-1-saeed@kernel.org |
---|---|
Headers | show |
Series | mlx5 updates 2021-02-04 | expand |
Hi, I didn't receive the cover letter, so I'm replying on this one. :-) This is nice. One thing is not clear to me yet. From the samples on the cover letter: $ tc -s filter show dev enp8s0f0_1 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 0a:40:bd:30:89:99 src_mac ca:2e:a7:3f:f5:0f eth_type ipv4 ip_tos 0/0x3 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key set src_ip 7.7.7.5 dst_ip 7.7.7.1 ... $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ... These operations imply that 7.7.7.5 is configured on some interface on the host. Most likely the VF representor itself, as that aids with ARP resolution. Is that so? Thanks, Marcelo
On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: > Hi, > > I didn't receive the cover letter, so I'm replying on this one. :-) > > This is nice. One thing is not clear to me yet. From the samples on > the cover letter: > > $ tc -s filter show dev enp8s0f0_1 ingress > filter protocol ip pref 4 flower chain 0 > filter protocol ip pref 4 flower chain 0 handle 0x1 > dst_mac 0a:40:bd:30:89:99 > src_mac ca:2e:a7:3f:f5:0f > eth_type ipv4 > ip_tos 0/0x3 > ip_flags nofrag > in_hw in_hw_count 1 > action order 1: tunnel_key set > src_ip 7.7.7.5 > dst_ip 7.7.7.1 > ... > > $ tc -s filter show dev vxlan_sys_4789 ingress > filter protocol ip pref 4 flower chain 0 > filter protocol ip pref 4 flower chain 0 handle 0x1 > dst_mac ca:2e:a7:3f:f5:0f > src_mac 0a:40:bd:30:89:99 > eth_type ipv4 > enc_dst_ip 7.7.7.5 > enc_src_ip 7.7.7.1 > enc_key_id 98 > enc_dst_port 4789 > enc_tos 0 > ... > > These operations imply that 7.7.7.5 is configured on some interface on > the host. Most likely the VF representor itself, as that aids with ARP > resolution. Is that so? > > Thanks, > Marcelo Hi Marcelo, The tunnel endpoint IP address is configured on VF that is represented by enp8s0f0_0 representor in example rules. The VF is on host. Regards, Vlad
On Mon, Feb 08, 2021 at 10:21:21AM +0200, Vlad Buslov wrote: > > On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: > > Hi, > > > > I didn't receive the cover letter, so I'm replying on this one. :-) > > > > This is nice. One thing is not clear to me yet. From the samples on > > the cover letter: > > > > $ tc -s filter show dev enp8s0f0_1 ingress > > filter protocol ip pref 4 flower chain 0 > > filter protocol ip pref 4 flower chain 0 handle 0x1 > > dst_mac 0a:40:bd:30:89:99 > > src_mac ca:2e:a7:3f:f5:0f > > eth_type ipv4 > > ip_tos 0/0x3 > > ip_flags nofrag > > in_hw in_hw_count 1 > > action order 1: tunnel_key set > > src_ip 7.7.7.5 > > dst_ip 7.7.7.1 > > ... > > > > $ tc -s filter show dev vxlan_sys_4789 ingress > > filter protocol ip pref 4 flower chain 0 > > filter protocol ip pref 4 flower chain 0 handle 0x1 > > dst_mac ca:2e:a7:3f:f5:0f > > src_mac 0a:40:bd:30:89:99 > > eth_type ipv4 > > enc_dst_ip 7.7.7.5 > > enc_src_ip 7.7.7.1 > > enc_key_id 98 > > enc_dst_port 4789 > > enc_tos 0 > > ... > > > > These operations imply that 7.7.7.5 is configured on some interface on > > the host. Most likely the VF representor itself, as that aids with ARP > > resolution. Is that so? > > > > Thanks, > > Marcelo > > Hi Marcelo, > > The tunnel endpoint IP address is configured on VF that is represented > by enp8s0f0_0 representor in example rules. The VF is on host. That's interesting and odd. The VF would be isolated by a netns and not be visible by whoever is administrating the VF representor. Some cooperation between the two entities (host and container, say) is needed then, right? Because the host needs to know the endpoint IP address that the container will be using, and vice-versa. If so, why not offload the tunnel actions via the VF itself and avoid this need for cooperation? Container privileges maybe? Thx, Marcelo
On Mon 08 Feb 2021 at 15:25, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: > On Mon, Feb 08, 2021 at 10:21:21AM +0200, Vlad Buslov wrote: >> >> On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: >> > Hi, >> > >> > I didn't receive the cover letter, so I'm replying on this one. :-) >> > >> > This is nice. One thing is not clear to me yet. From the samples on >> > the cover letter: >> > >> > $ tc -s filter show dev enp8s0f0_1 ingress >> > filter protocol ip pref 4 flower chain 0 >> > filter protocol ip pref 4 flower chain 0 handle 0x1 >> > dst_mac 0a:40:bd:30:89:99 >> > src_mac ca:2e:a7:3f:f5:0f >> > eth_type ipv4 >> > ip_tos 0/0x3 >> > ip_flags nofrag >> > in_hw in_hw_count 1 >> > action order 1: tunnel_key set >> > src_ip 7.7.7.5 >> > dst_ip 7.7.7.1 >> > ... >> > >> > $ tc -s filter show dev vxlan_sys_4789 ingress >> > filter protocol ip pref 4 flower chain 0 >> > filter protocol ip pref 4 flower chain 0 handle 0x1 >> > dst_mac ca:2e:a7:3f:f5:0f >> > src_mac 0a:40:bd:30:89:99 >> > eth_type ipv4 >> > enc_dst_ip 7.7.7.5 >> > enc_src_ip 7.7.7.1 >> > enc_key_id 98 >> > enc_dst_port 4789 >> > enc_tos 0 >> > ... >> > >> > These operations imply that 7.7.7.5 is configured on some interface on >> > the host. Most likely the VF representor itself, as that aids with ARP >> > resolution. Is that so? >> > >> > Thanks, >> > Marcelo >> >> Hi Marcelo, >> >> The tunnel endpoint IP address is configured on VF that is represented >> by enp8s0f0_0 representor in example rules. The VF is on host. > > That's interesting and odd. The VF would be isolated by a netns and > not be visible by whoever is administrating the VF representor. Some > cooperation between the two entities (host and container, say) is > needed then, right? Because the host needs to know the endpoint IP > address that the container will be using, and vice-versa. If so, why > not offload the tunnel actions via the VF itself and avoid this need > for cooperation? Container privileges maybe? > > Thx, > Marcelo As I wrote in previous email, tunnel endpoint VF is on host (not in namespace/container, VM, etc.).
On Mon, Feb 08, 2021 at 03:31:50PM +0200, Vlad Buslov wrote: > > On Mon 08 Feb 2021 at 15:25, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: > > On Mon, Feb 08, 2021 at 10:21:21AM +0200, Vlad Buslov wrote: > >> > >> On Sat 06 Feb 2021 at 20:13, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: > >> > Hi, > >> > > >> > I didn't receive the cover letter, so I'm replying on this one. :-) > >> > > >> > This is nice. One thing is not clear to me yet. From the samples on > >> > the cover letter: > >> > > >> > $ tc -s filter show dev enp8s0f0_1 ingress > >> > filter protocol ip pref 4 flower chain 0 > >> > filter protocol ip pref 4 flower chain 0 handle 0x1 > >> > dst_mac 0a:40:bd:30:89:99 > >> > src_mac ca:2e:a7:3f:f5:0f > >> > eth_type ipv4 > >> > ip_tos 0/0x3 > >> > ip_flags nofrag > >> > in_hw in_hw_count 1 > >> > action order 1: tunnel_key set > >> > src_ip 7.7.7.5 > >> > dst_ip 7.7.7.1 > >> > ... > >> > > >> > $ tc -s filter show dev vxlan_sys_4789 ingress > >> > filter protocol ip pref 4 flower chain 0 > >> > filter protocol ip pref 4 flower chain 0 handle 0x1 > >> > dst_mac ca:2e:a7:3f:f5:0f > >> > src_mac 0a:40:bd:30:89:99 > >> > eth_type ipv4 > >> > enc_dst_ip 7.7.7.5 > >> > enc_src_ip 7.7.7.1 > >> > enc_key_id 98 > >> > enc_dst_port 4789 > >> > enc_tos 0 > >> > ... > >> > > >> > These operations imply that 7.7.7.5 is configured on some interface on > >> > the host. Most likely the VF representor itself, as that aids with ARP > >> > resolution. Is that so? > >> > > >> > Thanks, > >> > Marcelo > >> > >> Hi Marcelo, > >> > >> The tunnel endpoint IP address is configured on VF that is represented > >> by enp8s0f0_0 representor in example rules. The VF is on host. > > > > That's interesting and odd. The VF would be isolated by a netns and > > not be visible by whoever is administrating the VF representor. Some > > cooperation between the two entities (host and container, say) is > > needed then, right? Because the host needs to know the endpoint IP > > address that the container will be using, and vice-versa. If so, why > > not offload the tunnel actions via the VF itself and avoid this need > > for cooperation? Container privileges maybe? > > > > Thx, > > Marcelo > > As I wrote in previous email, tunnel endpoint VF is on host (not in > namespace/container, VM, etc.). Right. I assumed it was just for simplicity of testing. Okay, I think I can see some use cases for this. Thanks. Cheers, Marcelo
On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote: > > These operations imply that 7.7.7.5 is configured on some interface on > > the host. Most likely the VF representor itself, as that aids with ARP > > resolution. Is that so? > > Hi Marcelo, > > The tunnel endpoint IP address is configured on VF that is represented > by enp8s0f0_0 representor in example rules. The VF is on host. This is very confusing, are you saying that the 7.7.7.5 is configured both on VF and VFrep? Could you provide a full picture of the config with IP addresses and routing?
On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote: > From: Saeed Mahameed <saeedm@nvidia.com> > This series adds the support for VF tunneling. > Vlad Buslov says: > ================= > Implement support for VF tunneling > Abstract > Currently, mlx5 only supports configuration with tunnel endpoint IP address on > uplink representor. Remove implicit and explicit assumptions of tunnel always > being terminated on uplink and implement necessary infrastructure for > configuring tunnels on VF representors and updating rules on such tunnels > according to routing changes. > > SW TC model maybe before SW TC model, you can explain the SW model (TC is a vehicle to implement the SW model). SW model for VST and "classic" v-switch tunnel setup: For example, in VST model, each virtio/vf/sf vport has a vlan such that the v-switch tags packets going out "south" of the vport towards the uplink, untags packets going "north" from the uplink into the vport (and does nothing for east-west traffic). In a similar manner, in "classic" v-switch tunnel setup, each virtio/vf/sf vport is somehow associated with VNI/s marking the tenant/s it belongs to. Same tenant east-west traffic on the host doesn't go through any encap/decap. The v-switch adds the relevant tunnel MD to packets/skbs sent "southward" by the end-point and forwards it to the VTEP which applies encap and sends the packets to the wire. On RX, the VTEP decaps the tunnel info from the packet, adds it as MD to the skb and forwards the packet up into the stack where the vsw hooks it, matches on the MD + inner tuple and then forwards it to the relevant endpoint. HW offloads for VST and "classic" v-switch tunnel setup: more or less straight forward based on the above > From TC perspective VF tunnel configuration requires two rules in both > directions: > > TX rules > > 1. Rule that redirects packets from UL to VF rep that has the tunnel > endpoint IP address: > 2. Rule that decapsulates the tunneled flow and redirects to destination VF > representor: > RX rules > > 1. Rule that encapsulates the tunneled flow and redirects packets from > source VF rep to tunnel device: > 2. Rule that redirects from tunnel device to UL rep: Sorry, I am not managing to follow and catch up a SW model from TC rules.. I think we need these two to begin with: [1] Motivation for enhanced v-switch tunnel setup: [2] SW model for enhanced v-switch tunnel setup: > HW offloads model a clear SW model before HW offloads model..
On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote: > Vlad Buslov says: > Implement support for VF tunneling > Currently, mlx5 only supports configuration with tunnel endpoint IP address on > uplink representor. Remove implicit and explicit assumptions of tunnel always > being terminated on uplink and implement necessary infrastructure for > configuring tunnels on VF representors and updating rules on such tunnels > according to routing changes. > SW TC model maybe before SW TC model, you can explain the vswitch SW model (TC is a vehicle to implement the SW model). SW model for VST and "classic" v-switch tunnel setup: For example, in VST model, each virtio/vf/sf vport has a vlan such that the v-switch tags packets going out "south" of the vport towards the uplink, untags packets going "north" from the uplink, matches on the vport tag and forwards them to the vport (and does nothing for east-west traffic). In a similar manner, in "classic" v-switch tunnel setup, each virtio/vf/sf vport is somehow associated with VNI/s marking the tenant/s it belongs to. Same tenant east-west traffic on the host doesn't go through any encap/decap. The v-switch adds the relevant tunnel MD to packets/skbs sent "southward" by the end-point and forwards it to the VTEP which applies encap based on the MD (LWT scheme) and sends the packets to the wire. On RX, the VTEP decaps the tunnel info from the packet, adds it as MD to the skb and forwards the packet up into the stack where the vsw hooks it, matches on the MD + inner tuple and then forwards it to the relevant endpoint. HW offloads for VST and "classic" v-switch tunnel setup: more or less straight forward based on the above > From TC perspective VF tunnel configuration requires two rules in both > directions: > TX rules > 1. Rule that redirects packets from UL to VF rep that has the tunnel > endpoint IP address: > 2. Rule that decapsulates the tunneled flow and redirects to destination VF > representor: > RX rules > 1. Rule that encapsulates the tunneled flow and redirects packets from > source VF rep to tunnel device: > 2. Rule that redirects from tunnel device to UL rep: mmm it's kinda hard managing to follow and catch up a SW model from TC rules.. I think we need these two to begin with (in whatever order that works better for you) [1] Motivation for enhanced v-switch tunnel setup: [2] SW model for enhanced v-switch tunnel setup: > HW offloads model a clear SW model before HW offloads model.. > 25 files changed, 3812 insertions(+), 1057 deletions(-) for adding almost 4K LOCs
On Tue, Feb 9, 2021 at 10:42 AM Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote: > > Vlad Buslov says: > > > Implement support for VF tunneling > > > Currently, mlx5 only supports configuration with tunnel endpoint IP address on > > uplink representor. Remove implicit and explicit assumptions of tunnel always > > being terminated on uplink and implement necessary infrastructure for > > configuring tunnels on VF representors and updating rules on such tunnels > > according to routing changes. > > > SW TC model > > maybe before SW TC model, you can explain the vswitch SW model (TC is > a vehicle to implement the SW model). I thought my earlier post missed the list, so I reposted, but realized now it didn't, feel free to address either of the posts
On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote: > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote: >> > These operations imply that 7.7.7.5 is configured on some interface on >> > the host. Most likely the VF representor itself, as that aids with ARP >> > resolution. Is that so? >> >> Hi Marcelo, >> >> The tunnel endpoint IP address is configured on VF that is represented >> by enp8s0f0_0 representor in example rules. The VF is on host. > > This is very confusing, are you saying that the 7.7.7.5 is configured > both on VF and VFrep? Could you provide a full picture of the config > with IP addresses and routing? Hi Jakub, No, tunnel IP is configured on VF. That particular VF is in host namespace. When mlx5 resolves tunneling the code checks if tunnel endpoint IP address is on such mlx5 VF, since the VF is in same namespace as eswitch manager (e.g. on host) and route returned by ip_route_output_key() is resolved through rt->dst.dev==tunVF device. After establishing that tunnel is on VF the goal is to process two resulting TC rules (in both directions) fully in hardware without exposing the packet on tunneling device or tunnel VF in sw, which is implemented with all the infrastructure from this series. So, to summarize with IP addresses from TC examples presented in cover letter, we have underlay network 7.7.7.0/24 in host namespace with tunnel endpoint IP address on VF: $ ip a show dev enp8s0f0v0 1537: enp8s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 52:e5:6d:f2:00:69 brd ff:ff:ff:ff:ff:ff altname enp8s0f0np0v0 inet 7.7.7.5/24 scope global enp8s0f0v0 valid_lft forever preferred_lft forever inet6 fe80::50e5:6dff:fef2:69/64 scope link valid_lft forever preferred_lft forever Like all VFs in switchdev model the tunnel VF is controlled through representor that doesn't have any IP address assigned: $ ip a show dev enp8s0f0_0 1534: enp8s0f0_0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000 link/ether 96:98:b1:59:aa:5e brd ff:ff:ff:ff:ff:ff altname enp8s0f0npf0vf0 inet6 fe80::9498:b1ff:fe59:aa5e/64 scope link valid_lft forever preferred_lft forever User VFs have IP addresses from overlay network (5.5.5.0/24 in my tests) and are in namespaces/VMs, while only their representors are on host attached to same v-switch bridge with tunnel VF represetor: $ sudo ip netns exec ns0 ip a show dev enp8s0f0v1 1538: enp8s0f0v1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 link/ether 9e:cf:b5:69:84:d1 brd ff:ff:ff:ff:ff:ff altname enp8s0f0np0v1 inet 5.5.5.5/24 scope global enp8s0f0v1 valid_lft forever preferred_lft forever inet6 fe80::9ccf:b5ff:fe69:84d1/64 scope link valid_lft forever preferred_lft forever $ ip a show dev enp8s0f0_1 1535: enp8s0f0_1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP group default qlen 1000 link/ether 06:96:1e:23:df:a4 brd ff:ff:ff:ff:ff:ff altname enp8s0f0npf0vf1 OVS bridge ports: $ sudo ovs-vsctl list-ports ovs-br enp8s0f0 enp8s0f0_0 enp8s0f0_1 enp8s0f0_2 vxlan0 The TC rules from cover letter are installed by OVS configured according to description above when running iperf traffic from namespaced VF enp8s0f0v1 to another machine connected over uplink port: $ sudo ip netns exec ns0 iperf3 -c 5.5.5.1 -t 10000 Connecting to host 5.5.5.1, port 5201 [ 5] local 5.5.5.5 port 34486 connected to 5.5.5.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 158 MBytes 1.32 Gbits/sec 41 771 KBytes Hope this clarifies things and sorry for confusion! Regards, Vlad
On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov <vladbu@nvidia.com> wrote: > On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote: > > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote: > >> > These operations imply that 7.7.7.5 is configured on some interface on > >> > the host. Most likely the VF representor itself, as that aids with ARP > >> > resolution. Is that so? > >> The tunnel endpoint IP address is configured on VF that is represented > >> by enp8s0f0_0 representor in example rules. The VF is on host. > > This is very confusing, are you saying that the 7.7.7.5 is configured > > both on VF and VFrep? Could you provide a full picture of the config > > with IP addresses and routing? > No, tunnel IP is configured on VF. That particular VF is in host [..] What's the motivation for that? isn't that introducing 3x slow down?
On Tue, 9 Feb 2021 16:22:26 +0200 Vlad Buslov wrote: > No, tunnel IP is configured on VF. That particular VF is in host > namespace. When mlx5 resolves tunneling the code checks if tunnel > endpoint IP address is on such mlx5 VF, since the VF is in same > namespace as eswitch manager (e.g. on host) and route returned by > ip_route_output_key() is resolved through rt->dst.dev==tunVF device. > After establishing that tunnel is on VF the goal is to process two > resulting TC rules (in both directions) fully in hardware without > exposing the packet on tunneling device or tunnel VF in sw, which is > implemented with all the infrastructure from this series. > > So, to summarize with IP addresses from TC examples presented in cover letter, > we have underlay network 7.7.7.0/24 in host namespace with tunnel endpoint IP > address on VF: > > $ ip a show dev enp8s0f0v0 > 1537: enp8s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 > link/ether 52:e5:6d:f2:00:69 brd ff:ff:ff:ff:ff:ff > altname enp8s0f0np0v0 > inet 7.7.7.5/24 scope global enp8s0f0v0 > valid_lft forever preferred_lft forever > inet6 fe80::50e5:6dff:fef2:69/64 scope link > valid_lft forever preferred_lft forever Isn't this 100% the wrong way around. Disable the offloads. Does the traffic hit the VF encapsulated? IIUC SW will do this: PHY port | device | ,-----. -----------|------------|-------|---------- kernel | | | (UL/PF) (VFr) (VF) | | | [TC ing]>redir -` V And the packet never hits encap.
On Tue 09 Feb 2021 at 20:05, Jakub Kicinski <kuba@kernel.org> wrote: > On Tue, 9 Feb 2021 16:22:26 +0200 Vlad Buslov wrote: >> No, tunnel IP is configured on VF. That particular VF is in host >> namespace. When mlx5 resolves tunneling the code checks if tunnel >> endpoint IP address is on such mlx5 VF, since the VF is in same >> namespace as eswitch manager (e.g. on host) and route returned by >> ip_route_output_key() is resolved through rt->dst.dev==tunVF device. >> After establishing that tunnel is on VF the goal is to process two >> resulting TC rules (in both directions) fully in hardware without >> exposing the packet on tunneling device or tunnel VF in sw, which is >> implemented with all the infrastructure from this series. >> >> So, to summarize with IP addresses from TC examples presented in cover letter, >> we have underlay network 7.7.7.0/24 in host namespace with tunnel endpoint IP >> address on VF: >> >> $ ip a show dev enp8s0f0v0 >> 1537: enp8s0f0v0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000 >> link/ether 52:e5:6d:f2:00:69 brd ff:ff:ff:ff:ff:ff >> altname enp8s0f0np0v0 >> inet 7.7.7.5/24 scope global enp8s0f0v0 >> valid_lft forever preferred_lft forever >> inet6 fe80::50e5:6dff:fef2:69/64 scope link >> valid_lft forever preferred_lft forever > > Isn't this 100% the wrong way around. Disable the offloads. Does the > traffic hit the VF encapsulated? > > IIUC SW will do this: > > PHY port > | > device | ,-----. > -----------|------------|-------|---------- > kernel | | | > (UL/PF) (VFr) (VF) > | | | > [TC ing]>redir -` V > > And the packet never hits encap. We can look at dumps on every stage (produced by running exactly the same test with OVS option other_config:tc-policy=skip_hw): 1. Traffic arrives at UL with vxlan encapsulation $ sudo tcpdump -ni enp8s0f0 -vvv -c 3 dropped privs to tcpdump tcpdump: listening on enp8s0f0, link-type EN10MB (Ethernet), capture size 262144 bytes 21:01:28.619346 IP (tos 0x0, ttl 64, id 65187, offset 0, flags [none], proto UDP (17), length 102) 7.7.7.1.52277 > 7.7.7.5.vxlan: [udp sum ok] VXLAN, flags [I] (0x08), vni 98 IP (tos 0x0, ttl 64, id 43919, offset 0, flags [DF], proto TCP (6), length 52) 5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0x467b (correct), seq 2194968387, ack 2680742983, win 24576, options [nop,nop,TS val 1092282319 ecr 348802330], length 0 21:01:28.619505 IP (tos 0x0, ttl 64, id 888, offset 0, flags [none], proto UDP (17), length 1500) 7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98 IP (tos 0x0, ttl 64, id 6662, offset 0, flags [DF], proto TCP (6), length 1450) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0x8025 (correct), seq 673837:675235, ack 0, win 502, options [nop,nop,TS val 348802333 ecr 1092282319], length 1398 21:01:28.619506 IP (tos 0x0, ttl 64, id 889, offset 0, flags [none], proto UDP (17), length 1500) 7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98 IP (tos 0x0, ttl 64, id 6663, offset 0, flags [DF], proto TCP (6), length 1450) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0x19d1 (correct), seq 675235:676633, ack 0, win 502, options [nop,nop,TS val 348802333 ecr 1092282319], length 1398 2. By TC rule traffic is redirected to tunnel VF that has IP address 7.7.7.5 (still encapsulated as there is no decap action attached to filter on enp8s0f0): $ sudo tcpdump -ni enp8s0f0v0 -vvv -c 3 dropped privs to tcpdump tcpdump: listening on enp8s0f0v0, link-type EN10MB (Ethernet), capture size 262144 bytes 21:03:41.524244 IP (tos 0x0, ttl 64, id 48184, offset 0, flags [none], proto UDP (17), length 1500) 7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98 IP (tos 0x0, ttl 64, id 52619, offset 0, flags [DF], proto TCP (6), length 1450) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0xaddb (correct), seq 279895999:279897397, ack 2194968387, win 502, options [nop,nop,TS val 348935238 ecr 1092415214], length 1398 21:03:41.568055 IP (tos 0x0, ttl 64, id 701, offset 0, flags [none], proto UDP (17), length 102) 7.7.7.1.52277 > 7.7.7.5.vxlan: [udp sum ok] VXLAN, flags [I] (0x08), vni 98 IP (tos 0x0, ttl 64, id 44938, offset 0, flags [DF], proto TCP (6), length 52) 5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0xc623 (correct), seq 1, ack 1398, win 24576, options [nop,nop,TS val 1092415267 ecr 348935238], length 0 21:03:41.568384 IP (tos 0x0, ttl 64, id 48191, offset 0, flags [none], proto UDP (17), length 1500) 7.7.7.5.40092 > 7.7.7.1.vxlan: [no cksum] VXLAN, flags [I] (0x08), vni 98 IP (tos 0x0, ttl 64, id 52620, offset 0, flags [DF], proto TCP (6), length 1450) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [.], cksum 0xe1b9 (correct), seq 1398:2796, ack 1, win 502, options [nop,nop,TS val 348935282 ecr 1092415267], length 1398 3. Traffic gets to tunnel device, where it gets decapsulated and redirected to destination VF by TC rule on vxlan_sys_4789: $ sudo tcpdump -ni vxlan_sys_4789 -vvv -c 3 dropped privs to tcpdump tcpdump: listening on vxlan_sys_4789, link-type EN10MB (Ethernet), capture size 262144 bytes 21:07:39.836141 IP (tos 0x0, ttl 64, id 15565, offset 0, flags [DF], proto TCP (6), length 52) 5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0xbe91 (correct), seq 2194968387, ack 4279285947, win 24576, options [nop,nop,TS val 1092653536 ecr 349173547], length 0 21:07:39.836202 IP (tos 0x0, ttl 64, id 50774, offset 0, flags [DF], proto TCP (6), length 64360) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [P.], cksum 0x0f6b (incorrect -> 0x1d69), seq 746533:810841, ack 0, win 502, options [nop,nop,TS val 349173550 ecr 1092653536], length 64308 21:07:39.836449 IP (tos 0x0, ttl 64, id 15566, offset 0, flags [DF], proto TCP (6), length 52) 5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0x610f (correct), seq 0, ack 89473, win 24576, options [nop,nop,TS val 1092653536 ecr 349173548], length 0 4. Decapsulated payload appears on namespaced VF with IP address 5.5.5.5: $ sudo ip netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3 yp_bind_client_create_v3: RPC: Unable to send dropped privs to tcpdump tcpdump: listening on enp8s0f0v1, link-type EN10MB (Ethernet), capture size 262144 bytes 21:09:06.758107 IP (tos 0x0, ttl 64, id 27527, offset 0, flags [DF], proto TCP (6), length 32206) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [P.], cksum 0x91d0 (incorrect -> 0x2a2a), seq 1198920825:1198952979, ack 2194968387, win 502, options [nop,nop,TS val 349260472 ecr 1092740448], length 32154 21:09:06.758697 IP (tos 0x0, ttl 64, id 3008, offset 0, flags [DF], proto TCP (6), length 64) 5.5.5.1.targus-getdata1 > 5.5.5.5.34538: Flags [.], cksum 0x6a1a (correct), seq 1, ack 4294942132, win 24576, options [nop,nop,TS val 1092740458 ecr 349260463,nop,nop,sack 1 {0:32154}], length 0 21:09:06.758748 IP (tos 0x0, ttl 64, id 27550, offset 0, flags [DF], proto TCP (6), length 25216) 5.5.5.5.34538 > 5.5.5.1.targus-getdata1: Flags [P.], cksum 0x7682 (incorrect -> 0x7627), seq 4294942132:0, ack 1, win 502, options [nop,nop,TS val 349260473 ecr 1092740458], length 25164 As you can see from the dump Tx is symmetrical. And that is exactly the behavior we are reproducing with offloads. So I guess correct diagram would be: PHY port | device | ,(vxlan) -----------|------------|-------|---------- kernel | | | (UL/PF) (VFr) (VF) | | | [TC ing]>redir -` V Regards, Vlad
On Tue, 9 Feb 2021 21:17:11 +0200 Vlad Buslov wrote: > 4. Decapsulated payload appears on namespaced VF with IP address > 5.5.5.5: > > $ sudo ip netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3 So there are two VFs? Hm, completely missed that. Could you *please* provide an ascii diagram for the entire flow? None of those dumps you're showing gives us the high level picture, and it's quite hard to follow which enpsfyxz interface is what.
On Tue 09 Feb 2021 at 21:50, Jakub Kicinski <kuba@kernel.org> wrote: > On Tue, 9 Feb 2021 21:17:11 +0200 Vlad Buslov wrote: >> 4. Decapsulated payload appears on namespaced VF with IP address >> 5.5.5.5: >> >> $ sudo ip netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3 > > So there are two VFs? Hm, completely missed that. Could you *please* > provide an ascii diagram for the entire flow? None of those dumps > you're showing gives us the high level picture, and it's quite hard > to follow which enpsfyxz interface is what. Sure. Here it is: +-------------------------------------------------------------------------------------+ | | | OVS br TC ingress TC ingress | | +---------------------+ +---------------------+ | | | TC ingress | | TC ingress | | | | +-------------+ | | +-------------+ | | | | | | | | | | | | | +------v---+---+ +---v---+------+ +------v---+---+ +---v---+------+ | +---+ +-----+ +-----+ +-----+ +---+ | UL rep | | VF0 rep | | vxlan | | VF1 rep | +---+ +-----+ +-----+ +-----+ +---+ | +-------^------+ +-^------------+ +-----^--------+ +-------------^+ | | | | | | | | Kernel | | | | | | | | +---------------+ +--------------------+ | | | | | | |namespace | | | | | | | | | | | | | | +------v-------+ | +--------------+ | | | +----------------------------+ +-----------------+ +--------+ | | | VF0 | | | VF1 | | | | | | | | | | | | | | +----^---------+ | +----------^---+ | | | | | | | | | | | | +--------------------+ | | | | | | | | | | | | | | | | | | | | | | | | | | +-------------------------------------------------------------------------------------+ | | | | | | | | Hardware | +------+ +-------+ | | | | | +-----+ | +--------+ | +----------------------------------------------------------------------+ | v | | | +-----+
On Tue, Feb 09, 2021 at 06:10:59PM +0200, Or Gerlitz wrote: > On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov <vladbu@nvidia.com> wrote: > > On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote: > > > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote: > > > >> > These operations imply that 7.7.7.5 is configured on some interface on > > >> > the host. Most likely the VF representor itself, as that aids with ARP > > >> > resolution. Is that so? > > > >> The tunnel endpoint IP address is configured on VF that is represented > > >> by enp8s0f0_0 representor in example rules. The VF is on host. > > > > This is very confusing, are you saying that the 7.7.7.5 is configured > > > both on VF and VFrep? Could you provide a full picture of the config > > > with IP addresses and routing? > > > No, tunnel IP is configured on VF. That particular VF is in host [..] > > What's the motivation for that? isn't that introducing 3x slow down? Vlad please correct me if I'm wrong. I think this boils down to not using the uplink representor as a real interface. This way, the host can make use of 7.7.7.5 for other stuff as well without passing (heavy) traffic through representor ports, which are not meant for it. So the host can have the IP 7.7.7.5 and also decapsulate vxlan traffic on it, which wouldn't be possible/recommended otherwise. Another moment that this gets visible is with VF LAG. When we bond the uplink representors, add an IP to it and do vxlan decap, that IP is meant only for the decap process and shouldn't be used for heavier traffic as its passing through representor ports. Then, tc config for decap need to be done on VF0rep and not on VF0 itself because that would be a security problem: one VF (which could be on a netns) could steer packets to another VF at will.
On Wed 10 Feb 2021 at 15:56, Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> wrote: > On Tue, Feb 09, 2021 at 06:10:59PM +0200, Or Gerlitz wrote: >> On Tue, Feb 9, 2021 at 4:26 PM Vlad Buslov <vladbu@nvidia.com> wrote: >> > On Mon 08 Feb 2021 at 22:22, Jakub Kicinski <kuba@kernel.org> wrote: >> > > On Mon, 8 Feb 2021 10:21:21 +0200 Vlad Buslov wrote: >> >> > >> > These operations imply that 7.7.7.5 is configured on some interface on >> > >> > the host. Most likely the VF representor itself, as that aids with ARP >> > >> > resolution. Is that so? >> >> > >> The tunnel endpoint IP address is configured on VF that is represented >> > >> by enp8s0f0_0 representor in example rules. The VF is on host. >> >> > > This is very confusing, are you saying that the 7.7.7.5 is configured >> > > both on VF and VFrep? Could you provide a full picture of the config >> > > with IP addresses and routing? >> >> > No, tunnel IP is configured on VF. That particular VF is in host [..] >> >> What's the motivation for that? isn't that introducing 3x slow down? > > Vlad please correct me if I'm wrong. > > I think this boils down to not using the uplink representor as a real > interface. This way, the host can make use of 7.7.7.5 for other stuff > as well without passing (heavy) traffic through representor ports, > which are not meant for it. > > So the host can have the IP 7.7.7.5 and also decapsulate vxlan traffic > on it, which wouldn't be possible/recommended otherwise. > > Another moment that this gets visible is with VF LAG. When we bond the > uplink representors, add an IP to it and do vxlan decap, that IP is > meant only for the decap process and shouldn't be used for heavier > traffic as its passing through representor ports. > > Then, tc config for decap need to be done on VF0rep and not on VF0 > itself because that would be a security problem: one VF (which could > be on a netns) could steer packets to another VF at will. While on-host VF (the one with IP 7.7.7.5 in my examples) is intended to be used for unencapsulated control traffic as well, we don't expect significant bandwidth of such traffic, so traffic-load on representor wasn't the main motivation. I didn't want to go into the details in cover letter because they are mostly OVS-specific and this series is a groundwork for features to come. So the main motivation is to be able to apply policy on both on underlay network (UL) and overlay network (tunnel netdev). As that will allow us to subject overlay and underlay traffic to different set of OVS rules, for example underlay traffic may be subject to vlan encap/decap, security policy or any other flow rule that the user may define. Hope this also answers some of Or's questions from this thread.
On Tue 09 Feb 2021 at 10:42, Or Gerlitz <gerlitz.or@gmail.com> wrote: > On Sat, Feb 6, 2021 at 7:10 AM Saeed Mahameed <saeed@kernel.org> wrote: > >> Vlad Buslov says: > >> Implement support for VF tunneling > >> Currently, mlx5 only supports configuration with tunnel endpoint IP address on >> uplink representor. Remove implicit and explicit assumptions of tunnel always >> being terminated on uplink and implement necessary infrastructure for >> configuring tunnels on VF representors and updating rules on such tunnels >> according to routing changes. > >> SW TC model > > maybe before SW TC model, you can explain the vswitch SW model (TC is > a vehicle to implement the SW model). > > SW model for VST and "classic" v-switch tunnel setup: > > For example, in VST model, each virtio/vf/sf vport has a vlan > such that the v-switch tags packets going out "south" of the > vport towards the uplink, untags packets going "north" from > the uplink, matches on the vport tag and forwards them to > the vport (and does nothing for east-west traffic). > > In a similar manner, in "classic" v-switch tunnel setup, each > virtio/vf/sf vport is somehow associated with VNI/s marking the > tenant/s it belongs to. Same tenant east-west traffic on the > host doesn't go through any encap/decap. The v-switch adds the > relevant tunnel MD to packets/skbs sent "southward" by the end-point > and forwards it to the VTEP which applies encap based on the MD (LWT > scheme) and sends the packets to the wire. On RX, the VTEP decaps > the tunnel info from the packet, adds it as MD to the skb and > forwards the packet up into the stack where the vsw hooks it, matches > on the MD + inner tuple and then forwards it to the relevant endpoint. Moving tunnel endpoint to VF doesn't change anything in this high-level description. > > HW offloads for VST and "classic" v-switch tunnel setup: > > more or less straight forward based on the above > >> From TC perspective VF tunnel configuration requires two rules in both >> directions: > >> TX rules >> 1. Rule that redirects packets from UL to VF rep that has the tunnel >> endpoint IP address: >> 2. Rule that decapsulates the tunneled flow and redirects to destination VF >> representor: > >> RX rules >> 1. Rule that encapsulates the tunneled flow and redirects packets from >> source VF rep to tunnel device: >> 2. Rule that redirects from tunnel device to UL rep: > > mmm it's kinda hard managing to follow and catch up a SW model from TC rules.. > > I think we need these two to begin with (in whatever order that works > better for you) > > [1] Motivation for enhanced v-switch tunnel setup: > > [2] SW model for enhanced v-switch tunnel setup: > >> HW offloads model > > a clear SW model before HW offloads model.. Hope my replies to Jakub and Marcelo also address these. > >> 25 files changed, 3812 insertions(+), 1057 deletions(-) > > for adding almost 4K LOCs
On Wed, 10 Feb 2021 13:25:05 +0200 Vlad Buslov wrote: > On Tue 09 Feb 2021 at 21:50, Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 9 Feb 2021 21:17:11 +0200 Vlad Buslov wrote: > >> 4. Decapsulated payload appears on namespaced VF with IP address > >> 5.5.5.5: > >> > >> $ sudo ip netns exec ns0 tcpdump -ni enp8s0f0v1 -vvv -c 3 > > > > So there are two VFs? Hm, completely missed that. Could you *please* > > provide an ascii diagram for the entire flow? None of those dumps > > you're showing gives us the high level picture, and it's quite hard > > to follow which enpsfyxz interface is what. > > Sure. Here it is: Thanks a lot, that clarifies it!
From: Saeed Mahameed <saeedm@nvidia.com> Hi Jakub, This series adds the support for VF tunneling. For more information please see tag log below. Please pull and let me know if there is any problem. v1->v2: - build error: Added the missing function 'mlx5_vport_get_other_func_cap' in patch 2 Thanks, Saeed. --- The following changes since commit 4d469ec8ec05e1fa4792415de1a95b28871ff2fa: Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue (2021-02-04 21:26:28 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2021-02-04 for you to fetch changes up to 8914add2c9e5518f6a864936658bba5752510b39: net/mlx5e: Handle FIB events to update tunnel endpoint device (2021-02-05 20:53:39 -0800) ---------------------------------------------------------------- mlx5-updates-2021-02-04 Vlad Buslov says: ================= Implement support for VF tunneling Abstract Currently, mlx5 only supports configuration with tunnel endpoint IP address on uplink representor. Remove implicit and explicit assumptions of tunnel always being terminated on uplink and implement necessary infrastructure for configuring tunnels on VF representors and updating rules on such tunnels according to routing changes. SW TC model