Message ID | 20201204102901.109709-2-marekx.majtyka@intel.com |
---|---|
State | New |
Headers | show |
Series | New netdev feature flags for XDP | expand |
alardam@gmail.com writes: > From: Marek Majtyka <marekx.majtyka@intel.com> > > Implement support for checking what kind of xdp functionality a netdev > supports. Previously, there was no way to do this other than to try > to create an AF_XDP socket on the interface or load an XDP program and see > if it worked. This commit changes this by adding a new variable which > describes all xdp supported functions on pretty detailed level: I like the direction this is going! :) > - aborted > - drop > - pass > - tx > - redirect Drivers can in principle implement support for the XDP_REDIRECT return code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit() for being the *target* of a redirect. While my quick grepping doesn't turn up any drivers that do only one of these right now, I think we've had examples of it in the past, so it would probably be better to split the redirect feature flag in two. This would also make it trivial to replace the check in __xdp_enqueue() (in devmap.c) from looking at whether the ndo is defined, and just checking the flag. It would be great if you could do this as part of this series. Maybe we could even make the 'redirect target' flag be set automatically if a driver implements ndo_xdp_xmit? > - zero copy > - hardware offload. > > Zerocopy mode requires that redirect xdp operation is implemented > in a driver and the driver supports also zero copy mode. > Full mode requires that all xdp operation are implemented in the driver. > Basic mode is just full mode without redirect operation. > > Initially, these new flags are disabled for all drivers by default. > > Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com> > --- > .../networking/netdev-xdp-properties.rst | 42 ++++++++ > include/linux/netdevice.h | 2 + > include/linux/xdp_properties.h | 53 +++++++++++ > include/net/xdp.h | 95 +++++++++++++++++++ > include/net/xdp_sock_drv.h | 10 ++ > include/uapi/linux/ethtool.h | 1 + > include/uapi/linux/xdp_properties.h | 32 +++++++ > net/ethtool/common.c | 11 +++ > net/ethtool/common.h | 4 + > net/ethtool/strset.c | 5 + > 10 files changed, 255 insertions(+) > create mode 100644 Documentation/networking/netdev-xdp-properties.rst > create mode 100644 include/linux/xdp_properties.h > create mode 100644 include/uapi/linux/xdp_properties.h > > diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst > new file mode 100644 > index 000000000000..4a434a1c512b > --- /dev/null > +++ b/Documentation/networking/netdev-xdp-properties.rst > @@ -0,0 +1,42 @@ > +.. SPDX-License-Identifier: GPL-2.0 > + > +===================== > +Netdev XDP properties > +===================== > + > + * XDP PROPERTIES FLAGS > + > +Following netdev xdp properties flags can be retrieve over netlink ethtool > +interface the same way as netdev feature flags. These properties flags are > +read only and cannot be change in the runtime. > + > + > +* XDP_ABORTED > + > +This property informs if netdev supports xdp aborted action. > + > +* XDP_DROP > + > +This property informs if netdev supports xdp drop action. > + > +* XDP_PASS > + > +This property informs if netdev supports xdp pass action. > + > +* XDP_TX > + > +This property informs if netdev supports xdp tx action. > + > +* XDP_REDIRECT > + > +This property informs if netdev supports xdp redirect action. > +It assumes the all beforehand mentioned flags are enabled. > + > +* XDP_ZEROCOPY > + > +This property informs if netdev driver supports xdp zero copy. > +It assumes the all beforehand mentioned flags are enabled. Nit: I think 'XDP_ZEROCOPY' can lead people to think that this is zero-copy support for all XDP operations, which is obviously not the case. So maybe 'XDP_SOCK_ZEROCOPY' (and update the description to mention AF_XDP sockets explicitly)? -Toke
On 12/4/20 1:46 PM, Maciej Fijalkowski wrote: > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote: >> alardam@gmail.com writes: >>> From: Marek Majtyka <marekx.majtyka@intel.com> >>> >>> Implement support for checking what kind of xdp functionality a netdev >>> supports. Previously, there was no way to do this other than to try >>> to create an AF_XDP socket on the interface or load an XDP program and see >>> if it worked. This commit changes this by adding a new variable which >>> describes all xdp supported functions on pretty detailed level: >> >> I like the direction this is going! :) >> >>> - aborted >>> - drop >>> - pass >>> - tx I strongly think we should _not_ merge any native XDP driver patchset that does not support/implement the above return codes. Could we instead group them together and call this something like XDP_BASE functionality to not give a wrong impression? If this is properly documented that these are basic must-have _requirements_, then users and driver developers both know what the expectations are. >>> - redirect >> >> Drivers can in principle implement support for the XDP_REDIRECT return >> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit() >> for being the *target* of a redirect. While my quick grepping doesn't >> turn up any drivers that do only one of these right now, I think we've >> had examples of it in the past, so it would probably be better to split >> the redirect feature flag in two. >> >> This would also make it trivial to replace the check in __xdp_enqueue() >> (in devmap.c) from looking at whether the ndo is defined, and just >> checking the flag. It would be great if you could do this as part of >> this series. >> >> Maybe we could even make the 'redirect target' flag be set automatically >> if a driver implements ndo_xdp_xmit? > > +1 > >>> - zero copy >>> - hardware offload. One other thing that is quite annoying to figure out sometimes and not always obvious from reading the driver code (and it may even differ depending on how the driver was built :/) is how much XDP headroom a driver really provides. We tried to standardize on a minimum guaranteed amount, but unfortunately not everyone seems to implement it, but I think it would be very useful to query this from application side, for example, consider that an app inserts a BPF prog at XDP doing custom encap shortly before XDP_TX so it would be useful to know which of the different encaps it implements are realistically possible on the underlying XDP supported dev. Thanks, Daniel
Daniel Borkmann <daniel@iogearbox.net> writes: > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote: >> On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote: >>> alardam@gmail.com writes: >>>> From: Marek Majtyka <marekx.majtyka@intel.com> >>>> >>>> Implement support for checking what kind of xdp functionality a netdev >>>> supports. Previously, there was no way to do this other than to try >>>> to create an AF_XDP socket on the interface or load an XDP program and see >>>> if it worked. This commit changes this by adding a new variable which >>>> describes all xdp supported functions on pretty detailed level: >>> >>> I like the direction this is going! :) >>> >>>> - aborted >>>> - drop >>>> - pass >>>> - tx > > I strongly think we should _not_ merge any native XDP driver patchset > that does not support/implement the above return codes. Could we > instead group them together and call this something like XDP_BASE > functionality to not give a wrong impression? If this is properly > documented that these are basic must-have _requirements_, then users > and driver developers both know what the expectations are. I think there may have been drivers that only did DROP/PASS on first merge; but adding TX to the "base set" is fine by me, as long as it's actually enforced ;) (As in, we originally said the same thing about the full feature set and that never really worked out). >>>> - redirect >>> >>> Drivers can in principle implement support for the XDP_REDIRECT return >>> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit() >>> for being the *target* of a redirect. While my quick grepping doesn't >>> turn up any drivers that do only one of these right now, I think we've >>> had examples of it in the past, so it would probably be better to split >>> the redirect feature flag in two. >>> >>> This would also make it trivial to replace the check in __xdp_enqueue() >>> (in devmap.c) from looking at whether the ndo is defined, and just >>> checking the flag. It would be great if you could do this as part of >>> this series. >>> >>> Maybe we could even make the 'redirect target' flag be set automatically >>> if a driver implements ndo_xdp_xmit? >> >> +1 >> >>>> - zero copy >>>> - hardware offload. > > One other thing that is quite annoying to figure out sometimes and not always > obvious from reading the driver code (and it may even differ depending on how > the driver was built :/) is how much XDP headroom a driver really provides. > > We tried to standardize on a minimum guaranteed amount, but unfortunately not > everyone seems to implement it, but I think it would be very useful to query > this from application side, for example, consider that an app inserts a BPF > prog at XDP doing custom encap shortly before XDP_TX so it would be useful to > know which of the different encaps it implements are realistically possible on > the underlying XDP supported dev. How many distinct values are there in reality? Enough to express this in a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need an additional field to get the exact value? If we implement the latter we also run the risk of people actually implementing all sorts of weird values, whereas if we constrain it to a few distinct values it's easier to push back against adding new values (as it'll be obvious from the addition of new flags). -Toke
On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote: > Daniel Borkmann <daniel@iogearbox.net> writes: [...] >> We tried to standardize on a minimum guaranteed amount, but unfortunately not >> everyone seems to implement it, but I think it would be very useful to query >> this from application side, for example, consider that an app inserts a BPF >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to >> know which of the different encaps it implements are realistically possible on >> the underlying XDP supported dev. > > How many distinct values are there in reality? Enough to express this in > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need > an additional field to get the exact value? If we implement the latter > we also run the risk of people actually implementing all sorts of weird > values, whereas if we constrain it to a few distinct values it's easier > to push back against adding new values (as it'll be obvious from the > addition of new flags). It's not everywhere straight forward to determine unfortunately, see also [0,1] as some data points where Jesper looked into in the past, so in some cases it might differ depending on the build/runtime config.. [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/ [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/
On Fri, 4 Dec 2020 23:19:55 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote: > > Daniel Borkmann <daniel@iogearbox.net> writes: > [...] > >> We tried to standardize on a minimum guaranteed amount, but unfortunately not > >> everyone seems to implement it, but I think it would be very useful to query > >> this from application side, for example, consider that an app inserts a BPF > >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to > >> know which of the different encaps it implements are realistically possible on > >> the underlying XDP supported dev. > > > > How many distinct values are there in reality? Enough to express this in > > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need > > an additional field to get the exact value? If we implement the latter > > we also run the risk of people actually implementing all sorts of weird > > values, whereas if we constrain it to a few distinct values it's easier > > to push back against adding new values (as it'll be obvious from the > > addition of new flags). > > It's not everywhere straight forward to determine unfortunately, see also [0,1] > as some data points where Jesper looked into in the past, so in some cases it > might differ depending on the build/runtime config.. > > [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/ > [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/ Yes, unfortunately drivers have already gotten creative in this area, and variations have sneaked in. I remember that we were forced to allow SFC driver to use 128 bytes headroom, to avoid a memory corruption. I tried hard to have the minimum 192 bytes as it is 3 cachelines, but I failed to enforce this. It might be valuable to expose info on the drivers headroom size, as this will allow end-users to take advantage of this (instead of having to use the lowest common headroom) and up-front in userspace rejecting to load on e.g. SFC that have this annoying limitation. BUT thinking about what the drivers headroom size MEANS to userspace, I'm not sure it is wise to give this info to userspace. The XDP-headroom is used for several kernel internal things, that limit the available space for growing packet-headroom. E.g. (1) xdp_frame is something that we likely need to grow (even-though I'm pushing back), E.g. (2) metadata area which Saeed is looking to populate from driver code (also reduce packet-headroom for encap-headers). So, userspace cannot use the XDP-headroom size to much... -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Daniel Borkmann <daniel@iogearbox.net> writes: > On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote: >> Daniel Borkmann <daniel@iogearbox.net> writes: > [...] >>> We tried to standardize on a minimum guaranteed amount, but unfortunately not >>> everyone seems to implement it, but I think it would be very useful to query >>> this from application side, for example, consider that an app inserts a BPF >>> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to >>> know which of the different encaps it implements are realistically possible on >>> the underlying XDP supported dev. >> >> How many distinct values are there in reality? Enough to express this in >> a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need >> an additional field to get the exact value? If we implement the latter >> we also run the risk of people actually implementing all sorts of weird >> values, whereas if we constrain it to a few distinct values it's easier >> to push back against adding new values (as it'll be obvious from the >> addition of new flags). > > It's not everywhere straight forward to determine unfortunately, see also [0,1] > as some data points where Jesper looked into in the past, so in some cases it > might differ depending on the build/runtime config.. > > [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/ > [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/ Right, well in that case maybe we should just expose the actual headroom as a separate netlink attribute? Although I suppose that would require another round of driver changes since Jesper's patch you linked above only puts this into xdp_buff at XDP program runtime. Jesper, WDYT? -Toke
Jesper Dangaard Brouer <brouer@redhat.com> writes: > On Fri, 4 Dec 2020 23:19:55 +0100 > Daniel Borkmann <daniel@iogearbox.net> wrote: > >> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote: >> > Daniel Borkmann <daniel@iogearbox.net> writes: >> [...] >> >> We tried to standardize on a minimum guaranteed amount, but unfortunately not >> >> everyone seems to implement it, but I think it would be very useful to query >> >> this from application side, for example, consider that an app inserts a BPF >> >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to >> >> know which of the different encaps it implements are realistically possible on >> >> the underlying XDP supported dev. >> > >> > How many distinct values are there in reality? Enough to express this in >> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need >> > an additional field to get the exact value? If we implement the latter >> > we also run the risk of people actually implementing all sorts of weird >> > values, whereas if we constrain it to a few distinct values it's easier >> > to push back against adding new values (as it'll be obvious from the >> > addition of new flags). >> >> It's not everywhere straight forward to determine unfortunately, see also [0,1] >> as some data points where Jesper looked into in the past, so in some cases it >> might differ depending on the build/runtime config.. >> >> [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/ >> [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/ > > Yes, unfortunately drivers have already gotten creative in this area, > and variations have sneaked in. I remember that we were forced to > allow SFC driver to use 128 bytes headroom, to avoid a memory > corruption. I tried hard to have the minimum 192 bytes as it is 3 > cachelines, but I failed to enforce this. > > It might be valuable to expose info on the drivers headroom size, as > this will allow end-users to take advantage of this (instead of having > to use the lowest common headroom) and up-front in userspace rejecting > to load on e.g. SFC that have this annoying limitation. > > BUT thinking about what the drivers headroom size MEANS to userspace, > I'm not sure it is wise to give this info to userspace. The > XDP-headroom is used for several kernel internal things, that limit the > available space for growing packet-headroom. E.g. (1) xdp_frame is > something that we likely need to grow (even-though I'm pushing back), > E.g. (2) metadata area which Saeed is looking to populate from driver > code (also reduce packet-headroom for encap-headers). So, userspace > cannot use the XDP-headroom size to much... (Ah, you had already replied, sorry seems I missed that). Can we calculate a number from the headroom that is meaningful for userspace? I suppose that would be "total number of bytes available for metadata+packet extension"? Even with growing data structures, any particular kernel should be able to inform userspace of the current value, no? -Toke
On Fri, 4 Dec 2020 16:21:08 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote: > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote: > >> alardam@gmail.com writes: > >>> From: Marek Majtyka <marekx.majtyka@intel.com> > >>> > >>> Implement support for checking what kind of xdp functionality a netdev > >>> supports. Previously, there was no way to do this other than to try > >>> to create an AF_XDP socket on the interface or load an XDP program and see > >>> if it worked. This commit changes this by adding a new variable which > >>> describes all xdp supported functions on pretty detailed level: > >> > >> I like the direction this is going! :) (Me too, don't get discouraged by our nitpicking, keep working on this! :-)) > >> > >>> - aborted > >>> - drop > >>> - pass > >>> - tx > > I strongly think we should _not_ merge any native XDP driver patchset > that does not support/implement the above return codes. I agree, with above statement. > Could we instead group them together and call this something like > XDP_BASE functionality to not give a wrong impression? I disagree. I can accept that XDP_BASE include aborted+drop+pass. I think we need to keep XDP_TX action separate, because I think that there are use-cases where the we want to disable XDP_TX due to end-user policy or hardware limitations. Use-case(1): Cloud-provider want to give customers (running VMs) ability to load XDP program for DDoS protection (only), but don't want to allow customer to use XDP_TX (that can implement LB or cheat their VM isolation policy). Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue resources, as the use-case is only DDoS. Today we have this problem with the ixgbe hardware, that cannot load XDP programs on systems with more than 192 CPUs. > If this is properly documented that these are basic must-have > _requirements_, then users and driver developers both know what the > expectations are. We can still document that XDP_TX is a must-have requirement, when a driver implements XDP. > >>> - redirect > >> -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Jesper Dangaard Brouer wrote: > On Fri, 4 Dec 2020 16:21:08 +0100 > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote: > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote: > > >> alardam@gmail.com writes: > > >>> From: Marek Majtyka <marekx.majtyka@intel.com> > > >>> > > >>> Implement support for checking what kind of xdp functionality a netdev > > >>> supports. Previously, there was no way to do this other than to try > > >>> to create an AF_XDP socket on the interface or load an XDP program and see > > >>> if it worked. This commit changes this by adding a new variable which > > >>> describes all xdp supported functions on pretty detailed level: > > >> > > >> I like the direction this is going! :) > > (Me too, don't get discouraged by our nitpicking, keep working on this! :-)) > > > >> > > >>> - aborted > > >>> - drop > > >>> - pass > > >>> - tx > > > > I strongly think we should _not_ merge any native XDP driver patchset > > that does not support/implement the above return codes. > > I agree, with above statement. > > > Could we instead group them together and call this something like > > XDP_BASE functionality to not give a wrong impression? > > I disagree. I can accept that XDP_BASE include aborted+drop+pass. > > I think we need to keep XDP_TX action separate, because I think that > there are use-cases where the we want to disable XDP_TX due to end-user > policy or hardware limitations. How about we discover this at load time though. Meaning if the program doesn't use XDP_TX then the hardware can skip resource allocations for it. I think we could have verifier or extra pass discover the use of XDP_TX and then pass a bit down to driver to enable/disable TX caps. > > Use-case(1): Cloud-provider want to give customers (running VMs) ability > to load XDP program for DDoS protection (only), but don't want to allow > customer to use XDP_TX (that can implement LB or cheat their VM > isolation policy). Not following. What interface do they want to allow loading on? If its the VM interface then I don't see how it matters. From outside the VM there should be no way to discover if its done in VM or in tc or some other stack. If its doing some onloading/offloading I would assume they need to ensure the isolation, etc. is still maintained because you can't let one VMs program work on other VMs packets safely. So what did I miss, above doesn't make sense to me. > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > resources, as the use-case is only DDoS. Today we have this problem > with the ixgbe hardware, that cannot load XDP programs on systems with > more than 192 CPUs. The ixgbe issues is just a bug or missing-feature in my opinion. I think we just document that XDP_TX consumes resources and if users care they shouldn't use XD_TX in programs and in that case hardware should via program discovery not allocate the resource. This seems cleaner in my opinion then more bits for features. > > > > If this is properly documented that these are basic must-have > > _requirements_, then users and driver developers both know what the > > expectations are. > > We can still document that XDP_TX is a must-have requirement, when a > driver implements XDP. +1 > > > > >>> - redirect > > >> > > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer >
On Mon, 2020-12-07 at 12:52 -0800, John Fastabend wrote: > Jesper Dangaard Brouer wrote: > > On Fri, 4 Dec 2020 16:21:08 +0100 > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote: > > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland- > > > > Jørgensen wrote: > > > > > alardam@gmail.com writes: > > > > > > From: Marek Majtyka <marekx.majtyka@intel.com> > > > > > > > > > > > > Implement support for checking what kind of xdp > > > > > > functionality a netdev > > > > > > supports. Previously, there was no way to do this other > > > > > > than to try > > > > > > to create an AF_XDP socket on the interface or load an XDP > > > > > > program and see > > > > > > if it worked. This commit changes this by adding a new > > > > > > variable which > > > > > > describes all xdp supported functions on pretty detailed > > > > > > level: > > > > > > > > > > I like the direction this is going! :) > > > > (Me too, don't get discouraged by our nitpicking, keep working on > > this! :-)) > > > > > > > > > > > > > - aborted > > > > > > - drop > > > > > > - pass > > > > > > - tx > > > > > > I strongly think we should _not_ merge any native XDP driver > > > patchset > > > that does not support/implement the above return codes. > > > > I agree, with above statement. > > > > > Could we instead group them together and call this something like > > > XDP_BASE functionality to not give a wrong impression? > > > > I disagree. I can accept that XDP_BASE include aborted+drop+pass. > > XDP_BASE is a weird name i vote: XDP_FLAG_RX, XDP_FLAG_TX, XDP_FLAG_REDIRECT, XDP_FLAG_AF_XDP, XDP_FLAG_AFXDP_ZC > > I think we need to keep XDP_TX action separate, because I think > > that > > there are use-cases where the we want to disable XDP_TX due to end- > > user > > policy or hardware limitations. > > How about we discover this at load time though. Meaning if the > program > doesn't use XDP_TX then the hardware can skip resource allocations > for > it. I think we could have verifier or extra pass discover the use of > XDP_TX and then pass a bit down to driver to enable/disable TX caps. > +1, how about we also attach some attributes to the program that would tell the kernel/driver how to prepare configure itself for the new program ? Attributes like how much headroom the program needs, what meta data driver must provide, should the driver do csum on tx, etc .. some attribute can be extracted from the byte code/logic others are stated explicitly in some predefined section in the XDP prog itself. On a second thought, this could be disruptive, users will eventually want to replace XDP progs, and they might want a persistent config prior to loading/reloading any prog to avoid reconfigs (packet drops) between progs. > > Use-case(1): Cloud-provider want to give customers (running VMs) > > ability > > to load XDP program for DDoS protection (only), but don't want to > > allow > > customer to use XDP_TX (that can implement LB or cheat their VM > > isolation policy). > > Not following. What interface do they want to allow loading on? If > its > the VM interface then I don't see how it matters. From outside the > VM there should be no way to discover if its done in VM or in tc or > some other stack. > > If its doing some onloading/offloading I would assume they need to > ensure the isolation, etc. is still maintained because you can't > let one VMs program work on other VMs packets safely. > > So what did I miss, above doesn't make sense to me. > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > resources, as the use-case is only DDoS. Today we have this > > problem > > with the ixgbe hardware, that cannot load XDP programs on systems > > with > > more than 192 CPUs. > > The ixgbe issues is just a bug or missing-feature in my opinion. > > I think we just document that XDP_TX consumes resources and if users > care they shouldn't use XD_TX in programs and in that case hardware > should via program discovery not allocate the resource. This seems > cleaner in my opinion then more bits for features. > > > > > > If this is properly documented that these are basic must-have > > > _requirements_, then users and driver developers both know what > > > the > > > expectations are. > > > > We can still document that XDP_TX is a must-have requirement, when > > a > > driver implements XDP. > > +1 > Ho about xdp redirect ? do we still need to load a no-op program on the egress netdev so it would allocate the xdp tx/redirect queues ? Adding the above discovery feature will break xdp redirect native mode and will require to have a special flag for xdp_redirect, so it actually makes more sense to have a unique knob to turn on XDP tx, for the redirect use case.
On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote: > Jesper Dangaard Brouer wrote: > > On Fri, 4 Dec 2020 16:21:08 +0100 > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote: > > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote: > > > >> alardam@gmail.com writes: > > > >>> From: Marek Majtyka <marekx.majtyka@intel.com> > > > >>> > > > >>> Implement support for checking what kind of xdp functionality a netdev > > > >>> supports. Previously, there was no way to do this other than to try > > > >>> to create an AF_XDP socket on the interface or load an XDP program and see > > > >>> if it worked. This commit changes this by adding a new variable which > > > >>> describes all xdp supported functions on pretty detailed level: > > > >> > > > >> I like the direction this is going! :) > > > > (Me too, don't get discouraged by our nitpicking, keep working on this! :-)) > > > > > >> > > > >>> - aborted > > > >>> - drop > > > >>> - pass > > > >>> - tx > > > > > > I strongly think we should _not_ merge any native XDP driver patchset > > > that does not support/implement the above return codes. > > > > I agree, with above statement. > > > > > Could we instead group them together and call this something like > > > XDP_BASE functionality to not give a wrong impression? > > > > I disagree. I can accept that XDP_BASE include aborted+drop+pass. > > > > I think we need to keep XDP_TX action separate, because I think that > > there are use-cases where the we want to disable XDP_TX due to end-user > > policy or hardware limitations. > > How about we discover this at load time though. Meaning if the program > doesn't use XDP_TX then the hardware can skip resource allocations for > it. I think we could have verifier or extra pass discover the use of > XDP_TX and then pass a bit down to driver to enable/disable TX caps. +1 > > > > > Use-case(1): Cloud-provider want to give customers (running VMs) ability > > to load XDP program for DDoS protection (only), but don't want to allow > > customer to use XDP_TX (that can implement LB or cheat their VM > > isolation policy). > > Not following. What interface do they want to allow loading on? If its > the VM interface then I don't see how it matters. From outside the > VM there should be no way to discover if its done in VM or in tc or > some other stack. > > If its doing some onloading/offloading I would assume they need to > ensure the isolation, etc. is still maintained because you can't > let one VMs program work on other VMs packets safely. > > So what did I miss, above doesn't make sense to me. > > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > resources, as the use-case is only DDoS. Today we have this problem > > with the ixgbe hardware, that cannot load XDP programs on systems with > > more than 192 CPUs. > > The ixgbe issues is just a bug or missing-feature in my opinion. Not a bug, rather HW limitation? > > I think we just document that XDP_TX consumes resources and if users > care they shouldn't use XD_TX in programs and in that case hardware > should via program discovery not allocate the resource. This seems > cleaner in my opinion then more bits for features. But what if I'm with some limited HW that actually has a support for XDP and I would like to utilize XDP_TX? Not all drivers that support XDP consume Tx resources. Recently igb got support and it shares Tx queues between netstack and XDP. I feel like we should have a sort-of best effort approach in case we stumble upon the XDP_TX in prog being loaded and query the driver if it would be able to provide the Tx resources on the current system, given that normally we tend to have a queue per core. In that case igb would say yes, ixgbe would say no and prog would be rejected. > > > > > > > > If this is properly documented that these are basic must-have > > > _requirements_, then users and driver developers both know what the > > > expectations are. > > > > We can still document that XDP_TX is a must-have requirement, when a > > driver implements XDP. > > +1 > > > > > > > > >>> - redirect > > > >> > > > > > > -- > > Best regards, > > Jesper Dangaard Brouer > > MSc.CS, Principal Kernel Engineer at Red Hat > > LinkedIn: http://www.linkedin.com/in/brouer > > > >
On 12/7/20 1:52 PM, John Fastabend wrote: >> >> I think we need to keep XDP_TX action separate, because I think that >> there are use-cases where the we want to disable XDP_TX due to end-user >> policy or hardware limitations. > > How about we discover this at load time though. Meaning if the program > doesn't use XDP_TX then the hardware can skip resource allocations for > it. I think we could have verifier or extra pass discover the use of > XDP_TX and then pass a bit down to driver to enable/disable TX caps. > This was discussed in the context of virtio_net some months back - it is hard to impossible to know a program will not return XDP_TX (e.g., value comes from a map). Flipping that around, what if the program attach indicates whether XDP_TX could be returned. If so, driver manages the resource needs. If not, no resource needed and if the program violates that and returns XDP_TX the packet is dropped.
On Mon, 7 Dec 2020 18:01:00 -0700 David Ahern <dsahern@gmail.com> wrote: > On 12/7/20 1:52 PM, John Fastabend wrote: > >> > >> I think we need to keep XDP_TX action separate, because I think that > >> there are use-cases where the we want to disable XDP_TX due to end-user > >> policy or hardware limitations. > > > > How about we discover this at load time though. Nitpick at XDP "attach" time. The general disconnect between BPF and XDP is that BPF can verify at "load" time (as kernel knows what it support) while XDP can have different support/features per driver, and cannot do this until attachment time. (See later issue with tail calls). (All other BPF-hooks don't have this issue) > > Meaning if the program > > doesn't use XDP_TX then the hardware can skip resource allocations for > > it. I think we could have verifier or extra pass discover the use of > > XDP_TX and then pass a bit down to driver to enable/disable TX caps. > > > > This was discussed in the context of virtio_net some months back - it is > hard to impossible to know a program will not return XDP_TX (e.g., value > comes from a map). It is hard, and sometimes not possible. For maps the workaround is that BPF-programmer adds a bound check on values from the map. If not doing that the verifier have to assume all possible return codes are used by BPF-prog. The real nemesis is program tail calls, that can be added dynamically after the XDP program is attached. It is at attachment time that changing the NIC resources is possible. So, for program tail calls the verifier have to assume all possible return codes are used by BPF-prog. BPF now have function calls and function replace right(?) How does this affect this detection of possible return codes? > Flipping that around, what if the program attach indicates whether > XDP_TX could be returned. If so, driver manages the resource needs. If > not, no resource needed and if the program violates that and returns > XDP_TX the packet is dropped. I do like this idea, as IMHO we do need something that is connected with the BPF-prog, that describe what resources the program request (either like above via detecting this in verifier, or simply manually configuring this in the BPF-prog ELF file) The main idea is that we all (I assume) want to provide a better end-user interface/experience. By direct feedback to the end-user that "loading+attaching" this XDP BPF-prog will not work, as e.g. driver don't support a specific return code. Thus, we need to reject "loading+attaching" if features cannot be satisfied. We need a solution as; today it is causing frustration for end-users that packets can be (silently) dropped by XDP, e.g. if driver don't support XDP_REDIRECT. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Mon, 07 Dec 2020 12:52:22 -0800 John Fastabend <john.fastabend@gmail.com> wrote: > > Use-case(1): Cloud-provider want to give customers (running VMs) ability > > to load XDP program for DDoS protection (only), but don't want to allow > > customer to use XDP_TX (that can implement LB or cheat their VM > > isolation policy). > > Not following. What interface do they want to allow loading on? If its > the VM interface then I don't see how it matters. From outside the > VM there should be no way to discover if its done in VM or in tc or > some other stack. > > If its doing some onloading/offloading I would assume they need to > ensure the isolation, etc. is still maintained because you can't > let one VMs program work on other VMs packets safely. > > So what did I miss, above doesn't make sense to me. The Cloud-provider want to load customer provided BPF-code on the physical Host-OS NIC (that support XDP). The customer can get access to a web-interface where they can write or upload their BPF-prog. As multiple customers can upload BPF-progs, the Cloud-provider have to write a BPF-prog dispatcher that runs these multiple program. This could be done via BPF tail-calls, or via Toke's libxdp[1], or via devmap XDP-progs per egress port. The Cloud-provider don't fully trust customers BPF-prog. They already pre-filtered traffic to the given VM, so they can allow customers freedom to see traffic and do XDP_PASS and XDP_DROP. They administratively (via ethtool) want to disable the XDP_REDIRECT and XDP_TX driver feature, as it can be used for violation their VM isolation policy between customers. Is the use-case more clear now? [1] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On 12/8/20 10:00 AM, Jesper Dangaard Brouer wrote: > On Mon, 07 Dec 2020 12:52:22 -0800 > John Fastabend <john.fastabend@gmail.com> wrote: > >>> Use-case(1): Cloud-provider want to give customers (running VMs) ability >>> to load XDP program for DDoS protection (only), but don't want to allow >>> customer to use XDP_TX (that can implement LB or cheat their VM >>> isolation policy). >> >> Not following. What interface do they want to allow loading on? If its >> the VM interface then I don't see how it matters. From outside the >> VM there should be no way to discover if its done in VM or in tc or >> some other stack. >> >> If its doing some onloading/offloading I would assume they need to >> ensure the isolation, etc. is still maintained because you can't >> let one VMs program work on other VMs packets safely. >> >> So what did I miss, above doesn't make sense to me. > > The Cloud-provider want to load customer provided BPF-code on the > physical Host-OS NIC (that support XDP). The customer can get access > to a web-interface where they can write or upload their BPF-prog. > > As multiple customers can upload BPF-progs, the Cloud-provider have to > write a BPF-prog dispatcher that runs these multiple program. This > could be done via BPF tail-calls, or via Toke's libxdp[1], or via > devmap XDP-progs per egress port. > > The Cloud-provider don't fully trust customers BPF-prog. They already > pre-filtered traffic to the given VM, so they can allow customers > freedom to see traffic and do XDP_PASS and XDP_DROP. They > administratively (via ethtool) want to disable the XDP_REDIRECT and > XDP_TX driver feature, as it can be used for violation their VM > isolation policy between customers. > > Is the use-case more clear now? I think we're talking about two different things. The use case as I understood it in (1) mentioned to be able to disable XDP_TX for NICs that are deployed in the VM. This would be a no-go as-is since that would mean my basic assumption for attaching XDP progs is gone in that today return codes pass/drop/tx is pretty much available everywhere on native XDP supported NICs. And if you've tried it on major cloud providers like AWS or Azure that offer SRIOV-based networking that works okay and further restricting this would mean breakage of existing programs. What you mean here is "offload" from guest to host which is a different use case than what likely John and I read from your description in (1). Such program should then be loaded via BPF offload API. Meaning, if offload is used and the host is then configured to disallow XDP_TX for such requests from guests, then these get rejected through such facility, but if the /same/ program was loaded as regular native XDP where it's still running in the guest, then it must succeed. These are two entirely different things. It's not clear to me whether some ethtool XDP properties flag is the right place to describe this (plus this needs to differ between offloaded / non-offloaded progs) or whether this should be an implementation detail for things like virtio_net e.g. via virtio_has_feature(). Feels more like the latter to me which already has such a facility in place.
Jesper Dangaard Brouer <jbrouer@redhat.com> writes: > On Mon, 7 Dec 2020 18:01:00 -0700 > David Ahern <dsahern@gmail.com> wrote: > >> On 12/7/20 1:52 PM, John Fastabend wrote: >> >> >> >> I think we need to keep XDP_TX action separate, because I think that >> >> there are use-cases where the we want to disable XDP_TX due to end-user >> >> policy or hardware limitations. >> > >> > How about we discover this at load time though. > > Nitpick at XDP "attach" time. The general disconnect between BPF and > XDP is that BPF can verify at "load" time (as kernel knows what it > support) while XDP can have different support/features per driver, and > cannot do this until attachment time. (See later issue with tail calls). > (All other BPF-hooks don't have this issue) > >> > Meaning if the program >> > doesn't use XDP_TX then the hardware can skip resource allocations for >> > it. I think we could have verifier or extra pass discover the use of >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps. >> > >> >> This was discussed in the context of virtio_net some months back - it is >> hard to impossible to know a program will not return XDP_TX (e.g., value >> comes from a map). > > It is hard, and sometimes not possible. For maps the workaround is > that BPF-programmer adds a bound check on values from the map. If not > doing that the verifier have to assume all possible return codes are > used by BPF-prog. > > The real nemesis is program tail calls, that can be added dynamically > after the XDP program is attached. It is at attachment time that > changing the NIC resources is possible. So, for program tail calls the > verifier have to assume all possible return codes are used by BPF-prog. We actually had someone working on a scheme for how to express this for programs some months ago, but unfortunately that stalled out (Jesper already knows this, but FYI to the rest of you). In any case, I view this as a "next step". Just exposing the feature bits to userspace will help users today, and as a side effect, this also makes drivers declare what they support, which we can then incorporate into the core code to, e.g., reject attachment of programs that won't work anyway. But let's do this in increments and not make the perfect the enemy of the good here. > BPF now have function calls and function replace right(?) How does > this affect this detection of possible return codes? It does have the same issue as tail calls, in that the return code of the function being replaced can obviously change. However, the verifier knows the target of a replace, so it can propagate any constraints put upon the caller if we implement it that way. -Toke
Toke Høiland-Jørgensen wrote: > Jesper Dangaard Brouer <jbrouer@redhat.com> writes: > > > On Mon, 7 Dec 2020 18:01:00 -0700 > > David Ahern <dsahern@gmail.com> wrote: > > > >> On 12/7/20 1:52 PM, John Fastabend wrote: > >> >> > >> >> I think we need to keep XDP_TX action separate, because I think that > >> >> there are use-cases where the we want to disable XDP_TX due to end-user > >> >> policy or hardware limitations. > >> > > >> > How about we discover this at load time though. > > > > Nitpick at XDP "attach" time. The general disconnect between BPF and > > XDP is that BPF can verify at "load" time (as kernel knows what it > > support) while XDP can have different support/features per driver, and > > cannot do this until attachment time. (See later issue with tail calls). > > (All other BPF-hooks don't have this issue) > > > >> > Meaning if the program > >> > doesn't use XDP_TX then the hardware can skip resource allocations for > >> > it. I think we could have verifier or extra pass discover the use of > >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps. > >> > > >> > >> This was discussed in the context of virtio_net some months back - it is > >> hard to impossible to know a program will not return XDP_TX (e.g., value > >> comes from a map). > > > > It is hard, and sometimes not possible. For maps the workaround is > > that BPF-programmer adds a bound check on values from the map. If not > > doing that the verifier have to assume all possible return codes are > > used by BPF-prog. > > > > The real nemesis is program tail calls, that can be added dynamically > > after the XDP program is attached. It is at attachment time that > > changing the NIC resources is possible. So, for program tail calls the > > verifier have to assume all possible return codes are used by BPF-prog. > > We actually had someone working on a scheme for how to express this for > programs some months ago, but unfortunately that stalled out (Jesper > already knows this, but FYI to the rest of you). In any case, I view > this as a "next step". Just exposing the feature bits to userspace will > help users today, and as a side effect, this also makes drivers declare > what they support, which we can then incorporate into the core code to, > e.g., reject attachment of programs that won't work anyway. But let's > do this in increments and not make the perfect the enemy of the good > here. > > > BPF now have function calls and function replace right(?) How does > > this affect this detection of possible return codes? > > It does have the same issue as tail calls, in that the return code of > the function being replaced can obviously change. However, the verifier > knows the target of a replace, so it can propagate any constraints put > upon the caller if we implement it that way. OK I'm convinced its not possible to tell at attach time if a program will use XDP_TX or not in general. And in fact for most real programs it likely will not be knowable. At least most programs I look at these days use either tail calls or function calls so seems like a dead end. Also above somewhere it was pointed out that XDP_REDIRECT would want the queues and it seems even more challenging to sort that out.
> On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote: > > Jesper Dangaard Brouer wrote: > > > On Fri, 4 Dec 2020 16:21:08 +0100 > > > Daniel Borkmann <daniel@iogearbox.net> wrote: [...] pruning the thread to answer Jesper. > > > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > > resources, as the use-case is only DDoS. Today we have this problem > > > with the ixgbe hardware, that cannot load XDP programs on systems with > > > more than 192 CPUs. > > > > The ixgbe issues is just a bug or missing-feature in my opinion. > > Not a bug, rather HW limitation? Well hardware has some max queue limit. Likely <192 otherwise I would have kept doing queue per core on up to 192. But, ideally we should still load and either share queues across multiple cores or restirct down to a subset of CPUs. Do you need 192 cores for a 10gbps nic, probably not. Yes, it requires some extra care, but should be doable if someone cares enough. I gather current limitation/bug is because no one has that configuration and/or has complained loud enough. > > > > > I think we just document that XDP_TX consumes resources and if users > > care they shouldn't use XD_TX in programs and in that case hardware > > should via program discovery not allocate the resource. This seems > > cleaner in my opinion then more bits for features. > > But what if I'm with some limited HW that actually has a support for XDP > and I would like to utilize XDP_TX? > > Not all drivers that support XDP consume Tx resources. Recently igb got > support and it shares Tx queues between netstack and XDP. Makes sense to me. > > I feel like we should have a sort-of best effort approach in case we > stumble upon the XDP_TX in prog being loaded and query the driver if it > would be able to provide the Tx resources on the current system, given > that normally we tend to have a queue per core. Why do we need to query? I guess you want some indication from the driver its not going to be running in the ideal NIC configuraition? I guess printing a warning would be the normal way to show that. But, maybe your point is you want something easier to query? > > In that case igb would say yes, ixgbe would say no and prog would be > rejected. I think the driver should load even if it can't meet the queue per core quota. Refusing to load at all or just dropping packets on the floor is not very friendly. I think we agree on that point.
On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote: > > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote: > > > Jesper Dangaard Brouer wrote: > > > > On Fri, 4 Dec 2020 16:21:08 +0100 > > > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > [...] pruning the thread to answer Jesper. I think you meant me, but thanks anyway for responding :) > > > > > > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > > > resources, as the use-case is only DDoS. Today we have this problem > > > > with the ixgbe hardware, that cannot load XDP programs on systems with > > > > more than 192 CPUs. > > > > > > The ixgbe issues is just a bug or missing-feature in my opinion. > > > > Not a bug, rather HW limitation? > > Well hardware has some max queue limit. Likely <192 otherwise I would > have kept doing queue per core on up to 192. But, ideally we should Data sheet states its 128 Tx qs for ixgbe. > still load and either share queues across multiple cores or restirct > down to a subset of CPUs. And that's the missing piece of logic, I suppose. > Do you need 192 cores for a 10gbps nic, probably not. Let's hear from Jesper :p > Yes, it requires some extra care, but should be doable > if someone cares enough. I gather current limitation/bug is because > no one has that configuration and/or has complained loud enough. I would say we're safe for queue per core approach for newer devices where we have thousands of queues to play with. Older devices combined with big cpu count can cause us some problems. Wondering if drivers could have a problem when user would do something weird as limiting the queue count to a lower value than cpu count and then changing the irq affinity? > > > > > > > > > I think we just document that XDP_TX consumes resources and if users > > > care they shouldn't use XD_TX in programs and in that case hardware > > > should via program discovery not allocate the resource. This seems > > > cleaner in my opinion then more bits for features. > > > > But what if I'm with some limited HW that actually has a support for XDP > > and I would like to utilize XDP_TX? > > > > Not all drivers that support XDP consume Tx resources. Recently igb got > > support and it shares Tx queues between netstack and XDP. > > Makes sense to me. > > > > > I feel like we should have a sort-of best effort approach in case we > > stumble upon the XDP_TX in prog being loaded and query the driver if it > > would be able to provide the Tx resources on the current system, given > > that normally we tend to have a queue per core. > > Why do we need to query? I guess you want some indication from the > driver its not going to be running in the ideal NIC configuraition? > I guess printing a warning would be the normal way to show that. But, > maybe your point is you want something easier to query? I meant that given Jesper's example, what should we do? You don't have Tx resources to pull at all. Should we have a data path for that case that would share Tx qs between XDP/netstack? Probably not. > > > > > In that case igb would say yes, ixgbe would say no and prog would be > > rejected. > > I think the driver should load even if it can't meet the queue per > core quota. Refusing to load at all or just dropping packets on the > floor is not very friendly. I think we agree on that point. Agreed on that. But it needs some work. I can dabble on that a bit.
John Fastabend <john.fastabend@gmail.com> writes: > Toke Høiland-Jørgensen wrote: >> Jesper Dangaard Brouer <jbrouer@redhat.com> writes: >> >> > On Mon, 7 Dec 2020 18:01:00 -0700 >> > David Ahern <dsahern@gmail.com> wrote: >> > >> >> On 12/7/20 1:52 PM, John Fastabend wrote: >> >> >> >> >> >> I think we need to keep XDP_TX action separate, because I think that >> >> >> there are use-cases where the we want to disable XDP_TX due to end-user >> >> >> policy or hardware limitations. >> >> > >> >> > How about we discover this at load time though. >> > >> > Nitpick at XDP "attach" time. The general disconnect between BPF and >> > XDP is that BPF can verify at "load" time (as kernel knows what it >> > support) while XDP can have different support/features per driver, and >> > cannot do this until attachment time. (See later issue with tail calls). >> > (All other BPF-hooks don't have this issue) >> > >> >> > Meaning if the program >> >> > doesn't use XDP_TX then the hardware can skip resource allocations for >> >> > it. I think we could have verifier or extra pass discover the use of >> >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps. >> >> > >> >> >> >> This was discussed in the context of virtio_net some months back - it is >> >> hard to impossible to know a program will not return XDP_TX (e.g., value >> >> comes from a map). >> > >> > It is hard, and sometimes not possible. For maps the workaround is >> > that BPF-programmer adds a bound check on values from the map. If not >> > doing that the verifier have to assume all possible return codes are >> > used by BPF-prog. >> > >> > The real nemesis is program tail calls, that can be added dynamically >> > after the XDP program is attached. It is at attachment time that >> > changing the NIC resources is possible. So, for program tail calls the >> > verifier have to assume all possible return codes are used by BPF-prog. >> >> We actually had someone working on a scheme for how to express this for >> programs some months ago, but unfortunately that stalled out (Jesper >> already knows this, but FYI to the rest of you). In any case, I view >> this as a "next step". Just exposing the feature bits to userspace will >> help users today, and as a side effect, this also makes drivers declare >> what they support, which we can then incorporate into the core code to, >> e.g., reject attachment of programs that won't work anyway. But let's >> do this in increments and not make the perfect the enemy of the good >> here. >> >> > BPF now have function calls and function replace right(?) How does >> > this affect this detection of possible return codes? >> >> It does have the same issue as tail calls, in that the return code of >> the function being replaced can obviously change. However, the verifier >> knows the target of a replace, so it can propagate any constraints put >> upon the caller if we implement it that way. > > OK I'm convinced its not possible to tell at attach time if a program > will use XDP_TX or not in general. And in fact for most real programs it > likely will not be knowable. At least most programs I look at these days > use either tail calls or function calls so seems like a dead end. > > Also above somewhere it was pointed out that XDP_REDIRECT would want > the queues and it seems even more challenging to sort that out. Yeah. Doesn't mean that all hope is lost for "reject stuff that doesn't work". We could either do pessimistic return code detection (if we don't know for sure assume all codes are used), or we could add metadata where the program declares what it wants to do... -Toke
On Wed, 9 Dec 2020 10:54:54 +0100 Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote: > On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote: > > > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote: > > > > Jesper Dangaard Brouer wrote: > > > > > On Fri, 4 Dec 2020 16:21:08 +0100 > > > > > Daniel Borkmann <daniel@iogearbox.net> wrote: > > > > [...] pruning the thread to answer Jesper. > > I think you meant me, but thanks anyway for responding :) I was about to say that ;-) > > > > > > > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue > > > > > resources, as the use-case is only DDoS. Today we have this problem > > > > > with the ixgbe hardware, that cannot load XDP programs on systems with > > > > > more than 192 CPUs. > > > > > > > > The ixgbe issues is just a bug or missing-feature in my opinion. > > > > > > Not a bug, rather HW limitation? > > > > Well hardware has some max queue limit. Likely <192 otherwise I would > > have kept doing queue per core on up to 192. But, ideally we should > > Data sheet states its 128 Tx qs for ixgbe. I likely remember wrong, maybe it was only ~96 CPUs. I do remember that some TX queue were reserved for something else, and QA reported issues (as I don't have this high end system myself). > > still load and either share queues across multiple cores or restirct > > down to a subset of CPUs. > > And that's the missing piece of logic, I suppose. > > > Do you need 192 cores for a 10gbps nic, probably not. > > Let's hear from Jesper :p LOL - of-cause you don't need 192 cores. With XDP I will claim that you only need 2 cores (with high GHz) to forward 10gbps wirespeed small packets. The point is that this only works, when we avoid atomic lock operations per packet and bulk NIC PCIe tail/doorbell. It was actually John's invention/design to have a dedicated TX queue per core to avoid the atomic lock operation per packet when queuing packets to the NIC. 10G @64B give budget of 67.2 ns (241 cycles @ 3.60GHz) Atomic lock operation use:[1] - Type:spin_lock_unlock Per elem: 34 cycles(tsc) 9.485 ns - Type:spin_lock_unlock_irqsave Per elem: 61 cycles(tsc) 17.125 ns (And atomic can affect Inst per cycle) But I have redesigned the ndo_xdp_xmit call to take a bulk of packets (up-to 16) so it should not be a problem to solve this by sharing TX-queue and talking a lock per 16 packets. I still recommend that, for fallback case, you allocated a number a TX-queue and distribute this across CPUs to avoid hitting a congested lock (above measurements are the optimal non-congested atomic lock operation) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c > > Yes, it requires some extra care, but should be doable > > if someone cares enough. I gather current limitation/bug is because > > no one has that configuration and/or has complained loud enough. > > I would say we're safe for queue per core approach for newer devices where > we have thousands of queues to play with. Older devices combined with big > cpu count can cause us some problems. > > Wondering if drivers could have a problem when user would do something > weird as limiting the queue count to a lower value than cpu count and then > changing the irq affinity? Not sure what you mean. But for XDP RX-side we use softirq NAPI guarantee to guard against concurrent access to our (per-cpu) data structures. > > > > > > > > > > > > > I think we just document that XDP_TX consumes resources and if users > > > > care they shouldn't use XD_TX in programs and in that case hardware > > > > should via program discovery not allocate the resource. This seems > > > > cleaner in my opinion then more bits for features. > > > > > > But what if I'm with some limited HW that actually has a support for XDP > > > and I would like to utilize XDP_TX? > > > > > > Not all drivers that support XDP consume Tx resources. Recently igb got > > > support and it shares Tx queues between netstack and XDP. > > > > Makes sense to me. > > > > > > > > I feel like we should have a sort-of best effort approach in case we > > > stumble upon the XDP_TX in prog being loaded and query the driver if it > > > would be able to provide the Tx resources on the current system, given > > > that normally we tend to have a queue per core. > > > > Why do we need to query? I guess you want some indication from the > > driver its not going to be running in the ideal NIC configuraition? > > I guess printing a warning would be the normal way to show that. But, > > maybe your point is you want something easier to query? > > I meant that given Jesper's example, what should we do? You don't have Tx > resources to pull at all. Should we have a data path for that case that > would share Tx qs between XDP/netstack? Probably not. > I think ixgbe should have a fallback mode, where it allocated e.g. 32 TX-queue for XDP xmits or even just same amount as RX-queues (I think XDP_TX and XDP_REDIRECT can share these TX-queues dedicated to XDP). When in fallback mode a lock need to be taken (sharded across CPUs), but ndo_xdp_xmit will bulk up-to 16 packets, so it should not matter too much. I do think ixgbe should output a dmesg log message, to say it is in XDP fallback mode with X number of TX-queues. For us QA usually collect the dmesg output after a test run. > > > > > > In that case igb would say yes, ixgbe would say no and prog would be > > > rejected. > > > > I think the driver should load even if it can't meet the queue per > > core quota. Refusing to load at all or just dropping packets on the > > floor is not very friendly. I think we agree on that point. > > Agreed on that. But it needs some work. I can dabble on that a bit. > I will really appreciate if Intel can fix this in the ixgbe driver, and implement a fallback method. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: >>> still load and either share queues across multiple cores or restirct >>> down to a subset of CPUs. >> >> And that's the missing piece of logic, I suppose. >> >>> Do you need 192 cores for a 10gbps nic, probably not. >> >> Let's hear from Jesper :p > > LOL - of-cause you don't need 192 cores. With XDP I will claim that > you only need 2 cores (with high GHz) to forward 10gbps wirespeed small > packets. You don't need 192 for 10G on Rx. However, if you are using XDP_REDIRECT from VM tap devices the next device (presumably the host NIC) does need to be able to handle the redirect. My personal experience with this one is mlx5/ConnectX4-LX with a limit of 63 queues and a server with 96 logical cpus. If the vhost thread for the tap device runs on a cpu that does not have an XDP TX Queue, the packet is dropped. This is a really bizarre case to debug as some packets go out fine while others are dropped.
On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets > (up-to 16) so it should not be a problem to solve this by sharing > TX-queue and talking a lock per 16 packets. I still recommend that, > for fallback case, you allocated a number a TX-queue and distribute > this across CPUs to avoid hitting a congested lock (above measurements > are the optimal non-congested atomic lock operation) I have been meaning to ask you why 16 for the XDP batching? If the netdev budget is 64, why not something higher like 32 or 64?
On Wed, 2020-12-09 at 08:41 -0700, David Ahern wrote: > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > > > still load and either share queues across multiple cores or > > > > restirct > > > > down to a subset of CPUs. > > > > > > And that's the missing piece of logic, I suppose. > > > > > > > Do you need 192 cores for a 10gbps nic, probably not. > > > > > > Let's hear from Jesper :p > > > > LOL - of-cause you don't need 192 cores. With XDP I will claim > > that > > you only need 2 cores (with high GHz) to forward 10gbps wirespeed > > small > > packets. > > You don't need 192 for 10G on Rx. However, if you are using > XDP_REDIRECT > from VM tap devices the next device (presumably the host NIC) does > need > to be able to handle the redirect. > > My personal experience with this one is mlx5/ConnectX4-LX with a > limit This limit was removed from mlx5 https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/ Note: you still need to use ehttool to increase from 64 to 128 or 96 in your case. > of 63 queues and a server with 96 logical cpus. If the vhost thread > for > the tap device runs on a cpu that does not have an XDP TX Queue, the > packet is dropped. This is a really bizarre case to debug as some > packets go out fine while others are dropped. I agree, the user experience horrible.
On 12/9/20 10:15 AM, Saeed Mahameed wrote: >> My personal experience with this one is mlx5/ConnectX4-LX with a >> limit > > This limit was removed from mlx5 > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/ > Note: you still need to use ehttool to increase from 64 to 128 or 96 in > your case. > I asked you about that commit back in May: https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/ As noted in the thread, it did not work for me.
On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote: > On 12/9/20 10:15 AM, Saeed Mahameed wrote: > > > My personal experience with this one is mlx5/ConnectX4-LX with a > > > limit > > > > This limit was removed from mlx5 > > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/ > > Note: you still need to use ehttool to increase from 64 to 128 or > > 96 in > > your case. > > > > I asked you about that commit back in May: > :/, sorry i missed this email, must have been the mlnx nvidia email transition. > https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/ > > As noted in the thread, it did not work for me. Still relevant ? I might need to get you some tools to increase #msix in Firmware.
On Wed, 9 Dec 2020 08:44:33 -0700 David Ahern <dsahern@gmail.com> wrote: > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets > > (up-to 16) so it should not be a problem to solve this by sharing > > TX-queue and talking a lock per 16 packets. I still recommend that, > > for fallback case, you allocated a number a TX-queue and distribute > > this across CPUs to avoid hitting a congested lock (above measurements > > are the optimal non-congested atomic lock operation) > > I have been meaning to ask you why 16 for the XDP batching? If the > netdev budget is 64, why not something higher like 32 or 64? Thanks you for asking as there are multiple good reasons and consideration for this 16 batch size. Notice cpumap have batch size 8, which is also an explicit choice. And AF_XDP went in the wrong direction IMHO and I think have 256. I designed this to be a choice in the map code, for the level of bulking it needs/wants. The low level explanation is that these 8 and 16 batch sizes are optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher with 10 elements). I'm betting on that memory backing these 8 or 16 packets have higher chance to remain/being in cache, and I can prefetch them without evicting them from cache again. In some cases the pointer to these packets are queued into a ptr_ring, and it is more optimal to write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring. The general explanation is my goal to do bulking without adding latency. This is explicitly stated in my presentation[1] as of Feb 2016, slide 20. Sure, you/we can likely make the micro-benchmarks look better by using 64 batch size, but that will introduce added latency and likely shoot our-selves in the foot for real workloads. With experience from bufferbloat and real networks, we know that massive TX bulking have bad effects. Still XDP-redirect does massive bulking (NIC flush is after full 64 budget) and we don't have pushback or a queue mechanism (so I know we are already shooting ourselves in the foot) ... Fortunately we now have a PhD student working on queuing for XDP. It is also important to understand that this is an adaptive bulking scheme, which comes from NAPI. We don't wait for packets arriving shortly, we pickup what NIC have available, but by only taking 8 or 16 packets (instead of emptying the entire RX-queue), and then spending some time to send them along, I'm hoping that NIC could have gotten some more frame. For cpumap and veth (in-some-cases) they can start to consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to update their internal queue state (and BQL), and if it gets close to full they can choose to flush/doorbell the NIC earlier. When doing queuing for XDP we need to expose these NIC queue states, and having 4 calls with 16 packets (64 budget) also gives us more chances to get NIC queue state info which the NIC already touch. [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer <brouer@redhat.com> wrote: > > On Wed, 9 Dec 2020 08:44:33 -0700 > David Ahern <dsahern@gmail.com> wrote: > > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets > > > (up-to 16) so it should not be a problem to solve this by sharing > > > TX-queue and talking a lock per 16 packets. I still recommend that, > > > for fallback case, you allocated a number a TX-queue and distribute > > > this across CPUs to avoid hitting a congested lock (above measurements > > > are the optimal non-congested atomic lock operation) > > > > I have been meaning to ask you why 16 for the XDP batching? If the > > netdev budget is 64, why not something higher like 32 or 64? > > Thanks you for asking as there are multiple good reasons and > consideration for this 16 batch size. Notice cpumap have batch size 8, > which is also an explicit choice. And AF_XDP went in the wrong > direction IMHO and I think have 256. I designed this to be a choice in > the map code, for the level of bulking it needs/wants. FYI, as far as I know, there is nothing in AF_XDP that says bulking should be 256. There is a 256 number in the i40e driver that states the maximum number of packets to be sent within one napi_poll loop. But this is just a maximum number and only for that driver. (In case you wonder, that number was inherited from the original skb Tx implementation in the driver.) The actual batch size is controlled by the application. If it puts 1 packet in the Tx ring and calls send(), the batch size will be 1. If it puts 128 packets in the Tx ring and calls send(), you get a batch size of 128, and so on. It is flexible, so you can trade-off latency with throughput in the way the application desires. Rx batch size has also become flexible now with the introduction of Björn's prefer_busy_poll patch set [1]. [1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/ > The low level explanation is that these 8 and 16 batch sizes are > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher > with 10 elements). I'm betting on that memory backing these 8 or 16 > packets have higher chance to remain/being in cache, and I can prefetch > them without evicting them from cache again. In some cases the pointer > to these packets are queued into a ptr_ring, and it is more optimal to > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring. > > The general explanation is my goal to do bulking without adding latency. > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20. > Sure, you/we can likely make the micro-benchmarks look better by using > 64 batch size, but that will introduce added latency and likely shoot > our-selves in the foot for real workloads. With experience from > bufferbloat and real networks, we know that massive TX bulking have bad > effects. Still XDP-redirect does massive bulking (NIC flush is after > full 64 budget) and we don't have pushback or a queue mechanism (so I > know we are already shooting ourselves in the foot) ... Fortunately we > now have a PhD student working on queuing for XDP. > > It is also important to understand that this is an adaptive bulking > scheme, which comes from NAPI. We don't wait for packets arriving > shortly, we pickup what NIC have available, but by only taking 8 or 16 > packets (instead of emptying the entire RX-queue), and then spending > some time to send them along, I'm hoping that NIC could have gotten > some more frame. For cpumap and veth (in-some-cases) they can start to > consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH > signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to > update their internal queue state (and BQL), and if it gets close to > full they can choose to flush/doorbell the NIC earlier. When doing > queuing for XDP we need to expose these NIC queue states, and having 4 > calls with 16 packets (64 budget) also gives us more chances to get NIC > queue state info which the NIC already touch. > > > [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: http://www.linkedin.com/in/brouer > > _______________________________________________ > Intel-wired-lan mailing list > Intel-wired-lan@osuosl.org > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
On 12/9/20 11:48 PM, Saeed Mahameed wrote: > On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote: >> On 12/9/20 10:15 AM, Saeed Mahameed wrote: >>>> My personal experience with this one is mlx5/ConnectX4-LX with a >>>> limit >>> >>> This limit was removed from mlx5 >>> https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/ >>> Note: you still need to use ehttool to increase from 64 to 128 or >>> 96 in >>> your case. >>> >> >> I asked you about that commit back in May: >> > > :/, sorry i missed this email, must have been the mlnx nvidia email > transition. > >> https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/ >> >> As noted in the thread, it did not work for me. > > Still relevant ? I might need to get you some tools to increase #msix > in Firmware. > not for me at the moment, but it would be good to document what a user needs to do - especially if it involves vendor specific tools and steps.
On Thu, 10 Dec 2020 15:14:18 +0100 Magnus Karlsson <magnus.karlsson@gmail.com> wrote: > On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer > <brouer@redhat.com> wrote: > > > > On Wed, 9 Dec 2020 08:44:33 -0700 > > David Ahern <dsahern@gmail.com> wrote: > > > > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets > > > > (up-to 16) so it should not be a problem to solve this by sharing > > > > TX-queue and talking a lock per 16 packets. I still recommend that, > > > > for fallback case, you allocated a number a TX-queue and distribute > > > > this across CPUs to avoid hitting a congested lock (above measurements > > > > are the optimal non-congested atomic lock operation) > > > > > > I have been meaning to ask you why 16 for the XDP batching? If the > > > netdev budget is 64, why not something higher like 32 or 64? > > > > Thanks you for asking as there are multiple good reasons and > > consideration for this 16 batch size. Notice cpumap have batch size 8, > > which is also an explicit choice. And AF_XDP went in the wrong > > direction IMHO and I think have 256. I designed this to be a choice in > > the map code, for the level of bulking it needs/wants. > > FYI, as far as I know, there is nothing in AF_XDP that says bulking > should be 256. There is a 256 number in the i40e driver that states > the maximum number of packets to be sent within one napi_poll loop. > But this is just a maximum number and only for that driver. (In case > you wonder, that number was inherited from the original skb Tx > implementation in the driver.) Ah, that explains the issue I have on the production system that runs the EDT-pacer[2]. I see that i40e function i40e_clean_tx_irq() ignores napi_budget but uses it own budget, that defaults to 256. Looks like I can adjust this via ethtool -C tx-frames-irq. I turned this down to 64 (32 was giving worse results, and below 16 system acted strange). Now the issue is gone, which was that if TX-DMA completion was running (i40e_clean_tx_irq()) on the same CPU that send packets via FQ-pacer qdisc, then the pacing was not accurate, and was sending too bursty. System have already tuned "net/core/dev_weight" and RX/TX-bias to reduce bulking, as this can influence latency and the EDT-pacing accuracy. (It is a middlebox bridging VLANs and BPF-EDT tiemstamping and FQ-pacing packets to solve bursts overflowing switch ports). sudo sysctl net/core/dev_weight net.core.dev_weight = 1 net.core.dev_weight_rx_bias = 32 net.core.dev_weight_tx_bias = 1 This net.core.dev_weight_tx_bias=1 (together with dev_weight=1) cause qdisc transmit budget to become one packet, cycling through NET_TX_SOFTIRQ which consumes time and gives a little more pacing space for the packets. > The actual batch size is controlled by > the application. If it puts 1 packet in the Tx ring and calls send(), > the batch size will be 1. If it puts 128 packets in the Tx ring and > calls send(), you get a batch size of 128, and so on. It is flexible, > so you can trade-off latency with throughput in the way the > application desires. Rx batch size has also become flexible now with > the introduction of Björn's prefer_busy_poll patch set [1]. > > [1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/ This looks like a cool trick, to get even more accurate packet scheduling. I played with the tunings, and could see changed behavior with mpstat, but ended up tuning it off again, as I could not measure a direct correlation with the bpftrace tools[3]. > > The low level explanation is that these 8 and 16 batch sizes are > > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher > > with 10 elements). I'm betting on that memory backing these 8 or 16 > > packets have higher chance to remain/being in cache, and I can prefetch > > them without evicting them from cache again. In some cases the pointer > > to these packets are queued into a ptr_ring, and it is more optimal to > > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring. > > > > The general explanation is my goal to do bulking without adding latency. > > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20. > > Sure, you/we can likely make the micro-benchmarks look better by using > > 64 batch size, but that will introduce added latency and likely shoot > > our-selves in the foot for real workloads. With experience from > > bufferbloat and real networks, we know that massive TX bulking have bad > > effects. Still XDP-redirect does massive bulking (NIC flush is after > > full 64 budget) and we don't have pushback or a queue mechanism (so I > > know we are already shooting ourselves in the foot) ... Fortunately we > > now have a PhD student working on queuing for XDP. > > > > It is also important to understand that this is an adaptive bulking > > scheme, which comes from NAPI. We don't wait for packets arriving > > shortly, we pickup what NIC have available, but by only taking 8 or 16 > > packets (instead of emptying the entire RX-queue), and then spending > > some time to send them along, I'm hoping that NIC could have gotten > > some more frame. For cpumap and veth (in-some-cases) they can start to > > consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH > > signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to > > update their internal queue state (and BQL), and if it gets close to > > full they can choose to flush/doorbell the NIC earlier. When doing > > queuing for XDP we need to expose these NIC queue states, and having 4 > > calls with 16 packets (64 budget) also gives us more chances to get NIC > > queue state info which the NIC already touch. > > > > > > [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf [2] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/ [3] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/bpftrace -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On Thu, 2020-12-10 at 08:30 -0700, David Ahern wrote: > On 12/9/20 11:48 PM, Saeed Mahameed wrote: > > On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote: > > > On 12/9/20 10:15 AM, Saeed Mahameed wrote: > > > > > My personal experience with this one is mlx5/ConnectX4-LX > > > > > with a > > > > > limit > > > > > > > > This limit was removed from mlx5 > > > > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/ > > > > Note: you still need to use ehttool to increase from 64 to 128 > > > > or > > > > 96 in > > > > your case. > > > > > > > > > > I asked you about that commit back in May: > > > > > > > :/, sorry i missed this email, must have been the mlnx nvidia email > > transition. > > > > > https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/ > > > > > > As noted in the thread, it did not work for me. > > > > Still relevant ? I might need to get you some tools to increase > > #msix > > in Firmware. > > > > not for me at the moment, but it would be good to document what a > user > needs to do - especially if it involves vendor specific tools and > steps. Ack, Thanks for pointing that out, I will take action on this.
On Thu, 2020-12-10 at 14:32 +0100, Jesper Dangaard Brouer wrote: > On Wed, 9 Dec 2020 08:44:33 -0700 > David Ahern <dsahern@gmail.com> wrote: > > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote: > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of > > > packets > > > (up-to 16) so it should not be a problem to solve this by sharing > > > TX-queue and talking a lock per 16 packets. I still recommend > > > that, > > > for fallback case, you allocated a number a TX-queue and > > > distribute > > > this across CPUs to avoid hitting a congested lock (above > > > measurements > > > are the optimal non-congested atomic lock operation) > > > > I have been meaning to ask you why 16 for the XDP batching? If the > > netdev budget is 64, why not something higher like 32 or 64? > > Thanks you for asking as there are multiple good reasons and > consideration for this 16 batch size. Notice cpumap have batch size > 8, > which is also an explicit choice. And AF_XDP went in the wrong > direction IMHO and I think have 256. I designed this to be a choice > in > the map code, for the level of bulking it needs/wants. > > The low level explanation is that these 8 and 16 batch sizes are > optimized towards cache sizes and Intel's Line-Fill-Buffer > (prefetcher > with 10 elements). I'm betting on that memory backing these 8 or 16 > packets have higher chance to remain/being in cache, and I can > prefetch > them without evicting them from cache again. In some cases the > pointer > to these packets are queued into a ptr_ring, and it is more optimal > to > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the > ptr_ring. > I've warned people about this once or twice on the mailing list, for example re-populating the rx ring, a common mistake is to use the napi budget, which has the exact side effects as you are explaining here Jesper ! these 8/16 numbers are used in more than one place in the stack, xdp, gro, hw buffer re-population, etc.. how can we enforce such numbers and a uniform handling in all drivers? 1. have a clear documentation ? well know defines, for people to copy? 2. for XDP we must keep track on the memory backing of the xdp bulked data as Jesper pointed out, so we always make sure whatever bulk-size we define it always remains cache friendly, especially now where people stated working on multi-buff and other features that will extend the xdp_buff and xdp_frame, do we need a selftest that maybe runs pahole to see the those data strcutre remain within reasonable format/sizes ? > The general explanation is my goal to do bulking without adding > latency. > This is explicitly stated in my presentation[1] as of Feb 2016, slide > 20. > Sure, you/we can likely make the micro-benchmarks look better by > using > 64 batch size, but that will introduce added latency and likely shoot > our-selves in the foot for real workloads. With experience from > bufferbloat and real networks, we know that massive TX bulking have > bad > effects. Still XDP-redirect does massive bulking (NIC flush is after > full 64 budget) and we don't have pushback or a queue mechanism (so I > know we are already shooting ourselves in the foot) ... Fortunately > we > now have a PhD student working on queuing for XDP. > > It is also important to understand that this is an adaptive bulking > scheme, which comes from NAPI. We don't wait for packets arriving > shortly, we pickup what NIC have available, but by only taking 8 or > 16 > packets (instead of emptying the entire RX-queue), and then spending > some time to send them along, I'm hoping that NIC could have gotten > some more frame. For cpumap and veth (in-some-cases) they can start > to > consume packets from these batches, but NIC drivers gets > XDP_XMIT_FLUSH > signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to > update their internal queue state (and BQL), and if it gets close to > full they can choose to flush/doorbell the NIC earlier. When doing > queuing for XDP we need to expose these NIC queue states, and having > 4 > calls with 16 packets (64 budget) also gives us more chances to get > NIC > queue state info which the NIC already touch. > > > [1] > https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
I would like to thank you for your time, comments, nitpicking as well as encouraging. One thing needs clarification I think, that is, that those flags describe driver static feature sets - which are read-only. They have nothing in common with driver runtime configuration change yet. Runtime change of this state can be added but it needs a new variable and it can be done later on if someone needs it. Obviously, it is not possible to make everybody happy, especially with XDP_BASE flags set. To be honest, this XDP_BASE definition is a syntactic sugar for me and I can live without it. We can either remove it completely, from which IMO we all and other developers will suffer later on, or maybe we can agree on these two helper set of flags: XDP_BASE (TX, ABORTED, PASS, DROP) and XDP_LIMITED_BASE(ABORTED,PASS_DROP). What do you think? I am also going to add a new XDP_REDIRECT_TARGET flag and retrieving XDP flags over rtnelink interface. I also think that for completeness, ethtool implementation should be kept together with rtnelink part in order to cover both ip and ethtool tools. Do I have your approval or disagreement? Please let me know. Both AF_XDP_ZEROCOPY and XDP_SOCK_ZEROCOPY are good to me. I will pick the one XDP_SOCK_ZEROCOPY unless there are protests. I don't think that HEADROOM parameter should be passed via the flags. It is by nature a number and an attempt to quantize with flags seems to be an unnatural limitation for the future. Thanks Marek
Marek Majtyka <alardam@gmail.com> writes: > I would like to thank you for your time, comments, nitpicking as well > as encouraging. > > One thing needs clarification I think, that is, that those flags > describe driver static feature sets - which are read-only. They have > nothing in common with driver runtime configuration change yet. > Runtime change of this state can be added but it needs a new variable > and it can be done later on if someone needs it. > > Obviously, it is not possible to make everybody happy, especially with > XDP_BASE flags set. To be honest, this XDP_BASE definition is a > syntactic sugar for me and I can live without it. We can either remove > it completely, from > which IMO we all and other developers will suffer later on, or maybe > we can agree on these two helper set of flags: XDP_BASE (TX, ABORTED, > PASS, DROP) and XDP_LIMITED_BASE(ABORTED,PASS_DROP). > What do you think? > > I am also going to add a new XDP_REDIRECT_TARGET flag and retrieving > XDP flags over rtnelink interface. > > I also think that for completeness, ethtool implementation should be > kept together with rtnelink part in order to cover both ip and > ethtool tools. Do I have your approval or disagreement? Please let me > know. Hi Marek I just realised that it seems no one actually replied to your email. On my part at least that was because I didn't have any objections, so I'm hoping you didn't feel the lack of response was discouraging (and that you're still working on a revision of this series)? :) -Toke
Thanks Toke, In fact, I was waiting for a single confirmation, disagreement or comment. I have it now. As there are no more comments, I am getting down to work right away. Marek On Mon, Feb 1, 2021 at 5:16 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote: > > Marek Majtyka <alardam@gmail.com> writes: > > > I would like to thank you for your time, comments, nitpicking as well > > as encouraging. > > > > One thing needs clarification I think, that is, that those flags > > describe driver static feature sets - which are read-only. They have > > nothing in common with driver runtime configuration change yet. > > Runtime change of this state can be added but it needs a new variable > > and it can be done later on if someone needs it. > > > > Obviously, it is not possible to make everybody happy, especially with > > XDP_BASE flags set. To be honest, this XDP_BASE definition is a > > syntactic sugar for me and I can live without it. We can either remove > > it completely, from > > which IMO we all and other developers will suffer later on, or maybe > > we can agree on these two helper set of flags: XDP_BASE (TX, ABORTED, > > PASS, DROP) and XDP_LIMITED_BASE(ABORTED,PASS_DROP). > > What do you think? > > > > I am also going to add a new XDP_REDIRECT_TARGET flag and retrieving > > XDP flags over rtnelink interface. > > > > I also think that for completeness, ethtool implementation should be > > kept together with rtnelink part in order to cover both ip and > > ethtool tools. Do I have your approval or disagreement? Please let me > > know. > > Hi Marek > > I just realised that it seems no one actually replied to your email. On > my part at least that was because I didn't have any objections, so I'm > hoping you didn't feel the lack of response was discouraging (and that > you're still working on a revision of this series)? :) > > -Toke >
Marek Majtyka <alardam@gmail.com> writes: > Thanks Toke, > > In fact, I was waiting for a single confirmation, disagreement or > comment. I have it now. As there are no more comments, I am getting > down to work right away. Awesome! And sorry for not replying straight away - I hate it when I send out something myself and receive no replies, so I suppose I should get better at not doing that myself :) As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I just realised I didn't reply to), I am fine with defining XDP_BASE as a shortcut for TX/ABORTED/PASS/DROP, but think we should skip XDP_LIMITED_BASE and instead require all new drivers to implement the full XDP_BASE set straight away. As long as we're talking about features *implemented* by the driver, at least; i.e., it should still be possible to *deactivate* XDP_TX if you don't want to use the HW resources, but I don't think there's much benefit from defining the LIMITED_BASE set as a shortcut for this mode... -Toke
On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote: > Marek Majtyka <alardam@gmail.com> writes: > > > Thanks Toke, > > > > In fact, I was waiting for a single confirmation, disagreement or > > comment. I have it now. As there are no more comments, I am getting > > down to work right away. > > Awesome! And sorry for not replying straight away - I hate it when I > send out something myself and receive no replies, so I suppose I should > get better at not doing that myself :) > > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I > just realised I didn't reply to), I am fine with defining XDP_BASE as a > shortcut for TX/ABORTED/PASS/DROP, but think we should skip > XDP_LIMITED_BASE and instead require all new drivers to implement the > full XDP_BASE set straight away. As long as we're talking about > features *implemented* by the driver, at least; i.e., it should still be > possible to *deactivate* XDP_TX if you don't want to use the HW > resources, but I don't think there's much benefit from defining the > LIMITED_BASE set as a shortcut for this mode... I still have mixed feelings about these flags. The first step IMO should be adding validation tests. I bet^W pray every vendor has validation tests but since they are not unified we don't know what level of interoperability we're achieving in practice. That doesn't matter for trivial feature like base actions, but we'll inevitably move on to defining more advanced capabilities and the question of "what supporting X actually mean" will come up (3 years later, when we don't remember ourselves).
On Tue, Feb 2, 2021 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote: > > Marek Majtyka <alardam@gmail.com> writes: > > > > > Thanks Toke, > > > > > > In fact, I was waiting for a single confirmation, disagreement or > > > comment. I have it now. As there are no more comments, I am getting > > > down to work right away. > > > > Awesome! And sorry for not replying straight away - I hate it when I > > send out something myself and receive no replies, so I suppose I should > > get better at not doing that myself :) > > > > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I > > just realised I didn't reply to), I am fine with defining XDP_BASE as a > > shortcut for TX/ABORTED/PASS/DROP, but think we should skip > > XDP_LIMITED_BASE and instead require all new drivers to implement the > > full XDP_BASE set straight away. As long as we're talking about > > features *implemented* by the driver, at least; i.e., it should still be > > possible to *deactivate* XDP_TX if you don't want to use the HW > > resources, but I don't think there's much benefit from defining the > > LIMITED_BASE set as a shortcut for this mode... > > I still have mixed feelings about these flags. The first step IMO > should be adding validation tests. I bet^W pray every vendor has > validation tests but since they are not unified we don't know what > level of interoperability we're achieving in practice. That doesn't > matter for trivial feature like base actions, but we'll inevitably > move on to defining more advanced capabilities and the question of > "what supporting X actually mean" will come up (3 years later, when > we don't remember ourselves). I am a bit confused now. Did you mean validation tests of those XDP flags, which I am working on or some other validation tests? What should these tests verify? Can you please elaborate more on the topic, please - just a few sentences how are you see it?
On Wed, 3 Feb 2021 13:50:59 +0100 Marek Majtyka wrote: > On Tue, Feb 2, 2021 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote: > > > Awesome! And sorry for not replying straight away - I hate it when I > > > send out something myself and receive no replies, so I suppose I should > > > get better at not doing that myself :) > > > > > > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I > > > just realised I didn't reply to), I am fine with defining XDP_BASE as a > > > shortcut for TX/ABORTED/PASS/DROP, but think we should skip > > > XDP_LIMITED_BASE and instead require all new drivers to implement the > > > full XDP_BASE set straight away. As long as we're talking about > > > features *implemented* by the driver, at least; i.e., it should still be > > > possible to *deactivate* XDP_TX if you don't want to use the HW > > > resources, but I don't think there's much benefit from defining the > > > LIMITED_BASE set as a shortcut for this mode... > > > > I still have mixed feelings about these flags. The first step IMO > > should be adding validation tests. I bet^W pray every vendor has > > validation tests but since they are not unified we don't know what > > level of interoperability we're achieving in practice. That doesn't > > matter for trivial feature like base actions, but we'll inevitably > > move on to defining more advanced capabilities and the question of > > "what supporting X actually mean" will come up (3 years later, when > > we don't remember ourselves). > > I am a bit confused now. Did you mean validation tests of those XDP > flags, which I am working on or some other validation tests? > What should these tests verify? Can you please elaborate more on the > topic, please - just a few sentences how are you see it? Conformance tests can be written for all features, whether they have an explicit capability in the uAPI or not. But for those that do IMO the tests should be required. Let me give you an example. This set adds a bit that says Intel NICs can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue shenanigans. So can i40e do XDP_REDIRECT or can it not? If we have exhaustive conformance tests we can confidently answer that question. And the answer may not be "yes" or "no", it may actually be "we need more options because many implementations fall in between". I think readable (IOW not written in some insane DSL) tests can also be useful for users who want to check which features their program / deployment will require.
Jakub Kicinski <kuba@kernel.org> writes: > On Wed, 3 Feb 2021 13:50:59 +0100 Marek Majtyka wrote: >> On Tue, Feb 2, 2021 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote: >> > On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote: >> > > Awesome! And sorry for not replying straight away - I hate it when I >> > > send out something myself and receive no replies, so I suppose I should >> > > get better at not doing that myself :) >> > > >> > > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I >> > > just realised I didn't reply to), I am fine with defining XDP_BASE as a >> > > shortcut for TX/ABORTED/PASS/DROP, but think we should skip >> > > XDP_LIMITED_BASE and instead require all new drivers to implement the >> > > full XDP_BASE set straight away. As long as we're talking about >> > > features *implemented* by the driver, at least; i.e., it should still be >> > > possible to *deactivate* XDP_TX if you don't want to use the HW >> > > resources, but I don't think there's much benefit from defining the >> > > LIMITED_BASE set as a shortcut for this mode... >> > >> > I still have mixed feelings about these flags. The first step IMO >> > should be adding validation tests. I bet^W pray every vendor has >> > validation tests but since they are not unified we don't know what >> > level of interoperability we're achieving in practice. That doesn't >> > matter for trivial feature like base actions, but we'll inevitably >> > move on to defining more advanced capabilities and the question of >> > "what supporting X actually mean" will come up (3 years later, when >> > we don't remember ourselves). >> >> I am a bit confused now. Did you mean validation tests of those XDP >> flags, which I am working on or some other validation tests? >> What should these tests verify? Can you please elaborate more on the >> topic, please - just a few sentences how are you see it? > > Conformance tests can be written for all features, whether they have > an explicit capability in the uAPI or not. But for those that do IMO > the tests should be required. > > Let me give you an example. This set adds a bit that says Intel NICs > can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue > shenanigans. So can i40e do XDP_REDIRECT or can it not? > > If we have exhaustive conformance tests we can confidently answer that > question. And the answer may not be "yes" or "no", it may actually be > "we need more options because many implementations fall in between". > > I think readable (IOW not written in some insane DSL) tests can also > be useful for users who want to check which features their program / > deployment will require. While I do agree that that kind of conformance test would be great, I don't think it has to hold up this series (the perfect being the enemy of the good, and all that). We have a real problem today that userspace can't tell if a given driver implements, say, XDP_REDIRECT, and so people try to use it and spend days wondering which black hole their packets disappear into. And for things like container migration we need to be able to predict whether a given host supports a feature *before* we start the migration and try to use it. I view the feature flags as a list of features *implemented* by the driver. Which should be pretty static in a given kernel, but may be different than the features currently *enabled* on a given system (due to, e.g., the TX queue stuff). The simple way to expose the latter would be to just have a second set of flags indicating the current configured state; and for that I guess we should at least agree what "enabled" means; and a conformance test would be a way to do this, of course. I don't see why we can't do this in stages, though; start with the first set of flags ('implemented'), move on to the second one ('enabled'), and then to things like making the kernel react to the flags by rejecting insertion into devmaps for invalid interfaces... -Toke
On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote: > >> I am a bit confused now. Did you mean validation tests of those XDP > >> flags, which I am working on or some other validation tests? > >> What should these tests verify? Can you please elaborate more on the > >> topic, please - just a few sentences how are you see it? > > > > Conformance tests can be written for all features, whether they have > > an explicit capability in the uAPI or not. But for those that do IMO > > the tests should be required. > > > > Let me give you an example. This set adds a bit that says Intel NICs > > can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue > > shenanigans. So can i40e do XDP_REDIRECT or can it not? > > > > If we have exhaustive conformance tests we can confidently answer that > > question. And the answer may not be "yes" or "no", it may actually be > > "we need more options because many implementations fall in between". > > > > I think readable (IOW not written in some insane DSL) tests can also > > be useful for users who want to check which features their program / > > deployment will require. > > While I do agree that that kind of conformance test would be great, I > don't think it has to hold up this series (the perfect being the enemy > of the good, and all that). We have a real problem today that userspace > can't tell if a given driver implements, say, XDP_REDIRECT, and so > people try to use it and spend days wondering which black hole their > packets disappear into. And for things like container migration we need > to be able to predict whether a given host supports a feature *before* > we start the migration and try to use it. Unless you have a strong definition of what XDP_REDIRECT means the flag itself is not worth much. We're not talking about normal ethtool feature flags which are primarily stack-driven, XDP is implemented mostly by the driver, each vendor can do their own thing. Maybe I've seen one vendor incompatibility too many at my day job to hope for the best... > I view the feature flags as a list of features *implemented* by the > driver. Which should be pretty static in a given kernel, but may be > different than the features currently *enabled* on a given system (due > to, e.g., the TX queue stuff). Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your example) is implemented across drivers differs in a meaningful ways. Hence the need for conformance testing. We don't have a golden SW standard to fall back on, like we do with HW offloads. Also IDK why those tests are considered such a huge ask. As I said most vendors probably already have them, and so I'd guess do good distros. So let's work together. > The simple way to expose the latter would be to just have a second set > of flags indicating the current configured state; and for that I guess > we should at least agree what "enabled" means; and a conformance test > would be a way to do this, of course. > > I don't see why we can't do this in stages, though; start with the first > set of flags ('implemented'), move on to the second one ('enabled'), and > then to things like making the kernel react to the flags by rejecting > insertion into devmaps for invalid interfaces...
Jakub Kicinski <kuba@kernel.org> writes: > On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote: >> >> I am a bit confused now. Did you mean validation tests of those XDP >> >> flags, which I am working on or some other validation tests? >> >> What should these tests verify? Can you please elaborate more on the >> >> topic, please - just a few sentences how are you see it? >> > >> > Conformance tests can be written for all features, whether they have >> > an explicit capability in the uAPI or not. But for those that do IMO >> > the tests should be required. >> > >> > Let me give you an example. This set adds a bit that says Intel NICs >> > can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue >> > shenanigans. So can i40e do XDP_REDIRECT or can it not? >> > >> > If we have exhaustive conformance tests we can confidently answer that >> > question. And the answer may not be "yes" or "no", it may actually be >> > "we need more options because many implementations fall in between". >> > >> > I think readable (IOW not written in some insane DSL) tests can also >> > be useful for users who want to check which features their program / >> > deployment will require. >> >> While I do agree that that kind of conformance test would be great, I >> don't think it has to hold up this series (the perfect being the enemy >> of the good, and all that). We have a real problem today that userspace >> can't tell if a given driver implements, say, XDP_REDIRECT, and so >> people try to use it and spend days wondering which black hole their >> packets disappear into. And for things like container migration we need >> to be able to predict whether a given host supports a feature *before* >> we start the migration and try to use it. > > Unless you have a strong definition of what XDP_REDIRECT means the flag > itself is not worth much. We're not talking about normal ethtool feature > flags which are primarily stack-driven, XDP is implemented mostly by > the driver, each vendor can do their own thing. Maybe I've seen one > vendor incompatibility too many at my day job to hope for the best... I'm totally on board with documenting what a feature means. E.g., for XDP_REDIRECT, whether it's acceptable to fail the redirect in some situations even when it's active, or if there should always be a slow-path fallback. But I disagree that the flag is worthless without it. People are running into real issues with trying to run XDP_REDIRECT programs on a driver that doesn't support it at all, and it's incredibly confusing. The latest example popped up literally yesterday: https://lore.kernel.org/xdp-newbies/CAM-scZPPeu44FeCPGO=Qz=03CrhhfB1GdJ8FNEpPqP_G27c6mQ@mail.gmail.com/ >> I view the feature flags as a list of features *implemented* by the >> driver. Which should be pretty static in a given kernel, but may be >> different than the features currently *enabled* on a given system (due >> to, e.g., the TX queue stuff). > > Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your > example) is implemented across drivers differs in a meaningful ways. > Hence the need for conformance testing. We don't have a golden SW > standard to fall back on, like we do with HW offloads. I'm not disagreeing that we need to harmonise what "implementing a feature" means. Maybe I'm just not sure what you mean by "conformance testing"? What would that look like, specifically? A script in selftest that sets up a redirect between two interfaces that we tell people to run? Or what? How would you catch, say, that issue where if a machine has more CPUs than the NIC has TXQs things start falling apart? > Also IDK why those tests are considered such a huge ask. As I said most > vendors probably already have them, and so I'd guess do good distros. > So let's work together. I guess what I'm afraid of is that this will end up delaying or stalling a fix for a long-standing issue (which is what I consider this series as shown by the example above). Maybe you can alleviate that by expanding a bit on what you mean? -Toke
On Wed, 10 Feb 2021 23:52:39 +0100 Toke Høiland-Jørgensen wrote: > Jakub Kicinski <kuba@kernel.org> writes: > > On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote: > >> While I do agree that that kind of conformance test would be great, I > >> don't think it has to hold up this series (the perfect being the enemy > >> of the good, and all that). We have a real problem today that userspace > >> can't tell if a given driver implements, say, XDP_REDIRECT, and so > >> people try to use it and spend days wondering which black hole their > >> packets disappear into. And for things like container migration we need > >> to be able to predict whether a given host supports a feature *before* > >> we start the migration and try to use it. > > > > Unless you have a strong definition of what XDP_REDIRECT means the flag > > itself is not worth much. We're not talking about normal ethtool feature > > flags which are primarily stack-driven, XDP is implemented mostly by > > the driver, each vendor can do their own thing. Maybe I've seen one > > vendor incompatibility too many at my day job to hope for the best... > > I'm totally on board with documenting what a feature means. We're trying documentation in devlink etc. and it's not that great. It's never clear and comprehensive enough, barely anyone reads it. > E.g., for > XDP_REDIRECT, whether it's acceptable to fail the redirect in some > situations even when it's active, or if there should always be a > slow-path fallback. > > But I disagree that the flag is worthless without it. People are running > into real issues with trying to run XDP_REDIRECT programs on a driver > that doesn't support it at all, and it's incredibly confusing. The > latest example popped up literally yesterday: > > https://lore.kernel.org/xdp-newbies/CAM-scZPPeu44FeCPGO=Qz=03CrhhfB1GdJ8FNEpPqP_G27c6mQ@mail.gmail.com/ To help such confusion we'd actually have to validate the program against the device caps. But perhaps I'm less concerned about a newcomer not knowing how to use things and more concerned about providing abstractions which will make programs dependably working across vendors and HW generations. > >> I view the feature flags as a list of features *implemented* by the > >> driver. Which should be pretty static in a given kernel, but may be > >> different than the features currently *enabled* on a given system (due > >> to, e.g., the TX queue stuff). > > > > Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your > > example) is implemented across drivers differs in a meaningful ways. > > Hence the need for conformance testing. We don't have a golden SW > > standard to fall back on, like we do with HW offloads. > > I'm not disagreeing that we need to harmonise what "implementing a > feature" means. Maybe I'm just not sure what you mean by "conformance > testing"? What would that look like, specifically? We developed a pretty good set of tests at my previous job for testing driver XDP as well as checking that the offload conforms to the SW behavior. I assume any vendor who takes quality seriously has comprehensive XDP tests. If those tests were upstream / common so that we could run them against every implementation - the features which are supported by a driver fall out naturally out of the set of tests which passed. And the structure of the capability API could be based on what the tests need to know to make a SKIP vs FAIL decision. Common tests would obviously also ease the validation burden, burden of writing tests on vendors and make it far easier for new implementations to be confidently submitted. > A script in selftest that sets up a redirect between two interfaces > that we tell people to run? Or what? How would you catch, say, that > issue where if a machine has more CPUs than the NIC has TXQs things > start falling apart? selftests should be a good place, but I don't mind the location. The point is having tests which anyone (vendors and users) can run to test their platforms. One of the tests should indeed test if every CPU in the platform can XDP_REDIRECT. Shouldn't it be a rather trivial combination of tun/veth, mh and taskset? > > Also IDK why those tests are considered such a huge ask. As I said most > > vendors probably already have them, and so I'd guess do good distros. > > So let's work together. > > I guess what I'm afraid of is that this will end up delaying or stalling > a fix for a long-standing issue (which is what I consider this series as > shown by the example above). Maybe you can alleviate that by expanding a > bit on what you mean? I hope what I wrote helps a little. I'm not good at explaining. Perhaps I had seen one too many vendor incompatibility to trust that adding a driver API without a validation suite will result in something usable in production settings.
On Thu, Feb 11, 2021 at 5:26 PM Jakub Kicinski <kuba@kernel.org> wrote: > > Perhaps I had seen one too many vendor incompatibility to trust that > adding a driver API without a validation suite will result in something > usable in production settings. I agree with Jakub. I don't see how extra ethtool reporting will help. Anyone who wants to know whether eth0 supports XDP_REDIRECT can already do so: ethtool -S eth0 | grep xdp_redirect I think converging on the same stat names across the drivers will make the whole thing much more user friendly than new apis.
On Fri, Feb 12, 2021 at 3:05 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Feb 11, 2021 at 5:26 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > Perhaps I had seen one too many vendor incompatibility to trust that > > adding a driver API without a validation suite will result in something > > usable in production settings. > > I agree with Jakub. I don't see how extra ethtool reporting will help. > Anyone who wants to know whether eth0 supports XDP_REDIRECT can already do so: > ethtool -S eth0 | grep xdp_redirect Doing things right can never be treated as an addition. It is the other way around. Option -S is for statistics and additionally it can show something (AFAIR there wasn't such counter xdp_redirect, it must be something new, thanks for the info). But nevertheless it cannot cover all needs IMO. Some questions worth to consider: Is this extra reporting function of statistics clearly documented in ethtool? Is this going to be clearly documented? Would it be easier for users/admins to find it? What about zero copy? Can it be available via statistics, too? What about drivers XDP transmit locking flag (latest request from Jesper)? > > I think converging on the same stat names across the drivers will make > the whole thing much more user friendly than new apis.
Marek Majtyka <alardam@gmail.com> writes: > On Fri, Feb 12, 2021 at 3:05 AM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: >> >> On Thu, Feb 11, 2021 at 5:26 PM Jakub Kicinski <kuba@kernel.org> wrote: >> > >> > Perhaps I had seen one too many vendor incompatibility to trust that >> > adding a driver API without a validation suite will result in something >> > usable in production settings. >> >> I agree with Jakub. I don't see how extra ethtool reporting will help. >> Anyone who wants to know whether eth0 supports XDP_REDIRECT can already do so: >> ethtool -S eth0 | grep xdp_redirect > > Doing things right can never be treated as an addition. It is the > other way around. Option -S is for statistics and additionally it can > show something (AFAIR there wasn't such counter xdp_redirect, it must > be something new, thanks for the info). But nevertheless it cannot > cover all needs IMO. > > Some questions worth to consider: > Is this extra reporting function of statistics clearly documented in > ethtool? Is this going to be clearly documented? Would it be easier > for users/admins to find it? > What about zero copy? Can it be available via statistics, too? > What about drivers XDP transmit locking flag (latest request from Jesper)? There is no way the statistics is enough. And saying "just grep for xdp_redirect in ethtool -S" is bordering on active hostility towards users. We need drivers to export explicit features so we can do things like: - Explicitly reject attaching a program that tries to do xdp_redirect on an interface that doesn't support it. - Prevent devices that don't implement ndo_xdp_xmit() from being inserted into a devmap (oh, and this is on thing you can't know at all from the statistics, BTW). - Expose the features in a machine-readable format (like the ethtool flags in your patch) so applications can discover in a reliable way what is available and do proper fallback if features are missing. I can accept that we need some kind of conformance test to define what each flag means (which would be kinda like a selftest for the feature flags), but we definitely need the feature flags themselves! -Toke
diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst new file mode 100644 index 000000000000..4a434a1c512b --- /dev/null +++ b/Documentation/networking/netdev-xdp-properties.rst @@ -0,0 +1,42 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Netdev XDP properties +===================== + + * XDP PROPERTIES FLAGS + +Following netdev xdp properties flags can be retrieve over netlink ethtool +interface the same way as netdev feature flags. These properties flags are +read only and cannot be change in the runtime. + + +* XDP_ABORTED + +This property informs if netdev supports xdp aborted action. + +* XDP_DROP + +This property informs if netdev supports xdp drop action. + +* XDP_PASS + +This property informs if netdev supports xdp pass action. + +* XDP_TX + +This property informs if netdev supports xdp tx action. + +* XDP_REDIRECT + +This property informs if netdev supports xdp redirect action. +It assumes the all beforehand mentioned flags are enabled. + +* XDP_ZEROCOPY + +This property informs if netdev driver supports xdp zero copy. +It assumes the all beforehand mentioned flags are enabled. + +* XDP_HW_OFFLOAD + +This property informs if netdev driver supports xdp hw oflloading. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 52d1cc2bd8a7..2544c7f0e1b7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -43,6 +43,7 @@ #include <net/xdp.h> #include <linux/netdev_features.h> +#include <linux/xdp_properties.h> #include <linux/neighbour.h> #include <uapi/linux/netdevice.h> #include <uapi/linux/if_bonding.h> @@ -2171,6 +2172,7 @@ struct net_device { /* protected by rtnl_lock */ struct bpf_xdp_entity xdp_state[__MAX_XDP_MODE]; + xdp_properties_t xdp_properties; }; #define to_net_dev(d) container_of(d, struct net_device, dev) diff --git a/include/linux/xdp_properties.h b/include/linux/xdp_properties.h new file mode 100644 index 000000000000..c72c9bcc50de --- /dev/null +++ b/include/linux/xdp_properties.h @@ -0,0 +1,53 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Network device xdp properties. + */ +#ifndef _LINUX_XDP_PROPERTIES_H +#define _LINUX_XDP_PROPERTIES_H + +#include <linux/types.h> +#include <linux/bitops.h> +#include <asm/byteorder.h> + +typedef u64 xdp_properties_t; + +enum { + XDP_F_ABORTED_BIT, + XDP_F_DROP_BIT, + XDP_F_PASS_BIT, + XDP_F_TX_BIT, + XDP_F_REDIRECT_BIT, + XDP_F_ZEROCOPY_BIT, + XDP_F_HW_OFFLOAD_BIT, + + /* + * Add your fresh new property above and remember to update + * xdp_properties_strings [] in net/core/ethtool.c and maybe + * some xdp_properties mask #defines below. Please also describe it + * in Documentation/networking/xdp_properties.rst. + */ + + /**/XDP_PROPERTIES_COUNT +}; + +#define __XDP_F_BIT(bit) ((xdp_properties_t)1 << (bit)) +#define __XDP_F(name) __XDP_F_BIT(XDP_F_##name##_BIT) + +#define XDP_F_ABORTED __XDP_F(ABORTED) +#define XDP_F_DROP __XDP_F(DROP) +#define XDP_F_PASS __XDP_F(PASS) +#define XDP_F_TX __XDP_F(TX) +#define XDP_F_REDIRECT __XDP_F(REDIRECT) +#define XDP_F_ZEROCOPY __XDP_F(ZEROCOPY) +#define XDP_F_HW_OFFLOAD __XDP_F(HW_OFFLOAD) + +#define XDP_F_BASIC (XDP_F_ABORTED | \ + XDP_F_DROP | \ + XDP_F_PASS | \ + XDP_F_TX) + +#define XDP_F_FULL (XDP_F_BASIC | XDP_F_REDIRECT) + +#define XDP_F_FULL_ZC (XDP_F_FULL | XDP_F_ZEROCOPY) + +#endif /* _LINUX_XDP_PROPERTIES_H */ diff --git a/include/net/xdp.h b/include/net/xdp.h index 700ad5db7f5d..a9fabc1282cf 100644 --- a/include/net/xdp.h +++ b/include/net/xdp.h @@ -7,6 +7,7 @@ #define __LINUX_NET_XDP_H__ #include <linux/skbuff.h> /* skb_shared_info */ +#include <linux/xdp_properties.h> /** * DOC: XDP RX-queue information @@ -255,6 +256,100 @@ struct xdp_attachment_info { u32 flags; }; +#if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL) + +static __always_inline void +xdp_set_aborted_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_ABORTED; +} + +static __always_inline void +xdp_set_pass_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_PASS; +} + +static __always_inline void +xdp_set_drop_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_DROP; +} + +static __always_inline void +xdp_set_tx_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_TX; +} + +static __always_inline void +xdp_set_redirect_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_REDIRECT; +} + +static __always_inline void +xdp_set_hw_offload_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_HW_OFFLOAD; +} + +static __always_inline void +xdp_set_basic_properties(xdp_properties_t *properties) +{ + *properties |= XDP_F_BASIC; +} + +static __always_inline void +xdp_set_full_properties(xdp_properties_t *properties) +{ + *properties |= XDP_F_FULL; +} + +#else + +static __always_inline void +xdp_set_aborted_property(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_pass_property(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_drop_property(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_tx_property(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_redirect_property(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_hw_offload_property(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_basic_properties(xdp_properties_t *properties) +{ +} + +static __always_inline void +xdp_set_full_properties(xdp_properties_t *properties) +{ +} + +#endif + struct netdev_bpf; bool xdp_attachment_flags_ok(struct xdp_attachment_info *info, struct netdev_bpf *bpf); diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index 4e295541e396..48a3b6d165c7 100644 --- a/include/net/xdp_sock_drv.h +++ b/include/net/xdp_sock_drv.h @@ -8,6 +8,7 @@ #include <net/xdp_sock.h> #include <net/xsk_buff_pool.h> +#include <linux/xdp_properties.h> #ifdef CONFIG_XDP_SOCKETS @@ -117,6 +118,11 @@ static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool, xp_dma_sync_for_device(pool, dma, size); } +static inline void xsk_set_zc_property(xdp_properties_t *properties) +{ + *properties |= XDP_F_ZEROCOPY; +} + #else static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries) @@ -242,6 +248,10 @@ static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool, { } +static inline void xsk_set_zc_property(xdp_properties_t *properties) +{ +} + #endif /* CONFIG_XDP_SOCKETS */ #endif /* _LINUX_XDP_SOCK_DRV_H */ diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h index 9ca87bc73c44..dfcb0e2c98b2 100644 --- a/include/uapi/linux/ethtool.h +++ b/include/uapi/linux/ethtool.h @@ -688,6 +688,7 @@ enum ethtool_stringset { ETH_SS_TS_TX_TYPES, ETH_SS_TS_RX_FILTERS, ETH_SS_UDP_TUNNEL_TYPES, + ETH_SS_XDP_PROPERTIES, /* add new constants above here */ ETH_SS_COUNT diff --git a/include/uapi/linux/xdp_properties.h b/include/uapi/linux/xdp_properties.h new file mode 100644 index 000000000000..e85be03eb707 --- /dev/null +++ b/include/uapi/linux/xdp_properties.h @@ -0,0 +1,32 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ + +/* + * Copyright (c) 2020 Intel + */ + +#ifndef __UAPI_LINUX_XDP_PROPERTIES__ +#define __UAPI_LINUX_XDP_PROPERTIES__ + +/* ETH_GSTRING_LEN define is needed. */ +#include <linux/ethtool.h> + +#define XDP_PROPERTIES_ABORTED_STR "xdp-aborted" +#define XDP_PROPERTIES_DROP_STR "xdp-drop" +#define XDP_PROPERTIES_PASS_STR "xdp-pass" +#define XDP_PROPERTIES_TX_STR "xdp-tx" +#define XDP_PROPERTIES_REDIRECT_STR "xdp-redirect" +#define XDP_PROPERTIES_ZEROCOPY_STR "xdp-zerocopy" +#define XDP_PROPERTIES_HW_OFFLOAD_STR "xdp-hw-offload" + +#define DECLARE_XDP_PROPERTIES_TABLE(name) \ + const char name[][ETH_GSTRING_LEN] = { \ + XDP_PROPERTIES_ABORTED_STR, \ + XDP_PROPERTIES_DROP_STR, \ + XDP_PROPERTIES_PASS_STR, \ + XDP_PROPERTIES_TX_STR, \ + XDP_PROPERTIES_REDIRECT_STR, \ + XDP_PROPERTIES_ZEROCOPY_STR, \ + XDP_PROPERTIES_HW_OFFLOAD_STR, \ + } + +#endif /* __UAPI_LINUX_XDP_PROPERTIES__ */ diff --git a/net/ethtool/common.c b/net/ethtool/common.c index 24036e3055a1..8f15f96b8922 100644 --- a/net/ethtool/common.c +++ b/net/ethtool/common.c @@ -4,6 +4,7 @@ #include <linux/net_tstamp.h> #include <linux/phy.h> #include <linux/rtnetlink.h> +#include <uapi/linux/xdp_properties.h> #include "common.h" @@ -283,6 +284,16 @@ const char udp_tunnel_type_names[][ETH_GSTRING_LEN] = { static_assert(ARRAY_SIZE(udp_tunnel_type_names) == __ETHTOOL_UDP_TUNNEL_TYPE_CNT); +const char xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN] = { + [XDP_F_ABORTED_BIT] = XDP_PROPERTIES_ABORTED_STR, + [XDP_F_DROP_BIT] = XDP_PROPERTIES_DROP_STR, + [XDP_F_PASS_BIT] = XDP_PROPERTIES_PASS_STR, + [XDP_F_TX_BIT] = XDP_PROPERTIES_TX_STR, + [XDP_F_REDIRECT_BIT] = XDP_PROPERTIES_REDIRECT_STR, + [XDP_F_ZEROCOPY_BIT] = XDP_PROPERTIES_ZEROCOPY_STR, + [XDP_F_HW_OFFLOAD_BIT] = XDP_PROPERTIES_HW_OFFLOAD_STR, +}; + /* return false if legacy contained non-0 deprecated fields * maxtxpkt/maxrxpkt. rest of ksettings always updated */ diff --git a/net/ethtool/common.h b/net/ethtool/common.h index 3d9251c95a8b..85a35f8781eb 100644 --- a/net/ethtool/common.h +++ b/net/ethtool/common.h @@ -5,8 +5,10 @@ #include <linux/netdevice.h> #include <linux/ethtool.h> +#include <linux/xdp_properties.h> #define ETHTOOL_DEV_FEATURE_WORDS DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32) +#define ETHTOOL_XDP_PROPERTIES_WORDS DIV_ROUND_UP(XDP_PROPERTIES_COUNT, 32) /* compose link mode index from speed, type and duplex */ #define ETHTOOL_LINK_MODE(speed, type, duplex) \ @@ -22,6 +24,8 @@ extern const char tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN]; extern const char phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN]; +extern const char +xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN]; extern const char link_mode_names[][ETH_GSTRING_LEN]; extern const char netif_msg_class_names[][ETH_GSTRING_LEN]; extern const char wol_mode_names[][ETH_GSTRING_LEN]; diff --git a/net/ethtool/strset.c b/net/ethtool/strset.c index 0baad0ce1832..684e751b31a9 100644 --- a/net/ethtool/strset.c +++ b/net/ethtool/strset.c @@ -80,6 +80,11 @@ static const struct strset_info info_template[] = { .count = __ETHTOOL_UDP_TUNNEL_TYPE_CNT, .strings = udp_tunnel_type_names, }, + [ETH_SS_XDP_PROPERTIES] = { + .per_dev = false, + .count = ARRAY_SIZE(xdp_properties_strings), + .strings = xdp_properties_strings, + }, }; struct strset_req_info {