diff mbox series

[v2,bpf,1/5] net: ethtool: add xdp properties flag set

Message ID 20201204102901.109709-2-marekx.majtyka@intel.com
State New
Headers show
Series New netdev feature flags for XDP | expand

Commit Message

Marek Majtyka Dec. 4, 2020, 10:28 a.m. UTC
From: Marek Majtyka <marekx.majtyka@intel.com>

Implement support for checking what kind of xdp functionality a netdev
supports. Previously, there was no way to do this other than to try
to create an AF_XDP socket on the interface or load an XDP program and see
if it worked. This commit changes this by adding a new variable which
describes all xdp supported functions on pretty detailed level:
 - aborted
 - drop
 - pass
 - tx
 - redirect
 - zero copy
 - hardware offload.

Zerocopy mode requires that redirect xdp operation is implemented
in a driver and the driver supports also zero copy mode.
Full mode requires that all xdp operation are implemented in the driver.
Basic mode is just full mode without redirect operation.

Initially, these new flags are disabled for all drivers by default.

Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
---
 .../networking/netdev-xdp-properties.rst      | 42 ++++++++
 include/linux/netdevice.h                     |  2 +
 include/linux/xdp_properties.h                | 53 +++++++++++
 include/net/xdp.h                             | 95 +++++++++++++++++++
 include/net/xdp_sock_drv.h                    | 10 ++
 include/uapi/linux/ethtool.h                  |  1 +
 include/uapi/linux/xdp_properties.h           | 32 +++++++
 net/ethtool/common.c                          | 11 +++
 net/ethtool/common.h                          |  4 +
 net/ethtool/strset.c                          |  5 +
 10 files changed, 255 insertions(+)
 create mode 100644 Documentation/networking/netdev-xdp-properties.rst
 create mode 100644 include/linux/xdp_properties.h
 create mode 100644 include/uapi/linux/xdp_properties.h

Comments

Toke Høiland-Jørgensen Dec. 4, 2020, 12:18 p.m. UTC | #1
alardam@gmail.com writes:

> From: Marek Majtyka <marekx.majtyka@intel.com>
>
> Implement support for checking what kind of xdp functionality a netdev
> supports. Previously, there was no way to do this other than to try
> to create an AF_XDP socket on the interface or load an XDP program and see
> if it worked. This commit changes this by adding a new variable which
> describes all xdp supported functions on pretty detailed level:

I like the direction this is going! :)

>  - aborted
>  - drop
>  - pass
>  - tx
>  - redirect

Drivers can in principle implement support for the XDP_REDIRECT return
code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
for being the *target* of a redirect. While my quick grepping doesn't
turn up any drivers that do only one of these right now, I think we've
had examples of it in the past, so it would probably be better to split
the redirect feature flag in two.

This would also make it trivial to replace the check in __xdp_enqueue()
(in devmap.c) from looking at whether the ndo is defined, and just
checking the flag. It would be great if you could do this as part of
this series.

Maybe we could even make the 'redirect target' flag be set automatically
if a driver implements ndo_xdp_xmit?

>  - zero copy
>  - hardware offload.
>
> Zerocopy mode requires that redirect xdp operation is implemented
> in a driver and the driver supports also zero copy mode.
> Full mode requires that all xdp operation are implemented in the driver.
> Basic mode is just full mode without redirect operation.
>
> Initially, these new flags are disabled for all drivers by default.
>
> Signed-off-by: Marek Majtyka <marekx.majtyka@intel.com>
> ---
>  .../networking/netdev-xdp-properties.rst      | 42 ++++++++
>  include/linux/netdevice.h                     |  2 +
>  include/linux/xdp_properties.h                | 53 +++++++++++
>  include/net/xdp.h                             | 95 +++++++++++++++++++
>  include/net/xdp_sock_drv.h                    | 10 ++
>  include/uapi/linux/ethtool.h                  |  1 +
>  include/uapi/linux/xdp_properties.h           | 32 +++++++
>  net/ethtool/common.c                          | 11 +++
>  net/ethtool/common.h                          |  4 +
>  net/ethtool/strset.c                          |  5 +
>  10 files changed, 255 insertions(+)
>  create mode 100644 Documentation/networking/netdev-xdp-properties.rst
>  create mode 100644 include/linux/xdp_properties.h
>  create mode 100644 include/uapi/linux/xdp_properties.h
>
> diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst
> new file mode 100644
> index 000000000000..4a434a1c512b
> --- /dev/null
> +++ b/Documentation/networking/netdev-xdp-properties.rst
> @@ -0,0 +1,42 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=====================
> +Netdev XDP properties
> +=====================
> +
> + * XDP PROPERTIES FLAGS
> +
> +Following netdev xdp properties flags can be retrieve over netlink ethtool
> +interface the same way as netdev feature flags. These properties flags are
> +read only and cannot be change in the runtime.
> +
> +
> +*  XDP_ABORTED
> +
> +This property informs if netdev supports xdp aborted action.
> +
> +*  XDP_DROP
> +
> +This property informs if netdev supports xdp drop action.
> +
> +*  XDP_PASS
> +
> +This property informs if netdev supports xdp pass action.
> +
> +*  XDP_TX
> +
> +This property informs if netdev supports xdp tx action.
> +
> +*  XDP_REDIRECT
> +
> +This property informs if netdev supports xdp redirect action.
> +It assumes the all beforehand mentioned flags are enabled.
> +
> +*  XDP_ZEROCOPY
> +
> +This property informs if netdev driver supports xdp zero copy.
> +It assumes the all beforehand mentioned flags are enabled.

Nit: I think 'XDP_ZEROCOPY' can lead people to think that this is
zero-copy support for all XDP operations, which is obviously not the
case. So maybe 'XDP_SOCK_ZEROCOPY' (and update the description to
mention AF_XDP sockets explicitly)?

-Toke
Daniel Borkmann Dec. 4, 2020, 3:21 p.m. UTC | #2
On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
> On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:
>> alardam@gmail.com writes:
>>> From: Marek Majtyka <marekx.majtyka@intel.com>
>>>
>>> Implement support for checking what kind of xdp functionality a netdev
>>> supports. Previously, there was no way to do this other than to try
>>> to create an AF_XDP socket on the interface or load an XDP program and see
>>> if it worked. This commit changes this by adding a new variable which
>>> describes all xdp supported functions on pretty detailed level:
>>
>> I like the direction this is going! :)
>>
>>>   - aborted
>>>   - drop
>>>   - pass
>>>   - tx

I strongly think we should _not_ merge any native XDP driver patchset that does
not support/implement the above return codes. Could we instead group them together
and call this something like XDP_BASE functionality to not give a wrong impression?
If this is properly documented that these are basic must-have _requirements_, then
users and driver developers both know what the expectations are.

>>>   - redirect
>>
>> Drivers can in principle implement support for the XDP_REDIRECT return
>> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
>> for being the *target* of a redirect. While my quick grepping doesn't
>> turn up any drivers that do only one of these right now, I think we've
>> had examples of it in the past, so it would probably be better to split
>> the redirect feature flag in two.
>>
>> This would also make it trivial to replace the check in __xdp_enqueue()
>> (in devmap.c) from looking at whether the ndo is defined, and just
>> checking the flag. It would be great if you could do this as part of
>> this series.
>>
>> Maybe we could even make the 'redirect target' flag be set automatically
>> if a driver implements ndo_xdp_xmit?
> 
> +1
> 
>>>   - zero copy
>>>   - hardware offload.

One other thing that is quite annoying to figure out sometimes and not always
obvious from reading the driver code (and it may even differ depending on how
the driver was built :/) is how much XDP headroom a driver really provides.

We tried to standardize on a minimum guaranteed amount, but unfortunately not
everyone seems to implement it, but I think it would be very useful to query
this from application side, for example, consider that an app inserts a BPF
prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
know which of the different encaps it implements are realistically possible on
the underlying XDP supported dev.

Thanks,
Daniel
Toke Høiland-Jørgensen Dec. 4, 2020, 5:20 p.m. UTC | #3
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:
>> On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:
>>> alardam@gmail.com writes:
>>>> From: Marek Majtyka <marekx.majtyka@intel.com>
>>>>
>>>> Implement support for checking what kind of xdp functionality a netdev
>>>> supports. Previously, there was no way to do this other than to try
>>>> to create an AF_XDP socket on the interface or load an XDP program and see
>>>> if it worked. This commit changes this by adding a new variable which
>>>> describes all xdp supported functions on pretty detailed level:
>>>
>>> I like the direction this is going! :)
>>>
>>>>   - aborted
>>>>   - drop
>>>>   - pass
>>>>   - tx
>
> I strongly think we should _not_ merge any native XDP driver patchset
> that does not support/implement the above return codes. Could we
> instead group them together and call this something like XDP_BASE
> functionality to not give a wrong impression? If this is properly
> documented that these are basic must-have _requirements_, then users
> and driver developers both know what the expectations are.

I think there may have been drivers that only did DROP/PASS on first
merge; but adding TX to the "base set" is fine by me, as long as it's
actually enforced ;)

(As in, we originally said the same thing about the full feature set and
that never really worked out).

>>>>   - redirect
>>>
>>> Drivers can in principle implement support for the XDP_REDIRECT return
>>> code (and calling xdp_do_redirect()) without implementing ndo_xdp_xmit()
>>> for being the *target* of a redirect. While my quick grepping doesn't
>>> turn up any drivers that do only one of these right now, I think we've
>>> had examples of it in the past, so it would probably be better to split
>>> the redirect feature flag in two.
>>>
>>> This would also make it trivial to replace the check in __xdp_enqueue()
>>> (in devmap.c) from looking at whether the ndo is defined, and just
>>> checking the flag. It would be great if you could do this as part of
>>> this series.
>>>
>>> Maybe we could even make the 'redirect target' flag be set automatically
>>> if a driver implements ndo_xdp_xmit?
>> 
>> +1
>> 
>>>>   - zero copy
>>>>   - hardware offload.
>
> One other thing that is quite annoying to figure out sometimes and not always
> obvious from reading the driver code (and it may even differ depending on how
> the driver was built :/) is how much XDP headroom a driver really provides.
>
> We tried to standardize on a minimum guaranteed amount, but unfortunately not
> everyone seems to implement it, but I think it would be very useful to query
> this from application side, for example, consider that an app inserts a BPF
> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
> know which of the different encaps it implements are realistically possible on
> the underlying XDP supported dev.

How many distinct values are there in reality? Enough to express this in
a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
an additional field to get the exact value? If we implement the latter
we also run the risk of people actually implementing all sorts of weird
values, whereas if we constrain it to a few distinct values it's easier
to push back against adding new values (as it'll be obvious from the
addition of new flags).

-Toke
Daniel Borkmann Dec. 4, 2020, 10:19 p.m. UTC | #4
On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:
> Daniel Borkmann <daniel@iogearbox.net> writes:
[...]
>> We tried to standardize on a minimum guaranteed amount, but unfortunately not
>> everyone seems to implement it, but I think it would be very useful to query
>> this from application side, for example, consider that an app inserts a BPF
>> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to
>> know which of the different encaps it implements are realistically possible on
>> the underlying XDP supported dev.
> 
> How many distinct values are there in reality? Enough to express this in
> a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need
> an additional field to get the exact value? If we implement the latter
> we also run the risk of people actually implementing all sorts of weird
> values, whereas if we constrain it to a few distinct values it's easier
> to push back against adding new values (as it'll be obvious from the
> addition of new flags).

It's not everywhere straight forward to determine unfortunately, see also [0,1]
as some data points where Jesper looked into in the past, so in some cases it
might differ depending on the build/runtime config..

   [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/
   [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/
Jesper Dangaard Brouer Dec. 7, 2020, 11:54 a.m. UTC | #5
On Fri, 4 Dec 2020 23:19:55 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:

> > Daniel Borkmann <daniel@iogearbox.net> writes:  

> [...]

> >> We tried to standardize on a minimum guaranteed amount, but unfortunately not

> >> everyone seems to implement it, but I think it would be very useful to query

> >> this from application side, for example, consider that an app inserts a BPF

> >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to

> >> know which of the different encaps it implements are realistically possible on

> >> the underlying XDP supported dev.  

> > 

> > How many distinct values are there in reality? Enough to express this in

> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need

> > an additional field to get the exact value? If we implement the latter

> > we also run the risk of people actually implementing all sorts of weird

> > values, whereas if we constrain it to a few distinct values it's easier

> > to push back against adding new values (as it'll be obvious from the

> > addition of new flags).  

> 

> It's not everywhere straight forward to determine unfortunately, see also [0,1]

> as some data points where Jesper looked into in the past, so in some cases it

> might differ depending on the build/runtime config..

> 

>    [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/

>    [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/


Yes, unfortunately drivers have already gotten creative in this area,
and variations have sneaked in.  I remember that we were forced to
allow SFC driver to use 128 bytes headroom, to avoid a memory
corruption. I tried hard to have the minimum 192 bytes as it is 3
cachelines, but I failed to enforce this.

It might be valuable to expose info on the drivers headroom size, as
this will allow end-users to take advantage of this (instead of having
to use the lowest common headroom) and up-front in userspace rejecting
to load on e.g. SFC that have this annoying limitation.

BUT thinking about what the drivers headroom size MEANS to userspace,
I'm not sure it is wise to give this info to userspace.  The
XDP-headroom is used for several kernel internal things, that limit the
available space for growing packet-headroom.  E.g. (1) xdp_frame is
something that we likely need to grow (even-though I'm pushing back),
E.g. (2) metadata area which Saeed is looking to populate from driver
code (also reduce packet-headroom for encap-headers).  So, userspace
cannot use the XDP-headroom size to much...

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Toke Høiland-Jørgensen Dec. 7, 2020, 12:03 p.m. UTC | #6
Daniel Borkmann <daniel@iogearbox.net> writes:

> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:

>> Daniel Borkmann <daniel@iogearbox.net> writes:

> [...]

>>> We tried to standardize on a minimum guaranteed amount, but unfortunately not

>>> everyone seems to implement it, but I think it would be very useful to query

>>> this from application side, for example, consider that an app inserts a BPF

>>> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to

>>> know which of the different encaps it implements are realistically possible on

>>> the underlying XDP supported dev.

>> 

>> How many distinct values are there in reality? Enough to express this in

>> a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need

>> an additional field to get the exact value? If we implement the latter

>> we also run the risk of people actually implementing all sorts of weird

>> values, whereas if we constrain it to a few distinct values it's easier

>> to push back against adding new values (as it'll be obvious from the

>> addition of new flags).

>

> It's not everywhere straight forward to determine unfortunately, see also [0,1]

> as some data points where Jesper looked into in the past, so in some cases it

> might differ depending on the build/runtime config..

>

>    [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/

>    [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/


Right, well in that case maybe we should just expose the actual headroom
as a separate netlink attribute? Although I suppose that would require
another round of driver changes since Jesper's patch you linked above
only puts this into xdp_buff at XDP program runtime.

Jesper, WDYT?

-Toke
Toke Høiland-Jørgensen Dec. 7, 2020, 12:08 p.m. UTC | #7
Jesper Dangaard Brouer <brouer@redhat.com> writes:

> On Fri, 4 Dec 2020 23:19:55 +0100

> Daniel Borkmann <daniel@iogearbox.net> wrote:

>

>> On 12/4/20 6:20 PM, Toke Høiland-Jørgensen wrote:

>> > Daniel Borkmann <daniel@iogearbox.net> writes:  

>> [...]

>> >> We tried to standardize on a minimum guaranteed amount, but unfortunately not

>> >> everyone seems to implement it, but I think it would be very useful to query

>> >> this from application side, for example, consider that an app inserts a BPF

>> >> prog at XDP doing custom encap shortly before XDP_TX so it would be useful to

>> >> know which of the different encaps it implements are realistically possible on

>> >> the underlying XDP supported dev.  

>> > 

>> > How many distinct values are there in reality? Enough to express this in

>> > a few flags (XDP_HEADROOM_128, XDP_HEADROOM_192, etc?), or does it need

>> > an additional field to get the exact value? If we implement the latter

>> > we also run the risk of people actually implementing all sorts of weird

>> > values, whereas if we constrain it to a few distinct values it's easier

>> > to push back against adding new values (as it'll be obvious from the

>> > addition of new flags).  

>> 

>> It's not everywhere straight forward to determine unfortunately, see also [0,1]

>> as some data points where Jesper looked into in the past, so in some cases it

>> might differ depending on the build/runtime config..

>> 

>>    [0] https://lore.kernel.org/bpf/158945314698.97035.5286827951225578467.stgit@firesoul/

>>    [1] https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul/

>

> Yes, unfortunately drivers have already gotten creative in this area,

> and variations have sneaked in.  I remember that we were forced to

> allow SFC driver to use 128 bytes headroom, to avoid a memory

> corruption. I tried hard to have the minimum 192 bytes as it is 3

> cachelines, but I failed to enforce this.

>

> It might be valuable to expose info on the drivers headroom size, as

> this will allow end-users to take advantage of this (instead of having

> to use the lowest common headroom) and up-front in userspace rejecting

> to load on e.g. SFC that have this annoying limitation.

>

> BUT thinking about what the drivers headroom size MEANS to userspace,

> I'm not sure it is wise to give this info to userspace.  The

> XDP-headroom is used for several kernel internal things, that limit the

> available space for growing packet-headroom.  E.g. (1) xdp_frame is

> something that we likely need to grow (even-though I'm pushing back),

> E.g. (2) metadata area which Saeed is looking to populate from driver

> code (also reduce packet-headroom for encap-headers).  So, userspace

> cannot use the XDP-headroom size to much...


(Ah, you had already replied, sorry seems I missed that).

Can we calculate a number from the headroom that is meaningful for
userspace? I suppose that would be "total number of bytes available for
metadata+packet extension"? Even with growing data structures, any
particular kernel should be able to inform userspace of the current
value, no?

-Toke
Jesper Dangaard Brouer Dec. 7, 2020, 12:54 p.m. UTC | #8
On Fri, 4 Dec 2020 16:21:08 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:

> > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  

> >> alardam@gmail.com writes:  

> >>> From: Marek Majtyka <marekx.majtyka@intel.com>

> >>>

> >>> Implement support for checking what kind of xdp functionality a netdev

> >>> supports. Previously, there was no way to do this other than to try

> >>> to create an AF_XDP socket on the interface or load an XDP program and see

> >>> if it worked. This commit changes this by adding a new variable which

> >>> describes all xdp supported functions on pretty detailed level:  

> >>

> >> I like the direction this is going! :)


(Me too, don't get discouraged by our nitpicking, keep working on this! :-))

> >>  

> >>>   - aborted

> >>>   - drop

> >>>   - pass

> >>>   - tx  

> 

> I strongly think we should _not_ merge any native XDP driver patchset

> that does not support/implement the above return codes. 


I agree, with above statement.

> Could we instead group them together and call this something like

> XDP_BASE functionality to not give a wrong impression?


I disagree.  I can accept that XDP_BASE include aborted+drop+pass.

I think we need to keep XDP_TX action separate, because I think that
there are use-cases where the we want to disable XDP_TX due to end-user
policy or hardware limitations.

Use-case(1): Cloud-provider want to give customers (running VMs) ability
to load XDP program for DDoS protection (only), but don't want to allow
customer to use XDP_TX (that can implement LB or cheat their VM
isolation policy).

Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue
resources, as the use-case is only DDoS.  Today we have this problem
with the ixgbe hardware, that cannot load XDP programs on systems with
more than 192 CPUs.


> If this is properly documented that these are basic must-have

> _requirements_, then users and driver developers both know what the

> expectations are.


We can still document that XDP_TX is a must-have requirement, when a
driver implements XDP.


> >>>   - redirect  

> >>



-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
John Fastabend Dec. 7, 2020, 8:52 p.m. UTC | #9
Jesper Dangaard Brouer wrote:
> On Fri, 4 Dec 2020 16:21:08 +0100

> Daniel Borkmann <daniel@iogearbox.net> wrote:

> 

> > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:

> > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  

> > >> alardam@gmail.com writes:  

> > >>> From: Marek Majtyka <marekx.majtyka@intel.com>

> > >>>

> > >>> Implement support for checking what kind of xdp functionality a netdev

> > >>> supports. Previously, there was no way to do this other than to try

> > >>> to create an AF_XDP socket on the interface or load an XDP program and see

> > >>> if it worked. This commit changes this by adding a new variable which

> > >>> describes all xdp supported functions on pretty detailed level:  

> > >>

> > >> I like the direction this is going! :)

> 

> (Me too, don't get discouraged by our nitpicking, keep working on this! :-))

> 

> > >>  

> > >>>   - aborted

> > >>>   - drop

> > >>>   - pass

> > >>>   - tx  

> > 

> > I strongly think we should _not_ merge any native XDP driver patchset

> > that does not support/implement the above return codes. 

> 

> I agree, with above statement.

> 

> > Could we instead group them together and call this something like

> > XDP_BASE functionality to not give a wrong impression?

> 

> I disagree.  I can accept that XDP_BASE include aborted+drop+pass.

> 

> I think we need to keep XDP_TX action separate, because I think that

> there are use-cases where the we want to disable XDP_TX due to end-user

> policy or hardware limitations.


How about we discover this at load time though. Meaning if the program
doesn't use XDP_TX then the hardware can skip resource allocations for
it. I think we could have verifier or extra pass discover the use of
XDP_TX and then pass a bit down to driver to enable/disable TX caps.

> 

> Use-case(1): Cloud-provider want to give customers (running VMs) ability

> to load XDP program for DDoS protection (only), but don't want to allow

> customer to use XDP_TX (that can implement LB or cheat their VM

> isolation policy).


Not following. What interface do they want to allow loading on? If its
the VM interface then I don't see how it matters. From outside the
VM there should be no way to discover if its done in VM or in tc or
some other stack.

If its doing some onloading/offloading I would assume they need to
ensure the isolation, etc. is still maintained because you can't
let one VMs program work on other VMs packets safely.

So what did I miss, above doesn't make sense to me.

> 

> Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue

> resources, as the use-case is only DDoS.  Today we have this problem

> with the ixgbe hardware, that cannot load XDP programs on systems with

> more than 192 CPUs.


The ixgbe issues is just a bug or missing-feature in my opinion.

I think we just document that XDP_TX consumes resources and if users
care they shouldn't use XD_TX in programs and in that case hardware
should via program discovery not allocate the resource. This seems
cleaner in my opinion then more bits for features.

> 

> 

> > If this is properly documented that these are basic must-have

> > _requirements_, then users and driver developers both know what the

> > expectations are.

> 

> We can still document that XDP_TX is a must-have requirement, when a

> driver implements XDP.


+1

> 

> 

> > >>>   - redirect  

> > >>

> 

> 

> -- 

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>
Saeed Mahameed Dec. 7, 2020, 10:38 p.m. UTC | #10
On Mon, 2020-12-07 at 12:52 -0800, John Fastabend wrote:
> Jesper Dangaard Brouer wrote:

> > On Fri, 4 Dec 2020 16:21:08 +0100

> > Daniel Borkmann <daniel@iogearbox.net> wrote:

> > 

> > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:

> > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-

> > > > Jørgensen wrote:  

> > > > > alardam@gmail.com writes:  

> > > > > > From: Marek Majtyka <marekx.majtyka@intel.com>

> > > > > > 

> > > > > > Implement support for checking what kind of xdp

> > > > > > functionality a netdev

> > > > > > supports. Previously, there was no way to do this other

> > > > > > than to try

> > > > > > to create an AF_XDP socket on the interface or load an XDP

> > > > > > program and see

> > > > > > if it worked. This commit changes this by adding a new

> > > > > > variable which

> > > > > > describes all xdp supported functions on pretty detailed

> > > > > > level:  

> > > > > 

> > > > > I like the direction this is going! :)

> > 

> > (Me too, don't get discouraged by our nitpicking, keep working on

> > this! :-))

> > 

> > > > >  

> > > > > >   - aborted

> > > > > >   - drop

> > > > > >   - pass

> > > > > >   - tx  

> > > 

> > > I strongly think we should _not_ merge any native XDP driver

> > > patchset

> > > that does not support/implement the above return codes. 

> > 

> > I agree, with above statement.

> > 

> > > Could we instead group them together and call this something like

> > > XDP_BASE functionality to not give a wrong impression?

> > 

> > I disagree.  I can accept that XDP_BASE include aborted+drop+pass.

> > 

XDP_BASE is a weird name i vote:  
XDP_FLAG_RX,
XDP_FLAG_TX,
XDP_FLAG_REDIRECT,
XDP_FLAG_AF_XDP,
XDP_FLAG_AFXDP_ZC

> > I think we need to keep XDP_TX action separate, because I think

> > that

> > there are use-cases where the we want to disable XDP_TX due to end-

> > user

> > policy or hardware limitations.

> 

> How about we discover this at load time though. Meaning if the

> program

> doesn't use XDP_TX then the hardware can skip resource allocations

> for

> it. I think we could have verifier or extra pass discover the use of

> XDP_TX and then pass a bit down to driver to enable/disable TX caps.

> 


+1, how about we also attach some attributes to the program that would
tell the kernel/driver how to prepare configure itself for the new
program ?

Attributes like how much headroom the program needs, what meta data
driver must provide, should the driver do csum on tx, etc .. 

some attribute can be extracted from the byte code/logic others are
stated explicitly in some predefined section in the XDP prog itself.

On a second thought, this could be disruptive, users will eventually
want to replace XDP progs, and they might want a persistent config
prior to loading/reloading any prog to avoid reconfigs (packet drops)
between progs.

> > Use-case(1): Cloud-provider want to give customers (running VMs)

> > ability

> > to load XDP program for DDoS protection (only), but don't want to

> > allow

> > customer to use XDP_TX (that can implement LB or cheat their VM

> > isolation policy).

> 

> Not following. What interface do they want to allow loading on? If

> its

> the VM interface then I don't see how it matters. From outside the

> VM there should be no way to discover if its done in VM or in tc or

> some other stack.

> 

> If its doing some onloading/offloading I would assume they need to

> ensure the isolation, etc. is still maintained because you can't

> let one VMs program work on other VMs packets safely.

> 

> So what did I miss, above doesn't make sense to me.

> 

> > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue

> > resources, as the use-case is only DDoS.  Today we have this

> > problem

> > with the ixgbe hardware, that cannot load XDP programs on systems

> > with

> > more than 192 CPUs.

> 

> The ixgbe issues is just a bug or missing-feature in my opinion.

> 

> I think we just document that XDP_TX consumes resources and if users

> care they shouldn't use XD_TX in programs and in that case hardware

> should via program discovery not allocate the resource. This seems

> cleaner in my opinion then more bits for features.

> 

> > 

> > > If this is properly documented that these are basic must-have

> > > _requirements_, then users and driver developers both know what

> > > the

> > > expectations are.

> > 

> > We can still document that XDP_TX is a must-have requirement, when

> > a

> > driver implements XDP.

> 

> +1

> 


Ho about xdp redirect ? 
do we still need to load a no-op program on the egress netdev so it
would allocate the xdp tx/redirect queues ? 

Adding the above discovery feature will break xdp redirect native mode
and will require to have a special flag for xdp_redirect, so it
actually makes more sense to have a unique knob to turn on XDP tx, for
the redirect use case.
Maciej Fijalkowski Dec. 7, 2020, 11:07 p.m. UTC | #11
On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:
> Jesper Dangaard Brouer wrote:

> > On Fri, 4 Dec 2020 16:21:08 +0100

> > Daniel Borkmann <daniel@iogearbox.net> wrote:

> > 

> > > On 12/4/20 1:46 PM, Maciej Fijalkowski wrote:

> > > > On Fri, Dec 04, 2020 at 01:18:31PM +0100, Toke Høiland-Jørgensen wrote:  

> > > >> alardam@gmail.com writes:  

> > > >>> From: Marek Majtyka <marekx.majtyka@intel.com>

> > > >>>

> > > >>> Implement support for checking what kind of xdp functionality a netdev

> > > >>> supports. Previously, there was no way to do this other than to try

> > > >>> to create an AF_XDP socket on the interface or load an XDP program and see

> > > >>> if it worked. This commit changes this by adding a new variable which

> > > >>> describes all xdp supported functions on pretty detailed level:  

> > > >>

> > > >> I like the direction this is going! :)

> > 

> > (Me too, don't get discouraged by our nitpicking, keep working on this! :-))

> > 

> > > >>  

> > > >>>   - aborted

> > > >>>   - drop

> > > >>>   - pass

> > > >>>   - tx  

> > > 

> > > I strongly think we should _not_ merge any native XDP driver patchset

> > > that does not support/implement the above return codes. 

> > 

> > I agree, with above statement.

> > 

> > > Could we instead group them together and call this something like

> > > XDP_BASE functionality to not give a wrong impression?

> > 

> > I disagree.  I can accept that XDP_BASE include aborted+drop+pass.

> > 

> > I think we need to keep XDP_TX action separate, because I think that

> > there are use-cases where the we want to disable XDP_TX due to end-user

> > policy or hardware limitations.

> 

> How about we discover this at load time though. Meaning if the program

> doesn't use XDP_TX then the hardware can skip resource allocations for

> it. I think we could have verifier or extra pass discover the use of

> XDP_TX and then pass a bit down to driver to enable/disable TX caps.


+1

> 

> > 

> > Use-case(1): Cloud-provider want to give customers (running VMs) ability

> > to load XDP program for DDoS protection (only), but don't want to allow

> > customer to use XDP_TX (that can implement LB or cheat their VM

> > isolation policy).

> 

> Not following. What interface do they want to allow loading on? If its

> the VM interface then I don't see how it matters. From outside the

> VM there should be no way to discover if its done in VM or in tc or

> some other stack.

> 

> If its doing some onloading/offloading I would assume they need to

> ensure the isolation, etc. is still maintained because you can't

> let one VMs program work on other VMs packets safely.

> 

> So what did I miss, above doesn't make sense to me.

> 

> > 

> > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue

> > resources, as the use-case is only DDoS.  Today we have this problem

> > with the ixgbe hardware, that cannot load XDP programs on systems with

> > more than 192 CPUs.

> 

> The ixgbe issues is just a bug or missing-feature in my opinion.


Not a bug, rather HW limitation?

> 

> I think we just document that XDP_TX consumes resources and if users

> care they shouldn't use XD_TX in programs and in that case hardware

> should via program discovery not allocate the resource. This seems

> cleaner in my opinion then more bits for features.


But what if I'm with some limited HW that actually has a support for XDP
and I would like to utilize XDP_TX?

Not all drivers that support XDP consume Tx resources. Recently igb got
support and it shares Tx queues between netstack and XDP.

I feel like we should have a sort-of best effort approach in case we
stumble upon the XDP_TX in prog being loaded and query the driver if it
would be able to provide the Tx resources on the current system, given
that normally we tend to have a queue per core.

In that case igb would say yes, ixgbe would say no and prog would be
rejected.

> 

> > 

> > 

> > > If this is properly documented that these are basic must-have

> > > _requirements_, then users and driver developers both know what the

> > > expectations are.

> > 

> > We can still document that XDP_TX is a must-have requirement, when a

> > driver implements XDP.

> 

> +1

> 

> > 

> > 

> > > >>>   - redirect  

> > > >>

> > 

> > 

> > -- 

> > Best regards,

> >   Jesper Dangaard Brouer

> >   MSc.CS, Principal Kernel Engineer at Red Hat

> >   LinkedIn: http://www.linkedin.com/in/brouer

> > 

> 

>
David Ahern Dec. 8, 2020, 1:01 a.m. UTC | #12
On 12/7/20 1:52 PM, John Fastabend wrote:
>>

>> I think we need to keep XDP_TX action separate, because I think that

>> there are use-cases where the we want to disable XDP_TX due to end-user

>> policy or hardware limitations.

> 

> How about we discover this at load time though. Meaning if the program

> doesn't use XDP_TX then the hardware can skip resource allocations for

> it. I think we could have verifier or extra pass discover the use of

> XDP_TX and then pass a bit down to driver to enable/disable TX caps.

> 


This was discussed in the context of virtio_net some months back - it is
hard to impossible to know a program will not return XDP_TX (e.g., value
comes from a map).

Flipping that around, what if the program attach indicates whether
XDP_TX could be returned. If so, driver manages the resource needs. If
not, no resource needed and if the program violates that and returns
XDP_TX the packet is dropped.
Jesper Dangaard Brouer Dec. 8, 2020, 8:28 a.m. UTC | #13
On Mon, 7 Dec 2020 18:01:00 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/7/20 1:52 PM, John Fastabend wrote:

> >>

> >> I think we need to keep XDP_TX action separate, because I think that

> >> there are use-cases where the we want to disable XDP_TX due to end-user

> >> policy or hardware limitations.  

> > 

> > How about we discover this at load time though. 


Nitpick at XDP "attach" time. The general disconnect between BPF and
XDP is that BPF can verify at "load" time (as kernel knows what it
support) while XDP can have different support/features per driver, and
cannot do this until attachment time. (See later issue with tail calls).
(All other BPF-hooks don't have this issue)

> > Meaning if the program

> > doesn't use XDP_TX then the hardware can skip resource allocations for

> > it. I think we could have verifier or extra pass discover the use of

> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.

> >   

> 

> This was discussed in the context of virtio_net some months back - it is

> hard to impossible to know a program will not return XDP_TX (e.g., value

> comes from a map).


It is hard, and sometimes not possible.  For maps the workaround is
that BPF-programmer adds a bound check on values from the map. If not
doing that the verifier have to assume all possible return codes are
used by BPF-prog.

The real nemesis is program tail calls, that can be added dynamically
after the XDP program is attached.  It is at attachment time that
changing the NIC resources is possible.  So, for program tail calls the
verifier have to assume all possible return codes are used by BPF-prog.

BPF now have function calls and function replace right(?)  How does
this affect this detection of possible return codes?


> Flipping that around, what if the program attach indicates whether

> XDP_TX could be returned. If so, driver manages the resource needs. If

> not, no resource needed and if the program violates that and returns

> XDP_TX the packet is dropped.


I do like this idea, as IMHO we do need something that is connected
with the BPF-prog, that describe what resources the program request
(either like above via detecting this in verifier, or simply manually
configuring this in the BPF-prog ELF file)

The main idea is that we all (I assume) want to provide a better
end-user interface/experience. By direct feedback to the end-user that
"loading+attaching" this XDP BPF-prog will not work, as e.g. driver
don't support a specific return code.  Thus, we need to reject
"loading+attaching" if features cannot be satisfied.

We need a solution as; today it is causing frustration for end-users
that packets can be (silently) dropped by XDP, e.g. if driver don't
support XDP_REDIRECT.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Jesper Dangaard Brouer Dec. 8, 2020, 9 a.m. UTC | #14
On Mon, 07 Dec 2020 12:52:22 -0800
John Fastabend <john.fastabend@gmail.com> wrote:

> > Use-case(1): Cloud-provider want to give customers (running VMs) ability

> > to load XDP program for DDoS protection (only), but don't want to allow

> > customer to use XDP_TX (that can implement LB or cheat their VM

> > isolation policy).  

> 

> Not following. What interface do they want to allow loading on? If its

> the VM interface then I don't see how it matters. From outside the

> VM there should be no way to discover if its done in VM or in tc or

> some other stack.

> 

> If its doing some onloading/offloading I would assume they need to

> ensure the isolation, etc. is still maintained because you can't

> let one VMs program work on other VMs packets safely.

> 

> So what did I miss, above doesn't make sense to me.


The Cloud-provider want to load customer provided BPF-code on the
physical Host-OS NIC (that support XDP).  The customer can get access
to a web-interface where they can write or upload their BPF-prog.

As multiple customers can upload BPF-progs, the Cloud-provider have to
write a BPF-prog dispatcher that runs these multiple program.  This
could be done via BPF tail-calls, or via Toke's libxdp[1], or via
devmap XDP-progs per egress port.

The Cloud-provider don't fully trust customers BPF-prog.   They already
pre-filtered traffic to the given VM, so they can allow customers
freedom to see traffic and do XDP_PASS and XDP_DROP.  They
administratively (via ethtool) want to disable the XDP_REDIRECT and
XDP_TX driver feature, as it can be used for violation their VM
isolation policy between customers.

Is the use-case more clear now?


[1] https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Daniel Borkmann Dec. 8, 2020, 9:42 a.m. UTC | #15
On 12/8/20 10:00 AM, Jesper Dangaard Brouer wrote:
> On Mon, 07 Dec 2020 12:52:22 -0800

> John Fastabend <john.fastabend@gmail.com> wrote:

> 

>>> Use-case(1): Cloud-provider want to give customers (running VMs) ability

>>> to load XDP program for DDoS protection (only), but don't want to allow

>>> customer to use XDP_TX (that can implement LB or cheat their VM

>>> isolation policy).

>>

>> Not following. What interface do they want to allow loading on? If its

>> the VM interface then I don't see how it matters. From outside the

>> VM there should be no way to discover if its done in VM or in tc or

>> some other stack.

>>

>> If its doing some onloading/offloading I would assume they need to

>> ensure the isolation, etc. is still maintained because you can't

>> let one VMs program work on other VMs packets safely.

>>

>> So what did I miss, above doesn't make sense to me.

> 

> The Cloud-provider want to load customer provided BPF-code on the

> physical Host-OS NIC (that support XDP).  The customer can get access

> to a web-interface where they can write or upload their BPF-prog.

> 

> As multiple customers can upload BPF-progs, the Cloud-provider have to

> write a BPF-prog dispatcher that runs these multiple program.  This

> could be done via BPF tail-calls, or via Toke's libxdp[1], or via

> devmap XDP-progs per egress port.

> 

> The Cloud-provider don't fully trust customers BPF-prog.   They already

> pre-filtered traffic to the given VM, so they can allow customers

> freedom to see traffic and do XDP_PASS and XDP_DROP.  They

> administratively (via ethtool) want to disable the XDP_REDIRECT and

> XDP_TX driver feature, as it can be used for violation their VM

> isolation policy between customers.

> 

> Is the use-case more clear now?


I think we're talking about two different things. The use case as I understood
it in (1) mentioned to be able to disable XDP_TX for NICs that are deployed in
the VM. This would be a no-go as-is since that would mean my basic assumption
for attaching XDP progs is gone in that today return codes pass/drop/tx is
pretty much available everywhere on native XDP supported NICs. And if you've
tried it on major cloud providers like AWS or Azure that offer SRIOV-based
networking that works okay and further restricting this would mean breakage of
existing programs.

What you mean here is "offload" from guest to host which is a different use
case than what likely John and I read from your description in (1). Such program
should then be loaded via BPF offload API. Meaning, if offload is used and the
host is then configured to disallow XDP_TX for such requests from guests, then
these get rejected through such facility, but if the /same/ program was loaded as
regular native XDP where it's still running in the guest, then it must succeed.
These are two entirely different things.

It's not clear to me whether some ethtool XDP properties flag is the right place
to describe this (plus this needs to differ between offloaded / non-offloaded progs)
or whether this should be an implementation detail for things like virtio_net e.g.
via virtio_has_feature(). Feels more like the latter to me which already has such
a facility in place.
Toke Høiland-Jørgensen Dec. 8, 2020, 11:58 a.m. UTC | #16
Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> On Mon, 7 Dec 2020 18:01:00 -0700

> David Ahern <dsahern@gmail.com> wrote:

>

>> On 12/7/20 1:52 PM, John Fastabend wrote:

>> >>

>> >> I think we need to keep XDP_TX action separate, because I think that

>> >> there are use-cases where the we want to disable XDP_TX due to end-user

>> >> policy or hardware limitations.  

>> > 

>> > How about we discover this at load time though. 

>

> Nitpick at XDP "attach" time. The general disconnect between BPF and

> XDP is that BPF can verify at "load" time (as kernel knows what it

> support) while XDP can have different support/features per driver, and

> cannot do this until attachment time. (See later issue with tail calls).

> (All other BPF-hooks don't have this issue)

>

>> > Meaning if the program

>> > doesn't use XDP_TX then the hardware can skip resource allocations for

>> > it. I think we could have verifier or extra pass discover the use of

>> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.

>> >   

>> 

>> This was discussed in the context of virtio_net some months back - it is

>> hard to impossible to know a program will not return XDP_TX (e.g., value

>> comes from a map).

>

> It is hard, and sometimes not possible.  For maps the workaround is

> that BPF-programmer adds a bound check on values from the map. If not

> doing that the verifier have to assume all possible return codes are

> used by BPF-prog.

>

> The real nemesis is program tail calls, that can be added dynamically

> after the XDP program is attached.  It is at attachment time that

> changing the NIC resources is possible.  So, for program tail calls the

> verifier have to assume all possible return codes are used by BPF-prog.


We actually had someone working on a scheme for how to express this for
programs some months ago, but unfortunately that stalled out (Jesper
already knows this, but FYI to the rest of you). In any case, I view
this as a "next step". Just exposing the feature bits to userspace will
help users today, and as a side effect, this also makes drivers declare
what they support, which we can then incorporate into the core code to,
e.g., reject attachment of programs that won't work anyway. But let's
do this in increments and not make the perfect the enemy of the good
here.

> BPF now have function calls and function replace right(?)  How does

> this affect this detection of possible return codes?


It does have the same issue as tail calls, in that the return code of
the function being replaced can obviously change. However, the verifier
knows the target of a replace, so it can propagate any constraints put
upon the caller if we implement it that way.

-Toke
John Fastabend Dec. 9, 2020, 5:50 a.m. UTC | #17
Toke Høiland-Jørgensen wrote:
> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

> 

> > On Mon, 7 Dec 2020 18:01:00 -0700

> > David Ahern <dsahern@gmail.com> wrote:

> >

> >> On 12/7/20 1:52 PM, John Fastabend wrote:

> >> >>

> >> >> I think we need to keep XDP_TX action separate, because I think that

> >> >> there are use-cases where the we want to disable XDP_TX due to end-user

> >> >> policy or hardware limitations.  

> >> > 

> >> > How about we discover this at load time though. 

> >

> > Nitpick at XDP "attach" time. The general disconnect between BPF and

> > XDP is that BPF can verify at "load" time (as kernel knows what it

> > support) while XDP can have different support/features per driver, and

> > cannot do this until attachment time. (See later issue with tail calls).

> > (All other BPF-hooks don't have this issue)

> >

> >> > Meaning if the program

> >> > doesn't use XDP_TX then the hardware can skip resource allocations for

> >> > it. I think we could have verifier or extra pass discover the use of

> >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.

> >> >   

> >> 

> >> This was discussed in the context of virtio_net some months back - it is

> >> hard to impossible to know a program will not return XDP_TX (e.g., value

> >> comes from a map).

> >

> > It is hard, and sometimes not possible.  For maps the workaround is

> > that BPF-programmer adds a bound check on values from the map. If not

> > doing that the verifier have to assume all possible return codes are

> > used by BPF-prog.

> >

> > The real nemesis is program tail calls, that can be added dynamically

> > after the XDP program is attached.  It is at attachment time that

> > changing the NIC resources is possible.  So, for program tail calls the

> > verifier have to assume all possible return codes are used by BPF-prog.

> 

> We actually had someone working on a scheme for how to express this for

> programs some months ago, but unfortunately that stalled out (Jesper

> already knows this, but FYI to the rest of you). In any case, I view

> this as a "next step". Just exposing the feature bits to userspace will

> help users today, and as a side effect, this also makes drivers declare

> what they support, which we can then incorporate into the core code to,

> e.g., reject attachment of programs that won't work anyway. But let's

> do this in increments and not make the perfect the enemy of the good

> here.

> 

> > BPF now have function calls and function replace right(?)  How does

> > this affect this detection of possible return codes?

> 

> It does have the same issue as tail calls, in that the return code of

> the function being replaced can obviously change. However, the verifier

> knows the target of a replace, so it can propagate any constraints put

> upon the caller if we implement it that way.


OK I'm convinced its not possible to tell at attach time if a program
will use XDP_TX or not in general. And in fact for most real programs it
likely will not be knowable. At least most programs I look at these days
use either tail calls or function calls so seems like a dead end.

Also above somewhere it was pointed out that XDP_REDIRECT would want
the queues and it seems even more challenging to sort that out.
John Fastabend Dec. 9, 2020, 6:03 a.m. UTC | #18
> On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:

> > Jesper Dangaard Brouer wrote:

> > > On Fri, 4 Dec 2020 16:21:08 +0100

> > > Daniel Borkmann <daniel@iogearbox.net> wrote:


[...] pruning the thread to answer Jesper.

> > > 

> > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue

> > > resources, as the use-case is only DDoS.  Today we have this problem

> > > with the ixgbe hardware, that cannot load XDP programs on systems with

> > > more than 192 CPUs.

> > 

> > The ixgbe issues is just a bug or missing-feature in my opinion.

> 

> Not a bug, rather HW limitation?


Well hardware has some max queue limit. Likely <192 otherwise I would
have kept doing queue per core on up to 192. But, ideally we should
still load and either share queues across multiple cores or restirct
down to a subset of CPUs. Do you need 192 cores for a 10gbps nic,
probably not. Yes, it requires some extra care, but should be doable
if someone cares enough. I gather current limitation/bug is because
no one has that configuration and/or has complained loud enough.

> 

> > 

> > I think we just document that XDP_TX consumes resources and if users

> > care they shouldn't use XD_TX in programs and in that case hardware

> > should via program discovery not allocate the resource. This seems

> > cleaner in my opinion then more bits for features.

> 

> But what if I'm with some limited HW that actually has a support for XDP

> and I would like to utilize XDP_TX?

> 

> Not all drivers that support XDP consume Tx resources. Recently igb got

> support and it shares Tx queues between netstack and XDP.


Makes sense to me.

> 

> I feel like we should have a sort-of best effort approach in case we

> stumble upon the XDP_TX in prog being loaded and query the driver if it

> would be able to provide the Tx resources on the current system, given

> that normally we tend to have a queue per core.


Why do we need to query? I guess you want some indication from the
driver its not going to be running in the ideal NIC configuraition?
I guess printing a warning would be the normal way to show that. But,
maybe your point is you want something easier to query?

> 

> In that case igb would say yes, ixgbe would say no and prog would be

> rejected.


I think the driver should load even if it can't meet the queue per
core quota. Refusing to load at all or just dropping packets on the
floor is not very friendly. I think we agree on that point.
Maciej Fijalkowski Dec. 9, 2020, 9:54 a.m. UTC | #19
On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote:
> > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:

> > > Jesper Dangaard Brouer wrote:

> > > > On Fri, 4 Dec 2020 16:21:08 +0100

> > > > Daniel Borkmann <daniel@iogearbox.net> wrote:

> 

> [...] pruning the thread to answer Jesper.


I think you meant me, but thanks anyway for responding :)

> 

> > > > 

> > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue

> > > > resources, as the use-case is only DDoS.  Today we have this problem

> > > > with the ixgbe hardware, that cannot load XDP programs on systems with

> > > > more than 192 CPUs.

> > > 

> > > The ixgbe issues is just a bug or missing-feature in my opinion.

> > 

> > Not a bug, rather HW limitation?

> 

> Well hardware has some max queue limit. Likely <192 otherwise I would

> have kept doing queue per core on up to 192. But, ideally we should


Data sheet states its 128 Tx qs for ixgbe.

> still load and either share queues across multiple cores or restirct

> down to a subset of CPUs.


And that's the missing piece of logic, I suppose.

> Do you need 192 cores for a 10gbps nic, probably not.


Let's hear from Jesper :p

> Yes, it requires some extra care, but should be doable

> if someone cares enough. I gather current limitation/bug is because

> no one has that configuration and/or has complained loud enough.


I would say we're safe for queue per core approach for newer devices where
we have thousands of queues to play with. Older devices combined with big
cpu count can cause us some problems.

Wondering if drivers could have a problem when user would do something
weird as limiting the queue count to a lower value than cpu count and then
changing the irq affinity?

> 

> > 

> > > 

> > > I think we just document that XDP_TX consumes resources and if users

> > > care they shouldn't use XD_TX in programs and in that case hardware

> > > should via program discovery not allocate the resource. This seems

> > > cleaner in my opinion then more bits for features.

> > 

> > But what if I'm with some limited HW that actually has a support for XDP

> > and I would like to utilize XDP_TX?

> > 

> > Not all drivers that support XDP consume Tx resources. Recently igb got

> > support and it shares Tx queues between netstack and XDP.

> 

> Makes sense to me.

> 

> > 

> > I feel like we should have a sort-of best effort approach in case we

> > stumble upon the XDP_TX in prog being loaded and query the driver if it

> > would be able to provide the Tx resources on the current system, given

> > that normally we tend to have a queue per core.

> 

> Why do we need to query? I guess you want some indication from the

> driver its not going to be running in the ideal NIC configuraition?

> I guess printing a warning would be the normal way to show that. But,

> maybe your point is you want something easier to query?


I meant that given Jesper's example, what should we do? You don't have Tx
resources to pull at all. Should we have a data path for that case that
would share Tx qs between XDP/netstack? Probably not.

> 

> > 

> > In that case igb would say yes, ixgbe would say no and prog would be

> > rejected.

> 

> I think the driver should load even if it can't meet the queue per

> core quota. Refusing to load at all or just dropping packets on the

> floor is not very friendly. I think we agree on that point.


Agreed on that. But it needs some work. I can dabble on that a bit.
Toke Høiland-Jørgensen Dec. 9, 2020, 10:26 a.m. UTC | #20
John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:

>> Jesper Dangaard Brouer <jbrouer@redhat.com> writes:

>> 

>> > On Mon, 7 Dec 2020 18:01:00 -0700

>> > David Ahern <dsahern@gmail.com> wrote:

>> >

>> >> On 12/7/20 1:52 PM, John Fastabend wrote:

>> >> >>

>> >> >> I think we need to keep XDP_TX action separate, because I think that

>> >> >> there are use-cases where the we want to disable XDP_TX due to end-user

>> >> >> policy or hardware limitations.  

>> >> > 

>> >> > How about we discover this at load time though. 

>> >

>> > Nitpick at XDP "attach" time. The general disconnect between BPF and

>> > XDP is that BPF can verify at "load" time (as kernel knows what it

>> > support) while XDP can have different support/features per driver, and

>> > cannot do this until attachment time. (See later issue with tail calls).

>> > (All other BPF-hooks don't have this issue)

>> >

>> >> > Meaning if the program

>> >> > doesn't use XDP_TX then the hardware can skip resource allocations for

>> >> > it. I think we could have verifier or extra pass discover the use of

>> >> > XDP_TX and then pass a bit down to driver to enable/disable TX caps.

>> >> >   

>> >> 

>> >> This was discussed in the context of virtio_net some months back - it is

>> >> hard to impossible to know a program will not return XDP_TX (e.g., value

>> >> comes from a map).

>> >

>> > It is hard, and sometimes not possible.  For maps the workaround is

>> > that BPF-programmer adds a bound check on values from the map. If not

>> > doing that the verifier have to assume all possible return codes are

>> > used by BPF-prog.

>> >

>> > The real nemesis is program tail calls, that can be added dynamically

>> > after the XDP program is attached.  It is at attachment time that

>> > changing the NIC resources is possible.  So, for program tail calls the

>> > verifier have to assume all possible return codes are used by BPF-prog.

>> 

>> We actually had someone working on a scheme for how to express this for

>> programs some months ago, but unfortunately that stalled out (Jesper

>> already knows this, but FYI to the rest of you). In any case, I view

>> this as a "next step". Just exposing the feature bits to userspace will

>> help users today, and as a side effect, this also makes drivers declare

>> what they support, which we can then incorporate into the core code to,

>> e.g., reject attachment of programs that won't work anyway. But let's

>> do this in increments and not make the perfect the enemy of the good

>> here.

>> 

>> > BPF now have function calls and function replace right(?)  How does

>> > this affect this detection of possible return codes?

>> 

>> It does have the same issue as tail calls, in that the return code of

>> the function being replaced can obviously change. However, the verifier

>> knows the target of a replace, so it can propagate any constraints put

>> upon the caller if we implement it that way.

>

> OK I'm convinced its not possible to tell at attach time if a program

> will use XDP_TX or not in general. And in fact for most real programs it

> likely will not be knowable. At least most programs I look at these days

> use either tail calls or function calls so seems like a dead end.

>

> Also above somewhere it was pointed out that XDP_REDIRECT would want

> the queues and it seems even more challenging to sort that out.


Yeah. Doesn't mean that all hope is lost for "reject stuff that doesn't
work". We could either do pessimistic return code detection (if we don't
know for sure assume all codes are used), or we could add metadata where
the program declares what it wants to do...

-Toke
Jesper Dangaard Brouer Dec. 9, 2020, 11:52 a.m. UTC | #21
On Wed, 9 Dec 2020 10:54:54 +0100
Maciej Fijalkowski <maciej.fijalkowski@intel.com> wrote:

> On Tue, Dec 08, 2020 at 10:03:51PM -0800, John Fastabend wrote:

> > > On Mon, Dec 07, 2020 at 12:52:22PM -0800, John Fastabend wrote:  

> > > > Jesper Dangaard Brouer wrote:  

> > > > > On Fri, 4 Dec 2020 16:21:08 +0100

> > > > > Daniel Borkmann <daniel@iogearbox.net> wrote:  

> > 

> > [...] pruning the thread to answer Jesper.  

> 

> I think you meant me, but thanks anyway for responding :)


I was about to say that ;-)

> > > > > 

> > > > > Use-case(2): Disable XDP_TX on a driver to save hardware TX-queue

> > > > > resources, as the use-case is only DDoS.  Today we have this problem

> > > > > with the ixgbe hardware, that cannot load XDP programs on systems with

> > > > > more than 192 CPUs.  

> > > > 

> > > > The ixgbe issues is just a bug or missing-feature in my opinion.  

> > > 

> > > Not a bug, rather HW limitation?  

> > 

> > Well hardware has some max queue limit. Likely <192 otherwise I would

> > have kept doing queue per core on up to 192. But, ideally we should  

> 

> Data sheet states its 128 Tx qs for ixgbe.


I likely remember wrong, maybe it was only ~96 CPUs.  I do remember that
some TX queue were reserved for something else, and QA reported issues
(as I don't have this high end system myself).


> > still load and either share queues across multiple cores or restirct

> > down to a subset of CPUs.  

> 

> And that's the missing piece of logic, I suppose.

> 

> > Do you need 192 cores for a 10gbps nic, probably not.  

> 

> Let's hear from Jesper :p


LOL - of-cause you don't need 192 cores.  With XDP I will claim that
you only need 2 cores (with high GHz) to forward 10gbps wirespeed small
packets.

The point is that this only works, when we avoid atomic lock operations
per packet and bulk NIC PCIe tail/doorbell.  It was actually John's
invention/design to have a dedicated TX queue per core to avoid the
atomic lock operation per packet when queuing packets to the NIC.

 10G @64B give budget of 67.2 ns (241 cycles @ 3.60GHz)
 Atomic lock operation use:[1]
 - Type:spin_lock_unlock         Per elem: 34 cycles(tsc) 9.485 ns
 - Type:spin_lock_unlock_irqsave Per elem: 61 cycles(tsc) 17.125 ns
 (And atomic can affect Inst per cycle)

But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
(up-to 16) so it should not be a problem to solve this by sharing
TX-queue and talking a lock per 16 packets.  I still recommend that,
for fallback case,  you allocated a number a TX-queue and distribute
this across CPUs to avoid hitting a congested lock (above measurements
are the optimal non-congested atomic lock operation)

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/time_bench_sample.c

> > Yes, it requires some extra care, but should be doable

> > if someone cares enough. I gather current limitation/bug is because

> > no one has that configuration and/or has complained loud enough.  

> 

> I would say we're safe for queue per core approach for newer devices where

> we have thousands of queues to play with. Older devices combined with big

> cpu count can cause us some problems.

> 

> Wondering if drivers could have a problem when user would do something

> weird as limiting the queue count to a lower value than cpu count and then

> changing the irq affinity?


Not sure what you mean.

But for XDP RX-side we use softirq NAPI guarantee to guard against
concurrent access to our (per-cpu) data structures.

> >   

> > >   

> > > > 

> > > > I think we just document that XDP_TX consumes resources and if users

> > > > care they shouldn't use XD_TX in programs and in that case hardware

> > > > should via program discovery not allocate the resource. This seems

> > > > cleaner in my opinion then more bits for features.  

> > > 

> > > But what if I'm with some limited HW that actually has a support for XDP

> > > and I would like to utilize XDP_TX?

> > > 

> > > Not all drivers that support XDP consume Tx resources. Recently igb got

> > > support and it shares Tx queues between netstack and XDP.  

> > 

> > Makes sense to me.

> >   

> > > 

> > > I feel like we should have a sort-of best effort approach in case we

> > > stumble upon the XDP_TX in prog being loaded and query the driver if it

> > > would be able to provide the Tx resources on the current system, given

> > > that normally we tend to have a queue per core.  

> > 

> > Why do we need to query? I guess you want some indication from the

> > driver its not going to be running in the ideal NIC configuraition?

> > I guess printing a warning would be the normal way to show that. But,

> > maybe your point is you want something easier to query?  

> 

> I meant that given Jesper's example, what should we do? You don't have Tx

> resources to pull at all. Should we have a data path for that case that

> would share Tx qs between XDP/netstack? Probably not.

> 


I think ixgbe should have a fallback mode, where it allocated e.g. 32
TX-queue for XDP xmits or even just same amount as RX-queues (I think
XDP_TX and XDP_REDIRECT can share these TX-queues dedicated to XDP).
When in fallback mode a lock need to be taken (sharded across CPUs),
but ndo_xdp_xmit will bulk up-to 16 packets, so it should not matter
too much.

I do think ixgbe should output a dmesg log message, to say it is in XDP
fallback mode with X number of TX-queues.  For us QA usually collect
the dmesg output after a test run.

   
> > > 

> > > In that case igb would say yes, ixgbe would say no and prog would be

> > > rejected.  

> > 

> > I think the driver should load even if it can't meet the queue per

> > core quota. Refusing to load at all or just dropping packets on the

> > floor is not very friendly. I think we agree on that point.  

> 

> Agreed on that. But it needs some work. I can dabble on that a bit.

> 


I will really appreciate if Intel can fix this in the ixgbe driver, and
implement a fallback method.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
David Ahern Dec. 9, 2020, 3:41 p.m. UTC | #22
On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
>>> still load and either share queues across multiple cores or restirct

>>> down to a subset of CPUs.  

>>

>> And that's the missing piece of logic, I suppose.

>>

>>> Do you need 192 cores for a 10gbps nic, probably not.  

>>

>> Let's hear from Jesper :p

> 

> LOL - of-cause you don't need 192 cores.  With XDP I will claim that

> you only need 2 cores (with high GHz) to forward 10gbps wirespeed small

> packets.


You don't need 192 for 10G on Rx. However, if you are using XDP_REDIRECT
from VM tap devices the next device (presumably the host NIC) does need
to be able to handle the redirect.

My personal experience with this one is mlx5/ConnectX4-LX with a limit
of 63 queues and a server with 96 logical cpus. If the vhost thread for
the tap device runs on a cpu that does not have an XDP TX Queue, the
packet is dropped. This is a really bizarre case to debug as some
packets go out fine while others are dropped.
David Ahern Dec. 9, 2020, 3:44 p.m. UTC | #23
On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> But I have redesigned the ndo_xdp_xmit call to take a bulk of packets

> (up-to 16) so it should not be a problem to solve this by sharing

> TX-queue and talking a lock per 16 packets.  I still recommend that,

> for fallback case,  you allocated a number a TX-queue and distribute

> this across CPUs to avoid hitting a congested lock (above measurements

> are the optimal non-congested atomic lock operation)


I have been meaning to ask you why 16 for the XDP batching? If the
netdev budget is 64, why not something higher like 32 or 64?
Saeed Mahameed Dec. 9, 2020, 5:15 p.m. UTC | #24
On Wed, 2020-12-09 at 08:41 -0700, David Ahern wrote:
> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:

> > > > still load and either share queues across multiple cores or

> > > > restirct

> > > > down to a subset of CPUs.  

> > > 

> > > And that's the missing piece of logic, I suppose.

> > > 

> > > > Do you need 192 cores for a 10gbps nic, probably not.  

> > > 

> > > Let's hear from Jesper :p

> > 

> > LOL - of-cause you don't need 192 cores.  With XDP I will claim

> > that

> > you only need 2 cores (with high GHz) to forward 10gbps wirespeed

> > small

> > packets.

> 

> You don't need 192 for 10G on Rx. However, if you are using

> XDP_REDIRECT

> from VM tap devices the next device (presumably the host NIC) does

> need

> to be able to handle the redirect.

> 

> My personal experience with this one is mlx5/ConnectX4-LX with a

> limit


This limit was removed from mlx5
https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/
Note: you still need to use ehttool to increase from 64 to 128 or 96 in
your case.

> of 63 queues and a server with 96 logical cpus. If the vhost thread

> for

> the tap device runs on a cpu that does not have an XDP TX Queue, the

> packet is dropped. This is a really bizarre case to debug as some

> packets go out fine while others are dropped.


I agree, the user experience horrible.
David Ahern Dec. 10, 2020, 3:34 a.m. UTC | #25
On 12/9/20 10:15 AM, Saeed Mahameed wrote:
>> My personal experience with this one is mlx5/ConnectX4-LX with a

>> limit

> 

> This limit was removed from mlx5

> https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/

> Note: you still need to use ehttool to increase from 64 to 128 or 96 in

> your case.

> 


I asked you about that commit back in May:

https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/

As noted in the thread, it did not work for me.
Saeed Mahameed Dec. 10, 2020, 6:48 a.m. UTC | #26
On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote:
> On 12/9/20 10:15 AM, Saeed Mahameed wrote:

> > > My personal experience with this one is mlx5/ConnectX4-LX with a

> > > limit

> > 

> > This limit was removed from mlx5

> > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/

> > Note: you still need to use ehttool to increase from 64 to 128 or

> > 96 in

> > your case.

> > 

> 

> I asked you about that commit back in May:

> 


:/, sorry i missed this email, must have been the mlnx nvidia email
transition.

> https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/

> 

> As noted in the thread, it did not work for me.


Still relevant ? I might need to get you some tools to increase #msix
in Firmware.
Jesper Dangaard Brouer Dec. 10, 2020, 1:32 p.m. UTC | #27
On Wed, 9 Dec 2020 08:44:33 -0700
David Ahern <dsahern@gmail.com> wrote:

> On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:
> > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets
> > (up-to 16) so it should not be a problem to solve this by sharing
> > TX-queue and talking a lock per 16 packets.  I still recommend that,
> > for fallback case,  you allocated a number a TX-queue and distribute
> > this across CPUs to avoid hitting a congested lock (above measurements
> > are the optimal non-congested atomic lock operation)  
> 
> I have been meaning to ask you why 16 for the XDP batching? If the
> netdev budget is 64, why not something higher like 32 or 64?

Thanks you for asking as there are multiple good reasons and
consideration for this 16 batch size.  Notice cpumap have batch size 8,
which is also an explicit choice.  And AF_XDP went in the wrong
direction IMHO and I think have 256.  I designed this to be a choice in
the map code, for the level of bulking it needs/wants.

The low level explanation is that these 8 and 16 batch sizes are
optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher
with 10 elements).  I'm betting on that memory backing these 8 or 16
packets have higher chance to remain/being in cache, and I can prefetch
them without evicting them from cache again.  In some cases the pointer
to these packets are queued into a ptr_ring, and it is more optimal to
write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

The general explanation is my goal to do bulking without adding latency.
This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.
Sure, you/we can likely make the micro-benchmarks look better by using
64 batch size, but that will introduce added latency and likely shoot
our-selves in the foot for real workloads.  With experience from
bufferbloat and real networks, we know that massive TX bulking have bad
effects.  Still XDP-redirect does massive bulking (NIC flush is after
full 64 budget) and we don't have pushback or a queue mechanism (so I
know we are already shooting ourselves in the foot) ...  Fortunately we
now have a PhD student working on queuing for XDP.

It is also important to understand that this is an adaptive bulking
scheme, which comes from NAPI.  We don't wait for packets arriving
shortly, we pickup what NIC have available, but by only taking 8 or 16
packets (instead of emptying the entire RX-queue), and then spending
some time to send them along, I'm hoping that NIC could have gotten
some more frame.  For cpumap and veth (in-some-cases) they can start to
consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH
signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to
update their internal queue state (and BQL), and if it gets close to
full they can choose to flush/doorbell the NIC earlier.  When doing
queuing for XDP we need to expose these NIC queue states, and having 4
calls with 16 packets (64 budget) also gives us more chances to get NIC
queue state info which the NIC already touch.


[1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
Magnus Karlsson Dec. 10, 2020, 2:14 p.m. UTC | #28
On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer
<brouer@redhat.com> wrote:
>

> On Wed, 9 Dec 2020 08:44:33 -0700

> David Ahern <dsahern@gmail.com> wrote:

>

> > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:

> > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets

> > > (up-to 16) so it should not be a problem to solve this by sharing

> > > TX-queue and talking a lock per 16 packets.  I still recommend that,

> > > for fallback case,  you allocated a number a TX-queue and distribute

> > > this across CPUs to avoid hitting a congested lock (above measurements

> > > are the optimal non-congested atomic lock operation)

> >

> > I have been meaning to ask you why 16 for the XDP batching? If the

> > netdev budget is 64, why not something higher like 32 or 64?

>

> Thanks you for asking as there are multiple good reasons and

> consideration for this 16 batch size.  Notice cpumap have batch size 8,

> which is also an explicit choice.  And AF_XDP went in the wrong

> direction IMHO and I think have 256.  I designed this to be a choice in

> the map code, for the level of bulking it needs/wants.


FYI, as far as I know, there is nothing in AF_XDP that says bulking
should be 256. There is a 256 number in the i40e driver that states
the maximum number of packets to be sent within one napi_poll loop.
But this is just a maximum number and only for that driver. (In case
you wonder, that number was inherited from the original skb Tx
implementation in the driver.) The actual batch size is controlled by
the application. If it puts 1 packet in the Tx ring and calls send(),
the batch size will be 1. If it puts 128 packets in the Tx ring and
calls send(), you get a batch size of 128, and so on. It is flexible,
so you can trade-off latency with throughput in the way the
application desires. Rx batch size has also become flexible now with
the introduction of Björn's prefer_busy_poll patch set [1].

[1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/

> The low level explanation is that these 8 and 16 batch sizes are

> optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher

> with 10 elements).  I'm betting on that memory backing these 8 or 16

> packets have higher chance to remain/being in cache, and I can prefetch

> them without evicting them from cache again.  In some cases the pointer

> to these packets are queued into a ptr_ring, and it is more optimal to

> write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

>

> The general explanation is my goal to do bulking without adding latency.

> This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.

> Sure, you/we can likely make the micro-benchmarks look better by using

> 64 batch size, but that will introduce added latency and likely shoot

> our-selves in the foot for real workloads.  With experience from

> bufferbloat and real networks, we know that massive TX bulking have bad

> effects.  Still XDP-redirect does massive bulking (NIC flush is after

> full 64 budget) and we don't have pushback or a queue mechanism (so I

> know we are already shooting ourselves in the foot) ...  Fortunately we

> now have a PhD student working on queuing for XDP.

>

> It is also important to understand that this is an adaptive bulking

> scheme, which comes from NAPI.  We don't wait for packets arriving

> shortly, we pickup what NIC have available, but by only taking 8 or 16

> packets (instead of emptying the entire RX-queue), and then spending

> some time to send them along, I'm hoping that NIC could have gotten

> some more frame.  For cpumap and veth (in-some-cases) they can start to

> consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH

> signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to

> update their internal queue state (and BQL), and if it gets close to

> full they can choose to flush/doorbell the NIC earlier.  When doing

> queuing for XDP we need to expose these NIC queue states, and having 4

> calls with 16 packets (64 budget) also gives us more chances to get NIC

> queue state info which the NIC already touch.

>

>

> [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf

> --

> Best regards,

>   Jesper Dangaard Brouer

>   MSc.CS, Principal Kernel Engineer at Red Hat

>   LinkedIn: http://www.linkedin.com/in/brouer

>

> _______________________________________________

> Intel-wired-lan mailing list

> Intel-wired-lan@osuosl.org

> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan
David Ahern Dec. 10, 2020, 3:30 p.m. UTC | #29
On 12/9/20 11:48 PM, Saeed Mahameed wrote:
> On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote:

>> On 12/9/20 10:15 AM, Saeed Mahameed wrote:

>>>> My personal experience with this one is mlx5/ConnectX4-LX with a

>>>> limit

>>>

>>> This limit was removed from mlx5

>>> https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/

>>> Note: you still need to use ehttool to increase from 64 to 128 or

>>> 96 in

>>> your case.

>>>

>>

>> I asked you about that commit back in May:

>>

> 

> :/, sorry i missed this email, must have been the mlnx nvidia email

> transition.

> 

>> https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/

>>

>> As noted in the thread, it did not work for me.

> 

> Still relevant ? I might need to get you some tools to increase #msix

> in Firmware.

> 


not for me at the moment, but it would be good to document what a user
needs to do - especially if it involves vendor specific tools and steps.
Jesper Dangaard Brouer Dec. 10, 2020, 5:30 p.m. UTC | #30
On Thu, 10 Dec 2020 15:14:18 +0100
Magnus Karlsson <magnus.karlsson@gmail.com> wrote:

> On Thu, Dec 10, 2020 at 2:32 PM Jesper Dangaard Brouer

> <brouer@redhat.com> wrote:

> >

> > On Wed, 9 Dec 2020 08:44:33 -0700

> > David Ahern <dsahern@gmail.com> wrote:

> >  

> > > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:  

> > > > But I have redesigned the ndo_xdp_xmit call to take a bulk of packets

> > > > (up-to 16) so it should not be a problem to solve this by sharing

> > > > TX-queue and talking a lock per 16 packets.  I still recommend that,

> > > > for fallback case,  you allocated a number a TX-queue and distribute

> > > > this across CPUs to avoid hitting a congested lock (above measurements

> > > > are the optimal non-congested atomic lock operation)  

> > >

> > > I have been meaning to ask you why 16 for the XDP batching? If the

> > > netdev budget is 64, why not something higher like 32 or 64?  

> >

> > Thanks you for asking as there are multiple good reasons and

> > consideration for this 16 batch size.  Notice cpumap have batch size 8,

> > which is also an explicit choice.  And AF_XDP went in the wrong

> > direction IMHO and I think have 256.  I designed this to be a choice in

> > the map code, for the level of bulking it needs/wants.  

> 

> FYI, as far as I know, there is nothing in AF_XDP that says bulking

> should be 256. There is a 256 number in the i40e driver that states

> the maximum number of packets to be sent within one napi_poll loop.

> But this is just a maximum number and only for that driver. (In case

> you wonder, that number was inherited from the original skb Tx

> implementation in the driver.) 


Ah, that explains the issue I have on the production system that runs
the EDT-pacer[2].  I see that i40e function i40e_clean_tx_irq() ignores
napi_budget but uses it own budget, that defaults to 256.  Looks like I
can adjust this via ethtool -C tx-frames-irq.   I turned this down to
64 (32 was giving worse results, and below 16 system acted strange).

Now the issue is gone, which was that if TX-DMA completion was running
(i40e_clean_tx_irq()) on the same CPU that send packets via FQ-pacer
qdisc, then the pacing was not accurate, and was sending too bursty.

System have already tuned "net/core/dev_weight" and RX/TX-bias to
reduce bulking, as this can influence latency and the EDT-pacing
accuracy. (It is a middlebox bridging VLANs and BPF-EDT tiemstamping and
FQ-pacing packets to solve bursts overflowing switch ports).

  sudo sysctl net/core/dev_weight
  net.core.dev_weight = 1
  net.core.dev_weight_rx_bias = 32
  net.core.dev_weight_tx_bias = 1

This net.core.dev_weight_tx_bias=1 (together with dev_weight=1) cause
qdisc transmit budget to become one packet, cycling through
NET_TX_SOFTIRQ which consumes time and gives a little more pacing space
for the packets.


> The actual batch size is controlled by

> the application. If it puts 1 packet in the Tx ring and calls send(),

> the batch size will be 1. If it puts 128 packets in the Tx ring and

> calls send(), you get a batch size of 128, and so on. It is flexible,

> so you can trade-off latency with throughput in the way the

> application desires. Rx batch size has also become flexible now with

> the introduction of Björn's prefer_busy_poll patch set [1].

> 

> [1] https://lore.kernel.org/netdev/20201130185205.196029-1-bjorn.topel@gmail.com/


This looks like a cool trick, to get even more accurate packet scheduling.

I played with the tunings, and could see changed behavior with mpstat,
but ended up tuning it off again, as I could not measure a direct
correlation with the bpftrace tools[3].


> > The low level explanation is that these 8 and 16 batch sizes are

> > optimized towards cache sizes and Intel's Line-Fill-Buffer (prefetcher

> > with 10 elements).  I'm betting on that memory backing these 8 or 16

> > packets have higher chance to remain/being in cache, and I can prefetch

> > them without evicting them from cache again.  In some cases the pointer

> > to these packets are queued into a ptr_ring, and it is more optimal to

> > write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the ptr_ring.

> >

> > The general explanation is my goal to do bulking without adding latency.

> > This is explicitly stated in my presentation[1] as of Feb 2016, slide 20.

> > Sure, you/we can likely make the micro-benchmarks look better by using

> > 64 batch size, but that will introduce added latency and likely shoot

> > our-selves in the foot for real workloads.  With experience from

> > bufferbloat and real networks, we know that massive TX bulking have bad

> > effects.  Still XDP-redirect does massive bulking (NIC flush is after

> > full 64 budget) and we don't have pushback or a queue mechanism (so I

> > know we are already shooting ourselves in the foot) ...  Fortunately we

> > now have a PhD student working on queuing for XDP.

> >

> > It is also important to understand that this is an adaptive bulking

> > scheme, which comes from NAPI.  We don't wait for packets arriving

> > shortly, we pickup what NIC have available, but by only taking 8 or 16

> > packets (instead of emptying the entire RX-queue), and then spending

> > some time to send them along, I'm hoping that NIC could have gotten

> > some more frame.  For cpumap and veth (in-some-cases) they can start to

> > consume packets from these batches, but NIC drivers gets XDP_XMIT_FLUSH

> > signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to

> > update their internal queue state (and BQL), and if it gets close to

> > full they can choose to flush/doorbell the NIC earlier.  When doing

> > queuing for XDP we need to expose these NIC queue states, and having 4

> > calls with 16 packets (64 budget) also gives us more chances to get NIC

> > queue state info which the NIC already touch.

> >

> >

> > [1] https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf


[2] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/

[3] https://github.com/netoptimizer/bpf-examples/tree/master/traffic-pacing-edt/bpftrace


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer
Saeed Mahameed Dec. 10, 2020, 6:58 p.m. UTC | #31
On Thu, 2020-12-10 at 08:30 -0700, David Ahern wrote:
> On 12/9/20 11:48 PM, Saeed Mahameed wrote:

> > On Wed, 2020-12-09 at 20:34 -0700, David Ahern wrote:

> > > On 12/9/20 10:15 AM, Saeed Mahameed wrote:

> > > > > My personal experience with this one is mlx5/ConnectX4-LX

> > > > > with a

> > > > > limit

> > > > 

> > > > This limit was removed from mlx5

> > > > https://patchwork.ozlabs.org/project/netdev/patch/20200107191335.12272-5-saeedm@mellanox.com/

> > > > Note: you still need to use ehttool to increase from 64 to 128

> > > > or

> > > > 96 in

> > > > your case.

> > > > 

> > > 

> > > I asked you about that commit back in May:

> > > 

> > 

> > :/, sorry i missed this email, must have been the mlnx nvidia email

> > transition.

> > 

> > > https://lore.kernel.org/netdev/198081c2-cb0d-e1d5-901c-446b63c36706@gmail.com/

> > > 

> > > As noted in the thread, it did not work for me.

> > 

> > Still relevant ? I might need to get you some tools to increase

> > #msix

> > in Firmware.

> > 

> 

> not for me at the moment, but it would be good to document what a

> user

> needs to do - especially if it involves vendor specific tools and

> steps.


Ack, Thanks for pointing that out, I will take action on this.
Saeed Mahameed Dec. 10, 2020, 7:20 p.m. UTC | #32
On Thu, 2020-12-10 at 14:32 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 9 Dec 2020 08:44:33 -0700

> David Ahern <dsahern@gmail.com> wrote:

> 

> > On 12/9/20 4:52 AM, Jesper Dangaard Brouer wrote:

> > > But I have redesigned the ndo_xdp_xmit call to take a bulk of

> > > packets

> > > (up-to 16) so it should not be a problem to solve this by sharing

> > > TX-queue and talking a lock per 16 packets.  I still recommend

> > > that,

> > > for fallback case,  you allocated a number a TX-queue and

> > > distribute

> > > this across CPUs to avoid hitting a congested lock (above

> > > measurements

> > > are the optimal non-congested atomic lock operation)  

> > 

> > I have been meaning to ask you why 16 for the XDP batching? If the

> > netdev budget is 64, why not something higher like 32 or 64?

> 

> Thanks you for asking as there are multiple good reasons and

> consideration for this 16 batch size.  Notice cpumap have batch size

> 8,

> which is also an explicit choice.  And AF_XDP went in the wrong

> direction IMHO and I think have 256.  I designed this to be a choice

> in

> the map code, for the level of bulking it needs/wants.

> 

> The low level explanation is that these 8 and 16 batch sizes are

> optimized towards cache sizes and Intel's Line-Fill-Buffer

> (prefetcher

> with 10 elements).  I'm betting on that memory backing these 8 or 16

> packets have higher chance to remain/being in cache, and I can

> prefetch

> them without evicting them from cache again.  In some cases the

> pointer

> to these packets are queued into a ptr_ring, and it is more optimal

> to

> write cacheline sizes 1 (8 pointers) or 2 (16 pointers) into the

> ptr_ring.

> 


I've warned people about this once or twice on the mailing list, for
example re-populating the rx ring, a common mistake is to use the napi
budget, which has the exact side effects as you are explaining here
Jesper !

these 8/16 numbers are used in more than one place in the stack, xdp,
gro, hw buffer re-population, etc..
how can we enforce such numbers and a uniform handling in all drivers?
1. have a clear documentation ? well know defines, for people to copy?

2. for XDP we must keep track on the memory backing of the xdp bulked
data as Jesper pointed out, so we always make sure whatever bulk-size
we define it always remains cache friendly, especially now where people
stated working on  multi-buff and other features that will extend the
xdp_buff and xdp_frame, do we need a selftest that maybe runs pahole to
see the those data strcutre remain within reasonable format/sizes ?



> The general explanation is my goal to do bulking without adding

> latency.

> This is explicitly stated in my presentation[1] as of Feb 2016, slide

> 20.

> Sure, you/we can likely make the micro-benchmarks look better by

> using

> 64 batch size, but that will introduce added latency and likely shoot

> our-selves in the foot for real workloads.  With experience from

> bufferbloat and real networks, we know that massive TX bulking have

> bad

> effects.  Still XDP-redirect does massive bulking (NIC flush is after

> full 64 budget) and we don't have pushback or a queue mechanism (so I

> know we are already shooting ourselves in the foot) ...  Fortunately

> we

> now have a PhD student working on queuing for XDP.

> 

> It is also important to understand that this is an adaptive bulking

> scheme, which comes from NAPI.  We don't wait for packets arriving

> shortly, we pickup what NIC have available, but by only taking 8 or

> 16

> packets (instead of emptying the entire RX-queue), and then spending

> some time to send them along, I'm hoping that NIC could have gotten

> some more frame.  For cpumap and veth (in-some-cases) they can start

> to

> consume packets from these batches, but NIC drivers gets

> XDP_XMIT_FLUSH

> signal at NAPI-end (xdp_do_flush). Still design allows NIC drivers to

> update their internal queue state (and BQL), and if it gets close to

> full they can choose to flush/doorbell the NIC earlier.  When doing

> queuing for XDP we need to expose these NIC queue states, and having

> 4

> calls with 16 packets (64 budget) also gives us more chances to get

> NIC

> queue state info which the NIC already touch.

> 

> 

> [1] 

> https://people.netfilter.org/hawk/presentations/devconf2016/net_stack_challenges_100G_Feb2016.pdf
Marek Majtyka Jan. 5, 2021, 11:56 a.m. UTC | #33
I would like to thank you for your time, comments, nitpicking as well
as encouraging.

One thing needs clarification I think, that is, that those flags
describe driver static feature sets - which are read-only. They have
nothing in common with driver runtime configuration change yet.
Runtime change of this state can be added but it needs a new variable
and it can be done later on if someone needs it.

Obviously, it is not possible to make everybody happy, especially with
XDP_BASE flags set. To be honest, this XDP_BASE definition is a
syntactic sugar for me and I can live without it. We can either remove
it completely, from
which IMO we all and other developers will suffer later on, or maybe
we can agree on these two helper set of flags: XDP_BASE (TX, ABORTED,
PASS, DROP) and XDP_LIMITED_BASE(ABORTED,PASS_DROP).
What do you think?

I am also going to add a new XDP_REDIRECT_TARGET flag and retrieving
XDP flags over rtnelink interface.

I also think that for completeness, ethtool implementation should be
kept  together with rtnelink part in order to cover both ip and
ethtool tools. Do I have your approval or disagreement? Please let me know.

Both AF_XDP_ZEROCOPY and XDP_SOCK_ZEROCOPY are good to me. I will pick
the one XDP_SOCK_ZEROCOPY unless there are protests.

I don't think that HEADROOM parameter should be passed via the flags.
It is by nature a number and an attempt to quantize with flags seems
to be an unnatural limitation for the future.

Thanks
Marek
Toke Høiland-Jørgensen Feb. 1, 2021, 4:16 p.m. UTC | #34
Marek Majtyka <alardam@gmail.com> writes:

> I would like to thank you for your time, comments, nitpicking as well

> as encouraging.

>

> One thing needs clarification I think, that is, that those flags

> describe driver static feature sets - which are read-only. They have

> nothing in common with driver runtime configuration change yet.

> Runtime change of this state can be added but it needs a new variable

> and it can be done later on if someone needs it.

>

> Obviously, it is not possible to make everybody happy, especially with

> XDP_BASE flags set. To be honest, this XDP_BASE definition is a

> syntactic sugar for me and I can live without it. We can either remove

> it completely, from

> which IMO we all and other developers will suffer later on, or maybe

> we can agree on these two helper set of flags: XDP_BASE (TX, ABORTED,

> PASS, DROP) and XDP_LIMITED_BASE(ABORTED,PASS_DROP).

> What do you think?

>

> I am also going to add a new XDP_REDIRECT_TARGET flag and retrieving

> XDP flags over rtnelink interface.

>

> I also think that for completeness, ethtool implementation should be

> kept  together with rtnelink part in order to cover both ip and

> ethtool tools. Do I have your approval or disagreement? Please let me

> know.


Hi Marek

I just realised that it seems no one actually replied to your email. On
my part at least that was because I didn't have any objections, so I'm
hoping you didn't feel the lack of response was discouraging (and that
you're still working on a revision of this series)? :)

-Toke
Marek Majtyka Feb. 2, 2021, 11:26 a.m. UTC | #35
Thanks Toke,

In fact, I was waiting for a single confirmation, disagreement or
comment. I have it now. As there are no more comments, I am getting
down to work right away.

Marek




On Mon, Feb 1, 2021 at 5:16 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>

> Marek Majtyka <alardam@gmail.com> writes:

>

> > I would like to thank you for your time, comments, nitpicking as well

> > as encouraging.

> >

> > One thing needs clarification I think, that is, that those flags

> > describe driver static feature sets - which are read-only. They have

> > nothing in common with driver runtime configuration change yet.

> > Runtime change of this state can be added but it needs a new variable

> > and it can be done later on if someone needs it.

> >

> > Obviously, it is not possible to make everybody happy, especially with

> > XDP_BASE flags set. To be honest, this XDP_BASE definition is a

> > syntactic sugar for me and I can live without it. We can either remove

> > it completely, from

> > which IMO we all and other developers will suffer later on, or maybe

> > we can agree on these two helper set of flags: XDP_BASE (TX, ABORTED,

> > PASS, DROP) and XDP_LIMITED_BASE(ABORTED,PASS_DROP).

> > What do you think?

> >

> > I am also going to add a new XDP_REDIRECT_TARGET flag and retrieving

> > XDP flags over rtnelink interface.

> >

> > I also think that for completeness, ethtool implementation should be

> > kept  together with rtnelink part in order to cover both ip and

> > ethtool tools. Do I have your approval or disagreement? Please let me

> > know.

>

> Hi Marek

>

> I just realised that it seems no one actually replied to your email. On

> my part at least that was because I didn't have any objections, so I'm

> hoping you didn't feel the lack of response was discouraging (and that

> you're still working on a revision of this series)? :)

>

> -Toke

>
Toke Høiland-Jørgensen Feb. 2, 2021, 12:05 p.m. UTC | #36
Marek Majtyka <alardam@gmail.com> writes:

> Thanks Toke,

>

> In fact, I was waiting for a single confirmation, disagreement or

> comment. I have it now. As there are no more comments, I am getting

> down to work right away.


Awesome! And sorry for not replying straight away - I hate it when I
send out something myself and receive no replies, so I suppose I should
get better at not doing that myself :)

As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I
just realised I didn't reply to), I am fine with defining XDP_BASE as a
shortcut for TX/ABORTED/PASS/DROP, but think we should skip
XDP_LIMITED_BASE and instead require all new drivers to implement the
full XDP_BASE set straight away. As long as we're talking about
features *implemented* by the driver, at least; i.e., it should still be
possible to *deactivate* XDP_TX if you don't want to use the HW
resources, but I don't think there's much benefit from defining the
LIMITED_BASE set as a shortcut for this mode...

-Toke
Jakub Kicinski Feb. 2, 2021, 7:34 p.m. UTC | #37
On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote:
> Marek Majtyka <alardam@gmail.com> writes:

> 

> > Thanks Toke,

> >

> > In fact, I was waiting for a single confirmation, disagreement or

> > comment. I have it now. As there are no more comments, I am getting

> > down to work right away.  

> 

> Awesome! And sorry for not replying straight away - I hate it when I

> send out something myself and receive no replies, so I suppose I should

> get better at not doing that myself :)

> 

> As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I

> just realised I didn't reply to), I am fine with defining XDP_BASE as a

> shortcut for TX/ABORTED/PASS/DROP, but think we should skip

> XDP_LIMITED_BASE and instead require all new drivers to implement the

> full XDP_BASE set straight away. As long as we're talking about

> features *implemented* by the driver, at least; i.e., it should still be

> possible to *deactivate* XDP_TX if you don't want to use the HW

> resources, but I don't think there's much benefit from defining the

> LIMITED_BASE set as a shortcut for this mode...


I still have mixed feelings about these flags. The first step IMO
should be adding validation tests. I bet^W pray every vendor has
validation tests but since they are not unified we don't know what
level of interoperability we're achieving in practice. That doesn't
matter for trivial feature like base actions, but we'll inevitably 
move on to defining more advanced capabilities and the question of
"what supporting X actually mean" will come up (3 years later, when
we don't remember ourselves).
Marek Majtyka Feb. 3, 2021, 12:50 p.m. UTC | #38
On Tue, Feb 2, 2021 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote:

> > Marek Majtyka <alardam@gmail.com> writes:

> >

> > > Thanks Toke,

> > >

> > > In fact, I was waiting for a single confirmation, disagreement or

> > > comment. I have it now. As there are no more comments, I am getting

> > > down to work right away.

> >

> > Awesome! And sorry for not replying straight away - I hate it when I

> > send out something myself and receive no replies, so I suppose I should

> > get better at not doing that myself :)

> >

> > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I

> > just realised I didn't reply to), I am fine with defining XDP_BASE as a

> > shortcut for TX/ABORTED/PASS/DROP, but think we should skip

> > XDP_LIMITED_BASE and instead require all new drivers to implement the

> > full XDP_BASE set straight away. As long as we're talking about

> > features *implemented* by the driver, at least; i.e., it should still be

> > possible to *deactivate* XDP_TX if you don't want to use the HW

> > resources, but I don't think there's much benefit from defining the

> > LIMITED_BASE set as a shortcut for this mode...

>

> I still have mixed feelings about these flags. The first step IMO

> should be adding validation tests. I bet^W pray every vendor has

> validation tests but since they are not unified we don't know what

> level of interoperability we're achieving in practice. That doesn't

> matter for trivial feature like base actions, but we'll inevitably

> move on to defining more advanced capabilities and the question of

> "what supporting X actually mean" will come up (3 years later, when

> we don't remember ourselves).


I am a bit confused now. Did you mean validation tests of those XDP
flags, which I am working on or some other validation tests?
What should these tests verify? Can you please elaborate more on the
topic, please - just a few sentences how are you see it?
Jakub Kicinski Feb. 3, 2021, 5:02 p.m. UTC | #39
On Wed, 3 Feb 2021 13:50:59 +0100 Marek Majtyka wrote:
> On Tue, Feb 2, 2021 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote:

> > On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote:  

> > > Awesome! And sorry for not replying straight away - I hate it when I

> > > send out something myself and receive no replies, so I suppose I should

> > > get better at not doing that myself :)

> > >

> > > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I

> > > just realised I didn't reply to), I am fine with defining XDP_BASE as a

> > > shortcut for TX/ABORTED/PASS/DROP, but think we should skip

> > > XDP_LIMITED_BASE and instead require all new drivers to implement the

> > > full XDP_BASE set straight away. As long as we're talking about

> > > features *implemented* by the driver, at least; i.e., it should still be

> > > possible to *deactivate* XDP_TX if you don't want to use the HW

> > > resources, but I don't think there's much benefit from defining the

> > > LIMITED_BASE set as a shortcut for this mode...  

> >

> > I still have mixed feelings about these flags. The first step IMO

> > should be adding validation tests. I bet^W pray every vendor has

> > validation tests but since they are not unified we don't know what

> > level of interoperability we're achieving in practice. That doesn't

> > matter for trivial feature like base actions, but we'll inevitably

> > move on to defining more advanced capabilities and the question of

> > "what supporting X actually mean" will come up (3 years later, when

> > we don't remember ourselves).  

> 

> I am a bit confused now. Did you mean validation tests of those XDP

> flags, which I am working on or some other validation tests?

> What should these tests verify? Can you please elaborate more on the

> topic, please - just a few sentences how are you see it?


Conformance tests can be written for all features, whether they have 
an explicit capability in the uAPI or not. But for those that do IMO
the tests should be required.

Let me give you an example. This set adds a bit that says Intel NICs 
can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue
shenanigans. So can i40e do XDP_REDIRECT or can it not?

If we have exhaustive conformance tests we can confidently answer that
question. And the answer may not be "yes" or "no", it may actually be
"we need more options because many implementations fall in between".

I think readable (IOW not written in some insane DSL) tests can also 
be useful for users who want to check which features their program /
deployment will require.
Toke Høiland-Jørgensen Feb. 10, 2021, 10:53 a.m. UTC | #40
Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 3 Feb 2021 13:50:59 +0100 Marek Majtyka wrote:

>> On Tue, Feb 2, 2021 at 8:34 PM Jakub Kicinski <kuba@kernel.org> wrote:

>> > On Tue, 02 Feb 2021 13:05:34 +0100 Toke Høiland-Jørgensen wrote:  

>> > > Awesome! And sorry for not replying straight away - I hate it when I

>> > > send out something myself and receive no replies, so I suppose I should

>> > > get better at not doing that myself :)

>> > >

>> > > As for the inclusion of the XDP_BASE / XDP_LIMITED_BASE sets (which I

>> > > just realised I didn't reply to), I am fine with defining XDP_BASE as a

>> > > shortcut for TX/ABORTED/PASS/DROP, but think we should skip

>> > > XDP_LIMITED_BASE and instead require all new drivers to implement the

>> > > full XDP_BASE set straight away. As long as we're talking about

>> > > features *implemented* by the driver, at least; i.e., it should still be

>> > > possible to *deactivate* XDP_TX if you don't want to use the HW

>> > > resources, but I don't think there's much benefit from defining the

>> > > LIMITED_BASE set as a shortcut for this mode...  

>> >

>> > I still have mixed feelings about these flags. The first step IMO

>> > should be adding validation tests. I bet^W pray every vendor has

>> > validation tests but since they are not unified we don't know what

>> > level of interoperability we're achieving in practice. That doesn't

>> > matter for trivial feature like base actions, but we'll inevitably

>> > move on to defining more advanced capabilities and the question of

>> > "what supporting X actually mean" will come up (3 years later, when

>> > we don't remember ourselves).  

>> 

>> I am a bit confused now. Did you mean validation tests of those XDP

>> flags, which I am working on or some other validation tests?

>> What should these tests verify? Can you please elaborate more on the

>> topic, please - just a few sentences how are you see it?

>

> Conformance tests can be written for all features, whether they have 

> an explicit capability in the uAPI or not. But for those that do IMO

> the tests should be required.

>

> Let me give you an example. This set adds a bit that says Intel NICs 

> can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue

> shenanigans. So can i40e do XDP_REDIRECT or can it not?

>

> If we have exhaustive conformance tests we can confidently answer that

> question. And the answer may not be "yes" or "no", it may actually be

> "we need more options because many implementations fall in between".

>

> I think readable (IOW not written in some insane DSL) tests can also 

> be useful for users who want to check which features their program /

> deployment will require.


While I do agree that that kind of conformance test would be great, I
don't think it has to hold up this series (the perfect being the enemy
of the good, and all that). We have a real problem today that userspace
can't tell if a given driver implements, say, XDP_REDIRECT, and so
people try to use it and spend days wondering which black hole their
packets disappear into. And for things like container migration we need
to be able to predict whether a given host supports a feature *before*
we start the migration and try to use it.

I view the feature flags as a list of features *implemented* by the
driver. Which should be pretty static in a given kernel, but may be
different than the features currently *enabled* on a given system (due
to, e.g., the TX queue stuff).

The simple way to expose the latter would be to just have a second set
of flags indicating the current configured state; and for that I guess
we should at least agree what "enabled" means; and a conformance test
would be a way to do this, of course.

I don't see why we can't do this in stages, though; start with the first
set of flags ('implemented'), move on to the second one ('enabled'), and
then to things like making the kernel react to the flags by rejecting
insertion into devmaps for invalid interfaces...

-Toke
Jakub Kicinski Feb. 10, 2021, 6:31 p.m. UTC | #41
On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote:
> >> I am a bit confused now. Did you mean validation tests of those XDP

> >> flags, which I am working on or some other validation tests?

> >> What should these tests verify? Can you please elaborate more on the

> >> topic, please - just a few sentences how are you see it?  

> >

> > Conformance tests can be written for all features, whether they have 

> > an explicit capability in the uAPI or not. But for those that do IMO

> > the tests should be required.

> >

> > Let me give you an example. This set adds a bit that says Intel NICs 

> > can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue

> > shenanigans. So can i40e do XDP_REDIRECT or can it not?

> >

> > If we have exhaustive conformance tests we can confidently answer that

> > question. And the answer may not be "yes" or "no", it may actually be

> > "we need more options because many implementations fall in between".

> >

> > I think readable (IOW not written in some insane DSL) tests can also 

> > be useful for users who want to check which features their program /

> > deployment will require.  

> 

> While I do agree that that kind of conformance test would be great, I

> don't think it has to hold up this series (the perfect being the enemy

> of the good, and all that). We have a real problem today that userspace

> can't tell if a given driver implements, say, XDP_REDIRECT, and so

> people try to use it and spend days wondering which black hole their

> packets disappear into. And for things like container migration we need

> to be able to predict whether a given host supports a feature *before*

> we start the migration and try to use it.


Unless you have a strong definition of what XDP_REDIRECT means the flag
itself is not worth much. We're not talking about normal ethtool feature
flags which are primarily stack-driven, XDP is implemented mostly by
the driver, each vendor can do their own thing. Maybe I've seen one
vendor incompatibility too many at my day job to hope for the best...

> I view the feature flags as a list of features *implemented* by the

> driver. Which should be pretty static in a given kernel, but may be

> different than the features currently *enabled* on a given system (due

> to, e.g., the TX queue stuff).


Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your
example) is implemented across drivers differs in a meaningful ways. 
Hence the need for conformance testing. We don't have a golden SW
standard to fall back on, like we do with HW offloads.

Also IDK why those tests are considered such a huge ask. As I said most
vendors probably already have them, and so I'd guess do good distros.
So let's work together.

> The simple way to expose the latter would be to just have a second set

> of flags indicating the current configured state; and for that I guess

> we should at least agree what "enabled" means; and a conformance test

> would be a way to do this, of course.

> 

> I don't see why we can't do this in stages, though; start with the first

> set of flags ('implemented'), move on to the second one ('enabled'), and

> then to things like making the kernel react to the flags by rejecting

> insertion into devmaps for invalid interfaces...
Toke Høiland-Jørgensen Feb. 10, 2021, 10:52 p.m. UTC | #42
Jakub Kicinski <kuba@kernel.org> writes:

> On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote:

>> >> I am a bit confused now. Did you mean validation tests of those XDP

>> >> flags, which I am working on or some other validation tests?

>> >> What should these tests verify? Can you please elaborate more on the

>> >> topic, please - just a few sentences how are you see it?  

>> >

>> > Conformance tests can be written for all features, whether they have 

>> > an explicit capability in the uAPI or not. But for those that do IMO

>> > the tests should be required.

>> >

>> > Let me give you an example. This set adds a bit that says Intel NICs 

>> > can do XDP_TX and XDP_REDIRECT, yet we both know of the Tx queue

>> > shenanigans. So can i40e do XDP_REDIRECT or can it not?

>> >

>> > If we have exhaustive conformance tests we can confidently answer that

>> > question. And the answer may not be "yes" or "no", it may actually be

>> > "we need more options because many implementations fall in between".

>> >

>> > I think readable (IOW not written in some insane DSL) tests can also 

>> > be useful for users who want to check which features their program /

>> > deployment will require.  

>> 

>> While I do agree that that kind of conformance test would be great, I

>> don't think it has to hold up this series (the perfect being the enemy

>> of the good, and all that). We have a real problem today that userspace

>> can't tell if a given driver implements, say, XDP_REDIRECT, and so

>> people try to use it and spend days wondering which black hole their

>> packets disappear into. And for things like container migration we need

>> to be able to predict whether a given host supports a feature *before*

>> we start the migration and try to use it.

>

> Unless you have a strong definition of what XDP_REDIRECT means the flag

> itself is not worth much. We're not talking about normal ethtool feature

> flags which are primarily stack-driven, XDP is implemented mostly by

> the driver, each vendor can do their own thing. Maybe I've seen one

> vendor incompatibility too many at my day job to hope for the best...


I'm totally on board with documenting what a feature means. E.g., for
XDP_REDIRECT, whether it's acceptable to fail the redirect in some
situations even when it's active, or if there should always be a
slow-path fallback.

But I disagree that the flag is worthless without it. People are running
into real issues with trying to run XDP_REDIRECT programs on a driver
that doesn't support it at all, and it's incredibly confusing. The
latest example popped up literally yesterday:

https://lore.kernel.org/xdp-newbies/CAM-scZPPeu44FeCPGO=Qz=03CrhhfB1GdJ8FNEpPqP_G27c6mQ@mail.gmail.com/

>> I view the feature flags as a list of features *implemented* by the

>> driver. Which should be pretty static in a given kernel, but may be

>> different than the features currently *enabled* on a given system (due

>> to, e.g., the TX queue stuff).

>

> Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your

> example) is implemented across drivers differs in a meaningful ways. 

> Hence the need for conformance testing. We don't have a golden SW

> standard to fall back on, like we do with HW offloads.


I'm not disagreeing that we need to harmonise what "implementing a
feature" means. Maybe I'm just not sure what you mean by "conformance
testing"? What would that look like, specifically? A script in selftest
that sets up a redirect between two interfaces that we tell people to
run? Or what? How would you catch, say, that issue where if a machine
has more CPUs than the NIC has TXQs things start falling apart?

> Also IDK why those tests are considered such a huge ask. As I said most

> vendors probably already have them, and so I'd guess do good distros.

> So let's work together.


I guess what I'm afraid of is that this will end up delaying or stalling
a fix for a long-standing issue (which is what I consider this series as
shown by the example above). Maybe you can alleviate that by expanding a
bit on what you mean?

-Toke
Jakub Kicinski Feb. 12, 2021, 1:26 a.m. UTC | #43
On Wed, 10 Feb 2021 23:52:39 +0100 Toke Høiland-Jørgensen wrote:
> Jakub Kicinski <kuba@kernel.org> writes:

> > On Wed, 10 Feb 2021 11:53:53 +0100 Toke Høiland-Jørgensen wrote:  

> >> While I do agree that that kind of conformance test would be great, I

> >> don't think it has to hold up this series (the perfect being the enemy

> >> of the good, and all that). We have a real problem today that userspace

> >> can't tell if a given driver implements, say, XDP_REDIRECT, and so

> >> people try to use it and spend days wondering which black hole their

> >> packets disappear into. And for things like container migration we need

> >> to be able to predict whether a given host supports a feature *before*

> >> we start the migration and try to use it.  

> >

> > Unless you have a strong definition of what XDP_REDIRECT means the flag

> > itself is not worth much. We're not talking about normal ethtool feature

> > flags which are primarily stack-driven, XDP is implemented mostly by

> > the driver, each vendor can do their own thing. Maybe I've seen one

> > vendor incompatibility too many at my day job to hope for the best...  

> 

> I'm totally on board with documenting what a feature means.


We're trying documentation in devlink etc. and it's not that great.
It's never clear and comprehensive enough, barely anyone reads it.

> E.g., for

> XDP_REDIRECT, whether it's acceptable to fail the redirect in some

> situations even when it's active, or if there should always be a

> slow-path fallback.

> 

> But I disagree that the flag is worthless without it. People are running

> into real issues with trying to run XDP_REDIRECT programs on a driver

> that doesn't support it at all, and it's incredibly confusing. The

> latest example popped up literally yesterday:

> 

> https://lore.kernel.org/xdp-newbies/CAM-scZPPeu44FeCPGO=Qz=03CrhhfB1GdJ8FNEpPqP_G27c6mQ@mail.gmail.com/


To help such confusion we'd actually have to validate the program
against the device caps. But perhaps I'm less concerned about a
newcomer not knowing how to use things and more concerned about
providing abstractions which will make programs dependably working
across vendors and HW generations.

> >> I view the feature flags as a list of features *implemented* by the

> >> driver. Which should be pretty static in a given kernel, but may be

> >> different than the features currently *enabled* on a given system (due

> >> to, e.g., the TX queue stuff).  

> >

> > Hm, maybe I'm not being clear enough. The way XDP_REDIRECT (your

> > example) is implemented across drivers differs in a meaningful ways. 

> > Hence the need for conformance testing. We don't have a golden SW

> > standard to fall back on, like we do with HW offloads.  

> 

> I'm not disagreeing that we need to harmonise what "implementing a

> feature" means. Maybe I'm just not sure what you mean by "conformance

> testing"? What would that look like, specifically? 


We developed a pretty good set of tests at my previous job for testing
driver XDP as well as checking that the offload conforms to the SW
behavior. I assume any vendor who takes quality seriously has
comprehensive XDP tests.

If those tests were upstream / common so that we could run them
against every implementation - the features which are supported by 
a driver fall out naturally out of the set of tests which passed.
And the structure of the capability API could be based on what the
tests need to know to make a SKIP vs FAIL decision.

Common tests would obviously also ease the validation burden, burden of
writing tests on vendors and make it far easier for new implementations
to be confidently submitted.

> A script in selftest that sets up a redirect between two interfaces

> that we tell people to run? Or what? How would you catch, say, that

> issue where if a machine has more CPUs than the NIC has TXQs things

> start falling apart?


selftests should be a good place, but I don't mind the location.
The point is having tests which anyone (vendors and users) can run
to test their platforms. One of the tests should indeed test if every
CPU in the platform can XDP_REDIRECT. Shouldn't it be a rather trivial
combination of tun/veth, mh and taskset?

> > Also IDK why those tests are considered such a huge ask. As I said most

> > vendors probably already have them, and so I'd guess do good distros.

> > So let's work together.  

> 

> I guess what I'm afraid of is that this will end up delaying or stalling

> a fix for a long-standing issue (which is what I consider this series as

> shown by the example above). Maybe you can alleviate that by expanding a

> bit on what you mean?


I hope what I wrote helps a little. I'm not good at explaining. 

Perhaps I had seen one too many vendor incompatibility to trust that
adding a driver API without a validation suite will result in something
usable in production settings.
Alexei Starovoitov Feb. 12, 2021, 2:05 a.m. UTC | #44
On Thu, Feb 11, 2021 at 5:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
>

> Perhaps I had seen one too many vendor incompatibility to trust that

> adding a driver API without a validation suite will result in something

> usable in production settings.


I agree with Jakub. I don't see how extra ethtool reporting will help.
Anyone who wants to know whether eth0 supports XDP_REDIRECT can already do so:
ethtool -S eth0 | grep xdp_redirect

I think converging on the same stat names across the drivers will make
the whole thing much more user friendly than new apis.
Marek Majtyka Feb. 12, 2021, 7:02 a.m. UTC | #45
On Fri, Feb 12, 2021 at 3:05 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>

> On Thu, Feb 11, 2021 at 5:26 PM Jakub Kicinski <kuba@kernel.org> wrote:

> >

> > Perhaps I had seen one too many vendor incompatibility to trust that

> > adding a driver API without a validation suite will result in something

> > usable in production settings.

>

> I agree with Jakub. I don't see how extra ethtool reporting will help.

> Anyone who wants to know whether eth0 supports XDP_REDIRECT can already do so:

> ethtool -S eth0 | grep xdp_redirect


Doing things right can never be treated as an addition. It is the
other way around. Option -S is for statistics and additionally it can
show something (AFAIR there wasn't such counter xdp_redirect, it must
be something new, thanks for the info). But  nevertheless it cannot
cover all needs IMO.

Some questions worth to consider:
Is this extra reporting function of statistics clearly documented in
ethtool? Is this going to be clearly documented? Would it be easier
for users/admins to find it?
What about zero copy? Can it be available via statistics, too?
What about drivers XDP transmit locking flag (latest request from Jesper)?






>

> I think converging on the same stat names across the drivers will make

> the whole thing much more user friendly than new apis.
Toke Høiland-Jørgensen Feb. 16, 2021, 2:30 p.m. UTC | #46
Marek Majtyka <alardam@gmail.com> writes:

> On Fri, Feb 12, 2021 at 3:05 AM Alexei Starovoitov

> <alexei.starovoitov@gmail.com> wrote:

>>

>> On Thu, Feb 11, 2021 at 5:26 PM Jakub Kicinski <kuba@kernel.org> wrote:

>> >

>> > Perhaps I had seen one too many vendor incompatibility to trust that

>> > adding a driver API without a validation suite will result in something

>> > usable in production settings.

>>

>> I agree with Jakub. I don't see how extra ethtool reporting will help.

>> Anyone who wants to know whether eth0 supports XDP_REDIRECT can already do so:

>> ethtool -S eth0 | grep xdp_redirect

>

> Doing things right can never be treated as an addition. It is the

> other way around. Option -S is for statistics and additionally it can

> show something (AFAIR there wasn't such counter xdp_redirect, it must

> be something new, thanks for the info). But  nevertheless it cannot

> cover all needs IMO.

>

> Some questions worth to consider:

> Is this extra reporting function of statistics clearly documented in

> ethtool? Is this going to be clearly documented? Would it be easier

> for users/admins to find it?

> What about zero copy? Can it be available via statistics, too?

> What about drivers XDP transmit locking flag (latest request from Jesper)?



There is no way the statistics is enough. And saying "just grep for
xdp_redirect in ethtool -S" is bordering on active hostility towards
users.

We need drivers to export explicit features so we can do things like:

- Explicitly reject attaching a program that tries to do xdp_redirect on
  an interface that doesn't support it.

- Prevent devices that don't implement ndo_xdp_xmit() from being
  inserted into a devmap (oh, and this is on thing you can't know at all
  from the statistics, BTW).

- Expose the features in a machine-readable format (like the ethtool
  flags in your patch) so applications can discover in a reliable way
  what is available and do proper fallback if features are missing.

I can accept that we need some kind of conformance test to define what
each flag means (which would be kinda like a selftest for the feature
flags), but we definitely need the feature flags themselves!

-Toke
diff mbox series

Patch

diff --git a/Documentation/networking/netdev-xdp-properties.rst b/Documentation/networking/netdev-xdp-properties.rst
new file mode 100644
index 000000000000..4a434a1c512b
--- /dev/null
+++ b/Documentation/networking/netdev-xdp-properties.rst
@@ -0,0 +1,42 @@ 
+.. SPDX-License-Identifier: GPL-2.0
+
+=====================
+Netdev XDP properties
+=====================
+
+ * XDP PROPERTIES FLAGS
+
+Following netdev xdp properties flags can be retrieve over netlink ethtool
+interface the same way as netdev feature flags. These properties flags are
+read only and cannot be change in the runtime.
+
+
+*  XDP_ABORTED
+
+This property informs if netdev supports xdp aborted action.
+
+*  XDP_DROP
+
+This property informs if netdev supports xdp drop action.
+
+*  XDP_PASS
+
+This property informs if netdev supports xdp pass action.
+
+*  XDP_TX
+
+This property informs if netdev supports xdp tx action.
+
+*  XDP_REDIRECT
+
+This property informs if netdev supports xdp redirect action.
+It assumes the all beforehand mentioned flags are enabled.
+
+*  XDP_ZEROCOPY
+
+This property informs if netdev driver supports xdp zero copy.
+It assumes the all beforehand mentioned flags are enabled.
+
+*  XDP_HW_OFFLOAD
+
+This property informs if netdev driver supports xdp hw oflloading.
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 52d1cc2bd8a7..2544c7f0e1b7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -43,6 +43,7 @@ 
 #include <net/xdp.h>
 
 #include <linux/netdev_features.h>
+#include <linux/xdp_properties.h>
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 #include <uapi/linux/if_bonding.h>
@@ -2171,6 +2172,7 @@  struct net_device {
 
 	/* protected by rtnl_lock */
 	struct bpf_xdp_entity	xdp_state[__MAX_XDP_MODE];
+	xdp_properties_t	xdp_properties;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
diff --git a/include/linux/xdp_properties.h b/include/linux/xdp_properties.h
new file mode 100644
index 000000000000..c72c9bcc50de
--- /dev/null
+++ b/include/linux/xdp_properties.h
@@ -0,0 +1,53 @@ 
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Network device xdp properties.
+ */
+#ifndef _LINUX_XDP_PROPERTIES_H
+#define _LINUX_XDP_PROPERTIES_H
+
+#include <linux/types.h>
+#include <linux/bitops.h>
+#include <asm/byteorder.h>
+
+typedef u64 xdp_properties_t;
+
+enum {
+	XDP_F_ABORTED_BIT,
+	XDP_F_DROP_BIT,
+	XDP_F_PASS_BIT,
+	XDP_F_TX_BIT,
+	XDP_F_REDIRECT_BIT,
+	XDP_F_ZEROCOPY_BIT,
+	XDP_F_HW_OFFLOAD_BIT,
+
+	/*
+	 * Add your fresh new property above and remember to update
+	 * xdp_properties_strings [] in net/core/ethtool.c and maybe
+	 * some xdp_properties mask #defines below. Please also describe it
+	 * in Documentation/networking/xdp_properties.rst.
+	 */
+
+	/**/XDP_PROPERTIES_COUNT
+};
+
+#define __XDP_F_BIT(bit)	((xdp_properties_t)1 << (bit))
+#define __XDP_F(name)		__XDP_F_BIT(XDP_F_##name##_BIT)
+
+#define XDP_F_ABORTED		__XDP_F(ABORTED)
+#define XDP_F_DROP		__XDP_F(DROP)
+#define XDP_F_PASS		__XDP_F(PASS)
+#define XDP_F_TX		__XDP_F(TX)
+#define XDP_F_REDIRECT		__XDP_F(REDIRECT)
+#define XDP_F_ZEROCOPY		__XDP_F(ZEROCOPY)
+#define XDP_F_HW_OFFLOAD	__XDP_F(HW_OFFLOAD)
+
+#define XDP_F_BASIC		(XDP_F_ABORTED |	\
+				 XDP_F_DROP |		\
+				 XDP_F_PASS |		\
+				 XDP_F_TX)
+
+#define XDP_F_FULL		(XDP_F_BASIC | XDP_F_REDIRECT)
+
+#define XDP_F_FULL_ZC		(XDP_F_FULL | XDP_F_ZEROCOPY)
+
+#endif /* _LINUX_XDP_PROPERTIES_H */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 700ad5db7f5d..a9fabc1282cf 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -7,6 +7,7 @@ 
 #define __LINUX_NET_XDP_H__
 
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <linux/xdp_properties.h>
 
 /**
  * DOC: XDP RX-queue information
@@ -255,6 +256,100 @@  struct xdp_attachment_info {
 	u32 flags;
 };
 
+#if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
+
+static __always_inline void
+xdp_set_aborted_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_ABORTED;
+}
+
+static __always_inline void
+xdp_set_pass_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_PASS;
+}
+
+static __always_inline void
+xdp_set_drop_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_DROP;
+}
+
+static __always_inline void
+xdp_set_tx_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_TX;
+}
+
+static __always_inline void
+xdp_set_redirect_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_REDIRECT;
+}
+
+static __always_inline void
+xdp_set_hw_offload_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_HW_OFFLOAD;
+}
+
+static __always_inline void
+xdp_set_basic_properties(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_BASIC;
+}
+
+static __always_inline void
+xdp_set_full_properties(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_FULL;
+}
+
+#else
+
+static __always_inline void
+xdp_set_aborted_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_pass_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_drop_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_tx_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_redirect_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_hw_offload_property(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_basic_properties(xdp_properties_t *properties)
+{
+}
+
+static __always_inline void
+xdp_set_full_properties(xdp_properties_t *properties)
+{
+}
+
+#endif
+
 struct netdev_bpf;
 bool xdp_attachment_flags_ok(struct xdp_attachment_info *info,
 			     struct netdev_bpf *bpf);
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 4e295541e396..48a3b6d165c7 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -8,6 +8,7 @@ 
 
 #include <net/xdp_sock.h>
 #include <net/xsk_buff_pool.h>
+#include <linux/xdp_properties.h>
 
 #ifdef CONFIG_XDP_SOCKETS
 
@@ -117,6 +118,11 @@  static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
 	xp_dma_sync_for_device(pool, dma, size);
 }
 
+static inline void xsk_set_zc_property(xdp_properties_t *properties)
+{
+	*properties |= XDP_F_ZEROCOPY;
+}
+
 #else
 
 static inline void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
@@ -242,6 +248,10 @@  static inline void xsk_buff_raw_dma_sync_for_device(struct xsk_buff_pool *pool,
 {
 }
 
+static inline void xsk_set_zc_property(xdp_properties_t *properties)
+{
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_DRV_H */
diff --git a/include/uapi/linux/ethtool.h b/include/uapi/linux/ethtool.h
index 9ca87bc73c44..dfcb0e2c98b2 100644
--- a/include/uapi/linux/ethtool.h
+++ b/include/uapi/linux/ethtool.h
@@ -688,6 +688,7 @@  enum ethtool_stringset {
 	ETH_SS_TS_TX_TYPES,
 	ETH_SS_TS_RX_FILTERS,
 	ETH_SS_UDP_TUNNEL_TYPES,
+	ETH_SS_XDP_PROPERTIES,
 
 	/* add new constants above here */
 	ETH_SS_COUNT
diff --git a/include/uapi/linux/xdp_properties.h b/include/uapi/linux/xdp_properties.h
new file mode 100644
index 000000000000..e85be03eb707
--- /dev/null
+++ b/include/uapi/linux/xdp_properties.h
@@ -0,0 +1,32 @@ 
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+/*
+ * Copyright (c) 2020 Intel
+ */
+
+#ifndef __UAPI_LINUX_XDP_PROPERTIES__
+#define __UAPI_LINUX_XDP_PROPERTIES__
+
+/* ETH_GSTRING_LEN define is needed. */
+#include <linux/ethtool.h>
+
+#define XDP_PROPERTIES_ABORTED_STR	"xdp-aborted"
+#define XDP_PROPERTIES_DROP_STR		"xdp-drop"
+#define XDP_PROPERTIES_PASS_STR		"xdp-pass"
+#define XDP_PROPERTIES_TX_STR		"xdp-tx"
+#define XDP_PROPERTIES_REDIRECT_STR	"xdp-redirect"
+#define XDP_PROPERTIES_ZEROCOPY_STR	"xdp-zerocopy"
+#define XDP_PROPERTIES_HW_OFFLOAD_STR	"xdp-hw-offload"
+
+#define	DECLARE_XDP_PROPERTIES_TABLE(name)		\
+	const char name[][ETH_GSTRING_LEN] = {		\
+		XDP_PROPERTIES_ABORTED_STR,		\
+		XDP_PROPERTIES_DROP_STR,		\
+		XDP_PROPERTIES_PASS_STR,		\
+		XDP_PROPERTIES_TX_STR,			\
+		XDP_PROPERTIES_REDIRECT_STR,		\
+		XDP_PROPERTIES_ZEROCOPY_STR,		\
+		XDP_PROPERTIES_HW_OFFLOAD_STR,		\
+	}
+
+#endif  /* __UAPI_LINUX_XDP_PROPERTIES__ */
diff --git a/net/ethtool/common.c b/net/ethtool/common.c
index 24036e3055a1..8f15f96b8922 100644
--- a/net/ethtool/common.c
+++ b/net/ethtool/common.c
@@ -4,6 +4,7 @@ 
 #include <linux/net_tstamp.h>
 #include <linux/phy.h>
 #include <linux/rtnetlink.h>
+#include <uapi/linux/xdp_properties.h>
 
 #include "common.h"
 
@@ -283,6 +284,16 @@  const char udp_tunnel_type_names[][ETH_GSTRING_LEN] = {
 static_assert(ARRAY_SIZE(udp_tunnel_type_names) ==
 	      __ETHTOOL_UDP_TUNNEL_TYPE_CNT);
 
+const char xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN] = {
+	[XDP_F_ABORTED_BIT] =		XDP_PROPERTIES_ABORTED_STR,
+	[XDP_F_DROP_BIT] =		XDP_PROPERTIES_DROP_STR,
+	[XDP_F_PASS_BIT] =		XDP_PROPERTIES_PASS_STR,
+	[XDP_F_TX_BIT] =		XDP_PROPERTIES_TX_STR,
+	[XDP_F_REDIRECT_BIT] =		XDP_PROPERTIES_REDIRECT_STR,
+	[XDP_F_ZEROCOPY_BIT] =		XDP_PROPERTIES_ZEROCOPY_STR,
+	[XDP_F_HW_OFFLOAD_BIT] =	XDP_PROPERTIES_HW_OFFLOAD_STR,
+};
+
 /* return false if legacy contained non-0 deprecated fields
  * maxtxpkt/maxrxpkt. rest of ksettings always updated
  */
diff --git a/net/ethtool/common.h b/net/ethtool/common.h
index 3d9251c95a8b..85a35f8781eb 100644
--- a/net/ethtool/common.h
+++ b/net/ethtool/common.h
@@ -5,8 +5,10 @@ 
 
 #include <linux/netdevice.h>
 #include <linux/ethtool.h>
+#include <linux/xdp_properties.h>
 
 #define ETHTOOL_DEV_FEATURE_WORDS	DIV_ROUND_UP(NETDEV_FEATURE_COUNT, 32)
+#define ETHTOOL_XDP_PROPERTIES_WORDS	DIV_ROUND_UP(XDP_PROPERTIES_COUNT, 32)
 
 /* compose link mode index from speed, type and duplex */
 #define ETHTOOL_LINK_MODE(speed, type, duplex) \
@@ -22,6 +24,8 @@  extern const char
 tunable_strings[__ETHTOOL_TUNABLE_COUNT][ETH_GSTRING_LEN];
 extern const char
 phy_tunable_strings[__ETHTOOL_PHY_TUNABLE_COUNT][ETH_GSTRING_LEN];
+extern const char
+xdp_properties_strings[XDP_PROPERTIES_COUNT][ETH_GSTRING_LEN];
 extern const char link_mode_names[][ETH_GSTRING_LEN];
 extern const char netif_msg_class_names[][ETH_GSTRING_LEN];
 extern const char wol_mode_names[][ETH_GSTRING_LEN];
diff --git a/net/ethtool/strset.c b/net/ethtool/strset.c
index 0baad0ce1832..684e751b31a9 100644
--- a/net/ethtool/strset.c
+++ b/net/ethtool/strset.c
@@ -80,6 +80,11 @@  static const struct strset_info info_template[] = {
 		.count		= __ETHTOOL_UDP_TUNNEL_TYPE_CNT,
 		.strings	= udp_tunnel_type_names,
 	},
+	[ETH_SS_XDP_PROPERTIES] = {
+		.per_dev	= false,
+		.count		= ARRAY_SIZE(xdp_properties_strings),
+		.strings	= xdp_properties_strings,
+	},
 };
 
 struct strset_req_info {