mbox series

[RFC,v2,0/7] tun: Introduce virtio-net hashing feature

Message ID 20231015141644.260646-1-akihiko.odaki@daynix.com
Headers show
Series tun: Introduce virtio-net hashing feature | expand

Message

Akihiko Odaki Oct. 15, 2023, 2:16 p.m. UTC
virtio-net have two usage of hashes: one is RSS and another is hash
reporting. Conventionally the hash calculation was done by the VMM.
However, computing the hash after the queue was chosen defeats the
purpose of RSS.

Another approach is to use eBPF steering program. This approach has
another downside: it cannot report the calculated hash due to the
restrictive nature of eBPF.

Extend the steering program feature by introducing a dedicated program
type: BPF_PROG_TYPE_VNET_HASH. This program type is capable to report
the hash value and the queue to use at the same time.

This is a rewrite of a RFC patch series submitted by Yuri Benditovich that
incorporates feedbacks for the series and V1 of this series:
https://lore.kernel.org/lkml/20210112194143.1494-1-yuri.benditovich@daynix.com/

QEMU patched to use this new feature is available at:
https://github.com/daynix/qemu/tree/akihikodaki/bpf

The QEMU patches will soon be submitted to the upstream as RFC too.

V1 -> V2:
  Changed to introduce a new BPF program type.

Akihiko Odaki (7):
  bpf: Introduce BPF_PROG_TYPE_VNET_HASH
  bpf: Add vnet_hash members to __sk_buff
  skbuff: Introduce SKB_EXT_TUN_VNET_HASH
  virtio_net: Add virtio_net_hdr_v1_hash_from_skb()
  tun: Support BPF_PROG_TYPE_VNET_HASH
  selftests/bpf: Test BPF_PROG_TYPE_VNET_HASH
  vhost_net: Support VIRTIO_NET_F_HASH_REPORT

 Documentation/bpf/bpf_prog_run.rst            |   1 +
 Documentation/bpf/libbpf/program_types.rst    |   2 +
 drivers/net/tun.c                             | 158 +++++--
 drivers/vhost/net.c                           |  16 +-
 include/linux/bpf_types.h                     |   2 +
 include/linux/filter.h                        |   7 +
 include/linux/skbuff.h                        |  10 +
 include/linux/virtio_net.h                    |  22 +
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/verifier.c                         |   6 +
 net/core/filter.c                             |  86 +++-
 net/core/skbuff.c                             |   3 +
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/lib/bpf/libbpf.c                        |   2 +
 tools/testing/selftests/bpf/config            |   1 +
 tools/testing/selftests/bpf/config.aarch64    |   1 -
 .../selftests/bpf/prog_tests/vnet_hash.c      | 385 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/vnet_hash.c |  16 +
 18 files changed, 681 insertions(+), 47 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/vnet_hash.c
 create mode 100644 tools/testing/selftests/bpf/progs/vnet_hash.c

Comments

Akihiko Odaki Oct. 15, 2023, 5:10 p.m. UTC | #1
On 2023/10/16 1:07, Alexei Starovoitov wrote:
> On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 0448700890f7..298634556fab 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -988,6 +988,7 @@ enum bpf_prog_type {
>>          BPF_PROG_TYPE_SK_LOOKUP,
>>          BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
>>          BPF_PROG_TYPE_NETFILTER,
>> +       BPF_PROG_TYPE_VNET_HASH,
> 
> Sorry, we do not add new stable program types anymore.
> 
>> @@ -6111,6 +6112,10 @@ struct __sk_buff {
>>          __u8  tstamp_type;
>>          __u32 :24;              /* Padding, future use. */
>>          __u64 hwtstamp;
>> +
>> +       __u32 vnet_hash_value;
>> +       __u16 vnet_hash_report;
>> +       __u16 vnet_rss_queue;
>>   };
> 
> we also do not add anything to uapi __sk_buff.
> 
>> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
>> +       .get_func_proto         = sk_filter_func_proto,
>> +       .is_valid_access        = sk_filter_is_valid_access,
>> +       .convert_ctx_access     = bpf_convert_ctx_access,
>> +       .gen_ld_abs             = bpf_gen_ld_abs,
>> +};
> 
> and we don't do ctx rewrites like this either.
> 
> Please see how hid-bpf and cgroup rstat are hooking up bpf
> in _unstable_ way.

Can you describe what "stable" and "unstable" mean here? I'm new to BPF 
and I'm worried if it may mean the interface stability.

Let me describe the context. QEMU bundles an eBPF program that is used 
for the "eBPF steering program" feature of tun. Now I'm proposing to 
extend the feature to allow to return some values to the userspace and 
vhost_net. As such, the extension needs to be done in a way that ensures 
interface stability.
Alexei Starovoitov Oct. 16, 2023, 11:53 p.m. UTC | #2
On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >>
> >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> >> index 0448700890f7..298634556fab 100644
> >> --- a/include/uapi/linux/bpf.h
> >> +++ b/include/uapi/linux/bpf.h
> >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> >>          BPF_PROG_TYPE_SK_LOOKUP,
> >>          BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> >>          BPF_PROG_TYPE_NETFILTER,
> >> +       BPF_PROG_TYPE_VNET_HASH,
> >
> > Sorry, we do not add new stable program types anymore.
> >
> >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> >>          __u8  tstamp_type;
> >>          __u32 :24;              /* Padding, future use. */
> >>          __u64 hwtstamp;
> >> +
> >> +       __u32 vnet_hash_value;
> >> +       __u16 vnet_hash_report;
> >> +       __u16 vnet_rss_queue;
> >>   };
> >
> > we also do not add anything to uapi __sk_buff.
> >
> >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> >> +       .get_func_proto         = sk_filter_func_proto,
> >> +       .is_valid_access        = sk_filter_is_valid_access,
> >> +       .convert_ctx_access     = bpf_convert_ctx_access,
> >> +       .gen_ld_abs             = bpf_gen_ld_abs,
> >> +};
> >
> > and we don't do ctx rewrites like this either.
> >
> > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > in _unstable_ way.
>
> Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> and I'm worried if it may mean the interface stability.
>
> Let me describe the context. QEMU bundles an eBPF program that is used
> for the "eBPF steering program" feature of tun. Now I'm proposing to
> extend the feature to allow to return some values to the userspace and
> vhost_net. As such, the extension needs to be done in a way that ensures
> interface stability.

bpf is not an option then.
we do not add stable bpf program types or hooks any more.
If a kernel subsystem wants to use bpf it needs to accept the fact
that such bpf extensibility will be unstable and subsystem maintainers
can decide to remove such bpf support in the future.
Willem de Bruijn Oct. 17, 2023, 12:36 a.m. UTC | #3
On Mon, Oct 16, 2023 at 7:53 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >
> > On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 0448700890f7..298634556fab 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> > >>          BPF_PROG_TYPE_SK_LOOKUP,
> > >>          BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > >>          BPF_PROG_TYPE_NETFILTER,
> > >> +       BPF_PROG_TYPE_VNET_HASH,
> > >
> > > Sorry, we do not add new stable program types anymore.
> > >
> > >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> > >>          __u8  tstamp_type;
> > >>          __u32 :24;              /* Padding, future use. */
> > >>          __u64 hwtstamp;
> > >> +
> > >> +       __u32 vnet_hash_value;
> > >> +       __u16 vnet_hash_report;
> > >> +       __u16 vnet_rss_queue;
> > >>   };
> > >
> > > we also do not add anything to uapi __sk_buff.
> > >
> > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> > >> +       .get_func_proto         = sk_filter_func_proto,
> > >> +       .is_valid_access        = sk_filter_is_valid_access,
> > >> +       .convert_ctx_access     = bpf_convert_ctx_access,
> > >> +       .gen_ld_abs             = bpf_gen_ld_abs,
> > >> +};
> > >
> > > and we don't do ctx rewrites like this either.
> > >
> > > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > > in _unstable_ way.
> >
> > Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> > and I'm worried if it may mean the interface stability.
> >
> > Let me describe the context. QEMU bundles an eBPF program that is used
> > for the "eBPF steering program" feature of tun. Now I'm proposing to
> > extend the feature to allow to return some values to the userspace and
> > vhost_net. As such, the extension needs to be done in a way that ensures
> > interface stability.
>
> bpf is not an option then.
> we do not add stable bpf program types or hooks any more.
> If a kernel subsystem wants to use bpf it needs to accept the fact
> that such bpf extensibility will be unstable and subsystem maintainers
> can decide to remove such bpf support in the future.

Based on hooks for tracepoints and kfuncs, correct?

Perhaps the existing stable flow dissector type is extensible to
optionally compute the hash and report hash and hash type. Else we
probably should revisit the previous version of this series.
Jason Wang Oct. 17, 2023, 2:38 a.m. UTC | #4
On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> >
> > On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> > >>
> > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > >> index 0448700890f7..298634556fab 100644
> > >> --- a/include/uapi/linux/bpf.h
> > >> +++ b/include/uapi/linux/bpf.h
> > >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> > >>          BPF_PROG_TYPE_SK_LOOKUP,
> > >>          BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > >>          BPF_PROG_TYPE_NETFILTER,
> > >> +       BPF_PROG_TYPE_VNET_HASH,
> > >
> > > Sorry, we do not add new stable program types anymore.
> > >
> > >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> > >>          __u8  tstamp_type;
> > >>          __u32 :24;              /* Padding, future use. */
> > >>          __u64 hwtstamp;
> > >> +
> > >> +       __u32 vnet_hash_value;
> > >> +       __u16 vnet_hash_report;
> > >> +       __u16 vnet_rss_queue;
> > >>   };
> > >
> > > we also do not add anything to uapi __sk_buff.
> > >
> > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> > >> +       .get_func_proto         = sk_filter_func_proto,
> > >> +       .is_valid_access        = sk_filter_is_valid_access,
> > >> +       .convert_ctx_access     = bpf_convert_ctx_access,
> > >> +       .gen_ld_abs             = bpf_gen_ld_abs,
> > >> +};
> > >
> > > and we don't do ctx rewrites like this either.
> > >
> > > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > > in _unstable_ way.
> >
> > Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> > and I'm worried if it may mean the interface stability.
> >
> > Let me describe the context. QEMU bundles an eBPF program that is used
> > for the "eBPF steering program" feature of tun. Now I'm proposing to
> > extend the feature to allow to return some values to the userspace and
> > vhost_net. As such, the extension needs to be done in a way that ensures
> > interface stability.
>
> bpf is not an option then.
> we do not add stable bpf program types or hooks any more.

Does this mean eBPF could not be used for any new use cases other than
the existing ones?

> If a kernel subsystem wants to use bpf it needs to accept the fact
> that such bpf extensibility will be unstable and subsystem maintainers
> can decide to remove such bpf support in the future.

I don't see how it is different from the existing ones.

Thanks

>
Alexei Starovoitov Oct. 17, 2023, 7:03 p.m. UTC | #5
On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> > >
> > > On 2023/10/16 1:07, Alexei Starovoitov wrote:
> > > > On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
> > > >>
> > > >> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > >> index 0448700890f7..298634556fab 100644
> > > >> --- a/include/uapi/linux/bpf.h
> > > >> +++ b/include/uapi/linux/bpf.h
> > > >> @@ -988,6 +988,7 @@ enum bpf_prog_type {
> > > >>          BPF_PROG_TYPE_SK_LOOKUP,
> > > >>          BPF_PROG_TYPE_SYSCALL, /* a program that can execute syscalls */
> > > >>          BPF_PROG_TYPE_NETFILTER,
> > > >> +       BPF_PROG_TYPE_VNET_HASH,
> > > >
> > > > Sorry, we do not add new stable program types anymore.
> > > >
> > > >> @@ -6111,6 +6112,10 @@ struct __sk_buff {
> > > >>          __u8  tstamp_type;
> > > >>          __u32 :24;              /* Padding, future use. */
> > > >>          __u64 hwtstamp;
> > > >> +
> > > >> +       __u32 vnet_hash_value;
> > > >> +       __u16 vnet_hash_report;
> > > >> +       __u16 vnet_rss_queue;
> > > >>   };
> > > >
> > > > we also do not add anything to uapi __sk_buff.
> > > >
> > > >> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
> > > >> +       .get_func_proto         = sk_filter_func_proto,
> > > >> +       .is_valid_access        = sk_filter_is_valid_access,
> > > >> +       .convert_ctx_access     = bpf_convert_ctx_access,
> > > >> +       .gen_ld_abs             = bpf_gen_ld_abs,
> > > >> +};
> > > >
> > > > and we don't do ctx rewrites like this either.
> > > >
> > > > Please see how hid-bpf and cgroup rstat are hooking up bpf
> > > > in _unstable_ way.
> > >
> > > Can you describe what "stable" and "unstable" mean here? I'm new to BPF
> > > and I'm worried if it may mean the interface stability.
> > >
> > > Let me describe the context. QEMU bundles an eBPF program that is used
> > > for the "eBPF steering program" feature of tun. Now I'm proposing to
> > > extend the feature to allow to return some values to the userspace and
> > > vhost_net. As such, the extension needs to be done in a way that ensures
> > > interface stability.
> >
> > bpf is not an option then.
> > we do not add stable bpf program types or hooks any more.
>
> Does this mean eBPF could not be used for any new use cases other than
> the existing ones?

It means that any new use of bpf has to be unstable for the time being.

> > If a kernel subsystem wants to use bpf it needs to accept the fact
> > that such bpf extensibility will be unstable and subsystem maintainers
> > can decide to remove such bpf support in the future.
>
> I don't see how it is different from the existing ones.

Can we remove BPF_CGROUP_RUN_PROG_INET_INGRESS hook along
with BPF_PROG_TYPE_CGROUP_SKB program type?
Obviously not.
We can refactor it. We can move it around, but not remove.
That's the difference in stable vs unstable.
Akihiko Odaki Nov. 18, 2023, 10:38 a.m. UTC | #6
On 2023/10/18 4:19, Akihiko Odaki wrote:
> On 2023/10/18 4:03, Alexei Starovoitov wrote:
>> On Mon, Oct 16, 2023 at 7:38 PM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> On Tue, Oct 17, 2023 at 7:53 AM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>>
>>>> On Sun, Oct 15, 2023 at 10:10 AM Akihiko Odaki 
>>>> <akihiko.odaki@daynix.com> wrote:
>>>>>
>>>>> On 2023/10/16 1:07, Alexei Starovoitov wrote:
>>>>>> On Sun, Oct 15, 2023 at 7:17 AM Akihiko Odaki 
>>>>>> <akihiko.odaki@daynix.com> wrote:
>>>>>>>
>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>> index 0448700890f7..298634556fab 100644
>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>> @@ -988,6 +988,7 @@ enum bpf_prog_type {
>>>>>>>           BPF_PROG_TYPE_SK_LOOKUP,
>>>>>>>           BPF_PROG_TYPE_SYSCALL, /* a program that can execute 
>>>>>>> syscalls */
>>>>>>>           BPF_PROG_TYPE_NETFILTER,
>>>>>>> +       BPF_PROG_TYPE_VNET_HASH,
>>>>>>
>>>>>> Sorry, we do not add new stable program types anymore.
>>>>>>
>>>>>>> @@ -6111,6 +6112,10 @@ struct __sk_buff {
>>>>>>>           __u8  tstamp_type;
>>>>>>>           __u32 :24;              /* Padding, future use. */
>>>>>>>           __u64 hwtstamp;
>>>>>>> +
>>>>>>> +       __u32 vnet_hash_value;
>>>>>>> +       __u16 vnet_hash_report;
>>>>>>> +       __u16 vnet_rss_queue;
>>>>>>>    };
>>>>>>
>>>>>> we also do not add anything to uapi __sk_buff.
>>>>>>
>>>>>>> +const struct bpf_verifier_ops vnet_hash_verifier_ops = {
>>>>>>> +       .get_func_proto         = sk_filter_func_proto,
>>>>>>> +       .is_valid_access        = sk_filter_is_valid_access,
>>>>>>> +       .convert_ctx_access     = bpf_convert_ctx_access,
>>>>>>> +       .gen_ld_abs             = bpf_gen_ld_abs,
>>>>>>> +};
>>>>>>
>>>>>> and we don't do ctx rewrites like this either.
>>>>>>
>>>>>> Please see how hid-bpf and cgroup rstat are hooking up bpf
>>>>>> in _unstable_ way.
>>>>>
>>>>> Can you describe what "stable" and "unstable" mean here? I'm new to 
>>>>> BPF
>>>>> and I'm worried if it may mean the interface stability.
>>>>>
>>>>> Let me describe the context. QEMU bundles an eBPF program that is used
>>>>> for the "eBPF steering program" feature of tun. Now I'm proposing to
>>>>> extend the feature to allow to return some values to the userspace and
>>>>> vhost_net. As such, the extension needs to be done in a way that 
>>>>> ensures
>>>>> interface stability.
>>>>
>>>> bpf is not an option then.
>>>> we do not add stable bpf program types or hooks any more.
>>>
>>> Does this mean eBPF could not be used for any new use cases other than
>>> the existing ones?
>>
>> It means that any new use of bpf has to be unstable for the time being.
> 
> Can you elaborate more about making new use unstable "for the time 
> being?" Is it a temporary situation? What is the rationale for that? 
> Such information will help devise a solution that is best for both of 
> the BPF and network subsystems.
> 
> I would also appreciate if you have some documentation or link to 
> relevant discussions on the mailing list. That will avoid having same 
> discussion you may already have done in the past.

Hi,

The discussion has been stuck for a month, but I'd still like to 
continue figuring out the way best for the whole kernel to implement 
this feature. I summarize the current situation and question that needs 
to be answered before push this forward:

The goal of this RFC is to allow to report hash values calculated with 
eBPF steering program. It's essentially just to report 4 bytes from the 
kernel to the userspace.

Unfortunately, however, it is not acceptable for the BPF subsystem 
because the "stable" BPF is completely fixed these days. The 
"unstable/kfunc" BPF is an alternative, but the eBPF program will be 
shipped with a portable userspace program (QEMU)[1] so the lack of 
interface stability is not tolerable.

Another option is to hardcode the algorithm that was conventionally 
implemented with eBPF steering program in the kernel[2]. It is possible 
because the algorithm strictly follows the virtio-net specification[3]. 
However, there are proposals to add different algorithms to the 
specification[4], and hardcoding the algorithm to the kernel will 
require to add more UAPIs and code each time such a specification change 
happens, which is not good for tuntap.

In short, the proposed feature requires to make either of three compromises:

1. Compromise on the BPF side: Relax the "stable" BPF feature freeze 
once and allow eBPF steering program to report 4 more bytes to the kernel.

2. Compromise on the tuntap side: Implement the algorithm to the kernel, 
and abandon the capability to update the algorithm without changing the 
kernel.

IMHO, I think it's better to make a compromise on the BPF side (option 
1). We should minimize the total UAPI changes in the whole kernel, and 
option 1 is much superior in that sense.

Yet I have to note that such a compromise on the BPF side can risk the 
"stable" BPF feature freeze fragile and let other people complain like 
"you allowed to change stable BPF for this, why do you reject [some 
other request to change stable BPF]?" It is bad for BPF maintainers. (I 
can imagine that introducing and maintaining widely different BPF 
interfaces is too much burden.) And, of course, this requires an 
approval from BPF maintainers.

So I'd like to ask you that which of these compromises you think worse. 
Please also tell me if you have another idea.

Regards,
Akihiko Odaki

[1] https://qemu.readthedocs.io/en/v8.1.0/devel/ebpf_rss.html
[2] 
https://lore.kernel.org/all/20231008052101.144422-1-akihiko.odaki@daynix.com/
[3] 
https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2400003
[4] 
https://lore.kernel.org/all/CACGkMEuBbGKssxNv5AfpaPpWQfk2BHR83rM5AHXN-YVMf2NvpQ@mail.gmail.com/
Song Liu Nov. 18, 2023, 4:08 p.m. UTC | #7
Hi,

A few rookie questions below.

On Sat, Nov 18, 2023 at 2:39 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/10/18 4:19, Akihiko Odaki wrote:
> > On 2023/10/18 4:03, Alexei Starovoitov wrote:
[...]
> >
> > I would also appreciate if you have some documentation or link to
> > relevant discussions on the mailing list. That will avoid having same
> > discussion you may already have done in the past.
>
> Hi,
>
> The discussion has been stuck for a month, but I'd still like to
> continue figuring out the way best for the whole kernel to implement
> this feature. I summarize the current situation and question that needs
> to be answered before push this forward:
>
> The goal of this RFC is to allow to report hash values calculated with
> eBPF steering program. It's essentially just to report 4 bytes from the
> kernel to the userspace.

AFAICT, the proposed design is to have BPF generate some data
(namely hash, but could be anything afaict) and consume it from
user space. Instead of updating __sk_buff, can we have the user
space to fetch the data/hash from a bpf map? If this is an option,
I guess we can implement the same feature with BPF tracing
programs?

>
> Unfortunately, however, it is not acceptable for the BPF subsystem
> because the "stable" BPF is completely fixed these days. The
> "unstable/kfunc" BPF is an alternative, but the eBPF program will be
> shipped with a portable userspace program (QEMU)[1] so the lack of
> interface stability is not tolerable.

bpf kfuncs are as stable as exported symbols. Is exported symbols
like stability enough for the use case? (I would assume yes.)

>
> Another option is to hardcode the algorithm that was conventionally
> implemented with eBPF steering program in the kernel[2]. It is possible
> because the algorithm strictly follows the virtio-net specification[3].
> However, there are proposals to add different algorithms to the
> specification[4], and hardcoding the algorithm to the kernel will
> require to add more UAPIs and code each time such a specification change
> happens, which is not good for tuntap.

The requirement looks similar to hid-bpf. Could you explain why that
model is not enough? HID also requires some stability AFAICT.

Thanks,
Song

>
> In short, the proposed feature requires to make either of three compromises:
>
> 1. Compromise on the BPF side: Relax the "stable" BPF feature freeze
> once and allow eBPF steering program to report 4 more bytes to the kernel.
>
> 2. Compromise on the tuntap side: Implement the algorithm to the kernel,
> and abandon the capability to update the algorithm without changing the
> kernel.
>
> IMHO, I think it's better to make a compromise on the BPF side (option
> 1). We should minimize the total UAPI changes in the whole kernel, and
> option 1 is much superior in that sense.
>
> Yet I have to note that such a compromise on the BPF side can risk the
> "stable" BPF feature freeze fragile and let other people complain like
> "you allowed to change stable BPF for this, why do you reject [some
> other request to change stable BPF]?" It is bad for BPF maintainers. (I
> can imagine that introducing and maintaining widely different BPF
> interfaces is too much burden.) And, of course, this requires an
> approval from BPF maintainers.
>
> So I'd like to ask you that which of these compromises you think worse.
> Please also tell me if you have another idea.
>
> Regards,
> Akihiko Odaki
>
> [1] https://qemu.readthedocs.io/en/v8.1.0/devel/ebpf_rss.html
> [2]
> https://lore.kernel.org/all/20231008052101.144422-1-akihiko.odaki@daynix.com/
> [3]
> https://docs.oasis-open.org/virtio/virtio/v1.2/csd01/virtio-v1.2-csd01.html#x1-2400003
> [4]
> https://lore.kernel.org/all/CACGkMEuBbGKssxNv5AfpaPpWQfk2BHR83rM5AHXN-YVMf2NvpQ@mail.gmail.com/
Akihiko Odaki Nov. 19, 2023, 8:03 a.m. UTC | #8
On 2023/11/19 1:08, Song Liu wrote:
> Hi,
> 
> A few rookie questions below.

Thanks for questions.

> 
> On Sat, Nov 18, 2023 at 2:39 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/10/18 4:19, Akihiko Odaki wrote:
>>> On 2023/10/18 4:03, Alexei Starovoitov wrote:
> [...]
>>>
>>> I would also appreciate if you have some documentation or link to
>>> relevant discussions on the mailing list. That will avoid having same
>>> discussion you may already have done in the past.
>>
>> Hi,
>>
>> The discussion has been stuck for a month, but I'd still like to
>> continue figuring out the way best for the whole kernel to implement
>> this feature. I summarize the current situation and question that needs
>> to be answered before push this forward:
>>
>> The goal of this RFC is to allow to report hash values calculated with
>> eBPF steering program. It's essentially just to report 4 bytes from the
>> kernel to the userspace.
> 
> AFAICT, the proposed design is to have BPF generate some data
> (namely hash, but could be anything afaict) and consume it from
> user space. Instead of updating __sk_buff, can we have the user
> space to fetch the data/hash from a bpf map? If this is an option,
> I guess we can implement the same feature with BPF tracing
> programs?

Unfortunately no. The communication with the userspace can be done with 
two different means:
- usual socket read/write
- vhost for direct interaction with a KVM guest

The BPF map may be a valid option for socket read/write, but it is not 
for vhost. In-kernel vhost may fetch hash from the BPF map, but I guess 
it's not a standard way to have an interaction between the kernel code 
and a BPF program.

> 
>>
>> Unfortunately, however, it is not acceptable for the BPF subsystem
>> because the "stable" BPF is completely fixed these days. The
>> "unstable/kfunc" BPF is an alternative, but the eBPF program will be
>> shipped with a portable userspace program (QEMU)[1] so the lack of
>> interface stability is not tolerable.
> 
> bpf kfuncs are as stable as exported symbols. Is exported symbols
> like stability enough for the use case? (I would assume yes.)
> 
>>
>> Another option is to hardcode the algorithm that was conventionally
>> implemented with eBPF steering program in the kernel[2]. It is possible
>> because the algorithm strictly follows the virtio-net specification[3].
>> However, there are proposals to add different algorithms to the
>> specification[4], and hardcoding the algorithm to the kernel will
>> require to add more UAPIs and code each time such a specification change
>> happens, which is not good for tuntap.
> 
> The requirement looks similar to hid-bpf. Could you explain why that
> model is not enough? HID also requires some stability AFAICT.

I have little knowledge with hid-bpf, but I assume it is more like a 
"safe" kernel module; in my understanding, it affects the system state 
and is intended to be loaded with some kind of a system daemon. It is 
fine to have the same lifecycle with the kernel for such a BPF program; 
whenever the kernel is updated, the distributor can recompile the BPF 
program with the new kernel headers and ship it along with the kernel 
just as like a kernel module.

In contrast, our intended use case is more like a normal application. 
So, for example, a user may download a container and run QEMU (including 
the BPF program) installed in the container. As such, it is nice if the 
ABI is stable across kernel releases, but it is not guaranteed for 
kfuncs. Such a use case is already covered with the eBPF steering 
program so I want to maintain it if possible.

Regards,
Akihiko Odaki
Song Liu Nov. 22, 2023, 5:25 a.m. UTC | #9
On Mon, Nov 20, 2023 at 12:05 AM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/11/20 6:02, Song Liu wrote:
[...]
> >> In contrast, our intended use case is more like a normal application.
> >> So, for example, a user may download a container and run QEMU (including
> >> the BPF program) installed in the container. As such, it is nice if the
> >> ABI is stable across kernel releases, but it is not guaranteed for
> >> kfuncs. Such a use case is already covered with the eBPF steering
> >> program so I want to maintain it if possible.
> >
> > TBH, I don't think stability should be a concern for kfuncs used by QEMU.
> > Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*,
> > bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will
> > be supported.
>
> Documentation/bpf/kfuncs.rst still says:
>  > kfuncs provide a kernel <-> kernel API, and thus are not bound by any
>  > of the strict stability restrictions associated with kernel <-> user
>  > UAPIs.
>
> Is it possible to change the statement like as follows:
> "Most kfuncs provide a kernel <-> kernel API, and thus are not bound by
> any of the strict stability restrictions associated with kernel <-> user
> UAPIs. kfuncs that have same stability restrictions associated with
> UAPIs are exceptional, and must be carefully reviewed by subsystem (and
> BPF?) maintainers as any other UAPIs are."

I am afraid this is against the intention to not guarantee UAPI-level stability
for kfuncs.

Thanks,
Song
Akihiko Odaki Dec. 10, 2023, 7:03 a.m. UTC | #10
On 2023/11/22 14:36, Akihiko Odaki wrote:
> On 2023/11/22 14:25, Song Liu wrote:
>> On Mon, Nov 20, 2023 at 12:05 AM Akihiko Odaki 
>> <akihiko.odaki@daynix.com> wrote:
>>>
>>> On 2023/11/20 6:02, Song Liu wrote:
>> [...]
>>>>> In contrast, our intended use case is more like a normal application.
>>>>> So, for example, a user may download a container and run QEMU 
>>>>> (including
>>>>> the BPF program) installed in the container. As such, it is nice if 
>>>>> the
>>>>> ABI is stable across kernel releases, but it is not guaranteed for
>>>>> kfuncs. Such a use case is already covered with the eBPF steering
>>>>> program so I want to maintain it if possible.
>>>>
>>>> TBH, I don't think stability should be a concern for kfuncs used by 
>>>> QEMU.
>>>> Many core BPF APIs are now implemented as kfuncs: bpf_dynptr_*,
>>>> bpf_rcu_*, etc. As long as there are valid use cases,these kfuncs will
>>>> be supported.
>>>
>>> Documentation/bpf/kfuncs.rst still says:
>>>   > kfuncs provide a kernel <-> kernel API, and thus are not bound by 
>>> any
>>>   > of the strict stability restrictions associated with kernel <-> user
>>>   > UAPIs.
>>>
>>> Is it possible to change the statement like as follows:
>>> "Most kfuncs provide a kernel <-> kernel API, and thus are not bound by
>>> any of the strict stability restrictions associated with kernel <-> user
>>> UAPIs. kfuncs that have same stability restrictions associated with
>>> UAPIs are exceptional, and must be carefully reviewed by subsystem (and
>>> BPF?) maintainers as any other UAPIs are."
>>
>> I am afraid this is against the intention to not guarantee UAPI-level 
>> stability
>> for kfuncs.
> 
> Is it possible to ensure that a QEMU binary with the eBPF program 
> included works on different kernel versions without UAPI-level stability 
> then? Otherwise, I think we need to think of the minimal UAPI addition 
> that exposes the feature I propose, and the two options I presented 
> first are the candidates of such: the stable BPF change or tuntap 
> interface change.
> 
> Regards,
> Akihiko Odaki

Now the discussion is stale again so let me summarize the discussion:

A tuntap device can have an eBPF steering program to let the userspace 
decide which tuntap queue should be used for each packet. QEMU uses this 
feature to implement the RSS algorithm for virtio-net emulation. Now, 
the virtio specification has a new feature to report hash values 
calculated with the RSS algorithm. The goal of this RFC is to report 
such hash values from the eBPF steering program to the userspace.

There are currently three ideas to implement the proposal:

1. Abandon eBPF steering program and implement RSS in the kernel.

It is possible to implement the RSS algorithm in the kernel as it's 
strictly defined in the specification. However, there are proposals for 
relevant virtio specification changes, and abandoning eBPF steering 
program will loose the ability to implement those changes in the 
userspace. There are concerns that this lead to more UAPI changes in the 
end.

2. Add BPF kfuncs.

Adding BPF kfuncs is *the* standard way to add BPF interfaces. hid-bpf 
is a good reference for this.

The problem with BPF kfuncs is that kfuncs are not considered as stable 
as UAPI. In my understanding, it is not problematic for things like 
hid-bpf because programs using those kfuncs affect the entire system 
state and expected to be centrally managed. Such BPF programs can be 
updated along with the kernel in a manner similar to kernel modules.

The use case of tuntap steering/hash reporting is somewhat different 
though; the eBPF program is more like a part of application (QEMU or 
potentially other VMM) and thus needs to be portable. For example, a 
user may expect a Debian container with QEMU installed to work on Fedora.

BPF kfuncs do still provide some level of stability, but there is no 
documentation that tell how stable they are. The worst case scenario I 
can imagine is that a future legitimate BPF change breaks QEMU, letting 
the "no regressions" rule force the change to be reverted. Some 
assurance that kind scenario will not happen is necessary in my opinion.

3. Add BPF program type derived from the conventional steering program type

In principle, it's just to add a feature to report four more bytes to 
the conventional steering program. However, BPF program types are frozen 
for feature additions and the proposed change will break the feature freeze.

So what's next? I'm inclined to option 3 due to its minimal ABI/API 
change, but I'm also fine with option 2 if it is possible to guarantee 
the ABI/API stability necessary to run pre-built QEMUs on future kernel 
versions by e.g., explicitly stating the stability of kfuncs. If no 
objection arises, I'll resend this series with the RFC prefix dropped 
for upstream inclusion. If it's decided to go for option 1 or 2, I'll 
post a new version of the series implementing the idea.

Regards,
Akihiko Odaki
Song Liu Dec. 11, 2023, 1:40 a.m. UTC | #11
On Sat, Dec 9, 2023 at 11:03 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
> On 2023/11/22 14:36, Akihiko Odaki wrote:
> > On 2023/11/22 14:25, Song Liu wrote:
[...]
>
> Now the discussion is stale again so let me summarize the discussion:
>
> A tuntap device can have an eBPF steering program to let the userspace
> decide which tuntap queue should be used for each packet. QEMU uses this
> feature to implement the RSS algorithm for virtio-net emulation. Now,
> the virtio specification has a new feature to report hash values
> calculated with the RSS algorithm. The goal of this RFC is to report
> such hash values from the eBPF steering program to the userspace.
>
> There are currently three ideas to implement the proposal:
>
> 1. Abandon eBPF steering program and implement RSS in the kernel.
>
> It is possible to implement the RSS algorithm in the kernel as it's
> strictly defined in the specification. However, there are proposals for
> relevant virtio specification changes, and abandoning eBPF steering
> program will loose the ability to implement those changes in the
> userspace. There are concerns that this lead to more UAPI changes in the
> end.
>
> 2. Add BPF kfuncs.
>
> Adding BPF kfuncs is *the* standard way to add BPF interfaces. hid-bpf
> is a good reference for this.
>
> The problem with BPF kfuncs is that kfuncs are not considered as stable
> as UAPI. In my understanding, it is not problematic for things like
> hid-bpf because programs using those kfuncs affect the entire system
> state and expected to be centrally managed. Such BPF programs can be
> updated along with the kernel in a manner similar to kernel modules.
>
> The use case of tuntap steering/hash reporting is somewhat different
> though; the eBPF program is more like a part of application (QEMU or
> potentially other VMM) and thus needs to be portable. For example, a
> user may expect a Debian container with QEMU installed to work on Fedora.
>
> BPF kfuncs do still provide some level of stability, but there is no
> documentation that tell how stable they are. The worst case scenario I
> can imagine is that a future legitimate BPF change breaks QEMU, letting
> the "no regressions" rule force the change to be reverted. Some
> assurance that kind scenario will not happen is necessary in my opinion.

I don't think we can provide stability guarantees before seeing something
being used in the field. How do we know it will be useful forever? If a
couple years later, there is only one person using it somewhere in the
world, why should we keep supporting it? If there are millions of virtual
machines using it, why would you worry about it being removed?

>
> 3. Add BPF program type derived from the conventional steering program type
>
> In principle, it's just to add a feature to report four more bytes to
> the conventional steering program. However, BPF program types are frozen
> for feature additions and the proposed change will break the feature freeze.
>
> So what's next? I'm inclined to option 3 due to its minimal ABI/API
> change, but I'm also fine with option 2 if it is possible to guarantee
> the ABI/API stability necessary to run pre-built QEMUs on future kernel
> versions by e.g., explicitly stating the stability of kfuncs. If no
> objection arises, I'll resend this series with the RFC prefix dropped
> for upstream inclusion. If it's decided to go for option 1 or 2, I'll
> post a new version of the series implementing the idea.

Probably a dumb question, but does this RFC fall into option 3? If
that's the case, I seriously don't think it's gonna happen.

I would recommend you give option 2 a try and share the code. This is
probably the best way to move the discussion forward.

Thanks,
Song
Akihiko Odaki Dec. 11, 2023, 5:04 a.m. UTC | #12
On 2023/12/11 10:40, Song Liu wrote:
> On Sat, Dec 9, 2023 at 11:03 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>>
>> On 2023/11/22 14:36, Akihiko Odaki wrote:
>>> On 2023/11/22 14:25, Song Liu wrote:
> [...]
>>
>> Now the discussion is stale again so let me summarize the discussion:
>>
>> A tuntap device can have an eBPF steering program to let the userspace
>> decide which tuntap queue should be used for each packet. QEMU uses this
>> feature to implement the RSS algorithm for virtio-net emulation. Now,
>> the virtio specification has a new feature to report hash values
>> calculated with the RSS algorithm. The goal of this RFC is to report
>> such hash values from the eBPF steering program to the userspace.
>>
>> There are currently three ideas to implement the proposal:
>>
>> 1. Abandon eBPF steering program and implement RSS in the kernel.
>>
>> It is possible to implement the RSS algorithm in the kernel as it's
>> strictly defined in the specification. However, there are proposals for
>> relevant virtio specification changes, and abandoning eBPF steering
>> program will loose the ability to implement those changes in the
>> userspace. There are concerns that this lead to more UAPI changes in the
>> end.
>>
>> 2. Add BPF kfuncs.
>>
>> Adding BPF kfuncs is *the* standard way to add BPF interfaces. hid-bpf
>> is a good reference for this.
>>
>> The problem with BPF kfuncs is that kfuncs are not considered as stable
>> as UAPI. In my understanding, it is not problematic for things like
>> hid-bpf because programs using those kfuncs affect the entire system
>> state and expected to be centrally managed. Such BPF programs can be
>> updated along with the kernel in a manner similar to kernel modules.
>>
>> The use case of tuntap steering/hash reporting is somewhat different
>> though; the eBPF program is more like a part of application (QEMU or
>> potentially other VMM) and thus needs to be portable. For example, a
>> user may expect a Debian container with QEMU installed to work on Fedora.
>>
>> BPF kfuncs do still provide some level of stability, but there is no
>> documentation that tell how stable they are. The worst case scenario I
>> can imagine is that a future legitimate BPF change breaks QEMU, letting
>> the "no regressions" rule force the change to be reverted. Some
>> assurance that kind scenario will not happen is necessary in my opinion.
> 
> I don't think we can provide stability guarantees before seeing something
> being used in the field. How do we know it will be useful forever? If a
> couple years later, there is only one person using it somewhere in the
> world, why should we keep supporting it? If there are millions of virtual
> machines using it, why would you worry about it being removed?

I have a different opinion about providing stability guarantees; I 
believe it is safe to provide such a guarantee without actual use in a 
field. We develop features expecting there are real uses, and if it 
turns out otherwise, we can break the stated guarantee since there is no 
real use cases anyway. It is fine even breaking UAPIs in such a case, 
which is stated in Documentation/admin-guide/reporting-regressions.rst.

So I rather feel easy about guaranteeing UAPI stability; we can just 
guarantee the UAPI-level stability for a particular kfunc and use it 
from QEMU expecting the stability. If the feature is found not useful, 
QEMU and the kernel can just remove it.

I'm more concerned about the other case, which means that there will be 
wide uses of this feature. A kernel developer may assume the stability 
of the interface is like one of kernel internal APIs 
(Documentation/bpf/kfuncs.rst says kfuncs are like EXPORT_SYMBOL_GPL) 
and decide to change it, breaking old QEMU binaries and that's something 
I would like to avoid.

Regarding the breakage scenario, I think we can avoid the kfuncs removal 
just by saying "we won't remove them". I'm more worried the case that a 
change in the BPF kfunc infrastucture requires to recompile the binary.

So, in short, I don't think we can say "kfuncs are like 
EXPORT_SYMBOL_GPL" and "you can freely use kfuncs in a normal userspace 
application like QEMU" at the same time.

> 
>>
>> 3. Add BPF program type derived from the conventional steering program type
>>
>> In principle, it's just to add a feature to report four more bytes to
>> the conventional steering program. However, BPF program types are frozen
>> for feature additions and the proposed change will break the feature freeze.
>>
>> So what's next? I'm inclined to option 3 due to its minimal ABI/API
>> change, but I'm also fine with option 2 if it is possible to guarantee
>> the ABI/API stability necessary to run pre-built QEMUs on future kernel
>> versions by e.g., explicitly stating the stability of kfuncs. If no
>> objection arises, I'll resend this series with the RFC prefix dropped
>> for upstream inclusion. If it's decided to go for option 1 or 2, I'll
>> post a new version of the series implementing the idea.
> 
> Probably a dumb question, but does this RFC fall into option 3? If
> that's the case, I seriously don't think it's gonna happen.

Yes, it's option 3.

> 
> I would recommend you give option 2 a try and share the code. This is
> probably the best way to move the discussion forward.

I'd like to add a documentation change to say the added kfuncs are 
exceptional cases that are not like EXPORT_SYMBOL_GPL in that case. Will 
it work?

Regards,
Akihiko Odaki
Song Liu Dec. 11, 2023, 5:40 p.m. UTC | #13
On Sun, Dec 10, 2023 at 9:04 PM Akihiko Odaki <akihiko.odaki@daynix.com> wrote:
>
[...]
> >
> > I don't think we can provide stability guarantees before seeing something
> > being used in the field. How do we know it will be useful forever? If a
> > couple years later, there is only one person using it somewhere in the
> > world, why should we keep supporting it? If there are millions of virtual
> > machines using it, why would you worry about it being removed?
>
> I have a different opinion about providing stability guarantees; I
> believe it is safe to provide such a guarantee without actual use in a
> field. We develop features expecting there are real uses, and if it
> turns out otherwise, we can break the stated guarantee since there is no
> real use cases anyway. It is fine even breaking UAPIs in such a case,
> which is stated in Documentation/admin-guide/reporting-regressions.rst.
>
> So I rather feel easy about guaranteeing UAPI stability; we can just
> guarantee the UAPI-level stability for a particular kfunc and use it
> from QEMU expecting the stability. If the feature is found not useful,
> QEMU and the kernel can just remove it.

It appears that we more or less agree that this feature may not be
something we will support forever.

> I'm more concerned about the other case, which means that there will be
> wide uses of this feature. A kernel developer may assume the stability
> of the interface is like one of kernel internal APIs
> (Documentation/bpf/kfuncs.rst says kfuncs are like EXPORT_SYMBOL_GPL)
> and decide to change it, breaking old QEMU binaries and that's something
> I would like to avoid.
>
> Regarding the breakage scenario, I think we can avoid the kfuncs removal
> just by saying "we won't remove them". I'm more worried the case that a
> change in the BPF kfunc infrastucture requires to recompile the binary.
>
> So, in short, I don't think we can say "kfuncs are like
> EXPORT_SYMBOL_GPL" and "you can freely use kfuncs in a normal userspace
> application like QEMU" at the same time.
>
> >
[...]
>
> >
> > I would recommend you give option 2 a try and share the code. This is
> > probably the best way to move the discussion forward.
>
> I'd like to add a documentation change to say the added kfuncs are
> exceptional cases that are not like EXPORT_SYMBOL_GPL in that case. Will
> it work?

It will not.

The BPF community had a lot of discussions about the stability of BPF APIs, for
example in [1]. Therefore, this is not a light decision.

AFAICT, what is being proposed in this RFC is way less stable than many kfuncs
we added recently. We are not changing the stability guarantee for this. Let's
invest our time wisely and work on more meaningful things, for example, a
prototype that may actually get accepted.

Thanks,
Song

[1] https://lore.kernel.org/bpf/20221207205537.860248-1-joannelkoong@gmail.com/