mbox series

[RFC,0/6] eBPF RSS support for virtio-net

Message ID 20201102185115.7425-1-andrew@daynix.com
Headers show
Series eBPF RSS support for virtio-net | expand

Message

Andrew Melnychenko Nov. 2, 2020, 6:51 p.m. UTC
Basic idea is to use eBPF to calculate and steer packets in TAP.
RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues
by calculating packet hash.
eBPF RSS allows us to use RSS with vhost TAP.

This set of patches introduces the usage of eBPF for packet steering
and RSS hash calculation:
* RSS(Receive Side Scaling) is used to distribute network packets to
guest virtqueues by calculating packet hash
* eBPF RSS suppose to be faster than already existing 'software'
implementation in QEMU
* Additionally adding support for the usage of RSS with vhost

Supported kernels: 5.8+

Implementation notes:
Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
Added eBPF support to qemu directly through a system call, see the
bpf(2) for details.
The eBPF program is part of the qemu and presented as an array of bpf
instructions.
The program can be recompiled by provided Makefile.ebpf(need to adjust
'linuxhdrs'),
although it's not required to build QEMU with eBPF support.
Added changes to virtio-net and vhost, primary eBPF RSS is used.
'Software' RSS used in the case of hash population and as a fallback option.
For vhost, the hash population feature is not reported to the guest.

Please also see the documentation in PATCH 6/6.

I am sending those patches as RFC to initiate the discussions and get
feedback on the following points:
* Fallback when eBPF is not supported by the kernel
* Live migration to the kernel that doesn't have eBPF support
* Integration with current QEMU build
* Additional usage for eBPF for packet filtering

Know issues:
* hash population not supported by eBPF RSS: 'software' RSS used
as a fallback, also, hash population feature is not reported to guests
with vhost.
* big-endian BPF support: for now, eBPF is disabled for big-endian systems.

Andrew (6):
  Added SetSteeringEBPF method for NetClientState.
  ebpf: Added basic eBPF API.
  ebpf: Added eBPF RSS program.
  ebpf: Added eBPF RSS loader.
  virtio-net: Added eBPF RSS to virtio-net.
  docs: Added eBPF documentation.

 MAINTAINERS                    |   6 +
 configure                      |  36 +++
 docs/ebpf.rst                  |  29 ++
 docs/ebpf_rss.rst              | 129 ++++++++
 ebpf/EbpfElf_to_C.py           |  67 ++++
 ebpf/Makefile.ebpf             |  38 +++
 ebpf/ebpf-stub.c               |  28 ++
 ebpf/ebpf.c                    | 107 +++++++
 ebpf/ebpf.h                    |  35 +++
 ebpf/ebpf_rss.c                | 178 +++++++++++
 ebpf/ebpf_rss.h                |  30 ++
 ebpf/meson.build               |   1 +
 ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++
 ebpf/trace-events              |   4 +
 ebpf/trace.h                   |   2 +
 ebpf/tun_rss_steering.h        | 556 +++++++++++++++++++++++++++++++++
 hw/net/vhost_net.c             |   2 +
 hw/net/virtio-net.c            | 120 ++++++-
 include/hw/virtio/virtio-net.h |   4 +
 include/net/net.h              |   2 +
 meson.build                    |   3 +
 net/tap-bsd.c                  |   5 +
 net/tap-linux.c                |  19 ++
 net/tap-solaris.c              |   5 +
 net/tap-stub.c                 |   5 +
 net/tap.c                      |   9 +
 net/tap_int.h                  |   1 +
 net/vhost-vdpa.c               |   2 +
 28 files changed, 1889 insertions(+), 4 deletions(-)
 create mode 100644 docs/ebpf.rst
 create mode 100644 docs/ebpf_rss.rst
 create mode 100644 ebpf/EbpfElf_to_C.py
 create mode 100755 ebpf/Makefile.ebpf
 create mode 100644 ebpf/ebpf-stub.c
 create mode 100644 ebpf/ebpf.c
 create mode 100644 ebpf/ebpf.h
 create mode 100644 ebpf/ebpf_rss.c
 create mode 100644 ebpf/ebpf_rss.h
 create mode 100644 ebpf/meson.build
 create mode 100644 ebpf/rss.bpf.c
 create mode 100644 ebpf/trace-events
 create mode 100644 ebpf/trace.h
 create mode 100644 ebpf/tun_rss_steering.h

Comments

Jason Wang Nov. 3, 2020, 9:02 a.m. UTC | #1
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> Basic idea is to use eBPF to calculate and steer packets in TAP.

> RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues

> by calculating packet hash.

> eBPF RSS allows us to use RSS with vhost TAP.

>

> This set of patches introduces the usage of eBPF for packet steering

> and RSS hash calculation:

> * RSS(Receive Side Scaling) is used to distribute network packets to

> guest virtqueues by calculating packet hash

> * eBPF RSS suppose to be faster than already existing 'software'

> implementation in QEMU

> * Additionally adding support for the usage of RSS with vhost

>

> Supported kernels: 5.8+

>

> Implementation notes:

> Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.

> Added eBPF support to qemu directly through a system call, see the

> bpf(2) for details.

> The eBPF program is part of the qemu and presented as an array of bpf

> instructions.

> The program can be recompiled by provided Makefile.ebpf(need to adjust

> 'linuxhdrs'),

> although it's not required to build QEMU with eBPF support.

> Added changes to virtio-net and vhost, primary eBPF RSS is used.

> 'Software' RSS used in the case of hash population and as a fallback option.

> For vhost, the hash population feature is not reported to the guest.

>

> Please also see the documentation in PATCH 6/6.

>

> I am sending those patches as RFC to initiate the discussions and get

> feedback on the following points:

> * Fallback when eBPF is not supported by the kernel



Yes, and it could also a lacking of CAP_BPF.


> * Live migration to the kernel that doesn't have eBPF support



Is there anything that we needs special treatment here?


> * Integration with current QEMU build



Yes, a question here:

1) Any reason for not using libbpf, e.g it has been shipped with some 
distros
2) It would be better if we can avoid shipping bytecodes


> * Additional usage for eBPF for packet filtering



Another interesting topics in to implement mac/vlan filters. And in the 
future, I plan to add mac based steering. All of these could be done via 
eBPF.


>

> Know issues:

> * hash population not supported by eBPF RSS: 'software' RSS used



Is this because there's not way to write to vnet header in STERRING BPF?


> as a fallback, also, hash population feature is not reported to guests

> with vhost.

> * big-endian BPF support: for now, eBPF is disabled for big-endian systems.



Are there any blocker for this?

Just some quick questions after a glance of the codes. Will go through 
them tomorrow.

Thanks


>

> Andrew (6):

>    Added SetSteeringEBPF method for NetClientState.

>    ebpf: Added basic eBPF API.

>    ebpf: Added eBPF RSS program.

>    ebpf: Added eBPF RSS loader.

>    virtio-net: Added eBPF RSS to virtio-net.

>    docs: Added eBPF documentation.

>

>   MAINTAINERS                    |   6 +

>   configure                      |  36 +++

>   docs/ebpf.rst                  |  29 ++

>   docs/ebpf_rss.rst              | 129 ++++++++

>   ebpf/EbpfElf_to_C.py           |  67 ++++

>   ebpf/Makefile.ebpf             |  38 +++

>   ebpf/ebpf-stub.c               |  28 ++

>   ebpf/ebpf.c                    | 107 +++++++

>   ebpf/ebpf.h                    |  35 +++

>   ebpf/ebpf_rss.c                | 178 +++++++++++

>   ebpf/ebpf_rss.h                |  30 ++

>   ebpf/meson.build               |   1 +

>   ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++

>   ebpf/trace-events              |   4 +

>   ebpf/trace.h                   |   2 +

>   ebpf/tun_rss_steering.h        | 556 +++++++++++++++++++++++++++++++++

>   hw/net/vhost_net.c             |   2 +

>   hw/net/virtio-net.c            | 120 ++++++-

>   include/hw/virtio/virtio-net.h |   4 +

>   include/net/net.h              |   2 +

>   meson.build                    |   3 +

>   net/tap-bsd.c                  |   5 +

>   net/tap-linux.c                |  19 ++

>   net/tap-solaris.c              |   5 +

>   net/tap-stub.c                 |   5 +

>   net/tap.c                      |   9 +

>   net/tap_int.h                  |   1 +

>   net/vhost-vdpa.c               |   2 +

>   28 files changed, 1889 insertions(+), 4 deletions(-)

>   create mode 100644 docs/ebpf.rst

>   create mode 100644 docs/ebpf_rss.rst

>   create mode 100644 ebpf/EbpfElf_to_C.py

>   create mode 100755 ebpf/Makefile.ebpf

>   create mode 100644 ebpf/ebpf-stub.c

>   create mode 100644 ebpf/ebpf.c

>   create mode 100644 ebpf/ebpf.h

>   create mode 100644 ebpf/ebpf_rss.c

>   create mode 100644 ebpf/ebpf_rss.h

>   create mode 100644 ebpf/meson.build

>   create mode 100644 ebpf/rss.bpf.c

>   create mode 100644 ebpf/trace-events

>   create mode 100644 ebpf/trace.h

>   create mode 100644 ebpf/tun_rss_steering.h

>
Yuri Benditovich Nov. 3, 2020, 10:32 a.m. UTC | #2
On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com> wrote:

>

> On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

> > Basic idea is to use eBPF to calculate and steer packets in TAP.

> > RSS(Receive Side Scaling) is used to distribute network packets to guest

> virtqueues

> > by calculating packet hash.

> > eBPF RSS allows us to use RSS with vhost TAP.

> >

> > This set of patches introduces the usage of eBPF for packet steering

> > and RSS hash calculation:

> > * RSS(Receive Side Scaling) is used to distribute network packets to

> > guest virtqueues by calculating packet hash

> > * eBPF RSS suppose to be faster than already existing 'software'

> > implementation in QEMU

> > * Additionally adding support for the usage of RSS with vhost

> >

> > Supported kernels: 5.8+

> >

> > Implementation notes:

> > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.

> > Added eBPF support to qemu directly through a system call, see the

> > bpf(2) for details.

> > The eBPF program is part of the qemu and presented as an array of bpf

> > instructions.

> > The program can be recompiled by provided Makefile.ebpf(need to adjust

> > 'linuxhdrs'),

> > although it's not required to build QEMU with eBPF support.

> > Added changes to virtio-net and vhost, primary eBPF RSS is used.

> > 'Software' RSS used in the case of hash population and as a fallback

> option.

> > For vhost, the hash population feature is not reported to the guest.

> >

> > Please also see the documentation in PATCH 6/6.

> >

> > I am sending those patches as RFC to initiate the discussions and get

> > feedback on the following points:

> > * Fallback when eBPF is not supported by the kernel

>

>

> Yes, and it could also a lacking of CAP_BPF.

>

>

> > * Live migration to the kernel that doesn't have eBPF support

>

>

> Is there anything that we needs special treatment here?

>

> Possible case: rss=on, vhost=on, source system with kernel 5.8 (everything

works) -> dest. system 5.6 (bpf does not work), the adapter functions, but
all the steering does not use proper queues.




>

> > * Integration with current QEMU build

>

>

> Yes, a question here:

>

> 1) Any reason for not using libbpf, e.g it has been shipped with some

> distros

>


We intentionally do not use libbpf, as it present only on some distros.
We can switch to libbpf, but this will disable bpf if libbpf is not
installed


> 2) It would be better if we can avoid shipping bytecodes

>



This creates new dependencies: llvm + clang + ...
We would prefer byte code and ability to generate it if prerequisites are
installed.


>

>

> > * Additional usage for eBPF for packet filtering

>

>

> Another interesting topics in to implement mac/vlan filters. And in the

> future, I plan to add mac based steering. All of these could be done via

> eBPF.

>

>

No problem, we can cooperate if needed


>

> >

> > Know issues:

> > * hash population not supported by eBPF RSS: 'software' RSS used

>

>

> Is this because there's not way to write to vnet header in STERRING BPF?

>

> Yes. We plan to submit changes for kernel to cooperate with BPF and

populate the hash, this work is in progress


>

> > as a fallback, also, hash population feature is not reported to guests

> > with vhost.

> > * big-endian BPF support: for now, eBPF is disabled for big-endian

> systems.

>

>

> Are there any blocker for this?

>


No, can be added in v2


>

> Just some quick questions after a glance of the codes. Will go through

> them tomorrow.

>

> Thanks

>

>

> >

> > Andrew (6):

> >    Added SetSteeringEBPF method for NetClientState.

> >    ebpf: Added basic eBPF API.

> >    ebpf: Added eBPF RSS program.

> >    ebpf: Added eBPF RSS loader.

> >    virtio-net: Added eBPF RSS to virtio-net.

> >    docs: Added eBPF documentation.

> >

> >   MAINTAINERS                    |   6 +

> >   configure                      |  36 +++

> >   docs/ebpf.rst                  |  29 ++

> >   docs/ebpf_rss.rst              | 129 ++++++++

> >   ebpf/EbpfElf_to_C.py           |  67 ++++

> >   ebpf/Makefile.ebpf             |  38 +++

> >   ebpf/ebpf-stub.c               |  28 ++

> >   ebpf/ebpf.c                    | 107 +++++++

> >   ebpf/ebpf.h                    |  35 +++

> >   ebpf/ebpf_rss.c                | 178 +++++++++++

> >   ebpf/ebpf_rss.h                |  30 ++

> >   ebpf/meson.build               |   1 +

> >   ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++

> >   ebpf/trace-events              |   4 +

> >   ebpf/trace.h                   |   2 +

> >   ebpf/tun_rss_steering.h        | 556 +++++++++++++++++++++++++++++++++

> >   hw/net/vhost_net.c             |   2 +

> >   hw/net/virtio-net.c            | 120 ++++++-

> >   include/hw/virtio/virtio-net.h |   4 +

> >   include/net/net.h              |   2 +

> >   meson.build                    |   3 +

> >   net/tap-bsd.c                  |   5 +

> >   net/tap-linux.c                |  19 ++

> >   net/tap-solaris.c              |   5 +

> >   net/tap-stub.c                 |   5 +

> >   net/tap.c                      |   9 +

> >   net/tap_int.h                  |   1 +

> >   net/vhost-vdpa.c               |   2 +

> >   28 files changed, 1889 insertions(+), 4 deletions(-)

> >   create mode 100644 docs/ebpf.rst

> >   create mode 100644 docs/ebpf_rss.rst

> >   create mode 100644 ebpf/EbpfElf_to_C.py

> >   create mode 100755 ebpf/Makefile.ebpf

> >   create mode 100644 ebpf/ebpf-stub.c

> >   create mode 100644 ebpf/ebpf.c

> >   create mode 100644 ebpf/ebpf.h

> >   create mode 100644 ebpf/ebpf_rss.c

> >   create mode 100644 ebpf/ebpf_rss.h

> >   create mode 100644 ebpf/meson.build

> >   create mode 100644 ebpf/rss.bpf.c

> >   create mode 100644 ebpf/trace-events

> >   create mode 100644 ebpf/trace.h

> >   create mode 100644 ebpf/tun_rss_steering.h

> >

>

>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 3, 2020 at 11:02 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt; Basic idea is to use eBPF to calculate and steer packets in TAP.<br>
&gt; RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues<br>
&gt; by calculating packet hash.<br>
&gt; eBPF RSS allows us to use RSS with vhost TAP.<br>
&gt;<br>
&gt; This set of patches introduces the usage of eBPF for packet steering<br>
&gt; and RSS hash calculation:<br>
&gt; * RSS(Receive Side Scaling) is used to distribute network packets to<br>
&gt; guest virtqueues by calculating packet hash<br>
&gt; * eBPF RSS suppose to be faster than already existing &#39;software&#39;<br>
&gt; implementation in QEMU<br>
&gt; * Additionally adding support for the usage of RSS with vhost<br>
&gt;<br>
&gt; Supported kernels: 5.8+<br>
&gt;<br>
&gt; Implementation notes:<br>
&gt; Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.<br>
&gt; Added eBPF support to qemu directly through a system call, see the<br>
&gt; bpf(2) for details.<br>
&gt; The eBPF program is part of the qemu and presented as an array of bpf<br>
&gt; instructions.<br>
&gt; The program can be recompiled by provided Makefile.ebpf(need to adjust<br>
&gt; &#39;linuxhdrs&#39;),<br>
&gt; although it&#39;s not required to build QEMU with eBPF support.<br>
&gt; Added changes to virtio-net and vhost, primary eBPF RSS is used.<br>
&gt; &#39;Software&#39; RSS used in the case of hash population and as a fallback option.<br>
&gt; For vhost, the hash population feature is not reported to the guest.<br>
&gt;<br>
&gt; Please also see the documentation in PATCH 6/6.<br>
&gt;<br>
&gt; I am sending those patches as RFC to initiate the discussions and get<br>
&gt; feedback on the following points:<br>
&gt; * Fallback when eBPF is not supported by the kernel<br>
<br>
<br>
Yes, and it could also a lacking of CAP_BPF.<br>
<br>
<br>
&gt; * Live migration to the kernel that doesn&#39;t have eBPF support<br>
<br>
<br>
Is there anything that we needs special treatment here?<br>
<br></blockquote><div>Possible case: rss=on, vhost=on, source system with kernel 5.8 (everything works) -&gt; dest. system 5.6 (bpf does not work), the adapter functions, but all the steering does not use proper queues.  </div><div><br></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; * Integration with current QEMU build<br>
<br>
<br>
Yes, a question here:<br>
<br>
1) Any reason for not using libbpf, e.g it has been shipped with some <br>
distros <br></blockquote><div><br></div><div>We intentionally do not use libbpf, as it present only on some distros.</div><div>We can switch to libbpf, but this will disable bpf if libbpf is not installed </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
2) It would be better if we can avoid shipping bytecodes<br>
</blockquote><div><br></div><div><br></div><div>This creates new dependencies: llvm + clang + ...</div><div>We would prefer byte code and ability to generate it if prerequisites are installed. <br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
<br>
&gt; * Additional usage for eBPF for packet filtering<br>
<br>
<br>
Another interesting topics in to implement mac/vlan filters. And in the <br>
future, I plan to add mac based steering. All of these could be done via <br>
eBPF.<br>
<br></blockquote><div><br></div><div>No problem, we can cooperate if needed</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt;<br>
&gt; Know issues:<br>
&gt; * hash population not supported by eBPF RSS: &#39;software&#39; RSS used<br>
<br>
<br>
Is this because there&#39;s not way to write to vnet header in STERRING BPF?<br>
<br></blockquote><div>Yes. We plan to submit changes for kernel to cooperate with BPF and populate the hash, this work is in progress</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; as a fallback, also, hash population feature is not reported to guests<br>
&gt; with vhost.<br>
&gt; * big-endian BPF support: for now, eBPF is disabled for big-endian systems.<br>
<br>
<br>
Are there any blocker for this?<br></blockquote><div><br></div><div>No, can be added in v2</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
Just some quick questions after a glance of the codes. Will go through <br>
them tomorrow.<br>
<br>
Thanks<br>
<br>
<br>
&gt;<br>
&gt; Andrew (6):<br>
&gt;    Added SetSteeringEBPF method for NetClientState.<br>
&gt;    ebpf: Added basic eBPF API.<br>
&gt;    ebpf: Added eBPF RSS program.<br>
&gt;    ebpf: Added eBPF RSS loader.<br>
&gt;    virtio-net: Added eBPF RSS to virtio-net.<br>
&gt;    docs: Added eBPF documentation.<br>
&gt;<br>
&gt;   MAINTAINERS                    |   6 +<br>
&gt;   configure                      |  36 +++<br>
&gt;   docs/ebpf.rst                  |  29 ++<br>
&gt;   docs/ebpf_rss.rst              | 129 ++++++++<br>
&gt;   ebpf/EbpfElf_to_C.py           |  67 ++++<br>
&gt;   ebpf/Makefile.ebpf             |  38 +++<br>
&gt;   ebpf/ebpf-stub.c               |  28 ++<br>
&gt;   ebpf/ebpf.c                    | 107 +++++++<br>
&gt;   ebpf/ebpf.h                    |  35 +++<br>
&gt;   ebpf/ebpf_rss.c                | 178 +++++++++++<br>
&gt;   ebpf/ebpf_rss.h                |  30 ++<br>
&gt;   ebpf/meson.build               |   1 +<br>
&gt;   ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++<br>
&gt;   ebpf/trace-events              |   4 +<br>
&gt;   ebpf/trace.h                   |   2 +<br>
&gt;   ebpf/tun_rss_steering.h        | 556 +++++++++++++++++++++++++++++++++<br>
&gt;   hw/net/vhost_net.c             |   2 +<br>
&gt;   hw/net/virtio-net.c            | 120 ++++++-<br>
&gt;   include/hw/virtio/virtio-net.h |   4 +<br>
&gt;   include/net/net.h              |   2 +<br>
&gt;   meson.build                    |   3 +<br>
&gt;   net/tap-bsd.c                  |   5 +<br>
&gt;   net/tap-linux.c                |  19 ++<br>
&gt;   net/tap-solaris.c              |   5 +<br>
&gt;   net/tap-stub.c                 |   5 +<br>
&gt;   net/tap.c                      |   9 +<br>
&gt;   net/tap_int.h                  |   1 +<br>
&gt;   net/vhost-vdpa.c               |   2 +<br>
&gt;   28 files changed, 1889 insertions(+), 4 deletions(-)<br>
&gt;   create mode 100644 docs/ebpf.rst<br>
&gt;   create mode 100644 docs/ebpf_rss.rst<br>
&gt;   create mode 100644 ebpf/EbpfElf_to_C.py<br>
&gt;   create mode 100755 ebpf/Makefile.ebpf<br>
&gt;   create mode 100644 ebpf/ebpf-stub.c<br>
&gt;   create mode 100644 ebpf/ebpf.c<br>
&gt;   create mode 100644 ebpf/ebpf.h<br>
&gt;   create mode 100644 ebpf/ebpf_rss.c<br>
&gt;   create mode 100644 ebpf/ebpf_rss.h<br>
&gt;   create mode 100644 ebpf/meson.build<br>
&gt;   create mode 100644 ebpf/rss.bpf.c<br>
&gt;   create mode 100644 ebpf/trace-events<br>
&gt;   create mode 100644 ebpf/trace.h<br>
&gt;   create mode 100644 ebpf/tun_rss_steering.h<br>
&gt;<br>
<br>
</blockquote></div></div>
Daniel P. Berrangé Nov. 3, 2020, 11:56 a.m. UTC | #3
On Tue, Nov 03, 2020 at 12:32:43PM +0200, Yuri Benditovich wrote:
> On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com> wrote:

> 

> >

> > On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

> > > Basic idea is to use eBPF to calculate and steer packets in TAP.

> > > RSS(Receive Side Scaling) is used to distribute network packets to guest

> > virtqueues

> > > by calculating packet hash.

> > > eBPF RSS allows us to use RSS with vhost TAP.

> > >

> > > This set of patches introduces the usage of eBPF for packet steering

> > > and RSS hash calculation:

> > > * RSS(Receive Side Scaling) is used to distribute network packets to

> > > guest virtqueues by calculating packet hash

> > > * eBPF RSS suppose to be faster than already existing 'software'

> > > implementation in QEMU

> > > * Additionally adding support for the usage of RSS with vhost

> > >

> > > Supported kernels: 5.8+

> > >

> > > Implementation notes:

> > > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.

> > > Added eBPF support to qemu directly through a system call, see the

> > > bpf(2) for details.

> > > The eBPF program is part of the qemu and presented as an array of bpf

> > > instructions.

> > > The program can be recompiled by provided Makefile.ebpf(need to adjust

> > > 'linuxhdrs'),

> > > although it's not required to build QEMU with eBPF support.

> > > Added changes to virtio-net and vhost, primary eBPF RSS is used.

> > > 'Software' RSS used in the case of hash population and as a fallback

> > option.

> > > For vhost, the hash population feature is not reported to the guest.

> > >

> > > Please also see the documentation in PATCH 6/6.

> > >

> > > I am sending those patches as RFC to initiate the discussions and get

> > > feedback on the following points:

> > > * Fallback when eBPF is not supported by the kernel

> >

> >

> > Yes, and it could also a lacking of CAP_BPF.

> >

> >

> > > * Live migration to the kernel that doesn't have eBPF support

> >

> >

> > Is there anything that we needs special treatment here?

> >

> > Possible case: rss=on, vhost=on, source system with kernel 5.8 (everything

> works) -> dest. system 5.6 (bpf does not work), the adapter functions, but

> all the steering does not use proper queues.

> 

> 

> 

> 

> >

> > > * Integration with current QEMU build

> >

> >

> > Yes, a question here:

> >

> > 1) Any reason for not using libbpf, e.g it has been shipped with some

> > distros

> >

> 

> We intentionally do not use libbpf, as it present only on some distros.

> We can switch to libbpf, but this will disable bpf if libbpf is not

> installed


If we were modifying existing funtionality then introducing a dep on
libbpf would be a problem as you'd be breaking existing QEMU users
on distros without libbpf.

This is brand new functionality though, so it is fine to place a
requirement on libbpf. If distros don't ship that library and they
want BPF features in QEMU, then those distros should take responsibility
for adding libbpf to their package set.

> > 2) It would be better if we can avoid shipping bytecodes

> >

> 

> 

> This creates new dependencies: llvm + clang + ...

> We would prefer byte code and ability to generate it if prerequisites are

> installed.


I've double checked with Fedora, and generating the BPF program from
source is a mandatory requirement for QEMU. Pre-generated BPF bytecode
is not permitted.

There was also a question raised about the kernel ABI compatibility
for BPF programs ? 

  https://lwn.net/Articles/831402/

  "The basic problem is that when BPF is compiled, it uses a set
   of kernel headers that describe various kernel data structures
   for that particular version, which may be different from those
   on the kernel where the program is run. Until relatively recently,
   that was solved by distributing the BPF as C code along with the
   Clang compiler to build the BPF on the system where it was going
   to be run."

Is this not an issue for QEMU's usage of BPF here ?

The dependancy on llvm is unfortunate for people who build with GCC,
but at least they can opt-out via a configure switch if they really
want to. As that LWN article notes, GCC will gain BPF support


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Jason Wang Nov. 4, 2020, 2:07 a.m. UTC | #4
On 2020/11/3 下午6:32, Yuri Benditovich wrote:
>

>

> On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com 

> <mailto:jasowang@redhat.com>> wrote:

>

>

>     On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

>     > Basic idea is to use eBPF to calculate and steer packets in TAP.

>     > RSS(Receive Side Scaling) is used to distribute network packets

>     to guest virtqueues

>     > by calculating packet hash.

>     > eBPF RSS allows us to use RSS with vhost TAP.

>     >

>     > This set of patches introduces the usage of eBPF for packet steering

>     > and RSS hash calculation:

>     > * RSS(Receive Side Scaling) is used to distribute network packets to

>     > guest virtqueues by calculating packet hash

>     > * eBPF RSS suppose to be faster than already existing 'software'

>     > implementation in QEMU

>     > * Additionally adding support for the usage of RSS with vhost

>     >

>     > Supported kernels: 5.8+

>     >

>     > Implementation notes:

>     > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.

>     > Added eBPF support to qemu directly through a system call, see the

>     > bpf(2) for details.

>     > The eBPF program is part of the qemu and presented as an array

>     of bpf

>     > instructions.

>     > The program can be recompiled by provided Makefile.ebpf(need to

>     adjust

>     > 'linuxhdrs'),

>     > although it's not required to build QEMU with eBPF support.

>     > Added changes to virtio-net and vhost, primary eBPF RSS is used.

>     > 'Software' RSS used in the case of hash population and as a

>     fallback option.

>     > For vhost, the hash population feature is not reported to the guest.

>     >

>     > Please also see the documentation in PATCH 6/6.

>     >

>     > I am sending those patches as RFC to initiate the discussions

>     and get

>     > feedback on the following points:

>     > * Fallback when eBPF is not supported by the kernel

>

>

>     Yes, and it could also a lacking of CAP_BPF.

>

>

>     > * Live migration to the kernel that doesn't have eBPF support

>

>

>     Is there anything that we needs special treatment here?

>

> Possible case: rss=on, vhost=on, source system with kernel 5.8 

> (everything works) -> dest. system 5.6 (bpf does not work), the 

> adapter functions, but all the steering does not use proper queues.



Right, I think we need to disable vhost on dest.


>

>

>

>     > * Integration with current QEMU build

>

>

>     Yes, a question here:

>

>     1) Any reason for not using libbpf, e.g it has been shipped with some

>     distros

>

>

> We intentionally do not use libbpf, as it present only on some distros.

> We can switch to libbpf, but this will disable bpf if libbpf is not 

> installed



That's better I think.


>     2) It would be better if we can avoid shipping bytecodes

>

>

>

> This creates new dependencies: llvm + clang + ...

> We would prefer byte code and ability to generate it if prerequisites 

> are installed.



It's probably ok if we treat the bytecode as a kind of firmware.

But in the long run, it's still worthwhile consider the qemu source is 
used for development and llvm/clang should be a common requirement for 
generating eBPF bytecode for host.


>

>

>     > * Additional usage for eBPF for packet filtering

>

>

>     Another interesting topics in to implement mac/vlan filters. And

>     in the

>     future, I plan to add mac based steering. All of these could be

>     done via

>     eBPF.

>

>

> No problem, we can cooperate if needed

>

>

>     >

>     > Know issues:

>     > * hash population not supported by eBPF RSS: 'software' RSS used

>

>

>     Is this because there's not way to write to vnet header in

>     STERRING BPF?

>

> Yes. We plan to submit changes for kernel to cooperate with BPF and 

> populate the hash, this work is in progress



That would require a new type of eBPF program and may need some work on 
verifier.

Btw, macvtap is still lacking even steering ebpf program. Would you want 
to post a patch to support that?


>

>     > as a fallback, also, hash population feature is not reported to

>     guests

>     > with vhost.

>     > * big-endian BPF support: for now, eBPF is disabled for

>     big-endian systems.

>

>

>     Are there any blocker for this?

>

>

> No, can be added in v2



Cool.

Thanks


>

>     Just some quick questions after a glance of the codes. Will go

>     through

>     them tomorrow.

>

>     Thanks

>

>

>     >

>     > Andrew (6):

>     >    Added SetSteeringEBPF method for NetClientState.

>     >    ebpf: Added basic eBPF API.

>     >    ebpf: Added eBPF RSS program.

>     >    ebpf: Added eBPF RSS loader.

>     >    virtio-net: Added eBPF RSS to virtio-net.

>     >    docs: Added eBPF documentation.

>     >

>     >   MAINTAINERS                    |   6 +

>     >   configure                      |  36 +++

>     >   docs/ebpf.rst                  |  29 ++

>     >   docs/ebpf_rss.rst              | 129 ++++++++

>     >   ebpf/EbpfElf_to_C.py           |  67 ++++

>     >   ebpf/Makefile.ebpf             |  38 +++

>     >   ebpf/ebpf-stub.c               |  28 ++

>     >   ebpf/ebpf.c                    | 107 +++++++

>     >   ebpf/ebpf.h                    |  35 +++

>     >   ebpf/ebpf_rss.c                | 178 +++++++++++

>     >   ebpf/ebpf_rss.h                |  30 ++

>     >   ebpf/meson.build               |   1 +

>     >   ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++

>     >   ebpf/trace-events              |   4 +

>     >   ebpf/trace.h                   |   2 +

>     >   ebpf/tun_rss_steering.h        | 556

>     +++++++++++++++++++++++++++++++++

>     >   hw/net/vhost_net.c             |   2 +

>     >   hw/net/virtio-net.c            | 120 ++++++-

>     >   include/hw/virtio/virtio-net.h |   4 +

>     >   include/net/net.h              |   2 +

>     >   meson.build                    |   3 +

>     >   net/tap-bsd.c                  |   5 +

>     >   net/tap-linux.c                |  19 ++

>     >   net/tap-solaris.c              |   5 +

>     >   net/tap-stub.c                 |   5 +

>     >   net/tap.c                      |   9 +

>     >   net/tap_int.h                  |   1 +

>     >   net/vhost-vdpa.c               |   2 +

>     >   28 files changed, 1889 insertions(+), 4 deletions(-)

>     >   create mode 100644 docs/ebpf.rst

>     >   create mode 100644 docs/ebpf_rss.rst

>     >   create mode 100644 ebpf/EbpfElf_to_C.py

>     >   create mode 100755 ebpf/Makefile.ebpf

>     >   create mode 100644 ebpf/ebpf-stub.c

>     >   create mode 100644 ebpf/ebpf.c

>     >   create mode 100644 ebpf/ebpf.h

>     >   create mode 100644 ebpf/ebpf_rss.c

>     >   create mode 100644 ebpf/ebpf_rss.h

>     >   create mode 100644 ebpf/meson.build

>     >   create mode 100644 ebpf/rss.bpf.c

>     >   create mode 100644 ebpf/trace-events

>     >   create mode 100644 ebpf/trace.h

>     >   create mode 100644 ebpf/tun_rss_steering.h

>     >

>
Jason Wang Nov. 4, 2020, 2:15 a.m. UTC | #5
On 2020/11/3 下午7:56, Daniel P. Berrangé wrote:
> On Tue, Nov 03, 2020 at 12:32:43PM +0200, Yuri Benditovich wrote:
>> On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>>> On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
>>>> Basic idea is to use eBPF to calculate and steer packets in TAP.
>>>> RSS(Receive Side Scaling) is used to distribute network packets to guest
>>> virtqueues
>>>> by calculating packet hash.
>>>> eBPF RSS allows us to use RSS with vhost TAP.
>>>>
>>>> This set of patches introduces the usage of eBPF for packet steering
>>>> and RSS hash calculation:
>>>> * RSS(Receive Side Scaling) is used to distribute network packets to
>>>> guest virtqueues by calculating packet hash
>>>> * eBPF RSS suppose to be faster than already existing 'software'
>>>> implementation in QEMU
>>>> * Additionally adding support for the usage of RSS with vhost
>>>>
>>>> Supported kernels: 5.8+
>>>>
>>>> Implementation notes:
>>>> Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
>>>> Added eBPF support to qemu directly through a system call, see the
>>>> bpf(2) for details.
>>>> The eBPF program is part of the qemu and presented as an array of bpf
>>>> instructions.
>>>> The program can be recompiled by provided Makefile.ebpf(need to adjust
>>>> 'linuxhdrs'),
>>>> although it's not required to build QEMU with eBPF support.
>>>> Added changes to virtio-net and vhost, primary eBPF RSS is used.
>>>> 'Software' RSS used in the case of hash population and as a fallback
>>> option.
>>>> For vhost, the hash population feature is not reported to the guest.
>>>>
>>>> Please also see the documentation in PATCH 6/6.
>>>>
>>>> I am sending those patches as RFC to initiate the discussions and get
>>>> feedback on the following points:
>>>> * Fallback when eBPF is not supported by the kernel
>>>
>>> Yes, and it could also a lacking of CAP_BPF.
>>>
>>>
>>>> * Live migration to the kernel that doesn't have eBPF support
>>>
>>> Is there anything that we needs special treatment here?
>>>
>>> Possible case: rss=on, vhost=on, source system with kernel 5.8 (everything
>> works) -> dest. system 5.6 (bpf does not work), the adapter functions, but
>> all the steering does not use proper queues.
>>
>>
>>
>>
>>>> * Integration with current QEMU build
>>>
>>> Yes, a question here:
>>>
>>> 1) Any reason for not using libbpf, e.g it has been shipped with some
>>> distros
>>>
>> We intentionally do not use libbpf, as it present only on some distros.
>> We can switch to libbpf, but this will disable bpf if libbpf is not
>> installed
> If we were modifying existing funtionality then introducing a dep on
> libbpf would be a problem as you'd be breaking existing QEMU users
> on distros without libbpf.
>
> This is brand new functionality though, so it is fine to place a
> requirement on libbpf. If distros don't ship that library and they
> want BPF features in QEMU, then those distros should take responsibility
> for adding libbpf to their package set.
>
>>> 2) It would be better if we can avoid shipping bytecodes
>>>
>>
>> This creates new dependencies: llvm + clang + ...
>> We would prefer byte code and ability to generate it if prerequisites are
>> installed.
> I've double checked with Fedora, and generating the BPF program from
> source is a mandatory requirement for QEMU. Pre-generated BPF bytecode
> is not permitted.
>
> There was also a question raised about the kernel ABI compatibility
> for BPF programs ?
>
>    https://lwn.net/Articles/831402/
>
>    "The basic problem is that when BPF is compiled, it uses a set
>     of kernel headers that describe various kernel data structures
>     for that particular version, which may be different from those
>     on the kernel where the program is run. Until relatively recently,
>     that was solved by distributing the BPF as C code along with the
>     Clang compiler to build the BPF on the system where it was going
>     to be run."
>
> Is this not an issue for QEMU's usage of BPF here ?


That's good point. Actually, DPDK ships RSS bytecodes but I don't know 
it works.

But as mentioned in the link, if we generate the code with BTF that 
would be fine.

Thanks


>
> The dependancy on llvm is unfortunate for people who build with GCC,
> but at least they can opt-out via a configure switch if they really
> want to. As that LWN article notes, GCC will gain BPF support
>
>
> Regards,
> Daniel
Jason Wang Nov. 4, 2020, 2:49 a.m. UTC | #6
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> From: Andrew <andrew@daynix.com>
>
> For now, that method supported only by Linux TAP.
> Linux TAP uses TUNSETSTEERINGEBPF ioctl.
> TUNSETSTEERINGBPF was added 3 years ago.
> Qemu checks if it was defined before using.
>
> Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
> ---
>   include/net/net.h |  2 ++
>   net/tap-bsd.c     |  5 +++++
>   net/tap-linux.c   | 19 +++++++++++++++++++
>   net/tap-solaris.c |  5 +++++
>   net/tap-stub.c    |  5 +++++
>   net/tap.c         |  9 +++++++++
>   net/tap_int.h     |  1 +
>   7 files changed, 46 insertions(+)
>
> diff --git a/include/net/net.h b/include/net/net.h
> index 897b2d7595..d8a41fb010 100644
> --- a/include/net/net.h
> +++ b/include/net/net.h
> @@ -60,6 +60,7 @@ typedef int (SetVnetBE)(NetClientState *, bool);
>   typedef struct SocketReadState SocketReadState;
>   typedef void (SocketReadStateFinalize)(SocketReadState *rs);
>   typedef void (NetAnnounce)(NetClientState *);
> +typedef bool (SetSteeringEBPF)(NetClientState *, int);
>   
>   typedef struct NetClientInfo {
>       NetClientDriver type;
> @@ -81,6 +82,7 @@ typedef struct NetClientInfo {
>       SetVnetLE *set_vnet_le;
>       SetVnetBE *set_vnet_be;
>       NetAnnounce *announce;
> +    SetSteeringEBPF *set_steering_ebpf;
>   } NetClientInfo;
>   
>   struct NetClientState {
> diff --git a/net/tap-bsd.c b/net/tap-bsd.c
> index 77aaf674b1..4f64f31e98 100644
> --- a/net/tap-bsd.c
> +++ b/net/tap-bsd.c
> @@ -259,3 +259,8 @@ int tap_fd_get_ifname(int fd, char *ifname)
>   {
>       return -1;
>   }
> +
> +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> +{
> +    return -1;
> +}
> diff --git a/net/tap-linux.c b/net/tap-linux.c
> index b0635e9e32..196373019f 100644
> --- a/net/tap-linux.c
> +++ b/net/tap-linux.c
> @@ -31,6 +31,7 @@
>   
>   #include <net/if.h>
>   #include <sys/ioctl.h>
> +#include <linux/if_tun.h> /* TUNSETSTEERINGEBPF */
>   
>   #include "qapi/error.h"
>   #include "qemu/error-report.h"
> @@ -316,3 +317,21 @@ int tap_fd_get_ifname(int fd, char *ifname)
>       pstrcpy(ifname, sizeof(ifr.ifr_name), ifr.ifr_name);
>       return 0;
>   }
> +
> +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> +{
> +#ifdef TUNSETSTEERINGEBPF


I'm not sure how much this can help.

But looking at tap-linux.h, I wonder do we need to pull TUN/TAP uapi 
headers.

Thanks


> +    if (ioctl(fd, TUNSETSTEERINGEBPF, (void *) &prog_fd) != 0) {
> +        error_report("Issue while setting TUNSETSTEERINGEBPF:"
> +                    " %s with fd: %d, prog_fd: %d",
> +                    strerror(errno), fd, prog_fd);
> +
> +       return -1;
> +    }
> +
> +    return 0;
> +#else
> +    error_report("TUNSETSTEERINGEBPF is not supported");
> +    return -1;
> +#endif
> +}
> diff --git a/net/tap-solaris.c b/net/tap-solaris.c
> index 0475a58207..d85224242b 100644
> --- a/net/tap-solaris.c
> +++ b/net/tap-solaris.c
> @@ -255,3 +255,8 @@ int tap_fd_get_ifname(int fd, char *ifname)
>   {
>       return -1;
>   }
> +
> +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> +{
> +    return -1;
> +}
> diff --git a/net/tap-stub.c b/net/tap-stub.c
> index de525a2e69..a0fa25804b 100644
> --- a/net/tap-stub.c
> +++ b/net/tap-stub.c
> @@ -85,3 +85,8 @@ int tap_fd_get_ifname(int fd, char *ifname)
>   {
>       return -1;
>   }
> +
> +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> +{
> +    return -1;
> +}
> diff --git a/net/tap.c b/net/tap.c
> index c46ff66184..81f50017bd 100644
> --- a/net/tap.c
> +++ b/net/tap.c
> @@ -337,6 +337,14 @@ static void tap_poll(NetClientState *nc, bool enable)
>       tap_write_poll(s, enable);
>   }
>   
> +static bool tap_set_steering_ebpf(NetClientState *nc, int prog_fd)
> +{
> +    TAPState *s = DO_UPCAST(TAPState, nc, nc);
> +    assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
> +
> +    return tap_fd_set_steering_ebpf(s->fd, prog_fd) == 0;
> +}
> +
>   int tap_get_fd(NetClientState *nc)
>   {
>       TAPState *s = DO_UPCAST(TAPState, nc, nc);
> @@ -362,6 +370,7 @@ static NetClientInfo net_tap_info = {
>       .set_vnet_hdr_len = tap_set_vnet_hdr_len,
>       .set_vnet_le = tap_set_vnet_le,
>       .set_vnet_be = tap_set_vnet_be,
> +    .set_steering_ebpf = tap_set_steering_ebpf,
>   };
>   
>   static TAPState *net_tap_fd_init(NetClientState *peer,
> diff --git a/net/tap_int.h b/net/tap_int.h
> index 225a49ea48..547f8a5a28 100644
> --- a/net/tap_int.h
> +++ b/net/tap_int.h
> @@ -44,5 +44,6 @@ int tap_fd_set_vnet_be(int fd, int vnet_is_be);
>   int tap_fd_enable(int fd);
>   int tap_fd_disable(int fd);
>   int tap_fd_get_ifname(int fd, char *ifname);
> +int tap_fd_set_steering_ebpf(int fd, int prog_fd);
>   
>   #endif /* NET_TAP_INT_H */
Jason Wang Nov. 4, 2020, 3:15 a.m. UTC | #7
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> From: Andrew <andrew@daynix.com>

>

> Also, added maintainers information.

>

> Signed-off-by: Yuri Benditovich <yuri.benditovich@daynix.com>

> Signed-off-by: Andrew Melnychenko <andrew@daynix.com>

> ---

>   MAINTAINERS       |   6 +++

>   docs/ebpf.rst     |  29 +++++++++++

>   docs/ebpf_rss.rst | 129 ++++++++++++++++++++++++++++++++++++++++++++++

>   3 files changed, 164 insertions(+)

>   create mode 100644 docs/ebpf.rst

>   create mode 100644 docs/ebpf_rss.rst

>

> diff --git a/MAINTAINERS b/MAINTAINERS

> index 2c22bbca5a..464b3f3c95 100644

> --- a/MAINTAINERS

> +++ b/MAINTAINERS

> @@ -3111,6 +3111,12 @@ S: Maintained

>   F: hw/semihosting/

>   F: include/hw/semihosting/

>   

> +EBPF:

> +M: Andrew Melnychenko <andrew@daynix.com>

> +M: Yuri Benditovich <yuri.benditovich@daynix.com>

> +S: Maintained

> +F: ebpf/*

> +

>   Build and test automation

>   -------------------------

>   Build and test automation

> diff --git a/docs/ebpf.rst b/docs/ebpf.rst

> new file mode 100644

> index 0000000000..e45d085432

> --- /dev/null

> +++ b/docs/ebpf.rst

> @@ -0,0 +1,29 @@

> +===========================

> +eBPF qemu support

> +===========================

> +

> +eBPF support (CONFIG_EBPF) is enabled automatically by 'configure' script

> +if 'bpf' system call is available.

> +To disable eBPF support use './configure --disable-bpf'

> +

> +Basic eBPF functionality is located in ebpf/ebpf.c and ebpf/ebpf.h.

> +There are basic functions to load the eBPF program into the kernel.

> +Mostly, functions name are self-explanatory:

> +

> +- `bpf_create_map()`, `bpf_lookup_element()`, `bpf_update_element()`, `bpf_delete_element()` - manages eBPF maps. On error, a basic error message would be reported and returned -1. On success, 0 would be returned(`bpf_create_map()` returns map's file descriptor).

> +- `bpf_prog_load()` - load the program. The program has to have proper map file descriptors if there are used. On error - the log eBPF would be reported. On success, the program file descriptor returned.

> +- `bpf_fixup_mapfd()` - would place map file descriptor into the program according to 'relocate array' of 'struct fixup_mapfd_t'. The function would return how many instructions were 'fixed' aka how many relocations was occurred.

> +

> +Simplified workflow would look like this:

> +

> +.. code:: C

> +

> +    int map1 = bpf_create_map(...);

> +    int map2 = bpf_create_map(...);

> +

> +    bpf_fixup_mapfd(<fixup table>, ARRAY_SIZE(<fixup table>), <instructions pointer>, ARRAY_SIZE(<instructions pointer>), <map1 name>, map1);

> +    bpf_fixup_mapfd(<fixup table>, ARRAY_SIZE(<fixup table>), <instructions pointer>, ARRAY_SIZE(<instructions pointer>), <map2 name>, map2);

> +

> +    int prog = bpf_prog_load(<program type>, <instructions pointer>, ARRAY_SIZE(<instructions pointer>), "GPL");

> +

> +See the bpf(2) for details.

> diff --git a/docs/ebpf_rss.rst b/docs/ebpf_rss.rst

> new file mode 100644

> index 0000000000..96fee391b8

> --- /dev/null

> +++ b/docs/ebpf_rss.rst

> @@ -0,0 +1,129 @@

> +===========================

> +eBPF RSS virtio-net support

> +===========================

> +

> +RSS(Receive Side Scaling) is used to distribute network packets to guest virtqueues

> +by calculating packet hash. Usually every queue is processed then by a specific guest CPU core.

> +

> +For now there are 2 RSS implementations in qemu:

> +- 'software' RSS (functions if qemu receives network packets, i.e. vhost=off)

> +- eBPF RSS (can function with also with vhost=on)

> +

> +If steering BPF is not set for kernel's TUN module, the TUN uses automatic selection

> +of rx virtqueue based on lookup table built according to calculated symmetric hash

> +of transmitted packets.

> +If steering BPF is set for TUN the BPF code calculates the hash of packet header and

> +returns the virtqueue number to place the packet to.

> +

> +Simplified decision formula:

> +

> +.. code:: C

> +

> +    queue_index = indirection_table[hash(<packet data>)%<indirection_table size>]

> +

> +

> +Not for all packets, the hash can/should be calculated.

> +

> +Note: currently, eBPF RSS does not support hash reporting.

> +

> +eBPF RSS turned on by different combinations of vhost-net, vitrio-net and tap configurations:

> +

> +- eBPF is used:

> +

> +        tap,vhost=off & virtio-net-pci,rss=on,hash=off

> +

> +- eBPF is used:

> +

> +        tap,vhost=on & virtio-net-pci,rss=on,hash=off

> +

> +- 'software' RSS is used:

> +

> +        tap,vhost=off & virtio-net-pci,rss=on,hash=on

> +

> +- eBPF is used, hash population feature is not reported to the guest:

> +

> +        tap,vhost=on & virtio-net-pci,rss=on,hash=on

> +

> +If CONFIG_EBPF is not set then only 'software' RSS is supported.

> +Also 'software' RSS, as a fallback, is used if the eBPF program failed to load or set to TUN.

> +

> +RSS eBPF program

> +----------------

> +

> +RSS program located in ebpf/tun_rss_steering.h as an array of 'struct bpf_insn'.

> +So the program is part of the qemu binary.

> +Initially, the eBPF program was compiled by clang and source code located at ebpf/rss.bpf.c.

> +Prerequisites to recompile the eBPF program (regenerate ebpf/tun_rss_steering.h):

> +

> +        llvm, clang, kernel source tree, python3 + (pip3 pyelftools)

> +        Adjust 'linuxhdrs' in Makefile.ebpf to reflect the location of the kernel source tree

> +

> +        $ cd ebpf

> +        $ make -f Makefile.ebpf

> +

> +Note the python script for convertation from eBPF ELF object to '.h' file - Ebpf_to_C.py:

> +

> +        $ python EbpfElf_to_C.py rss.bpf.o tun_rss_steering

> +

> +The first argument of the script is ELF object, second - section name where the eBPF program located.

> +The script would generate <section name>.h file with eBPF instructions and 'relocate array'.

> +'relocate array' is an array of 'struct fixup_mapfd_t' with the name of the eBPF map and instruction offset where the file descriptor of the map should be placed.

> +



Do we still need this if we decide to use llvm/clang toolchain? (I guess 
not)

Thanks


> +Current eBPF RSS implementation uses 'bounded loops' with 'backward jump instructions' which present in the last kernels.

> +Overall eBPF RSS works on kernels 5.8+.

> +

> +eBPF RSS implementation

> +-----------------------

> +

> +eBPF RSS loading functionality located in ebpf/ebpf_rss.c and ebpf/ebpf_rss.h.

> +

> +The `struct EBPFRSSContext` structure that holds 4 file descriptors:

> +

> +- program_fd - file descriptor of the eBPF RSS program.

> +- map_configuration - file descriptor of the 'configuration' map. This map contains one element of 'struct EBPFRSSConfig'. This configuration determines eBPF program behavior.

> +- map_toeplitz_key - file descriptor of the 'Toeplitz key' map. One element of the 40byte key prepared for the hashing algorithm.

> +- map_indirections_table - 128 elements of queue indexes.

> +

> +`struct EBPFRSSConfig` fields:

> +

> +- redirect - "boolean" value, should the hash be calculated, on false  - `default_queue` would be used as the final decision.

> +- populate_hash - for now, not used. eBPF RSS doesn't support hash reporting.

> +- hash_types - binary mask of different hash types. See `VIRTIO_NET_RSS_HASH_TYPE_*` defines. If for packet hash should not be calculated - `default_queue` would be used.

> +- indirections_len - length of the indirections table, maximum 128.

> +- default_queue - the queue index that used for packet that shouldn't be hashed. For some packets, the hash can't be calculated(g.e ARP).

> +

> +Functions:

> +

> +- `ebpf_rss_init()` - sets program_fd to -1, which indicates that EBPFRSSContext is not loaded.

> +- `ebpf_rss_load()` - creates 3 maps and loads eBPF program from tun_rss_steering.h. Returns 'true' on success. After that, program_fd can be used to set steering for TAP.

> +- `ebpf_rss_set_all()` - sets values for eBPF maps. `indirections_table` length is in EBPFRSSConfig. `toeplitz_key` is VIRTIO_NET_RSS_MAX_KEY_SIZE aka 40 bytes array.

> +- `ebpf_rss_unload()` - close all file descriptors and set program_fd to -1.

> +

> +Simplified eBPF RSS workflow:

> +

> +.. code:: C

> +

> +    struct EBPFRSSConfig config;

> +    config.redirect = 1;

> +    config.hash_types = VIRTIO_NET_RSS_HASH_TYPE_UDPv4 | VIRTIO_NET_RSS_HASH_TYPE_TCPv4;

> +    config.indirections_len = VIRTIO_NET_RSS_MAX_TABLE_LEN;

> +    config.default_queue = 0;

> +

> +    uint16_t table[VIRTIO_NET_RSS_MAX_TABLE_LEN] = {...};

> +    uint8_t key[VIRTIO_NET_RSS_MAX_KEY_SIZE] = {...};

> +

> +    struct EBPFRSSContext ctx;

> +    ebpf_rss_init(&ctx);

> +    ebpf_rss_load(&ctx);

> +    ebpf_rss_set_all(&ctx, &config, table, key);

> +    if (net_client->info->set_steering_ebpf != NULL) {

> +        net_client->info->set_steering_ebpf(net_client, ctx->program_fd);

> +    }

> +    ...

> +    ebpf_unload(&ctx);

> +

> +

> +NetClientState SetSteeringEBPF()

> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> +

> +For now, `set_steering_ebpf()` method supported by Linux TAP NetClientState. The method requires an eBPF program file descriptor as an argument.
Daniel P. Berrangé Nov. 4, 2020, 9:31 a.m. UTC | #8
On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:
> 
> On 2020/11/3 下午6:32, Yuri Benditovich wrote:
> > 
> > 
> > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com
> > <mailto:jasowang@redhat.com>> wrote:
> > 
> > 
> >     On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> >     > Basic idea is to use eBPF to calculate and steer packets in TAP.
> >     > RSS(Receive Side Scaling) is used to distribute network packets
> >     to guest virtqueues
> >     > by calculating packet hash.
> >     > eBPF RSS allows us to use RSS with vhost TAP.
> >     >
> >     > This set of patches introduces the usage of eBPF for packet steering
> >     > and RSS hash calculation:
> >     > * RSS(Receive Side Scaling) is used to distribute network packets to
> >     > guest virtqueues by calculating packet hash
> >     > * eBPF RSS suppose to be faster than already existing 'software'
> >     > implementation in QEMU
> >     > * Additionally adding support for the usage of RSS with vhost
> >     >
> >     > Supported kernels: 5.8+
> >     >
> >     > Implementation notes:
> >     > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
> >     > Added eBPF support to qemu directly through a system call, see the
> >     > bpf(2) for details.
> >     > The eBPF program is part of the qemu and presented as an array
> >     of bpf
> >     > instructions.
> >     > The program can be recompiled by provided Makefile.ebpf(need to
> >     adjust
> >     > 'linuxhdrs'),
> >     > although it's not required to build QEMU with eBPF support.
> >     > Added changes to virtio-net and vhost, primary eBPF RSS is used.
> >     > 'Software' RSS used in the case of hash population and as a
> >     fallback option.
> >     > For vhost, the hash population feature is not reported to the guest.
> >     >
> >     > Please also see the documentation in PATCH 6/6.
> >     >
> >     > I am sending those patches as RFC to initiate the discussions
> >     and get
> >     > feedback on the following points:
> >     > * Fallback when eBPF is not supported by the kernel
> > 
> > 
> >     Yes, and it could also a lacking of CAP_BPF.
> > 
> > 
> >     > * Live migration to the kernel that doesn't have eBPF support
> > 
> > 
> >     Is there anything that we needs special treatment here?
> > 
> > Possible case: rss=on, vhost=on, source system with kernel 5.8
> > (everything works) -> dest. system 5.6 (bpf does not work), the adapter
> > functions, but all the steering does not use proper queues.
> 
> 
> Right, I think we need to disable vhost on dest.
> 
> 
> > 
> > 
> > 
> >     > * Integration with current QEMU build
> > 
> > 
> >     Yes, a question here:
> > 
> >     1) Any reason for not using libbpf, e.g it has been shipped with some
> >     distros
> > 
> > 
> > We intentionally do not use libbpf, as it present only on some distros.
> > We can switch to libbpf, but this will disable bpf if libbpf is not
> > installed
> 
> 
> That's better I think.
> 
> 
> >     2) It would be better if we can avoid shipping bytecodes
> > 
> > 
> > 
> > This creates new dependencies: llvm + clang + ...
> > We would prefer byte code and ability to generate it if prerequisites
> > are installed.
> 
> 
> It's probably ok if we treat the bytecode as a kind of firmware.

That is explicitly *not* OK for inclusion in Fedora. They require that
BPF is compiled from source, and rejected my suggestion that it could
be considered a kind of firmware and thus have an exception from building
from source.

> But in the long run, it's still worthwhile consider the qemu source is used
> for development and llvm/clang should be a common requirement for generating
> eBPF bytecode for host.

So we need to do this right straight way before this merges.

Regards,
Daniel
Yuri Benditovich Nov. 4, 2020, 9:34 a.m. UTC | #9
On Wed, Nov 4, 2020 at 4:49 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> > From: Andrew <andrew@daynix.com>
> >
> > For now, that method supported only by Linux TAP.
> > Linux TAP uses TUNSETSTEERINGEBPF ioctl.
> > TUNSETSTEERINGBPF was added 3 years ago.
> > Qemu checks if it was defined before using.
> >
> > Signed-off-by: Andrew Melnychenko <andrew@daynix.com>
> > ---
> >   include/net/net.h |  2 ++
> >   net/tap-bsd.c     |  5 +++++
> >   net/tap-linux.c   | 19 +++++++++++++++++++
> >   net/tap-solaris.c |  5 +++++
> >   net/tap-stub.c    |  5 +++++
> >   net/tap.c         |  9 +++++++++
> >   net/tap_int.h     |  1 +
> >   7 files changed, 46 insertions(+)
> >
> > diff --git a/include/net/net.h b/include/net/net.h
> > index 897b2d7595..d8a41fb010 100644
> > --- a/include/net/net.h
> > +++ b/include/net/net.h
> > @@ -60,6 +60,7 @@ typedef int (SetVnetBE)(NetClientState *, bool);
> >   typedef struct SocketReadState SocketReadState;
> >   typedef void (SocketReadStateFinalize)(SocketReadState *rs);
> >   typedef void (NetAnnounce)(NetClientState *);
> > +typedef bool (SetSteeringEBPF)(NetClientState *, int);
> >
> >   typedef struct NetClientInfo {
> >       NetClientDriver type;
> > @@ -81,6 +82,7 @@ typedef struct NetClientInfo {
> >       SetVnetLE *set_vnet_le;
> >       SetVnetBE *set_vnet_be;
> >       NetAnnounce *announce;
> > +    SetSteeringEBPF *set_steering_ebpf;
> >   } NetClientInfo;
> >
> >   struct NetClientState {
> > diff --git a/net/tap-bsd.c b/net/tap-bsd.c
> > index 77aaf674b1..4f64f31e98 100644
> > --- a/net/tap-bsd.c
> > +++ b/net/tap-bsd.c
> > @@ -259,3 +259,8 @@ int tap_fd_get_ifname(int fd, char *ifname)
> >   {
> >       return -1;
> >   }
> > +
> > +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> > +{
> > +    return -1;
> > +}
> > diff --git a/net/tap-linux.c b/net/tap-linux.c
> > index b0635e9e32..196373019f 100644
> > --- a/net/tap-linux.c
> > +++ b/net/tap-linux.c
> > @@ -31,6 +31,7 @@
> >
> >   #include <net/if.h>
> >   #include <sys/ioctl.h>
> > +#include <linux/if_tun.h> /* TUNSETSTEERINGEBPF */
> >
> >   #include "qapi/error.h"
> >   #include "qemu/error-report.h"
> > @@ -316,3 +317,21 @@ int tap_fd_get_ifname(int fd, char *ifname)
> >       pstrcpy(ifname, sizeof(ifr.ifr_name), ifr.ifr_name);
> >       return 0;
> >   }
> > +
> > +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> > +{
> > +#ifdef TUNSETSTEERINGEBPF
>
>
> I'm not sure how much this can help.
>
> But looking at tap-linux.h, I wonder do we need to pull TUN/TAP uapi
> headers.
>
> Thanks
>

Agree, we just need to add this define to tap-linux.h


>
>
> > +    if (ioctl(fd, TUNSETSTEERINGEBPF, (void *) &prog_fd) != 0) {
> > +        error_report("Issue while setting TUNSETSTEERINGEBPF:"
> > +                    " %s with fd: %d, prog_fd: %d",
> > +                    strerror(errno), fd, prog_fd);
> > +
> > +       return -1;
> > +    }
> > +
> > +    return 0;
> > +#else
> > +    error_report("TUNSETSTEERINGEBPF is not supported");
> > +    return -1;
> > +#endif
> > +}
> > diff --git a/net/tap-solaris.c b/net/tap-solaris.c
> > index 0475a58207..d85224242b 100644
> > --- a/net/tap-solaris.c
> > +++ b/net/tap-solaris.c
> > @@ -255,3 +255,8 @@ int tap_fd_get_ifname(int fd, char *ifname)
> >   {
> >       return -1;
> >   }
> > +
> > +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> > +{
> > +    return -1;
> > +}
> > diff --git a/net/tap-stub.c b/net/tap-stub.c
> > index de525a2e69..a0fa25804b 100644
> > --- a/net/tap-stub.c
> > +++ b/net/tap-stub.c
> > @@ -85,3 +85,8 @@ int tap_fd_get_ifname(int fd, char *ifname)
> >   {
> >       return -1;
> >   }
> > +
> > +int tap_fd_set_steering_ebpf(int fd, int prog_fd)
> > +{
> > +    return -1;
> > +}
> > diff --git a/net/tap.c b/net/tap.c
> > index c46ff66184..81f50017bd 100644
> > --- a/net/tap.c
> > +++ b/net/tap.c
> > @@ -337,6 +337,14 @@ static void tap_poll(NetClientState *nc, bool
> enable)
> >       tap_write_poll(s, enable);
> >   }
> >
> > +static bool tap_set_steering_ebpf(NetClientState *nc, int prog_fd)
> > +{
> > +    TAPState *s = DO_UPCAST(TAPState, nc, nc);
> > +    assert(nc->info->type == NET_CLIENT_DRIVER_TAP);
> > +
> > +    return tap_fd_set_steering_ebpf(s->fd, prog_fd) == 0;
> > +}
> > +
> >   int tap_get_fd(NetClientState *nc)
> >   {
> >       TAPState *s = DO_UPCAST(TAPState, nc, nc);
> > @@ -362,6 +370,7 @@ static NetClientInfo net_tap_info = {
> >       .set_vnet_hdr_len = tap_set_vnet_hdr_len,
> >       .set_vnet_le = tap_set_vnet_le,
> >       .set_vnet_be = tap_set_vnet_be,
> > +    .set_steering_ebpf = tap_set_steering_ebpf,
> >   };
> >
> >   static TAPState *net_tap_fd_init(NetClientState *peer,
> > diff --git a/net/tap_int.h b/net/tap_int.h
> > index 225a49ea48..547f8a5a28 100644
> > --- a/net/tap_int.h
> > +++ b/net/tap_int.h
> > @@ -44,5 +44,6 @@ int tap_fd_set_vnet_be(int fd, int vnet_is_be);
> >   int tap_fd_enable(int fd);
> >   int tap_fd_disable(int fd);
> >   int tap_fd_get_ifname(int fd, char *ifname);
> > +int tap_fd_set_steering_ebpf(int fd, int prog_fd);
> >
> >   #endif /* NET_TAP_INT_H */
>
>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 4, 2020 at 4:49 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt; From: Andrew &lt;<a href="mailto:andrew@daynix.com" target="_blank">andrew@daynix.com</a>&gt;<br>
&gt;<br>
&gt; For now, that method supported only by Linux TAP.<br>
&gt; Linux TAP uses TUNSETSTEERINGEBPF ioctl.<br>
&gt; TUNSETSTEERINGBPF was added 3 years ago.<br>
&gt; Qemu checks if it was defined before using.<br>
&gt;<br>
&gt; Signed-off-by: Andrew Melnychenko &lt;<a href="mailto:andrew@daynix.com" target="_blank">andrew@daynix.com</a>&gt;<br>
&gt; ---<br>
&gt;   include/net/net.h |  2 ++<br>
&gt;   net/tap-bsd.c     |  5 +++++<br>
&gt;   net/tap-linux.c   | 19 +++++++++++++++++++<br>
&gt;   net/tap-solaris.c |  5 +++++<br>
&gt;   net/tap-stub.c    |  5 +++++<br>
&gt;   net/tap.c         |  9 +++++++++<br>
&gt;   net/tap_int.h     |  1 +<br>
&gt;   7 files changed, 46 insertions(+)<br>
&gt;<br>
&gt; diff --git a/include/net/net.h b/include/net/net.h<br>
&gt; index 897b2d7595..d8a41fb010 100644<br>
&gt; --- a/include/net/net.h<br>
&gt; +++ b/include/net/net.h<br>
&gt; @@ -60,6 +60,7 @@ typedef int (SetVnetBE)(NetClientState *, bool);<br>
&gt;   typedef struct SocketReadState SocketReadState;<br>
&gt;   typedef void (SocketReadStateFinalize)(SocketReadState *rs);<br>
&gt;   typedef void (NetAnnounce)(NetClientState *);<br>
&gt; +typedef bool (SetSteeringEBPF)(NetClientState *, int);<br>
&gt;   <br>
&gt;   typedef struct NetClientInfo {<br>
&gt;       NetClientDriver type;<br>
&gt; @@ -81,6 +82,7 @@ typedef struct NetClientInfo {<br>
&gt;       SetVnetLE *set_vnet_le;<br>
&gt;       SetVnetBE *set_vnet_be;<br>
&gt;       NetAnnounce *announce;<br>
&gt; +    SetSteeringEBPF *set_steering_ebpf;<br>
&gt;   } NetClientInfo;<br>
&gt;   <br>
&gt;   struct NetClientState {<br>
&gt; diff --git a/net/tap-bsd.c b/net/tap-bsd.c<br>
&gt; index 77aaf674b1..4f64f31e98 100644<br>
&gt; --- a/net/tap-bsd.c<br>
&gt; +++ b/net/tap-bsd.c<br>
&gt; @@ -259,3 +259,8 @@ int tap_fd_get_ifname(int fd, char *ifname)<br>
&gt;   {<br>
&gt;       return -1;<br>
&gt;   }<br>
&gt; +<br>
&gt; +int tap_fd_set_steering_ebpf(int fd, int prog_fd)<br>
&gt; +{<br>
&gt; +    return -1;<br>
&gt; +}<br>
&gt; diff --git a/net/tap-linux.c b/net/tap-linux.c<br>
&gt; index b0635e9e32..196373019f 100644<br>
&gt; --- a/net/tap-linux.c<br>
&gt; +++ b/net/tap-linux.c<br>
&gt; @@ -31,6 +31,7 @@<br>
&gt;   <br>
&gt;   #include &lt;net/if.h&gt;<br>
&gt;   #include &lt;sys/ioctl.h&gt;<br>
&gt; +#include &lt;linux/if_tun.h&gt; /* TUNSETSTEERINGEBPF */<br>
&gt;   <br>
&gt;   #include &quot;qapi/error.h&quot;<br>
&gt;   #include &quot;qemu/error-report.h&quot;<br>
&gt; @@ -316,3 +317,21 @@ int tap_fd_get_ifname(int fd, char *ifname)<br>
&gt;       pstrcpy(ifname, sizeof(ifr.ifr_name), ifr.ifr_name);<br>
&gt;       return 0;<br>
&gt;   }<br>
&gt; +<br>
&gt; +int tap_fd_set_steering_ebpf(int fd, int prog_fd)<br>
&gt; +{<br>
&gt; +#ifdef TUNSETSTEERINGEBPF<br>
<br>
<br>
I&#39;m not sure how much this can help.<br>
<br>
But looking at tap-linux.h, I wonder do we need to pull TUN/TAP uapi <br>
headers.<br>
<br>
Thanks<br></blockquote><div><br></div><div>Agree, we just need to add this define to tap-linux.h</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
&gt; +    if (ioctl(fd, TUNSETSTEERINGEBPF, (void *) &amp;prog_fd) != 0) {<br>
&gt; +        error_report(&quot;Issue while setting TUNSETSTEERINGEBPF:&quot;<br>
&gt; +                    &quot; %s with fd: %d, prog_fd: %d&quot;,<br>
&gt; +                    strerror(errno), fd, prog_fd);<br>
&gt; +<br>
&gt; +       return -1;<br>
&gt; +    }<br>
&gt; +<br>
&gt; +    return 0;<br>
&gt; +#else<br>
&gt; +    error_report(&quot;TUNSETSTEERINGEBPF is not supported&quot;);<br>
&gt; +    return -1;<br>
&gt; +#endif<br>
&gt; +}<br>
&gt; diff --git a/net/tap-solaris.c b/net/tap-solaris.c<br>
&gt; index 0475a58207..d85224242b 100644<br>
&gt; --- a/net/tap-solaris.c<br>
&gt; +++ b/net/tap-solaris.c<br>
&gt; @@ -255,3 +255,8 @@ int tap_fd_get_ifname(int fd, char *ifname)<br>
&gt;   {<br>
&gt;       return -1;<br>
&gt;   }<br>
&gt; +<br>
&gt; +int tap_fd_set_steering_ebpf(int fd, int prog_fd)<br>
&gt; +{<br>
&gt; +    return -1;<br>
&gt; +}<br>
&gt; diff --git a/net/tap-stub.c b/net/tap-stub.c<br>
&gt; index de525a2e69..a0fa25804b 100644<br>
&gt; --- a/net/tap-stub.c<br>
&gt; +++ b/net/tap-stub.c<br>
&gt; @@ -85,3 +85,8 @@ int tap_fd_get_ifname(int fd, char *ifname)<br>
&gt;   {<br>
&gt;       return -1;<br>
&gt;   }<br>
&gt; +<br>
&gt; +int tap_fd_set_steering_ebpf(int fd, int prog_fd)<br>
&gt; +{<br>
&gt; +    return -1;<br>
&gt; +}<br>
&gt; diff --git a/net/tap.c b/net/tap.c<br>
&gt; index c46ff66184..81f50017bd 100644<br>
&gt; --- a/net/tap.c<br>
&gt; +++ b/net/tap.c<br>
&gt; @@ -337,6 +337,14 @@ static void tap_poll(NetClientState *nc, bool enable)<br>
&gt;       tap_write_poll(s, enable);<br>
&gt;   }<br>
&gt;   <br>
&gt; +static bool tap_set_steering_ebpf(NetClientState *nc, int prog_fd)<br>
&gt; +{<br>
&gt; +    TAPState *s = DO_UPCAST(TAPState, nc, nc);<br>
&gt; +    assert(nc-&gt;info-&gt;type == NET_CLIENT_DRIVER_TAP);<br>
&gt; +<br>
&gt; +    return tap_fd_set_steering_ebpf(s-&gt;fd, prog_fd) == 0;<br>
&gt; +}<br>
&gt; +<br>
&gt;   int tap_get_fd(NetClientState *nc)<br>
&gt;   {<br>
&gt;       TAPState *s = DO_UPCAST(TAPState, nc, nc);<br>
&gt; @@ -362,6 +370,7 @@ static NetClientInfo net_tap_info = {<br>
&gt;       .set_vnet_hdr_len = tap_set_vnet_hdr_len,<br>
&gt;       .set_vnet_le = tap_set_vnet_le,<br>
&gt;       .set_vnet_be = tap_set_vnet_be,<br>
&gt; +    .set_steering_ebpf = tap_set_steering_ebpf,<br>
&gt;   };<br>
&gt;   <br>
&gt;   static TAPState *net_tap_fd_init(NetClientState *peer,<br>
&gt; diff --git a/net/tap_int.h b/net/tap_int.h<br>
&gt; index 225a49ea48..547f8a5a28 100644<br>
&gt; --- a/net/tap_int.h<br>
&gt; +++ b/net/tap_int.h<br>
&gt; @@ -44,5 +44,6 @@ int tap_fd_set_vnet_be(int fd, int vnet_is_be);<br>
&gt;   int tap_fd_enable(int fd);<br>
&gt;   int tap_fd_disable(int fd);<br>
&gt;   int tap_fd_get_ifname(int fd, char *ifname);<br>
&gt; +int tap_fd_set_steering_ebpf(int fd, int prog_fd);<br>
&gt;   <br>
&gt;   #endif /* NET_TAP_INT_H */<br>
<br>
</blockquote></div></div>
Yuri Benditovich Nov. 4, 2020, 11:49 a.m. UTC | #10
On Wed, Nov 4, 2020 at 4:08 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/11/3 下午6:32, Yuri Benditovich wrote:
> >
> >
> > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com
> > <mailto:jasowang@redhat.com>> wrote:
> >
> >
> >     On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> >     > Basic idea is to use eBPF to calculate and steer packets in TAP.
> >     > RSS(Receive Side Scaling) is used to distribute network packets
> >     to guest virtqueues
> >     > by calculating packet hash.
> >     > eBPF RSS allows us to use RSS with vhost TAP.
> >     >
> >     > This set of patches introduces the usage of eBPF for packet
> steering
> >     > and RSS hash calculation:
> >     > * RSS(Receive Side Scaling) is used to distribute network packets
> to
> >     > guest virtqueues by calculating packet hash
> >     > * eBPF RSS suppose to be faster than already existing 'software'
> >     > implementation in QEMU
> >     > * Additionally adding support for the usage of RSS with vhost
> >     >
> >     > Supported kernels: 5.8+
> >     >
> >     > Implementation notes:
> >     > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF
> program.
> >     > Added eBPF support to qemu directly through a system call, see the
> >     > bpf(2) for details.
> >     > The eBPF program is part of the qemu and presented as an array
> >     of bpf
> >     > instructions.
> >     > The program can be recompiled by provided Makefile.ebpf(need to
> >     adjust
> >     > 'linuxhdrs'),
> >     > although it's not required to build QEMU with eBPF support.
> >     > Added changes to virtio-net and vhost, primary eBPF RSS is used.
> >     > 'Software' RSS used in the case of hash population and as a
> >     fallback option.
> >     > For vhost, the hash population feature is not reported to the
> guest.
> >     >
> >     > Please also see the documentation in PATCH 6/6.
> >     >
> >     > I am sending those patches as RFC to initiate the discussions
> >     and get
> >     > feedback on the following points:
> >     > * Fallback when eBPF is not supported by the kernel
> >
> >
> >     Yes, and it could also a lacking of CAP_BPF.
> >
> >
> >     > * Live migration to the kernel that doesn't have eBPF support
> >
> >
> >     Is there anything that we needs special treatment here?
> >
> > Possible case: rss=on, vhost=on, source system with kernel 5.8
> > (everything works) -> dest. system 5.6 (bpf does not work), the
> > adapter functions, but all the steering does not use proper queues.
>
>
> Right, I think we need to disable vhost on dest.
>
>
Is this acceptable to disable vhost at time of migration?


> >
> >
> >
> >     > * Integration with current QEMU build
> >
> >
> >     Yes, a question here:
> >
> >     1) Any reason for not using libbpf, e.g it has been shipped with some
> >     distros
> >
> >
> > We intentionally do not use libbpf, as it present only on some distros.
> > We can switch to libbpf, but this will disable bpf if libbpf is not
> > installed
>
>
> That's better I think.
>

We think the preferred way is to have an eBPF code built-in in QEMU (not
distribute it as a separate file).

Our initial idea was to not use the libbpf because it:
1. Does not create additional dependency during build time and during
run-time
2. Gives us smaller footprint of loadable eBPF blob inside qemu
3. Do not add too much code to QEMU

We can switch to libbpf, in this case:
1. Presence of dynamic library is not guaranteed on the target system
2. Static library is large
3. libbpf uses eBPF ELF which is significantly bigger than just the array
or instructions (May be we succeed to reduce the ELF to some suitable size
and still have it built-in)

Please let us know whether you still think libbpf is better and why.

Thanks


>
> >     2) It would be better if we can avoid shipping bytecodes
> >
> >
> >
> > This creates new dependencies: llvm + clang + ...
> > We would prefer byte code and ability to generate it if prerequisites
> > are installed.
>
>
> It's probably ok if we treat the bytecode as a kind of firmware.
>
> But in the long run, it's still worthwhile consider the qemu source is
> used for development and llvm/clang should be a common requirement for
> generating eBPF bytecode for host.
>
>
> >
> >
> >     > * Additional usage for eBPF for packet filtering
> >
> >
> >     Another interesting topics in to implement mac/vlan filters. And
> >     in the
> >     future, I plan to add mac based steering. All of these could be
> >     done via
> >     eBPF.
> >
> >
> > No problem, we can cooperate if needed
> >
> >
> >     >
> >     > Know issues:
> >     > * hash population not supported by eBPF RSS: 'software' RSS used
> >
> >
> >     Is this because there's not way to write to vnet header in
> >     STERRING BPF?
> >
> > Yes. We plan to submit changes for kernel to cooperate with BPF and
> > populate the hash, this work is in progress
>
>
> That would require a new type of eBPF program and may need some work on
> verifier.
>
>
May be need to allow loading of an additional type in tun.c, not only
socket filter (to use bpf_set_hash)
Also vhost and tun in kernel need to be aware of header extension for hash
population.


> Btw, macvtap is still lacking even steering ebpf program. Would you want
> to post a patch to support that?
>
>
Probably after we have full functioning BPF with TAP/TUN


>
> >
> >     > as a fallback, also, hash population feature is not reported to
> >     guests
> >     > with vhost.
> >     > * big-endian BPF support: for now, eBPF is disabled for
> >     big-endian systems.
> >
> >
> >     Are there any blocker for this?
> >
> >
> > No, can be added in v2
>
>
> Cool.
>
> Thanks
>
>
> >
> >     Just some quick questions after a glance of the codes. Will go
> >     through
> >     them tomorrow.
> >
> >     Thanks
> >
> >
> >     >
> >     > Andrew (6):
> >     >    Added SetSteeringEBPF method for NetClientState.
> >     >    ebpf: Added basic eBPF API.
> >     >    ebpf: Added eBPF RSS program.
> >     >    ebpf: Added eBPF RSS loader.
> >     >    virtio-net: Added eBPF RSS to virtio-net.
> >     >    docs: Added eBPF documentation.
> >     >
> >     >   MAINTAINERS                    |   6 +
> >     >   configure                      |  36 +++
> >     >   docs/ebpf.rst                  |  29 ++
> >     >   docs/ebpf_rss.rst              | 129 ++++++++
> >     >   ebpf/EbpfElf_to_C.py           |  67 ++++
> >     >   ebpf/Makefile.ebpf             |  38 +++
> >     >   ebpf/ebpf-stub.c               |  28 ++
> >     >   ebpf/ebpf.c                    | 107 +++++++
> >     >   ebpf/ebpf.h                    |  35 +++
> >     >   ebpf/ebpf_rss.c                | 178 +++++++++++
> >     >   ebpf/ebpf_rss.h                |  30 ++
> >     >   ebpf/meson.build               |   1 +
> >     >   ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++
> >     >   ebpf/trace-events              |   4 +
> >     >   ebpf/trace.h                   |   2 +
> >     >   ebpf/tun_rss_steering.h        | 556
> >     +++++++++++++++++++++++++++++++++
> >     >   hw/net/vhost_net.c             |   2 +
> >     >   hw/net/virtio-net.c            | 120 ++++++-
> >     >   include/hw/virtio/virtio-net.h |   4 +
> >     >   include/net/net.h              |   2 +
> >     >   meson.build                    |   3 +
> >     >   net/tap-bsd.c                  |   5 +
> >     >   net/tap-linux.c                |  19 ++
> >     >   net/tap-solaris.c              |   5 +
> >     >   net/tap-stub.c                 |   5 +
> >     >   net/tap.c                      |   9 +
> >     >   net/tap_int.h                  |   1 +
> >     >   net/vhost-vdpa.c               |   2 +
> >     >   28 files changed, 1889 insertions(+), 4 deletions(-)
> >     >   create mode 100644 docs/ebpf.rst
> >     >   create mode 100644 docs/ebpf_rss.rst
> >     >   create mode 100644 ebpf/EbpfElf_to_C.py
> >     >   create mode 100755 ebpf/Makefile.ebpf
> >     >   create mode 100644 ebpf/ebpf-stub.c
> >     >   create mode 100644 ebpf/ebpf.c
> >     >   create mode 100644 ebpf/ebpf.h
> >     >   create mode 100644 ebpf/ebpf_rss.c
> >     >   create mode 100644 ebpf/ebpf_rss.h
> >     >   create mode 100644 ebpf/meson.build
> >     >   create mode 100644 ebpf/rss.bpf.c
> >     >   create mode 100644 ebpf/trace-events
> >     >   create mode 100644 ebpf/trace.h
> >     >   create mode 100644 ebpf/tun_rss_steering.h
> >     >
> >
>
>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Nov 4, 2020 at 4:08 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/3 下午6:32, Yuri Benditovich wrote:<br>
&gt;<br>
&gt;<br>
&gt; On Tue, Nov 3, 2020 at 11:02 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a> <br>
&gt; &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;&gt; wrote:<br>
&gt;<br>
&gt;<br>
&gt;     On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt;     &gt; Basic idea is to use eBPF to calculate and steer packets in TAP.<br>
&gt;     &gt; RSS(Receive Side Scaling) is used to distribute network packets<br>
&gt;     to guest virtqueues<br>
&gt;     &gt; by calculating packet hash.<br>
&gt;     &gt; eBPF RSS allows us to use RSS with vhost TAP.<br>
&gt;     &gt;<br>
&gt;     &gt; This set of patches introduces the usage of eBPF for packet steering<br>
&gt;     &gt; and RSS hash calculation:<br>
&gt;     &gt; * RSS(Receive Side Scaling) is used to distribute network packets to<br>
&gt;     &gt; guest virtqueues by calculating packet hash<br>
&gt;     &gt; * eBPF RSS suppose to be faster than already existing &#39;software&#39;<br>
&gt;     &gt; implementation in QEMU<br>
&gt;     &gt; * Additionally adding support for the usage of RSS with vhost<br>
&gt;     &gt;<br>
&gt;     &gt; Supported kernels: 5.8+<br>
&gt;     &gt;<br>
&gt;     &gt; Implementation notes:<br>
&gt;     &gt; Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.<br>
&gt;     &gt; Added eBPF support to qemu directly through a system call, see the<br>
&gt;     &gt; bpf(2) for details.<br>
&gt;     &gt; The eBPF program is part of the qemu and presented as an array<br>
&gt;     of bpf<br>
&gt;     &gt; instructions.<br>
&gt;     &gt; The program can be recompiled by provided Makefile.ebpf(need to<br>
&gt;     adjust<br>
&gt;     &gt; &#39;linuxhdrs&#39;),<br>
&gt;     &gt; although it&#39;s not required to build QEMU with eBPF support.<br>
&gt;     &gt; Added changes to virtio-net and vhost, primary eBPF RSS is used.<br>
&gt;     &gt; &#39;Software&#39; RSS used in the case of hash population and as a<br>
&gt;     fallback option.<br>
&gt;     &gt; For vhost, the hash population feature is not reported to the guest.<br>
&gt;     &gt;<br>
&gt;     &gt; Please also see the documentation in PATCH 6/6.<br>
&gt;     &gt;<br>
&gt;     &gt; I am sending those patches as RFC to initiate the discussions<br>
&gt;     and get<br>
&gt;     &gt; feedback on the following points:<br>
&gt;     &gt; * Fallback when eBPF is not supported by the kernel<br>
&gt;<br>
&gt;<br>
&gt;     Yes, and it could also a lacking of CAP_BPF.<br>
&gt;<br>
&gt;<br>
&gt;     &gt; * Live migration to the kernel that doesn&#39;t have eBPF support<br>
&gt;<br>
&gt;<br>
&gt;     Is there anything that we needs special treatment here?<br>
&gt;<br>
&gt; Possible case: rss=on, vhost=on, source system with kernel 5.8 <br>
&gt; (everything works) -&gt; dest. system 5.6 (bpf does not work), the <br>
&gt; adapter functions, but all the steering does not use proper queues.<br>
<br>
<br>
Right, I think we need to disable vhost on dest.<br><br></blockquote><div><br></div><div>Is this acceptable to disable vhost at time of migration? </div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
&gt;<br>
&gt;<br>
&gt;<br>
&gt;     &gt; * Integration with current QEMU build<br>
&gt;<br>
&gt;<br>
&gt;     Yes, a question here:<br>
&gt;<br>
&gt;     1) Any reason for not using libbpf, e.g it has been shipped with some<br>
&gt;     distros<br>
&gt;<br>
&gt;<br>
&gt; We intentionally do not use libbpf, as it present only on some distros.<br>
&gt; We can switch to libbpf, but this will disable bpf if libbpf is not <br>
&gt; installed<br>
<br>
<br>
That&#39;s better I think.<br></blockquote><div><br></div><div>We think the preferred way is to have an eBPF code built-in in QEMU (not distribute it as a separate file).</div><div><br></div><div>Our initial idea was to not use the libbpf because it:</div><div>1. Does not create additional dependency during build time and during run-time</div><div>2. Gives us smaller footprint of loadable eBPF blob inside qemu</div><div>3. Do not add too much code to QEMU</div><div><br></div><div>We can switch to libbpf, in this case:</div><div>1. Presence of dynamic library is not guaranteed on the target system</div><div>2. Static library is large</div><div>3. libbpf uses eBPF ELF which is significantly bigger than just the array or instructions (May be we succeed to reduce the ELF to some suitable size and still have it built-in)</div><div></div><div><br></div><div>Please let us know whether you still think libbpf is better and why. </div><div><br></div><div>Thanks</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
<br>
&gt;     2) It would be better if we can avoid shipping bytecodes<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; This creates new dependencies: llvm + clang + ...<br>
&gt; We would prefer byte code and ability to generate it if prerequisites <br>
&gt; are installed.<br>
<br>
<br>
It&#39;s probably ok if we treat the bytecode as a kind of firmware.<br>
<br>
But in the long run, it&#39;s still worthwhile consider the qemu source is <br>
used for development and llvm/clang should be a common requirement for <br>
generating eBPF bytecode for host.<br>
<br>
<br>
&gt;<br>
&gt;<br>
&gt;     &gt; * Additional usage for eBPF for packet filtering<br>
&gt;<br>
&gt;<br>
&gt;     Another interesting topics in to implement mac/vlan filters. And<br>
&gt;     in the<br>
&gt;     future, I plan to add mac based steering. All of these could be<br>
&gt;     done via<br>
&gt;     eBPF.<br>
&gt;<br>
&gt;<br>
&gt; No problem, we can cooperate if needed<br>
&gt;<br>
&gt;<br>
&gt;     &gt;<br>
&gt;     &gt; Know issues:<br>
&gt;     &gt; * hash population not supported by eBPF RSS: &#39;software&#39; RSS used<br>
&gt;<br>
&gt;<br>
&gt;     Is this because there&#39;s not way to write to vnet header in<br>
&gt;     STERRING BPF?<br>
&gt;<br>
&gt; Yes. We plan to submit changes for kernel to cooperate with BPF and <br>
&gt; populate the hash, this work is in progress<br>
<br>
<br>
That would require a new type of eBPF program and may need some work on <br>
verifier.<br>
<br></blockquote><div><br></div><div>May be need to allow loading of an additional type in tun.c, not only socket filter (to use bpf_set_hash)</div><div>Also vhost and tun in kernel need to be aware of header extension for hash population.<br></div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Btw, macvtap is still lacking even steering ebpf program. Would you want <br>
to post a patch to support that?<br>
<br></blockquote><div><br></div><div>Probably after we have full functioning BPF with TAP/TUN</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt;<br>
&gt;     &gt; as a fallback, also, hash population feature is not reported to<br>
&gt;     guests<br>
&gt;     &gt; with vhost.<br>
&gt;     &gt; * big-endian BPF support: for now, eBPF is disabled for<br>
&gt;     big-endian systems.<br>
&gt;<br>
&gt;<br>
&gt;     Are there any blocker for this?<br>
&gt;<br>
&gt;<br>
&gt; No, can be added in v2<br>
<br>
<br>
Cool.<br>
<br>
Thanks<br>
<br>
<br>
&gt;<br>
&gt;     Just some quick questions after a glance of the codes. Will go<br>
&gt;     through<br>
&gt;     them tomorrow.<br>
&gt;<br>
&gt;     Thanks<br>
&gt;<br>
&gt;<br>
&gt;     &gt;<br>
&gt;     &gt; Andrew (6):<br>
&gt;     &gt;    Added SetSteeringEBPF method for NetClientState.<br>
&gt;     &gt;    ebpf: Added basic eBPF API.<br>
&gt;     &gt;    ebpf: Added eBPF RSS program.<br>
&gt;     &gt;    ebpf: Added eBPF RSS loader.<br>
&gt;     &gt;    virtio-net: Added eBPF RSS to virtio-net.<br>
&gt;     &gt;    docs: Added eBPF documentation.<br>
&gt;     &gt;<br>
&gt;     &gt;   MAINTAINERS                    |   6 +<br>
&gt;     &gt;   configure                      |  36 +++<br>
&gt;     &gt;   docs/ebpf.rst                  |  29 ++<br>
&gt;     &gt;   docs/ebpf_rss.rst              | 129 ++++++++<br>
&gt;     &gt;   ebpf/EbpfElf_to_C.py           |  67 ++++<br>
&gt;     &gt;   ebpf/Makefile.ebpf             |  38 +++<br>
&gt;     &gt;   ebpf/ebpf-stub.c               |  28 ++<br>
&gt;     &gt;   ebpf/ebpf.c                    | 107 +++++++<br>
&gt;     &gt;   ebpf/ebpf.h                    |  35 +++<br>
&gt;     &gt;   ebpf/ebpf_rss.c                | 178 +++++++++++<br>
&gt;     &gt;   ebpf/ebpf_rss.h                |  30 ++<br>
&gt;     &gt;   ebpf/meson.build               |   1 +<br>
&gt;     &gt;   ebpf/rss.bpf.c                 | 470 ++++++++++++++++++++++++++++<br>
&gt;     &gt;   ebpf/trace-events              |   4 +<br>
&gt;     &gt;   ebpf/trace.h                   |   2 +<br>
&gt;     &gt;   ebpf/tun_rss_steering.h        | 556<br>
&gt;     +++++++++++++++++++++++++++++++++<br>
&gt;     &gt;   hw/net/vhost_net.c             |   2 +<br>
&gt;     &gt;   hw/net/virtio-net.c            | 120 ++++++-<br>
&gt;     &gt;   include/hw/virtio/virtio-net.h |   4 +<br>
&gt;     &gt;   include/net/net.h              |   2 +<br>
&gt;     &gt;   meson.build                    |   3 +<br>
&gt;     &gt;   net/tap-bsd.c                  |   5 +<br>
&gt;     &gt;   net/tap-linux.c                |  19 ++<br>
&gt;     &gt;   net/tap-solaris.c              |   5 +<br>
&gt;     &gt;   net/tap-stub.c                 |   5 +<br>
&gt;     &gt;   net/tap.c                      |   9 +<br>
&gt;     &gt;   net/tap_int.h                  |   1 +<br>
&gt;     &gt;   net/vhost-vdpa.c               |   2 +<br>
&gt;     &gt;   28 files changed, 1889 insertions(+), 4 deletions(-)<br>
&gt;     &gt;   create mode 100644 docs/ebpf.rst<br>
&gt;     &gt;   create mode 100644 docs/ebpf_rss.rst<br>
&gt;     &gt;   create mode 100644 ebpf/EbpfElf_to_C.py<br>
&gt;     &gt;   create mode 100755 ebpf/Makefile.ebpf<br>
&gt;     &gt;   create mode 100644 ebpf/ebpf-stub.c<br>
&gt;     &gt;   create mode 100644 ebpf/ebpf.c<br>
&gt;     &gt;   create mode 100644 ebpf/ebpf.h<br>
&gt;     &gt;   create mode 100644 ebpf/ebpf_rss.c<br>
&gt;     &gt;   create mode 100644 ebpf/ebpf_rss.h<br>
&gt;     &gt;   create mode 100644 ebpf/meson.build<br>
&gt;     &gt;   create mode 100644 ebpf/rss.bpf.c<br>
&gt;     &gt;   create mode 100644 ebpf/trace-events<br>
&gt;     &gt;   create mode 100644 ebpf/trace.h<br>
&gt;     &gt;   create mode 100644 ebpf/tun_rss_steering.h<br>
&gt;     &gt;<br>
&gt;<br>
<br>
</blockquote></div></div>
Daniel P. Berrangé Nov. 4, 2020, 12:04 p.m. UTC | #11
On Wed, Nov 04, 2020 at 01:49:05PM +0200, Yuri Benditovich wrote:
> On Wed, Nov 4, 2020 at 4:08 AM Jason Wang <jasowang@redhat.com> wrote:

> 

> >

> > On 2020/11/3 下午6:32, Yuri Benditovich wrote:

> > >

> > >

> > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com

> > > <mailto:jasowang@redhat.com>> wrote:

> > >

> > >

> > >     On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

> > >     > Basic idea is to use eBPF to calculate and steer packets in TAP.

> > >     > RSS(Receive Side Scaling) is used to distribute network packets

> > >     to guest virtqueues

> > >     > by calculating packet hash.

> > >     > eBPF RSS allows us to use RSS with vhost TAP.

> > >     >

> > >     > This set of patches introduces the usage of eBPF for packet

> > steering

> > >     > and RSS hash calculation:

> > >     > * RSS(Receive Side Scaling) is used to distribute network packets

> > to

> > >     > guest virtqueues by calculating packet hash

> > >     > * eBPF RSS suppose to be faster than already existing 'software'

> > >     > implementation in QEMU

> > >     > * Additionally adding support for the usage of RSS with vhost

> > >     >

> > >     > Supported kernels: 5.8+

> > >     >

> > >     > Implementation notes:

> > >     > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF

> > program.

> > >     > Added eBPF support to qemu directly through a system call, see the

> > >     > bpf(2) for details.

> > >     > The eBPF program is part of the qemu and presented as an array

> > >     of bpf

> > >     > instructions.

> > >     > The program can be recompiled by provided Makefile.ebpf(need to

> > >     adjust

> > >     > 'linuxhdrs'),

> > >     > although it's not required to build QEMU with eBPF support.

> > >     > Added changes to virtio-net and vhost, primary eBPF RSS is used.

> > >     > 'Software' RSS used in the case of hash population and as a

> > >     fallback option.

> > >     > For vhost, the hash population feature is not reported to the

> > guest.

> > >     >

> > >     > Please also see the documentation in PATCH 6/6.

> > >     >

> > >     > I am sending those patches as RFC to initiate the discussions

> > >     and get

> > >     > feedback on the following points:

> > >     > * Fallback when eBPF is not supported by the kernel

> > >

> > >

> > >     Yes, and it could also a lacking of CAP_BPF.

> > >

> > >

> > >     > * Live migration to the kernel that doesn't have eBPF support

> > >

> > >

> > >     Is there anything that we needs special treatment here?

> > >

> > > Possible case: rss=on, vhost=on, source system with kernel 5.8

> > > (everything works) -> dest. system 5.6 (bpf does not work), the

> > > adapter functions, but all the steering does not use proper queues.

> >

> >

> > Right, I think we need to disable vhost on dest.

> >

> >

> Is this acceptable to disable vhost at time of migration?

> 

> 

> > >

> > >

> > >

> > >     > * Integration with current QEMU build

> > >

> > >

> > >     Yes, a question here:

> > >

> > >     1) Any reason for not using libbpf, e.g it has been shipped with some

> > >     distros

> > >

> > >

> > > We intentionally do not use libbpf, as it present only on some distros.

> > > We can switch to libbpf, but this will disable bpf if libbpf is not

> > > installed

> >

> >

> > That's better I think.

> >

> 

> We think the preferred way is to have an eBPF code built-in in QEMU (not

> distribute it as a separate file).

> 

> Our initial idea was to not use the libbpf because it:

> 1. Does not create additional dependency during build time and during

> run-time

> 2. Gives us smaller footprint of loadable eBPF blob inside qemu

> 3. Do not add too much code to QEMU

> 

> We can switch to libbpf, in this case:

> 1. Presence of dynamic library is not guaranteed on the target system


Again if a distro or users wants to use this feature in
QEMU they should be expected build the library.

> 2. Static library is large


QEMU doesn't support static linking for system emulators.  It may
happen to work at times but there's no expectations in this respect.

> 3. libbpf uses eBPF ELF which is significantly bigger than just the array

> or instructions (May be we succeed to reduce the ELF to some suitable size

> and still have it built-in)

> 

> Please let us know whether you still think libbpf is better and why.


It looks like both CLang and GCC compilers for BPF are moving towards
a world where they use BTF to get compile once, run everywhere portability
for the compiled bytecode. IIUC the libbpf is what is responsible for
processing the BTF data when loading it into the running kernel. This
all looks like a good thing in general. 

If we introduce BPF to QEMU without using libbpf, and then later decide
we absolutely need libbpf features, it creates an upgrade back compat
issue for existing deployments. It is better to use libbpf right from
the start, so we're set up to take full advantage of what it offers
long term.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
Jason Wang Nov. 5, 2020, 3:46 a.m. UTC | #12
On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:
> On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:
>> On 2020/11/3 下午6:32, Yuri Benditovich wrote:
>>>
>>> On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com
>>> <mailto:jasowang@redhat.com>> wrote:
>>>
>>>
>>>      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
>>>      > Basic idea is to use eBPF to calculate and steer packets in TAP.
>>>      > RSS(Receive Side Scaling) is used to distribute network packets
>>>      to guest virtqueues
>>>      > by calculating packet hash.
>>>      > eBPF RSS allows us to use RSS with vhost TAP.
>>>      >
>>>      > This set of patches introduces the usage of eBPF for packet steering
>>>      > and RSS hash calculation:
>>>      > * RSS(Receive Side Scaling) is used to distribute network packets to
>>>      > guest virtqueues by calculating packet hash
>>>      > * eBPF RSS suppose to be faster than already existing 'software'
>>>      > implementation in QEMU
>>>      > * Additionally adding support for the usage of RSS with vhost
>>>      >
>>>      > Supported kernels: 5.8+
>>>      >
>>>      > Implementation notes:
>>>      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
>>>      > Added eBPF support to qemu directly through a system call, see the
>>>      > bpf(2) for details.
>>>      > The eBPF program is part of the qemu and presented as an array
>>>      of bpf
>>>      > instructions.
>>>      > The program can be recompiled by provided Makefile.ebpf(need to
>>>      adjust
>>>      > 'linuxhdrs'),
>>>      > although it's not required to build QEMU with eBPF support.
>>>      > Added changes to virtio-net and vhost, primary eBPF RSS is used.
>>>      > 'Software' RSS used in the case of hash population and as a
>>>      fallback option.
>>>      > For vhost, the hash population feature is not reported to the guest.
>>>      >
>>>      > Please also see the documentation in PATCH 6/6.
>>>      >
>>>      > I am sending those patches as RFC to initiate the discussions
>>>      and get
>>>      > feedback on the following points:
>>>      > * Fallback when eBPF is not supported by the kernel
>>>
>>>
>>>      Yes, and it could also a lacking of CAP_BPF.
>>>
>>>
>>>      > * Live migration to the kernel that doesn't have eBPF support
>>>
>>>
>>>      Is there anything that we needs special treatment here?
>>>
>>> Possible case: rss=on, vhost=on, source system with kernel 5.8
>>> (everything works) -> dest. system 5.6 (bpf does not work), the adapter
>>> functions, but all the steering does not use proper queues.
>>
>> Right, I think we need to disable vhost on dest.
>>
>>
>>>
>>>
>>>      > * Integration with current QEMU build
>>>
>>>
>>>      Yes, a question here:
>>>
>>>      1) Any reason for not using libbpf, e.g it has been shipped with some
>>>      distros
>>>
>>>
>>> We intentionally do not use libbpf, as it present only on some distros.
>>> We can switch to libbpf, but this will disable bpf if libbpf is not
>>> installed
>>
>> That's better I think.
>>
>>
>>>      2) It would be better if we can avoid shipping bytecodes
>>>
>>>
>>>
>>> This creates new dependencies: llvm + clang + ...
>>> We would prefer byte code and ability to generate it if prerequisites
>>> are installed.
>>
>> It's probably ok if we treat the bytecode as a kind of firmware.
> That is explicitly *not* OK for inclusion in Fedora. They require that
> BPF is compiled from source, and rejected my suggestion that it could
> be considered a kind of firmware and thus have an exception from building
> from source.


Please refer what it was done in DPDK:

http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235

I don't think what proposed here makes anything different.

It's still a bytecode that lives in an array.


>
>> But in the long run, it's still worthwhile consider the qemu source is used
>> for development and llvm/clang should be a common requirement for generating
>> eBPF bytecode for host.
> So we need to do this right straight way before this merges.


Yes.

Thanks


>
> Regards,
> Daniel
Jason Wang Nov. 5, 2020, 3:52 a.m. UTC | #13
On 2020/11/5 上午11:46, Jason Wang wrote:
>>
>> It's probably ok if we treat the bytecode as a kind of firmware.
> That is explicitly *not* OK for inclusion in Fedora. They require that
> BPF is compiled from source, and rejected my suggestion that it could
> be considered a kind of firmware and thus have an exception from building
> from source. 


Actually, there's another advantages. If we treat it as firmware, 
(actually it is). It allows us to upgrade it independently with qemu.

Thanks
Jason Wang Nov. 5, 2020, 3:56 a.m. UTC | #14
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> From: Andrew<andrew@daynix.com>
>
> Also, added maintainers information.
>
> Signed-off-by: Yuri Benditovich<yuri.benditovich@daynix.com>
> Signed-off-by: Andrew Melnychenko<andrew@daynix.com>
> ---
>   MAINTAINERS       |   6 +++
>   docs/ebpf.rst     |  29 +++++++++++
>   docs/ebpf_rss.rst | 129 ++++++++++++++++++++++++++++++++++++++++++++++
>   3 files changed, 164 insertions(+)
>   create mode 100644 docs/ebpf.rst
>   create mode 100644 docs/ebpf_rss.rst
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2c22bbca5a..464b3f3c95 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3111,6 +3111,12 @@ S: Maintained
>   F: hw/semihosting/
>   F: include/hw/semihosting/
>   
> +EBPF:
> +M: Andrew Melnychenko<andrew@daynix.com>
> +M: Yuri Benditovich<yuri.benditovich@daynix.com>
> +S: Maintained
> +F: ebpf/*
> +


If it's possible, I would like to be one of the maintainer or at least 
reviewer :)

Thanks
Yuri Benditovich Nov. 5, 2020, 9:11 a.m. UTC | #15
On Thu, Nov 5, 2020 at 5:52 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/11/5 上午11:46, Jason Wang wrote:
> >>
> >> It's probably ok if we treat the bytecode as a kind of firmware.
> > That is explicitly *not* OK for inclusion in Fedora. They require that
> > BPF is compiled from source, and rejected my suggestion that it could
> > be considered a kind of firmware and thus have an exception from building
> > from source.
>
>
> Actually, there's another advantages. If we treat it as firmware,
> (actually it is). It allows us to upgrade it independently with qemu.
>
> Hi Jason,
I think this is a big disadvantage to have the BPF binary outside of QEMU.
It is compiled with common structures (for example RSS configuration)
defined in QEMU and if it is not built in the QEMU then nobody is
responsible for the compatibility of the BPF and QEMU.
Just an array of instructions (af today) is ~2k, full object file (if we
use libbpf) is ~8K, so there is no big problem with the size.
If we even keep the entire object in QEMU, it is for sure 100% compatible.

Thanks
>
>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 5, 2020 at 5:52 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/5 上午11:46, Jason Wang wrote:<br>
&gt;&gt;<br>
&gt;&gt; It&#39;s probably ok if we treat the bytecode as a kind of firmware.<br>
&gt; That is explicitly *not* OK for inclusion in Fedora. They require that<br>
&gt; BPF is compiled from source, and rejected my suggestion that it could<br>
&gt; be considered a kind of firmware and thus have an exception from building<br>
&gt; from source. <br>
<br>
<br>
Actually, there&#39;s another advantages. If we treat it as firmware, <br>
(actually it is). It allows us to upgrade it independently with qemu.<br>
<br></blockquote><div>Hi Jason,</div><div>I think this is a big disadvantage to have the BPF binary outside of QEMU.</div><div>It is compiled with common structures (for example RSS configuration) defined in QEMU and if it is not built in the QEMU then nobody is responsible for the compatibility of the BPF and QEMU.</div><div>Just an array of instructions (af today) is ~2k, full object file (if we use libbpf) is ~8K, so there is no big problem with the size.</div><div>If we even keep the entire object in QEMU, it is for sure 100% compatible.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Thanks<br>
<br>
</blockquote></div></div>
Yuri Benditovich Nov. 5, 2020, 9:40 a.m. UTC | #16
On Thu, Nov 5, 2020 at 5:56 AM Jason Wang <jasowang@redhat.com> wrote:

>
> On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> > From: Andrew<andrew@daynix.com>
> >
> > Also, added maintainers information.
> >
> > Signed-off-by: Yuri Benditovich<yuri.benditovich@daynix.com>
> > Signed-off-by: Andrew Melnychenko<andrew@daynix.com>
> > ---
> >   MAINTAINERS       |   6 +++
> >   docs/ebpf.rst     |  29 +++++++++++
> >   docs/ebpf_rss.rst | 129 ++++++++++++++++++++++++++++++++++++++++++++++
> >   3 files changed, 164 insertions(+)
> >   create mode 100644 docs/ebpf.rst
> >   create mode 100644 docs/ebpf_rss.rst
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2c22bbca5a..464b3f3c95 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3111,6 +3111,12 @@ S: Maintained
> >   F: hw/semihosting/
> >   F: include/hw/semihosting/
> >
> > +EBPF:
> > +M: Andrew Melnychenko<andrew@daynix.com>
> > +M: Yuri Benditovich<yuri.benditovich@daynix.com>
> > +S: Maintained
> > +F: ebpf/*
> > +
>
>
> If it's possible, I would like to be one of the maintainer or at least
> reviewer :)
>
> With pleasure. We did not know who would want to maintain eBPF related
things, so we added ourselves as maintainers.
If you agree, we'll place you as a maintainer and ourselves as reviewers to
be informed about changes before they happen.

Thanks



> Thanks
>
>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 5, 2020 at 5:56 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt; From: Andrew&lt;<a href="mailto:andrew@daynix.com" target="_blank">andrew@daynix.com</a>&gt;<br>
&gt;<br>
&gt; Also, added maintainers information.<br>
&gt;<br>
&gt; Signed-off-by: Yuri Benditovich&lt;<a href="mailto:yuri.benditovich@daynix.com" target="_blank">yuri.benditovich@daynix.com</a>&gt;<br>
&gt; Signed-off-by: Andrew Melnychenko&lt;<a href="mailto:andrew@daynix.com" target="_blank">andrew@daynix.com</a>&gt;<br>
&gt; ---<br>
&gt;   MAINTAINERS       |   6 +++<br>
&gt;   docs/ebpf.rst     |  29 +++++++++++<br>
&gt;   docs/ebpf_rss.rst | 129 ++++++++++++++++++++++++++++++++++++++++++++++<br>
&gt;   3 files changed, 164 insertions(+)<br>
&gt;   create mode 100644 docs/ebpf.rst<br>
&gt;   create mode 100644 docs/ebpf_rss.rst<br>
&gt;<br>
&gt; diff --git a/MAINTAINERS b/MAINTAINERS<br>
&gt; index 2c22bbca5a..464b3f3c95 100644<br>
&gt; --- a/MAINTAINERS<br>
&gt; +++ b/MAINTAINERS<br>
&gt; @@ -3111,6 +3111,12 @@ S: Maintained<br>
&gt;   F: hw/semihosting/<br>
&gt;   F: include/hw/semihosting/<br>
&gt;   <br>
&gt; +EBPF:<br>
&gt; +M: Andrew Melnychenko&lt;<a href="mailto:andrew@daynix.com" target="_blank">andrew@daynix.com</a>&gt;<br>
&gt; +M: Yuri Benditovich&lt;<a href="mailto:yuri.benditovich@daynix.com" target="_blank">yuri.benditovich@daynix.com</a>&gt;<br>
&gt; +S: Maintained<br>
&gt; +F: ebpf/*<br>
&gt; +<br>
<br>
<br>
If it&#39;s possible, I would like to be one of the maintainer or at least <br>
reviewer :)<br>
<br></blockquote><div>With pleasure. We did not know who would want to maintain eBPF related things, so we added ourselves as maintainers.</div><div>If you agree, we&#39;ll place you as a maintainer and ourselves as reviewers to be informed about changes before they happen.</div><div><br></div><div>Thanks</div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
Thanks<br>
<br>
</blockquote></div></div>
Daniel P. Berrangé Nov. 5, 2020, 10:01 a.m. UTC | #17
On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:
> 
> On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:
> > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:
> > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:
> > > > 
> > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com
> > > > <mailto:jasowang@redhat.com>> wrote:
> > > > 
> > > > 
> > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> > > >      > Basic idea is to use eBPF to calculate and steer packets in TAP.
> > > >      > RSS(Receive Side Scaling) is used to distribute network packets
> > > >      to guest virtqueues
> > > >      > by calculating packet hash.
> > > >      > eBPF RSS allows us to use RSS with vhost TAP.
> > > >      >
> > > >      > This set of patches introduces the usage of eBPF for packet steering
> > > >      > and RSS hash calculation:
> > > >      > * RSS(Receive Side Scaling) is used to distribute network packets to
> > > >      > guest virtqueues by calculating packet hash
> > > >      > * eBPF RSS suppose to be faster than already existing 'software'
> > > >      > implementation in QEMU
> > > >      > * Additionally adding support for the usage of RSS with vhost
> > > >      >
> > > >      > Supported kernels: 5.8+
> > > >      >
> > > >      > Implementation notes:
> > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
> > > >      > Added eBPF support to qemu directly through a system call, see the
> > > >      > bpf(2) for details.
> > > >      > The eBPF program is part of the qemu and presented as an array
> > > >      of bpf
> > > >      > instructions.
> > > >      > The program can be recompiled by provided Makefile.ebpf(need to
> > > >      adjust
> > > >      > 'linuxhdrs'),
> > > >      > although it's not required to build QEMU with eBPF support.
> > > >      > Added changes to virtio-net and vhost, primary eBPF RSS is used.
> > > >      > 'Software' RSS used in the case of hash population and as a
> > > >      fallback option.
> > > >      > For vhost, the hash population feature is not reported to the guest.
> > > >      >
> > > >      > Please also see the documentation in PATCH 6/6.
> > > >      >
> > > >      > I am sending those patches as RFC to initiate the discussions
> > > >      and get
> > > >      > feedback on the following points:
> > > >      > * Fallback when eBPF is not supported by the kernel
> > > > 
> > > > 
> > > >      Yes, and it could also a lacking of CAP_BPF.
> > > > 
> > > > 
> > > >      > * Live migration to the kernel that doesn't have eBPF support
> > > > 
> > > > 
> > > >      Is there anything that we needs special treatment here?
> > > > 
> > > > Possible case: rss=on, vhost=on, source system with kernel 5.8
> > > > (everything works) -> dest. system 5.6 (bpf does not work), the adapter
> > > > functions, but all the steering does not use proper queues.
> > > 
> > > Right, I think we need to disable vhost on dest.
> > > 
> > > 
> > > > 
> > > > 
> > > >      > * Integration with current QEMU build
> > > > 
> > > > 
> > > >      Yes, a question here:
> > > > 
> > > >      1) Any reason for not using libbpf, e.g it has been shipped with some
> > > >      distros
> > > > 
> > > > 
> > > > We intentionally do not use libbpf, as it present only on some distros.
> > > > We can switch to libbpf, but this will disable bpf if libbpf is not
> > > > installed
> > > 
> > > That's better I think.
> > > 
> > > 
> > > >      2) It would be better if we can avoid shipping bytecodes
> > > > 
> > > > 
> > > > 
> > > > This creates new dependencies: llvm + clang + ...
> > > > We would prefer byte code and ability to generate it if prerequisites
> > > > are installed.
> > > 
> > > It's probably ok if we treat the bytecode as a kind of firmware.
> > That is explicitly *not* OK for inclusion in Fedora. They require that
> > BPF is compiled from source, and rejected my suggestion that it could
> > be considered a kind of firmware and thus have an exception from building
> > from source.
> 
> 
> Please refer what it was done in DPDK:
> 
> http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235
> 
> I don't think what proposed here makes anything different.

I'm not convinced that what DPDK does is acceptable to Fedora either
based on the responses I've received when asking about BPF handling
during build.  I wouldn't suprise me, however, if this was simply
missed by reviewers when accepting DPDK into Fedora, because it is
not entirely obvious unless you are looking closely.


Regards,
Daniel
Daniel P. Berrangé Nov. 5, 2020, 1:19 p.m. UTC | #18
On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé wrote:
> On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:
> > 
> > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:
> > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:
> > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:
> > > > > 
> > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com
> > > > > <mailto:jasowang@redhat.com>> wrote:
> > > > > 
> > > > > 
> > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:
> > > > >      > Basic idea is to use eBPF to calculate and steer packets in TAP.
> > > > >      > RSS(Receive Side Scaling) is used to distribute network packets
> > > > >      to guest virtqueues
> > > > >      > by calculating packet hash.
> > > > >      > eBPF RSS allows us to use RSS with vhost TAP.
> > > > >      >
> > > > >      > This set of patches introduces the usage of eBPF for packet steering
> > > > >      > and RSS hash calculation:
> > > > >      > * RSS(Receive Side Scaling) is used to distribute network packets to
> > > > >      > guest virtqueues by calculating packet hash
> > > > >      > * eBPF RSS suppose to be faster than already existing 'software'
> > > > >      > implementation in QEMU
> > > > >      > * Additionally adding support for the usage of RSS with vhost
> > > > >      >
> > > > >      > Supported kernels: 5.8+
> > > > >      >
> > > > >      > Implementation notes:
> > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.
> > > > >      > Added eBPF support to qemu directly through a system call, see the
> > > > >      > bpf(2) for details.
> > > > >      > The eBPF program is part of the qemu and presented as an array
> > > > >      of bpf
> > > > >      > instructions.
> > > > >      > The program can be recompiled by provided Makefile.ebpf(need to
> > > > >      adjust
> > > > >      > 'linuxhdrs'),
> > > > >      > although it's not required to build QEMU with eBPF support.
> > > > >      > Added changes to virtio-net and vhost, primary eBPF RSS is used.
> > > > >      > 'Software' RSS used in the case of hash population and as a
> > > > >      fallback option.
> > > > >      > For vhost, the hash population feature is not reported to the guest.
> > > > >      >
> > > > >      > Please also see the documentation in PATCH 6/6.
> > > > >      >
> > > > >      > I am sending those patches as RFC to initiate the discussions
> > > > >      and get
> > > > >      > feedback on the following points:
> > > > >      > * Fallback when eBPF is not supported by the kernel
> > > > > 
> > > > > 
> > > > >      Yes, and it could also a lacking of CAP_BPF.
> > > > > 
> > > > > 
> > > > >      > * Live migration to the kernel that doesn't have eBPF support
> > > > > 
> > > > > 
> > > > >      Is there anything that we needs special treatment here?
> > > > > 
> > > > > Possible case: rss=on, vhost=on, source system with kernel 5.8
> > > > > (everything works) -> dest. system 5.6 (bpf does not work), the adapter
> > > > > functions, but all the steering does not use proper queues.
> > > > 
> > > > Right, I think we need to disable vhost on dest.
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > >      > * Integration with current QEMU build
> > > > > 
> > > > > 
> > > > >      Yes, a question here:
> > > > > 
> > > > >      1) Any reason for not using libbpf, e.g it has been shipped with some
> > > > >      distros
> > > > > 
> > > > > 
> > > > > We intentionally do not use libbpf, as it present only on some distros.
> > > > > We can switch to libbpf, but this will disable bpf if libbpf is not
> > > > > installed
> > > > 
> > > > That's better I think.
> > > > 
> > > > 
> > > > >      2) It would be better if we can avoid shipping bytecodes
> > > > > 
> > > > > 
> > > > > 
> > > > > This creates new dependencies: llvm + clang + ...
> > > > > We would prefer byte code and ability to generate it if prerequisites
> > > > > are installed.
> > > > 
> > > > It's probably ok if we treat the bytecode as a kind of firmware.
> > > That is explicitly *not* OK for inclusion in Fedora. They require that
> > > BPF is compiled from source, and rejected my suggestion that it could
> > > be considered a kind of firmware and thus have an exception from building
> > > from source.
> > 
> > 
> > Please refer what it was done in DPDK:
> > 
> > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235
> > 
> > I don't think what proposed here makes anything different.
> 
> I'm not convinced that what DPDK does is acceptable to Fedora either
> based on the responses I've received when asking about BPF handling
> during build.  I wouldn't suprise me, however, if this was simply
> missed by reviewers when accepting DPDK into Fedora, because it is
> not entirely obvious unless you are looking closely.

FWIW, I'm pushing back against the idea that we have to compile the
BPF code from master source, as I think it is reasonable to have the
program embedded as a static array in the source code similar to what
DPDK does.  It doesn't feel much different from other places where apps
use generated sources, and don't build them from the original source
every time. eg "configure" is never re-generated from "configure.ac"
by Fedora packagers, they just use the generated "configure" script
as-is.

Regards,
Daniel
Yuri Benditovich Nov. 5, 2020, 3:13 p.m. UTC | #19
First of all, thank you for all your feedbacks

Please help me to summarize and let us understand better what we do in v2:
Major questions are:
1. Building eBPF from source during qemu build vs. regenerating it on
demand and keeping in the repository
Solution 1a (~ as in v1): keep instructions or ELF in H file, generate it
out of qemu build. In general we'll need to have BE and LE binaries.
Solution 1b: build ELF or instructions during QEMU build if llvm + clang
exist. Then we will have only one (BE or LE, depending on current QEMU
build)
We agree with any solution - I believe you know the requirements better.

2. Use libbpf or not
In general we do not see any advantage of using libbpf. It works with
object files (does ELF parsing at time of loading), but it does not do any
magic.
Solution 2a. Switch to libbpf, generate object files (LE and BE) from
source, keep them inside QEMU (~8k each) or aside
Solution 2b. (as in v1) Use python script to parse object -> instructions
(~2k each)
We'd prefer not to use libbpf at the moment.
If due to some unknown reason we'll find it useful in future, we can switch
to it, this does not create any incompatibility. Then this will create a
dependency on libbpf.so

3. Keep instructions or ELF inside QEMU or as separate external file
Solution 3a (~as in v1): Built-in array of instructions or ELF. If we
generate them out of QEMU build - keep 2 arrays or instructions or ELF (BE
and LE),
Solution 3b: Install them as separate files (/usr/share/qemu).
We'd prefer 3a:
 Then there is a guarantee that the eBPF is built with exactly the same
config structures as QEMU (qemu creates a mapping of its structures, eBPF
uses them).
 No need to take care on scenarios like 'file not found', 'file is not
suitable' etc

4. Is there some real request to have the eBPF for big-endian?
If no, we can enable eBPF only for LE builds

Jason, Daniel, Michael
Can you please let us know what you think and why?

On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé <berrange@redhat.com>
wrote:

> On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé wrote:

> > On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:

> > >

> > > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:

> > > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:

> > > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:

> > > > > >

> > > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang <jasowang@redhat.com

> > > > > > <mailto:jasowang@redhat.com>> wrote:

> > > > > >

> > > > > >

> > > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

> > > > > >      > Basic idea is to use eBPF to calculate and steer packets

> in TAP.

> > > > > >      > RSS(Receive Side Scaling) is used to distribute network

> packets

> > > > > >      to guest virtqueues

> > > > > >      > by calculating packet hash.

> > > > > >      > eBPF RSS allows us to use RSS with vhost TAP.

> > > > > >      >

> > > > > >      > This set of patches introduces the usage of eBPF for

> packet steering

> > > > > >      > and RSS hash calculation:

> > > > > >      > * RSS(Receive Side Scaling) is used to distribute network

> packets to

> > > > > >      > guest virtqueues by calculating packet hash

> > > > > >      > * eBPF RSS suppose to be faster than already existing

> 'software'

> > > > > >      > implementation in QEMU

> > > > > >      > * Additionally adding support for the usage of RSS with

> vhost

> > > > > >      >

> > > > > >      > Supported kernels: 5.8+

> > > > > >      >

> > > > > >      > Implementation notes:

> > > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the

> eBPF program.

> > > > > >      > Added eBPF support to qemu directly through a system

> call, see the

> > > > > >      > bpf(2) for details.

> > > > > >      > The eBPF program is part of the qemu and presented as an

> array

> > > > > >      of bpf

> > > > > >      > instructions.

> > > > > >      > The program can be recompiled by provided

> Makefile.ebpf(need to

> > > > > >      adjust

> > > > > >      > 'linuxhdrs'),

> > > > > >      > although it's not required to build QEMU with eBPF

> support.

> > > > > >      > Added changes to virtio-net and vhost, primary eBPF RSS

> is used.

> > > > > >      > 'Software' RSS used in the case of hash population and as

> a

> > > > > >      fallback option.

> > > > > >      > For vhost, the hash population feature is not reported to

> the guest.

> > > > > >      >

> > > > > >      > Please also see the documentation in PATCH 6/6.

> > > > > >      >

> > > > > >      > I am sending those patches as RFC to initiate the

> discussions

> > > > > >      and get

> > > > > >      > feedback on the following points:

> > > > > >      > * Fallback when eBPF is not supported by the kernel

> > > > > >

> > > > > >

> > > > > >      Yes, and it could also a lacking of CAP_BPF.

> > > > > >

> > > > > >

> > > > > >      > * Live migration to the kernel that doesn't have eBPF

> support

> > > > > >

> > > > > >

> > > > > >      Is there anything that we needs special treatment here?

> > > > > >

> > > > > > Possible case: rss=on, vhost=on, source system with kernel 5.8

> > > > > > (everything works) -> dest. system 5.6 (bpf does not work), the

> adapter

> > > > > > functions, but all the steering does not use proper queues.

> > > > >

> > > > > Right, I think we need to disable vhost on dest.

> > > > >

> > > > >

> > > > > >

> > > > > >

> > > > > >      > * Integration with current QEMU build

> > > > > >

> > > > > >

> > > > > >      Yes, a question here:

> > > > > >

> > > > > >      1) Any reason for not using libbpf, e.g it has been shipped

> with some

> > > > > >      distros

> > > > > >

> > > > > >

> > > > > > We intentionally do not use libbpf, as it present only on some

> distros.

> > > > > > We can switch to libbpf, but this will disable bpf if libbpf is

> not

> > > > > > installed

> > > > >

> > > > > That's better I think.

> > > > >

> > > > >

> > > > > >      2) It would be better if we can avoid shipping bytecodes

> > > > > >

> > > > > >

> > > > > >

> > > > > > This creates new dependencies: llvm + clang + ...

> > > > > > We would prefer byte code and ability to generate it if

> prerequisites

> > > > > > are installed.

> > > > >

> > > > > It's probably ok if we treat the bytecode as a kind of firmware.

> > > > That is explicitly *not* OK for inclusion in Fedora. They require

> that

> > > > BPF is compiled from source, and rejected my suggestion that it could

> > > > be considered a kind of firmware and thus have an exception from

> building

> > > > from source.

> > >

> > >

> > > Please refer what it was done in DPDK:

> > >

> > > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235

> > >

> > > I don't think what proposed here makes anything different.

> >

> > I'm not convinced that what DPDK does is acceptable to Fedora either

> > based on the responses I've received when asking about BPF handling

> > during build.  I wouldn't suprise me, however, if this was simply

> > missed by reviewers when accepting DPDK into Fedora, because it is

> > not entirely obvious unless you are looking closely.

>

> FWIW, I'm pushing back against the idea that we have to compile the

> BPF code from master source, as I think it is reasonable to have the

> program embedded as a static array in the source code similar to what

> DPDK does.  It doesn't feel much different from other places where apps

> use generated sources, and don't build them from the original source

> every time. eg "configure" is never re-generated from "configure.ac"

> by Fedora packagers, they just use the generated "configure" script

> as-is.

>

> Regards,

> Daniel

> --

> |: https://berrange.com      -o-

> https://www.flickr.com/photos/dberrange :|

> |: https://libvirt.org         -o-

> https://fstop138.berrange.com :|

> |: https://entangle-photo.org    -o-

> https://www.instagram.com/dberrange :|

>

>
<div dir="ltr">First of all, thank you for all your feedbacks<div><br></div><div>Please help me to summarize and let us understand better what we do in v2:</div><div>Major questions are: </div><div>1. Building eBPF from source during qemu build vs. regenerating it on demand and keeping in the repository</div><div>Solution 1a (~ as in v1): keep instructions or ELF in H file, generate it out of qemu build. In general we&#39;ll need to have BE and LE binaries.</div><div>Solution 1b: build ELF or instructions during QEMU build if llvm + clang exist. Then we will have only one (BE or LE, depending on current QEMU build)</div><div>We agree with any solution - I believe you know the requirements better. </div><div><br></div><div>2. Use libbpf or not</div><div>In general we do not see any advantage of using libbpf. It works with object files (does ELF parsing at time of loading), but it does not do any magic.</div><div>Solution 2a. Switch to libbpf, generate object files (LE and BE) from source, keep them inside QEMU (~8k each) or aside<br></div><div>Solution 2b. (as in v1) Use python script to parse object -&gt; instructions (~2k each)</div><div>We&#39;d prefer not to use libbpf at the moment.</div><div>If due to some unknown reason we&#39;ll find it useful in future, we can switch to it, this does not create any incompatibility. Then this will create a dependency on libbpf.so  </div><div><br></div><div>3. Keep instructions or ELF inside QEMU or as separate external file</div><div>Solution 3a (~as in v1): Built-in array of instructions or ELF. If we generate them out of QEMU build - keep 2 arrays or instructions or ELF (BE and LE), </div><div>Solution 3b: Install them as separate files (/usr/share/qemu).</div><div>We&#39;d prefer 3a:</div><div> Then there is a guarantee that the eBPF is built with exactly the same config structures as QEMU (qemu creates a mapping of its structures, eBPF uses them).</div><div> No need to take care on scenarios like &#39;file not found&#39;, &#39;file is not suitable&#39; etc</div><div><br></div><div>4. Is there some real request to have the eBPF for big-endian?<br></div><div>If no, we can enable eBPF only for LE builds</div><div><br></div><div>Jason, Daniel, Michael</div><div>Can you please let us know what you think and why?</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé &lt;<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé wrote:<br>
&gt; On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:<br>
&gt; &gt; <br>
&gt; &gt; On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:<br>
&gt; &gt; &gt; On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:<br>
&gt; &gt; &gt; &gt; On 2020/11/3 下午6:32, Yuri Benditovich wrote:<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; On Tue, Nov 3, 2020 at 11:02 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a><br>
&gt; &gt; &gt; &gt; &gt; &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;&gt; wrote:<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Basic idea is to use eBPF to calculate and steer packets in TAP.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; RSS(Receive Side Scaling) is used to distribute network packets<br>
&gt; &gt; &gt; &gt; &gt;      to guest virtqueues<br>
&gt; &gt; &gt; &gt; &gt;      &gt; by calculating packet hash.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; eBPF RSS allows us to use RSS with vhost TAP.<br>
&gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt; &gt; &gt; &gt; &gt;      &gt; This set of patches introduces the usage of eBPF for packet steering<br>
&gt; &gt; &gt; &gt; &gt;      &gt; and RSS hash calculation:<br>
&gt; &gt; &gt; &gt; &gt;      &gt; * RSS(Receive Side Scaling) is used to distribute network packets to<br>
&gt; &gt; &gt; &gt; &gt;      &gt; guest virtqueues by calculating packet hash<br>
&gt; &gt; &gt; &gt; &gt;      &gt; * eBPF RSS suppose to be faster than already existing &#39;software&#39;<br>
&gt; &gt; &gt; &gt; &gt;      &gt; implementation in QEMU<br>
&gt; &gt; &gt; &gt; &gt;      &gt; * Additionally adding support for the usage of RSS with vhost<br>
&gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Supported kernels: 5.8+<br>
&gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Implementation notes:<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Linux TAP TUNSETSTEERINGEBPF ioctl was used to set the eBPF program.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Added eBPF support to qemu directly through a system call, see the<br>
&gt; &gt; &gt; &gt; &gt;      &gt; bpf(2) for details.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; The eBPF program is part of the qemu and presented as an array<br>
&gt; &gt; &gt; &gt; &gt;      of bpf<br>
&gt; &gt; &gt; &gt; &gt;      &gt; instructions.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; The program can be recompiled by provided Makefile.ebpf(need to<br>
&gt; &gt; &gt; &gt; &gt;      adjust<br>
&gt; &gt; &gt; &gt; &gt;      &gt; &#39;linuxhdrs&#39;),<br>
&gt; &gt; &gt; &gt; &gt;      &gt; although it&#39;s not required to build QEMU with eBPF support.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Added changes to virtio-net and vhost, primary eBPF RSS is used.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; &#39;Software&#39; RSS used in the case of hash population and as a<br>
&gt; &gt; &gt; &gt; &gt;      fallback option.<br>
&gt; &gt; &gt; &gt; &gt;      &gt; For vhost, the hash population feature is not reported to the guest.<br>
&gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt; &gt; &gt; &gt; &gt;      &gt; Please also see the documentation in PATCH 6/6.<br>
&gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt; &gt; &gt; &gt; &gt;      &gt; I am sending those patches as RFC to initiate the discussions<br>
&gt; &gt; &gt; &gt; &gt;      and get<br>
&gt; &gt; &gt; &gt; &gt;      &gt; feedback on the following points:<br>
&gt; &gt; &gt; &gt; &gt;      &gt; * Fallback when eBPF is not supported by the kernel<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      Yes, and it could also a lacking of CAP_BPF.<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      &gt; * Live migration to the kernel that doesn&#39;t have eBPF support<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      Is there anything that we needs special treatment here?<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; Possible case: rss=on, vhost=on, source system with kernel 5.8<br>
&gt; &gt; &gt; &gt; &gt; (everything works) -&gt; dest. system 5.6 (bpf does not work), the adapter<br>
&gt; &gt; &gt; &gt; &gt; functions, but all the steering does not use proper queues.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; Right, I think we need to disable vhost on dest.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      &gt; * Integration with current QEMU build<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      Yes, a question here:<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      1) Any reason for not using libbpf, e.g it has been shipped with some<br>
&gt; &gt; &gt; &gt; &gt;      distros<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; We intentionally do not use libbpf, as it present only on some distros.<br>
&gt; &gt; &gt; &gt; &gt; We can switch to libbpf, but this will disable bpf if libbpf is not<br>
&gt; &gt; &gt; &gt; &gt; installed<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; That&#39;s better I think.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt;      2) It would be better if we can avoid shipping bytecodes<br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; &gt; This creates new dependencies: llvm + clang + ...<br>
&gt; &gt; &gt; &gt; &gt; We would prefer byte code and ability to generate it if prerequisites<br>
&gt; &gt; &gt; &gt; &gt; are installed.<br>
&gt; &gt; &gt; &gt; <br>
&gt; &gt; &gt; &gt; It&#39;s probably ok if we treat the bytecode as a kind of firmware.<br>
&gt; &gt; &gt; That is explicitly *not* OK for inclusion in Fedora. They require that<br>
&gt; &gt; &gt; BPF is compiled from source, and rejected my suggestion that it could<br>
&gt; &gt; &gt; be considered a kind of firmware and thus have an exception from building<br>
&gt; &gt; &gt; from source.<br>
&gt; &gt; <br>
&gt; &gt; <br>
&gt; &gt; Please refer what it was done in DPDK:<br>
&gt; &gt; <br>
&gt; &gt; <a href="http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235" rel="noreferrer" target="_blank">http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235</a><br>
&gt; &gt; <br>
&gt; &gt; I don&#39;t think what proposed here makes anything different.<br>
&gt; <br>
&gt; I&#39;m not convinced that what DPDK does is acceptable to Fedora either<br>
&gt; based on the responses I&#39;ve received when asking about BPF handling<br>
&gt; during build.  I wouldn&#39;t suprise me, however, if this was simply<br>
&gt; missed by reviewers when accepting DPDK into Fedora, because it is<br>
&gt; not entirely obvious unless you are looking closely.<br>
<br>
FWIW, I&#39;m pushing back against the idea that we have to compile the<br>
BPF code from master source, as I think it is reasonable to have the<br>
program embedded as a static array in the source code similar to what<br>
DPDK does.  It doesn&#39;t feel much different from other places where apps<br>
use generated sources, and don&#39;t build them from the original source<br>
every time. eg &quot;configure&quot; is never re-generated from &quot;<a href="http://configure.ac" rel="noreferrer" target="_blank">configure.ac</a>&quot;<br>
by Fedora packagers, they just use the generated &quot;configure&quot; script<br>
as-is.<br>
<br>
Regards,<br>
Daniel<br>
-- <br>
|: <a href="https://berrange.com" rel="noreferrer" target="_blank">https://berrange.com</a>      -o-    <a href="https://www.flickr.com/photos/dberrange" rel="noreferrer" target="_blank">https://www.flickr.com/photos/dberrange</a> :|<br>
|: <a href="https://libvirt.org" rel="noreferrer" target="_blank">https://libvirt.org</a>         -o-            <a href="https://fstop138.berrange.com" rel="noreferrer" target="_blank">https://fstop138.berrange.com</a> :|<br>
|: <a href="https://entangle-photo.org" rel="noreferrer" target="_blank">https://entangle-photo.org</a>    -o-    <a href="https://www.instagram.com/dberrange" rel="noreferrer" target="_blank">https://www.instagram.com/dberrange</a> :|<br>
<br>
</blockquote></div>
Jason Wang Nov. 9, 2020, 2:13 a.m. UTC | #20
On 2020/11/5 下午11:13, Yuri Benditovich wrote:
> First of all, thank you for all your feedbacks

>

> Please help me to summarize and let us understand better what we do in v2:

> Major questions are:

> 1. Building eBPF from source during qemu build vs. regenerating it on 

> demand and keeping in the repository

> Solution 1a (~ as in v1): keep instructions or ELF in H file, generate 

> it out of qemu build. In general we'll need to have BE and LE binaries.

> Solution 1b: build ELF or instructions during QEMU build if llvm + 

> clang exist. Then we will have only one (BE or LE, depending on 

> current QEMU build)

> We agree with any solution - I believe you know the requirements better.



I think we can go with 1a. (See Danial's comment)


>

> 2. Use libbpf or not

> In general we do not see any advantage of using libbpf. It works with 

> object files (does ELF parsing at time of loading), but it does not do 

> any magic.

> Solution 2a. Switch to libbpf, generate object files (LE and BE) from 

> source, keep them inside QEMU (~8k each) or aside



Can we simply use dynamic linking here?


> Solution 2b. (as in v1) Use python script to parse object -> 

> instructions (~2k each)

> We'd prefer not to use libbpf at the moment.

> If due to some unknown reason we'll find it useful in future, we can 

> switch to it, this does not create any incompatibility. Then this will 

> create a dependency on libbpf.so



I think we need to care about compatibility. E.g we need to enable BTF 
so I don't know how hard if we add BTF support in the current design. It 
would be probably OK it's not a lot of effort.


>

> 3. Keep instructions or ELF inside QEMU or as separate external file

> Solution 3a (~as in v1): Built-in array of instructions or ELF. If we 

> generate them out of QEMU build - keep 2 arrays or instructions or ELF 

> (BE and LE),

> Solution 3b: Install them as separate files (/usr/share/qemu).

> We'd prefer 3a:

>  Then there is a guarantee that the eBPF is built with exactly the 

> same config structures as QEMU (qemu creates a mapping of its 

> structures, eBPF uses them).

>  No need to take care on scenarios like 'file not found', 'file is not 

> suitable' etc



Yes, let's go 3a for upstream.


>

> 4. Is there some real request to have the eBPF for big-endian?

> If no, we can enable eBPF only for LE builds



We can go with LE first.

Thanks


>

> Jason, Daniel, Michael

> Can you please let us know what you think and why?

>

> On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé <berrange@redhat.com 

> <mailto:berrange@redhat.com>> wrote:

>

>     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé wrote:

>     > On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:

>     > >

>     > > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:

>     > > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:

>     > > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:

>     > > > > >

>     > > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang

>     <jasowang@redhat.com <mailto:jasowang@redhat.com>

>     > > > > > <mailto:jasowang@redhat.com

>     <mailto:jasowang@redhat.com>>> wrote:

>     > > > > >

>     > > > > >

>     > > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

>     > > > > >      > Basic idea is to use eBPF to calculate and steer

>     packets in TAP.

>     > > > > >      > RSS(Receive Side Scaling) is used to distribute

>     network packets

>     > > > > >      to guest virtqueues

>     > > > > >      > by calculating packet hash.

>     > > > > >      > eBPF RSS allows us to use RSS with vhost TAP.

>     > > > > >      >

>     > > > > >      > This set of patches introduces the usage of eBPF

>     for packet steering

>     > > > > >      > and RSS hash calculation:

>     > > > > >      > * RSS(Receive Side Scaling) is used to distribute

>     network packets to

>     > > > > >      > guest virtqueues by calculating packet hash

>     > > > > >      > * eBPF RSS suppose to be faster than already

>     existing 'software'

>     > > > > >      > implementation in QEMU

>     > > > > >      > * Additionally adding support for the usage of

>     RSS with vhost

>     > > > > >      >

>     > > > > >      > Supported kernels: 5.8+

>     > > > > >      >

>     > > > > >      > Implementation notes:

>     > > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to

>     set the eBPF program.

>     > > > > >      > Added eBPF support to qemu directly through a

>     system call, see the

>     > > > > >      > bpf(2) for details.

>     > > > > >      > The eBPF program is part of the qemu and

>     presented as an array

>     > > > > >      of bpf

>     > > > > >      > instructions.

>     > > > > >      > The program can be recompiled by provided

>     Makefile.ebpf(need to

>     > > > > >      adjust

>     > > > > >      > 'linuxhdrs'),

>     > > > > >      > although it's not required to build QEMU with

>     eBPF support.

>     > > > > >      > Added changes to virtio-net and vhost, primary

>     eBPF RSS is used.

>     > > > > >      > 'Software' RSS used in the case of hash

>     population and as a

>     > > > > >      fallback option.

>     > > > > >      > For vhost, the hash population feature is not

>     reported to the guest.

>     > > > > >      >

>     > > > > >      > Please also see the documentation in PATCH 6/6.

>     > > > > >      >

>     > > > > >      > I am sending those patches as RFC to initiate the

>     discussions

>     > > > > >      and get

>     > > > > >      > feedback on the following points:

>     > > > > >      > * Fallback when eBPF is not supported by the kernel

>     > > > > >

>     > > > > >

>     > > > > >      Yes, and it could also a lacking of CAP_BPF.

>     > > > > >

>     > > > > >

>     > > > > >      > * Live migration to the kernel that doesn't have

>     eBPF support

>     > > > > >

>     > > > > >

>     > > > > >      Is there anything that we needs special treatment here?

>     > > > > >

>     > > > > > Possible case: rss=on, vhost=on, source system with

>     kernel 5.8

>     > > > > > (everything works) -> dest. system 5.6 (bpf does not

>     work), the adapter

>     > > > > > functions, but all the steering does not use proper queues.

>     > > > >

>     > > > > Right, I think we need to disable vhost on dest.

>     > > > >

>     > > > >

>     > > > > >

>     > > > > >

>     > > > > >      > * Integration with current QEMU build

>     > > > > >

>     > > > > >

>     > > > > >      Yes, a question here:

>     > > > > >

>     > > > > >      1) Any reason for not using libbpf, e.g it has been

>     shipped with some

>     > > > > >      distros

>     > > > > >

>     > > > > >

>     > > > > > We intentionally do not use libbpf, as it present only

>     on some distros.

>     > > > > > We can switch to libbpf, but this will disable bpf if

>     libbpf is not

>     > > > > > installed

>     > > > >

>     > > > > That's better I think.

>     > > > >

>     > > > >

>     > > > > >      2) It would be better if we can avoid shipping

>     bytecodes

>     > > > > >

>     > > > > >

>     > > > > >

>     > > > > > This creates new dependencies: llvm + clang + ...

>     > > > > > We would prefer byte code and ability to generate it if

>     prerequisites

>     > > > > > are installed.

>     > > > >

>     > > > > It's probably ok if we treat the bytecode as a kind of

>     firmware.

>     > > > That is explicitly *not* OK for inclusion in Fedora. They

>     require that

>     > > > BPF is compiled from source, and rejected my suggestion that

>     it could

>     > > > be considered a kind of firmware and thus have an exception

>     from building

>     > > > from source.

>     > >

>     > >

>     > > Please refer what it was done in DPDK:

>     > >

>     > > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235

>     > >

>     > > I don't think what proposed here makes anything different.

>     >

>     > I'm not convinced that what DPDK does is acceptable to Fedora either

>     > based on the responses I've received when asking about BPF handling

>     > during build.  I wouldn't suprise me, however, if this was simply

>     > missed by reviewers when accepting DPDK into Fedora, because it is

>     > not entirely obvious unless you are looking closely.

>

>     FWIW, I'm pushing back against the idea that we have to compile the

>     BPF code from master source, as I think it is reasonable to have the

>     program embedded as a static array in the source code similar to what

>     DPDK does.  It doesn't feel much different from other places where

>     apps

>     use generated sources, and don't build them from the original source

>     every time. eg "configure" is never re-generated from

>     "configure.ac <http://configure.ac>"

>     by Fedora packagers, they just use the generated "configure" script

>     as-is.

>

>     Regards,

>     Daniel

>     -- 

>     |: https://berrange.com     -o-

>     https://www.flickr.com/photos/dberrange :|

>     |: https://libvirt.org        -o- https://fstop138.berrange.com :|

>     |: https://entangle-photo.org   -o-

>     https://www.instagram.com/dberrange :|

>
Yuri Benditovich Nov. 9, 2020, 1:33 p.m. UTC | #21
On Mon, Nov 9, 2020 at 4:14 AM Jason Wang <jasowang@redhat.com> wrote:

>

> On 2020/11/5 下午11:13, Yuri Benditovich wrote:

> > First of all, thank you for all your feedbacks

> >

> > Please help me to summarize and let us understand better what we do in

> v2:

> > Major questions are:

> > 1. Building eBPF from source during qemu build vs. regenerating it on

> > demand and keeping in the repository

> > Solution 1a (~ as in v1): keep instructions or ELF in H file, generate

> > it out of qemu build. In general we'll need to have BE and LE binaries.

> > Solution 1b: build ELF or instructions during QEMU build if llvm +

> > clang exist. Then we will have only one (BE or LE, depending on

> > current QEMU build)

> > We agree with any solution - I believe you know the requirements better.

>

>

> I think we can go with 1a. (See Danial's comment)

>

>

> >

> > 2. Use libbpf or not

> > In general we do not see any advantage of using libbpf. It works with

> > object files (does ELF parsing at time of loading), but it does not do

> > any magic.

> > Solution 2a. Switch to libbpf, generate object files (LE and BE) from

> > source, keep them inside QEMU (~8k each) or aside

>

>

> Can we simply use dynamic linking here?

>

>

Can you please explain, where exactly you suggest to use dynamic linking?


>

> > Solution 2b. (as in v1) Use python script to parse object ->

> > instructions (~2k each)

> > We'd prefer not to use libbpf at the moment.

> > If due to some unknown reason we'll find it useful in future, we can

> > switch to it, this does not create any incompatibility. Then this will

> > create a dependency on libbpf.so

>

>

> I think we need to care about compatibility. E.g we need to enable BTF

> so I don't know how hard if we add BTF support in the current design. It

> would be probably OK it's not a lot of effort.

>

>

As far as we understand BTF helps in BPF debugging and libbpf supports it
as is.
Without libbpf we in v1 load the BPF instructions only.
If you think the BTF is mandatory (BTW, why?) I think it is better to
switch to libbpf and keep the entire ELF in the qemu data.


> >

> > 3. Keep instructions or ELF inside QEMU or as separate external file

> > Solution 3a (~as in v1): Built-in array of instructions or ELF. If we

> > generate them out of QEMU build - keep 2 arrays or instructions or ELF

> > (BE and LE),

> > Solution 3b: Install them as separate files (/usr/share/qemu).

> > We'd prefer 3a:

> >  Then there is a guarantee that the eBPF is built with exactly the

> > same config structures as QEMU (qemu creates a mapping of its

> > structures, eBPF uses them).

> >  No need to take care on scenarios like 'file not found', 'file is not

> > suitable' etc

>

>

> Yes, let's go 3a for upstream.

>

>

> >

> > 4. Is there some real request to have the eBPF for big-endian?

> > If no, we can enable eBPF only for LE builds

>

>

> We can go with LE first.

>

> Thanks

>

>

> >

> > Jason, Daniel, Michael

> > Can you please let us know what you think and why?

> >

> > On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé <berrange@redhat.com

> > <mailto:berrange@redhat.com>> wrote:

> >

> >     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé wrote:

> >     > On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:

> >     > >

> >     > > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:

> >     > > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:

> >     > > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:

> >     > > > > >

> >     > > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang

> >     <jasowang@redhat.com <mailto:jasowang@redhat.com>

> >     > > > > > <mailto:jasowang@redhat.com

> >     <mailto:jasowang@redhat.com>>> wrote:

> >     > > > > >

> >     > > > > >

> >     > > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

> >     > > > > >      > Basic idea is to use eBPF to calculate and steer

> >     packets in TAP.

> >     > > > > >      > RSS(Receive Side Scaling) is used to distribute

> >     network packets

> >     > > > > >      to guest virtqueues

> >     > > > > >      > by calculating packet hash.

> >     > > > > >      > eBPF RSS allows us to use RSS with vhost TAP.

> >     > > > > >      >

> >     > > > > >      > This set of patches introduces the usage of eBPF

> >     for packet steering

> >     > > > > >      > and RSS hash calculation:

> >     > > > > >      > * RSS(Receive Side Scaling) is used to distribute

> >     network packets to

> >     > > > > >      > guest virtqueues by calculating packet hash

> >     > > > > >      > * eBPF RSS suppose to be faster than already

> >     existing 'software'

> >     > > > > >      > implementation in QEMU

> >     > > > > >      > * Additionally adding support for the usage of

> >     RSS with vhost

> >     > > > > >      >

> >     > > > > >      > Supported kernels: 5.8+

> >     > > > > >      >

> >     > > > > >      > Implementation notes:

> >     > > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to

> >     set the eBPF program.

> >     > > > > >      > Added eBPF support to qemu directly through a

> >     system call, see the

> >     > > > > >      > bpf(2) for details.

> >     > > > > >      > The eBPF program is part of the qemu and

> >     presented as an array

> >     > > > > >      of bpf

> >     > > > > >      > instructions.

> >     > > > > >      > The program can be recompiled by provided

> >     Makefile.ebpf(need to

> >     > > > > >      adjust

> >     > > > > >      > 'linuxhdrs'),

> >     > > > > >      > although it's not required to build QEMU with

> >     eBPF support.

> >     > > > > >      > Added changes to virtio-net and vhost, primary

> >     eBPF RSS is used.

> >     > > > > >      > 'Software' RSS used in the case of hash

> >     population and as a

> >     > > > > >      fallback option.

> >     > > > > >      > For vhost, the hash population feature is not

> >     reported to the guest.

> >     > > > > >      >

> >     > > > > >      > Please also see the documentation in PATCH 6/6.

> >     > > > > >      >

> >     > > > > >      > I am sending those patches as RFC to initiate the

> >     discussions

> >     > > > > >      and get

> >     > > > > >      > feedback on the following points:

> >     > > > > >      > * Fallback when eBPF is not supported by the kernel

> >     > > > > >

> >     > > > > >

> >     > > > > >      Yes, and it could also a lacking of CAP_BPF.

> >     > > > > >

> >     > > > > >

> >     > > > > >      > * Live migration to the kernel that doesn't have

> >     eBPF support

> >     > > > > >

> >     > > > > >

> >     > > > > >      Is there anything that we needs special treatment

> here?

> >     > > > > >

> >     > > > > > Possible case: rss=on, vhost=on, source system with

> >     kernel 5.8

> >     > > > > > (everything works) -> dest. system 5.6 (bpf does not

> >     work), the adapter

> >     > > > > > functions, but all the steering does not use proper queues.

> >     > > > >

> >     > > > > Right, I think we need to disable vhost on dest.

> >     > > > >

> >     > > > >

> >     > > > > >

> >     > > > > >

> >     > > > > >      > * Integration with current QEMU build

> >     > > > > >

> >     > > > > >

> >     > > > > >      Yes, a question here:

> >     > > > > >

> >     > > > > >      1) Any reason for not using libbpf, e.g it has been

> >     shipped with some

> >     > > > > >      distros

> >     > > > > >

> >     > > > > >

> >     > > > > > We intentionally do not use libbpf, as it present only

> >     on some distros.

> >     > > > > > We can switch to libbpf, but this will disable bpf if

> >     libbpf is not

> >     > > > > > installed

> >     > > > >

> >     > > > > That's better I think.

> >     > > > >

> >     > > > >

> >     > > > > >      2) It would be better if we can avoid shipping

> >     bytecodes

> >     > > > > >

> >     > > > > >

> >     > > > > >

> >     > > > > > This creates new dependencies: llvm + clang + ...

> >     > > > > > We would prefer byte code and ability to generate it if

> >     prerequisites

> >     > > > > > are installed.

> >     > > > >

> >     > > > > It's probably ok if we treat the bytecode as a kind of

> >     firmware.

> >     > > > That is explicitly *not* OK for inclusion in Fedora. They

> >     require that

> >     > > > BPF is compiled from source, and rejected my suggestion that

> >     it could

> >     > > > be considered a kind of firmware and thus have an exception

> >     from building

> >     > > > from source.

> >     > >

> >     > >

> >     > > Please refer what it was done in DPDK:

> >     > >

> >     > > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235

> >     > >

> >     > > I don't think what proposed here makes anything different.

> >     >

> >     > I'm not convinced that what DPDK does is acceptable to Fedora

> either

> >     > based on the responses I've received when asking about BPF handling

> >     > during build.  I wouldn't suprise me, however, if this was simply

> >     > missed by reviewers when accepting DPDK into Fedora, because it is

> >     > not entirely obvious unless you are looking closely.

> >

> >     FWIW, I'm pushing back against the idea that we have to compile the

> >     BPF code from master source, as I think it is reasonable to have the

> >     program embedded as a static array in the source code similar to what

> >     DPDK does.  It doesn't feel much different from other places where

> >     apps

> >     use generated sources, and don't build them from the original source

> >     every time. eg "configure" is never re-generated from

> >     "configure.ac <http://configure.ac>"

> >     by Fedora packagers, they just use the generated "configure" script

> >     as-is.

> >

> >     Regards,

> >     Daniel

> >     --

> >     |: https://berrange.com     -o-

> >     https://www.flickr.com/photos/dberrange :|

> >     |: https://libvirt.org        -o- https://fstop138.berrange.com :|

> >     |: https://entangle-photo.org   -o-

> >     https://www.instagram.com/dberrange :|

> >

>

>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 9, 2020 at 4:14 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/5 下午11:13, Yuri Benditovich wrote:<br>
&gt; First of all, thank you for all your feedbacks<br>
&gt;<br>
&gt; Please help me to summarize and let us understand better what we do in v2:<br>
&gt; Major questions are:<br>
&gt; 1. Building eBPF from source during qemu build vs. regenerating it on <br>
&gt; demand and keeping in the repository<br>
&gt; Solution 1a (~ as in v1): keep instructions or ELF in H file, generate <br>
&gt; it out of qemu build. In general we&#39;ll need to have BE and LE binaries.<br>
&gt; Solution 1b: build ELF or instructions during QEMU build if llvm + <br>
&gt; clang exist. Then we will have only one (BE or LE, depending on <br>
&gt; current QEMU build)<br>
&gt; We agree with any solution - I believe you know the requirements better.<br>
<br>
<br>
I think we can go with 1a. (See Danial&#39;s comment)<br>
<br>
<br>
&gt;<br>
&gt; 2. Use libbpf or not<br>
&gt; In general we do not see any advantage of using libbpf. It works with <br>
&gt; object files (does ELF parsing at time of loading), but it does not do <br>
&gt; any magic.<br>
&gt; Solution 2a. Switch to libbpf, generate object files (LE and BE) from <br>
&gt; source, keep them inside QEMU (~8k each) or aside<br>
<br>
<br>
Can we simply use dynamic linking here?<br>
<br></blockquote><div><br></div><div>Can you please explain, where exactly you suggest to use dynamic linking?<br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt; Solution 2b. (as in v1) Use python script to parse object -&gt; <br>
&gt; instructions (~2k each)<br>
&gt; We&#39;d prefer not to use libbpf at the moment.<br>
&gt; If due to some unknown reason we&#39;ll find it useful in future, we can <br>
&gt; switch to it, this does not create any incompatibility. Then this will <br>
&gt; create a dependency on libbpf.so<br>
<br>
<br>
I think we need to care about compatibility. E.g we need to enable BTF <br>
so I don&#39;t know how hard if we add BTF support in the current design. It <br>
would be probably OK it&#39;s not a lot of effort.<br>
<br></blockquote><div><br></div><div>As far as we understand BTF helps in BPF debugging and libbpf supports it as is.<br></div><div>Without libbpf we in v1 load the BPF instructions only.</div><div>If you think the BTF is mandatory (BTW, why?) I think it is better to switch to libbpf and keep the entire ELF in the qemu data.</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt;<br>
&gt; 3. Keep instructions or ELF inside QEMU or as separate external file<br>
&gt; Solution 3a (~as in v1): Built-in array of instructions or ELF. If we <br>
&gt; generate them out of QEMU build - keep 2 arrays or instructions or ELF <br>
&gt; (BE and LE),<br>
&gt; Solution 3b: Install them as separate files (/usr/share/qemu).<br>
&gt; We&#39;d prefer 3a:<br>
&gt;  Then there is a guarantee that the eBPF is built with exactly the <br>
&gt; same config structures as QEMU (qemu creates a mapping of its <br>
&gt; structures, eBPF uses them).<br>
&gt;  No need to take care on scenarios like &#39;file not found&#39;, &#39;file is not <br>
&gt; suitable&#39; etc<br>
<br>
<br>
Yes, let&#39;s go 3a for upstream.<br>
<br>
<br>
&gt;<br>
&gt; 4. Is there some real request to have the eBPF for big-endian?<br>
&gt; If no, we can enable eBPF only for LE builds<br>
<br>
<br>
We can go with LE first.<br>
<br>
Thanks<br>
<br>
<br>
&gt;<br>
&gt; Jason, Daniel, Michael<br>
&gt; Can you please let us know what you think and why?<br>
&gt;<br>
&gt; On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé &lt;<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a> <br>
&gt; &lt;mailto:<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a>&gt;&gt; wrote:<br>
&gt;<br>
&gt;     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé wrote:<br>
&gt;     &gt; On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:<br>
&gt;     &gt; &gt;<br>
&gt;     &gt; &gt; On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:<br>
&gt;     &gt; &gt; &gt; On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang wrote:<br>
&gt;     &gt; &gt; &gt; &gt; On 2020/11/3 下午6:32, Yuri Benditovich wrote:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt; On Tue, Nov 3, 2020 at 11:02 AM Jason Wang<br>
&gt;     &lt;<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a> &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt; &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a><br>
&gt;     &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;&gt;&gt; wrote:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Basic idea is to use eBPF to calculate and steer<br>
&gt;     packets in TAP.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; RSS(Receive Side Scaling) is used to distribute<br>
&gt;     network packets<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      to guest virtqueues<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; by calculating packet hash.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; eBPF RSS allows us to use RSS with vhost TAP.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; This set of patches introduces the usage of eBPF<br>
&gt;     for packet steering<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; and RSS hash calculation:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * RSS(Receive Side Scaling) is used to distribute<br>
&gt;     network packets to<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; guest virtqueues by calculating packet hash<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * eBPF RSS suppose to be faster than already<br>
&gt;     existing &#39;software&#39;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; implementation in QEMU<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Additionally adding support for the usage of<br>
&gt;     RSS with vhost<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Supported kernels: 5.8+<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Implementation notes:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Linux TAP TUNSETSTEERINGEBPF ioctl was used to<br>
&gt;     set the eBPF program.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Added eBPF support to qemu directly through a<br>
&gt;     system call, see the<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; bpf(2) for details.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; The eBPF program is part of the qemu and<br>
&gt;     presented as an array<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      of bpf<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; instructions.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; The program can be recompiled by provided<br>
&gt;     Makefile.ebpf(need to<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      adjust<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; &#39;linuxhdrs&#39;),<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; although it&#39;s not required to build QEMU with<br>
&gt;     eBPF support.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Added changes to virtio-net and vhost, primary<br>
&gt;     eBPF RSS is used.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; &#39;Software&#39; RSS used in the case of hash<br>
&gt;     population and as a<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      fallback option.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; For vhost, the hash population feature is not<br>
&gt;     reported to the guest.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Please also see the documentation in PATCH 6/6.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; I am sending those patches as RFC to initiate the<br>
&gt;     discussions<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      and get<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; feedback on the following points:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Fallback when eBPF is not supported by the kernel<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      Yes, and it could also a lacking of CAP_BPF.<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Live migration to the kernel that doesn&#39;t have<br>
&gt;     eBPF support<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      Is there anything that we needs special treatment here?<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt; Possible case: rss=on, vhost=on, source system with<br>
&gt;     kernel 5.8<br>
&gt;     &gt; &gt; &gt; &gt; &gt; (everything works) -&gt; dest. system 5.6 (bpf does not<br>
&gt;     work), the adapter<br>
&gt;     &gt; &gt; &gt; &gt; &gt; functions, but all the steering does not use proper queues.<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; Right, I think we need to disable vhost on dest.<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Integration with current QEMU build<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      Yes, a question here:<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      1) Any reason for not using libbpf, e.g it has been<br>
&gt;     shipped with some<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      distros<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt; We intentionally do not use libbpf, as it present only<br>
&gt;     on some distros.<br>
&gt;     &gt; &gt; &gt; &gt; &gt; We can switch to libbpf, but this will disable bpf if<br>
&gt;     libbpf is not<br>
&gt;     &gt; &gt; &gt; &gt; &gt; installed<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; That&#39;s better I think.<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;      2) It would be better if we can avoid shipping<br>
&gt;     bytecodes<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; &gt; This creates new dependencies: llvm + clang + ...<br>
&gt;     &gt; &gt; &gt; &gt; &gt; We would prefer byte code and ability to generate it if<br>
&gt;     prerequisites<br>
&gt;     &gt; &gt; &gt; &gt; &gt; are installed.<br>
&gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt; &gt; &gt; &gt; It&#39;s probably ok if we treat the bytecode as a kind of<br>
&gt;     firmware.<br>
&gt;     &gt; &gt; &gt; That is explicitly *not* OK for inclusion in Fedora. They<br>
&gt;     require that<br>
&gt;     &gt; &gt; &gt; BPF is compiled from source, and rejected my suggestion that<br>
&gt;     it could<br>
&gt;     &gt; &gt; &gt; be considered a kind of firmware and thus have an exception<br>
&gt;     from building<br>
&gt;     &gt; &gt; &gt; from source.<br>
&gt;     &gt; &gt;<br>
&gt;     &gt; &gt;<br>
&gt;     &gt; &gt; Please refer what it was done in DPDK:<br>
&gt;     &gt; &gt;<br>
&gt;     &gt; &gt; <a href="http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235" rel="noreferrer" target="_blank">http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235</a><br>
&gt;     &gt; &gt;<br>
&gt;     &gt; &gt; I don&#39;t think what proposed here makes anything different.<br>
&gt;     &gt;<br>
&gt;     &gt; I&#39;m not convinced that what DPDK does is acceptable to Fedora either<br>
&gt;     &gt; based on the responses I&#39;ve received when asking about BPF handling<br>
&gt;     &gt; during build.  I wouldn&#39;t suprise me, however, if this was simply<br>
&gt;     &gt; missed by reviewers when accepting DPDK into Fedora, because it is<br>
&gt;     &gt; not entirely obvious unless you are looking closely.<br>
&gt;<br>
&gt;     FWIW, I&#39;m pushing back against the idea that we have to compile the<br>
&gt;     BPF code from master source, as I think it is reasonable to have the<br>
&gt;     program embedded as a static array in the source code similar to what<br>
&gt;     DPDK does.  It doesn&#39;t feel much different from other places where<br>
&gt;     apps<br>
&gt;     use generated sources, and don&#39;t build them from the original source<br>
&gt;     every time. eg &quot;configure&quot; is never re-generated from<br>
&gt;     &quot;<a href="http://configure.ac" rel="noreferrer" target="_blank">configure.ac</a> &lt;<a href="http://configure.ac" rel="noreferrer" target="_blank">http://configure.ac</a>&gt;&quot;<br>
&gt;     by Fedora packagers, they just use the generated &quot;configure&quot; script<br>
&gt;     as-is.<br>
&gt;<br>
&gt;     Regards,<br>
&gt;     Daniel<br>
&gt;     -- <br>
&gt;     |: <a href="https://berrange.com" rel="noreferrer" target="_blank">https://berrange.com</a>     -o-<br>
&gt;     <a href="https://www.flickr.com/photos/dberrange" rel="noreferrer" target="_blank">https://www.flickr.com/photos/dberrange</a> :|<br>
&gt;     |: <a href="https://libvirt.org" rel="noreferrer" target="_blank">https://libvirt.org</a>        -o- <a href="https://fstop138.berrange.com" rel="noreferrer" target="_blank">https://fstop138.berrange.com</a> :|<br>
&gt;     |: <a href="https://entangle-photo.org" rel="noreferrer" target="_blank">https://entangle-photo.org</a>   -o-<br>
&gt;     <a href="https://www.instagram.com/dberrange" rel="noreferrer" target="_blank">https://www.instagram.com/dberrange</a> :|<br>
&gt;<br>
<br>
</blockquote></div></div>
Jason Wang Nov. 10, 2020, 2:23 a.m. UTC | #22
On 2020/11/9 下午9:33, Yuri Benditovich wrote:
>

>

> On Mon, Nov 9, 2020 at 4:14 AM Jason Wang <jasowang@redhat.com 

> <mailto:jasowang@redhat.com>> wrote:

>

>

>     On 2020/11/5 下午11:13, Yuri Benditovich wrote:

>     > First of all, thank you for all your feedbacks

>     >

>     > Please help me to summarize and let us understand better what we

>     do in v2:

>     > Major questions are:

>     > 1. Building eBPF from source during qemu build vs. regenerating

>     it on

>     > demand and keeping in the repository

>     > Solution 1a (~ as in v1): keep instructions or ELF in H file,

>     generate

>     > it out of qemu build. In general we'll need to have BE and LE

>     binaries.

>     > Solution 1b: build ELF or instructions during QEMU build if llvm +

>     > clang exist. Then we will have only one (BE or LE, depending on

>     > current QEMU build)

>     > We agree with any solution - I believe you know the requirements

>     better.

>

>

>     I think we can go with 1a. (See Danial's comment)

>

>

>     >

>     > 2. Use libbpf or not

>     > In general we do not see any advantage of using libbpf. It works

>     with

>     > object files (does ELF parsing at time of loading), but it does

>     not do

>     > any magic.

>     > Solution 2a. Switch to libbpf, generate object files (LE and BE)

>     from

>     > source, keep them inside QEMU (~8k each) or aside

>

>

>     Can we simply use dynamic linking here?

>

>

> Can you please explain, where exactly you suggest to use dynamic linking?



Yes. If I understand your 2a properly, you meant static linking of 
libbpf. So what I want to ask is the possibility of dynamic linking of 
libbpf here.


>

>     > Solution 2b. (as in v1) Use python script to parse object ->

>     > instructions (~2k each)

>     > We'd prefer not to use libbpf at the moment.

>     > If due to some unknown reason we'll find it useful in future, we

>     can

>     > switch to it, this does not create any incompatibility. Then

>     this will

>     > create a dependency on libbpf.so

>

>

>     I think we need to care about compatibility. E.g we need to enable

>     BTF

>     so I don't know how hard if we add BTF support in the current

>     design. It

>     would be probably OK it's not a lot of effort.

>

>

> As far as we understand BTF helps in BPF debugging and libbpf supports 

> it as is.

> Without libbpf we in v1 load the BPF instructions only.

> If you think the BTF is mandatory (BTW, why?) I think it is better to 

> switch to libbpf and keep the entire ELF in the qemu data.



It is used to make sure the BPF can do compile once run everywhere.

This is explained in detail in here: 
https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html.

Thanks


>

>

>     >

>     > 3. Keep instructions or ELF inside QEMU or as separate external file

>     > Solution 3a (~as in v1): Built-in array of instructions or ELF.

>     If we

>     > generate them out of QEMU build - keep 2 arrays or instructions

>     or ELF

>     > (BE and LE),

>     > Solution 3b: Install them as separate files (/usr/share/qemu).

>     > We'd prefer 3a:

>     >  Then there is a guarantee that the eBPF is built with exactly the

>     > same config structures as QEMU (qemu creates a mapping of its

>     > structures, eBPF uses them).

>     >  No need to take care on scenarios like 'file not found', 'file

>     is not

>     > suitable' etc

>

>

>     Yes, let's go 3a for upstream.

>

>

>     >

>     > 4. Is there some real request to have the eBPF for big-endian?

>     > If no, we can enable eBPF only for LE builds

>

>

>     We can go with LE first.

>

>     Thanks

>

>

>     >

>     > Jason, Daniel, Michael

>     > Can you please let us know what you think and why?

>     >

>     > On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé

>     <berrange@redhat.com <mailto:berrange@redhat.com>

>     > <mailto:berrange@redhat.com <mailto:berrange@redhat.com>>> wrote:

>     >

>     >     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé

>     wrote:

>     >     > On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:

>     >     > >

>     >     > > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:

>     >     > > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang

>     wrote:

>     >     > > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:

>     >     > > > > >

>     >     > > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang

>     >     <jasowang@redhat.com <mailto:jasowang@redhat.com>

>     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>

>     >     > > > > > <mailto:jasowang@redhat.com

>     <mailto:jasowang@redhat.com>

>     >     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>>>

>     wrote:

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

>     >     > > > > >      > Basic idea is to use eBPF to calculate and

>     steer

>     >     packets in TAP.

>     >     > > > > >      > RSS(Receive Side Scaling) is used to distribute

>     >     network packets

>     >     > > > > >      to guest virtqueues

>     >     > > > > >      > by calculating packet hash.

>     >     > > > > >      > eBPF RSS allows us to use RSS with vhost TAP.

>     >     > > > > >      >

>     >     > > > > >      > This set of patches introduces the usage of

>     eBPF

>     >     for packet steering

>     >     > > > > >      > and RSS hash calculation:

>     >     > > > > >      > * RSS(Receive Side Scaling) is used to

>     distribute

>     >     network packets to

>     >     > > > > >      > guest virtqueues by calculating packet hash

>     >     > > > > >      > * eBPF RSS suppose to be faster than already

>     >     existing 'software'

>     >     > > > > >      > implementation in QEMU

>     >     > > > > >      > * Additionally adding support for the usage of

>     >     RSS with vhost

>     >     > > > > >      >

>     >     > > > > >      > Supported kernels: 5.8+

>     >     > > > > >      >

>     >     > > > > >      > Implementation notes:

>     >     > > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to

>     >     set the eBPF program.

>     >     > > > > >      > Added eBPF support to qemu directly through a

>     >     system call, see the

>     >     > > > > >      > bpf(2) for details.

>     >     > > > > >      > The eBPF program is part of the qemu and

>     >     presented as an array

>     >     > > > > >      of bpf

>     >     > > > > >      > instructions.

>     >     > > > > >      > The program can be recompiled by provided

>     >     Makefile.ebpf(need to

>     >     > > > > >      adjust

>     >     > > > > >      > 'linuxhdrs'),

>     >     > > > > >      > although it's not required to build QEMU with

>     >     eBPF support.

>     >     > > > > >      > Added changes to virtio-net and vhost, primary

>     >     eBPF RSS is used.

>     >     > > > > >      > 'Software' RSS used in the case of hash

>     >     population and as a

>     >     > > > > >      fallback option.

>     >     > > > > >      > For vhost, the hash population feature is not

>     >     reported to the guest.

>     >     > > > > >      >

>     >     > > > > >      > Please also see the documentation in PATCH 6/6.

>     >     > > > > >      >

>     >     > > > > >      > I am sending those patches as RFC to

>     initiate the

>     >     discussions

>     >     > > > > >      and get

>     >     > > > > >      > feedback on the following points:

>     >     > > > > >      > * Fallback when eBPF is not supported by

>     the kernel

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >      Yes, and it could also a lacking of CAP_BPF.

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >      > * Live migration to the kernel that doesn't

>     have

>     >     eBPF support

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >      Is there anything that we needs special

>     treatment here?

>     >     > > > > >

>     >     > > > > > Possible case: rss=on, vhost=on, source system with

>     >     kernel 5.8

>     >     > > > > > (everything works) -> dest. system 5.6 (bpf does not

>     >     work), the adapter

>     >     > > > > > functions, but all the steering does not use

>     proper queues.

>     >     > > > >

>     >     > > > > Right, I think we need to disable vhost on dest.

>     >     > > > >

>     >     > > > >

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >      > * Integration with current QEMU build

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >      Yes, a question here:

>     >     > > > > >

>     >     > > > > >      1) Any reason for not using libbpf, e.g it

>     has been

>     >     shipped with some

>     >     > > > > >      distros

>     >     > > > > >

>     >     > > > > >

>     >     > > > > > We intentionally do not use libbpf, as it present only

>     >     on some distros.

>     >     > > > > > We can switch to libbpf, but this will disable bpf if

>     >     libbpf is not

>     >     > > > > > installed

>     >     > > > >

>     >     > > > > That's better I think.

>     >     > > > >

>     >     > > > >

>     >     > > > > >      2) It would be better if we can avoid shipping

>     >     bytecodes

>     >     > > > > >

>     >     > > > > >

>     >     > > > > >

>     >     > > > > > This creates new dependencies: llvm + clang + ...

>     >     > > > > > We would prefer byte code and ability to generate

>     it if

>     >     prerequisites

>     >     > > > > > are installed.

>     >     > > > >

>     >     > > > > It's probably ok if we treat the bytecode as a kind of

>     >     firmware.

>     >     > > > That is explicitly *not* OK for inclusion in Fedora. They

>     >     require that

>     >     > > > BPF is compiled from source, and rejected my

>     suggestion that

>     >     it could

>     >     > > > be considered a kind of firmware and thus have an

>     exception

>     >     from building

>     >     > > > from source.

>     >     > >

>     >     > >

>     >     > > Please refer what it was done in DPDK:

>     >     > >

>     >     > > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235

>     >     > >

>     >     > > I don't think what proposed here makes anything different.

>     >     >

>     >     > I'm not convinced that what DPDK does is acceptable to

>     Fedora either

>     >     > based on the responses I've received when asking about BPF

>     handling

>     >     > during build.  I wouldn't suprise me, however, if this was

>     simply

>     >     > missed by reviewers when accepting DPDK into Fedora,

>     because it is

>     >     > not entirely obvious unless you are looking closely.

>     >

>     >     FWIW, I'm pushing back against the idea that we have to

>     compile the

>     >     BPF code from master source, as I think it is reasonable to

>     have the

>     >     program embedded as a static array in the source code

>     similar to what

>     >     DPDK does.  It doesn't feel much different from other places

>     where

>     >     apps

>     >     use generated sources, and don't build them from the

>     original source

>     >     every time. eg "configure" is never re-generated from

>     >     "configure.ac <http://configure.ac> <http://configure.ac>"

>     >     by Fedora packagers, they just use the generated "configure"

>     script

>     >     as-is.

>     >

>     >     Regards,

>     >     Daniel

>     >     --

>     >     |: https://berrange.com     -o-

>     > https://www.flickr.com/photos/dberrange :|

>     >     |: https://libvirt.org        -o-

>     https://fstop138.berrange.com :|

>     >     |: https://entangle-photo.org   -o-

>     > https://www.instagram.com/dberrange :|

>     >

>
Yuri Benditovich Nov. 10, 2020, 8 a.m. UTC | #23
On Tue, Nov 10, 2020 at 4:23 AM Jason Wang <jasowang@redhat.com> wrote:

>

> On 2020/11/9 下午9:33, Yuri Benditovich wrote:

> >

> >

> > On Mon, Nov 9, 2020 at 4:14 AM Jason Wang <jasowang@redhat.com

> > <mailto:jasowang@redhat.com>> wrote:

> >

> >

> >     On 2020/11/5 下午11:13, Yuri Benditovich wrote:

> >     > First of all, thank you for all your feedbacks

> >     >

> >     > Please help me to summarize and let us understand better what we

> >     do in v2:

> >     > Major questions are:

> >     > 1. Building eBPF from source during qemu build vs. regenerating

> >     it on

> >     > demand and keeping in the repository

> >     > Solution 1a (~ as in v1): keep instructions or ELF in H file,

> >     generate

> >     > it out of qemu build. In general we'll need to have BE and LE

> >     binaries.

> >     > Solution 1b: build ELF or instructions during QEMU build if llvm +

> >     > clang exist. Then we will have only one (BE or LE, depending on

> >     > current QEMU build)

> >     > We agree with any solution - I believe you know the requirements

> >     better.

> >

> >

> >     I think we can go with 1a. (See Danial's comment)

> >

> >

> >     >

> >     > 2. Use libbpf or not

> >     > In general we do not see any advantage of using libbpf. It works

> >     with

> >     > object files (does ELF parsing at time of loading), but it does

> >     not do

> >     > any magic.

> >     > Solution 2a. Switch to libbpf, generate object files (LE and BE)

> >     from

> >     > source, keep them inside QEMU (~8k each) or aside

> >

> >

> >     Can we simply use dynamic linking here?

> >

> >

> > Can you please explain, where exactly you suggest to use dynamic linking?

>

>

> Yes. If I understand your 2a properly, you meant static linking of

> libbpf. So what I want to ask is the possibility of dynamic linking of

> libbpf here.

>

>

As Daniel explained above, QEMU is always linked dynamically vs libraries.
Also I see the libbpf package does not even contain the static library.
If the build environment contains libbpf, the libbpf.so becomes runtime
dependency, just as with other libs.


>

> >

> >     > Solution 2b. (as in v1) Use python script to parse object ->

> >     > instructions (~2k each)

> >     > We'd prefer not to use libbpf at the moment.

> >     > If due to some unknown reason we'll find it useful in future, we

> >     can

> >     > switch to it, this does not create any incompatibility. Then

> >     this will

> >     > create a dependency on libbpf.so

> >

> >

> >     I think we need to care about compatibility. E.g we need to enable

> >     BTF

> >     so I don't know how hard if we add BTF support in the current

> >     design. It

> >     would be probably OK it's not a lot of effort.

> >

> >

> > As far as we understand BTF helps in BPF debugging and libbpf supports

> > it as is.

> > Without libbpf we in v1 load the BPF instructions only.

> > If you think the BTF is mandatory (BTW, why?) I think it is better to

> > switch to libbpf and keep the entire ELF in the qemu data.

>

>

> It is used to make sure the BPF can do compile once run everywhere.

>

> This is explained in detail in here:

>

> https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html

> .

>

>

Thank you, then there is no question, we need to use libbpf.


> Thanks

>

>

> >

> >

> >     >

> >     > 3. Keep instructions or ELF inside QEMU or as separate external

> file

> >     > Solution 3a (~as in v1): Built-in array of instructions or ELF.

> >     If we

> >     > generate them out of QEMU build - keep 2 arrays or instructions

> >     or ELF

> >     > (BE and LE),

> >     > Solution 3b: Install them as separate files (/usr/share/qemu).

> >     > We'd prefer 3a:

> >     >  Then there is a guarantee that the eBPF is built with exactly the

> >     > same config structures as QEMU (qemu creates a mapping of its

> >     > structures, eBPF uses them).

> >     >  No need to take care on scenarios like 'file not found', 'file

> >     is not

> >     > suitable' etc

> >

> >

> >     Yes, let's go 3a for upstream.

> >

> >

> >     >

> >     > 4. Is there some real request to have the eBPF for big-endian?

> >     > If no, we can enable eBPF only for LE builds

> >

> >

> >     We can go with LE first.

> >

> >     Thanks

> >

> >

> >     >

> >     > Jason, Daniel, Michael

> >     > Can you please let us know what you think and why?

> >     >

> >     > On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé

> >     <berrange@redhat.com <mailto:berrange@redhat.com>

> >     > <mailto:berrange@redhat.com <mailto:berrange@redhat.com>>> wrote:

> >     >

> >     >     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé

> >     wrote:

> >     >     > On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:

> >     >     > >

> >     >     > > On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:

> >     >     > > > On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang

> >     wrote:

> >     >     > > > > On 2020/11/3 下午6:32, Yuri Benditovich wrote:

> >     >     > > > > >

> >     >     > > > > > On Tue, Nov 3, 2020 at 11:02 AM Jason Wang

> >     >     <jasowang@redhat.com <mailto:jasowang@redhat.com>

> >     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>

> >     >     > > > > > <mailto:jasowang@redhat.com

> >     <mailto:jasowang@redhat.com>

> >     >     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>>>

> >     wrote:

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:

> >     >     > > > > >      > Basic idea is to use eBPF to calculate and

> >     steer

> >     >     packets in TAP.

> >     >     > > > > >      > RSS(Receive Side Scaling) is used to

> distribute

> >     >     network packets

> >     >     > > > > >      to guest virtqueues

> >     >     > > > > >      > by calculating packet hash.

> >     >     > > > > >      > eBPF RSS allows us to use RSS with vhost TAP.

> >     >     > > > > >      >

> >     >     > > > > >      > This set of patches introduces the usage of

> >     eBPF

> >     >     for packet steering

> >     >     > > > > >      > and RSS hash calculation:

> >     >     > > > > >      > * RSS(Receive Side Scaling) is used to

> >     distribute

> >     >     network packets to

> >     >     > > > > >      > guest virtqueues by calculating packet hash

> >     >     > > > > >      > * eBPF RSS suppose to be faster than already

> >     >     existing 'software'

> >     >     > > > > >      > implementation in QEMU

> >     >     > > > > >      > * Additionally adding support for the usage of

> >     >     RSS with vhost

> >     >     > > > > >      >

> >     >     > > > > >      > Supported kernels: 5.8+

> >     >     > > > > >      >

> >     >     > > > > >      > Implementation notes:

> >     >     > > > > >      > Linux TAP TUNSETSTEERINGEBPF ioctl was used to

> >     >     set the eBPF program.

> >     >     > > > > >      > Added eBPF support to qemu directly through a

> >     >     system call, see the

> >     >     > > > > >      > bpf(2) for details.

> >     >     > > > > >      > The eBPF program is part of the qemu and

> >     >     presented as an array

> >     >     > > > > >      of bpf

> >     >     > > > > >      > instructions.

> >     >     > > > > >      > The program can be recompiled by provided

> >     >     Makefile.ebpf(need to

> >     >     > > > > >      adjust

> >     >     > > > > >      > 'linuxhdrs'),

> >     >     > > > > >      > although it's not required to build QEMU with

> >     >     eBPF support.

> >     >     > > > > >      > Added changes to virtio-net and vhost, primary

> >     >     eBPF RSS is used.

> >     >     > > > > >      > 'Software' RSS used in the case of hash

> >     >     population and as a

> >     >     > > > > >      fallback option.

> >     >     > > > > >      > For vhost, the hash population feature is not

> >     >     reported to the guest.

> >     >     > > > > >      >

> >     >     > > > > >      > Please also see the documentation in PATCH

> 6/6.

> >     >     > > > > >      >

> >     >     > > > > >      > I am sending those patches as RFC to

> >     initiate the

> >     >     discussions

> >     >     > > > > >      and get

> >     >     > > > > >      > feedback on the following points:

> >     >     > > > > >      > * Fallback when eBPF is not supported by

> >     the kernel

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >      Yes, and it could also a lacking of CAP_BPF.

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >      > * Live migration to the kernel that doesn't

> >     have

> >     >     eBPF support

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >      Is there anything that we needs special

> >     treatment here?

> >     >     > > > > >

> >     >     > > > > > Possible case: rss=on, vhost=on, source system with

> >     >     kernel 5.8

> >     >     > > > > > (everything works) -> dest. system 5.6 (bpf does not

> >     >     work), the adapter

> >     >     > > > > > functions, but all the steering does not use

> >     proper queues.

> >     >     > > > >

> >     >     > > > > Right, I think we need to disable vhost on dest.

> >     >     > > > >

> >     >     > > > >

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >      > * Integration with current QEMU build

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >      Yes, a question here:

> >     >     > > > > >

> >     >     > > > > >      1) Any reason for not using libbpf, e.g it

> >     has been

> >     >     shipped with some

> >     >     > > > > >      distros

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > > We intentionally do not use libbpf, as it present

> only

> >     >     on some distros.

> >     >     > > > > > We can switch to libbpf, but this will disable bpf if

> >     >     libbpf is not

> >     >     > > > > > installed

> >     >     > > > >

> >     >     > > > > That's better I think.

> >     >     > > > >

> >     >     > > > >

> >     >     > > > > >      2) It would be better if we can avoid shipping

> >     >     bytecodes

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > >

> >     >     > > > > > This creates new dependencies: llvm + clang + ...

> >     >     > > > > > We would prefer byte code and ability to generate

> >     it if

> >     >     prerequisites

> >     >     > > > > > are installed.

> >     >     > > > >

> >     >     > > > > It's probably ok if we treat the bytecode as a kind of

> >     >     firmware.

> >     >     > > > That is explicitly *not* OK for inclusion in Fedora. They

> >     >     require that

> >     >     > > > BPF is compiled from source, and rejected my

> >     suggestion that

> >     >     it could

> >     >     > > > be considered a kind of firmware and thus have an

> >     exception

> >     >     from building

> >     >     > > > from source.

> >     >     > >

> >     >     > >

> >     >     > > Please refer what it was done in DPDK:

> >     >     > >

> >     >     > > http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235

> >     >     > >

> >     >     > > I don't think what proposed here makes anything different.

> >     >     >

> >     >     > I'm not convinced that what DPDK does is acceptable to

> >     Fedora either

> >     >     > based on the responses I've received when asking about BPF

> >     handling

> >     >     > during build.  I wouldn't suprise me, however, if this was

> >     simply

> >     >     > missed by reviewers when accepting DPDK into Fedora,

> >     because it is

> >     >     > not entirely obvious unless you are looking closely.

> >     >

> >     >     FWIW, I'm pushing back against the idea that we have to

> >     compile the

> >     >     BPF code from master source, as I think it is reasonable to

> >     have the

> >     >     program embedded as a static array in the source code

> >     similar to what

> >     >     DPDK does.  It doesn't feel much different from other places

> >     where

> >     >     apps

> >     >     use generated sources, and don't build them from the

> >     original source

> >     >     every time. eg "configure" is never re-generated from

> >     >     "configure.ac <http://configure.ac> <http://configure.ac>"

> >     >     by Fedora packagers, they just use the generated "configure"

> >     script

> >     >     as-is.

> >     >

> >     >     Regards,

> >     >     Daniel

> >     >     --

> >     >     |: https://berrange.com     -o-

> >     > https://www.flickr.com/photos/dberrange :|

> >     >     |: https://libvirt.org        -o-

> >     https://fstop138.berrange.com :|

> >     >     |: https://entangle-photo.org   -o-

> >     > https://www.instagram.com/dberrange :|

> >     >

> >

>

>
<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Nov 10, 2020 at 4:23 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com">jasowang@redhat.com</a>&gt; wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
On 2020/11/9 下午9:33, Yuri Benditovich wrote:<br>
&gt;<br>
&gt;<br>
&gt; On Mon, Nov 9, 2020 at 4:14 AM Jason Wang &lt;<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a> <br>
&gt; &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;&gt; wrote:<br>
&gt;<br>
&gt;<br>
&gt;     On 2020/11/5 下午11:13, Yuri Benditovich wrote:<br>
&gt;     &gt; First of all, thank you for all your feedbacks<br>
&gt;     &gt;<br>
&gt;     &gt; Please help me to summarize and let us understand better what we<br>
&gt;     do in v2:<br>
&gt;     &gt; Major questions are:<br>
&gt;     &gt; 1. Building eBPF from source during qemu build vs. regenerating<br>
&gt;     it on<br>
&gt;     &gt; demand and keeping in the repository<br>
&gt;     &gt; Solution 1a (~ as in v1): keep instructions or ELF in H file,<br>
&gt;     generate<br>
&gt;     &gt; it out of qemu build. In general we&#39;ll need to have BE and LE<br>
&gt;     binaries.<br>
&gt;     &gt; Solution 1b: build ELF or instructions during QEMU build if llvm +<br>
&gt;     &gt; clang exist. Then we will have only one (BE or LE, depending on<br>
&gt;     &gt; current QEMU build)<br>
&gt;     &gt; We agree with any solution - I believe you know the requirements<br>
&gt;     better.<br>
&gt;<br>
&gt;<br>
&gt;     I think we can go with 1a. (See Danial&#39;s comment)<br>
&gt;<br>
&gt;<br>
&gt;     &gt;<br>
&gt;     &gt; 2. Use libbpf or not<br>
&gt;     &gt; In general we do not see any advantage of using libbpf. It works<br>
&gt;     with<br>
&gt;     &gt; object files (does ELF parsing at time of loading), but it does<br>
&gt;     not do<br>
&gt;     &gt; any magic.<br>
&gt;     &gt; Solution 2a. Switch to libbpf, generate object files (LE and BE)<br>
&gt;     from<br>
&gt;     &gt; source, keep them inside QEMU (~8k each) or aside<br>
&gt;<br>
&gt;<br>
&gt;     Can we simply use dynamic linking here?<br>
&gt;<br>
&gt;<br>
&gt; Can you please explain, where exactly you suggest to use dynamic linking?<br>
<br>
<br>
Yes. If I understand your 2a properly, you meant static linking of <br>
libbpf. So what I want to ask is the possibility of dynamic linking of <br>
libbpf here.<br>
<br></blockquote><div><br></div><div>As Daniel explained above, QEMU is always linked dynamically vs libraries.</div><div>Also I see the libbpf package does not even contain the static library.</div><div>If the build environment contains libbpf, the libbpf.so becomes runtime dependency, just as with other libs.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<br>
&gt;<br>
&gt;     &gt; Solution 2b. (as in v1) Use python script to parse object -&gt;<br>
&gt;     &gt; instructions (~2k each)<br>
&gt;     &gt; We&#39;d prefer not to use libbpf at the moment.<br>
&gt;     &gt; If due to some unknown reason we&#39;ll find it useful in future, we<br>
&gt;     can<br>
&gt;     &gt; switch to it, this does not create any incompatibility. Then<br>
&gt;     this will<br>
&gt;     &gt; create a dependency on libbpf.so<br>
&gt;<br>
&gt;<br>
&gt;     I think we need to care about compatibility. E.g we need to enable<br>
&gt;     BTF<br>
&gt;     so I don&#39;t know how hard if we add BTF support in the current<br>
&gt;     design. It<br>
&gt;     would be probably OK it&#39;s not a lot of effort.<br>
&gt;<br>
&gt;<br>
&gt; As far as we understand BTF helps in BPF debugging and libbpf supports <br>
&gt; it as is.<br>
&gt; Without libbpf we in v1 load the BPF instructions only.<br>
&gt; If you think the BTF is mandatory (BTW, why?) I think it is better to <br>
&gt; switch to libbpf and keep the entire ELF in the qemu data.<br>
<br>
<br>
It is used to make sure the BPF can do compile once run everywhere.<br>
<br>
This is explained in detail in here: <br>
<a href="https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html" rel="noreferrer" target="_blank">https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html</a>.<br>
<br>
</blockquote><div><br></div><div>Thank you, then there is no question, we need to use libbpf.</div><div> <br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Thanks<br>
<br>
<br>
&gt;<br>
&gt;<br>
&gt;     &gt;<br>
&gt;     &gt; 3. Keep instructions or ELF inside QEMU or as separate external file<br>
&gt;     &gt; Solution 3a (~as in v1): Built-in array of instructions or ELF.<br>
&gt;     If we<br>
&gt;     &gt; generate them out of QEMU build - keep 2 arrays or instructions<br>
&gt;     or ELF<br>
&gt;     &gt; (BE and LE),<br>
&gt;     &gt; Solution 3b: Install them as separate files (/usr/share/qemu).<br>
&gt;     &gt; We&#39;d prefer 3a:<br>
&gt;     &gt;  Then there is a guarantee that the eBPF is built with exactly the<br>
&gt;     &gt; same config structures as QEMU (qemu creates a mapping of its<br>
&gt;     &gt; structures, eBPF uses them).<br>
&gt;     &gt;  No need to take care on scenarios like &#39;file not found&#39;, &#39;file<br>
&gt;     is not<br>
&gt;     &gt; suitable&#39; etc<br>
&gt;<br>
&gt;<br>
&gt;     Yes, let&#39;s go 3a for upstream.<br>
&gt;<br>
&gt;<br>
&gt;     &gt;<br>
&gt;     &gt; 4. Is there some real request to have the eBPF for big-endian?<br>
&gt;     &gt; If no, we can enable eBPF only for LE builds<br>
&gt;<br>
&gt;<br>
&gt;     We can go with LE first.<br>
&gt;<br>
&gt;     Thanks<br>
&gt;<br>
&gt;<br>
&gt;     &gt;<br>
&gt;     &gt; Jason, Daniel, Michael<br>
&gt;     &gt; Can you please let us know what you think and why?<br>
&gt;     &gt;<br>
&gt;     &gt; On Thu, Nov 5, 2020 at 3:19 PM Daniel P. Berrangé<br>
&gt;     &lt;<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a> &lt;mailto:<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a>&gt;<br>
&gt;     &gt; &lt;mailto:<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a> &lt;mailto:<a href="mailto:berrange@redhat.com" target="_blank">berrange@redhat.com</a>&gt;&gt;&gt; wrote:<br>
&gt;     &gt;<br>
&gt;     &gt;     On Thu, Nov 05, 2020 at 10:01:09AM +0000, Daniel P. Berrangé<br>
&gt;     wrote:<br>
&gt;     &gt;     &gt; On Thu, Nov 05, 2020 at 11:46:18AM +0800, Jason Wang wrote:<br>
&gt;     &gt;     &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; On 2020/11/4 下午5:31, Daniel P. Berrangé wrote:<br>
&gt;     &gt;     &gt; &gt; &gt; On Wed, Nov 04, 2020 at 10:07:52AM +0800, Jason Wang<br>
&gt;     wrote:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; On 2020/11/3 下午6:32, Yuri Benditovich wrote:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; On Tue, Nov 3, 2020 at 11:02 AM Jason Wang<br>
&gt;     &gt;     &lt;<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a> &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;<br>
&gt;     &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a> &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;&gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a><br>
&gt;     &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;<br>
&gt;     &gt;     &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a> &lt;mailto:<a href="mailto:jasowang@redhat.com" target="_blank">jasowang@redhat.com</a>&gt;&gt;&gt;&gt;<br>
&gt;     wrote:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      On 2020/11/3 上午2:51, Andrew Melnychenko wrote:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Basic idea is to use eBPF to calculate and<br>
&gt;     steer<br>
&gt;     &gt;     packets in TAP.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; RSS(Receive Side Scaling) is used to distribute<br>
&gt;     &gt;     network packets<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      to guest virtqueues<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; by calculating packet hash.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; eBPF RSS allows us to use RSS with vhost TAP.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; This set of patches introduces the usage of<br>
&gt;     eBPF<br>
&gt;     &gt;     for packet steering<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; and RSS hash calculation:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * RSS(Receive Side Scaling) is used to<br>
&gt;     distribute<br>
&gt;     &gt;     network packets to<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; guest virtqueues by calculating packet hash<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * eBPF RSS suppose to be faster than already<br>
&gt;     &gt;     existing &#39;software&#39;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; implementation in QEMU<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Additionally adding support for the usage of<br>
&gt;     &gt;     RSS with vhost<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Supported kernels: 5.8+<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Implementation notes:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Linux TAP TUNSETSTEERINGEBPF ioctl was used to<br>
&gt;     &gt;     set the eBPF program.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Added eBPF support to qemu directly through a<br>
&gt;     &gt;     system call, see the<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; bpf(2) for details.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; The eBPF program is part of the qemu and<br>
&gt;     &gt;     presented as an array<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      of bpf<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; instructions.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; The program can be recompiled by provided<br>
&gt;     &gt;     Makefile.ebpf(need to<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      adjust<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; &#39;linuxhdrs&#39;),<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; although it&#39;s not required to build QEMU with<br>
&gt;     &gt;     eBPF support.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Added changes to virtio-net and vhost, primary<br>
&gt;     &gt;     eBPF RSS is used.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; &#39;Software&#39; RSS used in the case of hash<br>
&gt;     &gt;     population and as a<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      fallback option.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; For vhost, the hash population feature is not<br>
&gt;     &gt;     reported to the guest.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; Please also see the documentation in PATCH 6/6.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; I am sending those patches as RFC to<br>
&gt;     initiate the<br>
&gt;     &gt;     discussions<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      and get<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; feedback on the following points:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Fallback when eBPF is not supported by<br>
&gt;     the kernel<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      Yes, and it could also a lacking of CAP_BPF.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Live migration to the kernel that doesn&#39;t<br>
&gt;     have<br>
&gt;     &gt;     eBPF support<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      Is there anything that we needs special<br>
&gt;     treatment here?<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; Possible case: rss=on, vhost=on, source system with<br>
&gt;     &gt;     kernel 5.8<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; (everything works) -&gt; dest. system 5.6 (bpf does not<br>
&gt;     &gt;     work), the adapter<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; functions, but all the steering does not use<br>
&gt;     proper queues.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; Right, I think we need to disable vhost on dest.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      &gt; * Integration with current QEMU build<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      Yes, a question here:<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      1) Any reason for not using libbpf, e.g it<br>
&gt;     has been<br>
&gt;     &gt;     shipped with some<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      distros<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; We intentionally do not use libbpf, as it present only<br>
&gt;     &gt;     on some distros.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; We can switch to libbpf, but this will disable bpf if<br>
&gt;     &gt;     libbpf is not<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; installed<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; That&#39;s better I think.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;      2) It would be better if we can avoid shipping<br>
&gt;     &gt;     bytecodes<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; This creates new dependencies: llvm + clang + ...<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; We would prefer byte code and ability to generate<br>
&gt;     it if<br>
&gt;     &gt;     prerequisites<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; &gt; are installed.<br>
&gt;     &gt;     &gt; &gt; &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; &gt; &gt; It&#39;s probably ok if we treat the bytecode as a kind of<br>
&gt;     &gt;     firmware.<br>
&gt;     &gt;     &gt; &gt; &gt; That is explicitly *not* OK for inclusion in Fedora. They<br>
&gt;     &gt;     require that<br>
&gt;     &gt;     &gt; &gt; &gt; BPF is compiled from source, and rejected my<br>
&gt;     suggestion that<br>
&gt;     &gt;     it could<br>
&gt;     &gt;     &gt; &gt; &gt; be considered a kind of firmware and thus have an<br>
&gt;     exception<br>
&gt;     &gt;     from building<br>
&gt;     &gt;     &gt; &gt; &gt; from source.<br>
&gt;     &gt;     &gt; &gt;<br>
&gt;     &gt;     &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; Please refer what it was done in DPDK:<br>
&gt;     &gt;     &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; <a href="http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235" rel="noreferrer" target="_blank">http://git.dpdk.org/dpdk/tree/doc/guides/nics/tap.rst#n235</a><br>
&gt;     &gt;     &gt; &gt;<br>
&gt;     &gt;     &gt; &gt; I don&#39;t think what proposed here makes anything different.<br>
&gt;     &gt;     &gt;<br>
&gt;     &gt;     &gt; I&#39;m not convinced that what DPDK does is acceptable to<br>
&gt;     Fedora either<br>
&gt;     &gt;     &gt; based on the responses I&#39;ve received when asking about BPF<br>
&gt;     handling<br>
&gt;     &gt;     &gt; during build.  I wouldn&#39;t suprise me, however, if this was<br>
&gt;     simply<br>
&gt;     &gt;     &gt; missed by reviewers when accepting DPDK into Fedora,<br>
&gt;     because it is<br>
&gt;     &gt;     &gt; not entirely obvious unless you are looking closely.<br>
&gt;     &gt;<br>
&gt;     &gt;     FWIW, I&#39;m pushing back against the idea that we have to<br>
&gt;     compile the<br>
&gt;     &gt;     BPF code from master source, as I think it is reasonable to<br>
&gt;     have the<br>
&gt;     &gt;     program embedded as a static array in the source code<br>
&gt;     similar to what<br>
&gt;     &gt;     DPDK does.  It doesn&#39;t feel much different from other places<br>
&gt;     where<br>
&gt;     &gt;     apps<br>
&gt;     &gt;     use generated sources, and don&#39;t build them from the<br>
&gt;     original source<br>
&gt;     &gt;     every time. eg &quot;configure&quot; is never re-generated from<br>
&gt;     &gt;     &quot;<a href="http://configure.ac" rel="noreferrer" target="_blank">configure.ac</a> &lt;<a href="http://configure.ac" rel="noreferrer" target="_blank">http://configure.ac</a>&gt; &lt;<a href="http://configure.ac" rel="noreferrer" target="_blank">http://configure.ac</a>&gt;&quot;<br>
&gt;     &gt;     by Fedora packagers, they just use the generated &quot;configure&quot;<br>
&gt;     script<br>
&gt;     &gt;     as-is.<br>
&gt;     &gt;<br>
&gt;     &gt;     Regards,<br>
&gt;     &gt;     Daniel<br>
&gt;     &gt;     --<br>
&gt;     &gt;     |: <a href="https://berrange.com" rel="noreferrer" target="_blank">https://berrange.com</a>     -o-<br>
&gt;     &gt; <a href="https://www.flickr.com/photos/dberrange" rel="noreferrer" target="_blank">https://www.flickr.com/photos/dberrange</a> :|<br>
&gt;     &gt;     |: <a href="https://libvirt.org" rel="noreferrer" target="_blank">https://libvirt.org</a>        -o-<br>
&gt;     <a href="https://fstop138.berrange.com" rel="noreferrer" target="_blank">https://fstop138.berrange.com</a> :|<br>
&gt;     &gt;     |: <a href="https://entangle-photo.org" rel="noreferrer" target="_blank">https://entangle-photo.org</a>   -o-<br>
&gt;     &gt; <a href="https://www.instagram.com/dberrange" rel="noreferrer" target="_blank">https://www.instagram.com/dberrange</a> :|<br>
&gt;     &gt;<br>
&gt;<br>
<br>
</blockquote></div></div>