diff mbox series

net: tun: fix tun_xdp_one() for IFF_TUN mode

Message ID 03ee62602dd7b7101f78e0802249a6e2e4c10b7f.camel@infradead.org
State Superseded
Headers show
Series net: tun: fix tun_xdp_one() for IFF_TUN mode | expand

Commit Message

David Woodhouse June 19, 2021, 1:33 p.m. UTC
From: David Woodhouse <dwmw@amazon.co.uk>

In tun_get_user(), skb->protocol is either taken from the tun_pi header
or inferred from the first byte of the packet in IFF_TUN mode, while
eth_type_trans() is called only in the IFF_TAP mode where the payload
is expected to be an Ethernet frame.

The alternative path in tun_xdp_one() was unconditionally using
eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to
do the correct thing for IFF_TUN mode, as tun_get_user() does.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")
---
How is my userspace application going to know that the kernel has this
fix? Should we add a flag to TUN_FEATURES to show that vhost-net in
*IFF_TUN* mode is supported?

 drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 43 insertions(+), 1 deletion(-)

Comments

Jason Wang June 21, 2021, 7 a.m. UTC | #1
在 2021/6/19 下午9:33, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>

>

> In tun_get_user(), skb->protocol is either taken from the tun_pi header

> or inferred from the first byte of the packet in IFF_TUN mode, while

> eth_type_trans() is called only in the IFF_TAP mode where the payload

> is expected to be an Ethernet frame.

>

> The alternative path in tun_xdp_one() was unconditionally using

> eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to

> do the correct thing for IFF_TUN mode, as tun_get_user() does.

>

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")

> ---

> How is my userspace application going to know that the kernel has this

> fix? Should we add a flag to TUN_FEATURES to show that vhost-net in

> *IFF_TUN* mode is supported?



I think it's probably too late to fix? Since it should work before 
043d222f93ab.

The only way is to backport this fix to stable.


>

>   drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++-

>   1 file changed, 43 insertions(+), 1 deletion(-)

>

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c

> index 4cf38be26dc9..f812dcdc640e 100644

> --- a/drivers/net/tun.c

> +++ b/drivers/net/tun.c

> @@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun,

>   		err = -EINVAL;

>   		goto out;

>   	}

> +	switch (tun->flags & TUN_TYPE_MASK) {

> +	case IFF_TUN:

> +		if (tun->flags & IFF_NO_PI) {

> +			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;

> +

> +			switch (ip_version) {

> +			case 4:

> +				skb->protocol = htons(ETH_P_IP);

> +				break;

> +			case 6:

> +				skb->protocol = htons(ETH_P_IPV6);

> +				break;

> +			default:

> +				atomic_long_inc(&tun->dev->rx_dropped);

> +				kfree_skb(skb);

> +				err = -EINVAL;

> +				goto out;

> +			}

> +		} else {

> +			struct tun_pi *pi = (struct tun_pi *)skb->data;

> +			if (!pskb_may_pull(skb, sizeof(*pi))) {

> +				atomic_long_inc(&tun->dev->rx_dropped);

> +				kfree_skb(skb);

> +				err = -ENOMEM;

> +				goto out;

> +			}

> +			skb_pull_inline(skb, sizeof(*pi));

> +			skb->protocol = pi->proto;

> +		}

> +

> +		skb_reset_mac_header(skb);

> +		skb->dev = tun->dev;

> +		break;

> +	case IFF_TAP:

> +		if (!pskb_may_pull(skb, ETH_HLEN)) {

> +			atomic_long_inc(&tun->dev->rx_dropped);

> +			kfree_skb(skb);

> +			err = -ENOMEM;

> +			goto out;

> +		}

> +		skb->protocol = eth_type_trans(skb, tun->dev);

> +		break;



I wonder whether we can have some codes unification with tun_get_user().

Thanks


> +	}

>   

> -	skb->protocol = eth_type_trans(skb, tun->dev);

>   	skb_reset_network_header(skb);

>   	skb_probe_transport_header(skb);

>   	skb_record_rx_queue(skb, tfile->queue_index);
David Woodhouse June 21, 2021, 10:52 a.m. UTC | #2
On Mon, 2021-06-21 at 15:00 +0800, Jason Wang wrote:
> I think it's probably too late to fix? Since it should work before 

> 043d222f93ab.

> 

> The only way is to backport this fix to stable.


Yeah, I assumed the fix would be backported; if not then the "does the
kernel have it" check is fairly trivial.

I *can* avoid it for now by just using TUNSNDBUF to reduce the sndbuf
and then we never take the XDP path at all.

My initial crappy hacks are slowly turning into something that I might
actually want to commit to mainline (once I've fixed endianness and
memory ordering issues):
https://gitlab.com/openconnect/openconnect/-/compare/master...vhost

I have a couple of remaining problems using vhost-net directly from
userspace though.

Firstly, I don't think I can set IFF_VNET_HDR on the tun device after
opening it. So my model of "open the tun device, then *see* if we can
use vhost to accelerate it" doesn't work.

I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features
instead, but that gives me a weird failure mode where it drops around
half the incoming packets, and I haven't yet worked out why.

Of course I don't *actually* want a vnet header at all but the vhost
code really assumes that *someone* will add one; if I *don't* set
VHOST_NET_F_VIRTIO_NET_HDR then it always *assumes* it can read ten
bytes more from the tun socket than the 'peek' says, and barfs when it
can't. (Or such was my initial half-thought-through diagnosis before I
made it go away by setting IFF_VNET_HDR, at least).


Secondly, I need to pull numbers out of my posterior for the
VHOST_SET_MEM_TABLE call. This works for x86_64:

	vmem->nregions = 1;
	vmem->regions[0].guest_phys_addr = 4096;
	vmem->regions[0].memory_size = 0x7fffffffe000;
	vmem->regions[0].userspace_addr = 4096;
	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

Is there a way to bypass that and just unconditionally set a 1:1
mapping of *all* userspace address space? 



It's possible that one or the other of those problems will result in a
new advertised "feature" which is so simple (like a 1:1 map) that we
can call it a bugfix and backport it along with the tun fix I already
posted, and the presence of *that* can indicate that the tun bug is
fixed :)
David Woodhouse June 21, 2021, 2:50 p.m. UTC | #3
On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:
> 

> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after

> opening it. So my model of "open the tun device, then *see* if we can

> use vhost to accelerate it" doesn't work.

> 

> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features

> instead, but that gives me a weird failure mode where it drops around

> half the incoming packets, and I haven't yet worked out why.


FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP
data path.
David Woodhouse June 21, 2021, 8:43 p.m. UTC | #4
On Mon, 2021-06-21 at 15:50 +0100, David Woodhouse wrote:
> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:

> > 

> > Firstly, I don't think I can set IFF_VNET_HDR on the tun device after

> > opening it. So my model of "open the tun device, then *see* if we can

> > use vhost to accelerate it" doesn't work.

> > 

> > I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features

> > instead, but that gives me a weird failure mode where it drops around

> > half the incoming packets, and I haven't yet worked out why.

> 

> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP

> data path.


Looks like there are two problems there.

Firstly, vhost_net_build_xdp() doesn't cope well with sock_hlen being
zero. It reads those zero bytes into its buffer, then points 'gso' at
the buffer with no valid data in it, and checks gso->flags for the
NEEDS_CSUM flag.

Secondly, tun_xdp_one() doesn't cope with receiving packets without the
virtio header either. While tun_get_user() correctly checks
IFF_VNET_HDR, tun_xdp_one() does not, and treats the start of my IP
packets as if they were a virtio_net_hdr.

I'll look at turning my code into a test case for kernel selftests.
Jason Wang June 22, 2021, 4:34 a.m. UTC | #5
在 2021/6/21 下午6:52, David Woodhouse 写道:
> On Mon, 2021-06-21 at 15:00 +0800, Jason Wang wrote:

>> I think it's probably too late to fix? Since it should work before

>> 043d222f93ab.

>>

>> The only way is to backport this fix to stable.

> Yeah, I assumed the fix would be backported; if not then the "does the

> kernel have it" check is fairly trivial.

>

> I *can* avoid it for now by just using TUNSNDBUF to reduce the sndbuf

> and then we never take the XDP path at all.

>

> My initial crappy hacks are slowly turning into something that I might

> actually want to commit to mainline (once I've fixed endianness and

> memory ordering issues):

> https://gitlab.com/openconnect/openconnect/-/compare/master...vhost

>

> I have a couple of remaining problems using vhost-net directly from

> userspace though.

>

> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after

> opening it. So my model of "open the tun device, then *see* if we can

> use vhost to accelerate it" doesn't work.



Yes, IFF_VNET_HDR is set during TUN_SET_IFF which can't be changed 
afterwards.


>

> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features

> instead, but that gives me a weird failure mode where it drops around

> half the incoming packets, and I haven't yet worked out why.

>

> Of course I don't *actually* want a vnet header at all but the vhost

> code really assumes that *someone* will add one; if I *don't* set

> VHOST_NET_F_VIRTIO_NET_HDR then it always *assumes* it can read ten

> bytes more from the tun socket than the 'peek' says, and barfs when it

> can't. (Or such was my initial half-thought-through diagnosis before I

> made it go away by setting IFF_VNET_HDR, at least).



Yes, vhost always assumes there's a vnet header.


>

>

> Secondly, I need to pull numbers out of my posterior for the

> VHOST_SET_MEM_TABLE call. This works for x86_64:

>

> 	vmem->nregions = 1;

> 	vmem->regions[0].guest_phys_addr = 4096;

> 	vmem->regions[0].memory_size = 0x7fffffffe000;

> 	vmem->regions[0].userspace_addr = 4096;

> 	if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

>

> Is there a way to bypass that and just unconditionally set a 1:1

> mapping of *all* userspace address space?



Memory Table is one of the basic abstraction of the vhost. Basically, 
you only need to map the userspace buffers. This is how DPDK virtio-user 
PMD did. Vhost will validate the addresses through access_ok() during 
VHOST_SET_MEM_TABLE.

The range of all usersapce space seems architecture specific, I'm not 
sure if it's worth to bother.

Thanks


>

>

>

> It's possible that one or the other of those problems will result in a

> new advertised "feature" which is so simple (like a 1:1 map) that we

> can call it a bugfix and backport it along with the tun fix I already

> posted, and the presence of *that* can indicate that the tun bug is

> fixed :)
Jason Wang June 22, 2021, 4:34 a.m. UTC | #6
在 2021/6/21 下午10:50, David Woodhouse 写道:
> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:

>> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after

>> opening it. So my model of "open the tun device, then *see* if we can

>> use vhost to accelerate it" doesn't work.

>>

>> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features

>> instead, but that gives me a weird failure mode where it drops around

>> half the incoming packets, and I haven't yet worked out why.

> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP

> data path.



That looks a workaround.

Thanks
Jason Wang June 22, 2021, 4:52 a.m. UTC | #7
在 2021/6/22 上午4:43, David Woodhouse 写道:
> On Mon, 2021-06-21 at 15:50 +0100, David Woodhouse wrote:

>> On Mon, 2021-06-21 at 11:52 +0100, David Woodhouse wrote:

>>> Firstly, I don't think I can set IFF_VNET_HDR on the tun device after

>>> opening it. So my model of "open the tun device, then *see* if we can

>>> use vhost to accelerate it" doesn't work.

>>>

>>> I tried setting VHOST_NET_F_VIRTIO_NET_HDR in the vhost features

>>> instead, but that gives me a weird failure mode where it drops around

>>> half the incoming packets, and I haven't yet worked out why.

>> FWIW that problem also goes away if I set TUNSNDBUF and avoid the XDP

>> data path.

> Looks like there are two problems there.

>

> Firstly, vhost_net_build_xdp() doesn't cope well with sock_hlen being

> zero. It reads those zero bytes into its buffer, then points 'gso' at

> the buffer with no valid data in it, and checks gso->flags for the

> NEEDS_CSUM flag.

>

> Secondly, tun_xdp_one() doesn't cope with receiving packets without the

> virtio header either. While tun_get_user() correctly checks

> IFF_VNET_HDR, tun_xdp_one() does not, and treats the start of my IP

> packets as if they were a virtio_net_hdr.

>

> I'll look at turning my code into a test case for kernel selftests.



I cook two patches. Please see and check if they fix the problem. 
(compile test only for me).

Thanks


>

>
From 5d785027da87e40138407d76841f5cdd64914541 Mon Sep 17 00:00:00 2001
From: Jason Wang <jasowang@redhat.com>
Date: Tue, 22 Jun 2021 12:07:59 +0800
Subject: [PATCH 1/2] vhost_net: validate gso metadata only if socket has vnet
 header

When sock_hlen is zero, there's no need to validate the socket vnet
header since it doesn't support that.

Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index df82b124170e..5034c4949bc4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -725,7 +725,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
 	hdr = buf;
 	gso = &hdr->gso;
 
-	if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
+	if (nvq->sock_hlen && (gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) &&
 	    vhost16_to_cpu(vq, gso->csum_start) +
 	    vhost16_to_cpu(vq, gso->csum_offset) + 2 >
 	    vhost16_to_cpu(vq, gso->hdr_len)) {
David Woodhouse June 22, 2021, 7:24 a.m. UTC | #8
On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote:
> 

> 

> I cook two patches. Please see and check if they fix the problem. 

> (compile test only for me).


I did the second one slightly differently (below) but those are what I
came up with too, which seems to be working.

@@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun,
 {
        unsigned int datasize = xdp->data_end - xdp->data;
        struct tun_xdp_hdr *hdr = xdp->data_hard_start;
-       struct virtio_net_hdr *gso = &hdr->gso;
+       struct virtio_net_hdr *gso = NULL;
        struct bpf_prog *xdp_prog;
        struct sk_buff *skb = NULL;
        u32 rxhash = 0, act;
@@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun,
        bool skb_xdp = false;
        struct page *page;
 
+       if (tun->flags & IFF_VNET_HDR)
+               gso = &hdr->gso;
+
        xdp_prog = rcu_dereference(tun->xdp_prog);
        if (xdp_prog) {
-               if (gso->gso_type) {
+               if (gso && gso->gso_type) {
                        skb_xdp = true;
                        goto build;
                }
@@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun,
        skb_reserve(skb, xdp->data - xdp->data_hard_start);
        skb_put(skb, xdp->data_end - xdp->data);
 
-       if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
+       if (!gso)
+               skb_reset_mac_header(skb);
+       else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {
                atomic_long_inc(&tun->rx_frame_errors);
                kfree_skb(skb);
David Woodhouse June 22, 2021, 7:28 a.m. UTC | #9
On Tue, 2021-06-22 at 12:34 +0800, Jason Wang wrote:
> > 

> > Secondly, I need to pull numbers out of my posterior for the

> > VHOST_SET_MEM_TABLE call. This works for x86_64:

> > 

> >        vmem->nregions = 1;

> >        vmem->regions[0].guest_phys_addr = 4096;

> >        vmem->regions[0].memory_size = 0x7fffffffe000;

> >        vmem->regions[0].userspace_addr = 4096;

> >        if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

> > 

> > Is there a way to bypass that and just unconditionally set a 1:1

> > mapping of *all* userspace address space?

> 

> 

> Memory Table is one of the basic abstraction of the vhost. Basically, 

> you only need to map the userspace buffers. This is how DPDK virtio-user 

> PMD did. Vhost will validate the addresses through access_ok() during 

> VHOST_SET_MEM_TABLE.

> 

> The range of all usersapce space seems architecture specific, I'm not 

> sure if it's worth to bother.


The buffers are just malloc'd. I just need a full 1:1 mapping of all
"guest" memory to userspace addresses, and was trying to avoid having
to map them on demand *just* because I don't know the full range of
possible addresses that malloc will return, in advance.

I'm tempted to add a new feature for that 1:1 access, with no ->umem or
->iotlb at all. And then I can use it as a key to know that the XDP
bugs are fixed too :)
Jason Wang June 22, 2021, 7:51 a.m. UTC | #10
在 2021/6/22 下午3:24, David Woodhouse 写道:
> On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote:

>>

>> I cook two patches. Please see and check if they fix the problem.

>> (compile test only for me).

> I did the second one slightly differently (below) but those are what I

> came up with too, which seems to be working.

>

> @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun,

>   {

>          unsigned int datasize = xdp->data_end - xdp->data;

>          struct tun_xdp_hdr *hdr = xdp->data_hard_start;

> -       struct virtio_net_hdr *gso = &hdr->gso;

> +       struct virtio_net_hdr *gso = NULL;

>          struct bpf_prog *xdp_prog;

>          struct sk_buff *skb = NULL;

>          u32 rxhash = 0, act;

> @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun,

>          bool skb_xdp = false;

>          struct page *page;

>   

> +       if (tun->flags & IFF_VNET_HDR)

> +               gso = &hdr->gso;

> +

>          xdp_prog = rcu_dereference(tun->xdp_prog);

>          if (xdp_prog) {

> -               if (gso->gso_type) {

> +               if (gso && gso->gso_type) {

>                          skb_xdp = true;

>                          goto build;

>                  }

> @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun,

>          skb_reserve(skb, xdp->data - xdp->data_hard_start);

>          skb_put(skb, xdp->data_end - xdp->data);

>   

> -       if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {

> +       if (!gso)

> +               skb_reset_mac_header(skb);

> +       else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {

>                  atomic_long_inc(&tun->rx_frame_errors);

>                  kfree_skb(skb);



This should work as well.

Thanks
Jason Wang June 22, 2021, 8 a.m. UTC | #11
在 2021/6/22 下午3:28, David Woodhouse 写道:
> On Tue, 2021-06-22 at 12:34 +0800, Jason Wang wrote:

>>> Secondly, I need to pull numbers out of my posterior for the

>>> VHOST_SET_MEM_TABLE call. This works for x86_64:

>>>

>>>         vmem->nregions = 1;

>>>         vmem->regions[0].guest_phys_addr = 4096;

>>>         vmem->regions[0].memory_size = 0x7fffffffe000;

>>>         vmem->regions[0].userspace_addr = 4096;

>>>         if (ioctl(vpninfo->vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) {

>>>

>>> Is there a way to bypass that and just unconditionally set a 1:1

>>> mapping of *all* userspace address space?

>>

>> Memory Table is one of the basic abstraction of the vhost. Basically,

>> you only need to map the userspace buffers. This is how DPDK virtio-user

>> PMD did. Vhost will validate the addresses through access_ok() during

>> VHOST_SET_MEM_TABLE.

>>

>> The range of all usersapce space seems architecture specific, I'm not

>> sure if it's worth to bother.

> The buffers are just malloc'd. I just need a full 1:1 mapping of all

> "guest" memory to userspace addresses, and was trying to avoid having

> to map them on demand *just* because I don't know the full range of

> possible addresses that malloc will return, in advance.

>

> I'm tempted to add a new feature for that 1:1 access, with no ->umem or

> ->iotlb at all. And then I can use it as a key to know that the XDP

> bugs are fixed too :)



This means we need validate the userspace address each time before vhost 
tries to use that. This will de-gradate the performance. So we still 
need to figure out the legal userspace address range which might not be 
easy.

Thanks


>
David Woodhouse June 22, 2021, 8:10 a.m. UTC | #12
On 22 June 2021 08:51:43 BST, Jason Wang <jasowang@redhat.com> wrote:
>

>在 2021/6/22 下午3:24, David Woodhouse 写道:

>> On Tue, 2021-06-22 at 12:52 +0800, Jason Wang wrote:

>>>

>>> I cook two patches. Please see and check if they fix the problem.

>>> (compile test only for me).

>> I did the second one slightly differently (below) but those are what

>I

>> came up with too, which seems to be working.

>>

>> @@ -2331,7 +2344,7 @@ static int tun_xdp_one(struct tun_struct *tun,

>>   {

>>          unsigned int datasize = xdp->data_end - xdp->data;

>>          struct tun_xdp_hdr *hdr = xdp->data_hard_start;

>> -       struct virtio_net_hdr *gso = &hdr->gso;

>> +       struct virtio_net_hdr *gso = NULL;

>>          struct bpf_prog *xdp_prog;

>>          struct sk_buff *skb = NULL;

>>          u32 rxhash = 0, act;

>> @@ -2340,9 +2353,12 @@ static int tun_xdp_one(struct tun_struct *tun,

>>          bool skb_xdp = false;

>>          struct page *page;

>>   

>> +       if (tun->flags & IFF_VNET_HDR)

>> +               gso = &hdr->gso;

>> +

>>          xdp_prog = rcu_dereference(tun->xdp_prog);

>>          if (xdp_prog) {

>> -               if (gso->gso_type) {

>> +               if (gso && gso->gso_type) {

>>                          skb_xdp = true;

>>                          goto build;

>>                  }

>> @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct

>*tun,

>>          skb_reserve(skb, xdp->data - xdp->data_hard_start);

>>          skb_put(skb, xdp->data_end - xdp->data);

>>   

>> -       if (virtio_net_hdr_to_skb(skb, gso,

>tun_is_little_endian(tun))) {

>> +       if (!gso)

>> +               skb_reset_mac_header(skb);

>> +       else if (virtio_net_hdr_to_skb(skb, gso,

>tun_is_little_endian(tun))) {

>>                  atomic_long_inc(&tun->rx_frame_errors);

>>                  kfree_skb(skb);

>

>

>This should work as well.


I'll rip out the rest of my debugging hacks and check that these two are sufficient, and that I hadn't accidentally papered over something else as I debugged it.

Then I'll look at the case of having no virtio_net_hdr on either side, and also different sized headers on the tun device (why does it even support that?).

And the test case, of course. 

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.
David Woodhouse June 22, 2021, 8:29 a.m. UTC | #13
On Tue, 2021-06-22 at 16:00 +0800, Jason Wang wrote:
> > I'm tempted to add a new feature for that 1:1 access, with no ->umem or

> > ->iotlb at all. And then I can use it as a key to know that the XDP

> > bugs are fixed too :)

> 

> 

> This means we need validate the userspace address each time before vhost 

> tries to use that. This will de-gradate the performance. So we still 

> need to figure out the legal userspace address range which might not be 

> easy.


Easier from the kernel than from userspace though :)

But I don't know that we need it. Isn't a call to access_ok() going to
be faster than what translate_desc() does to look things up anyway?

In the 1:1 mode, the access_ok() is all that's needed since there's no
translation.

@@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
        u64 s = 0;
        int ret = 0;
 
+       if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) {
+               if (!access_ok((void __user *)addr, len))
+                       return -EFAULT;
+
+               iov[0].iov_len = len;
+               iov[0].iov_base = (void __user *)addr;
+               return 1;
+       }
        while ((u64)len > s) {
                u64 size;
                if (unlikely(ret >= iov_size)) {
David Woodhouse June 22, 2021, 11:36 a.m. UTC | #14
On Tue, 2021-06-22 at 15:51 +0800, Jason Wang wrote:
> > @@ -2388,14 +2406,18 @@ static int tun_xdp_one(struct tun_struct *tun,

> >           skb_reserve(skb, xdp->data - xdp->data_hard_start);

> >           skb_put(skb, xdp->data_end - xdp->data);

> >    

> > -       if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {

> > +       if (!gso)

> > +               skb_reset_mac_header(skb);

> > +       else if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {

> >                   atomic_long_inc(&tun->rx_frame_errors);

> >                   kfree_skb(skb);

> 

> 

> This should work as well.


Actually there's no need for the skb_reset_mac_header() as that's going
to happen anyway just a few lines down — either from eth_type_trans()
or explicitly in the IFF_TUN code path that I added in my first patch.

I also stripped out a lot more code from vhost_net_build_xdp() in the
!sock_hlen case where it was pointless.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/vhost-net

I'll repost the whole series after I take look at whether I can get it
to work without *either* side doing the vnet header, and add a test
case based on my userspace code in 
https://gitlab.com/openconnect/openconnect/-/blob/vhost/vhost.c
Jason Wang June 23, 2021, 3:39 a.m. UTC | #15
在 2021/6/22 下午4:29, David Woodhouse 写道:
> On Tue, 2021-06-22 at 16:00 +0800, Jason Wang wrote:

>>> I'm tempted to add a new feature for that 1:1 access, with no ->umem or

>>> ->iotlb at all. And then I can use it as a key to know that the XDP

>>> bugs are fixed too :)

>>

>> This means we need validate the userspace address each time before vhost

>> tries to use that. This will de-gradate the performance. So we still

>> need to figure out the legal userspace address range which might not be

>> easy.

> Easier from the kernel than from userspace though :)



Yes.


>

> But I don't know that we need it. Isn't a call to access_ok() going to

> be faster than what translate_desc() does to look things up anyway?



Right.


>

> In the 1:1 mode, the access_ok() is all that's needed since there's no

> translation.

>

> @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,

>          u64 s = 0;

>          int ret = 0;

>   

> +       if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) {



Using vhost_has_feature() is kind of tricky since it's used for virtio 
feature negotiation.

We probably need to use backend_features instead.

I think we should probably do more:

1) forbid the feature to be set when mem table / IOTLB has at least one 
mapping
2) forbid the mem table / IOTLB updating after the feature is set

Thanks


> +               if (!access_ok((void __user *)addr, len))

> +                       return -EFAULT;

> +

> +               iov[0].iov_len = len;

> +               iov[0].iov_base = (void __user *)addr;

> +               return 1;

> +       }

>          while ((u64)len > s) {

>                  u64 size;

>                  if (unlikely(ret >= iov_size)) {
David Woodhouse June 24, 2021, 12:39 p.m. UTC | #16
On Wed, 2021-06-23 at 11:39 +0800, Jason Wang wrote:
> > 

> > In the 1:1 mode, the access_ok() is all that's needed since there's no

> > translation.

> > 

> > @@ -2038,6 +2065,14 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,

> >           u64 s = 0;

> >           int ret = 0;

> >    

> > +       if (vhost_has_feature(vq, VHOST_F_IDENTITY_MAPPING)) {

> 

> 

> Using vhost_has_feature() is kind of tricky since it's used for virtio 

> feature negotiation.

> 

> We probably need to use backend_features instead.

> 

> I think we should probably do more:

> 

> 1) forbid the feature to be set when mem table / IOTLB has at least one 

> mapping

> 2) forbid the mem table / IOTLB updating after the feature is set


Yes, that all makes sense. I confess I hadn't actually *implemented*
the feature at all; the only time I'd typed 'VHOST_F_IDENTITY_MAPPING'
was to show that snippet of "patch" as an example of what
translate_desc() would do. I just wanted *something* to put in the
'if()' statement as a placeholder, so I used that.

It could *even* have just been 'if (!umem)' but that might let it
happen by *accident* which probably isn't a good idea.
Jason Wang June 25, 2021, 5 a.m. UTC | #17
在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>

>

> The vhost-net driver was making wild assumptions about the header length

> of the underlying tun/tap socket.



It's by design to depend on the userspace to co-ordinate the vnet header 
setting with the underlying sockets.


>   Then it was discarding packets if

> the number of bytes it got from sock_recvmsg() didn't precisely match

> its guess.



Anything that is broken by this? The failure is a hint for the userspace 
that something is wrong during the coordination.


>

> Fix it to get the correct information along with the socket itself.



I'm not sure what is fixed by this. It looks to me it tires to let 
packet go even if the userspace set the wrong attributes to tap or 
vhost. This is even sub-optimal than failing explicitly fail the RX.


> As a side-effect, this means that tun_get_socket() won't work if the

> tun file isn't actually connected to a device, since there's no 'tun'

> yet in that case to get the information from.



This may break the existing application. Vhost-net is tied to the socket 
instead of the device that the socket is loosely coupled.


>

> On the receive side, where the tun device generates the virtio_net_hdr

> but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill

> in the 'num_buffers' field on top of the existing virtio_net_hdr, fix

> that to use 'sock_hlen - 2' as the location, which means that it goes

> in the right place regardless of whether the tun device is using an

> additional tun_pi header or not. In this case, the user should have

> configured the tun device with a vnet hdr size of 12, to make room.

>

> Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support")

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>

> ---

>   drivers/net/tap.c      |  5 ++++-

>   drivers/net/tun.c      | 16 +++++++++++++++-

>   drivers/vhost/net.c    | 31 +++++++++++++++----------------

>   include/linux/if_tap.h |  4 ++--

>   include/linux/if_tun.h |  4 ++--

>   5 files changed, 38 insertions(+), 22 deletions(-)

>

> diff --git a/drivers/net/tap.c b/drivers/net/tap.c

> index 8e3a28ba6b28..2170a0d3d34c 100644

> --- a/drivers/net/tap.c

> +++ b/drivers/net/tap.c

> @@ -1246,7 +1246,7 @@ static const struct proto_ops tap_socket_ops = {

>    * attached to a device.  The returned object works like a packet socket, it

>    * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for

>    * holding a reference to the file for as long as the socket is in use. */

> -struct socket *tap_get_socket(struct file *file)

> +struct socket *tap_get_socket(struct file *file, size_t *hlen)

>   {

>   	struct tap_queue *q;

>   	if (file->f_op != &tap_fops)

> @@ -1254,6 +1254,9 @@ struct socket *tap_get_socket(struct file *file)

>   	q = file->private_data;

>   	if (!q)

>   		return ERR_PTR(-EBADFD);

> +	if (hlen)

> +		*hlen = (q->flags & IFF_VNET_HDR) ? q->vnet_hdr_sz : 0;

> +

>   	return &q->sock;

>   }

>   EXPORT_SYMBOL_GPL(tap_get_socket);

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c

> index 4cf38be26dc9..67b406fa0881 100644

> --- a/drivers/net/tun.c

> +++ b/drivers/net/tun.c

> @@ -3649,7 +3649,7 @@ static void tun_cleanup(void)

>    * attached to a device.  The returned object works like a packet socket, it

>    * can be used for sock_sendmsg/sock_recvmsg.  The caller is responsible for

>    * holding a reference to the file for as long as the socket is in use. */

> -struct socket *tun_get_socket(struct file *file)

> +struct socket *tun_get_socket(struct file *file, size_t *hlen)

>   {

>   	struct tun_file *tfile;

>   	if (file->f_op != &tun_fops)

> @@ -3657,6 +3657,20 @@ struct socket *tun_get_socket(struct file *file)

>   	tfile = file->private_data;

>   	if (!tfile)

>   		return ERR_PTR(-EBADFD);

> +

> +	if (hlen) {

> +		struct tun_struct *tun = tun_get(tfile);

> +		size_t len = 0;

> +

> +		if (!tun)

> +			return ERR_PTR(-ENOTCONN);

> +		if (tun->flags & IFF_VNET_HDR)

> +			len += READ_ONCE(tun->vnet_hdr_sz);

> +		if (!(tun->flags & IFF_NO_PI))

> +			len += sizeof(struct tun_pi);

> +		tun_put(tun);

> +		*hlen = len;

> +	}

>   	return &tfile->socket;

>   }

>   EXPORT_SYMBOL_GPL(tun_get_socket);

> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c

> index df82b124170e..b92a7144ed90 100644

> --- a/drivers/vhost/net.c

> +++ b/drivers/vhost/net.c

> @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)

>   

>   	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?

>   		vq->log : NULL;

> -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);

> +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&

> +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));



So this change the behavior. When mergeable buffer is enabled, userspace 
expects the vhost to merge buffers. If the feature is disabled silently, 
it violates virtio spec.

If anything wrong in the setup, userspace just breaks itself.

E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet 
header might be overwrote by the vnet header.


>   

>   	do {

>   		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,

> @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)

>   			}

>   		} else {

>   			/* Header came from socket; we'll need to patch

> -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF

> +			 * ->num_buffers over the last two bytes if

> +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.

>   			 */

> -			iov_iter_advance(&fixup, sizeof(hdr));

> +			iov_iter_advance(&fixup, sock_hlen - 2);



I'm not sure what did the above code want to fix. It doesn't change 
anything if vnet header is set correctly in TUN. It only prevents the 
the packet header from being rewrote.

Thanks
Jason Wang June 25, 2021, 6:58 a.m. UTC | #18
在 2021/6/24 下午8:30, David Woodhouse 写道:
> From: David Woodhouse <dwmw@amazon.co.uk>

>

> Sometimes it's just a data packet. The virtio_net_hdr processing should be

> conditional on IFF_VNET_HDR, just as it is in tun_get_user().

>

> Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()")

> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>



Acked-by: Jason Wang <jasowang@redhat.com>



> ---

>   drivers/net/tun.c | 9 ++++++---

>   1 file changed, 6 insertions(+), 3 deletions(-)

>

> diff --git a/drivers/net/tun.c b/drivers/net/tun.c

> index 67b406fa0881..9acd448e6dfc 100644

> --- a/drivers/net/tun.c

> +++ b/drivers/net/tun.c

> @@ -2331,7 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun,

>   {

>   	unsigned int datasize = xdp->data_end - xdp->data;

>   	struct tun_xdp_hdr *hdr = xdp->data_hard_start;

> -	struct virtio_net_hdr *gso = &hdr->gso;

> +	struct virtio_net_hdr *gso = NULL;

>   	struct bpf_prog *xdp_prog;

>   	struct sk_buff *skb = NULL;

>   	u32 rxhash = 0, act;

> @@ -2340,9 +2340,12 @@ static int tun_xdp_one(struct tun_struct *tun,

>   	bool skb_xdp = false;

>   	struct page *page;

>   

> +	if (tun->flags & IFF_VNET_HDR)

> +		gso = &hdr->gso;

> +

>   	xdp_prog = rcu_dereference(tun->xdp_prog);

>   	if (xdp_prog) {

> -		if (gso->gso_type) {

> +		if (gso && gso->gso_type) {

>   			skb_xdp = true;

>   			goto build;

>   		}

> @@ -2388,7 +2391,7 @@ static int tun_xdp_one(struct tun_struct *tun,

>   	skb_reserve(skb, xdp->data - xdp->data_hard_start);

>   	skb_put(skb, xdp->data_end - xdp->data);

>   

> -	if (virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {

> +	if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) {

>   		atomic_long_inc(&tun->rx_frame_errors);

>   		kfree_skb(skb);

>   		err = -EINVAL;
David Woodhouse June 25, 2021, 8:23 a.m. UTC | #19
On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote:
> 在 2021/6/24 下午8:30, David Woodhouse 写道:

> > From: David Woodhouse <dwmw@amazon.co.uk>

> > 

> > The vhost-net driver was making wild assumptions about the header length

> > of the underlying tun/tap socket.

> 

> 

> It's by design to depend on the userspace to co-ordinate the vnet header 

> setting with the underlying sockets.

> 

> 

> >   Then it was discarding packets if

> > the number of bytes it got from sock_recvmsg() didn't precisely match

> > its guess.

> 

> 

> Anything that is broken by this? The failure is a hint for the userspace 

> that something is wrong during the coordination.


I am not a fan of this approach. I firmly believe that for a given
configuration, the kernel should either *work* or it should gracefully
refuse to set it up that way. And the requirements should be clearly
documented.

Having been on the receiving end of this "hint" of which you speak, I
found it distinctly suboptimal as a user interface. I was left
scrabbling around trying to find a set of options which *would* work,
and it was only through debugging the kernel that I managed to work out
that I:

  • MUST set IFF_NO_PI
  • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX
  • MUST use a virtio_net_hdr that I don't want

If my application failed to do any of those things, I got a silent
failure to transport any packets. The only thing I could do *without*
debugging the kernel was tcpdump on the 'tun0' interface and see if the
TX packets I put into the ring were even making it to the interface,
and what they looked like if they did. (Losing the first 14 bytes and
having the *next* 14 bytes scribbled on by an Ethernet header was a fun
one.)





> > 

> > Fix it to get the correct information along with the socket itself.

> 

> 

> I'm not sure what is fixed by this. It looks to me it tires to let 

> packet go even if the userspace set the wrong attributes to tap or 

> vhost. This is even sub-optimal than failing explicitly fail the RX.


I'm OK with explicit failure. But once I'd let it *get* the information
from the underlying socket in order to decide whether it should fail or
not, it turned out to be easy enough just to make those configs work
anyway.

The main case where that "easy enough" is stretched a little (IMO) was
when there's a tun_pi header. I have one more of your emails to reply
to after this, and I'll address that there.


> 

> > As a side-effect, this means that tun_get_socket() won't work if the

> > tun file isn't actually connected to a device, since there's no 'tun'

> > yet in that case to get the information from.

> 

> 

> This may break the existing application. Vhost-net is tied to the socket 

> instead of the device that the socket is loosely coupled.


Hm. Perhaps the PI and vnet hdr should be considered an option of the
*socket* (which is tied to the tfile), not purely an option of the
underlying device?

Or maybe it's sufficient just to get the flags from *either* tfile->tun 
or tfile->detached, so that it works when the queue is detached. I'll
take a look.

I suppose we could even have a fallback that makes stuff up like we do
today. If the user attempts to attach a tun file descriptor to vhost
without ever calling TUNSETIFF on it first, *then* we make the same
assumptions we do today?

> > --- a/drivers/vhost/net.c

> > +++ b/drivers/vhost/net.c

> > @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)

> >   

> >   	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?

> >   		vq->log : NULL;

> > -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);

> > +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&

> > +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));

> 

> 

> So this change the behavior. When mergeable buffer is enabled, userspace 

> expects the vhost to merge buffers. If the feature is disabled silently, 

> it violates virtio spec.

> 

> If anything wrong in the setup, userspace just breaks itself.

> 

> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet 

> header might be overwrote by the vnet header.


This wasn't intended to change the behaviour of any code path that is
already working today. If *either* vhost or the underlying device have
provided a vnet header, we still merge.

If *neither* provide a vnet hdr, there's nowhere to put num_buffers and
we can't merge.

That code path doesn't work at all today, but does after my patches.
But you're right, we should explicitly refuse to negotiate
VIRITO_NET_F_MSG_RXBUF in that case.

> 

> >   

> >   	do {

> >   		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,

> > @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)

> >   			}

> >   		} else {

> >   			/* Header came from socket; we'll need to patch

> > -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF

> > +			 * ->num_buffers over the last two bytes if

> > +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.

> >   			 */

> > -			iov_iter_advance(&fixup, sizeof(hdr));

> > +			iov_iter_advance(&fixup, sock_hlen - 2);

> 

> 

> I'm not sure what did the above code want to fix. It doesn't change 

> anything if vnet header is set correctly in TUN. It only prevents the 

> the packet header from being rewrote.

> 


It fixes the case where the virtio_net_hdr isn't at the start of the
tun header, because the tun actually puts the tun_pi struct *first*,
and *then* the virtio_net_hdr. 

The num_buffers field needs to go at the *end* of sock_hlen. Not at a
fixed offset from the *start* of it.

At least, that's true unless we want to just declare that we *only*
support TUN with the IFF_NO_PI flag. (qv).
Willem de Bruijn June 25, 2021, 6:13 p.m. UTC | #20
On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org> wrote:
>

> From: David Woodhouse <dwmw@amazon.co.uk>

>

> The vhost-net driver was making wild assumptions about the header length


If respinning, please more concretely describe which configuration is
currently broken.

IFF_NO_PI + IFF_VNET_HDR, if I understand correctly. But I got that
from the discussion, not the commit message.

> of the underlying tun/tap socket. Then it was discarding packets if

> the number of bytes it got from sock_recvmsg() didn't precisely match

> its guess.

>

> Fix it to get the correct information along with the socket itself.

> As a side-effect, this means that tun_get_socket() won't work if the

> tun file isn't actually connected to a device, since there's no 'tun'

> yet in that case to get the information from.

>

> On the receive side, where the tun device generates the virtio_net_hdr

> but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill


Nit: VIRTIO_NET_F_MSG_RXBUF

> in the 'num_buffers' field on top of the existing virtio_net_hdr, fix

> that to use 'sock_hlen - 2' as the location, which means that it goes


Please use sizeof(hdr.num_buffers) instead of a raw constant 2, to
self document the code.

Should this be an independent one-line fix?
David Woodhouse June 25, 2021, 6:55 p.m. UTC | #21
On Fri, 2021-06-25 at 14:13 -0400, Willem de Bruijn wrote:
> On Thu, Jun 24, 2021 at 8:30 AM David Woodhouse <dwmw2@infradead.org>

> wrote:

> > 

> > From: David Woodhouse <dwmw@amazon.co.uk>

> > 

> > The vhost-net driver was making wild assumptions about the header

> > length

> 

> If respinning, please more concretely describe which configuration is

> currently broken.


Fairly much all of them. Here's a test run on the 5.12.8 kernel:

$ sudo ./test_vhost_net 
TEST: (hdr 0, xdp 0, pi 0, features 0) RESULT: -1
TEST: (hdr 10, xdp 0, pi 0, features 0) RESULT: 0
TEST: (hdr 12, xdp 0, pi 0, features 0) RESULT: -1
TEST: (hdr 20, xdp 0, pi 0, features 0) RESULT: -1
TEST: (hdr 0, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 10, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 12, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 20, xdp 1, pi 0, features 0) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 10, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 12, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 20, xdp 0, pi 1, features 0) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 10, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 12, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 20, xdp 1, pi 1, features 0) RESULT: -1
TEST: (hdr 0, xdp 0, pi 0, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 0, pi 0, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 0, pi 0, features 100000000) RESULT: 0
TEST: (hdr 20, xdp 0, pi 0, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 20, xdp 1, pi 0, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 20, xdp 0, pi 1, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 10, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 12, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 20, xdp 1, pi 1, features 100000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 0, features 8000000) RESULT: 0
TEST: (hdr 0, xdp 1, pi 0, features 8000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 8000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 8000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 0, features 108000000) RESULT: 0
TEST: (hdr 0, xdp 1, pi 0, features 108000000) RESULT: -1
TEST: (hdr 0, xdp 0, pi 1, features 108000000) RESULT: -1
TEST: (hdr 0, xdp 1, pi 1, features 108000000) RESULT: -1

> IFF_NO_PI + IFF_VNET_HDR, if I understand correctly. 


That's fairly much the only one that *did* work. As long as you use
TUNSETSNDBUF which has the undocumented side-effect of turning off the
XDP path.

> > On the receive side, where the tun device generates the virtio_net_hdr

> > but VIRITO_NET_F_MSG_RXBUF was negotiated and vhost-net needs to fill

> 

> Nit: VIRTIO_NET_F_MSG_RXBUF


Thanks.

> > in the 'num_buffers' field on top of the existing virtio_net_hdr, fix

> > that to use 'sock_hlen - 2' as the location, which means that it goes

> 

> Please use sizeof(hdr.num_buffers) instead of a raw constant 2, to

> self document the code.


Makes sense.

> Should this be an independent one-line fix?


I don't think so; it's very much intertwined with the way it makes
assumptions about someone else's data.
Jason Wang June 28, 2021, 4:22 a.m. UTC | #22
在 2021/6/25 下午4:23, David Woodhouse 写道:
> On Fri, 2021-06-25 at 13:00 +0800, Jason Wang wrote:

>> 在 2021/6/24 下午8:30, David Woodhouse 写道:

>>> From: David Woodhouse <dwmw@amazon.co.uk>

>>>

>>> The vhost-net driver was making wild assumptions about the header length

>>> of the underlying tun/tap socket.

>>

>> It's by design to depend on the userspace to co-ordinate the vnet header

>> setting with the underlying sockets.

>>

>>

>>>    Then it was discarding packets if

>>> the number of bytes it got from sock_recvmsg() didn't precisely match

>>> its guess.

>>

>> Anything that is broken by this? The failure is a hint for the userspace

>> that something is wrong during the coordination.

> I am not a fan of this approach. I firmly believe that for a given

> configuration, the kernel should either *work* or it should gracefully

> refuse to set it up that way. And the requirements should be clearly

> documented.



That works only if all the logic were implemented in the same module but 
not the case in the e.g networking stack that a packet need to iterate 
several modules.

E.g in this case, the vnet header size of the TAP could be changed at 
anytime via TUNSETVNETHDRSZ, and tuntap is unaware of the existence of 
vhost_net. This makes it impossible to do refuse in the case of setup 
(SET_BACKEND).


>

> Having been on the receiving end of this "hint" of which you speak, I

> found it distinctly suboptimal as a user interface. I was left

> scrabbling around trying to find a set of options which *would* work,

> and it was only through debugging the kernel that I managed to work out

> that I:

>

>    • MUST set IFF_NO_PI

>    • MUST use TUNSETSNDBUF to reduce the sndbuf from INT_MAX

>    • MUST use a virtio_net_hdr that I don't want

>

> If my application failed to do any of those things, I got a silent

> failure to transport any packets.



Yes, this is because the bug when using vhost_net + PI/TUN. And I guess 
the reason is that nobody tries to use that combination in the past.

I'm not even sure if it's a valid setup since vhost-net is a virtio-net 
kernel server which is not expected to handle L3 packets or PI header 
(which is Linux specific and out of the scope virtio spec).


>   The only thing I could do *without*

> debugging the kernel was tcpdump on the 'tun0' interface and see if the

> TX packets I put into the ring were even making it to the interface,

> and what they looked like if they did. (Losing the first 14 bytes and

> having the *next* 14 bytes scribbled on by an Ethernet header was a fun

> one.)



The tricky part is that, the networking stack thinks the packet is 
successfully received but it was actually dropped by vhost-net.

And there's no obvious userspace API to report such dropping as 
statistics counters or trace-points. Maybe we can tweak the vhost for a 
better logging in this case.


>

>

>

>

>

>>> Fix it to get the correct information along with the socket itself.

>>

>> I'm not sure what is fixed by this. It looks to me it tires to let

>> packet go even if the userspace set the wrong attributes to tap or

>> vhost. This is even sub-optimal than failing explicitly fail the RX.

> I'm OK with explicit failure. But once I'd let it *get* the information

> from the underlying socket in order to decide whether it should fail or

> not, it turned out to be easy enough just to make those configs work

> anyway.



The problem is that this change may make some wrong configuration 
"works" silently at the level of vhost or TAP. When using this for VM, 
it would make the debugging even harder.


>

> The main case where that "easy enough" is stretched a little (IMO) was

> when there's a tun_pi header. I have one more of your emails to reply

> to after this, and I'll address that there.

>

>

>>> As a side-effect, this means that tun_get_socket() won't work if the

>>> tun file isn't actually connected to a device, since there's no 'tun'

>>> yet in that case to get the information from.

>>

>> This may break the existing application. Vhost-net is tied to the socket

>> instead of the device that the socket is loosely coupled.

> Hm. Perhaps the PI and vnet hdr should be considered an option of the

> *socket* (which is tied to the tfile), not purely an option of the

> underlying device?



Though this is how it is done in macvtap. It's probably too late to 
change tuntap.


>

> Or maybe it's sufficient just to get the flags from *either* tfile->tun

> or tfile->detached, so that it works when the queue is detached. I'll

> take a look.

>

> I suppose we could even have a fallback that makes stuff up like we do

> today. If the user attempts to attach a tun file descriptor to vhost

> without ever calling TUNSETIFF on it first, *then* we make the same

> assumptions we do today?



Then I would rather keep the using the assumption:

1) the value get from get_socket() might not be correct
2) the complexity or risk for bring a very little improvement of the 
debug-ability (which is still suspicious).


>

>>> --- a/drivers/vhost/net.c

>>> +++ b/drivers/vhost/net.c

>>> @@ -1143,7 +1143,8 @@ static void handle_rx(struct vhost_net *net)

>>>    

>>>    	vq_log = unlikely(vhost_has_feature(vq, VHOST_F_LOG_ALL)) ?

>>>    		vq->log : NULL;

>>> -	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF);

>>> +	mergeable = vhost_has_feature(vq, VIRTIO_NET_F_MRG_RXBUF) &&

>>> +		(vhost_hlen || sock_hlen >= sizeof(num_buffers));

>>

>> So this change the behavior. When mergeable buffer is enabled, userspace

>> expects the vhost to merge buffers. If the feature is disabled silently,

>> it violates virtio spec.

>>

>> If anything wrong in the setup, userspace just breaks itself.

>>

>> E.g if sock_hlen is less that struct virtio_net_hdr_mrg_buf. The packet

>> header might be overwrote by the vnet header.

> This wasn't intended to change the behaviour of any code path that is

> already working today. If *either* vhost or the underlying device have

> provided a vnet header, we still merge.

>

> If *neither* provide a vnet hdr, there's nowhere to put num_buffers and

> we can't merge.

>

> That code path doesn't work at all today, but does after my patches.



It looks to me it's a bug that userspace can keep working in this case. 
After mrg rx buffer is negotiated, userspace should always assumes the 
vhost-net to provide num_buffers.

> But you're right, we should explicitly refuse to negotiate

> VIRITO_NET_F_MSG_RXBUF in that case.



This would be very hard:

1) VHOST_SET_FEATURES and VHOST_NET_SET_BACKEND are two different ioctls
2) vhost_net is not tightly coupled with tuntap, vnet header size could 
be changed by userspace at any time


>

>>>    

>>>    	do {

>>>    		sock_len = vhost_net_rx_peek_head_len(net, sock->sk,

>>> @@ -1213,9 +1214,10 @@ static void handle_rx(struct vhost_net *net)

>>>    			}

>>>    		} else {

>>>    			/* Header came from socket; we'll need to patch

>>> -			 * ->num_buffers over if VIRTIO_NET_F_MRG_RXBUF

>>> +			 * ->num_buffers over the last two bytes if

>>> +			 * VIRTIO_NET_F_MRG_RXBUF is enabled.

>>>    			 */

>>> -			iov_iter_advance(&fixup, sizeof(hdr));

>>> +			iov_iter_advance(&fixup, sock_hlen - 2);

>>

>> I'm not sure what did the above code want to fix. It doesn't change

>> anything if vnet header is set correctly in TUN. It only prevents the

>> the packet header from being rewrote.

>>

> It fixes the case where the virtio_net_hdr isn't at the start of the

> tun header, because the tun actually puts the tun_pi struct *first*,

> and *then* the virtio_net_hdr.



Right.


> The num_buffers field needs to go at the *end* of sock_hlen. Not at a

> fixed offset from the *start* of it.

>

> At least, that's true unless we want to just declare that we *only*

> support TUN with the IFF_NO_PI flag. (qv).



Yes, that's a good question. This is probably a hint that "vhost-net is 
never designed to work of PI", and even if it's not true, I'm not sure 
if it's too late to fix.

Thanks
diff mbox series

Patch

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4cf38be26dc9..f812dcdc640e 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -2394,8 +2394,50 @@  static int tun_xdp_one(struct tun_struct *tun,
 		err = -EINVAL;
 		goto out;
 	}
+	switch (tun->flags & TUN_TYPE_MASK) {
+	case IFF_TUN:
+		if (tun->flags & IFF_NO_PI) {
+			u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0;
+
+			switch (ip_version) {
+			case 4:
+				skb->protocol = htons(ETH_P_IP);
+				break;
+			case 6:
+				skb->protocol = htons(ETH_P_IPV6);
+				break;
+			default:
+				atomic_long_inc(&tun->dev->rx_dropped);
+				kfree_skb(skb);
+				err = -EINVAL;
+				goto out;
+			}
+		} else {
+			struct tun_pi *pi = (struct tun_pi *)skb->data;
+			if (!pskb_may_pull(skb, sizeof(*pi))) {
+				atomic_long_inc(&tun->dev->rx_dropped);
+				kfree_skb(skb);
+				err = -ENOMEM;
+				goto out;
+			}
+			skb_pull_inline(skb, sizeof(*pi));
+			skb->protocol = pi->proto;
+		}
+
+		skb_reset_mac_header(skb);
+		skb->dev = tun->dev;
+		break;
+	case IFF_TAP:
+		if (!pskb_may_pull(skb, ETH_HLEN)) {
+			atomic_long_inc(&tun->dev->rx_dropped);
+			kfree_skb(skb);
+			err = -ENOMEM;
+			goto out;
+		}
+		skb->protocol = eth_type_trans(skb, tun->dev);
+		break;
+	}
 
-	skb->protocol = eth_type_trans(skb, tun->dev);
 	skb_reset_network_header(skb);
 	skb_probe_transport_header(skb);
 	skb_record_rx_queue(skb, tfile->queue_index);