From patchwork Sat Jun 19 13:33:04 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 464234 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.3 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_SANE_2 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93D8BC2B9F4 for ; Sat, 19 Jun 2021 13:33:16 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6A1246135A for ; Sat, 19 Jun 2021 13:33:16 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234300AbhFSNfZ (ORCPT ); Sat, 19 Jun 2021 09:35:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:40210 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234262AbhFSNfX (ORCPT ); Sat, 19 Jun 2021 09:35:23 -0400 Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:e::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 41E98C061574 for ; Sat, 19 Jun 2021 06:33:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=bombadil.20210309; h=Mime-Version:Content-Type:Date:Cc:To: From:Subject:Message-ID:Sender:Reply-To:Content-Transfer-Encoding:Content-ID: Content-Description:In-Reply-To:References; bh=+wmEC5cj3eQG19iKl2wZKKGtJDAMZHW6C/R4ejQoPb8=; b=2e1oyiKpCsDGYqdvGP7ImzBy+p QqQn88lGj8vE2qJlKXycVbwVOYIQ0huJWNqSJMVs7TJ6bUqZLWG8GqTQRmE46LogHT85GVCE+wU8a 4qrdJgYcODkgW2OJHljdBCrNeX2oSameb7bojpTldQnYajwVLtcfIo73eUe6pmsWfgDspGa4CJTZf ICM4vkG72k2VjpN7wt/8WXzX9+Dzzj2NCtZy1GT+2tEK3+SZ5rKfaD0y1/488AbgmScmwCydKYH25 9rprJrIlSMlDMv0xA9/9stBy9cb6/XWFozkDkMUb7zZqolVmOlM/y4fGIzf3NLYUaT0TtsTm2oK7e kgJ+88gQ==; Received: from [2001:8b0:10b:1::3ae] (helo=u3832b3a9db3152.ant.amazon.com) by bombadil.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1lub5u-00H3fB-D7; Sat, 19 Jun 2021 13:33:07 +0000 Message-ID: <03ee62602dd7b7101f78e0802249a6e2e4c10b7f.camel@infradead.org> Subject: [PATCH] net: tun: fix tun_xdp_one() for IFF_TUN mode From: David Woodhouse To: netdev Cc: Jason Wang Date: Sat, 19 Jun 2021 14:33:04 +0100 X-Mailer: Evolution 3.28.5-0ubuntu0.18.04.2 Mime-Version: 1.0 X-SRS-Rewrite: SMTP reverse-path rewritten from by bombadil.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Woodhouse In tun_get_user(), skb->protocol is either taken from the tun_pi header or inferred from the first byte of the packet in IFF_TUN mode, while eth_type_trans() is called only in the IFF_TAP mode where the payload is expected to be an Ethernet frame. The alternative path in tun_xdp_one() was unconditionally using eth_type_trans(), which corrupts packets in IFF_TUN mode. Fix it to do the correct thing for IFF_TUN mode, as tun_get_user() does. Signed-off-by: David Woodhouse Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") Signed-off-by: Jason Wang Acked-by: Jason Wang --- How is my userspace application going to know that the kernel has this fix? Should we add a flag to TUN_FEATURES to show that vhost-net in *IFF_TUN* mode is supported? drivers/net/tun.c | 44 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 43 insertions(+), 1 deletion(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 4cf38be26dc9..f812dcdc640e 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2394,8 +2394,50 @@ static int tun_xdp_one(struct tun_struct *tun, err = -EINVAL; goto out; } + switch (tun->flags & TUN_TYPE_MASK) { + case IFF_TUN: + if (tun->flags & IFF_NO_PI) { + u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0; + + switch (ip_version) { + case 4: + skb->protocol = htons(ETH_P_IP); + break; + case 6: + skb->protocol = htons(ETH_P_IPV6); + break; + default: + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + err = -EINVAL; + goto out; + } + } else { + struct tun_pi *pi = (struct tun_pi *)skb->data; + if (!pskb_may_pull(skb, sizeof(*pi))) { + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + err = -ENOMEM; + goto out; + } + skb_pull_inline(skb, sizeof(*pi)); + skb->protocol = pi->proto; + } + + skb_reset_mac_header(skb); + skb->dev = tun->dev; + break; + case IFF_TAP: + if (!pskb_may_pull(skb, ETH_HLEN)) { + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + err = -ENOMEM; + goto out; + } + skb->protocol = eth_type_trans(skb, tun->dev); + break; + } - skb->protocol = eth_type_trans(skb, tun->dev); skb_reset_network_header(skb); skb_probe_transport_header(skb); skb_record_rx_queue(skb, tfile->queue_index); From patchwork Thu Jun 24 12:30:03 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 466736 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67C6BC48BDF for ; Thu, 24 Jun 2021 12:30:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4A66D613EC for ; Thu, 24 Jun 2021 12:30:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229505AbhFXMco (ORCPT ); Thu, 24 Jun 2021 08:32:44 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45162 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231407AbhFXMc3 (ORCPT ); Thu, 24 Jun 2021 08:32:29 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B37BEC061574 for ; Thu, 24 Jun 2021 05:30:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=c8vL8HZX/zIZPP6WgQpljiTWkbKb1nnMUaed62zzHUA=; b=Ez7x7SmJShwnSTPgGFFS6Wj+EI vmI37361G8xYO1ae6YTdUBPWjn3aGuKc3tzidjN29CfGo1orKgSEqR36CGcemn8a0CiO2HtMmESZ2 tsHNqki6tziqUo7tX9t5s3P7nrzVQwtPj2uCbDt4tk0bXLpGQDvzT2+t0F4b2ONj/FJ7okMk9xdvz s6zwA36S53mnvX8e7BJNWvUiHFdTtwMW9etzcPlkj1CXGeaqY7x5hpyNP+14StRdzFNpvWJnTwy3o GJCTT0Ewno6lJfgQPh/A9vX13vY3QDZVWgTDjhn3qxgJ+McbwsyDL54LzuoY3GL77WcJM7rRWSJWD Ros65bfw==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1lwOUZ-00BEEO-6D; Thu, 24 Jun 2021 12:30:06 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1lwOUf-005Sf1-L6; Thu, 24 Jun 2021 13:30:05 +0100 From: David Woodhouse To: netdev@vger.kernel.org Cc: Jason Wang , =?utf-8?q?Eugenio_P=C3=A9rez?= , Willem de Bruijn Subject: [PATCH v3 3/5] vhost_net: remove virtio_net_hdr validation, let tun/tap do it themselves Date: Thu, 24 Jun 2021 13:30:03 +0100 Message-Id: <20210624123005.1301761-3-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210624123005.1301761-1-dwmw2@infradead.org> References: <03ee62602dd7b7101f78e0802249a6e2e4c10b7f.camel@infradead.org> <20210624123005.1301761-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Woodhouse When the underlying socket isn't configured with a virtio_net_hdr, the existing code in vhost_net_build_xdp() would attempt to validate uninitialised data, by copying zero bytes (sock_hlen) into the local copy of the header and then trying to validate that. Fixing it is somewhat non-trivial because the tun device might put a struct tun_pi *before* the virtio_net_hdr, which makes it hard to find. So just stop messing with someone else's data in vhost_net_build_xdp(), and let tap and tun validate it for themselves, as they do in the non-XDP case anyway. This means that the 'gso' member of struct tun_xdp_hdr can die, leaving only 'int buflen'. The socket header of sock_hlen is still copied separately from the data payload because there may be a gap between them to ensure suitable alignment of the latter. Fixes: 0a0be13b8fe2 ("vhost_net: batch submitting XDP buffers to underlayer sockets") Signed-off-by: David Woodhouse --- drivers/net/tap.c | 25 ++++++++++++++++++++++--- drivers/net/tun.c | 21 ++++++++++++++++++--- drivers/vhost/net.c | 30 +++++++++--------------------- include/linux/if_tun.h | 1 - 4 files changed, 49 insertions(+), 28 deletions(-) diff --git a/drivers/net/tap.c b/drivers/net/tap.c index 2170a0d3d34c..d1b1f1de374e 100644 --- a/drivers/net/tap.c +++ b/drivers/net/tap.c @@ -1132,16 +1132,35 @@ static const struct file_operations tap_fops = { static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp) { struct tun_xdp_hdr *hdr = xdp->data_hard_start; - struct virtio_net_hdr *gso = &hdr->gso; + struct virtio_net_hdr *gso = NULL; int buflen = hdr->buflen; int vnet_hdr_len = 0; struct tap_dev *tap; struct sk_buff *skb; int err, depth; - if (q->flags & IFF_VNET_HDR) + if (q->flags & IFF_VNET_HDR) { vnet_hdr_len = READ_ONCE(q->vnet_hdr_sz); + if (xdp->data != xdp->data_hard_start + sizeof(*hdr) + vnet_hdr_len) { + err = -EINVAL; + goto err; + } + + gso = (void *)&hdr[1]; + if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && + tap16_to_cpu(q, gso->csum_start) + + tap16_to_cpu(q, gso->csum_offset) + 2 > + tap16_to_cpu(q, gso->hdr_len)) + gso->hdr_len = cpu_to_tap16(q, + tap16_to_cpu(q, gso->csum_start) + + tap16_to_cpu(q, gso->csum_offset) + 2); + + if (tap16_to_cpu(q, gso->hdr_len) > xdp->data_end - xdp->data) { + err = -EINVAL; + goto err; + } + } skb = build_skb(xdp->data_hard_start, buflen); if (!skb) { err = -ENOMEM; @@ -1155,7 +1174,7 @@ static int tap_get_user_xdp(struct tap_queue *q, struct xdp_buff *xdp) skb_reset_mac_header(skb); skb->protocol = eth_hdr(skb)->h_proto; - if (vnet_hdr_len) { + if (gso) { err = virtio_net_hdr_to_skb(skb, gso, tap_is_little_endian(q)); if (err) goto err_kfree; diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 9acd448e6dfc..1b553f79adb0 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -2331,6 +2331,7 @@ static int tun_xdp_one(struct tun_struct *tun, { unsigned int datasize = xdp->data_end - xdp->data; struct tun_xdp_hdr *hdr = xdp->data_hard_start; + void *tun_hdr = &hdr[1]; struct virtio_net_hdr *gso = NULL; struct bpf_prog *xdp_prog; struct sk_buff *skb = NULL; @@ -2340,8 +2341,22 @@ static int tun_xdp_one(struct tun_struct *tun, bool skb_xdp = false; struct page *page; - if (tun->flags & IFF_VNET_HDR) - gso = &hdr->gso; + if (tun->flags & IFF_VNET_HDR) { + gso = tun_hdr; + tun_hdr += sizeof(*gso); + + if (tun_hdr > xdp->data) { + atomic_long_inc(&tun->rx_frame_errors); + return -EINVAL; + } + + if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && + tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2 > tun16_to_cpu(tun, gso->hdr_len)) + gso->hdr_len = cpu_to_tun16(tun, tun16_to_cpu(tun, gso->csum_start) + tun16_to_cpu(tun, gso->csum_offset) + 2); + + if (tun16_to_cpu(tun, gso->hdr_len) > datasize) + return -EINVAL; + } xdp_prog = rcu_dereference(tun->xdp_prog); if (xdp_prog) { @@ -2389,7 +2404,7 @@ static int tun_xdp_one(struct tun_struct *tun, } skb_reserve(skb, xdp->data - xdp->data_hard_start); - skb_put(skb, xdp->data_end - xdp->data); + skb_put(skb, datasize); if (gso && virtio_net_hdr_to_skb(skb, gso, tun_is_little_endian(tun))) { atomic_long_inc(&tun->rx_frame_errors); diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index b92a7144ed90..7cae18151c60 100644 --- a/drivers/vhost/net.c +++ b/drivers/vhost/net.c @@ -690,7 +690,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, dev); struct socket *sock = vhost_vq_get_backend(vq); struct page_frag *alloc_frag = &net->page_frag; - struct virtio_net_hdr *gso; struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp]; struct tun_xdp_hdr *hdr; size_t len = iov_iter_count(from); @@ -715,29 +714,18 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq, return -ENOMEM; buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset; - copied = copy_page_from_iter(alloc_frag->page, - alloc_frag->offset + - offsetof(struct tun_xdp_hdr, gso), - sock_hlen, from); - if (copied != sock_hlen) - return -EFAULT; - hdr = buf; - gso = &hdr->gso; - - if ((gso->flags & VIRTIO_NET_HDR_F_NEEDS_CSUM) && - vhost16_to_cpu(vq, gso->csum_start) + - vhost16_to_cpu(vq, gso->csum_offset) + 2 > - vhost16_to_cpu(vq, gso->hdr_len)) { - gso->hdr_len = cpu_to_vhost16(vq, - vhost16_to_cpu(vq, gso->csum_start) + - vhost16_to_cpu(vq, gso->csum_offset) + 2); - - if (vhost16_to_cpu(vq, gso->hdr_len) > len) - return -EINVAL; + if (sock_hlen) { + copied = copy_page_from_iter(alloc_frag->page, + alloc_frag->offset + + sizeof(struct tun_xdp_hdr), + sock_hlen, from); + if (copied != sock_hlen) + return -EFAULT; + + len -= sock_hlen; } - len -= sock_hlen; copied = copy_page_from_iter(alloc_frag->page, alloc_frag->offset + pad, len, from); diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h index 8a7debd3f663..8d78b6bbc228 100644 --- a/include/linux/if_tun.h +++ b/include/linux/if_tun.h @@ -21,7 +21,6 @@ struct tun_msg_ctl { struct tun_xdp_hdr { int buflen; - struct virtio_net_hdr gso; }; #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE) From patchwork Thu Jun 24 12:30:04 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 466735 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E4D99C49EA5 for ; Thu, 24 Jun 2021 12:30:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C4700613B1 for ; Thu, 24 Jun 2021 12:30:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231474AbhFXMdM (ORCPT ); Thu, 24 Jun 2021 08:33:12 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45166 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231432AbhFXMc3 (ORCPT ); Thu, 24 Jun 2021 08:32:29 -0400 Received: from desiato.infradead.org (desiato.infradead.org [IPv6:2001:8b0:10b:1:d65d:64ff:fe57:4e05]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BBB46C061760 for ; Thu, 24 Jun 2021 05:30:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=desiato.20200630; h=Sender:Content-Transfer-Encoding: MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc:To:From: Reply-To:Content-Type:Content-ID:Content-Description; bh=gGZaECOQuaiwn+numAlhP+P5/O9waNNMX3buiX5lvnY=; b=TyilwAzLNJY8NmlbJJpbhaa2wG htWpR1HaS/9kimT13H+1s11hni0lPrtgrlIEuW3Q8fb1WrGs8gMrGRXgI8mQ/vB46j7yL4/y7/viL OqVf7s7COcTismmD+2tA4dq2vDKU+YviyBc0wo/ZflMje/bW81r9TGzsY9kt4GaBUexIOMyxkjaST ut9KTKxm+ickd4VtxLdc9G43mWRr3GgIO1cl92iZdgUWgETEKJAdn8QkJ0dZB648D3/gP0WERoGyK 6It6vejArXP4hA+Q+rouC4qfoEd9zmH5bSvlkqZwmV1KmXIlfZG0gT1XVqYfLgRIBY4qU1W4Bb1Zj iMnu6b4A==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by desiato.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1lwOUZ-00BEEP-6t; Thu, 24 Jun 2021 12:30:06 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1lwOUf-005Sf5-NC; Thu, 24 Jun 2021 13:30:05 +0100 From: David Woodhouse To: netdev@vger.kernel.org Cc: Jason Wang , =?utf-8?q?Eugenio_P=C3=A9rez?= , Willem de Bruijn Subject: [PATCH v3 4/5] net: tun: fix tun_xdp_one() for IFF_TUN mode Date: Thu, 24 Jun 2021 13:30:04 +0100 Message-Id: <20210624123005.1301761-4-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210624123005.1301761-1-dwmw2@infradead.org> References: <03ee62602dd7b7101f78e0802249a6e2e4c10b7f.camel@infradead.org> <20210624123005.1301761-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by desiato.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Woodhouse In tun_get_user(), skb->protocol is either taken from the tun_pi header or inferred from the first byte of the packet in IFF_TUN mode, while eth_type_trans() is called only in the IFF_TAP mode where the payload is expected to be an Ethernet frame. The equivalent code path in tun_xdp_one() was unconditionally using eth_type_trans(), which is the wrong thing to do in IFF_TUN mode and corrupts packets. Pull the logic out to a separate tun_skb_set_protocol() function, and call it from both tun_get_user() and tun_xdp_one(). XX: It is not entirely clear to me why it's OK to call eth_type_trans() in some cases without first checking that enough of the Ethernet header is linearly present by calling pskb_may_pull(). Such a check was never present in the tun_xdp_one() code path, and commit 96aa1b22bd6bb ("tun: correct header offsets in napi frags mode") deliberately added it *only* for the IFF_NAPI_FRAGS mode. I would like to see specific explanations of *why* it's ever valid and necessary (is it so much faster?) to skip the pskb_may_pull() by setting the 'no_pull_check' flag to tun_skb_set_protocol(), but for now I'll settle for faithfully preserving the existing behaviour and pretending it's someone else's problem. Cc: Willem de Bruijn Fixes: 043d222f93ab ("tuntap: accept an array of XDP buffs through sendmsg()") Signed-off-by: David Woodhouse --- drivers/net/tun.c | 95 +++++++++++++++++++++++++++++++---------------- 1 file changed, 63 insertions(+), 32 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 1b553f79adb0..9379fa86fae9 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -1641,6 +1641,47 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun, return NULL; } +static int tun_skb_set_protocol(struct tun_struct *tun, struct sk_buff *skb, + __be16 pi_proto, bool no_pull_check) +{ + switch (tun->flags & TUN_TYPE_MASK) { + case IFF_TUN: + if (tun->flags & IFF_NO_PI) { + u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0; + + switch (ip_version) { + case 4: + pi_proto = htons(ETH_P_IP); + break; + case 6: + pi_proto = htons(ETH_P_IPV6); + break; + default: + return -EINVAL; + } + } + + skb_reset_mac_header(skb); + skb->protocol = pi_proto; + skb->dev = tun->dev; + break; + case IFF_TAP: + /* As an optimisation, no_pull_check can be set in the cases + * where the caller *knows* that eth_type_trans() will be OK + * because the Ethernet header is linearised at skb->data. + * + * XX: Or so I was reliably assured when I moved this code + * and didn't make it unconditional. dwmw2. + */ + if (!no_pull_check && !pskb_may_pull(skb, ETH_HLEN)) + return -ENOMEM; + + skb->protocol = eth_type_trans(skb, tun->dev); + break; + } + return 0; +} + /* Get packet from user space buffer */ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, void *msg_control, struct iov_iter *from, @@ -1784,37 +1825,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile, return -EINVAL; } - switch (tun->flags & TUN_TYPE_MASK) { - case IFF_TUN: - if (tun->flags & IFF_NO_PI) { - u8 ip_version = skb->len ? (skb->data[0] >> 4) : 0; - - switch (ip_version) { - case 4: - pi.proto = htons(ETH_P_IP); - break; - case 6: - pi.proto = htons(ETH_P_IPV6); - break; - default: - atomic_long_inc(&tun->dev->rx_dropped); - kfree_skb(skb); - return -EINVAL; - } - } - - skb_reset_mac_header(skb); - skb->protocol = pi.proto; - skb->dev = tun->dev; - break; - case IFF_TAP: - if (frags && !pskb_may_pull(skb, ETH_HLEN)) { - err = -ENOMEM; - goto drop; - } - skb->protocol = eth_type_trans(skb, tun->dev); - break; - } + err = tun_skb_set_protocol(tun, skb, pi.proto, !frags); + if (err) + goto drop; /* copy skb_ubuf_info for callback when skb has no error */ if (zerocopy) { @@ -2335,12 +2348,24 @@ static int tun_xdp_one(struct tun_struct *tun, struct virtio_net_hdr *gso = NULL; struct bpf_prog *xdp_prog; struct sk_buff *skb = NULL; + __be16 proto = 0; u32 rxhash = 0, act; int buflen = hdr->buflen; int err = 0; bool skb_xdp = false; struct page *page; + if (!(tun->flags & IFF_NO_PI)) { + struct tun_pi *pi = tun_hdr; + tun_hdr += sizeof(*pi); + + if (tun_hdr > xdp->data) { + atomic_long_inc(&tun->rx_frame_errors); + return -EINVAL; + } + proto = pi->proto; + } + if (tun->flags & IFF_VNET_HDR) { gso = tun_hdr; tun_hdr += sizeof(*gso); @@ -2413,7 +2438,13 @@ static int tun_xdp_one(struct tun_struct *tun, goto out; } - skb->protocol = eth_type_trans(skb, tun->dev); + err = tun_skb_set_protocol(tun, skb, proto, true); + if (err) { + atomic_long_inc(&tun->dev->rx_dropped); + kfree_skb(skb); + goto out; + } + skb_reset_network_header(skb); skb_probe_transport_header(skb); skb_record_rx_queue(skb, tfile->queue_index); From patchwork Thu Jun 24 12:30:05 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Woodhouse X-Patchwork-Id: 466734 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81CF7C49EA6 for ; Thu, 24 Jun 2021 12:30:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 698B96054E for ; Thu, 24 Jun 2021 12:30:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231499AbhFXMdP (ORCPT ); Thu, 24 Jun 2021 08:33:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45266 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231434AbhFXMc4 (ORCPT ); Thu, 24 Jun 2021 08:32:56 -0400 Received: from casper.infradead.org (casper.infradead.org [IPv6:2001:8b0:10b:1236::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA197C061760 for ; Thu, 24 Jun 2021 05:30:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=infradead.org; s=casper.20170209; h=Sender:Content-Transfer-Encoding: Content-Type:MIME-Version:References:In-Reply-To:Message-Id:Date:Subject:Cc: To:From:Reply-To:Content-ID:Content-Description; bh=Dotab6s+ZXq1EdvTLh3ORHHPjdm3vVWOuLMYB2a0tY4=; b=bZPXsvArhhZP9BVxyU9dgeYY0g 3QwXDVdiUCA3hBsRpPW3tG0Qx6KP61kKH0DT08wF/W5q02Yc9eHXjXHglxcd2MQDowFOeLTvwZdQ2 x9UtbHCPLKq4frwlM9rw0/zuvJM2F32HsD+kIi13CW7/PDSFFemSOIHVZKJ0sp4rm9I3LiBdwGkzE W4c33rUWHOBkme1i8+ruHo+SwcZSXfQLk92w9+mYb3QYFT11BVK4sIhPQTB69Y1UlavIreoW9HxAK QODF0ltjoN/vtzZW44QqUp9SNPbyPFyBnRPcEKMGA3D8IqzNHRUc+6munolnelEhYsjEpX/VFleN5 gDaKUORA==; Received: from i7.infradead.org ([2001:8b0:10b:1:21e:67ff:fecb:7a92]) by casper.infradead.org with esmtpsa (Exim 4.94.2 #2 (Red Hat Linux)) id 1lwOUg-00GZAi-7x; Thu, 24 Jun 2021 12:30:12 +0000 Received: from dwoodhou by i7.infradead.org with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1lwOUf-005Sf9-Ny; Thu, 24 Jun 2021 13:30:05 +0100 From: David Woodhouse To: netdev@vger.kernel.org Cc: Jason Wang , =?utf-8?q?Eugenio_P=C3=A9rez?= , Willem de Bruijn Subject: [PATCH v3 5/5] vhost_net: Add self test with tun device Date: Thu, 24 Jun 2021 13:30:05 +0100 Message-Id: <20210624123005.1301761-5-dwmw2@infradead.org> X-Mailer: git-send-email 2.31.1 In-Reply-To: <20210624123005.1301761-1-dwmw2@infradead.org> References: <03ee62602dd7b7101f78e0802249a6e2e4c10b7f.camel@infradead.org> <20210624123005.1301761-1-dwmw2@infradead.org> MIME-Version: 1.0 Sender: David Woodhouse X-SRS-Rewrite: SMTP reverse-path rewritten from by casper.infradead.org. See http://www.infradead.org/rpr.html Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: David Woodhouse This creates a tun device and brings it up, and finds out the link-local address that the kernel automatically assigns to it. It then sets up vhost-net on the tun device, uses that to send a ping to the kernel's assigned link-local addresss, and waits for a reply. If everything is working correctly, it will get a response and manage to understand it. If the virtio_net_hdr and other pieces are not working as expected, then it fails (and times out after 1 second). The test is repeated in various combinations of vhost-net feature flags, tun vhdr length, PI enabled, and XDP/non-XDP code paths. Most of which didn't work before the patch series that added this test, but do now. Signed-off-by: David Woodhouse --- tools/testing/selftests/Makefile | 1 + tools/testing/selftests/vhost/Makefile | 16 + tools/testing/selftests/vhost/config | 2 + .../testing/selftests/vhost/test_vhost_net.c | 530 ++++++++++++++++++ 4 files changed, 549 insertions(+) create mode 100644 tools/testing/selftests/vhost/Makefile create mode 100644 tools/testing/selftests/vhost/config create mode 100644 tools/testing/selftests/vhost/test_vhost_net.c diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile index 6c575cf34a71..300c03cfd0c7 100644 --- a/tools/testing/selftests/Makefile +++ b/tools/testing/selftests/Makefile @@ -71,6 +71,7 @@ TARGETS += user TARGETS += vDSO TARGETS += vm TARGETS += x86 +TARGETS += vhost TARGETS += zram #Please keep the TARGETS list alphabetically sorted # Run "make quicktest=1 run_tests" or diff --git a/tools/testing/selftests/vhost/Makefile b/tools/testing/selftests/vhost/Makefile new file mode 100644 index 000000000000..f5e565d80733 --- /dev/null +++ b/tools/testing/selftests/vhost/Makefile @@ -0,0 +1,16 @@ +# SPDX-License-Identifier: GPL-2.0 +all: + +include ../lib.mk + +.PHONY: all clean + +BINARIES := test_vhost_net + +test_vhost_net: test_vhost_net.c ../kselftest.h ../kselftest_harness.h + $(CC) $(CFLAGS) -g $< -o $@ + +TEST_PROGS += $(BINARIES) +EXTRA_CLEAN := $(BINARIES) + +all: $(BINARIES) diff --git a/tools/testing/selftests/vhost/config b/tools/testing/selftests/vhost/config new file mode 100644 index 000000000000..6391c1f32c34 --- /dev/null +++ b/tools/testing/selftests/vhost/config @@ -0,0 +1,2 @@ +CONFIG_VHOST_NET=y +CONFIG_TUN=y diff --git a/tools/testing/selftests/vhost/test_vhost_net.c b/tools/testing/selftests/vhost/test_vhost_net.c new file mode 100644 index 000000000000..747f0e5e4f57 --- /dev/null +++ b/tools/testing/selftests/vhost/test_vhost_net.c @@ -0,0 +1,530 @@ +// SPDX-License-Identifier: LGPL-2.1 + +#include "../kselftest_harness.h" +#include "../../../virtio/asm/barrier.h" + +#include + +#include +#include + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include +#include +#include +#include +#include + +#include +#include +#include + +static unsigned char hexnybble(char hex) +{ + switch (hex) { + case '0'...'9': + return hex - '0'; + case 'a'...'f': + return 10 + hex - 'a'; + case 'A'...'F': + return 10 + hex - 'A'; + default: + exit (KSFT_SKIP); + } +} + +static unsigned char hexchar(char *hex) +{ + return (hexnybble(hex[0]) << 4) | hexnybble(hex[1]); +} + +int open_tun(int vnet_hdr_sz, int pi, struct in6_addr *addr) +{ + int tun_fd = open("/dev/net/tun", O_RDWR); + if (tun_fd == -1) + return -1; + + struct ifreq ifr = { 0 }; + + ifr.ifr_flags = IFF_TUN; + if (!pi) + ifr.ifr_flags |= IFF_NO_PI; + if (vnet_hdr_sz) + ifr.ifr_flags |= IFF_VNET_HDR; + + if (ioctl(tun_fd, TUNSETIFF, (void *)&ifr) < 0) + goto out_tun; + + if (vnet_hdr_sz && + ioctl(tun_fd, TUNSETVNETHDRSZ, &vnet_hdr_sz) < 0) + goto out_tun; + + int sockfd = socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP); + if (sockfd == -1) + goto out_tun; + + if (ioctl(sockfd, SIOCGIFFLAGS, (void *)&ifr) < 0) + goto out_sock; + + ifr.ifr_flags |= IFF_UP; + if (ioctl(sockfd, SIOCSIFFLAGS, (void *)&ifr) < 0) + goto out_sock; + + close(sockfd); + + FILE *inet6 = fopen("/proc/net/if_inet6", "r"); + if (!inet6) + goto out_tun; + + char buf[80]; + while (fgets(buf, sizeof(buf), inet6)) { + size_t len = strlen(buf), namelen = strlen(ifr.ifr_name); + if (!strncmp(buf, "fe80", 4) && + buf[len - namelen - 2] == ' ' && + !strncmp(buf + len - namelen - 1, ifr.ifr_name, namelen)) { + for (int i = 0; i < 16; i++) { + addr->s6_addr[i] = hexchar(buf + i*2); + } + fclose(inet6); + return tun_fd; + } + } + /* Not found */ + fclose(inet6); + out_sock: + close(sockfd); + out_tun: + close(tun_fd); + return -1; +} + +#define RING_SIZE 32 +#define RING_MASK(x) ((x) & (RING_SIZE-1)) + +struct pkt_buf { + unsigned char data[2048]; +}; + +struct test_vring { + struct vring_desc desc[RING_SIZE]; + struct vring_avail avail; + __virtio16 avail_ring[RING_SIZE]; + struct vring_used used; + struct vring_used_elem used_ring[RING_SIZE]; + struct pkt_buf pkts[RING_SIZE]; +} rings[2]; + +static int setup_vring(int vhost_fd, int tun_fd, int call_fd, int kick_fd, int idx) +{ + struct test_vring *vring = &rings[idx]; + int ret; + + memset(vring, 0, sizeof(vring)); + + struct vhost_vring_state vs = { }; + vs.index = idx; + vs.num = RING_SIZE; + if (ioctl(vhost_fd, VHOST_SET_VRING_NUM, &vs) < 0) { + perror("VHOST_SET_VRING_NUM"); + return -1; + } + + vs.num = 0; + if (ioctl(vhost_fd, VHOST_SET_VRING_BASE, &vs) < 0) { + perror("VHOST_SET_VRING_BASE"); + return -1; + } + + struct vhost_vring_addr va = { }; + va.index = idx; + va.desc_user_addr = (uint64_t)vring->desc; + va.avail_user_addr = (uint64_t)&vring->avail; + va.used_user_addr = (uint64_t)&vring->used; + if (ioctl(vhost_fd, VHOST_SET_VRING_ADDR, &va) < 0) { + perror("VHOST_SET_VRING_ADDR"); + return -1; + } + + struct vhost_vring_file vf = { }; + vf.index = idx; + vf.fd = tun_fd; + if (ioctl(vhost_fd, VHOST_NET_SET_BACKEND, &vf) < 0) { + perror("VHOST_NET_SET_BACKEND"); + return -1; + } + + vf.fd = call_fd; + if (ioctl(vhost_fd, VHOST_SET_VRING_CALL, &vf) < 0) { + perror("VHOST_SET_VRING_CALL"); + return -1; + } + + vf.fd = kick_fd; + if (ioctl(vhost_fd, VHOST_SET_VRING_KICK, &vf) < 0) { + perror("VHOST_SET_VRING_KICK"); + return -1; + } + + return 0; +} + +int setup_vhost(int vhost_fd, int tun_fd, int call_fd, int kick_fd, uint64_t want_features) +{ + int ret; + + if (ioctl(vhost_fd, VHOST_SET_OWNER, NULL) < 0) { + perror("VHOST_SET_OWNER"); + return -1; + } + + uint64_t features; + if (ioctl(vhost_fd, VHOST_GET_FEATURES, &features) < 0) { + perror("VHOST_GET_FEATURES"); + return -1; + } + + if ((features & want_features) != want_features) + return KSFT_SKIP; + + if (ioctl(vhost_fd, VHOST_SET_FEATURES, &want_features) < 0) { + perror("VHOST_SET_FEATURES"); + return -1; + } + + struct vhost_memory *vmem = alloca(sizeof(*vmem) + sizeof(vmem->regions[0])); + + memset(vmem, 0, sizeof(*vmem) + sizeof(vmem->regions[0])); + vmem->nregions = 1; + /* + * I just want to map the *whole* of userspace address space. But + * from userspace I don't know what that is. On x86_64 it would be: + * + * vmem->regions[0].guest_phys_addr = 4096; + * vmem->regions[0].memory_size = 0x7fffffffe000; + * vmem->regions[0].userspace_addr = 4096; + * + * For now, just ensure we put everything inside a single BSS region. + */ + vmem->regions[0].guest_phys_addr = (uint64_t)&rings; + vmem->regions[0].userspace_addr = (uint64_t)&rings; + vmem->regions[0].memory_size = sizeof(rings); + + if (ioctl(vhost_fd, VHOST_SET_MEM_TABLE, vmem) < 0) { + perror("VHOST_SET_MEM_TABLE"); + return -1; + } + + if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 0)) + return -1; + + if (setup_vring(vhost_fd, tun_fd, call_fd, kick_fd, 1)) + return -1; + + return 0; +} + + +static char ping_payload[16] = "VHOST TEST PACKT"; + +static inline uint32_t csum_partial(uint16_t *buf, int nwords) +{ + uint32_t sum = 0; + for(sum=0; nwords>0; nwords--) + sum += ntohs(*buf++); + return sum; +} + +static inline uint16_t csum_finish(uint32_t sum) +{ + sum = (sum >> 16) + (sum &0xffff); + sum += (sum >> 16); + return htons((uint16_t)(~sum)); +} + +static int create_icmp_echo(unsigned char *data, struct in6_addr *dst, + struct in6_addr *src, uint16_t id, uint16_t seq) +{ + const int icmplen = ICMP_MINLEN + sizeof(ping_payload); + const int plen = sizeof(struct ip6_hdr) + icmplen; + struct ip6_hdr *iph = (void *)data; + struct icmp6_hdr *icmph = (void *)(data + sizeof(*iph)); + + /* IPv6 Header */ + iph->ip6_flow = htonl((6 << 28) + /* version 6 */ + (0 << 20) + /* traffic class */ + (0 << 0)); /* flow ID */ + iph->ip6_nxt = IPPROTO_ICMPV6; + iph->ip6_plen = htons(icmplen); + iph->ip6_hlim = 128; + iph->ip6_src = *src; + iph->ip6_dst = *dst; + + /* ICMPv6 echo request */ + icmph->icmp6_type = ICMP6_ECHO_REQUEST; + icmph->icmp6_code = 0; + icmph->icmp6_data16[0] = htons(id); /* ID */ + icmph->icmp6_data16[1] = htons(seq); /* sequence */ + + /* Some arbitrary payload */ + memcpy(&icmph[1], ping_payload, sizeof(ping_payload)); + + /* + * IPv6 upper-layer checksums include a pseudo-header + * for IPv6 which contains the source address, the + * destination address, the upper-layer packet length + * and next-header field. See RFC8200 ยง8.1. The + * checksum is as follows: + * + * checksum 32 bytes of real IPv6 header: + * src addr (16 bytes) + * dst addr (16 bytes) + * 8 bytes more: + * length of ICMPv6 in bytes (be32) + * 3 bytes of 0 + * next header byte (IPPROTO_ICMPV6) + * Then the actual ICMPv6 bytes + */ + uint32_t sum = csum_partial((uint16_t *)&iph->ip6_src, 8); /* 8 uint16_t */ + sum += csum_partial((uint16_t *)&iph->ip6_dst, 8); /* 8 uint16_t */ + + /* The easiest way to checksum the following 8-byte + * part of the pseudo-header without horridly violating + * C type aliasing rules is *not* to build it in memory + * at all. We know the length fits in 16 bits so the + * partial checksum of 00 00 LL LL 00 00 00 NH ends up + * being just LLLL + NH. + */ + sum += IPPROTO_ICMPV6; + sum += ICMP_MINLEN + sizeof(ping_payload); + + sum += csum_partial((uint16_t *)icmph, icmplen / 2); + icmph->icmp6_cksum = csum_finish(sum); + return plen; +} + + +static int check_icmp_response(unsigned char *data, uint32_t len, + struct in6_addr *dst, struct in6_addr *src) +{ + struct ip6_hdr *iph = (void *)data; + return ( len >= 41 && (ntohl(iph->ip6_flow) >> 28)==6 /* IPv6 header */ + && iph->ip6_nxt == IPPROTO_ICMPV6 /* IPv6 next header field = ICMPv6 */ + && !memcmp(&iph->ip6_src, src, 16) /* source == magic address */ + && !memcmp(&iph->ip6_dst, dst, 16) /* source == magic address */ + && len >= 40 + ICMP_MINLEN + sizeof(ping_payload) /* No short-packet segfaults */ + && data[40] == ICMP6_ECHO_REPLY /* ICMPv6 reply */ + && !memcmp(&data[40 + ICMP_MINLEN], ping_payload, sizeof(ping_payload)) /* Same payload in response */ + ); + +} + +#if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ +#define vio16(x) (x) +#define vio32(x) (x) +#define vio64(x) (x) +#else +#define vio16(x) __builtin_bswap16(x) +#define vio32(x) __builtin_bswap32(x) +#define vio64(x) __builtin_bswap64(x) +#endif + + +int test_vhost(int vnet_hdr_sz, int pi, int xdp, uint64_t features) +{ + int call_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK); + int kick_fd = eventfd(0, EFD_CLOEXEC|EFD_NONBLOCK); + int vhost_fd = open("/dev/vhost-net", O_RDWR); + int tun_fd = -1; + int ret = KSFT_SKIP; + + if (call_fd < 0 || kick_fd < 0 || vhost_fd < 0) + goto err; + + memset(rings, 0, sizeof(rings)); + + /* Pick up the link-local address that the kernel + * assigns to the tun device. */ + struct in6_addr tun_addr; + tun_fd = open_tun(vnet_hdr_sz, pi, &tun_addr); + if (tun_fd < 0) + goto err; + + int pi_offset = -1; + int data_offset = vnet_hdr_sz; + + /* The tun device puts PI *first*, before the vnet hdr */ + if (pi) { + pi_offset = 0; + data_offset += sizeof(struct tun_pi); + }; + + /* If vhost is going a vnet hdr it comes before all else */ + if (features & (1ULL << VHOST_NET_F_VIRTIO_NET_HDR)) { + int vhost_hdr_sz = (features & ((1ULL << VIRTIO_NET_F_MRG_RXBUF) | + (1ULL << VIRTIO_F_VERSION_1))) ? + sizeof(struct virtio_net_hdr_mrg_rxbuf) : + sizeof(struct virtio_net_hdr); + + data_offset += vhost_hdr_sz; + if (pi_offset != -1) + pi_offset += vhost_hdr_sz; + } + + if (!xdp) { + int sndbuf = RING_SIZE * 2048; + if (ioctl(tun_fd, TUNSETSNDBUF, &sndbuf) < 0) { + perror("TUNSETSNDBUF"); + ret = -1; + goto err; + } + } + + ret = setup_vhost(vhost_fd, tun_fd, call_fd, kick_fd, features); + if (ret) + goto err; + + /* A fake link-local address for the userspace end */ + struct in6_addr local_addr = { 0 }; + local_addr.s6_addr16[0] = htons(0xfe80); + local_addr.s6_addr16[7] = htons(1); + + + /* Set up RX and TX descriptors; the latter with ping packets ready to + * send to the kernel, but don't actually send them yet. */ + for (int i = 0; i < RING_SIZE; i++) { + struct pkt_buf *pkt = &rings[1].pkts[i]; + if (pi_offset != -1) { + struct tun_pi *pi = (void *)&pkt->data[pi_offset]; + pi->proto = htons(ETH_P_IPV6); + } + int plen = create_icmp_echo(&pkt->data[data_offset], &tun_addr, + &local_addr, 0x4747, i); + + rings[1].desc[i].addr = vio64((uint64_t)pkt); + rings[1].desc[i].len = vio32(plen + data_offset); + rings[1].avail_ring[i] = vio16(i); + + pkt = &rings[0].pkts[i]; + rings[0].desc[i].addr = vio64((uint64_t)pkt); + rings[0].desc[i].len = vio32(sizeof(*pkt)); + rings[0].desc[i].flags = vio16(VRING_DESC_F_WRITE); + rings[0].avail_ring[i] = vio16(i); + } + barrier(); + rings[1].avail.idx = vio16(1); + + uint16_t rx_seen_used = 0; + struct timeval tv = { 1, 0 }; + while (1) { + fd_set rfds = { 0 }; + FD_SET(call_fd, &rfds); + + rings[0].avail.idx = vio16(rx_seen_used + RING_SIZE); + barrier(); + eventfd_write(kick_fd, 1); + + if (select(call_fd + 1, &rfds, NULL, NULL, &tv) <= 0) { + ret = -1; + goto err; + } + + uint16_t rx_used_idx = vio16(rings[0].used.idx); + barrier(); + + while(rx_used_idx != rx_seen_used) { + uint32_t desc = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].id); + uint32_t len = vio32(rings[0].used_ring[RING_MASK(rx_seen_used)].len); + + if (desc >= RING_SIZE || len < data_offset) + return -1; + + uint64_t addr = vio64(rings[0].desc[desc].addr); + if (!addr) + return -1; + + if (len > data_offset && + (pi_offset == -1 || + ((struct tun_pi *)(addr + pi_offset))->proto == htons(ETH_P_IPV6)) && + check_icmp_response((void *)(addr + data_offset), len - data_offset, + &local_addr, &tun_addr)) { + ret = 0; + goto err; + } + + /* Give the same buffer back */ + rings[0].avail_ring[RING_MASK(rx_seen_used++)] = vio32(desc); + } + barrier(); + + uint64_t ev_val; + eventfd_read(call_fd, &ev_val); + } + + err: + if (call_fd != -1) + close(call_fd); + if (kick_fd != -1) + close(kick_fd); + if (vhost_fd != -1) + close(vhost_fd); + if (tun_fd != -1) + close(tun_fd); + + printf("TEST: (hdr %d, xdp %d, pi %d, features %llx) RESULT: %d\n", + vnet_hdr_sz, xdp, pi, (unsigned long long)features, ret); + return ret; +} + +/* For iterating over all permutations. */ +#define VHDR_LEN_BITS 3 /* Tun vhdr length selection */ +#define XDP_BIT 4 /* Don't TUNSETSNDBUF, so we use XDP */ +#define PI_BIT 8 /* Don't set IFF_NO_PI */ +#define VIRTIO_V1_BIT 16 /* Use VIRTIO_F_VERSION_1 feature */ +#define VHOST_HDR_BIT 32 /* Use VHOST_NET_F_VIRTIO_NET_HDR */ + +unsigned int tun_vhdr_lens[] = { 0, 10, 12, 20 }; + +int main(void) +{ + int result = KSFT_SKIP; + int i, ret; + + for (i = 0; i < 64; i++) { + uint64_t features = 0; + + if (i & VIRTIO_V1_BIT) + features |= (1ULL << VIRTIO_F_VERSION_1); +#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ + else + continue; /* We'd need vio16 et al not to byteswap */ +#endif + + if (i & VHOST_HDR_BIT) { + features |= (1ULL << VHOST_NET_F_VIRTIO_NET_HDR); + + /* Even though the test actually passes at the time of + * writing, don't bother to try asking tun *and* vhost + * both to handle a virtio_net_hdr at the same time. + * That's just silly. */ + if (i & VHDR_LEN_BITS) + continue; + } + + ret = test_vhost(tun_vhdr_lens[i & VHDR_LEN_BITS], + !!(i & PI_BIT), !!(i & XDP_BIT), features); + if (ret < result) + result = ret; + } + + return result; +}