diff mbox series

[net] ip_gre: set dev->hard_header_len properly

Message ID 20201008012154.11149-1-xiyou.wangcong@gmail.com
State New
Headers show
Series [net] ip_gre: set dev->hard_header_len properly | expand

Commit Message

Cong Wang Oct. 8, 2020, 1:21 a.m. UTC
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
conditionally. When it is set, it assumes the outer IP header is
already created before ipgre_xmit().

This is not true when we send packets through a raw packet socket,
where L2 headers are supposed to be constructed by user. Packet
socket calls dev_validate_header() to validate the header. But
GRE tunnel does not set dev->hard_header_len, so that check can
be simply bypassed, therefore uninit memory could be passed down
to ipgre_xmit().

Fix this by setting dev->hard_header_len whenever sets header_ops,
as dev->hard_header_len is supposed to be the length of the header
created by dev->header_ops->create() anyway.

Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
Cc: William Tu <u9012063@gmail.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/ipv4/ip_gre.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Willem de Bruijn Oct. 8, 2020, 11:48 a.m. UTC | #1
On Wed, Oct 7, 2020 at 9:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> conditionally. When it is set, it assumes the outer IP header is
> already created before ipgre_xmit().
>
> This is not true when we send packets through a raw packet socket,
> where L2 headers are supposed to be constructed by user. Packet
> socket calls dev_validate_header() to validate the header. But
> GRE tunnel does not set dev->hard_header_len, so that check can
> be simply bypassed, therefore uninit memory could be passed down
> to ipgre_xmit().

If dev->hard_header_len is zero, the packet socket will not reserve
room for the link layer header, so skb->data points to network_header.
But I don't see any uninitialized packet data?

> Fix this by setting dev->hard_header_len whenever sets header_ops,
> as dev->hard_header_len is supposed to be the length of the header
> created by dev->header_ops->create() anyway.

Agreed. Xie has made similar changes to lapbether recently in
commit c7ca03c216ac.

> Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
> Cc: William Tu <u9012063@gmail.com>
> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/ipv4/ip_gre.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
> index 4e31f23e4117..43b62095559e 100644
> --- a/net/ipv4/ip_gre.c
> +++ b/net/ipv4/ip_gre.c
> @@ -987,10 +987,12 @@ static int ipgre_tunnel_init(struct net_device *dev)
>                                 return -EINVAL;
>                         dev->flags = IFF_BROADCAST;
>                         dev->header_ops = &ipgre_header_ops;
> +                       dev->hard_header_len = tunnel->hlen + sizeof(*iph);
>                 }
>  #endif
>         } else if (!tunnel->collect_md) {
>                 dev->header_ops = &ipgre_header_ops;
> +               dev->hard_header_len = tunnel->hlen + sizeof(*iph);

Unrelated to this patch, I do wonder if ipgre_header does the
right thing when tunnel->hlen > sizeof(gre_base_hdr),
i.e., for gre tunnels with optional fields.

Currently it appears to not initialize those.

>         }
>
>         return ip_tunnel_init(dev);
> --
> 2.28.0
>
Cong Wang Oct. 8, 2020, 5:33 p.m. UTC | #2
On Thu, Oct 8, 2020 at 4:49 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Oct 7, 2020 at 9:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> > conditionally. When it is set, it assumes the outer IP header is
> > already created before ipgre_xmit().
> >
> > This is not true when we send packets through a raw packet socket,
> > where L2 headers are supposed to be constructed by user. Packet
> > socket calls dev_validate_header() to validate the header. But
> > GRE tunnel does not set dev->hard_header_len, so that check can
> > be simply bypassed, therefore uninit memory could be passed down
> > to ipgre_xmit().
>
> If dev->hard_header_len is zero, the packet socket will not reserve
> room for the link layer header, so skb->data points to network_header.
> But I don't see any uninitialized packet data?

The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
dev_validate_header() still returns true obviously but only 'len'
bytes are copied
from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.

Thanks.
Willem de Bruijn Oct. 8, 2020, 7:04 p.m. UTC | #3
On Thu, Oct 8, 2020 at 1:34 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 4:49 AM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Wed, Oct 7, 2020 at 9:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> > > conditionally. When it is set, it assumes the outer IP header is
> > > already created before ipgre_xmit().
> > >
> > > This is not true when we send packets through a raw packet socket,
> > > where L2 headers are supposed to be constructed by user. Packet
> > > socket calls dev_validate_header() to validate the header. But
> > > GRE tunnel does not set dev->hard_header_len, so that check can
> > > be simply bypassed, therefore uninit memory could be passed down
> > > to ipgre_xmit().
> >
> > If dev->hard_header_len is zero, the packet socket will not reserve
> > room for the link layer header, so skb->data points to network_header.
> > But I don't see any uninitialized packet data?
>
> The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
> is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
> dev_validate_header() still returns true obviously but only 'len'
> bytes are copied
> from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
> within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.

With dev->hard_header_len of zero, packet_alloc_skb() only allocates len bytes.

With SOCK_RAW, the writer is expected to write the ip and gre header
and include these in the send len argument. The only difference I see
is that with hard_header_len the data starts reserve bytes before
skb_network_header, and an additional tail has been allocated that is
not used.

But this also fixes a potentially more serious bug. With SOCK_DGRAM,
dev_hard_header/ipgre_header just assumes that there is enough room in
the packet to skb_push(skb, t->hlen + sizeof(*iph)). Which may be
false if this header length had not been reserved.

Though I've mainly looked at packet_snd. Perhaps you are referring to
tpacket_snd?
Xie He Oct. 8, 2020, 7:16 p.m. UTC | #4
On Thu, Oct 8, 2020 at 12:04 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 1:34 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
> > is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
> > dev_validate_header() still returns true obviously but only 'len'
> > bytes are copied
> > from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
> > within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.
>
> With dev->hard_header_len of zero, packet_alloc_skb() only allocates len bytes.
>
> With SOCK_RAW, the writer is expected to write the ip and gre header
> and include these in the send len argument. The only difference I see
> is that with hard_header_len the data starts reserve bytes before
> skb_network_header, and an additional tail has been allocated that is
> not used.
>
> But this also fixes a potentially more serious bug. With SOCK_DGRAM,
> dev_hard_header/ipgre_header just assumes that there is enough room in
> the packet to skb_push(skb, t->hlen + sizeof(*iph)). Which may be
> false if this header length had not been reserved.
>
> Though I've mainly looked at packet_snd. Perhaps you are referring to
> tpacket_snd?

I think what Cong means is that hard_header_len has to be set properly
to prevent an AF_PACKET/RAW user from sending a frame that is too
short (shorter than the header length). When an AF_PACKET/RAW user
sends a frame shorter than the header length, and the code on the
sending path still expects a full header, it will read uninitialized
data.

If my understanding is right, I agree on this part.

However, there's something I don't understand in the GRE code. The
ipgre_header function only creates an IP header (20 bytes) + a GRE
base header (4 bytes), but pushes and returns "t->hlen +
sizeof(*iph)". What is t->hlen? It seems to me it is the sum of
t->tun_hlen and t->encap_hlen. What are these two?
Willem de Bruijn Oct. 8, 2020, 7:18 p.m. UTC | #5
On Thu, Oct 8, 2020 at 3:04 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 1:34 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > On Thu, Oct 8, 2020 at 4:49 AM Willem de Bruijn
> > <willemdebruijn.kernel@gmail.com> wrote:
> > >
> > > On Wed, Oct 7, 2020 at 9:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > >
> > > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> > > > conditionally. When it is set, it assumes the outer IP header is
> > > > already created before ipgre_xmit().
> > > >
> > > > This is not true when we send packets through a raw packet socket,
> > > > where L2 headers are supposed to be constructed by user. Packet
> > > > socket calls dev_validate_header() to validate the header. But
> > > > GRE tunnel does not set dev->hard_header_len, so that check can
> > > > be simply bypassed, therefore uninit memory could be passed down
> > > > to ipgre_xmit().
> > >
> > > If dev->hard_header_len is zero, the packet socket will not reserve
> > > room for the link layer header, so skb->data points to network_header.
> > > But I don't see any uninitialized packet data?
> >
> > The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
> > is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
> > dev_validate_header() still returns true obviously but only 'len'
> > bytes are copied
> > from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
> > within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.

Oh, do you mean that ipgre_xmit will process undefined data because it
calls skb_pull(skb, tunnel->hlen + sizeof(struct iphdr)) ?
Willem de Bruijn Oct. 8, 2020, 7:19 p.m. UTC | #6
On Thu, Oct 8, 2020 at 3:17 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 12:04 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Thu, Oct 8, 2020 at 1:34 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
> > > is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
> > > dev_validate_header() still returns true obviously but only 'len'
> > > bytes are copied
> > > from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
> > > within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.
> >
> > With dev->hard_header_len of zero, packet_alloc_skb() only allocates len bytes.
> >
> > With SOCK_RAW, the writer is expected to write the ip and gre header
> > and include these in the send len argument. The only difference I see
> > is that with hard_header_len the data starts reserve bytes before
> > skb_network_header, and an additional tail has been allocated that is
> > not used.
> >
> > But this also fixes a potentially more serious bug. With SOCK_DGRAM,
> > dev_hard_header/ipgre_header just assumes that there is enough room in
> > the packet to skb_push(skb, t->hlen + sizeof(*iph)). Which may be
> > false if this header length had not been reserved.
> >
> > Though I've mainly looked at packet_snd. Perhaps you are referring to
> > tpacket_snd?
>
> I think what Cong means is that hard_header_len has to be set properly
> to prevent an AF_PACKET/RAW user from sending a frame that is too
> short (shorter than the header length). When an AF_PACKET/RAW user
> sends a frame shorter than the header length, and the code on the
> sending path still expects a full header, it will read uninitialized
> data.

Yes, that makes sense, thanks. Our message crossed :)

> If my understanding is right, I agree on this part.
>
> However, there's something I don't understand in the GRE code. The
> ipgre_header function only creates an IP header (20 bytes) + a GRE
> base header (4 bytes), but pushes and returns "t->hlen +
> sizeof(*iph)". What is t->hlen?

GRE is variable length depending on flags:

        tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);


> It seems to me it is the sum of
> t->tun_hlen and t->encap_hlen. What are these two?
Cong Wang Oct. 8, 2020, 7:50 p.m. UTC | #7
On Thu, Oct 8, 2020 at 12:18 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 3:04 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Thu, Oct 8, 2020 at 1:34 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > >
> > > On Thu, Oct 8, 2020 at 4:49 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >
> > > > On Wed, Oct 7, 2020 at 9:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > > > >
> > > > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> > > > > conditionally. When it is set, it assumes the outer IP header is
> > > > > already created before ipgre_xmit().
> > > > >
> > > > > This is not true when we send packets through a raw packet socket,
> > > > > where L2 headers are supposed to be constructed by user. Packet
> > > > > socket calls dev_validate_header() to validate the header. But
> > > > > GRE tunnel does not set dev->hard_header_len, so that check can
> > > > > be simply bypassed, therefore uninit memory could be passed down
> > > > > to ipgre_xmit().
> > > >
> > > > If dev->hard_header_len is zero, the packet socket will not reserve
> > > > room for the link layer header, so skb->data points to network_header.
> > > > But I don't see any uninitialized packet data?
> > >
> > > The uninit data is allocated by packet_alloc_skb(), if dev->hard_header_len
> > > is 0 and 'len' is anything between [0, tunnel->hlen + sizeof(struct iphdr)),
> > > dev_validate_header() still returns true obviously but only 'len'
> > > bytes are copied
> > > from user-space by skb_copy_datagram_from_iter(). Therefore, those bytes
> > > within range (len, tunnel->hlen + sizeof(struct iphdr)] are uninitialized.
>
> Oh, do you mean that ipgre_xmit will process undefined data because it
> calls skb_pull(skb, tunnel->hlen + sizeof(struct iphdr)) ?

The syzbot report has the information for both of your questions:
https://syzkaller.appspot.com/text?tag=CrashReport&x=11845568500000

It clearly shows packet_snd() and ipgre_xmit().

Thanks.
Xie He Oct. 8, 2020, 8:10 p.m. UTC | #8
On Thu, Oct 8, 2020 at 12:20 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 3:17 PM Xie He <xie.he.0141@gmail.com> wrote:
> >
> > However, there's something I don't understand in the GRE code. The
> > ipgre_header function only creates an IP header (20 bytes) + a GRE
> > base header (4 bytes), but pushes and returns "t->hlen +
> > sizeof(*iph)". What is t->hlen?
>
> GRE is variable length depending on flags:
>
>         tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
>
>
> > It seems to me it is the sum of
> > t->tun_hlen and t->encap_hlen. What are these two?

OK. I understand that t->tun_hlen is the GRE header length. What is
t->encap_hlen?
Willem de Bruijn Oct. 8, 2020, 8:19 p.m. UTC | #9
On Wed, Oct 7, 2020 at 9:22 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> GRE tunnel has its own header_ops, ipgre_header_ops, and sets it
> conditionally. When it is set, it assumes the outer IP header is
> already created before ipgre_xmit().
>
> This is not true when we send packets through a raw packet socket,
> where L2 headers are supposed to be constructed by user. Packet
> socket calls dev_validate_header() to validate the header. But
> GRE tunnel does not set dev->hard_header_len, so that check can
> be simply bypassed, therefore uninit memory could be passed down
> to ipgre_xmit().
>
> Fix this by setting dev->hard_header_len whenever sets header_ops,
> as dev->hard_header_len is supposed to be the length of the header
> created by dev->header_ops->create() anyway.
>
> Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com
> Cc: William Tu <u9012063@gmail.com>
> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>

Acked-by: Willem de Bruijn <willemb@google.com>

> The syzbot report has the information for both of your questions:
> https://syzkaller.appspot.com/text?tag=CrashReport&x=11845568500000
>
> It clearly shows packet_snd() and ipgre_xmit().

Thanks. I hadn't thought to check that (clearly).
Willem de Bruijn Oct. 8, 2020, 8:31 p.m. UTC | #10
On Thu, Oct 8, 2020 at 4:11 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 12:20 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Thu, Oct 8, 2020 at 3:17 PM Xie He <xie.he.0141@gmail.com> wrote:
> > >
> > > However, there's something I don't understand in the GRE code. The
> > > ipgre_header function only creates an IP header (20 bytes) + a GRE
> > > base header (4 bytes), but pushes and returns "t->hlen +
> > > sizeof(*iph)". What is t->hlen?
> >
> > GRE is variable length depending on flags:
> >
> >         tunnel->tun_hlen = gre_calc_hlen(tunnel->parms.o_flags);
> >
> >
> > > It seems to me it is the sum of
> > > t->tun_hlen and t->encap_hlen. What are these two?
>
> OK. I understand that t->tun_hlen is the GRE header length. What is
> t->encap_hlen?

I've looked at that closely either.

Appears to be to account for additional FOU/GUE encap:

"
commit 56328486539ddd07cbaafec7a542a2c8a3043623
Author: Tom Herbert <therbert@google.com>
Date:   Wed Sep 17 12:25:58 2014 -0700
    net: Changes to ip_tunnel to support foo-over-udp encapsulation

    This patch changes IP tunnel to support (secondary) encapsulation,
    Foo-over-UDP. Changes include:

    1) Adding tun_hlen as the tunnel header length, encap_hlen as the
       encapsulation header length, and hlen becomes the grand total
       of these.
    2) Added common netlink define to support FOU encapsulation.
    3) Routines to perform FOU encapsulation.

    Signed-off-by: Tom Herbert <therbert@google.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>
"
Xie He Oct. 8, 2020, 9:35 p.m. UTC | #11
On Thu, Oct 8, 2020 at 1:32 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 4:11 PM Xie He <xie.he.0141@gmail.com> wrote:
> >
> > OK. I understand that t->tun_hlen is the GRE header length. What is
> > t->encap_hlen?
>
> I've looked at that closely either.
>
> Appears to be to account for additional FOU/GUE encap:
>
> "
> commit 56328486539ddd07cbaafec7a542a2c8a3043623
> Author: Tom Herbert <therbert@google.com>
> Date:   Wed Sep 17 12:25:58 2014 -0700
>     net: Changes to ip_tunnel to support foo-over-udp encapsulation
>
>     This patch changes IP tunnel to support (secondary) encapsulation,
>     Foo-over-UDP. Changes include:
>
>     1) Adding tun_hlen as the tunnel header length, encap_hlen as the
>        encapsulation header length, and hlen becomes the grand total
>        of these.
>     2) Added common netlink define to support FOU encapsulation.
>     3) Routines to perform FOU encapsulation.
>
>     Signed-off-by: Tom Herbert <therbert@google.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> "

I see the ipgre_xmit function would pull the header our header_ops
creates, and then call __gre_xmit. __gre_xmit will call
gre_build_header to complete the GRE header. gre_build_header expects
to find the base GRE header after pushing tunnel->tun_hlen. However,
if tunnel->encap_hlen is not 0, it couldn't find the base GRE header
there. Is there a problem?

Where exactly should we put the tunnel->encap_hlen header? Before the
GRE header or after?
Willem de Bruijn Oct. 8, 2020, 9:47 p.m. UTC | #12
On Thu, Oct 8, 2020 at 5:36 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 1:32 PM Willem de Bruijn
> <willemdebruijn.kernel@gmail.com> wrote:
> >
> > On Thu, Oct 8, 2020 at 4:11 PM Xie He <xie.he.0141@gmail.com> wrote:
> > >
> > > OK. I understand that t->tun_hlen is the GRE header length. What is
> > > t->encap_hlen?
> >
> > I've looked at that closely either.
> >
> > Appears to be to account for additional FOU/GUE encap:
> >
> > "
> > commit 56328486539ddd07cbaafec7a542a2c8a3043623
> > Author: Tom Herbert <therbert@google.com>
> > Date:   Wed Sep 17 12:25:58 2014 -0700
> >     net: Changes to ip_tunnel to support foo-over-udp encapsulation
> >
> >     This patch changes IP tunnel to support (secondary) encapsulation,
> >     Foo-over-UDP. Changes include:
> >
> >     1) Adding tun_hlen as the tunnel header length, encap_hlen as the
> >        encapsulation header length, and hlen becomes the grand total
> >        of these.
> >     2) Added common netlink define to support FOU encapsulation.
> >     3) Routines to perform FOU encapsulation.
> >
> >     Signed-off-by: Tom Herbert <therbert@google.com>
> >     Signed-off-by: David S. Miller <davem@davemloft.net>
> > "
>
> I see the ipgre_xmit function would pull the header our header_ops
> creates, and then call __gre_xmit. __gre_xmit will call
> gre_build_header to complete the GRE header. gre_build_header expects
> to find the base GRE header after pushing tunnel->tun_hlen. However,
> if tunnel->encap_hlen is not 0, it couldn't find the base GRE header
> there. Is there a problem?
>
> Where exactly should we put the tunnel->encap_hlen header? Before the
> GRE header or after?

The L4 tunnel infra uses the two callbacks encap_hlen (e.g.,
fou_encap_hlen) and build_header (fou_build_header) in struct
ip_tunnel_encap_ops to first allocate more space and later call back
into the specific implementation to fill that data. build_header is
called from __gre6_xmit -> ip6_tnl_xmit (or its ipv4 equivalent).
This happens after gre has pushed its header, so the headers
will come before that.
Xie He Oct. 8, 2020, 9:54 p.m. UTC | #13
On Thu, Oct 8, 2020 at 2:48 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> > I see the ipgre_xmit function would pull the header our header_ops
> > creates, and then call __gre_xmit. __gre_xmit will call
> > gre_build_header to complete the GRE header. gre_build_header expects
> > to find the base GRE header after pushing tunnel->tun_hlen. However,
> > if tunnel->encap_hlen is not 0, it couldn't find the base GRE header
> > there. Is there a problem?
> >
> > Where exactly should we put the tunnel->encap_hlen header? Before the
> > GRE header or after?
>
> The L4 tunnel infra uses the two callbacks encap_hlen (e.g.,
> fou_encap_hlen) and build_header (fou_build_header) in struct
> ip_tunnel_encap_ops to first allocate more space and later call back
> into the specific implementation to fill that data. build_header is
> called from __gre6_xmit -> ip6_tnl_xmit (or its ipv4 equivalent).
> This happens after gre has pushed its header, so the headers
> will come before that.

OK. If the t->encap_hlen header needs to be placed before the GRE
header, then I think the ipgre_header function should leave some space
before the GRE header to place the t->encap_hlen header, rather than
leaving space after the GRE header.
Xie He Oct. 8, 2020, 11:40 p.m. UTC | #14
On Thu, Oct 8, 2020 at 2:54 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> OK. If the t->encap_hlen header needs to be placed before the GRE
> header, then I think the ipgre_header function should leave some space
> before the GRE header to place the t->encap_hlen header, rather than
> leaving space after the GRE header.

I found another possible issue. Shouldn't we update hard_header_len
every time t->tun_hlen and t->hlen are updated in ipgre_link_update?
Cong Wang Oct. 9, 2020, 5:43 p.m. UTC | #15
On Thu, Oct 8, 2020 at 4:40 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> I found another possible issue. Shouldn't we update hard_header_len
> every time t->tun_hlen and t->hlen are updated in ipgre_link_update?

Good catch. It should be updated there like ->needed_headroom.
I will update my patch.

Thanks.
Xie He Oct. 9, 2020, 7:41 p.m. UTC | #16
On Fri, Oct 9, 2020 at 10:44 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Thu, Oct 8, 2020 at 4:40 PM Xie He <xie.he.0141@gmail.com> wrote:
> >
> > I found another possible issue. Shouldn't we update hard_header_len
> > every time t->tun_hlen and t->hlen are updated in ipgre_link_update?
>
> Good catch. It should be updated there like ->needed_headroom.
> I will update my patch.

Thanks. But there is still something that I don't understand. What is
needed_headroom used for? If we are requesting space for t->encap_hlen
and t->tun_hlen in hard_header_len. Do we still need to use
needed_headroom?

Also, if we update hard_header_len or needed_headroom in
ipgre_link_update, would there be racing issues if they are updated
while skbs are being sent?

If these are indeed issues, it might not be easy to fix this driver.
Willem, do you have any thoughts?
Xie He Oct. 9, 2020, 7:51 p.m. UTC | #17
On Fri, Oct 9, 2020 at 12:41 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> Thanks. But there is still something that I don't understand. What is
> needed_headroom used for? If we are requesting space for t->encap_hlen
> and t->tun_hlen in hard_header_len. Do we still need to use
> needed_headroom?

It seems to me that the original code includes t->encap_hlen,
t->tun_hlen, and the IP header length in needed_headroom. (Right?) If
we are including these in hard_header_len, we need to move them out of
needed_headroom.

> Also, if we update hard_header_len or needed_headroom in
> ipgre_link_update, would there be racing issues if they are updated
> while skbs are being sent?
>
> If these are indeed issues, it might not be easy to fix this driver.
> Willem, do you have any thoughts?
Cong Wang Oct. 9, 2020, 8:38 p.m. UTC | #18
On Fri, Oct 9, 2020 at 12:51 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 12:41 PM Xie He <xie.he.0141@gmail.com> wrote:
> >
> > Thanks. But there is still something that I don't understand. What is
> > needed_headroom used for? If we are requesting space for t->encap_hlen
> > and t->tun_hlen in hard_header_len. Do we still need to use
> > needed_headroom?
>
> It seems to me that the original code includes t->encap_hlen,
> t->tun_hlen, and the IP header length in needed_headroom. (Right?) If
> we are including these in hard_header_len, we need to move them out of
> needed_headroom.

Interesting point. I think needed_headroom is 0 until we call
ipgre_changelink(), but needed_headroom is already being used
in multiple places for skb_cow_head() in the same file, I guess
they should be replaced with hard_head_len because for GRE tunnel
those are its link-layer header. What makes it more complicated
is that header_ops is actually set conditionally, so should be
hard_header_len/needed_head_room accordingly.

Thanks.
Cong Wang Oct. 10, 2020, 1:07 a.m. UTC | #19
On Fri, Oct 9, 2020 at 1:38 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Interesting point. I think needed_headroom is 0 until we call
> ipgre_changelink(), but needed_headroom is already being used
> in multiple places for skb_cow_head() in the same file, I guess
> they should be replaced with hard_head_len because for GRE tunnel
> those are its link-layer header. What makes it more complicated
> is that header_ops is actually set conditionally, so should be
> hard_header_len/needed_head_room accordingly.

Looking a bit deeper, I doubt the ipgre_header_ops->create is necessary,
because 1) all other tunnels devices do not have it (ip_tunnel_header_ops
only contains ->parse_protocol); 2) GRE headers are pushed in xmit
anyway, so at least SOCK_DGRAM does not need it; 3) ipgre_header()
creates the GRE header, later ipgre_xmit() pulls it back, then __gre_xmit()
builds GRE header again...

Please correct me if I am wrong.

Thanks.
Xie He Oct. 10, 2020, 3:10 a.m. UTC | #20
On Fri, Oct 9, 2020 at 6:07 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Looking a bit deeper, I doubt the ipgre_header_ops->create is necessary,
> because 1) all other tunnels devices do not have it (ip_tunnel_header_ops
> only contains ->parse_protocol); 2) GRE headers are pushed in xmit
> anyway, so at least SOCK_DGRAM does not need it; 3) ipgre_header()
> creates the GRE header, later ipgre_xmit() pulls it back, then __gre_xmit()
> builds GRE header again...

Haha, I also don't like ipgre_header_ops->create and want to get rid
of it. We are thinking the same :)
Cong Wang Oct. 10, 2020, 6:58 p.m. UTC | #21
On Fri, Oct 9, 2020 at 8:10 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 6:07 PM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >
> > Looking a bit deeper, I doubt the ipgre_header_ops->create is necessary,
> > because 1) all other tunnels devices do not have it (ip_tunnel_header_ops
> > only contains ->parse_protocol); 2) GRE headers are pushed in xmit
> > anyway, so at least SOCK_DGRAM does not need it; 3) ipgre_header()
> > creates the GRE header, later ipgre_xmit() pulls it back, then __gre_xmit()
> > builds GRE header again...
>
> Haha, I also don't like ipgre_header_ops->create and want to get rid
> of it. We are thinking the same :)

Great!

>
> From what I see, the flow when sending skbs (when header_ops is used)
> is like this:
>
> 1) First ipgre_header creates the IP header and the GRE base header,
> leaving space for the GRE optional fields and the "encap" header (the
> space for the "encap" header should be left before the GRE header, but
> it is incorrectly left after the GRE header).
>
> 2) Then ipgre_xmit pulls the header created by ipgre_header (but
> leaves the data there). Then it calls __gre_xmit.
>
> 3) __gre_xmit calls gre_build_header. gre_build_header will push back
> the GRE header, read the GRE base header and build the GRE optional
> fields.
>
> 4) Then __gre_xmit calls ip_tunnel_xmit. ip_tunnel_xmit will build the
> "encap" header by calling ip_tunnel_encap.
>
> So what ipgre_header does is actually creating the IP header and the
> GRE header, and leaving some holes for the GRE optional fields and the
> "encap" header to be built later.

Right. For the encap header, I'd guess it is because this is relatively new.

>
> This seems so weird to me. If a user is using an AF_PACKET/RAW socket,
> the user is supposed to do what the header_ops->create function does
> (that is, creating two headers and leaving two holes to be filled in
> later). I think no user would actually do that. That is so weird.

Well, AF_PACKET RAW socket is supposed to construct L2 headers
from the user buffer, and for a tunnel device these headers are indeed its
L2. If users do not want to do this, they can switch to DGRAM anyway.

I know how inconvenient it is to construct a GRE tunnel header, I guess
this is why all other tunnel devices do not provide a header_ops::create.
GRE tunnel is not a special case, this is why I agree on removing its
->create() although it does look like all tunnel devices should let users
construct headers by definition of RAW.

Of course, it may be too late to change this behavior even if we really
want, users may already assume there is always no tunnel header needed
to construct.

Thanks.
Xie He Oct. 10, 2020, 9:49 p.m. UTC | #22
On Sat, Oct 10, 2020 at 11:58 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Fri, Oct 9, 2020 at 8:10 PM Xie He <xie.he.0141@gmail.com> wrote:
> >
> > This seems so weird to me. If a user is using an AF_PACKET/RAW socket,
> > the user is supposed to do what the header_ops->create function does
> > (that is, creating two headers and leaving two holes to be filled in
> > later). I think no user would actually do that. That is so weird.
>
> Well, AF_PACKET RAW socket is supposed to construct L2 headers
> from the user buffer, and for a tunnel device these headers are indeed its
> L2. If users do not want to do this, they can switch to DGRAM anyway.
>
> I know how inconvenient it is to construct a GRE tunnel header, I guess
> this is why all other tunnel devices do not provide a header_ops::create.
> GRE tunnel is not a special case, this is why I agree on removing its
> ->create() although it does look like all tunnel devices should let users
> construct headers by definition of RAW.
>
> Of course, it may be too late to change this behavior even if we really
> want, users may already assume there is always no tunnel header needed
> to construct.

Another reason why tunnel devices usually don't provide
header_ops->create might be to keep consistency with TAP devices. TAP
devices are just like tunnel (TUN) devices except that they use an
additional Ethernet header after the tunnel headers. I guess that TAP
devices would usually use Ethernet's header_ops to expose only the
Ethernet header to the user, but hide the outer tunnel headers from
the user and let them be constructed on the transmission path (so that
TAP devices would appear exactly like Ethernet devices). If we want to
keep TUN devices consistent with TAP devices, we should hide the
tunnel headers from the user for TUN devices, too.

Actually, a lot of devices expose a fake L2 header in header_ops,
instead of their real L2 header anyway. Wi-Fi devices usually pretend
to be Ethernet devices and expose an Ethernet header in header_ops. So
I think it is acceptable to not expose the real L2 header in
header_ops. (This is also what needed_headroom is created for - to
request additional header space for headers not exposed in
header_ops.)
Xie He Oct. 11, 2020, 3:55 a.m. UTC | #23
On Sat, Oct 10, 2020 at 2:49 PM Xie He <xie.he.0141@gmail.com> wrote:
>
> Another reason why tunnel devices usually don't provide
> header_ops->create might be to keep consistency with TAP devices. TAP
> devices are just like tunnel (TUN) devices except that they use an
> additional Ethernet header after the tunnel headers. I guess that TAP
> devices would usually use Ethernet's header_ops to expose only the
> Ethernet header to the user, but hide the outer tunnel headers from
> the user and let them be constructed on the transmission path (so that
> TAP devices would appear exactly like Ethernet devices). If we want to
> keep TUN devices consistent with TAP devices, we should hide the
> tunnel headers from the user for TUN devices, too.

Actually there's a "Universal TUN/TAP driver" in the kernel, which
passes L3 packets or Ethernet frames (that are being sent) back to the
user space and lets a user space program add the tunnel headers.

https://www.kernel.org/doc/Documentation/networking/tuntap.rst

In this case, we are not able to construct the tunnel headers until we
pass the packets / Ethernet frames back to the user space to the user
space program. We can only hide the tunnel headers.

To keep other TUN/TAP devices in the kernel consistent with the
"Universal TUN/TAP driver", we should hide the tunnel headers from the
user for those devices, too.

> Actually, a lot of devices expose a fake L2 header in header_ops,
> instead of their real L2 header anyway. Wi-Fi devices usually pretend
> to be Ethernet devices and expose an Ethernet header in header_ops. So
> I think it is acceptable to not expose the real L2 header in
> header_ops. (This is also what needed_headroom is created for - to
> request additional header space for headers not exposed in
> header_ops.)
Xie He Oct. 11, 2020, 2:35 p.m. UTC | #24
On Sat, Oct 10, 2020 at 11:58 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> Well, AF_PACKET RAW socket is supposed to construct L2 headers
> from the user buffer, and for a tunnel device these headers are indeed its
> L2. If users do not want to do this, they can switch to DGRAM anyway.
>
> I know how inconvenient it is to construct a GRE tunnel header, I guess
> this is why all other tunnel devices do not provide a header_ops::create.
> GRE tunnel is not a special case, this is why I agree on removing its
> ->create() although it does look like all tunnel devices should let users
> construct headers by definition of RAW.
>
> Of course, it may be too late to change this behavior even if we really
> want, users may already assume there is always no tunnel header needed
> to construct.

Sorry. I was re-thinking about whether we should expose certain
headers in header_ops or not. Now my thoughts are a little bit
different, so I want to re-reply to your email.

1.

The first reason why I think we should not expose the GRE header in
header_ops, is that in this way we can keep the un-exposed headers of
GRE devices consistent with those of GRETAP devices. (This is just
like the consistency between "universal TUN devices" and "universal
TAP devices".) This is similar to what I wrote in my previous replies.

However, after thinking I think this only applies to GRE/GRETAP but
not to some other tunnel devices. For example, for PPP, we can't
actually transmit Ethernet frames directly over PPP (but need to
encapsulate them in BCP over PPP), so there's no way to keep the
un-exposed headers of PPP devices consistent with those of "PPP TAP"
(PPP bridging) devices anyway.

2.

Second, I think the headers before the GRE header (the outer IP header
and the "encap" header) should actually belong to the underlying
network being tunneled through, and should not be viewed as L2 headers
of the tunneling network. As for the GRE header, it might be better
viewed as a "bridge" between the tunneled network and the tunneling
network, and belongs neither of the two, so it should not be viewed as
the L2 header of the tunneling network either.

However, for PPP tunnels, I think the PPP header clearly belongs to
the tunneling network's L2 header (because PPP can run directly over a
hardware link, and even if it is used for a tunnel, it needs a
specific underlying network's protocol, such as SSH or PPPoE, to
transport). So I think it is better to expose the PPP header in
header_ops.
diff mbox series

Patch

diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 4e31f23e4117..43b62095559e 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -987,10 +987,12 @@  static int ipgre_tunnel_init(struct net_device *dev)
 				return -EINVAL;
 			dev->flags = IFF_BROADCAST;
 			dev->header_ops = &ipgre_header_ops;
+			dev->hard_header_len = tunnel->hlen + sizeof(*iph);
 		}
 #endif
 	} else if (!tunnel->collect_md) {
 		dev->header_ops = &ipgre_header_ops;
+		dev->hard_header_len = tunnel->hlen + sizeof(*iph);
 	}
 
 	return ip_tunnel_init(dev);