Message ID | 20201011191129.991-1-xiyou.wangcong@gmail.com |
---|---|
State | New |
Headers | show |
Series | [net,v2] ip_gre: set dev->hard_header_len and dev->needed_headroom properly | expand |
On Sun, Oct 11, 2020 at 12:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it > conditionally. When it is set, it assumes the outer IP header is > already created before ipgre_xmit(). > > This is not true when we send packets through a raw packet socket, > where L2 headers are supposed to be constructed by user. Packet > socket calls dev_validate_header() to validate the header. But > GRE tunnel does not set dev->hard_header_len, so that check can > be simply bypassed, therefore uninit memory could be passed down > to ipgre_xmit(). Similar for dev->needed_headroom. > > dev->hard_header_len is supposed to be the length of the header > created by dev->header_ops->create(), so it should be used whenever > header_ops is set, and dev->needed_headroom should be used when it > is not set. Hi, thanks for attempting to fix this tunnel. Are we still considering removing header_ops->create? As said in my email sent previously today, I want to remove header_ops->create because 1) this keeps the un-exposed headers of GRE devices consistent with those of GRETAP devices, and 2) I think the GRE header (and the headers before the GRE header) is not actually the L2 header of the tunnel (the Wikipedia page for "Generic Routing Encapsulation" doesn't consider this protocol to be at L2 either). I'm not sure if you still agree to remove header_ops->create. Do you still agree but think it'd be better to do that in a separate patch? Removing header_ops->create would simplify the fixing of the issue you are trying to fix, too, because that way we would no longer need to use header_ops or hard_header_len. Also, I'm worried that changing hard_header_len (or needed_headroom) in ipgre_link_update would have racing issues. If we remove header_ops, we no longer need to use hard_header_len and we can just set needed_headroom to the maximum value, so that we no longer need to update them in ipgre_link_update.
On Sun, Oct 11, 2020 at 3:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it > conditionally. When it is set, it assumes the outer IP header is > already created before ipgre_xmit(). > > This is not true when we send packets through a raw packet socket, > where L2 headers are supposed to be constructed by user. Packet > socket calls dev_validate_header() to validate the header. But > GRE tunnel does not set dev->hard_header_len, so that check can > be simply bypassed, therefore uninit memory could be passed down > to ipgre_xmit(). Similar for dev->needed_headroom. > > dev->hard_header_len is supposed to be the length of the header > created by dev->header_ops->create(), so it should be used whenever > header_ops is set, and dev->needed_headroom should be used when it > is not set. > > Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com > Cc: Xie He <xie.he.0141@gmail.com> > Cc: William Tu <u9012063@gmail.com> > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > --- > Note, there are some other suspicious use of dev->hard_header_len in > the same file, but let's leave them to a separate patch if really > needed. There is agreement that hard_header_len should be the length of link layer headers visible to the upper layers, needed_headroom the additional room required for headers that are not exposed, i.e., those pushed inside ndo_start_xmit. The link layer header length also has to agree with the interface hardware type (ARPHRD_..). Tunnel devices have not always been consistent in this, but today "bare" ip tunnel devices without additional headers (ipip, sit, ..) do match this and advertise 0 byte hard_header_len. Bareudp, vxlan and geneve also conform to this. Known exception that probably needs to be addressed is sit, which still advertises LL_MAX_HEADER and so has exposed quite a few syzkaller issues. Side note, it is not entirely clear to me what sets ARPHRD_TUNNEL et al apart from ARPHRD_NONE and why they are needed. GRE devices advertise ARPHRD_IPGRE and GRETAP advertise ARPHRD_ETHER. The second makes sense, as it appears as an Ethernet device. The first should match "bare" ip tunnel devices, if following the above logic. Indeed, this is what commit e271c7b4420d ("gre: do not keep the GRE header around in collect medata mode") implements. It changes dev->type to ARPHRD_NONE in collect_md mode. Some of the inconsistency comes from the various modes of the GRE driver. Which brings us to ipgre_header_ops. It is set only in two special cases. Commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address") added ipgre_header_ops.parse to be able to receive the inner ip source address with PF_PACKET recvfrom. And apparently relies on ipgre_header_ops.create to be able to set an address, which implies SOCK_DGRAM. The other special case, CONFIG_NET_IPGRE_BROADCAST, predates git. Its implementation starts with the beautiful comment "/* Nice toy. Unfortunately, useless in real life :-)". From the rest of that detailed comment, it is not clear to me why it would need to expose the headers. The example does not use packet sockets. A packet socket cannot know devices details such as which configurable mode a device may be in. And different modes conflict with the basic rule that for a given well defined link layer type, i.e., dev->type, header length can be expected to be consistent. In an ideal world these exceptions would not exist, therefore. Unfortunately, this is legacy behavior that will have to continue to be supported. I agree that then swapping dev->needed_headroom for dev->hard_header_len at init is then a good approach. Underlying functions like ip_tunnel_xmit can modify needed_headroom at runtime, not sure how that affects runtime changes in ipgre_link_update: " max_headroom = LL_RESERVED_SPACE(rt->dst.dev) + sizeof(struct iphdr) + rt->dst.header_len + ip_encap_hlen(&tunnel->encap); if (max_headroom > dev->needed_headroom) dev->needed_headroom = max_headroom; " > net/ipv4/ip_gre.c | 14 +++++++++++--- > 1 file changed, 11 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c > index 4e31f23e4117..82fee0010353 100644 > --- a/net/ipv4/ip_gre.c > +++ b/net/ipv4/ip_gre.c > @@ -626,8 +626,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb, > > if (dev->header_ops) { > /* Need space for new headers */ > - if (skb_cow_head(skb, dev->needed_headroom - > - (tunnel->hlen + sizeof(struct iphdr)))) > + if (skb_cow_head(skb, dev->hard_header_len)) > goto free_skb; > > tnl_params = (const struct iphdr *)skb->data; > @@ -748,7 +747,11 @@ static void ipgre_link_update(struct net_device *dev, bool set_mtu) > len = tunnel->tun_hlen - len; > tunnel->hlen = tunnel->hlen + len; > > - dev->needed_headroom = dev->needed_headroom + len; > + if (dev->header_ops) > + dev->hard_header_len += len; > + else > + dev->needed_headroom += len; > + > if (set_mtu) > dev->mtu = max_t(int, dev->mtu - len, 68); > > @@ -944,6 +947,7 @@ static void __gre_tunnel_init(struct net_device *dev) > tunnel->parms.iph.protocol = IPPROTO_GRE; > > tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen; > + dev->needed_headroom = tunnel->hlen + sizeof(tunnel->parms.iph); > > dev->features |= GRE_FEATURES; > dev->hw_features |= GRE_FEATURES; > @@ -987,10 +991,14 @@ static int ipgre_tunnel_init(struct net_device *dev) > return -EINVAL; > dev->flags = IFF_BROADCAST; > dev->header_ops = &ipgre_header_ops; > + dev->hard_header_len = tunnel->hlen + sizeof(*iph); > + dev->needed_headroom = 0; here and below, perhaps more descriptive of what is being done, something like: /* If device has header_ops.create, then headers are part of hard_header_len */ swap(dev->needed_headroom, dev->hard_header_len) > } > #endif > } else if (!tunnel->collect_md) { > dev->header_ops = &ipgre_header_ops; > + dev->hard_header_len = tunnel->hlen + sizeof(*iph); > + dev->needed_headroom = 0; > } > > return ip_tunnel_init(dev); > -- > 2.28.0 >
On Sun, Oct 11, 2020 at 4:42 PM Xie He <xie.he.0141@gmail.com> wrote: > > On Sun, Oct 11, 2020 at 12:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it > > conditionally. When it is set, it assumes the outer IP header is > > already created before ipgre_xmit(). > > > > This is not true when we send packets through a raw packet socket, > > where L2 headers are supposed to be constructed by user. Packet > > socket calls dev_validate_header() to validate the header. But > > GRE tunnel does not set dev->hard_header_len, so that check can > > be simply bypassed, therefore uninit memory could be passed down > > to ipgre_xmit(). Similar for dev->needed_headroom. > > > > dev->hard_header_len is supposed to be the length of the header > > created by dev->header_ops->create(), so it should be used whenever > > header_ops is set, and dev->needed_headroom should be used when it > > is not set. > > Hi, thanks for attempting to fix this tunnel. Are we still considering > removing header_ops->create? > > As said in my email sent previously today, I want to remove > header_ops->create because 1) this keeps the un-exposed headers of GRE > devices consistent with those of GRETAP devices, and 2) I think the > GRE header (and the headers before the GRE header) is not actually the > L2 header of the tunnel (the Wikipedia page for "Generic Routing > Encapsulation" doesn't consider this protocol to be at L2 either). > > I'm not sure if you still agree to remove header_ops->create. Do you > still agree but think it'd be better to do that in a separate patch? > > Removing header_ops->create would simplify the fixing of the issue you > are trying to fix, too, because that way we would no longer need to > use header_ops or hard_header_len. Also, I'm worried that changing > hard_header_len (or needed_headroom) in ipgre_link_update would have > racing issues. If we remove header_ops, we no longer need to use > hard_header_len and we can just set needed_headroom to the maximum > value, so that we no longer need to update them in ipgre_link_update. Our messages crossed. It seems there are legacy expectations that sendto/recvfrom packet sockets allow writing/reading the outer IP address, as of commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address"). That is the express purpose of that commit. The behavior is inconsistent with other tunnels, as you also point out, and probably only rarely used if at all. I would love to get rid of it, but given that we cannot be certain that it is unused, I'm afraid that we have to continue to support this special case.
On Sun, Oct 11, 2020 at 5:00 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Sun, Oct 11, 2020 at 3:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > GRE tunnel has its own header_ops, ipgre_header_ops, and sets it > > conditionally. When it is set, it assumes the outer IP header is > > already created before ipgre_xmit(). > > > > This is not true when we send packets through a raw packet socket, > > where L2 headers are supposed to be constructed by user. Packet > > socket calls dev_validate_header() to validate the header. But > > GRE tunnel does not set dev->hard_header_len, so that check can > > be simply bypassed, therefore uninit memory could be passed down > > to ipgre_xmit(). Similar for dev->needed_headroom. > > > > dev->hard_header_len is supposed to be the length of the header > > created by dev->header_ops->create(), so it should be used whenever > > header_ops is set, and dev->needed_headroom should be used when it > > is not set. > > > > Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com > > Cc: Xie He <xie.he.0141@gmail.com> > > Cc: William Tu <u9012063@gmail.com> > > Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > I agree that then swapping dev->needed_headroom for > dev->hard_header_len at init is then a good approach. > > Underlying functions like ip_tunnel_xmit can modify needed_headroom at > runtime, not sure how that affects runtime changes in > ipgre_link_update: Never mind, the patch only selects between needed_headroom or hard_header_link at ipgre_tunnel_init. > > @@ -987,10 +991,14 @@ static int ipgre_tunnel_init(struct net_device *dev) > > return -EINVAL; > > dev->flags = IFF_BROADCAST; > > dev->header_ops = &ipgre_header_ops; > > + dev->hard_header_len = tunnel->hlen + sizeof(*iph); > > + dev->needed_headroom = 0; > > here and below, perhaps more descriptive of what is being done, something like: > > /* If device has header_ops.create, then headers are part of hard_header_len */ > swap(dev->needed_headroom, dev->hard_header_len) Arguably nitpicking. Otherwise, patch looks great to me, thanks, so Acked-by: Willem de Bruijn <willemb@google.com>
On Sun, Oct 11, 2020 at 2:07 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Sun, Oct 11, 2020 at 4:42 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > Hi, thanks for attempting to fix this tunnel. Are we still considering > > removing header_ops->create? > > > > As said in my email sent previously today, I want to remove > > header_ops->create because 1) this keeps the un-exposed headers of GRE > > devices consistent with those of GRETAP devices, and 2) I think the > > GRE header (and the headers before the GRE header) is not actually the > > L2 header of the tunnel (the Wikipedia page for "Generic Routing > > Encapsulation" doesn't consider this protocol to be at L2 either). > > > > I'm not sure if you still agree to remove header_ops->create. Do you > > still agree but think it'd be better to do that in a separate patch? > > > > Removing header_ops->create would simplify the fixing of the issue you > > are trying to fix, too, because that way we would no longer need to > > use header_ops or hard_header_len. Also, I'm worried that changing > > hard_header_len (or needed_headroom) in ipgre_link_update would have > > racing issues. If we remove header_ops, we no longer need to use > > hard_header_len and we can just set needed_headroom to the maximum > > value, so that we no longer need to update them in ipgre_link_update. > > Our messages crossed. > > It seems there are legacy expectations that sendto/recvfrom packet > sockets allow writing/reading the outer IP address, as of commit > 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address"). That is > the express purpose of that commit. > > The behavior is inconsistent with other tunnels, as you also point > out, and probably only rarely used if at all. I would love to get rid > of it, but given that we cannot be certain that it is unused, I'm afraid > that we have to continue to support this special case. OK. At least we agree that in an ideal world header_ops for GRE devices should be removed. Thanks, Willem.
On Sun, Oct 11, 2020 at 2:01 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > There is agreement that hard_header_len should be the length of link > layer headers visible to the upper layers, needed_headroom the > additional room required for headers that are not exposed, i.e., those > pushed inside ndo_start_xmit. > > The link layer header length also has to agree with the interface > hardware type (ARPHRD_..). > > Tunnel devices have not always been consistent in this, but today > "bare" ip tunnel devices without additional headers (ipip, sit, ..) do > match this and advertise 0 byte hard_header_len. Bareudp, vxlan and > geneve also conform to this. Known exception that probably needs to be > addressed is sit, which still advertises LL_MAX_HEADER and so has > exposed quite a few syzkaller issues. Side note, it is not entirely > clear to me what sets ARPHRD_TUNNEL et al apart from ARPHRD_NONE and > why they are needed. > > GRE devices advertise ARPHRD_IPGRE and GRETAP advertise ARPHRD_ETHER. > The second makes sense, as it appears as an Ethernet device. The first > should match "bare" ip tunnel devices, if following the above logic. > Indeed, this is what commit e271c7b4420d ("gre: do not keep the GRE > header around in collect medata mode") implements. It changes > dev->type to ARPHRD_NONE in collect_md mode. > > Some of the inconsistency comes from the various modes of the GRE > driver. Which brings us to ipgre_header_ops. It is set only in two > special cases. > > Commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address") > added ipgre_header_ops.parse to be able to receive the inner ip source > address with PF_PACKET recvfrom. And apparently relies on > ipgre_header_ops.create to be able to set an address, which implies > SOCK_DGRAM. > > The other special case, CONFIG_NET_IPGRE_BROADCAST, predates git. Its > implementation starts with the beautiful comment "/* Nice toy. > Unfortunately, useless in real life :-)". From the rest of that > detailed comment, it is not clear to me why it would need to expose > the headers. The example does not use packet sockets. > > A packet socket cannot know devices details such as which configurable > mode a device may be in. And different modes conflict with the basic > rule that for a given well defined link layer type, i.e., dev->type, > header length can be expected to be consistent. In an ideal world > these exceptions would not exist, therefore. Nice explanation of the situation. I agree with you. Thanks!
On Sun, Oct 11, 2020 at 12:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > @@ -626,8 +626,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb, > > if (dev->header_ops) { > /* Need space for new headers */ > - if (skb_cow_head(skb, dev->needed_headroom - > - (tunnel->hlen + sizeof(struct iphdr)))) > + if (skb_cow_head(skb, dev->hard_header_len)) > goto free_skb; > > tnl_params = (const struct iphdr *)skb->data; As I understand, the skb_cow functions are for ensuring enough header space before skb->data. (Right?) However, at this stage our skb->data is already at the outer IP header, I think we don't need to request additional header space before the outer IP header.
On Sun, Oct 11, 2020 at 3:46 PM Xie He <xie.he.0141@gmail.com> wrote: > > On Sun, Oct 11, 2020 at 12:11 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > > > @@ -626,8 +626,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb, > > > > if (dev->header_ops) { > > /* Need space for new headers */ > > - if (skb_cow_head(skb, dev->needed_headroom - > > - (tunnel->hlen + sizeof(struct iphdr)))) > > + if (skb_cow_head(skb, dev->hard_header_len)) > > goto free_skb; > > > > tnl_params = (const struct iphdr *)skb->data; > > As I understand, the skb_cow functions are for ensuring enough header > space before skb->data. (Right?) However, at this stage our skb->data > is already at the outer IP header, I think we don't need to request > additional header space before the outer IP header. Good point, I thought skb_headroom() == dev->hard_header_len, but skb->data already points to the tunnel header like you said, so we should pass 0 to skb_cow_head() here. Thanks.
On Sun, Oct 11, 2020 at 2:01 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > There is agreement that hard_header_len should be the length of link > layer headers visible to the upper layers, needed_headroom the > additional room required for headers that are not exposed, i.e., those > pushed inside ndo_start_xmit. > > The link layer header length also has to agree with the interface > hardware type (ARPHRD_..). > > Tunnel devices have not always been consistent in this, but today > "bare" ip tunnel devices without additional headers (ipip, sit, ..) do > match this and advertise 0 byte hard_header_len. Bareudp, vxlan and > geneve also conform to this. Known exception that probably needs to be > addressed is sit, which still advertises LL_MAX_HEADER and so has > exposed quite a few syzkaller issues. Side note, it is not entirely > clear to me what sets ARPHRD_TUNNEL et al apart from ARPHRD_NONE and > why they are needed. > > GRE devices advertise ARPHRD_IPGRE and GRETAP advertise ARPHRD_ETHER. > The second makes sense, as it appears as an Ethernet device. The first > should match "bare" ip tunnel devices, if following the above logic. > Indeed, this is what commit e271c7b4420d ("gre: do not keep the GRE > header around in collect medata mode") implements. It changes > dev->type to ARPHRD_NONE in collect_md mode. > > Some of the inconsistency comes from the various modes of the GRE > driver. Which brings us to ipgre_header_ops. It is set only in two > special cases. > > Commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address") > added ipgre_header_ops.parse to be able to receive the inner ip source > address with PF_PACKET recvfrom. And apparently relies on > ipgre_header_ops.create to be able to set an address, which implies > SOCK_DGRAM. > > The other special case, CONFIG_NET_IPGRE_BROADCAST, predates git. Its > implementation starts with the beautiful comment "/* Nice toy. > Unfortunately, useless in real life :-)". From the rest of that > detailed comment, it is not clear to me why it would need to expose > the headers. The example does not use packet sockets. > > A packet socket cannot know devices details such as which configurable > mode a device may be in. And different modes conflict with the basic > rule that for a given well defined link layer type, i.e., dev->type, > header length can be expected to be consistent. In an ideal world > these exceptions would not exist, therefore. > > Unfortunately, this is legacy behavior that will have to continue to > be supported. Thanks for your explanation. So header_ops for GRE devices is only used in 2 special situations. In normal situations, header_ops is not used for GRE devices. And we consider not using header_ops should be the ideal arrangement for GRE devices. Can we create a new dev->type (like ARPHRD_IPGRE_SPECIAL) for GRE devices that use header_ops? I guess changing dev->type will not affect the interface to the user space? This way we can solve the problem of the same dev->type having different hard_header_len values. Also, for the second special situation, if there's no obvious reason to use header_ops, maybe we can consider removing header_ops for this situation.
On Wed, Oct 14, 2020 at 4:52 AM Xie He <xie.he.0141@gmail.com> wrote: > > On Sun, Oct 11, 2020 at 2:01 PM Willem de Bruijn > <willemdebruijn.kernel@gmail.com> wrote: > > > > There is agreement that hard_header_len should be the length of link > > layer headers visible to the upper layers, needed_headroom the > > additional room required for headers that are not exposed, i.e., those > > pushed inside ndo_start_xmit. > > > > The link layer header length also has to agree with the interface > > hardware type (ARPHRD_..). > > > > Tunnel devices have not always been consistent in this, but today > > "bare" ip tunnel devices without additional headers (ipip, sit, ..) do > > match this and advertise 0 byte hard_header_len. Bareudp, vxlan and > > geneve also conform to this. Known exception that probably needs to be > > addressed is sit, which still advertises LL_MAX_HEADER and so has > > exposed quite a few syzkaller issues. Side note, it is not entirely > > clear to me what sets ARPHRD_TUNNEL et al apart from ARPHRD_NONE and > > why they are needed. > > > > GRE devices advertise ARPHRD_IPGRE and GRETAP advertise ARPHRD_ETHER. > > The second makes sense, as it appears as an Ethernet device. The first > > should match "bare" ip tunnel devices, if following the above logic. > > Indeed, this is what commit e271c7b4420d ("gre: do not keep the GRE > > header around in collect medata mode") implements. It changes > > dev->type to ARPHRD_NONE in collect_md mode. > > > > Some of the inconsistency comes from the various modes of the GRE > > driver. Which brings us to ipgre_header_ops. It is set only in two > > special cases. > > > > Commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address") > > added ipgre_header_ops.parse to be able to receive the inner ip source > > address with PF_PACKET recvfrom. And apparently relies on > > ipgre_header_ops.create to be able to set an address, which implies > > SOCK_DGRAM. > > > > The other special case, CONFIG_NET_IPGRE_BROADCAST, predates git. Its > > implementation starts with the beautiful comment "/* Nice toy. > > Unfortunately, useless in real life :-)". From the rest of that > > detailed comment, it is not clear to me why it would need to expose > > the headers. The example does not use packet sockets. > > > > A packet socket cannot know devices details such as which configurable > > mode a device may be in. And different modes conflict with the basic > > rule that for a given well defined link layer type, i.e., dev->type, > > header length can be expected to be consistent. In an ideal world > > these exceptions would not exist, therefore. > > > > Unfortunately, this is legacy behavior that will have to continue to > > be supported. > > Thanks for your explanation. So header_ops for GRE devices is only > used in 2 special situations. In normal situations, header_ops is not > used for GRE devices. And we consider not using header_ops should be > the ideal arrangement for GRE devices. > > Can we create a new dev->type (like ARPHRD_IPGRE_SPECIAL) for GRE > devices that use header_ops? I guess changing dev->type will not > affect the interface to the user space? This way we can solve the > problem of the same dev->type having different hard_header_len values. But does that address any real issue? If anything, it would make sense to keep ARHPHRD_IPGRE for tunnels that expect headers and switch to ARPHRD_NONE for those that do not. As the collect_md commit I mentioned above does. > Also, for the second special situation, if there's no obvious reason > to use header_ops, maybe we can consider removing header_ops for this > situation. Unfortunately, there's no knowing if some application is using this broadcast mode *with* a process using packet sockets.
On Wed, Oct 14, 2020 at 8:12 AM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 4:52 AM Xie He <xie.he.0141@gmail.com> wrote: > > > > On Sun, Oct 11, 2020 at 2:01 PM Willem de Bruijn > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > There is agreement that hard_header_len should be the length of link > > > layer headers visible to the upper layers, needed_headroom the > > > additional room required for headers that are not exposed, i.e., those > > > pushed inside ndo_start_xmit. > > > > > > The link layer header length also has to agree with the interface > > > hardware type (ARPHRD_..). > > > > > > Tunnel devices have not always been consistent in this, but today > > > "bare" ip tunnel devices without additional headers (ipip, sit, ..) do > > > match this and advertise 0 byte hard_header_len. Bareudp, vxlan and > > > geneve also conform to this. Known exception that probably needs to be > > > addressed is sit, which still advertises LL_MAX_HEADER and so has > > > exposed quite a few syzkaller issues. Side note, it is not entirely > > > clear to me what sets ARPHRD_TUNNEL et al apart from ARPHRD_NONE and > > > why they are needed. > > > > > > GRE devices advertise ARPHRD_IPGRE and GRETAP advertise ARPHRD_ETHER. > > > The second makes sense, as it appears as an Ethernet device. The first > > > should match "bare" ip tunnel devices, if following the above logic. > > > Indeed, this is what commit e271c7b4420d ("gre: do not keep the GRE > > > header around in collect medata mode") implements. It changes > > > dev->type to ARPHRD_NONE in collect_md mode. > > > > > > Some of the inconsistency comes from the various modes of the GRE > > > driver. Which brings us to ipgre_header_ops. It is set only in two > > > special cases. > > > > > > Commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address") > > > added ipgre_header_ops.parse to be able to receive the inner ip source > > > address with PF_PACKET recvfrom. And apparently relies on > > > ipgre_header_ops.create to be able to set an address, which implies > > > SOCK_DGRAM. > > > > > > The other special case, CONFIG_NET_IPGRE_BROADCAST, predates git. Its > > > implementation starts with the beautiful comment "/* Nice toy. > > > Unfortunately, useless in real life :-)". From the rest of that > > > detailed comment, it is not clear to me why it would need to expose > > > the headers. The example does not use packet sockets. > > > > > > A packet socket cannot know devices details such as which configurable > > > mode a device may be in. And different modes conflict with the basic > > > rule that for a given well defined link layer type, i.e., dev->type, > > > header length can be expected to be consistent. In an ideal world > > > these exceptions would not exist, therefore. > > > > > > Unfortunately, this is legacy behavior that will have to continue to > > > be supported. > > > > Thanks for your explanation. So header_ops for GRE devices is only > > used in 2 special situations. In normal situations, header_ops is not > > used for GRE devices. And we consider not using header_ops should be > > the ideal arrangement for GRE devices. > > > > Can we create a new dev->type (like ARPHRD_IPGRE_SPECIAL) for GRE > > devices that use header_ops? I guess changing dev->type will not > > affect the interface to the user space? This way we can solve the > > problem of the same dev->type having different hard_header_len values. > > But does that address any real issue? It doesn't address any issue visible when using. Just to solve the problem of the same dev->type having different hard_header_len values which you mentioned. Making this change will not affect the user in any way. So I think it is valuable to make this change. > If anything, it would make sense to keep ARHPHRD_IPGRE for tunnels > that expect headers and switch to ARPHRD_NONE for those that do not. > As the collect_md commit I mentioned above does. I thought we agreed that ideally GRE devices would not have header_ops. Currently GRE devices (in normal situations) indeed do not use header_ops (and use ARHPHRD_IPGRE as dev->type). I think we should keep this behavior. To solve the problem of the same dev->type having different hard_header_len values which you mentioned. I think we should create a new dev->type (ARPHRD_IPGRE_SPECIAL) for GRE devices that use header_ops. Also, for collect_md, I think we should use ARHPHRD_IPGRE. I see no reason to use ARPHRD_NONE. > > Also, for the second special situation, if there's no obvious reason > > to use header_ops, maybe we can consider removing header_ops for this > > situation. > > Unfortunately, there's no knowing if some application is using this > broadcast mode *with* a process using packet sockets. We can't always keep the interface to the user space unchanged when fixing problems. When we fix drivers by adding hard_header_len or removing hard_header_len, we ARE changing the interface. I did these fixes a lot. I also changed skb->protocol when sending skbs for some drivers, which in fact was also changing the interface. It is not possible to keep the interface strictly unchanged, otherwise a lot of problems will be impossible to fix.
On Wed, Oct 14, 2020 at 3:48 PM Xie He <xie.he.0141@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 8:12 AM Willem de Bruijn > <willemdebruijn.kernel@gmail.com> wrote: > > > > On Wed, Oct 14, 2020 at 4:52 AM Xie He <xie.he.0141@gmail.com> wrote: > > > > > > On Sun, Oct 11, 2020 at 2:01 PM Willem de Bruijn > > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > > > There is agreement that hard_header_len should be the length of link > > > > layer headers visible to the upper layers, needed_headroom the > > > > additional room required for headers that are not exposed, i.e., those > > > > pushed inside ndo_start_xmit. > > > > > > > > The link layer header length also has to agree with the interface > > > > hardware type (ARPHRD_..). > > > > > > > > Tunnel devices have not always been consistent in this, but today > > > > "bare" ip tunnel devices without additional headers (ipip, sit, ..) do > > > > match this and advertise 0 byte hard_header_len. Bareudp, vxlan and > > > > geneve also conform to this. Known exception that probably needs to be > > > > addressed is sit, which still advertises LL_MAX_HEADER and so has > > > > exposed quite a few syzkaller issues. Side note, it is not entirely > > > > clear to me what sets ARPHRD_TUNNEL et al apart from ARPHRD_NONE and > > > > why they are needed. > > > > > > > > GRE devices advertise ARPHRD_IPGRE and GRETAP advertise ARPHRD_ETHER. > > > > The second makes sense, as it appears as an Ethernet device. The first > > > > should match "bare" ip tunnel devices, if following the above logic. > > > > Indeed, this is what commit e271c7b4420d ("gre: do not keep the GRE > > > > header around in collect medata mode") implements. It changes > > > > dev->type to ARPHRD_NONE in collect_md mode. > > > > > > > > Some of the inconsistency comes from the various modes of the GRE > > > > driver. Which brings us to ipgre_header_ops. It is set only in two > > > > special cases. > > > > > > > > Commit 6a5f44d7a048 ("[IPV4] ip_gre: sendto/recvfrom NBMA address") > > > > added ipgre_header_ops.parse to be able to receive the inner ip source > > > > address with PF_PACKET recvfrom. And apparently relies on > > > > ipgre_header_ops.create to be able to set an address, which implies > > > > SOCK_DGRAM. > > > > > > > > The other special case, CONFIG_NET_IPGRE_BROADCAST, predates git. Its > > > > implementation starts with the beautiful comment "/* Nice toy. > > > > Unfortunately, useless in real life :-)". From the rest of that > > > > detailed comment, it is not clear to me why it would need to expose > > > > the headers. The example does not use packet sockets. > > > > > > > > A packet socket cannot know devices details such as which configurable > > > > mode a device may be in. And different modes conflict with the basic > > > > rule that for a given well defined link layer type, i.e., dev->type, > > > > header length can be expected to be consistent. In an ideal world > > > > these exceptions would not exist, therefore. > > > > > > > > Unfortunately, this is legacy behavior that will have to continue to > > > > be supported. > > > > > > Thanks for your explanation. So header_ops for GRE devices is only > > > used in 2 special situations. In normal situations, header_ops is not > > > used for GRE devices. And we consider not using header_ops should be > > > the ideal arrangement for GRE devices. > > > > > > Can we create a new dev->type (like ARPHRD_IPGRE_SPECIAL) for GRE > > > devices that use header_ops? I guess changing dev->type will not > > > affect the interface to the user space? This way we can solve the > > > problem of the same dev->type having different hard_header_len values. > > > > But does that address any real issue? > > It doesn't address any issue visible when using. Just to solve the > problem of the same dev->type having different hard_header_len values > which you mentioned. Making this change will not affect the user in > any way. So I think it is valuable to make this change. > > > If anything, it would make sense to keep ARHPHRD_IPGRE for tunnels > > that expect headers and switch to ARPHRD_NONE for those that do not. > > As the collect_md commit I mentioned above does. > > I thought we agreed that ideally GRE devices would not have > header_ops. Currently GRE devices (in normal situations) indeed do not > use header_ops (and use ARHPHRD_IPGRE as dev->type). I think we should > keep this behavior. > > To solve the problem of the same dev->type having different > hard_header_len values which you mentioned. I think we should create a > new dev->type (ARPHRD_IPGRE_SPECIAL) for GRE devices that use > header_ops. > > Also, for collect_md, I think we should use ARHPHRD_IPGRE. I see no > reason to use ARPHRD_NONE. What does ARPHRD_IPGRE define beyond ARPHRD_NONE? And same for ARPHRD_TUNNEL variants. If they are indistinguishable, they are the same and might as well have the same label. > > > Also, for the second special situation, if there's no obvious reason > > > to use header_ops, maybe we can consider removing header_ops for this > > > situation. > > > > Unfortunately, there's no knowing if some application is using this > > broadcast mode *with* a process using packet sockets. > > We can't always keep the interface to the user space unchanged when > fixing problems. When we fix drivers by adding hard_header_len or > removing hard_header_len, we ARE changing the interface. I did these > fixes a lot. I also changed skb->protocol when sending skbs for some > drivers, which in fact was also changing the interface. It is not > possible to keep the interface strictly unchanged, otherwise a lot of > problems will be impossible to fix. I understand that for bug fixes this is sometimes unavoidable. I don't think code cleanup is reason enough, though.
On Wed, Oct 14, 2020 at 1:19 PM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 3:48 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > I thought we agreed that ideally GRE devices would not have > > header_ops. Currently GRE devices (in normal situations) indeed do not > > use header_ops (and use ARHPHRD_IPGRE as dev->type). I think we should > > keep this behavior. > > > > To solve the problem of the same dev->type having different > > hard_header_len values which you mentioned. I think we should create a > > new dev->type (ARPHRD_IPGRE_SPECIAL) for GRE devices that use > > header_ops. > > > > Also, for collect_md, I think we should use ARHPHRD_IPGRE. I see no > > reason to use ARPHRD_NONE. > > What does ARPHRD_IPGRE define beyond ARPHRD_NONE? And same for > ARPHRD_TUNNEL variants. If they are indistinguishable, they are the > same and might as well have the same label. It is indeed reasonable to keep devices indistinguishable to each other having the same dev->type label. But I see a lot of devices in the kernel without header_ops having different dev->type labels. For example, ARPHRD_SLIP should be the same as ARPHRD_RAWIP. One feature distinguishing these devices might be their dev->mtu. GRE devices may have their special dev->mtu determined by the maximum IP packet size and the GRE header length. For ARPHRD_TUNNEL, it may also have its own dev->mtu. I also see it has header_ops->parse_protocol (but it doesn't have header_ops->create).
On Wed, Oct 14, 2020 at 6:38 PM Xie He <xie.he.0141@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 1:19 PM Willem de Bruijn > <willemdebruijn.kernel@gmail.com> wrote: > > > > On Wed, Oct 14, 2020 at 3:48 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > > > I thought we agreed that ideally GRE devices would not have > > > header_ops. Currently GRE devices (in normal situations) indeed do not > > > use header_ops (and use ARHPHRD_IPGRE as dev->type). I think we should > > > keep this behavior. > > > > > > To solve the problem of the same dev->type having different > > > hard_header_len values which you mentioned. I think we should create a > > > new dev->type (ARPHRD_IPGRE_SPECIAL) for GRE devices that use > > > header_ops. > > > > > > Also, for collect_md, I think we should use ARHPHRD_IPGRE. I see no > > > reason to use ARPHRD_NONE. > > > > What does ARPHRD_IPGRE define beyond ARPHRD_NONE? And same for > > ARPHRD_TUNNEL variants. If they are indistinguishable, they are the > > same and might as well have the same label. > > It is indeed reasonable to keep devices indistinguishable to each > other having the same dev->type label. But I see a lot of devices in > the kernel without header_ops having different dev->type labels. For > example, ARPHRD_SLIP should be the same as ARPHRD_RAWIP. One feature > distinguishing these devices might be their dev->mtu. > > GRE devices may have their special dev->mtu determined by the maximum > IP packet size and the GRE header length. > > For ARPHRD_TUNNEL, it may also have its own dev->mtu. I also see it > has header_ops->parse_protocol (but it doesn't have > header_ops->create). Actually I think dev->type can be seen from user space. For example, when you type "ip link", it will display the link type for you. So I think it is useful to keep different dev->type labels without merging them even if they appear to have no difference.
On Wed, Oct 14, 2020 at 10:25 PM Xie He <xie.he.0141@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 6:38 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > On Wed, Oct 14, 2020 at 1:19 PM Willem de Bruijn > > <willemdebruijn.kernel@gmail.com> wrote: > > > > > > On Wed, Oct 14, 2020 at 3:48 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > > > > > I thought we agreed that ideally GRE devices would not have > > > > header_ops. Currently GRE devices (in normal situations) indeed do not > > > > use header_ops (and use ARHPHRD_IPGRE as dev->type). I think we should > > > > keep this behavior. > > > > > > > > To solve the problem of the same dev->type having different > > > > hard_header_len values which you mentioned. I think we should create a > > > > new dev->type (ARPHRD_IPGRE_SPECIAL) for GRE devices that use > > > > header_ops. > > > > > > > > Also, for collect_md, I think we should use ARHPHRD_IPGRE. I see no > > > > reason to use ARPHRD_NONE. > > > > > > What does ARPHRD_IPGRE define beyond ARPHRD_NONE? And same for > > > ARPHRD_TUNNEL variants. If they are indistinguishable, they are the > > > same and might as well have the same label. > > > > It is indeed reasonable to keep devices indistinguishable to each > > other having the same dev->type label. But I see a lot of devices in > > the kernel without header_ops having different dev->type labels. For > > example, ARPHRD_SLIP should be the same as ARPHRD_RAWIP. One feature > > distinguishing these devices might be their dev->mtu. > > > > GRE devices may have their special dev->mtu determined by the maximum > > IP packet size and the GRE header length. > > > > For ARPHRD_TUNNEL, it may also have its own dev->mtu. I also see it > > has header_ops->parse_protocol (but it doesn't have > > header_ops->create). > > Actually I think dev->type can be seen from user space. For example, > when you type "ip link", it will display the link type for you. So I > think it is useful to keep different dev->type labels without merging > them even if they appear to have no difference. Ah, indeed. These constants are matched in iproute2 in lib/ll_types.c to string representations. Good catch. Yes, then they have to stay as is.
On Thu, Oct 15, 2020 at 6:42 AM Willem de Bruijn <willemdebruijn.kernel@gmail.com> wrote: > > On Wed, Oct 14, 2020 at 10:25 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > Actually I think dev->type can be seen from user space. For example, > > when you type "ip link", it will display the link type for you. So I > > think it is useful to keep different dev->type labels without merging > > them even if they appear to have no difference. > > Ah, indeed. These constants are matched in iproute2 in lib/ll_types.c > to string representations. > > Good catch. Yes, then they have to stay as is. So in this case it may be better to keep dev->type as ARHPHRD_IPGRE for GRE devices with or without header_ops. This way we can avoid the need to update iproute2. We can still consider changing GRE devices in collect_md mode from ARPHRD_NONE to ARHPHRD_IPGRE. The original author who changed it to ARPHRD_NONE might assume ARHPHRD_IPGRE is for GRE devices with header_ops, so he felt he needed to distinguish GRE devices without header_ops by dev->type. However, ARHPHRD_IPGRE is actually already used for GRE devices with header_ops AND without header_ops. So it doesn't hurt to use ARHPHRD_IPGRE for GRE devices in collect_md mode, too. This would also make iproute2 to correctly display GRE devices in collect_md mode as GRE devices.
On Thu, Oct 15, 2020 at 3:19 PM Xie He <xie.he.0141@gmail.com> wrote: > > On Thu, Oct 15, 2020 at 6:42 AM Willem de Bruijn > <willemdebruijn.kernel@gmail.com> wrote: > > > > On Wed, Oct 14, 2020 at 10:25 PM Xie He <xie.he.0141@gmail.com> wrote: > > > > > > Actually I think dev->type can be seen from user space. For example, > > > when you type "ip link", it will display the link type for you. So I > > > think it is useful to keep different dev->type labels without merging > > > them even if they appear to have no difference. > > > > Ah, indeed. These constants are matched in iproute2 in lib/ll_types.c > > to string representations. > > > > Good catch. Yes, then they have to stay as is. > > So in this case it may be better to keep dev->type as ARHPHRD_IPGRE > for GRE devices with or without header_ops. This way we can avoid the > need to update iproute2. > > We can still consider changing GRE devices in collect_md mode from > ARPHRD_NONE to ARHPHRD_IPGRE. The original author who changed it to > ARPHRD_NONE might assume ARHPHRD_IPGRE is for GRE devices with > header_ops, so he felt he needed to distinguish GRE devices without > header_ops by dev->type. However, ARHPHRD_IPGRE is actually already > used for GRE devices with header_ops AND without header_ops. So it > doesn't hurt to use ARHPHRD_IPGRE for GRE devices in collect_md mode, > too. This would also make iproute2 to correctly display GRE devices in > collect_md mode as GRE devices. That's true. Whenever you make user visible changes there is some chance of breakage. I'm not sure it's worth the risk here. The present state is perhaps not ideal, but not broken.
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c index 4e31f23e4117..82fee0010353 100644 --- a/net/ipv4/ip_gre.c +++ b/net/ipv4/ip_gre.c @@ -626,8 +626,7 @@ static netdev_tx_t ipgre_xmit(struct sk_buff *skb, if (dev->header_ops) { /* Need space for new headers */ - if (skb_cow_head(skb, dev->needed_headroom - - (tunnel->hlen + sizeof(struct iphdr)))) + if (skb_cow_head(skb, dev->hard_header_len)) goto free_skb; tnl_params = (const struct iphdr *)skb->data; @@ -748,7 +747,11 @@ static void ipgre_link_update(struct net_device *dev, bool set_mtu) len = tunnel->tun_hlen - len; tunnel->hlen = tunnel->hlen + len; - dev->needed_headroom = dev->needed_headroom + len; + if (dev->header_ops) + dev->hard_header_len += len; + else + dev->needed_headroom += len; + if (set_mtu) dev->mtu = max_t(int, dev->mtu - len, 68); @@ -944,6 +947,7 @@ static void __gre_tunnel_init(struct net_device *dev) tunnel->parms.iph.protocol = IPPROTO_GRE; tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen; + dev->needed_headroom = tunnel->hlen + sizeof(tunnel->parms.iph); dev->features |= GRE_FEATURES; dev->hw_features |= GRE_FEATURES; @@ -987,10 +991,14 @@ static int ipgre_tunnel_init(struct net_device *dev) return -EINVAL; dev->flags = IFF_BROADCAST; dev->header_ops = &ipgre_header_ops; + dev->hard_header_len = tunnel->hlen + sizeof(*iph); + dev->needed_headroom = 0; } #endif } else if (!tunnel->collect_md) { dev->header_ops = &ipgre_header_ops; + dev->hard_header_len = tunnel->hlen + sizeof(*iph); + dev->needed_headroom = 0; } return ip_tunnel_init(dev);
GRE tunnel has its own header_ops, ipgre_header_ops, and sets it conditionally. When it is set, it assumes the outer IP header is already created before ipgre_xmit(). This is not true when we send packets through a raw packet socket, where L2 headers are supposed to be constructed by user. Packet socket calls dev_validate_header() to validate the header. But GRE tunnel does not set dev->hard_header_len, so that check can be simply bypassed, therefore uninit memory could be passed down to ipgre_xmit(). Similar for dev->needed_headroom. dev->hard_header_len is supposed to be the length of the header created by dev->header_ops->create(), so it should be used whenever header_ops is set, and dev->needed_headroom should be used when it is not set. Reported-and-tested-by: syzbot+4a2c52677a8a1aa283cb@syzkaller.appspotmail.com Cc: Xie He <xie.he.0141@gmail.com> Cc: William Tu <u9012063@gmail.com> Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> --- Note, there are some other suspicious use of dev->hard_header_len in the same file, but let's leave them to a separate patch if really needed. net/ipv4/ip_gre.c | 14 +++++++++++--- 1 file changed, 11 insertions(+), 3 deletions(-)