Message ID | e5d05e56bf41de82f10d33229b8a8f6b49290e98.1690332693.git.yan@cloudflare.com |
---|---|
State | New |
Headers | show |
Series | bpf: return proper error codes for lwt redirect | expand |
On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote: > I'm not positive I understand the code in ip_finish_output2(). I think > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for > != LWTUNNEL_XMIT_CONTINUE. It's unfortunate that NET_XMIT_DROP and > LWTUNNEL_XMIT_CONTINUE are the both 0x1. Why don't we just change that > instead? > I considered about changing lwt side logic. But it would bring larger impact since there are multiple types of encaps on this hook, not just bpf redirect. Changing bpf return values is a minimum change on the other hand. In addition, returning value of NET_RX_DROP and NET_XMIT_CN are the same, so if we don't do something in bpf redirect, there is no way to distinguish them later: the former is considered as an error, while "CN" is considered as non-error. > Also there seems to be a leak in lwtunnel_xmit(). Should that return > LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning? > > Something like the following? > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 11652e464f5d..375790b672bc 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev); > #define NET_XMIT_CN 0x02 /* congestion notification */ > #define NET_XMIT_MASK 0x0f /* qdisc flags in net/sch_generic.h */ > > +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS > +#define LWTUNNEL_XMIT_CONTINUE 0x3 > + > /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It > * indicates that the device will soon be dropping packets, or already drops > * some packets of the same priority; prompting us to send less aggressively. */ > diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h > index 6f15e6fa154e..8ab032ee04d0 100644 > --- a/include/net/lwtunnel.h > +++ b/include/net/lwtunnel.h > @@ -16,12 +16,6 @@ > #define LWTUNNEL_STATE_INPUT_REDIRECT BIT(1) > #define LWTUNNEL_STATE_XMIT_REDIRECT BIT(2) > > -enum { > - LWTUNNEL_XMIT_DONE, > - LWTUNNEL_XMIT_CONTINUE, > -}; > - > - > struct lwtunnel_state { > __u16 type; > __u16 flags; > diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c > index 711cd3b4347a..732415d1287d 100644 > --- a/net/core/lwtunnel.c > +++ b/net/core/lwtunnel.c > @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb) > > if (lwtstate->type == LWTUNNEL_ENCAP_NONE || > lwtstate->type > LWTUNNEL_ENCAP_MAX) > - return 0; > + return LWTUNNEL_XMIT_CONTINUE; You are correct this path would leak skb. Return continue (or drop) would avoid the leak. Personally I'd prefer drop instead to signal the error setup. Since this is a separate issue, do you want to send a separate patch on this? Or I am happy to do it if you prefer. > > ret = -EOPNOTSUPP; > rcu_read_lock(); > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 6e70839257f7..4be50a211b14 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s > if (lwtunnel_xmit_redirect(dst->lwtstate)) { > int res = lwtunnel_xmit(skb); > > - if (res < 0 || res == LWTUNNEL_XMIT_DONE) > + if (res != LWTUNNEL_XMIT_CONTINUE) > return res; Unfortunately we cannot return res directly here when res > 0. This is the final reason why I didn't patch here. Return values here can be propagated back to sendmsg syscall, so returning a positive value would break the syscall convention. best, Yan > } > > diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c > index 1e8c90e97608..016b0a513259 100644 > --- a/net/ipv6/ip6_output.c > +++ b/net/ipv6/ip6_output.c > @@ -113,7 +113,7 @@ static int ip6_finish_output2(struct net *net, struct sock *sk, struct sk_buff * > if (lwtunnel_xmit_redirect(dst->lwtstate)) { > int res = lwtunnel_xmit(skb); > > - if (res < 0 || res == LWTUNNEL_XMIT_DONE) > + if (res != LWTUNNEL_XMIT_CONTINUE) > return res; > } >
On Wed, Jul 26, 2023 at 07:14:56AM -0700, Yan Zhai wrote: > On Wed, Jul 26, 2023 at 04:39:08PM +0300, Dan Carpenter wrote: > > I'm not positive I understand the code in ip_finish_output2(). I think > > instead of looking for LWTUNNEL_XMIT_DONE it should instead look for > > != LWTUNNEL_XMIT_CONTINUE. It's unfortunate that NET_XMIT_DROP and > > LWTUNNEL_XMIT_CONTINUE are the both 0x1. Why don't we just change that > > instead? > > > I considered about changing lwt side logic. But it would bring larger > impact since there are multiple types of encaps on this hook, not just > bpf redirect. Changing bpf return values is a minimum change on the > other hand. In addition, returning value of NET_RX_DROP and > NET_XMIT_CN are the same, so if we don't do something in bpf redirect, > there is no way to distinguish them later: the former is considered as > an error, while "CN" is considered as non-error. Uh, NET_RX/XMIT_DROP values are 1. NET_XMIT_CN is 2. I'm not an expert but I think what happens is that we treat NET_XMIT_CN as success so that it takes a while for the resend to happen. Eventually the TCP layer will detect it as a dropped packet. > > > Also there seems to be a leak in lwtunnel_xmit(). Should that return > > LWTUNNEL_XMIT_CONTINUE or should it call kfree_skb() before returning? > > > > Something like the following? > > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > > index 11652e464f5d..375790b672bc 100644 > > --- a/include/linux/netdevice.h > > +++ b/include/linux/netdevice.h > > @@ -112,6 +112,9 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev); > > #define NET_XMIT_CN 0x02 /* congestion notification */ > > #define NET_XMIT_MASK 0x0f /* qdisc flags in net/sch_generic.h */ > > > > +#define LWTUNNEL_XMIT_DONE NET_XMIT_SUCCESS > > +#define LWTUNNEL_XMIT_CONTINUE 0x3 > > + > > /* NET_XMIT_CN is special. It does not guarantee that this packet is lost. It > > * indicates that the device will soon be dropping packets, or already drops > > * some packets of the same priority; prompting us to send less aggressively. */ > > diff --git a/include/net/lwtunnel.h b/include/net/lwtunnel.h > > index 6f15e6fa154e..8ab032ee04d0 100644 > > --- a/include/net/lwtunnel.h > > +++ b/include/net/lwtunnel.h > > @@ -16,12 +16,6 @@ > > #define LWTUNNEL_STATE_INPUT_REDIRECT BIT(1) > > #define LWTUNNEL_STATE_XMIT_REDIRECT BIT(2) > > > > -enum { > > - LWTUNNEL_XMIT_DONE, > > - LWTUNNEL_XMIT_CONTINUE, > > -}; > > - > > - > > struct lwtunnel_state { > > __u16 type; > > __u16 flags; > > diff --git a/net/core/lwtunnel.c b/net/core/lwtunnel.c > > index 711cd3b4347a..732415d1287d 100644 > > --- a/net/core/lwtunnel.c > > +++ b/net/core/lwtunnel.c > > @@ -371,7 +371,7 @@ int lwtunnel_xmit(struct sk_buff *skb) > > > > if (lwtstate->type == LWTUNNEL_ENCAP_NONE || > > lwtstate->type > LWTUNNEL_ENCAP_MAX) > > - return 0; > > + return LWTUNNEL_XMIT_CONTINUE; > > You are correct this path would leak skb. Return continue (or drop) > would avoid the leak. Personally I'd prefer drop instead to signal the > error setup. Since this is a separate issue, do you want to send a > separate patch on this? Or I am happy to do it if you prefer. > I don't know which makes sense so I'll leave that up to you. > > > > ret = -EOPNOTSUPP; > > rcu_read_lock(); > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > > index 6e70839257f7..4be50a211b14 100644 > > --- a/net/ipv4/ip_output.c > > +++ b/net/ipv4/ip_output.c > > @@ -216,7 +216,7 @@ static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *s > > if (lwtunnel_xmit_redirect(dst->lwtstate)) { > > int res = lwtunnel_xmit(skb); > > > > - if (res < 0 || res == LWTUNNEL_XMIT_DONE) > > + if (res != LWTUNNEL_XMIT_CONTINUE) > > return res; > > Unfortunately we cannot return res directly here when res > 0. This is > the final reason why I didn't patch here. Return values here can be > propagated back to sendmsg syscall, so returning a positive value > would break the syscall convention. The neigh_output() function is going to return NET_XMIT_DROP so this already happens. Is that not what we want to happen? I guess my concern is that eventually people will eventually new introduce bugs. Fixing incorrect error codes is something that I do several times per week. :P regards, dan carpenter
On 7/31/23 4:01 PM, Yan Zhai wrote: > What I commented was an exact same issue at different location: BPF > reroute may trigger the crash as well, since it also returns > dev_queue_xmit status in bpf_xmit. Need to fix this, or instead fixing > LWTUNNEL_XMIT_CONTINUE value and correct the behavior at lwtunnel_xmit > rather than bpf_xmit. Ah. I think I got it. You meant the bpf_lwt_xmit_reroute() / BPF_LWT_REROUTE case? It would be clearer if some of these names were quoted instead. "reroute" could mean many things. Please put details comment in v5. Thanks. > > Yan > >> >>> As Dan suggested, packets might not have been freed when this is >>> returned from drivers. The caller of dev_queue_xmit might need to free >>> skb when this happens. >>> >>> Yan >>
On Mon, Jul 31, 2023 at 9:26 AM Dan Carpenter <dan.carpenter@linaro.org> wrote: > > I'm not a networking person, but I was looking at some use after free > static checker warnings. Did you refer to the gist I posted or something new? > > Apparently the rule with xmit functions is that if they return a value > > 15 then that means the skb was not freed. Otherwise it's supposed to > be freed. So like NETDEV_TX_BUSY is 0x10 so it's not freed. > > This is checked with using the dev_xmit_complete() function. So I feel > like it would make sense for LWTUNNEL_XMIT_CONTINUE to return higher > than 15. Yes I am adopting your suggestion in v5. Dealing with NETDEV_TX_BUSY would be left as another item (potentially more suited for netdev rather than bpf). Would be great to find a reproduction of memleak. > > Because that's the bug right? The original code was assuming that > everything besides LWTUNNEL_XMIT_DONE was freed. > > regards, > dan carpenter >
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index b828c7a75be2..520d808eec15 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -87,6 +87,8 @@ void netdev_sw_irq_coalesce_default_on(struct net_device *dev); #define NET_RX_SUCCESS 0 /* keep 'em coming, baby */ #define NET_RX_DROP 1 /* packet dropped */ +#define net_rx_errno(e) ((e) == NET_RX_DROP ? -ENOBUFS : (e)) + #define MAX_NEST_DEV 8 /* diff --git a/net/core/filter.c b/net/core/filter.c index 06ba0e56e369..564104543737 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -2095,7 +2095,9 @@ static const struct bpf_func_proto bpf_csum_level_proto = { static inline int __bpf_rx_skb(struct net_device *dev, struct sk_buff *skb) { - return dev_forward_skb_nomtu(dev, skb); + int ret = dev_forward_skb_nomtu(dev, skb); + + return net_rx_errno(ret); } static inline int __bpf_rx_skb_no_mac(struct net_device *dev, @@ -2108,7 +2110,7 @@ static inline int __bpf_rx_skb_no_mac(struct net_device *dev, ret = netif_rx(skb); } - return ret; + return net_rx_errno(ret); } static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb) @@ -2129,6 +2131,9 @@ static inline int __bpf_tx_skb(struct net_device *dev, struct sk_buff *skb) ret = dev_queue_xmit(skb); dev_xmit_recursion_dec(); + if (unlikely(ret > 0)) + ret = net_xmit_errno(ret); + return ret; }