Message ID | 161098881526.108067.7603213364270807261.stgit@firesoul |
---|---|
Headers | show |
Series | bpf: New approach for BPF MTU handling | expand |
On 1/18/21 5:54 PM, Jesper Dangaard Brouer wrote: > This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs. > > The SKB object is complex and the skb->len value (accessible from > BPF-prog) also include the length of any extra GRO/GSO segments, but > without taking into account that these GRO/GSO segments get added > transport (L4) and network (L3) headers before being transmitted. Thus, > this BPF-helper is created such that the BPF-programmer don't need to > handle these details in the BPF-prog. > > The API is designed to help the BPF-programmer, that want to do packet > context size changes, which involves other helpers. These other helpers > usually does a delta size adjustment. This helper also support a delta > size (len_diff), which allow BPF-programmer to reuse arguments needed by > these other helpers, and perform the MTU check prior to doing any actual > size adjustment of the packet context. > > It is on purpose, that we allow the len adjustment to become a negative > result, that will pass the MTU check. This might seem weird, but it's not > this helpers responsibility to "catch" wrong len_diff adjustments. Other > helpers will take care of these checks, if BPF-programmer chooses to do > actual size adjustment. > > V12: > - Simplify segment check that calls skb_gso_validate_network_len. > - Helpers should return long > > V9: > - Use dev->hard_header_len (instead of ETH_HLEN) > - Annotate with unlikely req from Daniel > - Fix logic error using skb_gso_validate_network_len from Daniel > > V6: > - Took John's advice and dropped BPF_MTU_CHK_RELAX > - Returned MTU is kept at L3-level (like fib_lookup) > > V4: Lot of changes > - ifindex 0 now use current netdev for MTU lookup > - rename helper from bpf_mtu_check to bpf_check_mtu > - fix bug for GSO pkt length (as skb->len is total len) > - remove __bpf_len_adj_positive, simply allow negative len adj > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > include/uapi/linux/bpf.h | 67 ++++++++++++++++++++++++ > net/core/filter.c | 111 ++++++++++++++++++++++++++++++++++++++++ > tools/include/uapi/linux/bpf.h | 67 ++++++++++++++++++++++++ > 3 files changed, 245 insertions(+) > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > index 05bfc8c843dc..f17381a337ec 100644 > --- a/include/uapi/linux/bpf.h > +++ b/include/uapi/linux/bpf.h > @@ -3839,6 +3839,61 @@ union bpf_attr { > * Return > * A pointer to a struct socket on success or NULL if the file is > * not a socket. > + * > + * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags) > + * Description > + * Check ctx packet size against MTU of net device (based on > + * *ifindex*). This helper will likely be used in combination with > + * helpers that adjust/change the packet size. The argument > + * *len_diff* can be used for querying with a planned size > + * change. This allows to check MTU prior to changing packet ctx. > + * > + * Specifying *ifindex* zero means the MTU check is performed > + * against the current net device. This is practical if this isn't > + * used prior to redirect. > + * > + * The Linux kernel route table can configure MTUs on a more > + * specific per route level, which is not provided by this helper. > + * For route level MTU checks use the **bpf_fib_lookup**\ () > + * helper. > + * > + * *ctx* is either **struct xdp_md** for XDP programs or > + * **struct sk_buff** for tc cls_act programs. > + * > + * The *flags* argument can be a combination of one or more of the > + * following values: > + * > + * **BPF_MTU_CHK_SEGS** > + * This flag will only works for *ctx* **struct sk_buff**. > + * If packet context contains extra packet segment buffers > + * (often knows as GSO skb), then MTU check is harder to > + * check at this point, because in transmit path it is > + * possible for the skb packet to get re-segmented > + * (depending on net device features). This could still be > + * a MTU violation, so this flag enables performing MTU > + * check against segments, with a different violation > + * return code to tell it apart. Check cannot use len_diff. > + * > + * On return *mtu_len* pointer contains the MTU value of the net > + * device. Remember the net device configured MTU is the L3 size, > + * which is returned here and XDP and TX length operate at L2. > + * Helper take this into account for you, but remember when using > + * MTU value in your BPF-code. On input *mtu_len* must be a valid > + * pointer and be initialized (to zero), else verifier will reject > + * BPF program. > + * > + * Return > + * * 0 on success, and populate MTU value in *mtu_len* pointer. > + * > + * * < 0 if any input argument is invalid (*mtu_len* not updated) > + * > + * MTU violations return positive values, but also populate MTU > + * value in *mtu_len* pointer, as this can be needed for > + * implementing PMTU handing: > + * > + * * **BPF_MTU_CHK_RET_FRAG_NEEDED** > + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG** > + * > */ [...] > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) > +{ > + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > + struct net_device *dev = skb->dev; > + int skb_len, dev_len; > + int mtu; > + > + if (unlikely(flags & ~(BPF_MTU_CHK_SEGS))) > + return -EINVAL; > + > + dev = __dev_via_ifindex(dev, ifindex); > + if (unlikely(!dev)) > + return -ENODEV; > + > + mtu = READ_ONCE(dev->mtu); > + > + dev_len = mtu + dev->hard_header_len; > + skb_len = skb->len + len_diff; /* minus result pass check */ > + if (skb_len <= dev_len) { > + ret = BPF_MTU_CHK_RET_SUCCESS; > + goto out; > + } > + /* At this point, skb->len exceed MTU, but as it include length of all > + * segments, it can still be below MTU. The SKB can possibly get > + * re-segmented in transmit path (see validate_xmit_skb). Thus, user > + * must choose if segs are to be MTU checked. > + */ > + if (skb_is_gso(skb)) { > + ret = BPF_MTU_CHK_RET_SUCCESS; > + > + if (flags & BPF_MTU_CHK_SEGS && > + !skb_gso_validate_network_len(skb, mtu)) > + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; I think that looks okay overall now. One thing that will easily slip through is that in the helper description you mentioned 'Check cannot use len_diff.' for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked, maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS? > + } > +out: > + /* BPF verifier guarantees valid pointer */ > + *mtu_len = mtu; > + > + return ret; > +}
On 1/18/21 5:54 PM, Jesper Dangaard Brouer wrote: > Adding selftest for BPF-helper bpf_check_mtu(). Making sure > it can be used from both XDP and TC. > [...] (small nit: your subject lines are mixed up with 'bpf/selftests' vs 'selftests/bpf')
On Sat, 23 Jan 2021 02:35:41 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: > > + * The *flags* argument can be a combination of one or more of the > > + * following values: > > + * > > + * **BPF_MTU_CHK_SEGS** > > + * This flag will only works for *ctx* **struct sk_buff**. > > + * If packet context contains extra packet segment buffers > > + * (often knows as GSO skb), then MTU check is harder to > > + * check at this point, because in transmit path it is > > + * possible for the skb packet to get re-segmented > > + * (depending on net device features). This could still be > > + * a MTU violation, so this flag enables performing MTU > > + * check against segments, with a different violation > > + * return code to tell it apart. Check cannot use len_diff. > > + * > > + * On return *mtu_len* pointer contains the MTU value of the net > > + * device. Remember the net device configured MTU is the L3 size, > > + * which is returned here and XDP and TX length operate at L2. > > + * Helper take this into account for you, but remember when using > > + * MTU value in your BPF-code. On input *mtu_len* must be a valid > > + * pointer and be initialized (to zero), else verifier will reject > > + * BPF program. > > + * > > + * Return > > + * * 0 on success, and populate MTU value in *mtu_len* pointer. > > + * > > + * * < 0 if any input argument is invalid (*mtu_len* not updated) > > + * > > + * MTU violations return positive values, but also populate MTU > > + * value in *mtu_len* pointer, as this can be needed for > > + * implementing PMTU handing: > > + * > > + * * **BPF_MTU_CHK_RET_FRAG_NEEDED** > > + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG** > > + * > > */ > [...] > > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, > > + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) > > +{ > > + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; > > + struct net_device *dev = skb->dev; > > + int skb_len, dev_len; > > + int mtu; > > + > > + if (unlikely(flags & ~(BPF_MTU_CHK_SEGS))) > > + return -EINVAL; > > + > > + dev = __dev_via_ifindex(dev, ifindex); > > + if (unlikely(!dev)) > > + return -ENODEV; > > + > > + mtu = READ_ONCE(dev->mtu); > > + > > + dev_len = mtu + dev->hard_header_len; > > + skb_len = skb->len + len_diff; /* minus result pass check */ > > + if (skb_len <= dev_len) { > > + ret = BPF_MTU_CHK_RET_SUCCESS; > > + goto out; > > + } > > + /* At this point, skb->len exceed MTU, but as it include length of all > > + * segments, it can still be below MTU. The SKB can possibly get > > + * re-segmented in transmit path (see validate_xmit_skb). Thus, user > > + * must choose if segs are to be MTU checked. > > + */ > > + if (skb_is_gso(skb)) { > > + ret = BPF_MTU_CHK_RET_SUCCESS; > > + > > + if (flags & BPF_MTU_CHK_SEGS && > > + !skb_gso_validate_network_len(skb, mtu)) > > + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; > > I think that looks okay overall now. One thing that will easily slip through > is that in the helper description you mentioned 'Check cannot use len_diff.' > for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user > will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via > skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked, > maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS? Ok. Do you want/think this can be enforced by the verifier or are you simply requesting that the helper will return -EINVAL (or another errno)? -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
On 1/25/21 9:41 AM, Jesper Dangaard Brouer wrote: > On Sat, 23 Jan 2021 02:35:41 +0100 > Daniel Borkmann <daniel@iogearbox.net> wrote: > >>> + * The *flags* argument can be a combination of one or more of the >>> + * following values: >>> + * >>> + * **BPF_MTU_CHK_SEGS** >>> + * This flag will only works for *ctx* **struct sk_buff**. >>> + * If packet context contains extra packet segment buffers >>> + * (often knows as GSO skb), then MTU check is harder to >>> + * check at this point, because in transmit path it is >>> + * possible for the skb packet to get re-segmented >>> + * (depending on net device features). This could still be >>> + * a MTU violation, so this flag enables performing MTU >>> + * check against segments, with a different violation >>> + * return code to tell it apart. Check cannot use len_diff. >>> + * >>> + * On return *mtu_len* pointer contains the MTU value of the net >>> + * device. Remember the net device configured MTU is the L3 size, >>> + * which is returned here and XDP and TX length operate at L2. >>> + * Helper take this into account for you, but remember when using >>> + * MTU value in your BPF-code. On input *mtu_len* must be a valid >>> + * pointer and be initialized (to zero), else verifier will reject >>> + * BPF program. >>> + * >>> + * Return >>> + * * 0 on success, and populate MTU value in *mtu_len* pointer. >>> + * >>> + * * < 0 if any input argument is invalid (*mtu_len* not updated) >>> + * >>> + * MTU violations return positive values, but also populate MTU >>> + * value in *mtu_len* pointer, as this can be needed for >>> + * implementing PMTU handing: >>> + * >>> + * * **BPF_MTU_CHK_RET_FRAG_NEEDED** >>> + * * **BPF_MTU_CHK_RET_SEGS_TOOBIG** >>> + * >>> */ >> [...] >>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb, >>> + u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags) >>> +{ >>> + int ret = BPF_MTU_CHK_RET_FRAG_NEEDED; >>> + struct net_device *dev = skb->dev; >>> + int skb_len, dev_len; >>> + int mtu; >>> + >>> + if (unlikely(flags & ~(BPF_MTU_CHK_SEGS))) >>> + return -EINVAL; >>> + >>> + dev = __dev_via_ifindex(dev, ifindex); >>> + if (unlikely(!dev)) >>> + return -ENODEV; >>> + >>> + mtu = READ_ONCE(dev->mtu); >>> + >>> + dev_len = mtu + dev->hard_header_len; >>> + skb_len = skb->len + len_diff; /* minus result pass check */ >>> + if (skb_len <= dev_len) { >>> + ret = BPF_MTU_CHK_RET_SUCCESS; >>> + goto out; >>> + } >>> + /* At this point, skb->len exceed MTU, but as it include length of all >>> + * segments, it can still be below MTU. The SKB can possibly get >>> + * re-segmented in transmit path (see validate_xmit_skb). Thus, user >>> + * must choose if segs are to be MTU checked. >>> + */ >>> + if (skb_is_gso(skb)) { >>> + ret = BPF_MTU_CHK_RET_SUCCESS; >>> + >>> + if (flags & BPF_MTU_CHK_SEGS && >>> + !skb_gso_validate_network_len(skb, mtu)) >>> + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; >> >> I think that looks okay overall now. One thing that will easily slip through >> is that in the helper description you mentioned 'Check cannot use len_diff.' >> for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user >> will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via >> skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked, >> maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS? > > Ok. Do you want/think this can be enforced by the verifier or are you > simply requesting that the helper will return -EINVAL (or another errno)? Simple -EINVAL should be fine in this case. Generally, we can detect this from verifier side but I don't think the extra complexity is worth it especially given this is dependent on BPF_MTU_CHK_SEGS and otherwise can be non-zero. Thanks, Daniel
On Mon, 25 Jan 2021 23:27:22 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote: > >>> + /* At this point, skb->len exceed MTU, but as it include length of all > >>> + * segments, it can still be below MTU. The SKB can possibly get > >>> + * re-segmented in transmit path (see validate_xmit_skb). Thus, user > >>> + * must choose if segs are to be MTU checked. > >>> + */ > >>> + if (skb_is_gso(skb)) { > >>> + ret = BPF_MTU_CHK_RET_SUCCESS; > >>> + > >>> + if (flags & BPF_MTU_CHK_SEGS && > >>> + !skb_gso_validate_network_len(skb, mtu)) > >>> + ret = BPF_MTU_CHK_RET_SEGS_TOOBIG; > >> > >> I think that looks okay overall now. One thing that will easily slip through > >> is that in the helper description you mentioned 'Check cannot use len_diff.' > >> for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user > >> will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via > >> skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked, > >> maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS? > > > > Ok. Do you want/think this can be enforced by the verifier or are you > > simply requesting that the helper will return -EINVAL (or another errno)? > > Simple -EINVAL should be fine in this case. Generally, we can detect this from > verifier side but I don't think the extra complexity is worth it especially given > this is dependent on BPF_MTU_CHK_SEGS and otherwise can be non-zero. Luckily this was also my choice in V13 that I've already send out. https://lore.kernel.org/netdev/161159457239.321749.9067604476261493815.stgit@firesoul/ -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer