[bpf-next,V12,0/7] bpf: New approach for BPF MTU handling

Message ID	161098881526.108067.7603213364270807261.stgit@firesoul
Headers	show Return-Path: <netdev-owner@kernel.org> Subject: [PATCH bpf-next V12 0/7] bpf: New approach for BPF MTU handling From: Jesper Dangaard Brouer <brouer@redhat.com> To: bpf@vger.kernel.org Cc: Jesper Dangaard Brouer <brouer@redhat.com>, netdev@vger.kernel.org, Daniel Borkmann <borkmann@iogearbox.net>, Alexei Starovoitov <alexei.starovoitov@gmail.com>, maze@google.com, lmb@cloudflare.com, shaun@tigera.io, Lorenzo Bianconi <lorenzo@kernel.org>, marek@cloudflare.com, John Fastabend <john.fastabend@gmail.com>, Jakub Kicinski <kuba@kernel.org>, eyal.birger@gmail.com, colrack@gmail.com Date: Mon, 18 Jan 2021 17:54:09 +0100 Message-ID: <161098881526.108067.7603213364270807261.stgit@firesoul> User-Agent: StGit/0.19 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Precedence: bulk
Series	bpf: New approach for BPF MTU handling \| expand [bpf-next,V12,0/7] bpf: New approach for BPF MTU handling [bpf-next,V12,2/7] bpf: fix bpf_fib_lookup helper MTU check for SKB ctx [bpf-next,V12,3/7] bpf: bpf_fib_lookup return MTU value as output when looked up [bpf-next,V12,5/7] bpf: drop MTU check when doing TC-BPF redirect to ingress

Jesper Dangaard Brouer Jan. 18, 2021, 4:54 p.m. UTC

This patchset drops all the MTU checks in TC BPF-helpers that limits
growing the packet size. This is done because these BPF-helpers doesn't
take redirect into account, which can result in their MTU check being done
against the wrong netdev.

The new approach is to give BPF-programs knowledge about the MTU on a
netdev (via ifindex) and fib route lookup level. Meaning some BPF-helpers
are added and extended to make it possible to do MTU checks in the
BPF-code.

If BPF-prog doesn't comply with the MTU then the packet will eventually
get dropped as some other layer. In some cases the existing kernel MTU
checks will drop the packet, but there are also cases where BPF can bypass
these checks. Specifically doing TC-redirect from ingress step
(sch_handle_ingress) into egress code path (basically calling
dev_queue_xmit()). It is left up to driver code to handle these kind of
MTU violations.

One advantage of this approach is that it ingress-to-egress BPF-prog can
send information via packet data. With the MTU checks removed in the
helpers, and also not done in skb_do_redirect() call, this allows for an
ingress BPF-prog to communicate with an egress BPF-prog via packet data,
as long as egress BPF-prog remove this prior to transmitting packet.

This patchset is primarily focused on TC-BPF, but I've made sure that the
MTU BPF-helpers also works for XDP BPF-programs.

V2: Change BPF-helper API from lookup to check.
V3: Drop enforcement of MTU in net-core, leave it to drivers.
V4: Keep sanity limit + netdev "up" checks + rename BPF-helper.
V5: Fix uninit variable + name struct output member mtu_result.
V6: Use bpf_check_mtu() in selftest
V7: Fix logic using tot_len and add another selftest
V8: Add better selftests for BPF-helper bpf_check_mtu
V9: Remove patch that use skb_set_redirected
V10: Fix selftests and 'tot_len' MTU check like XDP
V11: Fix nitpicks in selftests
V12: Adjustments requested by Daniel

---

Jesper Dangaard Brouer (7):
      bpf: Remove MTU check in __bpf_skb_max_len
      bpf: fix bpf_fib_lookup helper MTU check for SKB ctx
      bpf: bpf_fib_lookup return MTU value as output when looked up
      bpf: add BPF-helper for MTU checking
      bpf: drop MTU check when doing TC-BPF redirect to ingress
      selftests/bpf: use bpf_check_mtu in selftest test_cls_redirect
      bpf/selftests: tests using bpf_check_mtu BPF-helper


 include/linux/netdevice.h                          |   32 +++
 include/uapi/linux/bpf.h                           |   78 +++++++
 net/core/dev.c                                     |   32 +--
 net/core/filter.c                                  |  164 +++++++++++++--
 tools/include/uapi/linux/bpf.h                     |   78 +++++++
 tools/testing/selftests/bpf/prog_tests/check_mtu.c |  216 ++++++++++++++++++++
 tools/testing/selftests/bpf/progs/test_check_mtu.c |  198 ++++++++++++++++++
 .../selftests/bpf/progs/test_cls_redirect.c        |    7 +
 8 files changed, 760 insertions(+), 45 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/check_mtu.c
 create mode 100644 tools/testing/selftests/bpf/progs/test_check_mtu.c

--

Daniel Borkmann Jan. 23, 2021, 1:35 a.m. UTC | #1

On 1/18/21 5:54 PM, Jesper Dangaard Brouer wrote:
> This BPF-helper bpf_check_mtu() works for both XDP and TC-BPF programs.

> 

> The SKB object is complex and the skb->len value (accessible from

> BPF-prog) also include the length of any extra GRO/GSO segments, but

> without taking into account that these GRO/GSO segments get added

> transport (L4) and network (L3) headers before being transmitted. Thus,

> this BPF-helper is created such that the BPF-programmer don't need to

> handle these details in the BPF-prog.

> 

> The API is designed to help the BPF-programmer, that want to do packet

> context size changes, which involves other helpers. These other helpers

> usually does a delta size adjustment. This helper also support a delta

> size (len_diff), which allow BPF-programmer to reuse arguments needed by

> these other helpers, and perform the MTU check prior to doing any actual

> size adjustment of the packet context.

> 

> It is on purpose, that we allow the len adjustment to become a negative

> result, that will pass the MTU check. This might seem weird, but it's not

> this helpers responsibility to "catch" wrong len_diff adjustments. Other

> helpers will take care of these checks, if BPF-programmer chooses to do

> actual size adjustment.

> 

> V12:

>   - Simplify segment check that calls skb_gso_validate_network_len.

>   - Helpers should return long

> 

> V9:

> - Use dev->hard_header_len (instead of ETH_HLEN)

> - Annotate with unlikely req from Daniel

> - Fix logic error using skb_gso_validate_network_len from Daniel

> 

> V6:

> - Took John's advice and dropped BPF_MTU_CHK_RELAX

> - Returned MTU is kept at L3-level (like fib_lookup)

> 

> V4: Lot of changes

>   - ifindex 0 now use current netdev for MTU lookup

>   - rename helper from bpf_mtu_check to bpf_check_mtu

>   - fix bug for GSO pkt length (as skb->len is total len)

>   - remove __bpf_len_adj_positive, simply allow negative len adj

> 

> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

> ---

>   include/uapi/linux/bpf.h       |   67 ++++++++++++++++++++++++

>   net/core/filter.c              |  111 ++++++++++++++++++++++++++++++++++++++++

>   tools/include/uapi/linux/bpf.h |   67 ++++++++++++++++++++++++

>   3 files changed, 245 insertions(+)

> 

> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h

> index 05bfc8c843dc..f17381a337ec 100644

> --- a/include/uapi/linux/bpf.h

> +++ b/include/uapi/linux/bpf.h

> @@ -3839,6 +3839,61 @@ union bpf_attr {

>    *	Return

>    *		A pointer to a struct socket on success or NULL if the file is

>    *		not a socket.

> + *

> + * long bpf_check_mtu(void *ctx, u32 ifindex, u32 *mtu_len, s32 len_diff, u64 flags)

> + *	Description

> + *		Check ctx packet size against MTU of net device (based on

> + *		*ifindex*).  This helper will likely be used in combination with

> + *		helpers that adjust/change the packet size.  The argument

> + *		*len_diff* can be used for querying with a planned size

> + *		change. This allows to check MTU prior to changing packet ctx.

> + *

> + *		Specifying *ifindex* zero means the MTU check is performed

> + *		against the current net device.  This is practical if this isn't

> + *		used prior to redirect.

> + *

> + *		The Linux kernel route table can configure MTUs on a more

> + *		specific per route level, which is not provided by this helper.

> + *		For route level MTU checks use the **bpf_fib_lookup**\ ()

> + *		helper.

> + *

> + *		*ctx* is either **struct xdp_md** for XDP programs or

> + *		**struct sk_buff** for tc cls_act programs.

> + *

> + *		The *flags* argument can be a combination of one or more of the

> + *		following values:

> + *

> + *		**BPF_MTU_CHK_SEGS**

> + *			This flag will only works for *ctx* **struct sk_buff**.

> + *			If packet context contains extra packet segment buffers

> + *			(often knows as GSO skb), then MTU check is harder to

> + *			check at this point, because in transmit path it is

> + *			possible for the skb packet to get re-segmented

> + *			(depending on net device features).  This could still be

> + *			a MTU violation, so this flag enables performing MTU

> + *			check against segments, with a different violation

> + *			return code to tell it apart. Check cannot use len_diff.

> + *

> + *		On return *mtu_len* pointer contains the MTU value of the net

> + *		device.  Remember the net device configured MTU is the L3 size,

> + *		which is returned here and XDP and TX length operate at L2.

> + *		Helper take this into account for you, but remember when using

> + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid

> + *		pointer and be initialized (to zero), else verifier will reject

> + *		BPF program.

> + *

> + *	Return

> + *		* 0 on success, and populate MTU value in *mtu_len* pointer.

> + *

> + *		* < 0 if any input argument is invalid (*mtu_len* not updated)

> + *

> + *		MTU violations return positive values, but also populate MTU

> + *		value in *mtu_len* pointer, as this can be needed for

> + *		implementing PMTU handing:

> + *

> + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**

> + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**

> + *

>    */

[...]
> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,

> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)

> +{

> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;

> +	struct net_device *dev = skb->dev;

> +	int skb_len, dev_len;

> +	int mtu;

> +

> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))

> +		return -EINVAL;

> +

> +	dev = __dev_via_ifindex(dev, ifindex);

> +	if (unlikely(!dev))

> +		return -ENODEV;

> +

> +	mtu = READ_ONCE(dev->mtu);

> +

> +	dev_len = mtu + dev->hard_header_len;

> +	skb_len = skb->len + len_diff; /* minus result pass check */

> +	if (skb_len <= dev_len) {

> +		ret = BPF_MTU_CHK_RET_SUCCESS;

> +		goto out;

> +	}

> +	/* At this point, skb->len exceed MTU, but as it include length of all

> +	 * segments, it can still be below MTU.  The SKB can possibly get

> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user

> +	 * must choose if segs are to be MTU checked.

> +	 */

> +	if (skb_is_gso(skb)) {

> +		ret = BPF_MTU_CHK_RET_SUCCESS;

> +

> +		if (flags & BPF_MTU_CHK_SEGS &&

> +		    !skb_gso_validate_network_len(skb, mtu))

> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;


I think that looks okay overall now. One thing that will easily slip through
is that in the helper description you mentioned 'Check cannot use len_diff.'
for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user
will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via
skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked,
maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS?

> +	}

> +out:

> +	/* BPF verifier guarantees valid pointer */

> +	*mtu_len = mtu;

> +

> +	return ret;

> +}

Daniel Borkmann Jan. 23, 2021, 1:49 a.m. UTC | #2

On 1/18/21 5:54 PM, Jesper Dangaard Brouer wrote:
> Adding selftest for BPF-helper bpf_check_mtu(). Making sure

> it can be used from both XDP and TC.

> 

[...]

(small nit: your subject lines are mixed up with 'bpf/selftests' vs
  'selftests/bpf')

Jesper Dangaard Brouer Jan. 25, 2021, 8:41 a.m. UTC | #3

On Sat, 23 Jan 2021 02:35:41 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> > + *		The *flags* argument can be a combination of one or more of the

> > + *		following values:

> > + *

> > + *		**BPF_MTU_CHK_SEGS**

> > + *			This flag will only works for *ctx* **struct sk_buff**.

> > + *			If packet context contains extra packet segment buffers

> > + *			(often knows as GSO skb), then MTU check is harder to

> > + *			check at this point, because in transmit path it is

> > + *			possible for the skb packet to get re-segmented

> > + *			(depending on net device features).  This could still be

> > + *			a MTU violation, so this flag enables performing MTU

> > + *			check against segments, with a different violation

> > + *			return code to tell it apart. Check cannot use len_diff.

> > + *

> > + *		On return *mtu_len* pointer contains the MTU value of the net

> > + *		device.  Remember the net device configured MTU is the L3 size,

> > + *		which is returned here and XDP and TX length operate at L2.

> > + *		Helper take this into account for you, but remember when using

> > + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid

> > + *		pointer and be initialized (to zero), else verifier will reject

> > + *		BPF program.

> > + *

> > + *	Return

> > + *		* 0 on success, and populate MTU value in *mtu_len* pointer.

> > + *

> > + *		* < 0 if any input argument is invalid (*mtu_len* not updated)

> > + *

> > + *		MTU violations return positive values, but also populate MTU

> > + *		value in *mtu_len* pointer, as this can be needed for

> > + *		implementing PMTU handing:

> > + *

> > + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**

> > + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**

> > + *

> >    */  

> [...]

> > +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,

> > +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)

> > +{

> > +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;

> > +	struct net_device *dev = skb->dev;

> > +	int skb_len, dev_len;

> > +	int mtu;

> > +

> > +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))

> > +		return -EINVAL;

> > +

> > +	dev = __dev_via_ifindex(dev, ifindex);

> > +	if (unlikely(!dev))

> > +		return -ENODEV;

> > +

> > +	mtu = READ_ONCE(dev->mtu);

> > +

> > +	dev_len = mtu + dev->hard_header_len;

> > +	skb_len = skb->len + len_diff; /* minus result pass check */

> > +	if (skb_len <= dev_len) {

> > +		ret = BPF_MTU_CHK_RET_SUCCESS;

> > +		goto out;

> > +	}

> > +	/* At this point, skb->len exceed MTU, but as it include length of all

> > +	 * segments, it can still be below MTU.  The SKB can possibly get

> > +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user

> > +	 * must choose if segs are to be MTU checked.

> > +	 */

> > +	if (skb_is_gso(skb)) {

> > +		ret = BPF_MTU_CHK_RET_SUCCESS;

> > +

> > +		if (flags & BPF_MTU_CHK_SEGS &&

> > +		    !skb_gso_validate_network_len(skb, mtu))

> > +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;  

> 

> I think that looks okay overall now. One thing that will easily slip through

> is that in the helper description you mentioned 'Check cannot use len_diff.'

> for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user

> will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via

> skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked,

> maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS?


Ok. Do you want/think this can be enforced by the verifier or are you
simply requesting that the helper will return -EINVAL (or another errno)?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Daniel Borkmann Jan. 25, 2021, 10:27 p.m. UTC | #4

On 1/25/21 9:41 AM, Jesper Dangaard Brouer wrote:
> On Sat, 23 Jan 2021 02:35:41 +0100

> Daniel Borkmann <daniel@iogearbox.net> wrote:

> 

>>> + *		The *flags* argument can be a combination of one or more of the

>>> + *		following values:

>>> + *

>>> + *		**BPF_MTU_CHK_SEGS**

>>> + *			This flag will only works for *ctx* **struct sk_buff**.

>>> + *			If packet context contains extra packet segment buffers

>>> + *			(often knows as GSO skb), then MTU check is harder to

>>> + *			check at this point, because in transmit path it is

>>> + *			possible for the skb packet to get re-segmented

>>> + *			(depending on net device features).  This could still be

>>> + *			a MTU violation, so this flag enables performing MTU

>>> + *			check against segments, with a different violation

>>> + *			return code to tell it apart. Check cannot use len_diff.

>>> + *

>>> + *		On return *mtu_len* pointer contains the MTU value of the net

>>> + *		device.  Remember the net device configured MTU is the L3 size,

>>> + *		which is returned here and XDP and TX length operate at L2.

>>> + *		Helper take this into account for you, but remember when using

>>> + *		MTU value in your BPF-code.  On input *mtu_len* must be a valid

>>> + *		pointer and be initialized (to zero), else verifier will reject

>>> + *		BPF program.

>>> + *

>>> + *	Return

>>> + *		* 0 on success, and populate MTU value in *mtu_len* pointer.

>>> + *

>>> + *		* < 0 if any input argument is invalid (*mtu_len* not updated)

>>> + *

>>> + *		MTU violations return positive values, but also populate MTU

>>> + *		value in *mtu_len* pointer, as this can be needed for

>>> + *		implementing PMTU handing:

>>> + *

>>> + *		* **BPF_MTU_CHK_RET_FRAG_NEEDED**

>>> + *		* **BPF_MTU_CHK_RET_SEGS_TOOBIG**

>>> + *

>>>     */

>> [...]

>>> +BPF_CALL_5(bpf_skb_check_mtu, struct sk_buff *, skb,

>>> +	   u32, ifindex, u32 *, mtu_len, s32, len_diff, u64, flags)

>>> +{

>>> +	int ret = BPF_MTU_CHK_RET_FRAG_NEEDED;

>>> +	struct net_device *dev = skb->dev;

>>> +	int skb_len, dev_len;

>>> +	int mtu;

>>> +

>>> +	if (unlikely(flags & ~(BPF_MTU_CHK_SEGS)))

>>> +		return -EINVAL;

>>> +

>>> +	dev = __dev_via_ifindex(dev, ifindex);

>>> +	if (unlikely(!dev))

>>> +		return -ENODEV;

>>> +

>>> +	mtu = READ_ONCE(dev->mtu);

>>> +

>>> +	dev_len = mtu + dev->hard_header_len;

>>> +	skb_len = skb->len + len_diff; /* minus result pass check */

>>> +	if (skb_len <= dev_len) {

>>> +		ret = BPF_MTU_CHK_RET_SUCCESS;

>>> +		goto out;

>>> +	}

>>> +	/* At this point, skb->len exceed MTU, but as it include length of all

>>> +	 * segments, it can still be below MTU.  The SKB can possibly get

>>> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user

>>> +	 * must choose if segs are to be MTU checked.

>>> +	 */

>>> +	if (skb_is_gso(skb)) {

>>> +		ret = BPF_MTU_CHK_RET_SUCCESS;

>>> +

>>> +		if (flags & BPF_MTU_CHK_SEGS &&

>>> +		    !skb_gso_validate_network_len(skb, mtu))

>>> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;

>>

>> I think that looks okay overall now. One thing that will easily slip through

>> is that in the helper description you mentioned 'Check cannot use len_diff.'

>> for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user

>> will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via

>> skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked,

>> maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS?

> 

> Ok. Do you want/think this can be enforced by the verifier or are you

> simply requesting that the helper will return -EINVAL (or another errno)?


Simple -EINVAL should be fine in this case. Generally, we can detect this from
verifier side but I don't think the extra complexity is worth it especially given
this is dependent on BPF_MTU_CHK_SEGS and otherwise can be non-zero.

Thanks,
Daniel

Jesper Dangaard Brouer Jan. 26, 2021, 9:13 a.m. UTC | #5

On Mon, 25 Jan 2021 23:27:22 +0100
Daniel Borkmann <daniel@iogearbox.net> wrote:

> >>> +	/* At this point, skb->len exceed MTU, but as it include length of all

> >>> +	 * segments, it can still be below MTU.  The SKB can possibly get

> >>> +	 * re-segmented in transmit path (see validate_xmit_skb).  Thus, user

> >>> +	 * must choose if segs are to be MTU checked.

> >>> +	 */

> >>> +	if (skb_is_gso(skb)) {

> >>> +		ret = BPF_MTU_CHK_RET_SUCCESS;

> >>> +

> >>> +		if (flags & BPF_MTU_CHK_SEGS &&

> >>> +		    !skb_gso_validate_network_len(skb, mtu))

> >>> +			ret = BPF_MTU_CHK_RET_SEGS_TOOBIG;  

> >>

> >> I think that looks okay overall now. One thing that will easily slip through

> >> is that in the helper description you mentioned 'Check cannot use len_diff.'

> >> for BPF_MTU_CHK_SEGS flag. So right now for non-zero len_diff the user

> >> will still get BPF_MTU_CHK_RET_SUCCESS if the current length check via

> >> skb_gso_validate_network_len(skb, mtu) passes. If it cannot be checked,

> >> maybe enforce len_diff == 0 for gso skbs on BPF_MTU_CHK_SEGS?  

> > 

> > Ok. Do you want/think this can be enforced by the verifier or are you

> > simply requesting that the helper will return -EINVAL (or another errno)?  

> 

> Simple -EINVAL should be fine in this case. Generally, we can detect this from

> verifier side but I don't think the extra complexity is worth it especially given

> this is dependent on BPF_MTU_CHK_SEGS and otherwise can be non-zero.


Luckily this was also my choice in V13 that I've already send out.

https://lore.kernel.org/netdev/161159457239.321749.9067604476261493815.stgit@firesoul/
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

[bpf-next,V12,0/7] bpf: New approach for BPF MTU handling

Message

Comments