mbox series

[bpf-next,v2,0/8] Support defragmenting IPv(4|6) packets in BPF

Message ID cover.1677526810.git.dxu@dxuuu.xyz
Headers show
Series Support defragmenting IPv(4|6) packets in BPF | expand

Message

Daniel Xu Feb. 27, 2023, 7:51 p.m. UTC
=== Context ===

In the context of a middlebox, fragmented packets are tricky to handle.
The full 5-tuple of a packet is often only available in the first
fragment which makes enforcing consistent policy difficult. There are
really only two stateless options, neither of which are very nice:

1. Enforce policy on first fragment and accept all subsequent fragments.
   This works but may let in certain attacks or allow data exfiltration.

2. Enforce policy on first fragment and drop all subsequent fragments.
   This does not really work b/c some protocols may rely on
   fragmentation. For example, DNS may rely on oversized UDP packets for
   large responses.

So stateful tracking is the only sane option. RFC 8900 [0] calls this
out as well in section 6.3:

    Middleboxes [...] should process IP fragments in a manner that is
    consistent with [RFC0791] and [RFC8200]. In many cases, middleboxes
    must maintain state in order to achieve this goal.

=== BPF related bits ===

However, when policy is enforced through BPF, the prog is run before the
kernel reassembles fragmented packets. This leaves BPF developers in a
awkward place: implement reassembly (possibly poorly) or use a stateless
method as described above.

Fortunately, the kernel has robust support for fragmented IP packets.
This patchset wraps the existing defragmentation facilities in kfuncs so
that BPF progs running on middleboxes can reassemble fragmented packets
before applying policy.

=== Patchset details ===

This patchset is (hopefully) relatively straightforward from BPF perspective.
One thing I'd like to call out is the skb_copy()ing of the prog skb. I
did this to maintain the invariant that the ctx remains valid after prog
has run. This is relevant b/c ip_defrag() and ip_check_defrag() may
consume the skb if the skb is a fragment.

Originally I did play around with teaching the verifier about kfuncs
that may consume the ctx and disallowing ctx accesses in ret != 0
branches. It worked ok, but it seemed too complex to modify the
surrounding assumptions about ctx validity.

[0]: https://datatracker.ietf.org/doc/html/rfc8900

===

Changes from v1:
* Add support for ipv6 defragmentation


Daniel Xu (8):
  ip: frags: Return actual error codes from ip_check_defrag()
  bpf: verifier: Support KF_CHANGES_PKT flag
  bpf, net, frags: Add bpf_ip_check_defrag() kfunc
  net: ipv6: Factor ipv6_frag_rcv() to take netns and user
  bpf: net: ipv6: Add bpf_ipv6_frag_rcv() kfunc
  bpf: selftests: Support not connecting client socket
  bpf: selftests: Support custom type and proto for client sockets
  bpf: selftests: Add defrag selftests

 Documentation/bpf/kfuncs.rst                  |   7 +
 drivers/net/macvlan.c                         |   2 +-
 include/linux/btf.h                           |   1 +
 include/net/ip.h                              |  11 +
 include/net/ipv6.h                            |   1 +
 include/net/ipv6_frag.h                       |   1 +
 include/net/transp_v6.h                       |   1 +
 kernel/bpf/verifier.c                         |   8 +
 net/ipv4/Makefile                             |   1 +
 net/ipv4/ip_fragment.c                        |  15 +-
 net/ipv4/ip_fragment_bpf.c                    |  98 ++++++
 net/ipv6/Makefile                             |   1 +
 net/ipv6/af_inet6.c                           |   4 +
 net/ipv6/reassembly.c                         |  16 +-
 net/ipv6/reassembly_bpf.c                     | 143 ++++++++
 net/packet/af_packet.c                        |   2 +-
 tools/testing/selftests/bpf/Makefile          |   3 +-
 .../selftests/bpf/generate_udp_fragments.py   |  90 +++++
 .../selftests/bpf/ip_check_defrag_frags.h     |  57 +++
 tools/testing/selftests/bpf/network_helpers.c |  26 +-
 tools/testing/selftests/bpf/network_helpers.h |   3 +
 .../bpf/prog_tests/ip_check_defrag.c          | 327 ++++++++++++++++++
 .../selftests/bpf/progs/bpf_tracing_net.h     |   1 +
 .../selftests/bpf/progs/ip_check_defrag.c     | 133 +++++++
 24 files changed, 931 insertions(+), 21 deletions(-)
 create mode 100644 net/ipv4/ip_fragment_bpf.c
 create mode 100644 net/ipv6/reassembly_bpf.c
 create mode 100755 tools/testing/selftests/bpf/generate_udp_fragments.py
 create mode 100644 tools/testing/selftests/bpf/ip_check_defrag_frags.h
 create mode 100644 tools/testing/selftests/bpf/prog_tests/ip_check_defrag.c
 create mode 100644 tools/testing/selftests/bpf/progs/ip_check_defrag.c

Comments

Alexei Starovoitov March 7, 2023, 4:17 a.m. UTC | #1
On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
>
> > Have you considered to skb redirect to another netdev that does ip defrag?
> > Like macvlan does it under some conditions. This can be generalized.
>
> I had not considered that yet. Are you suggesting adding a new
> passthrough netdev thing that'll defrags? I looked at the macvlan driver
> and it looks like it defrags to handle some multicast corner case.

Something like that. A netdev that bpf prog can redirect too.
It will consume ip frags and eventually will produce reassembled skb.

The kernel ip_defrag logic has timeouts, counters, rhashtable
with thresholds, etc. All of them are per netns.
Just another ip_defrag_user will still share rhashtable
with its limits. The kernel can even do icmp_send().
ip_defrag is not a kfunc. It's a big block with plenty of kernel
wide side effects.
I really don't think we can alloc_skb, copy_skb, and ip_defrag it.
It messes with the stack too much.
It's also not clear to me when skb is reassembled and how bpf sees it.
"redirect into reassembling netdev" and attaching bpf prog to consume
that skb is much cleaner imo.
May be there are other ways to use ip_defrag, but certainly not like
synchronous api helper.
Daniel Xu March 7, 2023, 7:48 p.m. UTC | #2
Hi Alexei,

(cc netfilter maintainers)

On Mon, Mar 06, 2023 at 08:17:20PM -0800, Alexei Starovoitov wrote:
> On Tue, Feb 28, 2023 at 3:17 PM Daniel Xu <dxu@dxuuu.xyz> wrote:
> >
> > > Have you considered to skb redirect to another netdev that does ip defrag?
> > > Like macvlan does it under some conditions. This can be generalized.
> >
> > I had not considered that yet. Are you suggesting adding a new
> > passthrough netdev thing that'll defrags? I looked at the macvlan driver
> > and it looks like it defrags to handle some multicast corner case.
> 
> Something like that. A netdev that bpf prog can redirect too.
> It will consume ip frags and eventually will produce reassembled skb.
> 
> The kernel ip_defrag logic has timeouts, counters, rhashtable
> with thresholds, etc. All of them are per netns.
> Just another ip_defrag_user will still share rhashtable
> with its limits. The kernel can even do icmp_send().
> ip_defrag is not a kfunc. It's a big block with plenty of kernel
> wide side effects.
> I really don't think we can alloc_skb, copy_skb, and ip_defrag it.
> It messes with the stack too much.
> It's also not clear to me when skb is reassembled and how bpf sees it.
> "redirect into reassembling netdev" and attaching bpf prog to consume
> that skb is much cleaner imo.
> May be there are other ways to use ip_defrag, but certainly not like
> synchronous api helper.

I was giving the virtual netdev idea some thought this morning and I
thought I'd give the netfilter approach a deeper look.
Florian Westphal March 7, 2023, 8:11 p.m. UTC | #3
Daniel Xu <dxu@dxuuu.xyz> wrote:
> From my reading (I'll run some tests later) it looks like netfilter
> will defrag all ipv4/ipv6 packets in any netns with conntrack enabled.
> It appears to do so in NF_INET_PRE_ROUTING.

Yes, and output.

> One thing we would need though are (probably kfunc) wrappers around
> nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs
> are not transitively depending on defrag support from other netfilter
> modules.
>
> The exact mechanism would probably need some thinking, as the above
> functions kinda rely on module_init() and module_exit() semantics. We
> cannot make the prog bump the refcnt every time it runs -- it would
> overflow.  And it would be nice to automatically free the refcnt when
> prog is unloaded.

Probably add a flag attribute that is evaluated at BPF_LINK time, so
progs can say they need defrag enabled.  Same could be used to request
conntrack enablement.

Will need some glue on netfilter side to handle DEFRAG=m, but we already
have plenty of those.
Alexei Starovoitov March 7, 2023, 9:18 p.m. UTC | #4
On Tue, Mar 7, 2023 at 12:11 PM Florian Westphal <fw@strlen.de> wrote:
>
> Daniel Xu <dxu@dxuuu.xyz> wrote:
> > From my reading (I'll run some tests later) it looks like netfilter
> > will defrag all ipv4/ipv6 packets in any netns with conntrack enabled.
> > It appears to do so in NF_INET_PRE_ROUTING.
>
> Yes, and output.
>
> > One thing we would need though are (probably kfunc) wrappers around
> > nf_defrag_ipv4_enable() and nf_defrag_ipv6_enable() to ensure BPF progs
> > are not transitively depending on defrag support from other netfilter
> > modules.
> >
> > The exact mechanism would probably need some thinking, as the above
> > functions kinda rely on module_init() and module_exit() semantics. We
> > cannot make the prog bump the refcnt every time it runs -- it would
> > overflow.  And it would be nice to automatically free the refcnt when
> > prog is unloaded.
>
> Probably add a flag attribute that is evaluated at BPF_LINK time, so
> progs can say they need defrag enabled.  Same could be used to request
> conntrack enablement.
>
> Will need some glue on netfilter side to handle DEFRAG=m, but we already
> have plenty of those.

All makes perfect sense to me.
It's cleaner than a special netdevice.
ipv4_conntrack_defrag() is pretty neat. I didn't know about it.
If we can reuse it as-is that would be ideal.
Conceptually it fits perfectly.
If we cannot reuse it (for whatever unlikely reason) I would
argue that TC hook should gain similar functionality.