mbox series

[RFC,v2,0/3] Allow sk_lookup UDP return traffic to egress when setting src port/address.

Message ID 20240920-reverse-sk-lookup-v2-0-916a48c47d56@cloudflare.com
Headers show
Series Allow sk_lookup UDP return traffic to egress when setting src port/address. | expand

Message

Tiago Lam Sept. 20, 2024, 5:02 p.m. UTC
Currently, sk_lookup allows an ebpf program to run on the ingress socket
lookup path, and accept traffic not only on a range of addresses, but
also on a range of ports. At Cloudflare we use sk_lookup for two main
cases:
1. Sharing a single port between multiple services - i.e. two services
   (or more) use disjoint IP ranges but share the same port;
2. Receiving traffic on all ports - i.e. a service which accepts traffic
   on specific IP ranges but any port [1].

However, one main challenge we face while using sk_lookup for these use
cases is how to source return UDP traffic:
- On point 1. above, sometimes this range of addresses are not local
  (i.e. there's no local routes for these in the server), which means we
  need IP_TRANSPARENT set to be able to egress traffic from addresses
  we've received traffic on (or simply IP_FREEBIND in the case of IPv6);
- And on point 2. above, allowing traffic to a range of ports means a
  service could get traffic on multiple ports, but currently there's no
  way to set the source UDP port egress traffic should be sourced from -
  it's possible to receive the original destination port using the
  IP_ORIGDSTADDR ancilliary message in recvmsg, but not set it in
  sendmsg.

Both of these limitations can be worked around, but in a sub-optimal
way. Using IP_TRANSPARENT, for instance, requires special privileges.
And while one could use UDP connected sockets to send return traffic,
creating a connected socket for each different address a UDP traffic is
received on does have performance implications.

Given sk_lookup allows services to accept traffic on a range of
addresses or ports, it seems sensible to also allow return traffic to
proceed through as well, without needing extra configurations / set ups.

This patch sets out to fix both of this issues by:
1. Allowing users to set the src address/port egress traffic should be
   sent from, when calling sendmsg();
2. Validating that this egress traffic comes from a socket that matches
   an ingress socket in sk_lookup.
   - If it does, traffic is allowed to proceed;
   - Otherwise it falls back to the regular egress path.

The downsides to this is that this runs on the egress hot path, although
this work tries to minimise its impact by only performing the reverse
socket lookup when necessary (i.e. only when the src address/port are
modified). Further performance measurements are to be taken, but we're
reaching out early for feedback to see what the technical concerns are
and if we can address them.

[1] https://blog.cloudflare.com/how-we-built-spectrum/

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
---
Changes in v2:
- Amended commit messages and cover letter to make the intent and
  implementation clearer (Willem de Bruijn);
- Fixed socket comparison by not using socket cookies and comparing them
  directly (Eric Dumazet);
- Fixed misspellings and checkpatch.pl warnings on line lengths (Simon
  Horman);
- Fixed usage of start_server_addr() and gcc compilation (Philo Lu);
- Link to v1: https://lore.kernel.org/r/20240913-reverse-sk-lookup-v1-0-e721ea003d4c@cloudflare.com

---
Tiago Lam (3):
      ipv4: Support setting src port in sendmsg().
      ipv6: Support setting src port in sendmsg().
      bpf: Add sk_lookup test to use ORIGDSTADDR cmsg.

 include/net/ip.h                                   |  1 +
 net/ipv4/ip_sockglue.c                             | 11 +++
 net/ipv4/udp.c                                     | 35 +++++++++-
 net/ipv6/datagram.c                                | 79 ++++++++++++++++++++++
 net/ipv6/udp.c                                     |  8 ++-
 tools/testing/selftests/bpf/prog_tests/sk_lookup.c | 67 ++++++++++++------
 6 files changed, 176 insertions(+), 25 deletions(-)
---
base-commit: 6562a89739bbefddb5495c09aaab67c1c3756f36
change-id: 20240909-reverse-sk-lookup-f7bf36292bc4

Best regards,

Comments

Willem de Bruijn Sept. 21, 2024, 9:12 a.m. UTC | #1
Tiago Lam wrote:
> sendmsg() doesn't currently allow users to set the src port from which
> egress traffic should be sent from. This is possible if a user wants to
> configure the src address from which egress traffic should be sent from
> - with the IP_PKTINFO ancillary message, a user is currently able to
>   specify a source address to egress from when calling sendmsg().
> However, this still requires the user to set the IP_TRANSPARENT flag
> using setsockopt(), which happens to require special privileges in the
> case of IPv4.
> 
> To support users setting the src port for egress traffic when using
> sendmsg(), this patch extends the ancillary messages supported by
> sendmsg() to support the IP_ORIGDSTADDR ancillary message, reusing the
> same cmsg and struct used in recvmsg() - which already supports
> specifying a port.
> 
> Additionally, to avoid having to have special configurations, such as
> IP_TRANSPARENT, this patch allows egress traffic that's been configured
> using (the newly added) IP_ORIGDSTADDR to proceed if there's an ingress
> socket lookup (sk_lookup) that matches that traffic - by performing a
> reserve sk_lookup. Thus, if the sk_lookup reverse call returns a socket
> that matches the egress socket, we also let the egress traffic through -
> following the principle of, allowing return traffic to proceed if
> ingress traffic is allowed in. In case no match is found in the reverse
> sk_lookup, traffic falls back to the regular egress path.
> 
> This reverse lookup is only performed in case an sk_lookup ebpf program
> is attached and the source address and/or port for the return traffic
> have been modified using the (newly added) IP_ORIGDSTADDR in sendmsg.
> 
> Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
> Signed-off-by: Tiago Lam <tiagolam@cloudflare.com>
> ---
>  include/net/ip.h       |  1 +
>  net/ipv4/ip_sockglue.c | 11 +++++++++++
>  net/ipv4/udp.c         | 35 ++++++++++++++++++++++++++++++++++-
>  3 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/ip.h b/include/net/ip.h
> index c5606cadb1a5..e5753abd7247 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -75,6 +75,7 @@ static inline unsigned int ip_hdrlen(const struct sk_buff *skb)
>  struct ipcm_cookie {
>  	struct sockcm_cookie	sockc;
>  	__be32			addr;
> +	__be16			port;
>  	int			oif;
>  	struct ip_options_rcu	*opt;
>  	__u8			protocol;
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index cf377377b52d..6e55bd25b5f7 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -297,6 +297,17 @@ int ip_cmsg_send(struct sock *sk, struct msghdr *msg, struct ipcm_cookie *ipc,
>  			ipc->addr = info->ipi_spec_dst.s_addr;
>  			break;
>  		}
> +		case IP_ORIGDSTADDR:

Should this just be IP_SRCADDR?

> +		{
> +			struct sockaddr_in *dst_addr;
> +
> +			if (cmsg->cmsg_len != CMSG_LEN(sizeof(struct sockaddr_in)))
> +				return -EINVAL;
> +			dst_addr = (struct sockaddr_in *)CMSG_DATA(cmsg);
> +			ipc->port = dst_addr->sin_port;
> +			ipc->addr = dst_addr->sin_addr.s_addr;
> +			break;
> +		}
>  		case IP_TTL:
>  			if (cmsg->cmsg_len != CMSG_LEN(sizeof(int)))
>  				return -EINVAL;
> diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
> index 49c622e743e8..208cee40c0ec 100644
> --- a/net/ipv4/udp.c
> +++ b/net/ipv4/udp.c
> @@ -1060,6 +1060,7 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	DECLARE_SOCKADDR(struct sockaddr_in *, usin, msg->msg_name);
>  	struct flowi4 fl4_stack;
>  	struct flowi4 *fl4;
> +	__u8 flow_flags = inet_sk_flowi_flags(sk);
>  	int ulen = len;
>  	struct ipcm_cookie ipc;
>  	struct rtable *rt = NULL;
> @@ -1179,6 +1180,39 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  		}
>  	}
>  
> +	/* If we're egressing with a different source address and/or port, we
> +	 * perform a reverse socket lookup.  The rationale behind this is that
> +	 * we can allow return UDP traffic that has ingressed through sk_lookup
> +	 * to also egress correctly. In case this the reverse lookup fails.
> +	 *
> +	 * The lookup is performed if either source address and/or port
> +	 * changed, and neither is "0".
> +	 */
> +	if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
> +	    !connected &&
> +	    (ipc.port && ipc.addr) &&
> +	    (inet->inet_saddr != ipc.addr || inet->inet_sport != ipc.port)) {
> +		struct sock *sk_egress;
> +
> +		bpf_sk_lookup_run_v4(sock_net(sk), IPPROTO_UDP,
> +				     daddr, dport, ipc.addr, ntohs(ipc.port),
> +				     1, &sk_egress);

Does this need to use a bpf helper rather than a normal route lookup
function?

I don't know this func, but the sk is returned without a reference
taken?

> +		if (IS_ERR_OR_NULL(sk_egress) || sk_egress != sk) {
> +			net_info_ratelimited("No reverse socket lookup match for local addr %pI4:%d remote addr %pI4:%d\n",
> +					     &ipc.addr, ntohs(ipc.port), &daddr,
> +					     ntohs(dport));

No need for logging to the kernel log when syscalls can just return an
error.

> +		} else {
> +			/* Override the source port to use with the one we got
> +			 * in cmsg, and tell routing to let us use a non-local
> +			 * address. Otherwise route lookups will fail with
> +			 * non-local source address when IP_TRANSPARENT isn't
> +			 * set.
> +			 */
> +			inet->inet_sport = ipc.port;
> +			flow_flags |= FLOWI_FLAG_ANYSRC;
> +		}
> +	}
> +
>  	saddr = ipc.addr;
>  	ipc.addr = faddr = daddr;
>  
> @@ -1223,7 +1257,6 @@ int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  
>  	if (!rt) {
>  		struct net *net = sock_net(sk);
> -		__u8 flow_flags = inet_sk_flowi_flags(sk);
>  
>  		fl4 = &fl4_stack;
>  
> 
> -- 
> 2.34.1
>
Jakub Sitnicki Sept. 23, 2024, 2:34 p.m. UTC | #2
Just an FYI to the reviewers -

Tiago is out this week, so his reponses will be delayed.