Message ID | 20210305015655.14249-1-xiyou.wangcong@gmail.com |
---|---|
Headers | show |
Series | sockmap: introduce BPF_SK_SKB_VERDICT and support UDP | expand |
Cong Wang wrote: > From: Cong Wang <cong.wang@bytedance.com> > > UDP already has udp_sendmsg() which takes lock_sock() inside. > We have to build ->sendmsg_locked() on top of it, by adding > a new parameter for whether the sock has been locked. > > Cc: John Fastabend <john.fastabend@gmail.com> > Cc: Daniel Borkmann <daniel@iogearbox.net> > Cc: Jakub Sitnicki <jakub@cloudflare.com> > Cc: Lorenz Bauer <lmb@cloudflare.com> > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > --- > include/net/udp.h | 1 + > net/ipv4/af_inet.c | 1 + > net/ipv4/udp.c | 30 +++++++++++++++++++++++------- > 3 files changed, 25 insertions(+), 7 deletions(-) [...] > -int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) > +static int __udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, bool locked) > { The lock_sock is also taken by BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK() in udp_sendmsg(), if (cgroup_bpf_enabled(BPF_CGROUP_UDP4_SENDMSG) && !connected) { err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, (struct sockaddr *)usin, &ipc.addr); so that will also need to be handled. It also looks like sk_dst_set() wants the sock lock to be held, but I'm not seeing how its covered in the current code, static inline void __sk_dst_set(struct sock *sk, struct dst_entry *dst) { struct dst_entry *old_dst; sk_tx_queue_clear(sk); sk->sk_dst_pending_confirm = 0; old_dst = rcu_dereference_protected(sk->sk_dst_cache, lockdep_sock_is_held(sk)); rcu_assign_pointer(sk->sk_dst_cache, dst); dst_release(old_dst); } I guess this could trip lockdep now, I'll dig a bit more Monday and see if its actually the case. In general I don't really like code that wraps locks in 'if' branches like this. It seem fragile to me. I didn't walk every path in the code to see if a lock is taken in any of the called functions but it looks like ip_send_skb() can call into netfilter code and may try to take the sock lock. Do we need this locked send at all? We use it in sk_psock_backlog but that routine needs an optimization rewrite for TCP anyways. Its dropping a lot of performance on the floor for no good reason. .John
On Fri, Mar 5, 2021 at 5:21 PM John Fastabend <john.fastabend@gmail.com> wrote: > > Cong Wang wrote: > > From: Cong Wang <cong.wang@bytedance.com> > > > > UDP already has udp_sendmsg() which takes lock_sock() inside. > > We have to build ->sendmsg_locked() on top of it, by adding > > a new parameter for whether the sock has been locked. > > > > Cc: John Fastabend <john.fastabend@gmail.com> > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > --- > > include/net/udp.h | 1 + > > net/ipv4/af_inet.c | 1 + > > net/ipv4/udp.c | 30 +++++++++++++++++++++++------- > > 3 files changed, 25 insertions(+), 7 deletions(-) > > [...] > > > -int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) > > +static int __udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, bool locked) > > { > > The lock_sock is also taken by BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK() in > udp_sendmsg(), > > if (cgroup_bpf_enabled(BPF_CGROUP_UDP4_SENDMSG) && !connected) { > err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, > (struct sockaddr *)usin, &ipc.addr); > > so that will also need to be handled. Indeed, good catch! > > It also looks like sk_dst_set() wants the sock lock to be held, but I'm not > seeing how its covered in the current code, > > static inline void > __sk_dst_set(struct sock *sk, struct dst_entry *dst) > { > struct dst_entry *old_dst; > > sk_tx_queue_clear(sk); > sk->sk_dst_pending_confirm = 0; > old_dst = rcu_dereference_protected(sk->sk_dst_cache, > lockdep_sock_is_held(sk)); > rcu_assign_pointer(sk->sk_dst_cache, dst); > dst_release(old_dst); > } I do not see how __sk_dst_set() is called in udp_sendmsg(). > > I guess this could trip lockdep now, I'll dig a bit more Monday and see > if its actually the case. > > In general I don't really like code that wraps locks in 'if' branches > like this. It seem fragile to me. I didn't walk every path in the code I do not like it either, actually I spent quite some time trying to get rid of this lock_sock, it is definitely not easy. The comment in sk_psock_backlog() is clearly wrong, we do not lock_sock to keep sk_socket, we lock it to protect other structures like ingress_{skb,msg}. > to see if a lock is taken in any of the called functions but it looks > like ip_send_skb() can call into netfilter code and may try to take > the sock lock. Are you saying skb_send_sock_locked() is buggy? If so, clearly not my fault. > > Do we need this locked send at all? We use it in sk_psock_backlog > but that routine needs an optimization rewrite for TCP anyways. > Its dropping a lot of performance on the floor for no good reason. At least for ingress_msg. It is not as easy as adding a queue lock here, because we probably want to retrieve atomically with the receive queue together. Thanks.
Cong Wang wrote: > On Fri, Mar 5, 2021 at 5:21 PM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Cong Wang wrote: > > > From: Cong Wang <cong.wang@bytedance.com> > > > > > > UDP already has udp_sendmsg() which takes lock_sock() inside. > > > We have to build ->sendmsg_locked() on top of it, by adding > > > a new parameter for whether the sock has been locked. > > > > > > Cc: John Fastabend <john.fastabend@gmail.com> > > > Cc: Daniel Borkmann <daniel@iogearbox.net> > > > Cc: Jakub Sitnicki <jakub@cloudflare.com> > > > Cc: Lorenz Bauer <lmb@cloudflare.com> > > > Signed-off-by: Cong Wang <cong.wang@bytedance.com> > > > --- > > > include/net/udp.h | 1 + > > > net/ipv4/af_inet.c | 1 + > > > net/ipv4/udp.c | 30 +++++++++++++++++++++++------- > > > 3 files changed, 25 insertions(+), 7 deletions(-) > > > > [...] > > > > > -int udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) > > > +static int __udp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len, bool locked) > > > { > > > > The lock_sock is also taken by BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK() in > > udp_sendmsg(), > > > > if (cgroup_bpf_enabled(BPF_CGROUP_UDP4_SENDMSG) && !connected) { > > err = BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, > > (struct sockaddr *)usin, &ipc.addr); > > > > so that will also need to be handled. > > Indeed, good catch! This is going to get tricky though because we can't exactly drop the lock and try to reclaim it. We would have no guarentee some other core didn't grab the lock from the backlog side. > > > > > It also looks like sk_dst_set() wants the sock lock to be held, but I'm not > > seeing how its covered in the current code, > > > > static inline void > > __sk_dst_set(struct sock *sk, struct dst_entry *dst) > > { > > struct dst_entry *old_dst; > > > > sk_tx_queue_clear(sk); > > sk->sk_dst_pending_confirm = 0; > > old_dst = rcu_dereference_protected(sk->sk_dst_cache, > > lockdep_sock_is_held(sk)); > > rcu_assign_pointer(sk->sk_dst_cache, dst); > > dst_release(old_dst); > > } > > I do not see how __sk_dst_set() is called in udp_sendmsg(). The path I was probably looking at is, udp_sendmsg() sk_dst_check() sk_dst_reset() sk_dst_set(sk, NULL) but that does a cmpxchg only __sk_dst_set() actually has the lockdep_sock_is_held(sk) check. So should be OK. > > > > > I guess this could trip lockdep now, I'll dig a bit more Monday and see > > if its actually the case. > > > > In general I don't really like code that wraps locks in 'if' branches > > like this. It seem fragile to me. I didn't walk every path in the code > > I do not like it either, actually I spent quite some time trying to > get rid of this lock_sock, it is definitely not easy. The comment in > sk_psock_backlog() is clearly wrong, we do not lock_sock to keep > sk_socket, we lock it to protect other structures like > ingress_{skb,msg}. The comment comes from early days before psock was ref counted and can be removed. > > > to see if a lock is taken in any of the called functions but it looks > > like ip_send_skb() can call into netfilter code and may try to take > > the sock lock. > > Are you saying skb_send_sock_locked() is buggy? If so, clearly not > my fault. Except this path only exists on the UDP I think. udp_sendmsg() udp_send_skb() ip_send_skb() ... TCP has some extra queuing logic in there that makes this work. > > > > > Do we need this locked send at all? We use it in sk_psock_backlog > > but that routine needs an optimization rewrite for TCP anyways. > > Its dropping a lot of performance on the floor for no good reason. > > At least for ingress_msg. It is not as easy as adding a queue lock here, > because we probably want to retrieve atomically with the receive queue > together. Agree. I'll try a bit harder tomorrow and see if I can come up with anything, intended to do this today, but got busy with some other things. Best case is we find some way to drop that sock_lock altogether here. > > Thanks.
From: Cong Wang <cong.wang@bytedance.com> We have thousands of services connected to a daemon on every host via AF_UNIX dgram sockets, after they are moved into VM, we have to add a proxy to forward these communications from VM to host, because rewriting thousands of them is not practical. This proxy uses an AF_UNIX socket connected to services and a UDP socket to connect to the host. It is inefficient because data is copied between kernel space and user space twice, and we can not use splice() which only supports TCP. Therefore, we want to use sockmap to do the splicing without going to user-space at all (after the initial setup). Currently sockmap only fully supports TCP, UDP is partially supported as it is only allowed to add into sockmap. This patchset, as the second part of the original large patchset, extends sockmap with: 1) cross-protocol support with BPF_SK_SKB_VERDICT; 2) full UDP support. On the high level, ->sendmsg_locked() and ->read_sock() are required for each protocol to support sockmap redirection, and in order to do sock proto update, a new ops ->update_proto() is introduced, which is also required to implement. A BPF ->recvmsg() is also needed to replace the original ->recvmsg() to retrieve skmsg. Please see each patch for more details. To see the big picture, the original patchset is available here: https://github.com/congwang/linux/tree/sockmap this patchset is also available: https://github.com/congwang/linux/tree/sockmap2 --- v3: export tcp/udp_update_proto() rename sk->sk_prot->psock_update_sk_prot() improve changelogs v2: separate from the original large patchset rebase to the latest bpf-next split UDP test case move inet_csk_has_ulp() check to tcp_bpf.c clean up udp_read_sock() Cong Wang (9): sock_map: introduce BPF_SK_SKB_VERDICT sock: introduce sk->sk_prot->psock_update_sk_prot() udp: implement ->sendmsg_locked() udp: implement ->read_sock() for sockmap udp: add ->read_sock() and ->sendmsg_locked() to ipv6 skmsg: extract __tcp_bpf_recvmsg() and tcp_bpf_wait_data() udp: implement udp_bpf_recvmsg() for sockmap sock_map: update sock type checks for UDP selftests/bpf: add a test case for udp sockmap include/linux/skmsg.h | 25 ++-- include/net/ipv6.h | 1 + include/net/sock.h | 3 + include/net/tcp.h | 3 +- include/net/udp.h | 4 + include/uapi/linux/bpf.h | 1 + kernel/bpf/syscall.c | 1 + net/core/skmsg.c | 113 +++++++++++++- net/core/sock_map.c | 52 ++++--- net/ipv4/af_inet.c | 2 + net/ipv4/tcp_bpf.c | 130 +++------------- net/ipv4/tcp_ipv4.c | 3 + net/ipv4/udp.c | 68 ++++++++- net/ipv4/udp_bpf.c | 79 +++++++++- net/ipv6/af_inet6.c | 2 + net/ipv6/tcp_ipv6.c | 3 + net/ipv6/udp.c | 30 +++- net/tls/tls_sw.c | 4 +- tools/bpf/bpftool/common.c | 1 + tools/bpf/bpftool/prog.c | 1 + tools/include/uapi/linux/bpf.h | 1 + .../selftests/bpf/prog_tests/sockmap_listen.c | 140 ++++++++++++++++++ .../selftests/bpf/progs/test_sockmap_listen.c | 22 +++ 23 files changed, 519 insertions(+), 170 deletions(-)