From patchwork Tue Dec 1 14:44:08 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 336662 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C99FC83016 for ; Tue, 1 Dec 2020 14:46:15 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C88B72076C for ; Tue, 1 Dec 2020 14:46:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="krr1JUgT" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403930AbgLAOp6 (ORCPT ); Tue, 1 Dec 2020 09:45:58 -0500 Received: from smtp-fw-33001.amazon.com ([207.171.190.10]:64506 "EHLO smtp-fw-33001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391592AbgLAOpz (ORCPT ); Tue, 1 Dec 2020 09:45:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606833954; x=1638369954; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=08Dy0Jbl2BG1YozdQOoj2MRFQHEUaiVq4zp0lSiWm0g=; b=krr1JUgTAnpD3LKbKVJLLWdru64XitxJaKCipCIBgCUAgcZgzXAJK0Ga 2Tr3Umf2R+/7pJcbgGN9hzdMqaKSnxceZO8aUIx3qj4AlQc6cV79Irp9Q l8qfw7lebxI8Pi+9sF4dUOAySXQQBYQm5As1LEmaqxcMtvGun8rTp1Y/B 8=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="99546809" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2c-4e7c8266.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-33001.sea14.amazon.com with ESMTP; 01 Dec 2020 14:45:13 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2c-4e7c8266.us-west-2.amazon.com (Postfix) with ESMTPS id 663EBA1D1F; Tue, 1 Dec 2020 14:45:12 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:11 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:07 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 01/11] tcp: Keep TCP_CLOSE sockets in the reuseport group. Date: Tue, 1 Dec 2020 23:44:08 +0900 Message-ID: <20201201144418.35045-2-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch is a preparation patch to migrate incoming connections in the later commits and adds a field (num_closed_socks) to the struct sock_reuseport to keep TCP_CLOSE sockets in the reuseport group. When we close a listening socket, to migrate its connections to another listener in the same reuseport group, we have to handle two kinds of child sockets. One is that a listening socket has a reference to, and the other is not. The former is the TCP_ESTABLISHED/TCP_SYN_RECV sockets, and they are in the accept queue of their listening socket. So, we can pop them out and push them into another listener's queue at close() or shutdown() syscalls. On the other hand, the latter, the TCP_NEW_SYN_RECV socket is during the three-way handshake and not in the accept queue. Thus, we cannot access such sockets at close() or shutdown() syscalls. Accordingly, we have to migrate immature sockets after their listening socket has been closed. Currently, if their listening socket has been closed, TCP_NEW_SYN_RECV sockets are freed at receiving the final ACK or retransmitting SYN+ACKs. At that time, if we could select a new listener from the same reuseport group, no connection would be aborted. However, it is impossible because reuseport_detach_sock() sets NULL to sk_reuseport_cb and forbids access to the reuseport group from closed sockets. This patch allows TCP_CLOSE sockets to remain in the reuseport group and to have access to it while any child socket references to them. The point is that reuseport_detach_sock() is called twice from inet_unhash() and sk_destruct(). At first, it moves the socket backwards in socks[] and increments num_closed_socks. Later, when all migrated connections are accepted, it removes the socket from socks[], decrements num_closed_socks, and sets NULL to sk_reuseport_cb. By this change, closed sockets can keep sk_reuseport_cb until all child requests have been freed or accepted. Consequently calling listen() after shutdown() can cause EADDRINUSE or EBUSY in reuseport_add_sock() or inet_csk_bind_conflict() which expect that such sockets should not have the reuseport group. Therefore, this patch also loosens such validation rules so that the socket can listen again if it has the same reuseport group with other listening sockets. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/sock_reuseport.h | 5 ++- net/core/sock_reuseport.c | 79 +++++++++++++++++++++++++++------ net/ipv4/inet_connection_sock.c | 7 ++- 3 files changed, 74 insertions(+), 17 deletions(-) diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 505f1e18e9bf..0e558ca7afbf 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -13,8 +13,9 @@ extern spinlock_t reuseport_lock; struct sock_reuseport { struct rcu_head rcu; - u16 max_socks; /* length of socks */ - u16 num_socks; /* elements in socks */ + u16 max_socks; /* length of socks */ + u16 num_socks; /* elements in socks */ + u16 num_closed_socks; /* closed elements in socks */ /* The last synq overflow event timestamp of this * reuse->socks[] group. */ diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index bbdd3c7b6cb5..fd133516ac0e 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -98,16 +98,21 @@ static struct sock_reuseport *reuseport_grow(struct sock_reuseport *reuse) return NULL; more_reuse->num_socks = reuse->num_socks; + more_reuse->num_closed_socks = reuse->num_closed_socks; more_reuse->prog = reuse->prog; more_reuse->reuseport_id = reuse->reuseport_id; more_reuse->bind_inany = reuse->bind_inany; more_reuse->has_conns = reuse->has_conns; + more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); memcpy(more_reuse->socks, reuse->socks, reuse->num_socks * sizeof(struct sock *)); - more_reuse->synq_overflow_ts = READ_ONCE(reuse->synq_overflow_ts); + memcpy(more_reuse->socks + + (more_reuse->max_socks - more_reuse->num_closed_socks), + reuse->socks + reuse->num_socks, + reuse->num_closed_socks * sizeof(struct sock *)); - for (i = 0; i < reuse->num_socks; ++i) + for (i = 0; i < reuse->max_socks; ++i) rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb, more_reuse); @@ -129,6 +134,25 @@ static void reuseport_free_rcu(struct rcu_head *head) kfree(reuse); } +static int reuseport_sock_index(struct sock_reuseport *reuse, struct sock *sk, + bool closed) +{ + int left, right; + + if (!closed) { + left = 0; + right = reuse->num_socks; + } else { + left = reuse->max_socks - reuse->num_closed_socks; + right = reuse->max_socks; + } + + for (; left < right; left++) + if (reuse->socks[left] == sk) + return left; + return -1; +} + /** * reuseport_add_sock - Add a socket to the reuseport group of another. * @sk: New socket to add to the group. @@ -153,12 +177,23 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) lockdep_is_held(&reuseport_lock)); old_reuse = rcu_dereference_protected(sk->sk_reuseport_cb, lockdep_is_held(&reuseport_lock)); - if (old_reuse && old_reuse->num_socks != 1) { + + if (old_reuse == reuse) { + int i = reuseport_sock_index(reuse, sk, true); + + if (i == -1) { + spin_unlock_bh(&reuseport_lock); + return -EBUSY; + } + + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; + reuse->num_closed_socks--; + } else if (old_reuse && old_reuse->num_socks != 1) { spin_unlock_bh(&reuseport_lock); return -EBUSY; } - if (reuse->num_socks == reuse->max_socks) { + if (reuse->num_socks + reuse->num_closed_socks == reuse->max_socks) { reuse = reuseport_grow(reuse); if (!reuse) { spin_unlock_bh(&reuseport_lock); @@ -174,8 +209,9 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) spin_unlock_bh(&reuseport_lock); - if (old_reuse) + if (old_reuse && old_reuse != reuse) call_rcu(&old_reuse->rcu, reuseport_free_rcu); + return 0; } EXPORT_SYMBOL(reuseport_add_sock); @@ -199,17 +235,34 @@ void reuseport_detach_sock(struct sock *sk) */ bpf_sk_reuseport_detach(sk); - rcu_assign_pointer(sk->sk_reuseport_cb, NULL); + if (sk->sk_protocol != IPPROTO_TCP || sk->sk_state == TCP_LISTEN) { + i = reuseport_sock_index(reuse, sk, false); + if (i == -1) + goto out; + + reuse->num_socks--; + reuse->socks[i] = reuse->socks[reuse->num_socks]; - for (i = 0; i < reuse->num_socks; i++) { - if (reuse->socks[i] == sk) { - reuse->socks[i] = reuse->socks[reuse->num_socks - 1]; - reuse->num_socks--; - if (reuse->num_socks == 0) - call_rcu(&reuse->rcu, reuseport_free_rcu); - break; + if (sk->sk_protocol == IPPROTO_TCP) { + reuse->num_closed_socks++; + reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; + } else { + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + } else { + i = reuseport_sock_index(reuse, sk, true); + if (i == -1) + goto out; + + reuse->socks[i] = reuse->socks[reuse->max_socks - reuse->num_closed_socks]; + reuse->num_closed_socks--; + + rcu_assign_pointer(sk->sk_reuseport_cb, NULL); } + + if (reuse->num_socks + reuse->num_closed_socks == 0) + call_rcu(&reuse->rcu, reuseport_free_rcu); +out: spin_unlock_bh(&reuseport_lock); } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index f60869acbef0..1451aa9712b0 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -138,6 +138,7 @@ static int inet_csk_bind_conflict(const struct sock *sk, bool reuse = sk->sk_reuse; bool reuseport = !!sk->sk_reuseport; kuid_t uid = sock_i_uid((struct sock *)sk); + struct sock_reuseport *reuseport_cb = rcu_access_pointer(sk->sk_reuseport_cb); /* * Unlike other sk lookup places we do not check @@ -156,14 +157,16 @@ static int inet_csk_bind_conflict(const struct sock *sk, if ((!relax || (!reuseport_ok && reuseport && sk2->sk_reuseport && - !rcu_access_pointer(sk->sk_reuseport_cb) && + (!reuseport_cb || + reuseport_cb == rcu_access_pointer(sk2->sk_reuseport_cb)) && (sk2->sk_state == TCP_TIME_WAIT || uid_eq(uid, sock_i_uid(sk2))))) && inet_rcv_saddr_equal(sk, sk2, true)) break; } else if (!reuseport_ok || !reuseport || !sk2->sk_reuseport || - rcu_access_pointer(sk->sk_reuseport_cb) || + (reuseport_cb && + reuseport_cb != rcu_access_pointer(sk2->sk_reuseport_cb)) || (sk2->sk_state != TCP_TIME_WAIT && !uid_eq(uid, sock_i_uid(sk2)))) { if (inet_rcv_saddr_equal(sk, sk2, true)) From patchwork Tue Dec 1 14:44:09 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 335630 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE55CC83012 for ; Tue, 1 Dec 2020 14:46:14 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4BF692080A for ; Tue, 1 Dec 2020 14:46:14 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="A9yvdEQD" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403914AbgLAOpy (ORCPT ); Tue, 1 Dec 2020 09:45:54 -0500 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:50785 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389661AbgLAOpx (ORCPT ); Tue, 1 Dec 2020 09:45:53 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606833953; x=1638369953; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=fZzd58Z28I00jRpPznhhNdsFg37FRopxhjk7qpzG9hQ=; b=A9yvdEQDmYfr8dL1oBC2jzHXQ1E1j8OqYZBL5h0vRhblkjevVO3CYbBZ FUwfDWxsviJlGcsYxysepaMxZ8QaOXJqJbMD8Ra6NvsrOkyoDYjG2v0F7 aZFxhTv/YNDinyIgM390BA04yhIq5ME5wqAlaW82hYhMHk5mHsrC8kIk/ E=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="92542011" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2a-d0be17ee.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 01 Dec 2020 14:45:28 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2a-d0be17ee.us-west-2.amazon.com (Postfix) with ESMTPS id D1734A206F; Tue, 1 Dec 2020 14:45:27 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:27 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:22 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 02/11] bpf: Define migration types for SO_REUSEPORT. Date: Tue, 1 Dec 2020 23:44:09 +0900 Message-ID: <20201201144418.35045-3-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org As noted in the preceding commit, there are two migration types. In addition to that, the kernel will run the same eBPF program to select a listener for SYN packets. This patch defines three types to signal the kernel and the eBPF program if it is receiving a new request or migrating ESTABLISHED/SYN_RECV sockets in the accept queue or NEW_SYN_RECV socket during 3WHS. Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 14 ++++++++++++++ tools/include/uapi/linux/bpf.h | 14 ++++++++++++++ 2 files changed, 28 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 162999b12790..85278deff439 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4380,6 +4380,20 @@ struct sk_msg_md { __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */ }; +/* Migration type for SO_REUSEPORT enabled TCP sockets. + * + * BPF_SK_REUSEPORT_MIGRATE_NO : Select a listener for SYN packets. + * BPF_SK_REUSEPORT_MIGRATE_QUEUE : Migrate ESTABLISHED and SYN_RECV sockets in + * the accept queue at close() or shutdown(). + * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the + * final ACK of 3WHS or retransmitting SYN+ACKs. + */ +enum { + BPF_SK_REUSEPORT_MIGRATE_NO, + BPF_SK_REUSEPORT_MIGRATE_QUEUE, + BPF_SK_REUSEPORT_MIGRATE_REQUEST, +}; + struct sk_reuseport_md { /* * Start of directly accessible data. It begins from diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 162999b12790..85278deff439 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4380,6 +4380,20 @@ struct sk_msg_md { __bpf_md_ptr(struct bpf_sock *, sk); /* current socket */ }; +/* Migration type for SO_REUSEPORT enabled TCP sockets. + * + * BPF_SK_REUSEPORT_MIGRATE_NO : Select a listener for SYN packets. + * BPF_SK_REUSEPORT_MIGRATE_QUEUE : Migrate ESTABLISHED and SYN_RECV sockets in + * the accept queue at close() or shutdown(). + * BPF_SK_REUSEPORT_MIGRATE_REQUEST : Migrate NEW_SYN_RECV socket at receiving the + * final ACK of 3WHS or retransmitting SYN+ACKs. + */ +enum { + BPF_SK_REUSEPORT_MIGRATE_NO, + BPF_SK_REUSEPORT_MIGRATE_QUEUE, + BPF_SK_REUSEPORT_MIGRATE_REQUEST, +}; + struct sk_reuseport_md { /* * Start of directly accessible data. It begins from From patchwork Tue Dec 1 14:44:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 335629 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C1FA9C64E90 for ; Tue, 1 Dec 2020 14:46:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 781BD2084C for ; Tue, 1 Dec 2020 14:46:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="toMeHb3I" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403955AbgLAOq2 (ORCPT ); Tue, 1 Dec 2020 09:46:28 -0500 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:62322 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403942AbgLAOq2 (ORCPT ); Tue, 1 Dec 2020 09:46:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606833985; x=1638369985; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=7Fal7tXCFu+HPZpsdNg00Jpy9tCvxXaYm2D1klkIH5Q=; b=toMeHb3IrlMj19perBC/I+QAk7dS5Nh72Chlq1gWBwPdZKuB1THWjYM7 Uga6O/DGq/uwmoBR46Urw7rmIuKUp0rNFsrTF3EUL5nrmRzzh9Sjb6gDS N2HAJ79Odt5WWNgZkUTa57rks7uaZlsCHOC5THZ2qfbS6KG3AI1C9a33+ M=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="66957220" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-2a-53356bf6.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 01 Dec 2020 14:45:43 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2a-53356bf6.us-west-2.amazon.com (Postfix) with ESMTPS id C0F6CA1E28; Tue, 1 Dec 2020 14:45:42 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:42 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:37 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 03/11] tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues. Date: Tue, 1 Dec 2020 23:44:10 +0900 Message-ID: <20201201144418.35045-4-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch lets reuseport_detach_sock() return a pointer of struct sock, which is used only by inet_unhash(). If it is not NULL, inet_csk_reqsk_queue_migrate() migrates TCP_ESTABLISHED/TCP_SYN_RECV sockets from the closing listener to the selected one. Listening sockets hold incoming connections as a linked list of struct request_sock in the accept queue, and each request has reference to a full socket and its listener. In inet_csk_reqsk_queue_migrate(), we only unlink the requests from the closing listener's queue and relink them to the head of the new listener's queue. We do not process each request and its reference to the listener, so the migration completes in O(1) time complexity. However, in the case of TCP_SYN_RECV sockets, we take special care in the next commit. By default, the kernel selects a new listener randomly. In order to pick out a different socket every time, we select the last element of socks[] as the new listener. This behaviour is based on how the kernel moves sockets in socks[]. (See also [1]) Basically, in order to redistribute sockets evenly, we have to use an eBPF program called in the later commit, but as the side effect of such default selection, the kernel can redistribute old requests evenly to new listeners for a specific case where the application replaces listeners by generations. For example, we call listen() for four sockets (A, B, C, D), and close the first two by turns. The sockets move in socks[] like below. socks[0] : A <-. socks[0] : D socks[0] : D socks[1] : B | => socks[1] : B <-. => socks[1] : C socks[2] : C | socks[2] : C --' socks[3] : D --' Then, if C and D have newer settings than A and B, and each socket has a request (a, b, c, d) in their accept queue, we can redistribute old requests evenly to new listeners. socks[0] : A (a) <-. socks[0] : D (a + d) socks[0] : D (a + d) socks[1] : B (b) | => socks[1] : B (b) <-. => socks[1] : C (b + c) socks[2] : C (c) | socks[2] : C (c) --' socks[3] : D (d) --' Here, (A, D) or (B, C) can have different application settings, but they MUST have the same settings at the socket API level; otherwise, unexpected error may happen. For instance, if only the new listeners have TCP_SAVE_SYN, old requests do not have SYN data, so the application will face inconsistency and cause an error. Therefore, if there are different kinds of sockets, we must attach an eBPF program described in later commits. Link: https://lore.kernel.org/netdev/CAEfhGiyG8Y_amDZ2C8dQoQqjZJMHjTY76b=KBkTKcBtA=dhdGQ@mail.gmail.com/ Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/inet_connection_sock.h | 1 + include/net/sock_reuseport.h | 2 +- net/core/sock_reuseport.c | 10 +++++++++- net/ipv4/inet_connection_sock.c | 30 ++++++++++++++++++++++++++++++ net/ipv4/inet_hashtables.c | 9 +++++++-- 5 files changed, 48 insertions(+), 4 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 7338b3865a2a..2ea2d743f8fc 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -260,6 +260,7 @@ struct dst_entry *inet_csk_route_child_sock(const struct sock *sk, struct sock *inet_csk_reqsk_queue_add(struct sock *sk, struct request_sock *req, struct sock *child); +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk); void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req, unsigned long timeout); struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 0e558ca7afbf..09a1b1539d4c 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -31,7 +31,7 @@ struct sock_reuseport { extern int reuseport_alloc(struct sock *sk, bool bind_inany); extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany); -extern void reuseport_detach_sock(struct sock *sk); +extern struct sock *reuseport_detach_sock(struct sock *sk); extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, struct sk_buff *skb, diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index fd133516ac0e..60d7c1f28809 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -216,9 +216,11 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } EXPORT_SYMBOL(reuseport_add_sock); -void reuseport_detach_sock(struct sock *sk) +struct sock *reuseport_detach_sock(struct sock *sk) { struct sock_reuseport *reuse; + struct bpf_prog *prog; + struct sock *nsk = NULL; int i; spin_lock_bh(&reuseport_lock); @@ -242,8 +244,12 @@ void reuseport_detach_sock(struct sock *sk) reuse->num_socks--; reuse->socks[i] = reuse->socks[reuse->num_socks]; + prog = rcu_dereference(reuse->prog); if (sk->sk_protocol == IPPROTO_TCP) { + if (reuse->num_socks && !prog) + nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; + reuse->num_closed_socks++; reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; } else { @@ -264,6 +270,8 @@ void reuseport_detach_sock(struct sock *sk) call_rcu(&reuse->rcu, reuseport_free_rcu); out: spin_unlock_bh(&reuseport_lock); + + return nsk; } EXPORT_SYMBOL(reuseport_detach_sock); diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 1451aa9712b0..b27241ea96bd 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -992,6 +992,36 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk, } EXPORT_SYMBOL(inet_csk_reqsk_queue_add); +void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) +{ + struct request_sock_queue *old_accept_queue, *new_accept_queue; + + old_accept_queue = &inet_csk(sk)->icsk_accept_queue; + new_accept_queue = &inet_csk(nsk)->icsk_accept_queue; + + spin_lock(&old_accept_queue->rskq_lock); + spin_lock(&new_accept_queue->rskq_lock); + + if (old_accept_queue->rskq_accept_head) { + if (new_accept_queue->rskq_accept_head) + old_accept_queue->rskq_accept_tail->dl_next = + new_accept_queue->rskq_accept_head; + else + new_accept_queue->rskq_accept_tail = old_accept_queue->rskq_accept_tail; + + new_accept_queue->rskq_accept_head = old_accept_queue->rskq_accept_head; + old_accept_queue->rskq_accept_head = NULL; + old_accept_queue->rskq_accept_tail = NULL; + + WRITE_ONCE(nsk->sk_ack_backlog, nsk->sk_ack_backlog + sk->sk_ack_backlog); + WRITE_ONCE(sk->sk_ack_backlog, 0); + } + + spin_unlock(&new_accept_queue->rskq_lock); + spin_unlock(&old_accept_queue->rskq_lock); +} +EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate); + struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child, struct request_sock *req, bool own_req) { diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 45fb450b4522..545538a6bfac 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -681,6 +681,7 @@ void inet_unhash(struct sock *sk) { struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo; struct inet_listen_hashbucket *ilb = NULL; + struct sock *nsk; spinlock_t *lock; if (sk_unhashed(sk)) @@ -696,8 +697,12 @@ void inet_unhash(struct sock *sk) if (sk_unhashed(sk)) goto unlock; - if (rcu_access_pointer(sk->sk_reuseport_cb)) - reuseport_detach_sock(sk); + if (rcu_access_pointer(sk->sk_reuseport_cb)) { + nsk = reuseport_detach_sock(sk); + if (nsk) + inet_csk_reqsk_queue_migrate(sk, nsk); + } + if (ilb) { inet_unhash2(hashinfo, sk); ilb->count--; From patchwork Tue Dec 1 14:44:11 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 336661 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60BC2C83013 for ; Tue, 1 Dec 2020 14:46:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id F00052080A for ; Tue, 1 Dec 2020 14:46:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="Qt6rUKEy" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403966AbgLAOqn (ORCPT ); Tue, 1 Dec 2020 09:46:43 -0500 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:51228 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389033AbgLAOqm (ORCPT ); Tue, 1 Dec 2020 09:46:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834002; x=1638370002; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=uyJ//ZGcPRgBO7Ln6GyuNrk95n/KiRMWJ9SsDfr+V+U=; b=Qt6rUKEy0tbmlAkaTMJzDpcg6ErA7P8Awa4E9UGGWlfZrlp1DpqWDHUh qzoEx0ILVJcili2xppDn24JwlhPFLRJbJj5o+/2DWSqqY1d+5wSMSYnwM 6y7KobekHNLwRDXg5SCbB43jT3tbMhtKfoA4o17IRo+4sBJzFQF4XREm8 U=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="92542266" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2b-c300ac87.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 01 Dec 2020 14:45:58 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2b-c300ac87.us-west-2.amazon.com (Postfix) with ESMTPS id ECBBCA20D1; Tue, 1 Dec 2020 14:45:57 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:57 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:45:52 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 04/11] tcp: Migrate TFO requests causing RST during TCP_SYN_RECV. Date: Tue, 1 Dec 2020 23:44:11 +0900 Message-ID: <20201201144418.35045-5-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org A TFO request socket is only freed after BOTH 3WHS has completed (or aborted) and the child socket has been accepted (or its listener has been closed). Hence, depending on the order, there can be two kinds of request sockets in the accept queue. 3WHS -> accept : TCP_ESTABLISHED accept -> 3WHS : TCP_SYN_RECV Unlike TCP_ESTABLISHED socket, accept() does not free the request socket for TCP_SYN_RECV socket. It is freed later at reqsk_fastopen_remove(). Also, it accesses request_sock.rsk_listener. So, in order to complete TFO socket migration, we have to set the current listener to it at accept() before reqsk_fastopen_remove(). Moreover, if TFO request caused RST before 3WHS has completed, it is held in the listener's TFO queue to prevent DDoS attack. Thus, we also have to migrate the requests in TFO queue. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/ipv4/inet_connection_sock.c | 35 ++++++++++++++++++++++++++++++++- 1 file changed, 34 insertions(+), 1 deletion(-) diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index b27241ea96bd..361efe55b1ad 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -500,6 +500,16 @@ struct sock *inet_csk_accept(struct sock *sk, int flags, int *err, bool kern) tcp_rsk(req)->tfo_listener) { spin_lock_bh(&queue->fastopenq.lock); if (tcp_rsk(req)->tfo_listener) { + if (req->rsk_listener != sk) { + /* TFO request was migrated to another listener so + * the new listener must be used in reqsk_fastopen_remove() + * to hold requests which cause RST. + */ + sock_put(req->rsk_listener); + sock_hold(sk); + req->rsk_listener = sk; + } + /* We are still waiting for the final ACK from 3WHS * so can't free req now. Instead, we set req->sk to * NULL to signify that the child socket is taken @@ -954,7 +964,6 @@ static void inet_child_forget(struct sock *sk, struct request_sock *req, if (sk->sk_protocol == IPPROTO_TCP && tcp_rsk(req)->tfo_listener) { BUG_ON(rcu_access_pointer(tcp_sk(child)->fastopen_rsk) != req); - BUG_ON(sk != req->rsk_listener); /* Paranoid, to prevent race condition if * an inbound pkt destined for child is @@ -995,6 +1004,7 @@ EXPORT_SYMBOL(inet_csk_reqsk_queue_add); void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) { struct request_sock_queue *old_accept_queue, *new_accept_queue; + struct fastopen_queue *old_fastopenq, *new_fastopenq; old_accept_queue = &inet_csk(sk)->icsk_accept_queue; new_accept_queue = &inet_csk(nsk)->icsk_accept_queue; @@ -1019,6 +1029,29 @@ void inet_csk_reqsk_queue_migrate(struct sock *sk, struct sock *nsk) spin_unlock(&new_accept_queue->rskq_lock); spin_unlock(&old_accept_queue->rskq_lock); + + old_fastopenq = &old_accept_queue->fastopenq; + new_fastopenq = &new_accept_queue->fastopenq; + + spin_lock_bh(&old_fastopenq->lock); + spin_lock_bh(&new_fastopenq->lock); + + new_fastopenq->qlen += old_fastopenq->qlen; + old_fastopenq->qlen = 0; + + if (old_fastopenq->rskq_rst_head) { + if (new_fastopenq->rskq_rst_head) + old_fastopenq->rskq_rst_tail->dl_next = new_fastopenq->rskq_rst_head; + else + old_fastopenq->rskq_rst_tail = new_fastopenq->rskq_rst_tail; + + new_fastopenq->rskq_rst_head = old_fastopenq->rskq_rst_head; + old_fastopenq->rskq_rst_head = NULL; + old_fastopenq->rskq_rst_tail = NULL; + } + + spin_unlock_bh(&new_fastopenq->lock); + spin_unlock_bh(&old_fastopenq->lock); } EXPORT_SYMBOL(inet_csk_reqsk_queue_migrate); From patchwork Tue Dec 1 14:44:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 335628 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A625CC64E7B for ; Tue, 1 Dec 2020 14:47:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 483B920758 for ; Tue, 1 Dec 2020 14:47:31 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="bVsFVuX0" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391629AbgLAOq4 (ORCPT ); Tue, 1 Dec 2020 09:46:56 -0500 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:51364 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389033AbgLAOqz (ORCPT ); Tue, 1 Dec 2020 09:46:55 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834014; x=1638370014; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=sbHECF02s6Bqq1okkw9Zu8kr/lAkSusVFYHxoYY5tRk=; b=bVsFVuX0dZ2PdNLys5uFwn54J/SOK6QQMbRXB1cAMZJXG7rQbfuiXtlK OKdcQiyzxRiZs0fjd/GNgAN1y8/E7Y2K+Sxi7EtUdAulXy9YrK7ZAA6Qa OdELC8TDW4D0Nel0sQdw7yK2ni4ITVecOjw1AZwICgtbWBd7PC9dZdGPk k=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="92542374" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2a-1c1b5cdd.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 01 Dec 2020 14:46:13 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2a-1c1b5cdd.us-west-2.amazon.com (Postfix) with ESMTPS id E8F0AA188D; Tue, 1 Dec 2020 14:46:12 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:12 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:07 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 05/11] tcp: Migrate TCP_NEW_SYN_RECV requests. Date: Tue, 1 Dec 2020 23:44:12 +0900 Message-ID: <20201201144418.35045-6-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch renames reuseport_select_sock() to __reuseport_select_sock() and adds two wrapper function of it to pass the migration type defined in the previous commit. reuseport_select_sock : BPF_SK_REUSEPORT_MIGRATE_NO reuseport_select_migrated_sock : BPF_SK_REUSEPORT_MIGRATE_REQUEST As mentioned before, we have to select a new listener for TCP_NEW_SYN_RECV requests at receiving the final ACK or sending a SYN+ACK. Therefore, this patch also changes the code to call reuseport_select_migrated_sock() even if the listening socket is TCP_CLOSE. If we can pick out a listening socket from the reuseport group, we rewrite request_sock.rsk_listener and resume processing the request. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- include/net/inet_connection_sock.h | 12 +++++++++++ include/net/request_sock.h | 13 ++++++++++++ include/net/sock_reuseport.h | 8 +++---- net/core/sock_reuseport.c | 34 ++++++++++++++++++++++++------ net/ipv4/inet_connection_sock.c | 13 ++++++++++-- net/ipv4/tcp_ipv4.c | 9 ++++++-- net/ipv6/tcp_ipv6.c | 9 ++++++-- 7 files changed, 81 insertions(+), 17 deletions(-) diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h index 2ea2d743f8fc..1e0958f5eb21 100644 --- a/include/net/inet_connection_sock.h +++ b/include/net/inet_connection_sock.h @@ -272,6 +272,18 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk) reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue); } +static inline void inet_csk_reqsk_queue_migrated(struct sock *sk, + struct sock *nsk, + struct request_sock *req) +{ + reqsk_queue_migrated(&inet_csk(sk)->icsk_accept_queue, + &inet_csk(nsk)->icsk_accept_queue, + req); + sock_put(sk); + sock_hold(nsk); + req->rsk_listener = nsk; +} + static inline int inet_csk_reqsk_queue_len(const struct sock *sk) { return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue); diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 29e41ff3ec93..d18ba0b857cc 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -226,6 +226,19 @@ static inline void reqsk_queue_added(struct request_sock_queue *queue) atomic_inc(&queue->qlen); } +static inline void reqsk_queue_migrated(struct request_sock_queue *old_accept_queue, + struct request_sock_queue *new_accept_queue, + const struct request_sock *req) +{ + atomic_dec(&old_accept_queue->qlen); + atomic_inc(&new_accept_queue->qlen); + + if (req->num_timeout == 0) { + atomic_dec(&old_accept_queue->young); + atomic_inc(&new_accept_queue->young); + } +} + static inline int reqsk_queue_len(const struct request_sock_queue *queue) { return atomic_read(&queue->qlen); diff --git a/include/net/sock_reuseport.h b/include/net/sock_reuseport.h index 09a1b1539d4c..a48259a974be 100644 --- a/include/net/sock_reuseport.h +++ b/include/net/sock_reuseport.h @@ -32,10 +32,10 @@ extern int reuseport_alloc(struct sock *sk, bool bind_inany); extern int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany); extern struct sock *reuseport_detach_sock(struct sock *sk); -extern struct sock *reuseport_select_sock(struct sock *sk, - u32 hash, - struct sk_buff *skb, - int hdr_len); +extern struct sock *reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len); +extern struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb); extern int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog); extern int reuseport_detach_prog(struct sock *sk); diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 60d7c1f28809..b4fe0829c9ab 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -202,7 +202,7 @@ int reuseport_add_sock(struct sock *sk, struct sock *sk2, bool bind_inany) } reuse->socks[reuse->num_socks] = sk; - /* paired with smp_rmb() in reuseport_select_sock() */ + /* paired with smp_rmb() in __reuseport_select_sock() */ smp_wmb(); reuse->num_socks++; rcu_assign_pointer(sk->sk_reuseport_cb, reuse); @@ -313,12 +313,13 @@ static struct sock *run_bpf_filter(struct sock_reuseport *reuse, u16 socks, * @hdr_len: BPF filter expects skb data pointer at payload data. If * the skb does not yet point at the payload, this parameter represents * how far the pointer needs to advance to reach the payload. + * @migration: represents if it is selecting a listener for SYN or + * migrating ESTABLISHED/SYN_RECV sockets or NEW_SYN_RECV socket. * Returns a socket that should receive the packet (or NULL on error). */ -struct sock *reuseport_select_sock(struct sock *sk, - u32 hash, - struct sk_buff *skb, - int hdr_len) +struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len, + u8 migration) { struct sock_reuseport *reuse; struct bpf_prog *prog; @@ -332,13 +333,19 @@ struct sock *reuseport_select_sock(struct sock *sk, if (!reuse) goto out; - prog = rcu_dereference(reuse->prog); socks = READ_ONCE(reuse->num_socks); if (likely(socks)) { /* paired with smp_wmb() in reuseport_add_sock() */ smp_rmb(); - if (!prog || !skb) + prog = rcu_dereference(reuse->prog); + if (!prog) + goto select_by_hash; + + if (migration) + goto out; + + if (!skb) goto select_by_hash; if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) @@ -367,8 +374,21 @@ struct sock *reuseport_select_sock(struct sock *sk, rcu_read_unlock(); return sk2; } + +struct sock *reuseport_select_sock(struct sock *sk, u32 hash, + struct sk_buff *skb, int hdr_len) +{ + return __reuseport_select_sock(sk, hash, skb, hdr_len, BPF_SK_REUSEPORT_MIGRATE_NO); +} EXPORT_SYMBOL(reuseport_select_sock); +struct sock *reuseport_select_migrated_sock(struct sock *sk, u32 hash, + struct sk_buff *skb) +{ + return __reuseport_select_sock(sk, hash, skb, 0, BPF_SK_REUSEPORT_MIGRATE_REQUEST); +} +EXPORT_SYMBOL(reuseport_select_migrated_sock); + int reuseport_attach_prog(struct sock *sk, struct bpf_prog *prog) { struct sock_reuseport *reuse; diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c index 361efe55b1ad..e71653c6eae2 100644 --- a/net/ipv4/inet_connection_sock.c +++ b/net/ipv4/inet_connection_sock.c @@ -743,8 +743,17 @@ static void reqsk_timer_handler(struct timer_list *t) struct request_sock_queue *queue = &icsk->icsk_accept_queue; int max_syn_ack_retries, qlen, expire = 0, resend = 0; - if (inet_sk_state_load(sk_listener) != TCP_LISTEN) - goto drop; + if (inet_sk_state_load(sk_listener) != TCP_LISTEN) { + sk_listener = reuseport_select_migrated_sock(sk_listener, + req_to_sk(req)->sk_hash, NULL); + if (!sk_listener) { + sk_listener = req->rsk_listener; + goto drop; + } + inet_csk_reqsk_queue_migrated(req->rsk_listener, sk_listener, req); + icsk = inet_csk(sk_listener); + queue = &icsk->icsk_accept_queue; + } max_syn_ack_retries = icsk->icsk_syn_retries ? : net->ipv4.sysctl_tcp_synack_retries; /* Normally all the openreqs are young and become mature diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index e4b31e70bd30..9a9aa27c6069 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1973,8 +1973,13 @@ int tcp_v4_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } /* We own a reference on the listener, increase it again * as we might lose it too soon. diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 992cbf3eb9e3..ff11f3c0cb96 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1635,8 +1635,13 @@ INDIRECT_CALLABLE_SCOPE int tcp_v6_rcv(struct sk_buff *skb) goto csum_error; } if (unlikely(sk->sk_state != TCP_LISTEN)) { - inet_csk_reqsk_queue_drop_and_put(sk, req); - goto lookup; + nsk = reuseport_select_migrated_sock(sk, req_to_sk(req)->sk_hash, skb); + if (!nsk) { + inet_csk_reqsk_queue_drop_and_put(sk, req); + goto lookup; + } + inet_csk_reqsk_queue_migrated(sk, nsk, req); + sk = nsk; } sock_hold(sk); refcounted = true; From patchwork Tue Dec 1 14:44:13 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 336660 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 82A43C83013 for ; Tue, 1 Dec 2020 14:47:32 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 4F4A820758 for ; Tue, 1 Dec 2020 14:47:32 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="Bebrghwc" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2403977AbgLAOrM (ORCPT ); Tue, 1 Dec 2020 09:47:12 -0500 Received: from smtp-fw-33001.amazon.com ([207.171.190.10]:64982 "EHLO smtp-fw-33001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389033AbgLAOrK (ORCPT ); Tue, 1 Dec 2020 09:47:10 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834030; x=1638370030; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=1djFn+jagjiayQ5HBXRcUZo+k3FTO2zB5G1LxznygzM=; b=BebrghwcFbLcs8l7zp8gK+OwTT8RQ9OijlVHh3PeFJyEjJBo4/ypR4+0 MVqbr7mGN6WbhxOg7M3cRZJDqIAKURcGM6/ulm7ewrzXRT3OJBVYxufIO N0fK2F/L0EzHdqHkPQ7nOarHAsglbAO1llXPO3ve7hF7aynxs7glCHVR2 k=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="99547297" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2b-c7131dcf.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-33001.sea14.amazon.com with ESMTP; 01 Dec 2020 14:46:29 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2b-c7131dcf.us-west-2.amazon.com (Postfix) with ESMTPS id 3DF29A1BB0; Tue, 1 Dec 2020 14:46:28 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:27 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:23 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 06/11] bpf: Introduce two attach types for BPF_PROG_TYPE_SK_REUSEPORT. Date: Tue, 1 Dec 2020 23:44:13 +0900 Message-ID: <20201201144418.35045-7-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This commit adds new bpf_attach_type for BPF_PROG_TYPE_SK_REUSEPORT to check if the attached eBPF program is capable of migrating sockets. When the eBPF program is attached, the kernel runs it for socket migration only if the expected_attach_type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. The kernel will change the behaviour depending on the returned value: - SK_PASS with selected_sk, select it as a new listener - SK_PASS with selected_sk NULL, fall back to the random selection - SK_DROP, cancel the migration Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 2 ++ kernel/bpf/syscall.c | 8 ++++++++ tools/include/uapi/linux/bpf.h | 2 ++ 3 files changed, 12 insertions(+) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index 85278deff439..cfc207ae7782 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -241,6 +241,8 @@ enum bpf_attach_type { BPF_XDP_CPUMAP, BPF_SK_LOOKUP, BPF_XDP, + BPF_SK_REUSEPORT_SELECT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, __MAX_BPF_ATTACH_TYPE }; diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index f3fe9f53f93c..a0796a8de5ea 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2036,6 +2036,14 @@ bpf_prog_load_check_attach(enum bpf_prog_type prog_type, if (expected_attach_type == BPF_SK_LOOKUP) return 0; return -EINVAL; + case BPF_PROG_TYPE_SK_REUSEPORT: + switch (expected_attach_type) { + case BPF_SK_REUSEPORT_SELECT: + case BPF_SK_REUSEPORT_SELECT_OR_MIGRATE: + return 0; + default: + return -EINVAL; + } case BPF_PROG_TYPE_EXT: if (expected_attach_type) return -EINVAL; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 85278deff439..cfc207ae7782 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -241,6 +241,8 @@ enum bpf_attach_type { BPF_XDP_CPUMAP, BPF_SK_LOOKUP, BPF_XDP, + BPF_SK_REUSEPORT_SELECT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, __MAX_BPF_ATTACH_TYPE }; From patchwork Tue Dec 1 14:44:14 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 335627 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B19AC83019 for ; Tue, 1 Dec 2020 14:47:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 440D220758 for ; Tue, 1 Dec 2020 14:47:33 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="KvAw6aUC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404002AbgLAOr2 (ORCPT ); Tue, 1 Dec 2020 09:47:28 -0500 Received: from smtp-fw-2101.amazon.com ([72.21.196.25]:5495 "EHLO smtp-fw-2101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2403991AbgLAOr2 (ORCPT ); Tue, 1 Dec 2020 09:47:28 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834047; x=1638370047; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=TmovmZFg0uGwgULxNT+qzpNNcmmliehv0wNj2j0+gi0=; b=KvAw6aUCHc/716dOdupPYvAJtlZ+wQEQGdqJbFKMXRbKg0NWowF3Eoh8 hS6RPxUyI9sGz/SjUrV6AhGJi8or9v5KGLczAHz5OT2HaH3kI0OYSt7ie JohLu7mvwZ0xpyJ9/+uMZ0Kgu34Ninwel72QIfs/RzcBtm4k9tiwS1s3v Y=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="66687217" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-2a-41350382.us-west-2.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-2101.iad2.amazon.com with ESMTP; 01 Dec 2020 14:46:43 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2a-41350382.us-west-2.amazon.com (Postfix) with ESMTPS id D4D33C2CDA; Tue, 1 Dec 2020 14:46:42 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:42 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:37 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 07/11] libbpf: Set expected_attach_type for BPF_PROG_TYPE_SK_REUSEPORT. Date: Tue, 1 Dec 2020 23:44:14 +0900 Message-ID: <20201201144418.35045-8-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This commit introduces a new section (sk_reuseport/migrate) and sets expected_attach_type to two each section in BPF_PROG_TYPE_SK_REUSEPORT program. Signed-off-by: Kuniyuki Iwashima --- tools/lib/bpf/libbpf.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c index 28baee7ba1ca..bbb3902a0e41 100644 --- a/tools/lib/bpf/libbpf.c +++ b/tools/lib/bpf/libbpf.c @@ -8237,7 +8237,10 @@ static struct bpf_link *attach_iter(const struct bpf_sec_def *sec, static const struct bpf_sec_def section_defs[] = { BPF_PROG_SEC("socket", BPF_PROG_TYPE_SOCKET_FILTER), - BPF_PROG_SEC("sk_reuseport", BPF_PROG_TYPE_SK_REUSEPORT), + BPF_EAPROG_SEC("sk_reuseport/migrate", BPF_PROG_TYPE_SK_REUSEPORT, + BPF_SK_REUSEPORT_SELECT_OR_MIGRATE), + BPF_EAPROG_SEC("sk_reuseport", BPF_PROG_TYPE_SK_REUSEPORT, + BPF_SK_REUSEPORT_SELECT), SEC_DEF("kprobe/", KPROBE, .attach_fn = attach_kprobe), BPF_PROG_SEC("uprobe/", BPF_PROG_TYPE_KPROBE), From patchwork Tue Dec 1 14:44:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 336659 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 783BFC64E7B for ; Tue, 1 Dec 2020 14:48:05 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 178FB20758 for ; Tue, 1 Dec 2020 14:48:05 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="knXeO2CQ" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404015AbgLAOrn (ORCPT ); Tue, 1 Dec 2020 09:47:43 -0500 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:51548 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2388491AbgLAOrm (ORCPT ); Tue, 1 Dec 2020 09:47:42 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834062; x=1638370062; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=vaoGiMEYdHqZuIObsn8FiWWZW4pERkBSeGbRWDs9t6A=; b=knXeO2CQ4ICS2zr7ufDTOLDpm6SML1jsEfT5gMBXqHMq5PQY9FpXQk9r YnYTrVMbST3J56MWvgGxcS2icTz4I/vUIT/NbJXHcOc4QQ2F7IsgE002T raFser05ZAV6btOLwvHcL8CLmqNF9eNesSA40v4j7p3blKt3yWZr5jLcH E=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="92542548" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2b-4ff6265a.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 01 Dec 2020 14:47:01 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2b-4ff6265a.us-west-2.amazon.com (Postfix) with ESMTPS id 653DCA244D; Tue, 1 Dec 2020 14:46:58 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:57 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:46:53 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 08/11] bpf: Add migration to sk_reuseport_(kern|md). Date: Tue, 1 Dec 2020 23:44:15 +0900 Message-ID: <20201201144418.35045-9-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds u8 migration field to sk_reuseport_kern and sk_reuseport_md to signal the eBPF program if the kernel calls it for selecting a listener for SYN or migrating sockets in the accept queue or an immature socket during 3WHS. Note that this field is accessible only if the attached type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Link: https://lore.kernel.org/netdev/20201123003828.xjpjdtk4ygl6tg6h@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/linux/bpf.h | 1 + include/linux/filter.h | 4 ++-- include/uapi/linux/bpf.h | 1 + net/core/filter.c | 15 ++++++++++++--- net/core/sock_reuseport.c | 2 +- tools/include/uapi/linux/bpf.h | 1 + 6 files changed, 18 insertions(+), 6 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 581b2a2e78eb..244f823f1f84 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -1897,6 +1897,7 @@ struct sk_reuseport_kern { u32 hash; u32 reuseport_id; bool bind_inany; + u8 migration; }; bool bpf_tcp_sock_is_valid_access(int off, int size, enum bpf_access_type type, struct bpf_insn_access_aux *info); diff --git a/include/linux/filter.h b/include/linux/filter.h index 1b62397bd124..15d5bf13a905 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -967,12 +967,12 @@ void bpf_warn_invalid_xdp_action(u32 act); #ifdef CONFIG_INET struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, - u32 hash); + u32 hash, u8 migration); #else static inline struct sock * bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, - u32 hash) + u32 hash, u8 migration) { return NULL; } diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index cfc207ae7782..efe342bf3dbc 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -4419,6 +4419,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __u8 migration; /* Migration type */ }; #define BPF_TAG_SIZE 8 diff --git a/net/core/filter.c b/net/core/filter.c index 2ca5eecebacf..0a0634787bb4 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9853,7 +9853,7 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf, static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, struct sock_reuseport *reuse, struct sock *sk, struct sk_buff *skb, - u32 hash) + u32 hash, u8 migration) { reuse_kern->skb = skb; reuse_kern->sk = sk; @@ -9862,16 +9862,17 @@ static void bpf_init_reuseport_kern(struct sk_reuseport_kern *reuse_kern, reuse_kern->hash = hash; reuse_kern->reuseport_id = reuse->reuseport_id; reuse_kern->bind_inany = reuse->bind_inany; + reuse_kern->migration = migration; } struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, struct bpf_prog *prog, struct sk_buff *skb, - u32 hash) + u32 hash, u8 migration) { struct sk_reuseport_kern reuse_kern; enum sk_action action; - bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash); + bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration); action = BPF_PROG_RUN(prog, &reuse_kern); if (action == SK_PASS) @@ -10010,6 +10011,10 @@ sk_reuseport_is_valid_access(int off, int size, case offsetof(struct sk_reuseport_md, hash): return size == size_default; + case bpf_ctx_range(struct sk_reuseport_md, migration): + return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE && + size == sizeof(__u8); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10082,6 +10087,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, bind_inany): SK_REUSEPORT_LOAD_FIELD(bind_inany); break; + + case offsetof(struct sk_reuseport_md, migration): + SK_REUSEPORT_LOAD_FIELD(migration); + break; } return insn - insn_buf; diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index b4fe0829c9ab..96d65b4c6974 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -349,7 +349,7 @@ struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, goto select_by_hash; if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) - sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash); + sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration); else sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len); diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index cfc207ae7782..efe342bf3dbc 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -4419,6 +4419,7 @@ struct sk_reuseport_md { __u32 ip_protocol; /* IP protocol. e.g. IPPROTO_TCP, IPPROTO_UDP */ __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ + __u8 migration; /* Migration type */ }; #define BPF_TAG_SIZE 8 From patchwork Tue Dec 1 14:44:16 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 335626 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AAC4CC83012 for ; Tue, 1 Dec 2020 14:48:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 156B1204EA for ; Tue, 1 Dec 2020 14:48:06 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="MY3HlCgP" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404025AbgLAOr7 (ORCPT ); Tue, 1 Dec 2020 09:47:59 -0500 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:51548 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391677AbgLAOr7 (ORCPT ); Tue, 1 Dec 2020 09:47:59 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834078; x=1638370078; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=NUzw8SbA3vX4dvNrDqnZIUBmoUk7I8Pv3AQv8zbmeyA=; b=MY3HlCgPnVO7uCNWjKoMDZVRNDV6H93CNQ4OtQQwUK7EA/RTVSV6zEcq +Ss4owcWpdw9OrT6L6gvuGo9Mxkbgno6i1GrR1OrhDe9fu0IwBwr0HHTL dZv1Xiyo7OsTUzPmlPYRmKze8/pgzW+MwnguIek1isyt2rNOdA+/KA3ui w=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="92542631" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2b-baacba05.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 01 Dec 2020 14:47:19 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan2.pdx.amazon.com [10.236.137.194]) by email-inbound-relay-2b-baacba05.us-west-2.amazon.com (Postfix) with ESMTPS id 05C32A1960; Tue, 1 Dec 2020 14:47:18 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:47:18 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:47:08 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 09/11] bpf: Support bpf_get_socket_cookie_sock() for BPF_PROG_TYPE_SK_REUSEPORT. Date: Tue, 1 Dec 2020 23:44:16 +0900 Message-ID: <20201201144418.35045-10-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org We will call sock_reuseport.prog for socket migration in the next commit, so the eBPF program has to know which listener is closing in order to select the new listener. Currently, we can get a unique ID for each listener in the userspace by calling bpf_map_lookup_elem() for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map. This patch makes the sk pointer available in sk_reuseport_md so that we can get the ID by BPF_FUNC_get_socket_cookie() in the eBPF program. Link: https://lore.kernel.org/netdev/20201119001154.kapwihc2plp4f7zc@kafai-mbp.dhcp.thefacebook.com/ Suggested-by: Martin KaFai Lau Signed-off-by: Kuniyuki Iwashima --- include/uapi/linux/bpf.h | 8 ++++++++ net/core/filter.c | 12 +++++++++++- tools/include/uapi/linux/bpf.h | 8 ++++++++ 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h index efe342bf3dbc..3e9b8bd42b4e 100644 --- a/include/uapi/linux/bpf.h +++ b/include/uapi/linux/bpf.h @@ -1650,6 +1650,13 @@ union bpf_attr { * A 8-byte long non-decreasing number on success, or 0 if the * socket field is missing inside *skb*. * + * u64 bpf_get_socket_cookie(struct bpf_sock *sk) + * Description + * Equivalent to bpf_get_socket_cookie() helper that accepts + * *skb*, but gets socket from **struct bpf_sock** context. + * Return + * A 8-byte long non-decreasing number. + * * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) * Description * Equivalent to bpf_get_socket_cookie() helper that accepts @@ -4420,6 +4427,7 @@ struct sk_reuseport_md { __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ __u8 migration; /* Migration type */ + __bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */ }; #define BPF_TAG_SIZE 8 diff --git a/net/core/filter.c b/net/core/filter.c index 0a0634787bb4..1059d31847ef 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -4628,7 +4628,7 @@ static const struct bpf_func_proto bpf_get_socket_cookie_sock_proto = { .func = bpf_get_socket_cookie_sock, .gpl_only = false, .ret_type = RET_INTEGER, - .arg1_type = ARG_PTR_TO_CTX, + .arg1_type = ARG_PTR_TO_SOCKET, }; BPF_CALL_1(bpf_get_socket_cookie_sock_ops, struct bpf_sock_ops_kern *, ctx) @@ -9982,6 +9982,8 @@ sk_reuseport_func_proto(enum bpf_func_id func_id, return &sk_reuseport_load_bytes_proto; case BPF_FUNC_skb_load_bytes_relative: return &sk_reuseport_load_bytes_relative_proto; + case BPF_FUNC_get_socket_cookie: + return &bpf_get_socket_cookie_sock_proto; default: return bpf_base_func_proto(func_id); } @@ -10015,6 +10017,10 @@ sk_reuseport_is_valid_access(int off, int size, return prog->expected_attach_type == BPF_SK_REUSEPORT_SELECT_OR_MIGRATE && size == sizeof(__u8); + case offsetof(struct sk_reuseport_md, sk): + info->reg_type = PTR_TO_SOCKET; + return size == sizeof(__u64); + /* Fields that allow narrowing */ case bpf_ctx_range(struct sk_reuseport_md, eth_protocol): if (size < sizeof_field(struct sk_buff, protocol)) @@ -10091,6 +10097,10 @@ static u32 sk_reuseport_convert_ctx_access(enum bpf_access_type type, case offsetof(struct sk_reuseport_md, migration): SK_REUSEPORT_LOAD_FIELD(migration); break; + + case offsetof(struct sk_reuseport_md, sk): + SK_REUSEPORT_LOAD_FIELD(sk); + break; } return insn - insn_buf; diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index efe342bf3dbc..3e9b8bd42b4e 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -1650,6 +1650,13 @@ union bpf_attr { * A 8-byte long non-decreasing number on success, or 0 if the * socket field is missing inside *skb*. * + * u64 bpf_get_socket_cookie(struct bpf_sock *sk) + * Description + * Equivalent to bpf_get_socket_cookie() helper that accepts + * *skb*, but gets socket from **struct bpf_sock** context. + * Return + * A 8-byte long non-decreasing number. + * * u64 bpf_get_socket_cookie(struct bpf_sock_addr *ctx) * Description * Equivalent to bpf_get_socket_cookie() helper that accepts @@ -4420,6 +4427,7 @@ struct sk_reuseport_md { __u32 bind_inany; /* Is sock bound to an INANY address? */ __u32 hash; /* A hash of the packet 4 tuples */ __u8 migration; /* Migration type */ + __bpf_md_ptr(struct bpf_sock *, sk); /* current listening socket */ }; #define BPF_TAG_SIZE 8 From patchwork Tue Dec 1 14:44:17 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 335625 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-20.2 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2996C64E7B for ; Tue, 1 Dec 2020 14:48:41 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 84AD020758 for ; Tue, 1 Dec 2020 14:48:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="vWyY2DKG" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404051AbgLAOsM (ORCPT ); Tue, 1 Dec 2020 09:48:12 -0500 Received: from smtp-fw-6002.amazon.com ([52.95.49.90]:63105 "EHLO smtp-fw-6002.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404026AbgLAOsL (ORCPT ); Tue, 1 Dec 2020 09:48:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834092; x=1638370092; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=L3BDjC4r+KlsYF9bVtRPpv4evDg4B9zvwuwACl+vsMQ=; b=vWyY2DKGcYFohSyYJaYvhI62sUUUbqHilbziB0cFZT58M2SjHUCRfnQV mQxPkBkvYNjxffA9OIpsMakKDjx6wdd6M0gsoQ7uUctsMdoogQW8Sgj+S 8k+u0CiIBiRaIm3IrerkdZugCeNbP7AnLS55JNxNcMFM4XZFm9VOmAKFS c=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="68318988" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-2c-2225282c.us-west-2.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6002.iad6.amazon.com with ESMTP; 01 Dec 2020 14:47:30 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2c-2225282c.us-west-2.amazon.com (Postfix) with ESMTPS id A5B4BA1EBA; Tue, 1 Dec 2020 14:47:28 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:47:27 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:47:23 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 10/11] bpf: Call bpf_run_sk_reuseport() for socket migration. Date: Tue, 1 Dec 2020 23:44:17 +0900 Message-ID: <20201201144418.35045-11-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch supports socket migration by eBPF. If the attached type is BPF_SK_REUSEPORT_SELECT_OR_MIGRATE, we can select a new listener by BPF_FUNC_sk_select_reuseport(). Also, we can cancel migration by returning SK_DROP. This feature is useful when listeners have different settings at the socket API level or when we want to free resources as soon as possible. There are two noteworthy points. The first is that we select a listening socket in reuseport_detach_sock() and __reuseport_select_sock(), but we do not have struct skb at closing a listener or retransmitting a SYN+ACK. However, some helper functions do not expect skb is NULL (e.g. skb_header_pointer() in BPF_FUNC_skb_load_bytes(), skb_tail_pointer() in BPF_FUNC_skb_load_bytes_relative()). So, we allocate an empty skb temporarily before running the eBPF program. The second is that we do not have struct request_sock in unhash path, and the sk_hash of the listener is always zero. Thus, we pass zero as hash to bpf_run_sk_reuseport(). Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- net/core/filter.c | 19 +++++++++++++++++++ net/core/sock_reuseport.c | 19 ++++++++++--------- net/ipv4/inet_hashtables.c | 2 +- 3 files changed, 30 insertions(+), 10 deletions(-) diff --git a/net/core/filter.c b/net/core/filter.c index 1059d31847ef..2f2fb77cdb72 100644 --- a/net/core/filter.c +++ b/net/core/filter.c @@ -9871,10 +9871,29 @@ struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk, { struct sk_reuseport_kern reuse_kern; enum sk_action action; + bool allocated = false; + + if (migration) { + /* cancel migration for possibly incapable eBPF program */ + if (prog->expected_attach_type != BPF_SK_REUSEPORT_SELECT_OR_MIGRATE) + return ERR_PTR(-ENOTSUPP); + + if (!skb) { + allocated = true; + skb = alloc_skb(0, GFP_ATOMIC); + if (!skb) + return ERR_PTR(-ENOMEM); + } + } else if (!skb) { + return NULL; /* fall back to select by hash */ + } bpf_init_reuseport_kern(&reuse_kern, reuse, sk, skb, hash, migration); action = BPF_PROG_RUN(prog, &reuse_kern); + if (allocated) + kfree_skb(skb); + if (action == SK_PASS) return reuse_kern.selected_sk; else diff --git a/net/core/sock_reuseport.c b/net/core/sock_reuseport.c index 96d65b4c6974..6b475897b496 100644 --- a/net/core/sock_reuseport.c +++ b/net/core/sock_reuseport.c @@ -247,8 +247,15 @@ struct sock *reuseport_detach_sock(struct sock *sk) prog = rcu_dereference(reuse->prog); if (sk->sk_protocol == IPPROTO_TCP) { - if (reuse->num_socks && !prog) - nsk = i == reuse->num_socks ? reuse->socks[i - 1] : reuse->socks[i]; + if (reuse->num_socks) { + if (prog) + nsk = bpf_run_sk_reuseport(reuse, sk, prog, NULL, 0, + BPF_SK_REUSEPORT_MIGRATE_QUEUE); + + if (!nsk) + nsk = i == reuse->num_socks ? + reuse->socks[i - 1] : reuse->socks[i]; + } reuse->num_closed_socks++; reuse->socks[reuse->max_socks - reuse->num_closed_socks] = sk; @@ -342,15 +349,9 @@ struct sock *__reuseport_select_sock(struct sock *sk, u32 hash, if (!prog) goto select_by_hash; - if (migration) - goto out; - - if (!skb) - goto select_by_hash; - if (prog->type == BPF_PROG_TYPE_SK_REUSEPORT) sk2 = bpf_run_sk_reuseport(reuse, sk, prog, skb, hash, migration); - else + else if (!skb) sk2 = run_bpf_filter(reuse, socks, prog, skb, hdr_len); select_by_hash: diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c index 545538a6bfac..59f58740c20d 100644 --- a/net/ipv4/inet_hashtables.c +++ b/net/ipv4/inet_hashtables.c @@ -699,7 +699,7 @@ void inet_unhash(struct sock *sk) if (rcu_access_pointer(sk->sk_reuseport_cb)) { nsk = reuseport_detach_sock(sk); - if (nsk) + if (!IS_ERR_OR_NULL(nsk)) inet_csk_reqsk_queue_migrate(sk, nsk); } From patchwork Tue Dec 1 14:44:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Kuniyuki Iwashima X-Patchwork-Id: 336658 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNWANTED_LANGUAGE_BODY, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 699B2C64E7A for ; Tue, 1 Dec 2020 14:48:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 15D2D2080A for ; Tue, 1 Dec 2020 14:48:07 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=amazon.co.jp header.i=@amazon.co.jp header.b="ca0KbJdC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2404038AbgLAOsE (ORCPT ); Tue, 1 Dec 2020 09:48:04 -0500 Received: from smtp-fw-9101.amazon.com ([207.171.184.25]:51644 "EHLO smtp-fw-9101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2404026AbgLAOsB (ORCPT ); Tue, 1 Dec 2020 09:48:01 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.co.jp; i=@amazon.co.jp; q=dns/txt; s=amazon201209; t=1606834080; x=1638370080; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=PfwueMD1yNf+zcMEg3MW90GiebWy2GJDZGjeZTQJa5M=; b=ca0KbJdC0l2tKx5UGNCxiFGke9LAbF3MR21tSg8Yrsn2qnu+vvQLG41x VlOb101LWJMPxqvEAhmb4O+tZ7FHF/wvdKjRQbURkoJA2niISUyx4vdvw oRsiR0wrCusDMCcOspeZE0f35yQ+jgJDai5gH1IyzQlLvCv7A8sl0Sdhj E=; X-IronPort-AV: E=Sophos;i="5.78,384,1599523200"; d="scan'208";a="92542747" Received: from sea32-co-svc-lb4-vlan3.sea.corp.amazon.com (HELO email-inbound-relay-2c-397e131e.us-west-2.amazon.com) ([10.47.23.38]) by smtp-border-fw-out-9101.sea19.amazon.com with ESMTP; 01 Dec 2020 14:47:44 +0000 Received: from EX13MTAUWB001.ant.amazon.com (pdx1-ws-svc-p6-lb9-vlan3.pdx.amazon.com [10.236.137.198]) by email-inbound-relay-2c-397e131e.us-west-2.amazon.com (Postfix) with ESMTPS id 2EE14A21DF; Tue, 1 Dec 2020 14:47:43 +0000 (UTC) Received: from EX13D04ANC001.ant.amazon.com (10.43.157.89) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:47:42 +0000 Received: from 38f9d3582de7.ant.amazon.com (10.43.162.146) by EX13D04ANC001.ant.amazon.com (10.43.157.89) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 1 Dec 2020 14:47:37 +0000 From: Kuniyuki Iwashima To: "David S . Miller" , Jakub Kicinski , Eric Dumazet , Alexei Starovoitov , Daniel Borkmann , Martin KaFai Lau CC: Benjamin Herrenschmidt , Kuniyuki Iwashima , Kuniyuki Iwashima , , , , Subject: [PATCH v1 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Date: Tue, 1 Dec 2020 23:44:18 +0900 Message-ID: <20201201144418.35045-12-kuniyu@amazon.co.jp> X-Mailer: git-send-email 2.17.2 (Apple Git-113) In-Reply-To: <20201201144418.35045-1-kuniyu@amazon.co.jp> References: <20201201144418.35045-1-kuniyu@amazon.co.jp> MIME-Version: 1.0 X-Originating-IP: [10.43.162.146] X-ClientProxiedBy: EX13D36UWA004.ant.amazon.com (10.43.160.175) To EX13D04ANC001.ant.amazon.com (10.43.157.89) Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE. Reviewed-by: Benjamin Herrenschmidt Signed-off-by: Kuniyuki Iwashima --- .../bpf/prog_tests/migrate_reuseport.c | 164 ++++++++++++++++++ .../bpf/progs/test_migrate_reuseport_kern.c | 54 ++++++ 2 files changed, 218 insertions(+) create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c diff --git a/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c new file mode 100644 index 000000000000..87c72d9ccadd --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c @@ -0,0 +1,164 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. call listen() for 5 server sockets. + * 2. update a map to migrate all child socket + * to the last server socket (migrate_map[cookie] = 4) + * 3. call connect() for 25 client sockets. + * 4. call close() for first 4 server sockets. + * 5. call accept() for the last server socket. + * + * Author: Kuniyuki Iwashima + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define NUM_SOCKS 5 +#define LOCALHOST "127.0.0.1" +#define err_exit(condition, message) \ + do { \ + if (condition) { \ + perror("ERROR: " message " "); \ + exit(1); \ + } \ + } while (0) + +__u64 server_fds[NUM_SOCKS]; +int prog_fd, reuseport_map_fd, migrate_map_fd; + + +void setup_bpf(void) +{ + struct bpf_object *obj; + struct bpf_program *prog; + struct bpf_map *reuseport_map, *migrate_map; + int err; + + obj = bpf_object__open("test_migrate_reuseport_kern.o"); + err_exit(libbpf_get_error(obj), "opening BPF object file failed"); + + err = bpf_object__load(obj); + err_exit(err, "loading BPF object failed"); + + prog = bpf_program__next(NULL, obj); + err_exit(!prog, "loading BPF program failed"); + + reuseport_map = bpf_object__find_map_by_name(obj, "reuseport_map"); + err_exit(!reuseport_map, "loading BPF reuseport_map failed"); + + migrate_map = bpf_object__find_map_by_name(obj, "migrate_map"); + err_exit(!migrate_map, "loading BPF migrate_map failed"); + + prog_fd = bpf_program__fd(prog); + reuseport_map_fd = bpf_map__fd(reuseport_map); + migrate_map_fd = bpf_map__fd(migrate_map); +} + +void test_listen(void) +{ + struct sockaddr_in addr; + socklen_t addr_len = sizeof(addr); + int i, err, optval = 1, migrated_to = NUM_SOCKS - 1; + __u64 value; + + addr.sin_family = AF_INET; + addr.sin_port = htons(80); + inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr); + + for (i = 0; i < NUM_SOCKS; i++) { + server_fds[i] = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + err_exit(server_fds[i] == -1, "socket() for listener sockets failed"); + + err = setsockopt(server_fds[i], SOL_SOCKET, SO_REUSEPORT, + &optval, sizeof(optval)); + err_exit(err == -1, "setsockopt() for SO_REUSEPORT failed"); + + if (i == 0) { + err = setsockopt(server_fds[i], SOL_SOCKET, SO_ATTACH_REUSEPORT_EBPF, + &prog_fd, sizeof(prog_fd)); + err_exit(err == -1, "setsockopt() for SO_ATTACH_REUSEPORT_EBPF failed"); + } + + err = bind(server_fds[i], (struct sockaddr *)&addr, addr_len); + err_exit(err == -1, "bind() failed"); + + err = listen(server_fds[i], 32); + err_exit(err == -1, "listen() failed"); + + err = bpf_map_update_elem(reuseport_map_fd, &i, &server_fds[i], BPF_NOEXIST); + err_exit(err == -1, "updating BPF reuseport_map failed"); + + err = bpf_map_lookup_elem(reuseport_map_fd, &i, &value); + err_exit(err == -1, "looking up BPF reuseport_map failed"); + + printf("fd[%d] (cookie: %llu) -> fd[%d]\n", i, value, migrated_to); + err = bpf_map_update_elem(migrate_map_fd, &value, &migrated_to, BPF_NOEXIST); + err_exit(err == -1, "updating BPF migrate_map failed"); + } +} + +void test_connect(void) +{ + struct sockaddr_in addr; + socklen_t addr_len = sizeof(addr); + int i, err, client_fd; + + addr.sin_family = AF_INET; + addr.sin_port = htons(80); + inet_pton(AF_INET, LOCALHOST, &addr.sin_addr.s_addr); + + for (i = 0; i < NUM_SOCKS * 5; i++) { + client_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); + err_exit(client_fd == -1, "socket() for listener sockets failed"); + + err = connect(client_fd, (struct sockaddr *)&addr, addr_len); + err_exit(err == -1, "connect() failed"); + + close(client_fd); + } +} + +void test_close(void) +{ + int i; + + for (i = 0; i < NUM_SOCKS - 1; i++) + close(server_fds[i]); +} + +void test_accept(void) +{ + struct sockaddr_in addr; + socklen_t addr_len = sizeof(addr); + int cnt, client_fd; + + fcntl(server_fds[NUM_SOCKS - 1], F_SETFL, O_NONBLOCK); + + for (cnt = 0; cnt < NUM_SOCKS * 5; cnt++) { + client_fd = accept(server_fds[NUM_SOCKS - 1], (struct sockaddr *)&addr, &addr_len); + err_exit(client_fd == -1, "accept() failed"); + } + + printf("%d accepted, %d is expected\n", cnt, NUM_SOCKS * 5); +} + +int main(void) +{ + setup_bpf(); + test_listen(); + test_connect(); + test_close(); + test_accept(); + close(server_fds[NUM_SOCKS - 1]); + return 0; +} diff --git a/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c new file mode 100644 index 000000000000..28d007b3a7a7 --- /dev/null +++ b/tools/testing/selftests/bpf/progs/test_migrate_reuseport_kern.c @@ -0,0 +1,54 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Check if we can migrate child sockets. + * + * 1. If reuse_md->migration is 0 (SYN packet), + * return SK_PASS without selecting a listener. + * 2. If reuse_md->migration is not 0 (socket migration), + * select a listener (reuseport_map[migrate_map[cookie]]) + * + * Author: Kuniyuki Iwashima + */ + +#include +#include + +#define NULL ((void *)0) + +int _version SEC("version") = 1; + +struct bpf_map_def SEC("maps") reuseport_map = { + .type = BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, + .key_size = sizeof(int), + .value_size = sizeof(__u64), + .max_entries = 256, +}; + +struct bpf_map_def SEC("maps") migrate_map = { + .type = BPF_MAP_TYPE_HASH, + .key_size = sizeof(__u64), + .value_size = sizeof(int), + .max_entries = 256, +}; + +SEC("sk_reuseport/migrate") +int select_by_skb_data(struct sk_reuseport_md *reuse_md) +{ + int *key, flags = 0; + __u64 cookie; + + if (!reuse_md->migration) + return SK_PASS; + + cookie = bpf_get_socket_cookie(reuse_md->sk); + + key = bpf_map_lookup_elem(&migrate_map, &cookie); + if (key == NULL) + return SK_DROP; + + bpf_sk_select_reuseport(reuse_md, &reuseport_map, key, flags); + + return SK_PASS; +} + +char _license[] SEC("license") = "GPL";