From patchwork Fri Sep 11 13:51:57 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261059 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, UNWANTED_LANGUAGE_BODY autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 13495C433E2 for ; Fri, 11 Sep 2020 15:11:00 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A09EA206E6 for ; Fri, 11 Sep 2020 15:10:59 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="JUDv23bi" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726410AbgIKPKv (ORCPT ); Fri, 11 Sep 2020 11:10:51 -0400 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:31706 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726358AbgIKPH3 (ORCPT ); Fri, 11 Sep 2020 11:07:29 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599836806; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WqCMOiS2VshMlLkrkKtWQH98sRJoAhZ9ktcEKlFUYOE=; b=JUDv23bimEVYutA5kEC9htqnYE+687OMnxf8ZOOBYzeGl7Hb2k0T6TZL9TCrC1wwnpJnO6 gr7wRq7wZyxf38jHMKEqfYE0OLTYvGcPxBMdJzq3StwQr5c6QCPbsAdlAcPDFTUzs/BYSc 07N7K+93bK253ftAJ5TsZHD72DVU51M= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-579-TqGQvupnP5-fSBLNDJMmTA-1; Fri, 11 Sep 2020 09:52:31 -0400 X-MC-Unique: TqGQvupnP5-fSBLNDJMmTA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 40A601017DC5; Fri, 11 Sep 2020 13:52:30 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id 14E0C5C1BD; Fri, 11 Sep 2020 13:52:28 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 02/13] mptcp: set data_ready status bit in subflow_check_data_avail() Date: Fri, 11 Sep 2020 15:51:57 +0200 Message-Id: <10b65904849e08dfccab8da33219d2081341f581.1599832097.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This simplify mptcp_subflow_data_available() and will made follow-up patches simpler. Additionally remove the unneeded checks on subflow copied_seq: we always whole skbs out of subflows. Signed-off-by: Paolo Abeni --- net/mptcp/subflow.c | 19 ++++++++----------- 1 file changed, 8 insertions(+), 11 deletions(-) diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 7ae1d3604047..53b455c3c229 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -825,6 +825,8 @@ static bool subflow_check_data_avail(struct sock *ssk) pr_debug("msk=%p ssk=%p data_avail=%d skb=%p", subflow->conn, ssk, subflow->data_avail, skb_peek(&ssk->sk_receive_queue)); + if (!skb_peek(&ssk->sk_receive_queue)) + subflow->data_avail = 0; if (subflow->data_avail) return true; @@ -849,6 +851,7 @@ static bool subflow_check_data_avail(struct sock *ssk) subflow->map_data_len = skb->len; subflow->map_subflow_seq = tcp_sk(ssk)->copied_seq - subflow->ssn_offset; + subflow->data_avail = 1; return true; } @@ -876,8 +879,10 @@ static bool subflow_check_data_avail(struct sock *ssk) ack_seq = mptcp_subflow_get_mapped_dsn(subflow); pr_debug("msk ack_seq=%llx subflow ack_seq=%llx", old_ack, ack_seq); - if (ack_seq == old_ack) + if (ack_seq == old_ack) { + subflow->data_avail = 1; break; + } /* only accept in-sequence mapping. Old values are spurious * retransmission; we can hit "future" values on active backup @@ -922,13 +927,13 @@ static bool subflow_check_data_avail(struct sock *ssk) ssk->sk_error_report(ssk); tcp_set_state(ssk, TCP_CLOSE); tcp_send_active_reset(ssk, GFP_ATOMIC); + subflow->data_avail = 0; return false; } bool mptcp_subflow_data_available(struct sock *sk) { struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); - struct sk_buff *skb; /* check if current mapping is still valid */ if (subflow->map_valid && @@ -941,15 +946,7 @@ bool mptcp_subflow_data_available(struct sock *sk) subflow->map_data_len); } - if (!subflow_check_data_avail(sk)) { - subflow->data_avail = 0; - return false; - } - - skb = skb_peek(&sk->sk_receive_queue); - subflow->data_avail = skb && - before(tcp_sk(sk)->copied_seq, TCP_SKB_CB(skb)->end_seq); - return subflow->data_avail; + return subflow_check_data_avail(sk); } /* If ssk has an mptcp parent socket, use the mptcp rcvbuf occupancy, From patchwork Fri Sep 11 13:51:58 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261051 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D07BC2BC11 for ; Fri, 11 Sep 2020 16:48:52 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D7B95221E7 for ; Fri, 11 Sep 2020 16:48:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="ZnwLwJfX" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726373AbgIKQsq (ORCPT ); Fri, 11 Sep 2020 12:48:46 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:40345 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726360AbgIKPHc (ORCPT ); Fri, 11 Sep 2020 11:07:32 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599836821; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=e+TtbgBTAVGzYjYMB80XGZMy2FcFqxaPfC0HGZU+TL0=; b=ZnwLwJfXPwBYfntY16X4nJziMDALMv6kU4A+e1AXMYlvmtcsXiZL36lTIxn8thwA7USGa6 f+n/mULo2WxLvzNKO4ZtZfe5Up4F0kFyUhBkO/Prce9WvhPZ0N1MMF0O+bDIOYk8xO8xHS wNxqGNY3foq8MNDtX7frZtzvGN2DGxs= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-419--UJdzLveOxqVtPqUNsnlFA-1; Fri, 11 Sep 2020 09:52:33 -0400 X-MC-Unique: -UJdzLveOxqVtPqUNsnlFA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id C8C011017DCF; Fri, 11 Sep 2020 13:52:31 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9C45B5C1BD; Fri, 11 Sep 2020 13:52:30 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 03/13] mptcp: trigger msk processing even for OoO data Date: Fri, 11 Sep 2020 15:51:58 +0200 Message-Id: <10d123b243cb850962982ae7d46eeac49ca63332.1599832097.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This is a prerequisite to allow receiving data from multiple subflows without re-injection. Instead of dropping the OoO - "future" data in subflow_check_data_avail(), call into __mptcp_move_skbs() and let the msk drop that. To avoid code duplication factor out the mptcp_subflow_discard_data() helper. Note that __mptcp_move_skbs() can now find multiple subflows with data avail (comprising to-be-discarded data), so must update the byte counter incrementally. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 33 +++++++++++++++---- net/mptcp/protocol.h | 9 ++++- net/mptcp/subflow.c | 78 ++++++++++++++++++++++++-------------------- 3 files changed, 78 insertions(+), 42 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 854a8b3b9ecd..95573c6f7762 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -167,7 +167,8 @@ static bool mptcp_subflow_dsn_valid(const struct mptcp_sock *msk, return true; subflow->data_avail = 0; - return mptcp_subflow_data_available(ssk); + mptcp_subflow_data_available(ssk); + return subflow->data_avail == MPTCP_SUBFLOW_DATA_AVAIL; } static void mptcp_check_data_fin_ack(struct sock *sk) @@ -313,11 +314,18 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, struct tcp_sock *tp; bool done = false; - if (!mptcp_subflow_dsn_valid(msk, ssk)) { - *bytes = 0; + pr_debug("msk=%p ssk=%p data avail=%d valid=%d empty=%d", + msk, ssk, subflow->data_avail, + mptcp_subflow_dsn_valid(msk, ssk), + !skb_peek(&ssk->sk_receive_queue)); + if (subflow->data_avail == MPTCP_SUBFLOW_OOO_DATA) { + mptcp_subflow_discard_data(ssk, subflow->map_data_len); return false; } + if (!mptcp_subflow_dsn_valid(msk, ssk)) + return false; + tp = tcp_sk(ssk); do { u32 map_remaining, offset; @@ -376,7 +384,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, } } while (more_data_avail); - *bytes = moved; + *bytes += moved; /* If the moves have caught up with the DATA_FIN sequence number * it's time to ack the DATA_FIN and change socket state, but @@ -415,9 +423,17 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) void mptcp_data_ready(struct sock *sk, struct sock *ssk) { + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); struct mptcp_sock *msk = mptcp_sk(sk); + bool wake; - set_bit(MPTCP_DATA_READY, &msk->flags); + /* move_skbs_to_msk below can legitly clear the data_avail flag, + * but we will need later to properly woke the reader, cache its + * value + */ + wake = subflow->data_avail == MPTCP_SUBFLOW_DATA_AVAIL; + if (wake) + set_bit(MPTCP_DATA_READY, &msk->flags); if (atomic_read(&sk->sk_rmem_alloc) < READ_ONCE(sk->sk_rcvbuf) && move_skbs_to_msk(msk, ssk)) @@ -438,7 +454,8 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) move_skbs_to_msk(msk, ssk); } wake: - sk->sk_data_ready(sk); + if (wake) + sk->sk_data_ready(sk); } static void __mptcp_flush_join_list(struct mptcp_sock *msk) @@ -1281,6 +1298,9 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, set_bit(MPTCP_DATA_READY, &msk->flags); } out_err: + pr_debug("msk=%p data_ready=%d rx queue empty=%d copied=%d", + msk, test_bit(MPTCP_DATA_READY, &msk->flags), + skb_queue_empty(&sk->sk_receive_queue), copied); mptcp_rcv_space_adjust(msk, copied); release_sock(sk); @@ -2308,6 +2328,7 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock, sock_poll_wait(file, sock, wait); state = inet_sk_state_load(sk); + pr_debug("msk=%p state=%d flags=%lx", msk, state, msk->flags); if (state == TCP_LISTEN) return mptcp_check_readable(msk); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 60b27d44c184..981e395abb46 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -268,6 +268,12 @@ mptcp_subflow_rsk(const struct request_sock *rsk) return (struct mptcp_subflow_request_sock *)rsk; } +enum mptcp_data_avail { + MPTCP_SUBFLOW_NODATA, + MPTCP_SUBFLOW_DATA_AVAIL, + MPTCP_SUBFLOW_OOO_DATA +}; + /* MPTCP subflow context */ struct mptcp_subflow_context { struct list_head node;/* conn_list of subflows */ @@ -292,10 +298,10 @@ struct mptcp_subflow_context { map_valid : 1, mpc_map : 1, backup : 1, - data_avail : 1, rx_eof : 1, use_64bit_ack : 1, /* Set when we received a 64-bit DSN */ can_ack : 1; /* only after processing the remote a key */ + enum mptcp_data_avail data_avail; u32 remote_nonce; u64 thmac; u32 local_nonce; @@ -347,6 +353,7 @@ int mptcp_is_enabled(struct net *net); void mptcp_subflow_fully_established(struct mptcp_subflow_context *subflow, struct mptcp_options_received *mp_opt); bool mptcp_subflow_data_available(struct sock *sk); +int mptcp_subflow_discard_data(struct sock *sk, unsigned limit); void __init mptcp_subflow_init(void); /* called with sk socket lock held */ diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 53b455c3c229..a983beb8e6a6 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -816,6 +816,40 @@ static int subflow_read_actor(read_descriptor_t *desc, return copy_len; } +int mptcp_subflow_discard_data(struct sock *ssk, unsigned limit) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); + u32 map_remaining; + size_t delta; + + map_remaining = subflow->map_data_len - + mptcp_subflow_get_map_offset(subflow); + delta = min_t(size_t, limit, map_remaining); + + /* discard mapped data */ + pr_debug("discarding %zu bytes, current map len=%d", delta, + map_remaining); + if (delta) { + read_descriptor_t desc = { + .count = delta, + }; + int ret; + + ret = tcp_read_sock(ssk, &desc, subflow_read_actor); + if (ret < 0) { + ssk->sk_err = -ret; + return ret; + } + if (ret < delta) + return 0; + if (delta == map_remaining) { + subflow->data_avail = 0; + subflow->map_valid = 0; + } + } + return 0; +} + static bool subflow_check_data_avail(struct sock *ssk) { struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); @@ -832,8 +866,6 @@ static bool subflow_check_data_avail(struct sock *ssk) msk = mptcp_sk(subflow->conn); for (;;) { - u32 map_remaining; - size_t delta; u64 ack_seq; u64 old_ack; @@ -851,7 +883,7 @@ static bool subflow_check_data_avail(struct sock *ssk) subflow->map_data_len = skb->len; subflow->map_subflow_seq = tcp_sk(ssk)->copied_seq - subflow->ssn_offset; - subflow->data_avail = 1; + subflow->data_avail = MPTCP_SUBFLOW_DATA_AVAIL; return true; } @@ -880,43 +912,19 @@ static bool subflow_check_data_avail(struct sock *ssk) pr_debug("msk ack_seq=%llx subflow ack_seq=%llx", old_ack, ack_seq); if (ack_seq == old_ack) { - subflow->data_avail = 1; + subflow->data_avail = MPTCP_SUBFLOW_DATA_AVAIL; + break; + } else if (after64(ack_seq, old_ack)) { + subflow->data_avail = MPTCP_SUBFLOW_OOO_DATA; break; } /* only accept in-sequence mapping. Old values are spurious - * retransmission; we can hit "future" values on active backup - * subflow switch, we relay on retransmissions to get - * in-sequence data. - * Cuncurrent subflows support will require subflow data - * reordering + * retransmission */ - map_remaining = subflow->map_data_len - - mptcp_subflow_get_map_offset(subflow); - if (before64(ack_seq, old_ack)) - delta = min_t(size_t, old_ack - ack_seq, map_remaining); - else - delta = min_t(size_t, ack_seq - old_ack, map_remaining); - - /* discard mapped data */ - pr_debug("discarding %zu bytes, current map len=%d", delta, - map_remaining); - if (delta) { - read_descriptor_t desc = { - .count = delta, - }; - int ret; - - ret = tcp_read_sock(ssk, &desc, subflow_read_actor); - if (ret < 0) { - ssk->sk_err = -ret; - goto fatal; - } - if (ret < delta) - return false; - if (delta == map_remaining) - subflow->map_valid = 0; - } + if (mptcp_subflow_discard_data(ssk, old_ack - ack_seq)) + goto fatal; + return false; } return true; From patchwork Fri Sep 11 13:52:00 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261054 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EABB9C43461 for ; Fri, 11 Sep 2020 16:38:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8988622204 for ; Fri, 11 Sep 2020 16:38:44 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="JG/oa5zx" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726092AbgIKQim (ORCPT ); Fri, 11 Sep 2020 12:38:42 -0400 Received: from us-smtp-delivery-124.mimecast.com ([63.128.21.124]:46296 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726364AbgIKPMQ (ORCPT ); Fri, 11 Sep 2020 11:12:16 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599837122; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bgw7GjGyCIzt0zpVv4jK5j2Dm7NXsSA2o9iy1jtIdpo=; b=JG/oa5zx0T508AGnqKKKflFgk59PXx0KXeVwnPORZtkxBbxJ5ep9Q4fOepUYQIb0NeqSNM 6Vcm54bD2V3x77nqjVdE6OMOvQ3OFR5bbD586H+EPc1zc/h7S7m/OQV1wy6VnOs4uhkGYz k9leqbSt7Xi8szIk0qJfbXG4GMEO/dk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-592-nk2RKPtSNQaAoQODcfnszA-1; Fri, 11 Sep 2020 09:52:36 -0400 X-MC-Unique: nk2RKPtSNQaAoQODcfnszA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id E9B8F18C5210; Fri, 11 Sep 2020 13:52:34 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id BDB255C69A; Fri, 11 Sep 2020 13:52:33 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 05/13] mptcp: introduce and use mptcp_try_coalesce() Date: Fri, 11 Sep 2020 15:52:00 +0200 Message-Id: <24130f7d45c2a2e595b45b3d27b0290e9c0a1318.1599832097.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Factor-out existing code, will be re-used by the next patch. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 31 +++++++++++++++++++------------ 1 file changed, 19 insertions(+), 12 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 4f12a8ce0ddd..5a2ff333e426 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -110,6 +110,22 @@ static int __mptcp_socket_create(struct mptcp_sock *msk) return 0; } +static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to, + struct sk_buff *from) +{ + bool fragstolen; + int delta; + + if (MPTCP_SKB_CB(from)->offset || + !skb_try_coalesce(to, from, &fragstolen, &delta)) + return false; + + kfree_skb_partial(from, fragstolen); + atomic_add(delta, &sk->sk_rmem_alloc); + sk_mem_charge(sk, delta); + return true; +} + static void __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, struct sk_buff *skb, unsigned int offset, size_t copy_len) @@ -121,24 +137,15 @@ static void __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, skb_ext_reset(skb); skb_orphan(skb); + MPTCP_SKB_CB(skb)->offset = offset; msk->ack_seq += copy_len; tail = skb_peek_tail(&sk->sk_receive_queue); - if (offset == 0 && tail) { - bool fragstolen; - int delta; - - if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) { - kfree_skb_partial(skb, fragstolen); - atomic_add(delta, &sk->sk_rmem_alloc); - sk_mem_charge(sk, delta); - return; - } - } + if (tail && mptcp_try_coalesce(sk, tail, skb)) + return; skb_set_owner_r(skb, sk); __skb_queue_tail(&sk->sk_receive_queue, skb); - MPTCP_SKB_CB(skb)->offset = offset; } static void mptcp_stop_timer(struct sock *sk) From patchwork Fri Sep 11 13:52:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261048 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BE8FC433E2 for ; Fri, 11 Sep 2020 16:50:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D2FBB20770 for ; Fri, 11 Sep 2020 16:50:41 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="Z+Jt5khy" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726425AbgIKQuU (ORCPT ); Fri, 11 Sep 2020 12:50:20 -0400 Received: from us-smtp-2.mimecast.com ([205.139.110.61]:59812 "EHLO us-smtp-delivery-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726177AbgIKPGZ (ORCPT ); Fri, 11 Sep 2020 11:06:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599836782; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=kwknB12ZngXM3ass+Mnw8INfsDCdQd1GnwYLKGgZ+xI=; b=Z+Jt5khy11MLnrt85MOy2HCVcLMEwb4ICn9wbrpBvefkXHoEbusCeQwEfoybhme0wjhP+Z MTijx15la5NSr32lbaUGqZjJafkLB/Am9/aduE4bkh9AWWRqFFxezMaT6pToecYGAgOt/p iqtWKBWROYRrfjhOPXj3YosLTAtLz1s= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-470-h7vJKMs0MTKy_2WmwRMOGw-1; Fri, 11 Sep 2020 09:52:37 -0400 X-MC-Unique: h7vJKMs0MTKy_2WmwRMOGw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7D5D6AF201; Fri, 11 Sep 2020 13:52:36 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id 511F55C22B; Fri, 11 Sep 2020 13:52:35 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 06/13] mptcp: move ooo skbs into msk out of order queue. Date: Fri, 11 Sep 2020 15:52:01 +0200 Message-Id: <906a26a2292583b90dc82e955d85fa1e0f3fddce.1599832097.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Add an RB-tree to cope with OoO (at MPTCP level) data. __mptcp_move_skb() insert into the RB tree "future" data, eventually coalescing skb as allowed by the MPTCP DSN. To simplify sequence accounting, move the DSN inside the cb. After successfully enqueuing in sequence data, check if we can use any data from the RB tree. Additionally move the data_fin check after spooling data from the OoO tree, otherwise we could miss shutdown events. The RB tree code is copied as verbatim as possible from tcp_data_queue_ofo(), with a few simplifications due to the fact that MPTCP doesn't need to cope with sacks. All bugs here are added by me. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 264 ++++++++++++++++++++++++++++++++++--------- net/mptcp/protocol.h | 2 + net/mptcp/subflow.c | 1 + 3 files changed, 211 insertions(+), 56 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 5a2ff333e426..6fbfbab51660 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -32,6 +32,8 @@ struct mptcp6_sock { #endif struct mptcp_skb_cb { + u64 map_seq; + u64 end_seq; u32 offset; }; @@ -110,6 +112,12 @@ static int __mptcp_socket_create(struct mptcp_sock *msk) return 0; } +static void mptcp_drop(struct sock *sk, struct sk_buff *skb) +{ + sk_drops_add(sk, skb); + __kfree_skb(skb); +} + static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to, struct sk_buff *from) { @@ -120,16 +128,127 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to, !skb_try_coalesce(to, from, &fragstolen, &delta)) return false; + MPTCP_SKB_CB(to)->end_seq = MPTCP_SKB_CB(from)->end_seq; kfree_skb_partial(from, fragstolen); atomic_add(delta, &sk->sk_rmem_alloc); sk_mem_charge(sk, delta); return true; } -static void __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, - struct sk_buff *skb, - unsigned int offset, size_t copy_len) +static bool mptcp_ooo_try_coalesce(struct mptcp_sock *msk, struct sk_buff *to, + struct sk_buff *from) +{ + if (MPTCP_SKB_CB(from)->map_seq != MPTCP_SKB_CB(to)->end_seq) + return false; + + return mptcp_try_coalesce((struct sock *)msk, to, from); +} + +/* "inspired" by tcp_data_queue_ofo(), main differences: + * - use mptcp seqs + * - don't cope with sacks + */ +static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb) { + struct sock *sk = (struct sock *)msk; + struct rb_node **p, *parent; + u64 seq, end_seq, max_seq; + struct sk_buff *skb1; + + seq = MPTCP_SKB_CB(skb)->map_seq; + end_seq = MPTCP_SKB_CB(skb)->end_seq; + max_seq = tcp_space(sk); + max_seq = max_seq > 0 ? max_seq + msk->ack_seq : msk->ack_seq; + + if (after64(seq, max_seq)) { + /* out of window */ + mptcp_drop(sk, skb); + return; + } + + p = &msk->out_of_order_queue.rb_node; + if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) { + rb_link_node(&skb->rbnode, NULL, p); + rb_insert_color(&skb->rbnode, &msk->out_of_order_queue); + msk->ooo_last_skb = skb; + goto end; + } + + /* with 2 subflows, adding at end of ooo queue is quite likely + * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup. + */ + if (mptcp_ooo_try_coalesce(msk, msk->ooo_last_skb, skb)) + return; + + /* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */ + if (!before64(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) { + parent = &msk->ooo_last_skb->rbnode; + p = &parent->rb_right; + goto insert; + } + + /* Find place to insert this segment. Handle overlaps on the way. */ + parent = NULL; + while (*p) { + parent = *p; + skb1 = rb_to_skb(parent); + if (before64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { + p = &parent->rb_left; + continue; + } + if (before64(seq, MPTCP_SKB_CB(skb1)->end_seq)) { + if (!after64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) { + /* All the bits are present. Drop. */ + mptcp_drop(sk, skb); + return; + } + if (after64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { + /* partial overlap: + * | skb | + * | skb1 | + * continue traversing + */ + } else { + /* skb's seq == skb1's seq and skb covers skb1. + * Replace skb1 with skb. + */ + rb_replace_node(&skb1->rbnode, &skb->rbnode, + &msk->out_of_order_queue); + mptcp_drop(sk, skb1); + goto merge_right; + } + } else if (mptcp_ooo_try_coalesce(msk, skb1, skb)) { + return; + } + p = &parent->rb_right; + } +insert: + /* Insert segment into RB tree. */ + rb_link_node(&skb->rbnode, parent, p); + rb_insert_color(&skb->rbnode, &msk->out_of_order_queue); + +merge_right: + /* Remove other segments covered by skb. */ + while ((skb1 = skb_rb_next(skb)) != NULL) { + if (before64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) + break; + rb_erase(&skb1->rbnode, &msk->out_of_order_queue); + mptcp_drop(sk, skb1); + } + /* If there is no skb after us, we are the last_skb ! */ + if (!skb1) + msk->ooo_last_skb = skb; + +end: + skb_condense(skb); + skb_set_owner_r(skb, sk); +} + +static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, + struct sk_buff *skb, unsigned int offset, + size_t copy_len) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); struct sock *sk = (struct sock *)msk; struct sk_buff *tail; @@ -137,15 +256,35 @@ static void __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, skb_ext_reset(skb); skb_orphan(skb); + + /* the skb map_seq accounts for the skb offset: + * mptcp_subflow_get_mapped_dsn() is based on the current tp->copied_seq + * value + */ + MPTCP_SKB_CB(skb)->map_seq = mptcp_subflow_get_mapped_dsn(subflow); + MPTCP_SKB_CB(skb)->end_seq = MPTCP_SKB_CB(skb)->map_seq + copy_len; MPTCP_SKB_CB(skb)->offset = offset; - msk->ack_seq += copy_len; - tail = skb_peek_tail(&sk->sk_receive_queue); - if (tail && mptcp_try_coalesce(sk, tail, skb)) - return; + if (MPTCP_SKB_CB(skb)->map_seq == msk->ack_seq) { + /* in sequence */ + msk->ack_seq += copy_len; + tail = skb_peek_tail(&sk->sk_receive_queue); + if (tail && mptcp_try_coalesce(sk, tail, skb)) + return true; - skb_set_owner_r(skb, sk); - __skb_queue_tail(&sk->sk_receive_queue, skb); + skb_set_owner_r(skb, sk); + __skb_queue_tail(&sk->sk_receive_queue, skb); + return true; + } else if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) { + mptcp_data_queue_ofo(msk, skb); + return false; + } + + /* old data, keep it simple and drop the whole pkt, sender + * will retransmit as needed, if needed. + */ + mptcp_drop(sk, skb); + return false; } static void mptcp_stop_timer(struct sock *sk) @@ -156,28 +295,6 @@ static void mptcp_stop_timer(struct sock *sk) mptcp_sk(sk)->timer_ival = 0; } -/* both sockets must be locked */ -static bool mptcp_subflow_dsn_valid(const struct mptcp_sock *msk, - struct sock *ssk) -{ - struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); - u64 dsn = mptcp_subflow_get_mapped_dsn(subflow); - - /* revalidate data sequence number. - * - * mptcp_subflow_data_available() is usually called - * without msk lock. Its unlikely (but possible) - * that msk->ack_seq has been advanced since the last - * call found in-sequence data. - */ - if (likely(dsn == msk->ack_seq)) - return true; - - subflow->data_avail = 0; - mptcp_subflow_data_available(ssk); - return subflow->data_avail == MPTCP_SUBFLOW_DATA_AVAIL; -} - static void mptcp_check_data_fin_ack(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); @@ -321,18 +438,7 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, struct tcp_sock *tp; bool done = false; - pr_debug("msk=%p ssk=%p data avail=%d valid=%d empty=%d", - msk, ssk, subflow->data_avail, - mptcp_subflow_dsn_valid(msk, ssk), - !skb_peek(&ssk->sk_receive_queue)); - if (subflow->data_avail == MPTCP_SUBFLOW_OOO_DATA) { - mptcp_subflow_discard_data(ssk, subflow->map_data_len); - return false; - } - - if (!mptcp_subflow_dsn_valid(msk, ssk)) - return false; - + pr_debug("msk=%p ssk=%p", msk, ssk); tp = tcp_sk(ssk); do { u32 map_remaining, offset; @@ -370,9 +476,9 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, if (tp->urg_data) done = true; - __mptcp_move_skb(msk, ssk, skb, offset, len); + if (__mptcp_move_skb(msk, ssk, skb, offset, len)) + moved += len; seq += len; - moved += len; if (WARN_ON_ONCE(map_remaining < len)) break; @@ -393,18 +499,47 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, *bytes += moved; - /* If the moves have caught up with the DATA_FIN sequence number - * it's time to ack the DATA_FIN and change socket state, but - * this is not a good place to change state. Let the workqueue - * do it. - */ - if (mptcp_pending_data_fin(sk, NULL) && - schedule_work(&msk->work)) - sock_hold(sk); - return done; } +static bool mptcp_ofo_queue(struct mptcp_sock *msk) +{ + struct sock *sk = (struct sock *)msk; + struct sk_buff *skb, *tail; + bool moved = false; + struct rb_node *p; + u64 end_seq; + + p = rb_first(&msk->out_of_order_queue); + while (p) { + skb = rb_to_skb(p); + if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) + break; + + p = rb_next(p); + rb_erase(&skb->rbnode, &msk->out_of_order_queue); + + if (unlikely(!after64(MPTCP_SKB_CB(skb)->end_seq, + msk->ack_seq))) { + mptcp_drop(sk, skb); + continue; + } + + end_seq = MPTCP_SKB_CB(skb)->end_seq; + tail = skb_peek_tail(&sk->sk_receive_queue); + if (!tail || !mptcp_ooo_try_coalesce(msk, tail, skb)) { + int delta = msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq; + + /* skip overlapping data, if any */ + MPTCP_SKB_CB(skb)->offset += delta; + __skb_queue_tail(&sk->sk_receive_queue, skb); + } + msk->ack_seq = end_seq; + moved = true; + } + return moved; +} + /* In most cases we will be able to lock the mptcp socket. If its already * owned, we need to defer to the work queue to avoid ABBA deadlock. */ @@ -420,8 +555,19 @@ static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) return false; /* must re-check after taking the lock */ - if (!READ_ONCE(sk->sk_lock.owned)) + if (!READ_ONCE(sk->sk_lock.owned)) { __mptcp_move_skbs_from_subflow(msk, ssk, &moved); + mptcp_ofo_queue(msk); + + /* If the moves have caught up with the DATA_FIN sequence number + * it's time to ack the DATA_FIN and change socket state, but + * this is not a good place to change state. Let the workqueue + * do it. + */ + if (mptcp_pending_data_fin(sk, NULL) && + schedule_work(&msk->work)) + sock_hold(sk); + } spin_unlock_bh(&sk->sk_lock.slock); @@ -1218,7 +1364,11 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) release_sock(ssk); } while (!done); - return moved > 0; + if (mptcp_ofo_queue(msk) || moved > 0) { + mptcp_check_data_fin((struct sock *)msk); + return true; + } + return false; } static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, @@ -1530,6 +1680,7 @@ static int __mptcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&msk->rtx_queue); __set_bit(MPTCP_SEND_SPACE, &msk->flags); INIT_WORK(&msk->work, mptcp_worker); + msk->out_of_order_queue = RB_ROOT; msk->first = NULL; inet_csk(sk)->icsk_sync_mss = mptcp_sync_mss; @@ -1869,6 +2020,7 @@ static void mptcp_destroy(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); + skb_rbtree_purge(&msk->out_of_order_queue); mptcp_token_destroy(msk); if (msk->cached_ext) __skb_ext_put(msk->cached_ext); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 981e395abb46..e20154a33fa7 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -204,6 +204,8 @@ struct mptcp_sock { bool snd_data_fin_enable; spinlock_t join_list_lock; struct work_struct work; + struct sk_buff *ooo_last_skb; + struct rb_root out_of_order_queue; struct list_head conn_list; struct list_head rtx_queue; struct list_head join_list; diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index a983beb8e6a6..1f048a5bf120 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -434,6 +434,7 @@ static void mptcp_sock_destruct(struct sock *sk) sock_orphan(sk); } + skb_rbtree_purge(&mptcp_sk(sk)->out_of_order_queue); mptcp_token_destroy(mptcp_sk(sk)); inet_sock_destruct(sk); } From patchwork Fri Sep 11 13:52:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261049 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7CA8C433E2 for ; Fri, 11 Sep 2020 16:49:30 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 98FBC221EB for ; Fri, 11 Sep 2020 16:49:30 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="EQeHVebw" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726399AbgIKQtD (ORCPT ); Fri, 11 Sep 2020 12:49:03 -0400 Received: from us-smtp-delivery-1.mimecast.com ([205.139.110.120]:30753 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726146AbgIKPHa (ORCPT ); Fri, 11 Sep 2020 11:07:30 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599836821; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T0MTEumKhtaZQRf119Zs+x4J6+zEYS87nS+1tZ/kxMU=; b=EQeHVebwLGqCKUIu7anmPfuC5rVriIuIGy8XDd1RppZmamHRaqeXK339cm376MMRFWFcKW R3KBN27gHHzFCT7lCBHkHoAVvD/LDKxbMijS4FRZ9kPz1IFVTSawOSml9UcrRZxTUDv4H2 svJGs+1twdlDVXaVFVK23h2l3dK+g4o= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-205-FrnK3UbSPHO6OI94-xJSQw-1; Fri, 11 Sep 2020 09:52:40 -0400 X-MC-Unique: FrnK3UbSPHO6OI94-xJSQw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A3EA3AF205; Fri, 11 Sep 2020 13:52:39 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id 679015C22B; Fri, 11 Sep 2020 13:52:38 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 08/13] mptcp: add OoO related mibs Date: Fri, 11 Sep 2020 15:52:03 +0200 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Add a bunch of MPTCP mibs related to MPTCP OoO data processing. Signed-off-by: Paolo Abeni --- net/mptcp/mib.c | 5 +++++ net/mptcp/mib.h | 5 +++++ net/mptcp/protocol.c | 24 +++++++++++++++++++++++- net/mptcp/subflow.c | 1 + 4 files changed, 34 insertions(+), 1 deletion(-) diff --git a/net/mptcp/mib.c b/net/mptcp/mib.c index 0a6a15f3456d..056986c7a228 100644 --- a/net/mptcp/mib.c +++ b/net/mptcp/mib.c @@ -22,6 +22,11 @@ static const struct snmp_mib mptcp_snmp_list[] = { SNMP_MIB_ITEM("MPJoinAckHMacFailure", MPTCP_MIB_JOINACKMAC), SNMP_MIB_ITEM("DSSNotMatching", MPTCP_MIB_DSSNOMATCH), SNMP_MIB_ITEM("InfiniteMapRx", MPTCP_MIB_INFINITEMAPRX), + SNMP_MIB_ITEM("OFOQueueTail", MPTCP_MIB_OFOQUEUETAIL), + SNMP_MIB_ITEM("OFOQueue", MPTCP_MIB_OFOQUEUE), + SNMP_MIB_ITEM("OFOMerge", MPTCP_MIB_OFOMERGE), + SNMP_MIB_ITEM("NoDSSInWindow", MPTCP_MIB_NODSSWINDOW), + SNMP_MIB_ITEM("DuplicateData", MPTCP_MIB_DUPDATA), SNMP_MIB_SENTINEL }; diff --git a/net/mptcp/mib.h b/net/mptcp/mib.h index d7de340fc997..937a177729f1 100644 --- a/net/mptcp/mib.h +++ b/net/mptcp/mib.h @@ -15,6 +15,11 @@ enum linux_mptcp_mib_field { MPTCP_MIB_JOINACKMAC, /* HMAC was wrong on ACK + MP_JOIN */ MPTCP_MIB_DSSNOMATCH, /* Received a new mapping that did not match the previous one */ MPTCP_MIB_INFINITEMAPRX, /* Received an infinite mapping */ + MPTCP_MIB_OFOQUEUETAIL, /* Segments inserted into OoO queue tail */ + MPTCP_MIB_OFOQUEUE, /* Segments inserted into OoO queue */ + MPTCP_MIB_OFOMERGE, /* Segments merged in OoO queue */ + MPTCP_MIB_NODSSWINDOW, /* Segments not in MPTCP windows */ + MPTCP_MIB_DUPDATA, /* Segments discarded due to duplicate DSS */ __MPTCP_MIB_MAX }; diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 6fbfbab51660..ec9c38d3acc7 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -128,6 +128,9 @@ static bool mptcp_try_coalesce(struct sock *sk, struct sk_buff *to, !skb_try_coalesce(to, from, &fragstolen, &delta)) return false; + pr_debug("colesced seq %llx into %llx new len %d new end seq %llx", + MPTCP_SKB_CB(from)->map_seq, MPTCP_SKB_CB(to)->map_seq, + to->len, MPTCP_SKB_CB(from)->end_seq); MPTCP_SKB_CB(to)->end_seq = MPTCP_SKB_CB(from)->end_seq; kfree_skb_partial(from, fragstolen); atomic_add(delta, &sk->sk_rmem_alloc); @@ -160,13 +163,17 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb) max_seq = tcp_space(sk); max_seq = max_seq > 0 ? max_seq + msk->ack_seq : msk->ack_seq; + pr_debug("msk=%p seq=%llx limit=%llx empty=%d", msk, seq, max_seq, + RB_EMPTY_ROOT(&msk->out_of_order_queue)); if (after64(seq, max_seq)) { /* out of window */ mptcp_drop(sk, skb); + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_NODSSWINDOW); return; } p = &msk->out_of_order_queue.rb_node; + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUE); if (RB_EMPTY_ROOT(&msk->out_of_order_queue)) { rb_link_node(&skb->rbnode, NULL, p); rb_insert_color(&skb->rbnode, &msk->out_of_order_queue); @@ -177,11 +184,15 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb) /* with 2 subflows, adding at end of ooo queue is quite likely * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup. */ - if (mptcp_ooo_try_coalesce(msk, msk->ooo_last_skb, skb)) + if (mptcp_ooo_try_coalesce(msk, msk->ooo_last_skb, skb)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOMERGE); + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL); return; + } /* Can avoid an rbtree lookup if we are adding skb after ooo_last_skb */ if (!before64(seq, MPTCP_SKB_CB(msk->ooo_last_skb)->end_seq)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOQUEUETAIL); parent = &msk->ooo_last_skb->rbnode; p = &parent->rb_right; goto insert; @@ -200,6 +211,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb) if (!after64(end_seq, MPTCP_SKB_CB(skb1)->end_seq)) { /* All the bits are present. Drop. */ mptcp_drop(sk, skb); + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); return; } if (after64(seq, MPTCP_SKB_CB(skb1)->map_seq)) { @@ -215,13 +227,16 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb) rb_replace_node(&skb1->rbnode, &skb->rbnode, &msk->out_of_order_queue); mptcp_drop(sk, skb1); + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); goto merge_right; } } else if (mptcp_ooo_try_coalesce(msk, skb1, skb)) { + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_OFOMERGE); return; } p = &parent->rb_right; } + insert: /* Insert segment into RB tree. */ rb_link_node(&skb->rbnode, parent, p); @@ -234,6 +249,7 @@ static void mptcp_data_queue_ofo(struct mptcp_sock *msk, struct sk_buff *skb) break; rb_erase(&skb1->rbnode, &msk->out_of_order_queue); mptcp_drop(sk, skb1); + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); } /* If there is no skb after us, we are the last_skb ! */ if (!skb1) @@ -283,6 +299,7 @@ static bool __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, /* old data, keep it simple and drop the whole pkt, sender * will retransmit as needed, if needed. */ + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); mptcp_drop(sk, skb); return false; } @@ -511,6 +528,7 @@ static bool mptcp_ofo_queue(struct mptcp_sock *msk) u64 end_seq; p = rb_first(&msk->out_of_order_queue); + pr_debug("msk=%p empty=%d", msk, RB_EMPTY_ROOT(&msk->out_of_order_queue)); while (p) { skb = rb_to_skb(p); if (after64(MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq)) @@ -522,6 +540,7 @@ static bool mptcp_ofo_queue(struct mptcp_sock *msk) if (unlikely(!after64(MPTCP_SKB_CB(skb)->end_seq, msk->ack_seq))) { mptcp_drop(sk, skb); + MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_DUPDATA); continue; } @@ -531,6 +550,9 @@ static bool mptcp_ofo_queue(struct mptcp_sock *msk) int delta = msk->ack_seq - MPTCP_SKB_CB(skb)->map_seq; /* skip overlapping data, if any */ + pr_debug("uncoalesced seq=%llx ack seq=%llx delta=%d", + MPTCP_SKB_CB(skb)->map_seq, msk->ack_seq, + delta); MPTCP_SKB_CB(skb)->offset += delta; __skb_queue_tail(&sk->sk_receive_queue, skb); } diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index c4c174749865..d304ce1743eb 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -816,6 +816,7 @@ static void mptcp_subflow_discard_data(struct sock *ssk, struct sk_buff *skb, pr_debug("discarding=%d len=%d seq=%d", incr, skb->len, subflow->map_subflow_seq); + MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_DUPDATA); tcp_sk(ssk)->copied_seq += incr; if (!before(tcp_sk(ssk)->copied_seq, TCP_SKB_CB(skb)->end_seq)) sk_eat_skb(ssk, skb); From patchwork Fri Sep 11 13:52:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261050 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8D8A8C2D0E1 for ; Fri, 11 Sep 2020 16:49:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 46718221ED for ; Fri, 11 Sep 2020 16:49:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="hoCdIdCu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726373AbgIKQtL (ORCPT ); Fri, 11 Sep 2020 12:49:11 -0400 Received: from us-smtp-2.mimecast.com ([207.211.31.81]:46946 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726245AbgIKPGk (ORCPT ); Fri, 11 Sep 2020 11:06:40 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599836799; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ZKF4MuMJRJnl3W8pCnFsdzq1VYiRt3b1ir1fAlXZi6Q=; b=hoCdIdCusKVIt7JiyqyW/6ENtlxBxXkPM2IsjzFVIEVg7How+c2uXVnux9PJQuuBaPg0YW MbYLGCHrjx32WyHSPRzAC4/riAF1B9IwBL5VgvsuNlN39mb7Bc4QSfdJNQAs2TgbVxdMi3 V+vw504Ij533LSCaxTGUMs+bVyX0scE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-482-qcsuMo-WPfGqVHpjenSySQ-1; Fri, 11 Sep 2020 09:52:43 -0400 X-MC-Unique: qcsuMo-WPfGqVHpjenSySQ-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id BEAAF18C5205; Fri, 11 Sep 2020 13:52:42 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9834A5C22A; Fri, 11 Sep 2020 13:52:41 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 10/13] mptcp: allow creating non-backup subflows Date: Fri, 11 Sep 2020 15:52:05 +0200 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Currently the 'backup' attribute of local endpoint is ignored. Let's use it for the MP_JOIN handshake Signed-off-by: Paolo Abeni --- net/mptcp/subflow.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 9edcce21715b..58f2349930a5 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -20,6 +20,7 @@ #include #endif #include +#include #include "protocol.h" #include "mib.h" @@ -1090,7 +1091,7 @@ int __mptcp_subflow_connect(struct sock *sk, const struct mptcp_addr_info *loc, subflow->remote_token = remote_token; subflow->local_id = local_id; subflow->request_join = 1; - subflow->request_bkup = 1; + subflow->request_bkup = !!(loc->flags & MPTCP_PM_ADDR_FLAG_BACKUP); mptcp_info2sockaddr(remote, &addr); err = kernel_connect(sf, (struct sockaddr *)&addr, addrlen, O_NONBLOCK); From patchwork Fri Sep 11 13:52:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 261052 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-9.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 958C2C2BC11 for ; Fri, 11 Sep 2020 16:43:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 52747221E7 for ; Fri, 11 Sep 2020 16:43:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="NCGOst7q" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726422AbgIKQnJ (ORCPT ); Fri, 11 Sep 2020 12:43:09 -0400 Received: from us-smtp-delivery-1.mimecast.com ([207.211.31.120]:53992 "EHLO us-smtp-1.mimecast.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726011AbgIKPJd (ORCPT ); Fri, 11 Sep 2020 11:09:33 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1599836966; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/nqNRXCddy7BVndHGs9zUs4tL4o1QcOM0slIj6tsPSk=; b=NCGOst7q0cnyPYzk+630o2/LeFN36+hy422yU+nwwGMYfSKm6L0Gale45y9TciLh28YbfZ z+AAaJs//yMh6q27InCisLncXkpEXEpczaP+veyYzPTPBU3eW1L9sramGQUNXmssFtb86y e2+zBw0seDvJ16tnvdRIm9YYZZRMAfk= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-339-wdQuIgHBPUC3V91nbDr_FA-1; Fri, 11 Sep 2020 09:52:45 -0400 X-MC-Unique: wdQuIgHBPUC3V91nbDr_FA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 57B278B94B4; Fri, 11 Sep 2020 13:52:44 +0000 (UTC) Received: from linux.fritz.box.com (ovpn-114-214.ams2.redhat.com [10.36.114.214]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2665C5C22A; Fri, 11 Sep 2020 13:52:42 +0000 (UTC) From: Paolo Abeni To: netdev@vger.kernel.org Cc: "David S. Miller" , Eric Dumazet , mptcp@lists.01.org Subject: [PATCH net-next 11/13] mptcp: allow picking different xmit subflows Date: Fri, 11 Sep 2020 15:52:06 +0200 Message-Id: <3b8e364293d3cbb0348f20ca14301200aa43bc24.1599832097.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Update the scheduler to less trivial heuristic: cache the last used subflow, and try to send on it a reasonably long burst of data. When the burst or the subflow send space is exhausted, pick the subflow with the lower ratio between write space and send buffer - that is, the subflow with the greater relative amount of free space. Signed-off-by: Paolo Abeni --- net/mptcp/protocol.c | 109 ++++++++++++++++++++++++++++++++++++------- net/mptcp/protocol.h | 6 ++- 2 files changed, 97 insertions(+), 18 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index ec9c38d3acc7..148c4e685ecd 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1031,41 +1031,103 @@ static void mptcp_nospace(struct mptcp_sock *msk) } } +static bool mptcp_subflow_active(struct mptcp_subflow_context *subflow) +{ + struct sock *ssk = mptcp_subflow_tcp_sock(subflow); + + /* can't send if JOIN hasn't completed yet (i.e. is usable for mptcp) */ + if (subflow->request_join && !subflow->fully_established) + return false; + + /* only send if our side has not closed yet */ + return ((1 << ssk->sk_state) & (TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)); +} + +#define MPTCP_SEND_BURST_SIZE ((1 << 16) - \ + sizeof(struct tcphdr) - \ + MAX_TCP_OPTION_SPACE - \ + sizeof(struct ipv6hdr) - \ + sizeof(struct frag_hdr)) + +struct subflow_send_info { + struct sock *ssk; + uint64_t ratio; +}; + static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk, u32 *sndbuf) { + struct subflow_send_info send_info[2]; struct mptcp_subflow_context *subflow; - struct sock *sk = (struct sock *)msk; - struct sock *backup = NULL; - bool free; + int i, nr_active = 0; + int64_t ratio, pace; + struct sock *ssk; - sock_owned_by_me(sk); + sock_owned_by_me((struct sock *)msk); *sndbuf = 0; if (!mptcp_ext_cache_refill(msk)) return NULL; - mptcp_for_each_subflow(msk, subflow) { - struct sock *ssk = mptcp_subflow_tcp_sock(subflow); - - free = sk_stream_is_writeable(subflow->tcp_sock); - if (!free) { - mptcp_nospace(msk); + if (__mptcp_check_fallback(msk)) { + if (!msk->first) return NULL; + *sndbuf = msk->first->sk_sndbuf; + return sk_stream_memory_free(msk->first) ? msk->first : NULL; + } + + /* re-use last subflow, if the burst allow that */ + if (msk->last_snd && msk->snd_burst > 0 && + sk_stream_memory_free(msk->last_snd) && + mptcp_subflow_active(mptcp_subflow_ctx(msk->last_snd))) { + mptcp_for_each_subflow(msk, subflow) { + ssk = mptcp_subflow_tcp_sock(subflow); + *sndbuf = max(tcp_sk(ssk)->snd_wnd, *sndbuf); } + return msk->last_snd; + } + + /* pick the subflow with the lower wmem/wspace ratio */ + for (i = 0; i < 2; ++i) { + send_info[i].ssk = NULL; + send_info[i].ratio = -1; + } + mptcp_for_each_subflow(msk, subflow) { + ssk = mptcp_subflow_tcp_sock(subflow); + if (!mptcp_subflow_active(subflow)) + continue; + nr_active += !subflow->backup; *sndbuf = max(tcp_sk(ssk)->snd_wnd, *sndbuf); - if (subflow->backup) { - if (!backup) - backup = ssk; + if (!sk_stream_memory_free(subflow->tcp_sock)) + continue; + pace = READ_ONCE(ssk->sk_pacing_rate); + if (!pace) continue; - } - return ssk; + ratio = (int64_t)READ_ONCE(ssk->sk_wmem_queued) << 32 / pace; + if (ratio < send_info[subflow->backup].ratio) { + send_info[subflow->backup].ssk = ssk; + send_info[subflow->backup].ratio = ratio; + } } - return backup; + pr_debug("msk=%p nr_active=%d ssk=%p:%lld backup=%p:%lld", + msk, nr_active, send_info[0].ssk, send_info[0].ratio, + send_info[1].ssk, send_info[1].ratio); + + /* pick the best backup if no other subflow is active */ + if (!nr_active) + send_info[0].ssk = send_info[1].ssk; + + if (send_info[0].ssk) { + msk->last_snd = send_info[0].ssk; + msk->snd_burst = min_t(int, MPTCP_SEND_BURST_SIZE, + sk_stream_wspace(msk->last_snd)); + return msk->last_snd; + } + return NULL; } static void ssk_check_wmem(struct mptcp_sock *msk) @@ -1160,6 +1222,10 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) break; } + /* burst can be negative, we will try move to the next subflow + * at selection time, if possible. + */ + msk->snd_burst -= ret; copied += ret; tx_ok = msg_data_left(msg); @@ -1375,6 +1441,11 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) unsigned int moved = 0; bool done; + /* avoid looping forever below on racing close */ + if (((struct sock *)msk)->sk_state == TCP_CLOSE) + return false; + + __mptcp_flush_join_list(msk); do { struct sock *ssk = mptcp_subflow_recv_lookup(msk); @@ -1539,9 +1610,15 @@ static struct sock *mptcp_subflow_get_retrans(const struct mptcp_sock *msk) sock_owned_by_me((const struct sock *)msk); + if (__mptcp_check_fallback(msk)) + return msk->first; + mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); + if (!mptcp_subflow_active(subflow)) + continue; + /* still data outstanding at TCP level? Don't retransmit. */ if (!tcp_write_queue_empty(ssk)) return NULL; diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index cfa5e1b9521b..493bd2c13bc6 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -196,6 +196,8 @@ struct mptcp_sock { u64 write_seq; u64 ack_seq; u64 rcv_data_fin_seq; + struct sock *last_snd; + int snd_burst; atomic64_t snd_una; unsigned long timer_ival; u32 token; @@ -473,12 +475,12 @@ static inline bool before64(__u64 seq1, __u64 seq2) void mptcp_diag_subflow_init(struct tcp_ulp_ops *ops); -static inline bool __mptcp_check_fallback(struct mptcp_sock *msk) +static inline bool __mptcp_check_fallback(const struct mptcp_sock *msk) { return test_bit(MPTCP_FALLBACK_DONE, &msk->flags); } -static inline bool mptcp_check_fallback(struct sock *sk) +static inline bool mptcp_check_fallback(const struct sock *sk) { struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); struct mptcp_sock *msk = mptcp_sk(subflow->conn);