From patchwork Mon Jan 11 22:24:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Heath Caldwell X-Patchwork-Id: 361721 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D92AFC433E0 for ; Tue, 12 Jan 2021 00:35:56 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9545522D6E for ; Tue, 12 Jan 2021 00:35:56 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405346AbhALAZ0 (ORCPT ); Mon, 11 Jan 2021 19:25:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:35400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390908AbhAKXCy (ORCPT ); Mon, 11 Jan 2021 18:02:54 -0500 Received: from mx0a-00190b01.pphosted.com (mx0a-00190b01.pphosted.com [IPv6:2620:100:9001:583::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8B02DC061786 for ; Mon, 11 Jan 2021 15:02:13 -0800 (PST) Received: from pps.filterd (m0122333.ppops.net [127.0.0.1]) by mx0a-00190b01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 10BMNTNi010313; Mon, 11 Jan 2021 22:24:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=jan2016.eng; bh=q4ubsMsFa832LtHHGYTOI5j9eDgqlUJBNQI7iRUdbWo=; b=oS9I6huPEAc8IrdPYR+WrGpJwtUfwLANDhu93wLY91fNFuYMyevw03D//ppZZWTHMlR9 LyD/aai7X00UOZrCRzBYZz3dv0sm3WNPoMeEkg/MkrQ2GjFqR6pzUSs74n/QB6nlZ4Bs 0vEHBeblSIwoFJ/iFAyOZQ1gGsumd3qrVgCO/+qa/CSOPMhE0tmzTSDhYtWP+lhzJuc+ a6M+YIvKTKOd4kFi2HvtH4g+cNgprvfqXmRVcXVXaIImSuZkIHZsrIKSOwhpFgYBINVn 0g24qPmhbfVlA45pfmmoH+2snrJBjx1TWWmSgj1ZZAnVPizwqVNm4l1LUFIq0G6kvM5y iQ== Received: from prod-mail-ppoint5 (prod-mail-ppoint5.akamai.com [184.51.33.60] (may be forged)) by mx0a-00190b01.pphosted.com with ESMTP id 35y5m4t82g-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 22:24:16 +0000 Received: from pps.filterd (prod-mail-ppoint5.akamai.com [127.0.0.1]) by prod-mail-ppoint5.akamai.com (8.16.0.43/8.16.0.43) with SMTP id 10BMJeKl013411; Mon, 11 Jan 2021 14:24:15 -0800 Received: from email.msg.corp.akamai.com ([172.27.123.53]) by prod-mail-ppoint5.akamai.com with ESMTP id 35ybbe4hpn-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 14:24:15 -0800 Received: from usma1ex-cas5.msg.corp.akamai.com (172.27.123.53) by usma1ex-dag3mb4.msg.corp.akamai.com (172.27.123.56) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 11 Jan 2021 17:24:14 -0500 Received: from bos-lhvedt.bos01.corp.akamai.com (172.28.223.201) by usma1ex-cas5.msg.corp.akamai.com (172.27.123.53) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Mon, 11 Jan 2021 17:24:14 -0500 Received: by bos-lhvedt.bos01.corp.akamai.com (Postfix, from userid 33863) id AC78316004D; Mon, 11 Jan 2021 17:24:14 -0500 (EST) From: Heath Caldwell To: CC: Eric Dumazet , Yuchung Cheng , Josh Hunt , Ji Li , Heath Caldwell Subject: [PATCH net-next 1/4] net: account for overhead when restricting SO_RCVBUF Date: Mon, 11 Jan 2021 17:24:08 -0500 Message-ID: <20210111222411.232916-2-hcaldwel@akamai.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210111222411.232916-1-hcaldwel@akamai.com> References: <20210111222411.232916-1-hcaldwel@akamai.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 phishscore=0 malwarescore=0 mlxscore=0 mlxlogscore=999 suspectscore=0 spamscore=0 adultscore=0 bulkscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110124 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 spamscore=0 clxscore=1015 phishscore=0 impostorscore=0 bulkscore=0 lowpriorityscore=0 suspectscore=0 mlxscore=0 mlxlogscore=999 malwarescore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110125 X-Agari-Authentication-Results: mx.akamai.com; spf=${SPFResult} (sender IP is 184.51.33.60) smtp.mailfrom=hcaldwel@akamai.com smtp.helo=prod-mail-ppoint5 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org When restricting the value supplied for SO_RCVBUF to be no greater than what is specified by sysctl_rmem_max, properly cap the value to the *available* space that sysctl_rmem_max would provide, rather than to the raw value of sysctl_rmem_max. Without this change, it is possible to cause sk_rcvbuf to be assigned a value larger than sysctl_rmem_max via setsockopt() for SO_RCVBUF. To illustrate: If an application calls setsockopt() to set SO_RCVBUF to some value, R, such that: sysctl_rmem_max / 2 < R < sysctl_rmem_max and sk_rcvbuf will be assigned to some value, V, such that: V = R * 2 which produces: R = V / 2 Then, sysctl_rmem_max / 2 < V / 2 < sysctl_rmem_max which produces: sysctl_rmem_max < V < 2 * sysctl_rmem_max For example: If sysctl_rmem_max has a value of 212992, and an application calls setsockopt() to set SO_RCVBUF to 200000 (which is less than sysctl_rmem_max, but greater than sysctl_rmem_max/2), then, without this change, sk_rcvbuf would be set to 2*200000 = 400000, which is larger than sysctl_rmem_max. This change restricts the domain of R to [0, sysctl_rmem_max/2], removing the possibility for V to be greater than sysctl_rmem_max. Also, abstract the actions of converting "buffer" space to and from "available" space and clarify comments. Signed-off-by: Heath Caldwell --- net/core/sock.c | 83 +++++++++++++++++++++++++++++++++++-------------- 1 file changed, 60 insertions(+), 23 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index bbcd4b97eddd..0a9c19f52989 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -778,25 +778,46 @@ void sock_set_keepalive(struct sock *sk) } EXPORT_SYMBOL(sock_set_keepalive); +/* Convert a buffer size value (which accounts for overhead) to the amount of + * space which would be available for data in a buffer that size. + */ +static inline int sock_buf_size_to_available(struct sock *sk, int buf_size) +{ + return buf_size / 2; +} + +/* Convert a size value for an amount of data ("available") to the size of + * buffer necessary to accommodate that amount of data (accounting for + * overhead). + */ +static inline int sock_available_to_buf_size(struct sock *sk, int available) +{ + return available * 2; +} + +/* Applications likely assume that successfully setting SO_RCVBUF will allow for + * the requested amount of data to be received on the socket. Applications are + * not expected to account for implementation specific overhead which may also + * take up space in the receive buffer. + * + * In other words: applications supply a value in "available" space - that is, + * *not* including overhead - to SO_RCVBUF, which must be converted to "buffer" + * space - that is, *including* overhead - to obtain the effective size + * required. + * + * val is in "available" space. + */ static void __sock_set_rcvbuf(struct sock *sk, int val) { - /* Ensure val * 2 fits into an int, to prevent max_t() from treating it - * as a negative value. - */ - val = min_t(int, val, INT_MAX / 2); + int buf_size; + + /* Cap val to what would be available in a maximum sized buffer: */ + val = min(val, sock_buf_size_to_available(sk, INT_MAX)); + buf_size = sock_available_to_buf_size(sk, val); + sk->sk_userlocks |= SOCK_RCVBUF_LOCK; - /* We double it on the way in to account for "struct sk_buff" etc. - * overhead. Applications assume that the SO_RCVBUF setting they make - * will allow that much actual data to be received on that socket. - * - * Applications are unaware that "struct sk_buff" and other overheads - * allocate from the receive buffer during socket buffer allocation. - * - * And after considering the possible alternatives, returning the value - * we actually used in getsockopt is the most desirable behavior. - */ - WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF)); + WRITE_ONCE(sk->sk_rcvbuf, max_t(int, buf_size, SOCK_MIN_RCVBUF)); } void sock_set_rcvbuf(struct sock *sk, int val) @@ -906,12 +927,27 @@ int sock_setsockopt(struct socket *sock, int level, int optname, goto set_sndbuf; case SO_RCVBUF: - /* Don't error on this BSD doesn't and if you think - * about it this is right. Otherwise apps have to - * play 'guess the biggest size' games. RCVBUF/SNDBUF - * are treated in BSD as hints + /* val is in "available" space - that is, it is a requested + * amount of space to be available in the receive buffer *not* + * including any overhead. + * + * sysctl_rmem_max is in "buffer" space - that is, it specifies + * a buffer size *including* overhead. It must be scaled into + * "available" space, which is what __sock_set_rcvbuf() expects. + * + * Don't return an error when val exceeds scaled sysctl_rmem_max + * (or, maybe more clearly: when val scaled into "buffer" space + * would exceed sysctl_rmem_max). Instead, just cap the + * requested value to what sysctl_rmem_max would make available. + * + * Floor negative values to 0. */ - __sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max)); + __sock_set_rcvbuf(sk, + min(max(val, 0), + sock_buf_size_to_available(sk, + min_t(u32, + sysctl_rmem_max, + INT_MAX)))); break; case SO_RCVBUFFORCE: @@ -920,9 +956,6 @@ int sock_setsockopt(struct socket *sock, int level, int optname, break; } - /* No negative values (to prevent underflow, as val will be - * multiplied by 2). - */ __sock_set_rcvbuf(sk, max(val, 0)); break; @@ -1333,6 +1366,10 @@ int sock_getsockopt(struct socket *sock, int level, int optname, break; case SO_RCVBUF: + /* The actual value, in "buffer" space, is supplied for + * getsockopt(), even though the value supplied to setsockopt() + * is in "available" space. + */ v.val = sk->sk_rcvbuf; break; From patchwork Mon Jan 11 22:24:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Heath Caldwell X-Patchwork-Id: 362758 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79201C433DB for ; Tue, 12 Jan 2021 00:36:04 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 46A6222D5B for ; Tue, 12 Jan 2021 00:36:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405341AbhALAZ0 (ORCPT ); Mon, 11 Jan 2021 19:25:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34370 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390857AbhAKW6E (ORCPT ); Mon, 11 Jan 2021 17:58:04 -0500 Received: from mx0a-00190b01.pphosted.com (mx0a-00190b01.pphosted.com [IPv6:2620:100:9001:583::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 51E15C061786 for ; Mon, 11 Jan 2021 14:57:24 -0800 (PST) Received: from pps.filterd (m0050093.ppops.net [127.0.0.1]) by m0050093.ppops.net-00190b01. (8.16.0.43/8.16.0.43) with SMTP id 10BMF6Hc014068; Mon, 11 Jan 2021 22:24:17 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=jan2016.eng; bh=/kB/hFZbGTXdfuCH87IQQIikycIjycJyFT/A4uTpOqg=; b=MLDGYDOmoN7RfKBzzm18ctT7iwaFhF4CjItOZ4GUZkIfuKG/9gkL/Dzr0r4XpjXng3+v PoBMRtY9aYq9Y1A2DzGT/buo8kJopRYYkytRomY4+O6Vm+VGkJjVzCHjjtafxSMcbwOl cG8aCBcywCVo3+gVJd+0if+oyrPlbR9ihwzP8XhxKyWFcODY4IZlh6LWCs0AoP6tYIOv R2j5upYkMQQUWSnLvLyE+oKYHECeTr/5oFh5zxIqYryohu9XjjHO31TrxoVgdhAf2Gje gqFytBpH9Wi+LuyptFG5iP86Zf/5789UsAF/jHM0QvEr1hCdbjhgKNBHEvsWvUbktfjA Ew== Received: from prod-mail-ppoint1 (prod-mail-ppoint1.akamai.com [184.51.33.18] (may be forged)) by m0050093.ppops.net-00190b01. with ESMTP id 3605h7vkp9-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 22:24:17 +0000 Received: from pps.filterd (prod-mail-ppoint1.akamai.com [127.0.0.1]) by prod-mail-ppoint1.akamai.com (8.16.0.43/8.16.0.43) with SMTP id 10BMJoMr002704; Mon, 11 Jan 2021 17:24:15 -0500 Received: from email.msg.corp.akamai.com ([172.27.123.32]) by prod-mail-ppoint1.akamai.com with ESMTP id 35y8q2vu4u-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 17:24:15 -0500 Received: from usma1ex-cas5.msg.corp.akamai.com (172.27.123.53) by usma1ex-dag3mb5.msg.corp.akamai.com (172.27.123.55) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 11 Jan 2021 17:24:14 -0500 Received: from bos-lhvedt.bos01.corp.akamai.com (172.28.223.201) by usma1ex-cas5.msg.corp.akamai.com (172.27.123.53) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Mon, 11 Jan 2021 17:24:14 -0500 Received: by bos-lhvedt.bos01.corp.akamai.com (Postfix, from userid 33863) id AF12616004E; Mon, 11 Jan 2021 17:24:14 -0500 (EST) From: Heath Caldwell To: CC: Eric Dumazet , Yuchung Cheng , Josh Hunt , Ji Li , Heath Caldwell Subject: [PATCH net-next 2/4] net: tcp: consistently account for overhead for SO_RCVBUF for TCP Date: Mon, 11 Jan 2021 17:24:09 -0500 Message-ID: <20210111222411.232916-3-hcaldwel@akamai.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210111222411.232916-1-hcaldwel@akamai.com> References: <20210111222411.232916-1-hcaldwel@akamai.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 malwarescore=0 mlxscore=0 bulkscore=0 adultscore=0 phishscore=0 suspectscore=0 mlxlogscore=948 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110124 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 suspectscore=0 lowpriorityscore=0 malwarescore=0 priorityscore=1501 mlxlogscore=892 adultscore=0 clxscore=1015 bulkscore=0 spamscore=0 impostorscore=0 phishscore=0 mlxscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110124 X-Agari-Authentication-Results: mx.akamai.com; spf=${SPFResult} (sender IP is 184.51.33.18) smtp.mailfrom=hcaldwel@akamai.com smtp.helo=prod-mail-ppoint1 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org When setting SO_RCVBUF for TCP sockets, account for overhead in accord with sysctl_tcp_adv_win_scale. This makes the receive buffer overhead accounting for SO_RCVBUF consistent with how it is accounted elsewhere for TCP sockets. Signed-off-by: Heath Caldwell --- include/net/tcp.h | 17 +++++++++++++++++ net/core/sock.c | 6 ++++++ 2 files changed, 23 insertions(+) diff --git a/include/net/tcp.h b/include/net/tcp.h index 78d13c88720f..9961de3fbf09 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1408,6 +1408,23 @@ static inline int tcp_win_from_space(const struct sock *sk, int space) space - (space>>tcp_adv_win_scale); } +/* Calculate the amount of buffer space which would allow for the advertisement + * of a window size of win, accounting for overhead. + * + * This is the inverse of tcp_win_from_space(). + */ +static inline int tcp_space_from_win(const struct sock *sk, int win) +{ + int tcp_adv_win_scale = sock_net(sk)->ipv4.sysctl_tcp_adv_win_scale; + + return tcp_adv_win_scale <= 0 ? + win<<(-tcp_adv_win_scale) : + /* Division by zero is avoided because the above expression is + * used when tcp_adv_win_scale == 0. + */ + (win<sk_protocol == IPPROTO_TCP) + return tcp_win_from_space(sk, buf_size); + return buf_size / 2; } @@ -792,6 +795,9 @@ static inline int sock_buf_size_to_available(struct sock *sk, int buf_size) */ static inline int sock_available_to_buf_size(struct sock *sk, int available) { + if (sk->sk_protocol == IPPROTO_TCP) + return tcp_space_from_win(sk, available); + return available * 2; } From patchwork Mon Jan 11 22:24:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Heath Caldwell X-Patchwork-Id: 362756 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5B760C433E0 for ; Tue, 12 Jan 2021 00:36:33 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3426C2253A for ; Tue, 12 Jan 2021 00:36:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2405330AbhALAZY (ORCPT ); Mon, 11 Jan 2021 19:25:24 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60468 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390766AbhAKWsE (ORCPT ); Mon, 11 Jan 2021 17:48:04 -0500 X-Greylist: delayed 1385 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Mon, 11 Jan 2021 14:47:23 PST Received: from mx0b-00190b01.pphosted.com (mx0b-00190b01.pphosted.com [IPv6:2620:100:9005:57f::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D0562C061786 for ; Mon, 11 Jan 2021 14:47:23 -0800 (PST) Received: from pps.filterd (m0122330.ppops.net [127.0.0.1]) by mx0b-00190b01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 10BMNfbI015016; Mon, 11 Jan 2021 22:24:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=jan2016.eng; bh=U6ScRs8gJ/mqbOrKcwH+l4+kJXmDfsU1g2ccT/Jl1xc=; b=llnXhZ0HyurOYS6s7ktdDk+gdcQuP8W8vXEsuZqQlPTpIOhoujLs4+WUoPHVejoKICsz 2DdRbUlQmAC7XopgV8nSqxzfxQ4nY4ZgosksgXA5pZqc0TVPFMDMWosIoRIFl2zyECfc p1WcvOVQr0kW6zYb7d5TDbTYOd5QO+IA+MFEjrphhWR8UNbLFB9ysASgOPAr/FrylQFH OnoU56nSdIIAVQWjW0TnfnxuyJ/YH/2ScJRxzcWGeWHZOQZr/mBX9TA8TNpJ0KCMCqxq O3jcyE311OS8WfYUI31slwrtlEn586iefMgVcA5PzTC1Z1QYTWJEQ6ew41k2VzC7vb88 0Q== Received: from prod-mail-ppoint1 (prod-mail-ppoint1.akamai.com [184.51.33.18] (may be forged)) by mx0b-00190b01.pphosted.com with ESMTP id 35y5ek0wea-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 22:24:16 +0000 Received: from pps.filterd (prod-mail-ppoint1.akamai.com [127.0.0.1]) by prod-mail-ppoint1.akamai.com (8.16.0.43/8.16.0.43) with SMTP id 10BMJoMq002704; Mon, 11 Jan 2021 17:24:15 -0500 Received: from email.msg.corp.akamai.com ([172.27.123.32]) by prod-mail-ppoint1.akamai.com with ESMTP id 35y8q2vu4u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 17:24:15 -0500 Received: from USMA1EX-CAS2.msg.corp.akamai.com (172.27.123.31) by usma1ex-dag3mb5.msg.corp.akamai.com (172.27.123.55) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 11 Jan 2021 17:24:14 -0500 Received: from bos-lhvedt.bos01.corp.akamai.com (172.28.223.201) by USMA1EX-CAS2.msg.corp.akamai.com (172.27.123.31) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Mon, 11 Jan 2021 17:24:14 -0500 Received: by bos-lhvedt.bos01.corp.akamai.com (Postfix, from userid 33863) id B18D316004F; Mon, 11 Jan 2021 17:24:14 -0500 (EST) From: Heath Caldwell To: CC: Eric Dumazet , Yuchung Cheng , Josh Hunt , Ji Li , Heath Caldwell Subject: [PATCH net-next 3/4] tcp: consistently account for overhead in rcv_wscale calculation Date: Mon, 11 Jan 2021 17:24:10 -0500 Message-ID: <20210111222411.232916-4-hcaldwel@akamai.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210111222411.232916-1-hcaldwel@akamai.com> References: <20210111222411.232916-1-hcaldwel@akamai.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 malwarescore=0 mlxscore=0 bulkscore=0 adultscore=0 phishscore=0 suspectscore=0 mlxlogscore=999 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110124 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 spamscore=0 impostorscore=0 mlxscore=0 suspectscore=0 mlxlogscore=973 bulkscore=0 malwarescore=0 clxscore=1011 lowpriorityscore=0 priorityscore=1501 phishscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110125 X-Agari-Authentication-Results: mx.akamai.com; spf=${SPFResult} (sender IP is 184.51.33.18) smtp.mailfrom=hcaldwel@akamai.com smtp.helo=prod-mail-ppoint1 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org When calculating the window scale to use for the advertised window for a TCP connection, adjust the size values used to extend the maximum possible window value so that overhead is properly accounted. In other words: convert the maximum value candidates from buffer size space into advertised window space. This adjustment keeps the scale of the maximum value consistent - that is, keeps it in window space. Without this adjustment, the window scale used could be larger than necessary, reducing granularity for the advertised window. Signed-off-by: Heath Caldwell --- net/ipv4/tcp_output.c | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index f322e798a351..1d2773cd02c8 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -240,8 +240,12 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, *rcv_wscale = 0; if (wscale_ok) { /* Set window scaling on max possible window */ - space = max_t(u32, space, sock_net(sk)->ipv4.sysctl_tcp_rmem[2]); - space = max_t(u32, space, sysctl_rmem_max); + space = max_t(u32, space, + tcp_win_from_space( + sk, + sock_net(sk)->ipv4.sysctl_tcp_rmem[2])); + space = max_t(u32, space, + tcp_win_from_space(sk, sysctl_rmem_max)); space = min_t(u32, space, *window_clamp); *rcv_wscale = clamp_t(int, ilog2(space) - 15, 0, TCP_MAX_WSCALE); From patchwork Mon Jan 11 22:24:11 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Heath Caldwell X-Patchwork-Id: 362757 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 79BB9C433E0 for ; Tue, 12 Jan 2021 00:36:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 496C722D58 for ; Tue, 12 Jan 2021 00:36:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391700AbhALAZ0 (ORCPT ); Mon, 11 Jan 2021 19:25:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33900 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2390844AbhAKWzw (ORCPT ); Mon, 11 Jan 2021 17:55:52 -0500 X-Greylist: delayed 1852 seconds by postgrey-1.37 at lindbergh.monkeyblade.net; Mon, 11 Jan 2021 14:55:12 PST Received: from mx0a-00190b01.pphosted.com (mx0a-00190b01.pphosted.com [IPv6:2620:100:9001:583::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A28DBC061786 for ; Mon, 11 Jan 2021 14:55:12 -0800 (PST) Received: from pps.filterd (m0122333.ppops.net [127.0.0.1]) by mx0a-00190b01.pphosted.com (8.16.0.43/8.16.0.43) with SMTP id 10BMNcKH010401; Mon, 11 Jan 2021 22:24:16 GMT DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=akamai.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=jan2016.eng; bh=F5gTNodYxj5GCB2CiAasAHDvULOoKdfShxMNdjkfiY8=; b=TVJyUNhfwPRNWwuhteJCeycCRKm+tdVJcZz6VjJ/Y/UXCVGZSZC1mj+a+h2XQmFXZlSG AKMJmoKvBcwmogyfq66kyEyWLUHOVVgHrp7JEV37jccjaLzDJk6htCFQfeH4/zDO8nfp B6gNbwLhB3ciQ1QRgeWZ2p4uZjlNg+kRndBgeKumEbZ+YjF17ZHsCLucCOiGqDHHtTkH LGs0Xv8ur6vjTrCyOZBKrk47Do6lvsRyMWeRVpScppFPh/LH4Oi1Sxrix3ToydUb+fqG 1uMx3MmTcHiTdW+5PvL5z2fMjIbCgrFjdsHM4V+uE1Rv/FsaxL4n3Sg0v5kpnuXL1fIw iQ== Received: from prod-mail-ppoint6 (prod-mail-ppoint6.akamai.com [184.51.33.61] (may be forged)) by mx0a-00190b01.pphosted.com with ESMTP id 35y5m4t82h-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 22:24:16 +0000 Received: from pps.filterd (prod-mail-ppoint6.akamai.com [127.0.0.1]) by prod-mail-ppoint6.akamai.com (8.16.0.43/8.16.0.43) with SMTP id 10BMImYr008919; Mon, 11 Jan 2021 17:24:15 -0500 Received: from email.msg.corp.akamai.com ([172.27.123.34]) by prod-mail-ppoint6.akamai.com with ESMTP id 35y8q3vt33-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Mon, 11 Jan 2021 17:24:15 -0500 Received: from USMA1EX-CAS3.msg.corp.akamai.com (172.27.123.32) by usma1ex-dag3mb6.msg.corp.akamai.com (172.27.123.54) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Mon, 11 Jan 2021 17:24:14 -0500 Received: from bos-lhvedt.bos01.corp.akamai.com (172.28.223.201) by USMA1EX-CAS3.msg.corp.akamai.com (172.27.123.32) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Mon, 11 Jan 2021 17:24:14 -0500 Received: by bos-lhvedt.bos01.corp.akamai.com (Postfix, from userid 33863) id B4187160050; Mon, 11 Jan 2021 17:24:14 -0500 (EST) From: Heath Caldwell To: CC: Eric Dumazet , Yuchung Cheng , Josh Hunt , Ji Li , Heath Caldwell Subject: [PATCH net-next 4/4] tcp: remove limit on initial receive window Date: Mon, 11 Jan 2021 17:24:11 -0500 Message-ID: <20210111222411.232916-5-hcaldwel@akamai.com> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210111222411.232916-1-hcaldwel@akamai.com> References: <20210111222411.232916-1-hcaldwel@akamai.com> MIME-Version: 1.0 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 spamscore=0 malwarescore=0 adultscore=0 phishscore=0 mlxlogscore=999 mlxscore=0 bulkscore=0 suspectscore=0 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110124 X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.343, 18.0.737 definitions=2021-01-11_32:2021-01-11,2021-01-11 signatures=0 X-Proofpoint-Spam-Details: rule=notspam policy=default score=0 adultscore=0 spamscore=0 clxscore=1011 phishscore=0 impostorscore=0 bulkscore=0 lowpriorityscore=0 suspectscore=0 mlxscore=0 mlxlogscore=999 malwarescore=0 priorityscore=1501 classifier=spam adjust=0 reason=mlx scancount=1 engine=8.12.0-2009150000 definitions=main-2101110125 X-Agari-Authentication-Results: mx.akamai.com; spf=${SPFResult} (sender IP is 184.51.33.61) smtp.mailfrom=hcaldwel@akamai.com smtp.helo=prod-mail-ppoint6 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Remove the 64KB limit imposed on the initial receive window. The limit was added by commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB"). This change removes that limit so that the initial receive window can be arbitrarily large (within existing limits and depending on the current configuration). The arbitrary, internal limit can interfere with research because it irremediably restricts the receive window at the beginning of a connection below what would be expected when explicitly configuring the receive buffer size. - Here is a scenario to illustrate how the limit might cause undesirable behavior: Consider an installation where all parts of a network are either controlled or sufficiently monitored and there is a desired use case where a 1MB object is transmitted over a newly created TCP connection in a single initial burst. Let MSS be 1460 bytes. The initial cwnd would need to be at least: |- 1048576 bytes -| cwnd_init = | --------------- | = 719 packets | 1460 bytes/pkt | Let us say that it was determined that the network could handle bursts of 800 full sized packets at the frequency which the connections under consideration would be expected to occur, so the sending host is configured to use an initial cwnd of 800 for these connections. In order for the receiver to be able to receive a 1MB burst, it needs to have a sufficiently large receive buffer for the connection. Considering overhead, let us say that the receiver is configured to initially use a receive buffer of 2148K for TCP connections: net.ipv4.tcp_rmem = 4096 2199552 6291456 Let rtt be 50 milliseconds. If the entire object is sent in a single burst, then the theoretically highest achievable throughput (discounting handshake and request) should be: bits 1048576 bytes 8 bits T_upperbound = ---- = ------------- * ------ =~ 168 Mbit/s rtt 0.05 s 1 byte But, if flow control limits throughput because the receive window is initially limited to 64KB and grows at a rate of quadrupling every rtt (maybe not accurate but seems to be optimistic from observation), we should expect the highest achievable throughput to be limited to: bytes_sent = 65536 * (1 + 4)^(t / rtt) When bytes_sent = object size = 1048576: 1048576 = 65536 * (1 + 4)^(t / rtt) t = rtt * log_5(16) 1048576 bytes 8 bits T_limited = ------------------------------------ * ------ / |- rtt * log_5(16) -| \ 1 byte rtt * ( 1 + | ---------------- | ) \ | rtt | / 1048576 bytes 8 bits = ---------------- * ------ 0.05 s * (1 + 2) 1 byte =~ 55.9 Mbit/s In short: for this scenario, the 64KB limit on the initial receive window increases the achievable acknowledged delivery time from 1 rtt to (optimistically) 3 rtts, reducing the achievable throughput from 168 Mbit/s to 55.9 Mbit/s. Here is an experimental illustration: A time sequence chart of a packet capture taken on the sender for a scenario similar to what is described above, where the receiver had the 64KB limit in place: Symbols: .:' - Data packets _- - Window advertised by receiver y-axis - Relative sequence number x-axis - Time from sending of first data packet, in seconds 3212891 _ 3089318 - 2965745 - 2842172 - 2718600 ________- 2595027 - 2471454 - 2347881 -------- 2224309 _ 2100736 - 1977163 -- 1853590 _ 1730018 - 1606445 - 1482872 - 1359300 - 1235727 - 1112154 - 988581 _: 865009 _______--------.: 741436 . : ' 617863 -: 494290 -: 370718 .: 247145 --------.-------: 123572 _________________: ' 0 .: ' 0.000 0.028 0.056 0.084 0.112 0.140 0.168 0.195 Note that the sender was not able to send the object in a single initial burst and that it took around 4 rtts for the object to be fully acknowledged. A time sequence chart of a packet capture taken for the same scenario, but with the limit removed: 2147035 __ 2064456 _- 1981878 _- 1899300 - 1816721 -- 1734143 _- 1651565 _- 1568987 - 1486408 -- 1403830 _- 1321252 _- 1238674 - 1156095 ________________________________________________________-- 1073517 990939 : 908360 :' 825782 :' 743204 .: 660626 : 578047 :' 495469 :' 412891 .: 330313 .: 247734 : 165156 :' 82578 :' 0 .: 0.000 0.008 0.016 0.025 0.033 0.041 0.049 0.057 Note that the sender was able to send the entire object in a single burst and that it was fully acknowledged after a little over 1 rtt. Signed-off-by: Heath Caldwell --- net/ipv4/tcp_output.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index 1d2773cd02c8..d7ab1f5f071e 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss, if (sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows) (*rcv_wnd) = min(space, MAX_TCP_WINDOW); else - (*rcv_wnd) = min_t(u32, space, U16_MAX); + (*rcv_wnd) = space; if (init_rcv_wnd) *rcv_wnd = min(*rcv_wnd, init_rcv_wnd * mss);