diff mbox series

[net-next] tcp: optimise receiver buffer autotuning initialisation for high latency connections

Message ID 20201204180622.14285-1-abuehaze@amazon.com
State New
Headers show
Series [net-next] tcp: optimise receiver buffer autotuning initialisation for high latency connections | expand

Commit Message

Mohamed Abuelfotoh, Hazem Dec. 4, 2020, 6:06 p.m. UTC
Previously receiver buffer auto-tuning starts after receiving
    one advertised window amount of data.After the initial
    receiver buffer was raised by
    commit a337531b942b ("tcp: up initial rmem to 128KB
    and SYN rwin to around 64KB"),the receiver buffer may
    take too long for TCP autotuning to start raising
    the receiver buffer size.
    commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
    tried to decrease the threshold at which TCP auto-tuning starts
    but it's doesn't work well in some environments
    where the receiver has large MTU (9001) configured
    specially within environments where RTT is high.
    To address this issue this patch is relying on RCV_MSS
    so auto-tuning can start early regardless
    the receiver configured MTU.

    Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
    Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")

Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
---
 net/ipv4/tcp_input.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

Comments

Mohamed Abuelfotoh, Hazem Dec. 4, 2020, 6:19 p.m. UTC | #1
Hey Team,

I am sending you this e-mail as a follow-up to provide more context about the patch that I proposed in my previous e-mail.


1-We have received a customer complain[1] about degraded download speed   from google endpoints after they upgraded their Ubuntu kernel from 4.14 to 5.4.These customers were getting around 80MB/s on kernel 4.14 which became 3MB/s after the upgrade to kernel 5.4.
2-We tried to reproduce the issue locally between EC2 instances within the same region but we couldn’t however we were able to reproduce it when fetching data from google endpoint.
3-The issue could only be reproduced in Regions where we have high RTT(around 12msec  or more ) with Google endpoints.
4-We have found some workarounds that can be applied on the receiver side which has proven to be effective and I am listing them below:
            A) Decrease TCP socket default rmem from 131072 to 87380
            B) Decrease MTU from 9001 to 1500.
            C) Change sysctl_tcp_adv_win_scale from default 1 to 0 or 2
            D)We have also found that disabling net.ipv4.tcp_moderate_rcvbuf on kernel 4.14 is giving exactly the same bad performance speed.
5-We have done some kernel bisect to understand when this behaviour has been introduced and found that   commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")[2] which has been merged to mainline kernel 4.19.86 is the culprit behind this download performance degradation, The commit  mainly did two main changes:
A)Raising the initial TCP receive buffer size and receive window.
B)Changing the way in which TCP Dynamic Right Sizing (DRS) is been kicked off.

6)There was a regression that has been introduced because of the above patch causing the receive window scaling  to take long time after raising the initial receiver buffer & receive window  and there was additional fix for that  in commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")[3].

7)Commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to decrease the initial rcvq_space.space which  is used in TCP's internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send and It should  change over the life of any connection based on the amount of data that the sender is sending. This patch is relying on advmss (which is the MSS configured on the receiver) to identify the initial receive space, although this works very well with receivers with small MTUs like 1500 it’s doesn’t help if the receiver is configured to use Jumbo frames (9001 MTU) which is the default MTU on AWS EC2 instances and this is why we think this hasn’t been reported before beside the high RTT >=12msec required to see the issue as well.

8)After further debugging and testing we have found that the issue can only be reproduced under any of the  below conditions:
A)Sender (MTU 1500) using bbr/bbrv2 as congestion control algorithm ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072   running kernel 4.19.86 or later with RTT >=12msec.——>consistently reproducible
B)Sender (MTU 1500) using cubic as congestion control algorithm with fq as disc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=30msec.——>consistently reproducible.
C)Sender (MTU 1500) using cubic as congestion control algorithm with pfifo_fast as qdisc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072   running kernel 4.19.86 or later with RTT >=30msec.——>intermittently  reproducible
D)Sender needs a MTU of 1500. If the sender is using MTU of 9001  with no MSS clamping , then we  couldn’t  reproduce the issue.
E)AWS EC2 instances are using 9001 as MTU by default hence they are likely more impacted by this.


9)With some kernel hacking & packet capture analysis we found that the main issue is that under the above mentioned conditions the receive window never scales up as it looks like the tcp receiver autotuning never kicks off, I have attached to this e-mail  screenshots showing Window scaling with and without the proposed patch.
We also found that all workarounds either decreasing initial rcvq_space (this includes decreasing receiver advertised MSS from 9001 to 1500 or  default receive buffer size from 131072 to 87380) or increasing the maximum advertised receive window (before TCP autotuning start scaling) and this includes changing net.ipv4.tcp_adv_win_scale from 1 to 0 or 2.

10)It looks like when the issue happen we have a  kind of deadlock here so advertised receive window has to exceed rcvq_space for the tcp auto tuning to kickoff at the same time with the initial default  configuration the receive window is not going to exceed rcvq_space because it can only get half of the initial receive socket buffer size.

11)The current code which is based on patch has main drawback  which should be handled:
A)It relies on receiver configured MTU to define the initial receive space(threshold where tcp autotuning starts), as mentioned above this works well with 1500 MTU because with that it will make sure that initial receive space is lower than receive window so tcp autotuning will work just fine while it won’t work with Jumbo frames in use on the receiver because at this case the receiver won’t start tcp autotuning especially with high RTT and we will be hitting the regression that commit  041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to handle.
12)I am proposing  the below patch which is relying on RCV_MSS (our guess about MSS used by the peer which is equal to TCP_MSS_DEFAULT 536 bytes by default) this should work regardless the receiver configured MSS. I am also sharing  my iperf test results with and without the patch and also verified  that the connection won’t get stuck in the middle in case of packet loss or latency spike which I emulated using tc netem on the sender side.


Test Results using the same sender & receiver:

-Without our proposed patch

#iperf3 -c xx.xx.xx.xx -t15 -i1 -R
Connecting to host xx.xx.xx.xx, port 5201
Reverse mode, remote host xx.xx.xx.xx is sending
[  4] local 172.31.37.167 port 52838 connected to xx.xx.xx.xx port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   269 KBytes  2.20 Mbits/sec
[  4]   1.00-2.00   sec   332 KBytes  2.72 Mbits/sec
[  4]   2.00-3.00   sec   334 KBytes  2.73 Mbits/sec
[  4]   3.00-4.00   sec   335 KBytes  2.75 Mbits/sec
[  4]   4.00-5.00   sec   332 KBytes  2.72 Mbits/sec
[  4]   5.00-6.00   sec   283 KBytes  2.32 Mbits/sec
[  4]   6.00-7.00   sec   332 KBytes  2.72 Mbits/sec
[  4]   7.00-8.00   sec   335 KBytes  2.75 Mbits/sec
[  4]   8.00-9.00   sec   335 KBytes  2.75 Mbits/sec
[  4]   9.00-10.00  sec   334 KBytes  2.73 Mbits/sec
[  4]  10.00-11.00  sec   332 KBytes  2.72 Mbits/sec
[  4]  11.00-12.00  sec   332 KBytes  2.72 Mbits/sec
[  4]  12.00-13.00  sec   338 KBytes  2.77 Mbits/sec
[  4]  13.00-14.00  sec   334 KBytes  2.73 Mbits/sec
[  4]  14.00-15.00  sec   332 KBytes  2.72 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-15.00  sec  6.07 MBytes  3.39 Mbits/sec    0             sender
[  4]   0.00-15.00  sec  4.90 MBytes  2.74 Mbits/sec                  receiver

iperf Done.


Test downloading from google endpoint:

# wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
--2020-12-04 16:53:00--  https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.1.48, 172.217.8.176, 172.217.4.48, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.1.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113320760 (108M) [application/octet-stream]
Saving to: ‘kubelet.45’

100%[===================================================================================================================================================>] 113,320,760 3.04MB/s   in 36s

2020-12-04 16:53:36 (3.02 MB/s) - ‘kubelet’ saved [113320760/113320760]


########################################################################################################################

-With the proposed  patch:

#iperf3 -c xx.xx.xx.xx -t15 -i1 -R
Connecting to host xx.xx.xx.xx, port 5201
Reverse mode, remote host xx.xx.xx.xx is sending
[  4] local 172.31.37.167 port 44514 connected to xx.xx.xx.xx port 5201
[ ID] Interval           Transfer     Bandwidth
[  4]   0.00-1.00   sec   911 KBytes  7.46 Mbits/sec
[  4]   1.00-2.00   sec  8.95 MBytes  75.1 Mbits/sec
[  4]   2.00-3.00   sec  9.57 MBytes  80.3 Mbits/sec
[  4]   3.00-4.00   sec  9.56 MBytes  80.2 Mbits/sec
[  4]   4.00-5.00   sec  9.58 MBytes  80.3 Mbits/sec
[  4]   5.00-6.00   sec  9.58 MBytes  80.4 Mbits/sec
[  4]   6.00-7.00   sec  9.59 MBytes  80.4 Mbits/sec
[  4]   7.00-8.00   sec  9.59 MBytes  80.5 Mbits/sec
[  4]   8.00-9.00   sec  9.58 MBytes  80.4 Mbits/sec
[  4]   9.00-10.00  sec  9.58 MBytes  80.4 Mbits/sec
[  4]  10.00-11.00  sec  9.59 MBytes  80.4 Mbits/sec
[  4]  11.00-12.00  sec  9.59 MBytes  80.5 Mbits/sec
[  4]  12.00-13.00  sec  8.05 MBytes  67.5 Mbits/sec
[  4]  13.00-14.00  sec  9.57 MBytes  80.3 Mbits/sec
[  4]  14.00-15.00  sec  9.57 MBytes  80.3 Mbits/sec
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-15.00  sec   136 MBytes  76.3 Mbits/sec    0             sender
[  4]   0.00-15.00  sec   134 MBytes  75.2 Mbits/sec                  receiver

iperf Done.

Test downloading from google endpoint:


# wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
--2020-12-04 16:54:34--  https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.0.16, 216.58.192.144, 172.217.6.16, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.0.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 113320760 (108M) [application/octet-stream]
Saving to: ‘kubelet’

100%[===================================================================================================================================================>] 113,320,760 80.0MB/s   in 1.4s

2020-12-04 16:54:36 (80.0 MB/s) - ‘kubelet.1’ saved [113320760/113320760]

Links:

[1] https://github.com/kubernetes/kops/issues/10206
[2] https://lore.kernel.org/patchwork/patch/1157936/
[3] https://lore.kernel.org/patchwork/patch/1157883/



Thank you.

Hazem

On 04/12/2020, 18:08, "Hazem Mohamed Abuelfotoh" <abuehaze@amazon.com> wrote:

        Previously receiver buffer auto-tuning starts after receiving
        one advertised window amount of data.After the initial
        receiver buffer was raised by
        commit a337531b942b ("tcp: up initial rmem to 128KB
        and SYN rwin to around 64KB"),the receiver buffer may
        take too long for TCP autotuning to start raising
        the receiver buffer size.
        commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
        tried to decrease the threshold at which TCP auto-tuning starts
        but it's doesn't work well in some environments
        where the receiver has large MTU (9001) configured
        specially within environments where RTT is high.
        To address this issue this patch is relying on RCV_MSS
        so auto-tuning can start early regardless
        the receiver configured MTU.

        Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
        Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")

    Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>

    ---
     net/ipv4/tcp_input.c | 3 ++-
     1 file changed, 2 insertions(+), 1 deletion(-)

    diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
    index 389d1b340248..f0ffac9e937b 100644
    --- a/net/ipv4/tcp_input.c
    +++ b/net/ipv4/tcp_input.c
    @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
     static void tcp_init_buffer_space(struct sock *sk)
     {
     	int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
    +	struct inet_connection_sock *icsk = inet_csk(sk);
     	struct tcp_sock *tp = tcp_sk(sk);
     	int maxwin;

     	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
     		tcp_sndbuf_expand(sk);

    -	tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
    +	tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
     	tcp_mstamp_refresh(tp);
     	tp->rcvq_space.time = tp->tcp_mstamp;
     	tp->rcvq_space.seq = tp->copied_seq;
    -- 
    2.16.6





Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Eric Dumazet Dec. 4, 2020, 6:41 p.m. UTC | #2
On Fri, Dec 4, 2020 at 7:19 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>
> Hey Team,
>
> I am sending you this e-mail as a follow-up to provide more context about the patch that I proposed in my previous e-mail.
>
>
> 1-We have received a customer complain[1] about degraded download speed   from google endpoints after they upgraded their Ubuntu kernel from 4.14 to 5.4.These customers were getting around 80MB/s on kernel 4.14 which became 3MB/s after the upgrade to kernel 5.4.
> 2-We tried to reproduce the issue locally between EC2 instances within the same region but we couldn’t however we were able to reproduce it when fetching data from google endpoint.
> 3-The issue could only be reproduced in Regions where we have high RTT(around 12msec  or more ) with Google endpoints.
> 4-We have found some workarounds that can be applied on the receiver side which has proven to be effective and I am listing them below:
>             A) Decrease TCP socket default rmem from 131072 to 87380
>             B) Decrease MTU from 9001 to 1500.
>             C) Change sysctl_tcp_adv_win_scale from default 1 to 0 or 2
>             D)We have also found that disabling net.ipv4.tcp_moderate_rcvbuf on kernel 4.14 is giving exactly the same bad performance speed.
> 5-We have done some kernel bisect to understand when this behaviour has been introduced and found that   commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")[2] which has been merged to mainline kernel 4.19.86 is the culprit behind this download performance degradation, The commit  mainly did two main changes:
> A)Raising the initial TCP receive buffer size and receive window.
> B)Changing the way in which TCP Dynamic Right Sizing (DRS) is been kicked off.
>
> 6)There was a regression that has been introduced because of the above patch causing the receive window scaling  to take long time after raising the initial receiver buffer & receive window  and there was additional fix for that  in commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")[3].
>
> 7)Commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to decrease the initial rcvq_space.space which  is used in TCP's internal auto-tuning to grow socket buffers based on how much data the kernel estimates the sender can send and It should  change over the life of any connection based on the amount of data that the sender is sending. This patch is relying on advmss (which is the MSS configured on the receiver) to identify the initial receive space, although this works very well with receivers with small MTUs like 1500 it’s doesn’t help if the receiver is configured to use Jumbo frames (9001 MTU) which is the default MTU on AWS EC2 instances and this is why we think this hasn’t been reported before beside the high RTT >=12msec required to see the issue as well.
>
> 8)After further debugging and testing we have found that the issue can only be reproduced under any of the  below conditions:
> A)Sender (MTU 1500) using bbr/bbrv2 as congestion control algorithm ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072   running kernel 4.19.86 or later with RTT >=12msec.——>consistently reproducible
> B)Sender (MTU 1500) using cubic as congestion control algorithm with fq as disc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072 running kernel 4.19.86 or later with RTT >=30msec.——>consistently reproducible.
> C)Sender (MTU 1500) using cubic as congestion control algorithm with pfifo_fast as qdisc ——> Receiver (MTU 9001) with default ipv4.sysctl_tcp_rmem[1] = 131072   running kernel 4.19.86 or later with RTT >=30msec.——>intermittently  reproducible
> D)Sender needs a MTU of 1500. If the sender is using MTU of 9001  with no MSS clamping , then we  couldn’t  reproduce the issue.
> E)AWS EC2 instances are using 9001 as MTU by default hence they are likely more impacted by this.
>
>
> 9)With some kernel hacking & packet capture analysis we found that the main issue is that under the above mentioned conditions the receive window never scales up as it looks like the tcp receiver autotuning never kicks off, I have attached to this e-mail  screenshots showing Window scaling with and without the proposed patch.
> We also found that all workarounds either decreasing initial rcvq_space (this includes decreasing receiver advertised MSS from 9001 to 1500 or  default receive buffer size from 131072 to 87380) or increasing the maximum advertised receive window (before TCP autotuning start scaling) and this includes changing net.ipv4.tcp_adv_win_scale from 1 to 0 or 2.
>
> 10)It looks like when the issue happen we have a  kind of deadlock here so advertised receive window has to exceed rcvq_space for the tcp auto tuning to kickoff at the same time with the initial default  configuration the receive window is not going to exceed rcvq_space because it can only get half of the initial receive socket buffer size.
>
> 11)The current code which is based on patch has main drawback  which should be handled:
> A)It relies on receiver configured MTU to define the initial receive space(threshold where tcp autotuning starts), as mentioned above this works well with 1500 MTU because with that it will make sure that initial receive space is lower than receive window so tcp autotuning will work just fine while it won’t work with Jumbo frames in use on the receiver because at this case the receiver won’t start tcp autotuning especially with high RTT and we will be hitting the regression that commit  041a14d26715 ("tcp: start receiver buffer autotuning sooner") was trying to handle.
> 12)I am proposing  the below patch which is relying on RCV_MSS (our guess about MSS used by the peer which is equal to TCP_MSS_DEFAULT 536 bytes by default) this should work regardless the receiver configured MSS. I am also sharing  my iperf test results with and without the patch and also verified  that the connection won’t get stuck in the middle in case of packet loss or latency spike which I emulated using tc netem on the sender side.
>
>
> Test Results using the same sender & receiver:
>
> -Without our proposed patch
>
> #iperf3 -c xx.xx.xx.xx -t15 -i1 -R
> Connecting to host xx.xx.xx.xx, port 5201
> Reverse mode, remote host xx.xx.xx.xx is sending
> [  4] local 172.31.37.167 port 52838 connected to xx.xx.xx.xx port 5201
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   269 KBytes  2.20 Mbits/sec
> [  4]   1.00-2.00   sec   332 KBytes  2.72 Mbits/sec
> [  4]   2.00-3.00   sec   334 KBytes  2.73 Mbits/sec
> [  4]   3.00-4.00   sec   335 KBytes  2.75 Mbits/sec
> [  4]   4.00-5.00   sec   332 KBytes  2.72 Mbits/sec
> [  4]   5.00-6.00   sec   283 KBytes  2.32 Mbits/sec
> [  4]   6.00-7.00   sec   332 KBytes  2.72 Mbits/sec
> [  4]   7.00-8.00   sec   335 KBytes  2.75 Mbits/sec
> [  4]   8.00-9.00   sec   335 KBytes  2.75 Mbits/sec
> [  4]   9.00-10.00  sec   334 KBytes  2.73 Mbits/sec
> [  4]  10.00-11.00  sec   332 KBytes  2.72 Mbits/sec
> [  4]  11.00-12.00  sec   332 KBytes  2.72 Mbits/sec
> [  4]  12.00-13.00  sec   338 KBytes  2.77 Mbits/sec
> [  4]  13.00-14.00  sec   334 KBytes  2.73 Mbits/sec
> [  4]  14.00-15.00  sec   332 KBytes  2.72 Mbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-15.00  sec  6.07 MBytes  3.39 Mbits/sec    0             sender
> [  4]   0.00-15.00  sec  4.90 MBytes  2.74 Mbits/sec                  receiver
>
> iperf Done.
>
>
> Test downloading from google endpoint:
>
> # wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
> --2020-12-04 16:53:00--  https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
> Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.1.48, 172.217.8.176, 172.217.4.48, ...
> Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.1.48|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 113320760 (108M) [application/octet-stream]
> Saving to: ‘kubelet.45’
>
> 100%[===================================================================================================================================================>] 113,320,760 3.04MB/s   in 36s
>
> 2020-12-04 16:53:36 (3.02 MB/s) - ‘kubelet’ saved [113320760/113320760]
>
>
> ########################################################################################################################
>
> -With the proposed  patch:
>
> #iperf3 -c xx.xx.xx.xx -t15 -i1 -R
> Connecting to host xx.xx.xx.xx, port 5201
> Reverse mode, remote host xx.xx.xx.xx is sending
> [  4] local 172.31.37.167 port 44514 connected to xx.xx.xx.xx port 5201
> [ ID] Interval           Transfer     Bandwidth
> [  4]   0.00-1.00   sec   911 KBytes  7.46 Mbits/sec
> [  4]   1.00-2.00   sec  8.95 MBytes  75.1 Mbits/sec
> [  4]   2.00-3.00   sec  9.57 MBytes  80.3 Mbits/sec
> [  4]   3.00-4.00   sec  9.56 MBytes  80.2 Mbits/sec
> [  4]   4.00-5.00   sec  9.58 MBytes  80.3 Mbits/sec
> [  4]   5.00-6.00   sec  9.58 MBytes  80.4 Mbits/sec
> [  4]   6.00-7.00   sec  9.59 MBytes  80.4 Mbits/sec
> [  4]   7.00-8.00   sec  9.59 MBytes  80.5 Mbits/sec
> [  4]   8.00-9.00   sec  9.58 MBytes  80.4 Mbits/sec
> [  4]   9.00-10.00  sec  9.58 MBytes  80.4 Mbits/sec
> [  4]  10.00-11.00  sec  9.59 MBytes  80.4 Mbits/sec
> [  4]  11.00-12.00  sec  9.59 MBytes  80.5 Mbits/sec
> [  4]  12.00-13.00  sec  8.05 MBytes  67.5 Mbits/sec
> [  4]  13.00-14.00  sec  9.57 MBytes  80.3 Mbits/sec
> [  4]  14.00-15.00  sec  9.57 MBytes  80.3 Mbits/sec
> - - - - - - - - - - - - - - - - - - - - - - - - -
> [ ID] Interval           Transfer     Bandwidth       Retr
> [  4]   0.00-15.00  sec   136 MBytes  76.3 Mbits/sec    0             sender
> [  4]   0.00-15.00  sec   134 MBytes  75.2 Mbits/sec                  receiver
>
> iperf Done.
>
> Test downloading from google endpoint:
>
>
> # wget https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
> --2020-12-04 16:54:34--  https://storage.googleapis.com/kubernetes-release/release/v1.18.9/bin/linux/amd64/kubelet
> Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.0.16, 216.58.192.144, 172.217.6.16, ...
> Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.0.16|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 113320760 (108M) [application/octet-stream]
> Saving to: ‘kubelet’
>
> 100%[===================================================================================================================================================>] 113,320,760 80.0MB/s   in 1.4s
>
> 2020-12-04 16:54:36 (80.0 MB/s) - ‘kubelet.1’ saved [113320760/113320760]
>
> Links:
>
> [1] https://github.com/kubernetes/kops/issues/10206
> [2] https://lore.kernel.org/patchwork/patch/1157936/
> [3] https://lore.kernel.org/patchwork/patch/1157883/
>

Unfortunately few things are missing in this report.

What is the RTT between hosts in your test ?

What driver is used at the receiving side ?

Usually, this kind of problem comes when s(kb->len / skb->truesize) is
pathologically small.
This could be caused by a driver lacking scatter gather support at RX
(a 1500 bytes incoming packet would use 12KB of memory or so, because
driver MTU was set to 9000)

Also worth noting that if you set MTU to 9000 (instead of standard
1500), you probably need to tweak a few sysctls.

autotuning is tricky, changing initial values can be good in some
cases, bad in others.

It would be nice if you send "ss -temoi"  output taken at receiver
while transfer is in progress.

>
>
> Thank you.
>
> Hazem
>
> On 04/12/2020, 18:08, "Hazem Mohamed Abuelfotoh" <abuehaze@amazon.com> wrote:
>
>         Previously receiver buffer auto-tuning starts after receiving
>         one advertised window amount of data.After the initial
>         receiver buffer was raised by
>         commit a337531b942b ("tcp: up initial rmem to 128KB
>         and SYN rwin to around 64KB"),the receiver buffer may
>         take too long for TCP autotuning to start raising
>         the receiver buffer size.
>         commit 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
>         tried to decrease the threshold at which TCP auto-tuning starts
>         but it's doesn't work well in some environments
>         where the receiver has large MTU (9001) configured
>         specially within environments where RTT is high.
>         To address this issue this patch is relying on RCV_MSS
>         so auto-tuning can start early regardless
>         the receiver configured MTU.
>
>         Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
>         Fixes: 041a14d26715 ("tcp: start receiver buffer autotuning sooner")
>
>     Signed-off-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com>
>     ---
>      net/ipv4/tcp_input.c | 3 ++-
>      1 file changed, 2 insertions(+), 1 deletion(-)
>
>     diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>     index 389d1b340248..f0ffac9e937b 100644
>     --- a/net/ipv4/tcp_input.c
>     +++ b/net/ipv4/tcp_input.c
>     @@ -504,13 +504,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
>      static void tcp_init_buffer_space(struct sock *sk)
>      {
>         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
>     +   struct inet_connection_sock *icsk = inet_csk(sk);
>         struct tcp_sock *tp = tcp_sk(sk);
>         int maxwin;
>
>         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
>                 tcp_sndbuf_expand(sk);
>
>     -   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
>     +   tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
>         tcp_mstamp_refresh(tp);
>         tp->rcvq_space.time = tp->tcp_mstamp;
>         tp->rcvq_space.seq = tp->copied_seq;
>     --
>     2.16.6
>
>
>
>
>
> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284
>
> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
>
>
Eric Dumazet Dec. 7, 2020, 3:25 p.m. UTC | #3
On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>

> Unfortunately few things are missing in this report.

>

>     What is the RTT between hosts in your test ?

>      >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google   endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that    the sender is using bbr.

>

>         RTT between hosts where I run the iperf test.

>         # ping 54.199.163.187

>         PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data.

>         64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms

>         64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms

>         64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms

>         64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms

>

>         RTT between my EC2 instances and google endpoint.

>         # ping 172.217.4.240

>         PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data.

>         64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms

>         64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms

>         64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms

>         64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms

>

>     What driver is used at the receiving side ?

>       >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled.

>

>         # ethtool -k eth0 | grep scatter-gather

>         scatter-gather: on

>                 tx-scatter-gather: on

>                 tx-scatter-gather-fraglist: off [fixed]


This ethtool output refers to TX scatter gather, which is not relevant
for this bug.

I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB)

Since I can not reproduce this problem with another NIC on x86, I
really wonder if this is not an issue with ENA driver on PowerPC
perhaps ?
Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 4:09 p.m. UTC | #4
>Since I can not reproduce this problem with another NIC on x86, I

    >really wonder if this is not an issue with ENA driver on PowerPC

    >perhaps ?



I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?

Thank you.

Hazem

On 07/12/2020, 15:26, "Eric Dumazet" <edumazet@google.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem
    <abuehaze@amazon.com> wrote:
    >

    > Unfortunately few things are missing in this report.

    >

    >     What is the RTT between hosts in your test ?

    >      >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google   endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that    the sender is using bbr.

    >

    >         RTT between hosts where I run the iperf test.

    >         # ping 54.199.163.187

    >         PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data.

    >         64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms

    >         64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms

    >         64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms

    >         64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms

    >

    >         RTT between my EC2 instances and google endpoint.

    >         # ping 172.217.4.240

    >         PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data.

    >         64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms

    >         64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms

    >         64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms

    >         64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms

    >

    >     What driver is used at the receiving side ?

    >       >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled.

    >

    >         # ethtool -k eth0 | grep scatter-gather

    >         scatter-gather: on

    >                 tx-scatter-gather: on

    >                 tx-scatter-gather-fraglist: off [fixed]


    This ethtool output refers to TX scatter gather, which is not relevant
    for this bug.

    I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB)

    Since I can not reproduce this problem with another NIC on x86, I
    really wonder if this is not an issue with ENA driver on PowerPC
    perhaps ?




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Eric Dumazet Dec. 7, 2020, 4:22 p.m. UTC | #5
On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>

>     >Since I can not reproduce this problem with another NIC on x86, I

>     >really wonder if this is not an issue with ENA driver on PowerPC

>     >perhaps ?

>

>

> I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

>

> What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?



100ms RTT

Which exact version of linux kernel are you using ?



>

> Thank you.

>

> Hazem

>

> On 07/12/2020, 15:26, "Eric Dumazet" <edumazet@google.com> wrote:

>

>     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

>

>

>

>     On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem

>     <abuehaze@amazon.com> wrote:

>     >

>     > Unfortunately few things are missing in this report.

>     >

>     >     What is the RTT between hosts in your test ?

>     >      >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google   endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that    the sender is using bbr.

>     >

>     >         RTT between hosts where I run the iperf test.

>     >         # ping 54.199.163.187

>     >         PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data.

>     >         64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms

>     >         64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms

>     >         64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms

>     >         64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms

>     >

>     >         RTT between my EC2 instances and google endpoint.

>     >         # ping 172.217.4.240

>     >         PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data.

>     >         64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms

>     >         64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms

>     >         64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms

>     >         64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms

>     >

>     >     What driver is used at the receiving side ?

>     >       >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled.

>     >

>     >         # ethtool -k eth0 | grep scatter-gather

>     >         scatter-gather: on

>     >                 tx-scatter-gather: on

>     >                 tx-scatter-gather-fraglist: off [fixed]

>

>     This ethtool output refers to TX scatter gather, which is not relevant

>     for this bug.

>

>     I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB)

>

>     Since I can not reproduce this problem with another NIC on x86, I

>     really wonder if this is not an issue with ENA driver on PowerPC

>     perhaps ?

>

>

>

>

> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

>

> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

>

>
Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 4:34 p.m. UTC | #6
100ms RTT

>Which exact version of linux kernel are you using ?

On the receiver side I could see the issue with any mainline kernel version >=4.19.86 which is the first kernel version that has patches [1] & [2] included.
On the sender I am using kernel 5.4.0-rc6.

Links:

[1] https://lore.kernel.org/patchwork/patch/1157936/
[2] https://lore.kernel.org/patchwork/patch/1157883/

Thank you.

Hazem



On 07/12/2020, 16:24, "Eric Dumazet" <edumazet@google.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem
    <abuehaze@amazon.com> wrote:
    >

    >     >Since I can not reproduce this problem with another NIC on x86, I

    >     >really wonder if this is not an issue with ENA driver on PowerPC

    >     >perhaps ?

    >

    >

    > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

    >

    > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?



    100ms RTT

    Which exact version of linux kernel are you using ?



    >

    > Thank you.

    >

    > Hazem

    >

    > On 07/12/2020, 15:26, "Eric Dumazet" <edumazet@google.com> wrote:

    >

    >     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

    >

    >

    >

    >     On Sat, Dec 5, 2020 at 1:03 PM Mohamed Abuelfotoh, Hazem

    >     <abuehaze@amazon.com> wrote:

    >     >

    >     > Unfortunately few things are missing in this report.

    >     >

    >     >     What is the RTT between hosts in your test ?

    >     >      >>>>>RTT in my test is 162 msec, but I am able to reproduce it with lower RTTs for example I could see the issue downloading from google   endpoint with RTT of 16.7 msec, as mentioned in my previous e-mail the issue is reproducible whenever RTT exceeded 12msec given that    the sender is using bbr.

    >     >

    >     >         RTT between hosts where I run the iperf test.

    >     >         # ping 54.199.163.187

    >     >         PING 54.199.163.187 (54.199.163.187) 56(84) bytes of data.

    >     >         64 bytes from 54.199.163.187: icmp_seq=1 ttl=33 time=162 ms

    >     >         64 bytes from 54.199.163.187: icmp_seq=2 ttl=33 time=162 ms

    >     >         64 bytes from 54.199.163.187: icmp_seq=3 ttl=33 time=162 ms

    >     >         64 bytes from 54.199.163.187: icmp_seq=4 ttl=33 time=162 ms

    >     >

    >     >         RTT between my EC2 instances and google endpoint.

    >     >         # ping 172.217.4.240

    >     >         PING 172.217.4.240 (172.217.4.240) 56(84) bytes of data.

    >     >         64 bytes from 172.217.4.240: icmp_seq=1 ttl=101 time=16.7 ms

    >     >         64 bytes from 172.217.4.240: icmp_seq=2 ttl=101 time=16.7 ms

    >     >         64 bytes from 172.217.4.240: icmp_seq=3 ttl=101 time=16.7 ms

    >     >         64 bytes from 172.217.4.240: icmp_seq=4 ttl=101 time=16.7 ms

    >     >

    >     >     What driver is used at the receiving side ?

    >     >       >>>>>>I am using ENA driver version version: 2.2.10g on the receiver with scatter gathering enabled.

    >     >

    >     >         # ethtool -k eth0 | grep scatter-gather

    >     >         scatter-gather: on

    >     >                 tx-scatter-gather: on

    >     >                 tx-scatter-gather-fraglist: off [fixed]

    >

    >     This ethtool output refers to TX scatter gather, which is not relevant

    >     for this bug.

    >

    >     I see ENA driver might use 16 KB per incoming packet (if ENA_PAGE_SIZE is 16 KB)

    >

    >     Since I can not reproduce this problem with another NIC on x86, I

    >     really wonder if this is not an issue with ENA driver on PowerPC

    >     perhaps ?

    >

    >

    >

    >

    > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

    >

    > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

    >

    >





Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Eric Dumazet Dec. 7, 2020, 5:08 p.m. UTC | #7
On Mon, Dec 7, 2020 at 5:34 PM Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem
> > <abuehaze@amazon.com> wrote:
> > >
> > >     >Since I can not reproduce this problem with another NIC on x86, I
> > >     >really wonder if this is not an issue with ENA driver on PowerPC
> > >     >perhaps ?
> > >
> > >
> > > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.
> > >
> > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?
> >
> >
> > 100ms RTT
> >
> > Which exact version of linux kernel are you using ?
>
> Thanks for testing this, Eric. Would you be able to share the MTU
> config commands you used, and the tcpdump traces you get? I'm
> surprised that receive buffer autotuning would work for advmss of
> around 6500 or higher.

autotuning might be delayed by one RTT, this does not match numbers
given by Mohamed (flows stuck in low speed)

autotuning is an heuristic, and because it has one RTT latency, it is
crucial to get proper initial rcvmem values.

People using MTU=9000 should know they have to tune tcp_rmem[1]
accordingly, especially when using drivers consuming one page per
incoming MSS.


(mlx4 driver only uses ome 2048 bytes fragment for a 1500 MTU packet.
even with MTU set to 9000)

I want to state again that using 536 bytes as a magic value makes no
sense to me.


For the record, Google has increased tcp_rmem[1] when switching to a bigger MTU.

The reason is simple : If we intend to receive 10 MSS, we should allow
for 90000 bytes of payload, or tcp_rmem[1] set to 180,000
Because of autotuning latency, doubling the value is advised : 360000

Another problem with kicking autotuning too fast is that it might
allow bigger sk->sk_rcvbuf values even for small flows, opening more
surface to malicious attacks.

I _think_ that if we want to allow admins to set high MTU without
having to tune tcp_rmem[], we need something different than current
proposal.
Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 5:16 p.m. UTC | #8
>Thanks for testing this, Eric. Would you be able to share the MTU

    >config commands you used, and the tcpdump traces you get? I'm

    >surprised that receive buffer autotuning would work for advmss of

    >around 6500 or higher.


Packet capture before applying the proposed patch

https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T170123Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=a599a0e0e6632a957e5619007ba5ce4f63c8e8535ea24470b7093fef440a8300

Packet capture after applying the proposed patch

https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T165831Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f18ec7246107590e8ac35c24322af699e4c2a73d174067c51cf6b0a06bbbca77

kernel version & MTU and configuration  from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver.

Thank you.

Hazem


On 07/12/2020, 16:34, "Neal Cardwell" <ncardwell@google.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:
    >

    > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem

    > <abuehaze@amazon.com> wrote:

    > >

    > >     >Since I can not reproduce this problem with another NIC on x86, I

    > >     >really wonder if this is not an issue with ENA driver on PowerPC

    > >     >perhaps ?

    > >

    > >

    > > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

    > >

    > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?

    >

    >

    > 100ms RTT

    >

    > Which exact version of linux kernel are you using ?


    Thanks for testing this, Eric. Would you be able to share the MTU
    config commands you used, and the tcpdump traces you get? I'm
    surprised that receive buffer autotuning would work for advmss of
    around 6500 or higher.

    thanks,
    neal




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Eric Dumazet Dec. 7, 2020, 5:27 p.m. UTC | #9
On Mon, Dec 7, 2020 at 6:17 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>

>     >Thanks for testing this, Eric. Would you be able to share the MTU

>     >config commands you used, and the tcpdump traces you get? I'm

>     >surprised that receive buffer autotuning would work for advmss of

>     >around 6500 or higher.

>

> Packet capture before applying the proposed patch

>

> https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T170123Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=a599a0e0e6632a957e5619007ba5ce4f63c8e8535ea24470b7093fef440a8300

>

> Packet capture after applying the proposed patch

>

> https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T165831Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f18ec7246107590e8ac35c24322af699e4c2a73d174067c51cf6b0a06bbbca77

>

> kernel version & MTU and configuration  from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver.

>

> Thank you.

>

> Hazem


Please try again, with a fixed tcp_rmem[1] on receiver, taking into
account bigger memory requirement for MTU 9000

Rationale : TCP should be ready to receive 10 full frames before
autotuning takes place (these 10 MSS are typically in a single GRO
packet)

At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)

TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.

->

echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem



>

>

> On 07/12/2020, 16:34, "Neal Cardwell" <ncardwell@google.com> wrote:

>

>     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

>

>

>

>     On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:

>     >

>     > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem

>     > <abuehaze@amazon.com> wrote:

>     > >

>     > >     >Since I can not reproduce this problem with another NIC on x86, I

>     > >     >really wonder if this is not an issue with ENA driver on PowerPC

>     > >     >perhaps ?

>     > >

>     > >

>     > > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

>     > >

>     > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?

>     >

>     >

>     > 100ms RTT

>     >

>     > Which exact version of linux kernel are you using ?

>

>     Thanks for testing this, Eric. Would you be able to share the MTU

>     config commands you used, and the tcpdump traces you get? I'm

>     surprised that receive buffer autotuning would work for advmss of

>     around 6500 or higher.

>

>     thanks,

>     neal

>

>

>

>

> Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

>

> Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

>

>
Greg KH Dec. 7, 2020, 5:46 p.m. UTC | #10
On Mon, Dec 07, 2020 at 04:34:57PM +0000, Mohamed Abuelfotoh, Hazem wrote:
> 100ms RTT

> 

> >Which exact version of linux kernel are you using ?

> On the receiver side I could see the issue with any mainline kernel

> version >=4.19.86 which is the first kernel version that has patches

> [1] & [2] included.  On the sender I am using kernel 5.4.0-rc6.


5.4.0-rc6 is a very old and odd kernel to be doing anything with.  Are
you sure you don't mean "5.10-rc6" here?

thanks,

greg k-h
Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 5:54 p.m. UTC | #11
>5.4.0-rc6 is a very old and odd kernel to be doing anything with.  Are

>you sure you don't mean "5.10-rc6" here?


I was able to reproduce it on the latest mainline kernel as well  so anything newer than 4.19.85 is just broken.

Thank you.

Hazem

On 07/12/2020, 17:45, "Greg KH" <gregkh@linuxfoundation.org> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Dec 07, 2020 at 04:34:57PM +0000, Mohamed Abuelfotoh, Hazem wrote:
    > 100ms RTT

    >

    > >Which exact version of linux kernel are you using ?

    > On the receiver side I could see the issue with any mainline kernel

    > version >=4.19.86 which is the first kernel version that has patches

    > [1] & [2] included.  On the sender I am using kernel 5.4.0-rc6.


    5.4.0-rc6 is a very old and odd kernel to be doing anything with.  Are
    you sure you don't mean "5.10-rc6" here?

    thanks,

    greg k-h




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Mohamed Abuelfotoh, Hazem Dec. 7, 2020, 8:09 p.m. UTC | #12
>I want to state again that using 536 bytes as a magic value makes no

    sense to me.

 >autotuning might be delayed by one RTT, this does not match numbers

 >given by Mohamed (flows stuck in low speed)


  >autotuning is an heuristic, and because it has one RTT latency, it is

   >crucial to get proper initial rcvmem values.


   >People using MTU=9000 should know they have to tune tcp_rmem[1]

   >accordingly, especially when using drivers consuming one page per

   >+incoming MSS.




The magic number would be 10*rcv_mss=5360 not 536 and in my opinion it's a big amount of data to be sent in security attack so if we are talking about DDos attack triggering Autotuning at 5360 bytes I'd say he will also be able to trigger it sending 64KB but I totally agree that it would be easier with lower rcvq_space.space, it's always a tradeoff between security and performance.

Other options would be to either consider the configured MTU in the rcv_wnd calculation or probably check the MTU before calculating the initial rcvspace. We have to make sure that initial receive space is lower than initial receive window so Autotuning would work regardless the configured MTU on the receiver and only people using Jumbo frames will be paying the price if we agreed that it's expected for Jumbo frame users to have machines with more memory,  I'd say something as below should work:

void tcp_init_buffer_space(struct sock *sk)
{
	int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
	struct inet_connection_sock *icsk = inet_csk(sk);
	struct tcp_sock *tp = tcp_sk(sk);
	int maxwin;

	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
		tcp_sndbuf_expand(sk);
	if(tp->advmss < 6000)
		tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
	else
		tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
	tcp_mstamp_refresh(tp);
	tp->rcvq_space.time = tp->tcp_mstamp;
	tp->rcvq_space.seq = tp->copied_seq;



I don't think that we should rely on Admins manually tuning this tcp_rmem[1] with Jumbo frame in use also Linux users shouldn't expect performance degradation after kernel upgrade. although [1] is the only public reporting of this issue, I am pretty sure we will see more users reporting this with Linux Main distributions moving to kernel 5.4 as stable version. In Summary we should come up with something either the proposed patch or something else to avoid admins doing the manual job.


Links

[1] https://github.com/kubernetes/kops/issues/10206

On 07/12/2020, 17:08, "Eric Dumazet" <edumazet@google.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Dec 7, 2020 at 5:34 PM Neal Cardwell <ncardwell@google.com> wrote:
    >

    > On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:

    > >

    > > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem

    > > <abuehaze@amazon.com> wrote:

    > > >

    > > >     >Since I can not reproduce this problem with another NIC on x86, I

    > > >     >really wonder if this is not an issue with ENA driver on PowerPC

    > > >     >perhaps ?

    > > >

    > > >

    > > > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

    > > >

    > > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?

    > >

    > >

    > > 100ms RTT

    > >

    > > Which exact version of linux kernel are you using ?

    >

    > Thanks for testing this, Eric. Would you be able to share the MTU

    > config commands you used, and the tcpdump traces you get? I'm

    > surprised that receive buffer autotuning would work for advmss of

    > around 6500 or higher.


    autotuning might be delayed by one RTT, this does not match numbers
    given by Mohamed (flows stuck in low speed)

    autotuning is an heuristic, and because it has one RTT latency, it is
    crucial to get proper initial rcvmem values.

    People using MTU=9000 should know they have to tune tcp_rmem[1]
    accordingly, especially when using drivers consuming one page per
    incoming MSS.


    (mlx4 driver only uses ome 2048 bytes fragment for a 1500 MTU packet.
    even with MTU set to 9000)

    I want to state again that using 536 bytes as a magic value makes no
    sense to me.


    For the record, Google has increased tcp_rmem[1] when switching to a bigger MTU.

    The reason is simple : If we intend to receive 10 MSS, we should allow
    for 90000 bytes of payload, or tcp_rmem[1] set to 180,000
    Because of autotuning latency, doubling the value is advised : 360000

    Another problem with kicking autotuning too fast is that it might
    allow bigger sk->sk_rcvbuf values even for small flows, opening more
    surface to malicious attacks.

    I _think_ that if we want to allow admins to set high MTU without
    having to tune tcp_rmem[], we need something different than current
    proposal.




Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Eric Dumazet Dec. 7, 2020, 11:22 p.m. UTC | #13
On Mon, Dec 7, 2020 at 9:09 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>

>     >I want to state again that using 536 bytes as a magic value makes no

>     sense to me.

>

>  >autotuning might be delayed by one RTT, this does not match numbers

>  >given by Mohamed (flows stuck in low speed)

>

>   >autotuning is an heuristic, and because it has one RTT latency, it is

>    >crucial to get proper initial rcvmem values.

>

>    >People using MTU=9000 should know they have to tune tcp_rmem[1]

>    >accordingly, especially when using drivers consuming one page per

>    >+incoming MSS.

>

>

>

> The magic number would be 10*rcv_mss=5360 not 536 and in my opinion it's a big amount of data to be sent in security attack so if we are talking about DDos attack triggering Autotuning at 5360 bytes I'd say he will also be able to trigger it sending 64KB but I totally agree that it would be easier with lower rcvq_space.space, it's always a tradeoff between security and performance.




>

> Other options would be to either consider the configured MTU in the rcv_wnd calculation or probably check the MTU before calculating the initial rcvspace. We have to make sure that initial receive space is lower than initial receive window so Autotuning would work regardless the configured MTU on the receiver and only people using Jumbo frames will be paying the price if we agreed that it's expected for Jumbo frame users to have machines with more memory,  I'd say something as below should work:

>

> void tcp_init_buffer_space(struct sock *sk)

> {

>         int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;

>         struct inet_connection_sock *icsk = inet_csk(sk);

>         struct tcp_sock *tp = tcp_sk(sk);

>         int maxwin;

>

>         if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))

>                 tcp_sndbuf_expand(sk);

>         if(tp->advmss < 6000)

>                 tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);


This is just another hack, based on 'magic' numbers.

>         else

>                 tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);

>         tcp_mstamp_refresh(tp);

>         tp->rcvq_space.time = tp->tcp_mstamp;

>         tp->rcvq_space.seq = tp->copied_seq;

>

>

>

> I don't think that we should rely on Admins manually tuning this tcp_rmem[1] with Jumbo frame in use also Linux users shouldn't expect performance degradation after kernel upgrade. although [1] is the only public reporting of this issue, I am pretty sure we will see more users reporting this with Linux Main distributions moving to kernel 5.4 as stable version. In Summary we should come up with something either the proposed patch or something else to avoid admins doing the manual job.

>




Default MTU is 1500, not 9000.

I hinted in my very first reply to you that MTU  9000 is not easy and
needs tuning. We could argue and try to make this less of a pain in
future kernel (net-next)

<quote>Also worth noting that if you set MTU to 9000 (instead of
standard 1500), you probably need to tweak a few sysctls.
</quote>

I think I have asked you multiple times to test appropriate
tcp_rmem[1] settings...

I gave the reason why tcp_rmem[1] set to 131072 is not good for MTU
9000, I will prefer a solution that involves no kernel patch, no
backports, just a matter of educating sysadmins, for increased TCP
performance,
especially when really using 9000 MTU...

Your patch would change the behavior of TCP stack for standard
MTU=1500 flows which are yet the majority. This is very risky.

Anyway. _if_ we really wanted to change the kernel, ( keeping stupid
tcp_rmem[1] value ) :

In the tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND *
tp->advmss);  formula, really the bug is in the tp->rcv_wnd term, not
the second one.

This is buggy, because tcp_init_buffer_space() ends up with
tp->window_clamp smaller than tp->rcv_wnd, so tcp_grow_window() is not
able to change tp->rcv_ssthresh

The only mechanism allowing to change tp->window_clamp later would be
DRS, so we better use the proper limit when initializing
tp->rcvq_space.space

This issue disappears if tcp_rmem[1] is slightly above 131072, because
then the following is not needed.

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)

        tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);
        tp->snd_cwnd_stamp = tcp_jiffies32;
+       tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);
 }

 /* 4. Recalculate window clamp after socket hit its memory bounds. */
Mohamed Abuelfotoh, Hazem Dec. 8, 2020, 4:28 p.m. UTC | #14
>Please try again, with a fixed tcp_rmem[1] on receiver, taking into

    >account bigger memory requirement for MTU 9000


    >Rationale : TCP should be ready to receive 10 full frames before

    >autotuning takes place (these 10 MSS are typically in a single GRO

   > packet)


    >At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)


   >TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.


    ->

    >echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem




>diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

>index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab

>100644

>--- a/net/ipv4/tcp_input.c

>+++ b/net/ipv4/tcp_input.c

>@@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)

>

>        tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);

>       tp->snd_cwnd_stamp = tcp_jiffies32;

>+       tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);

>}


Yes this worked and it looks like echo "4096 140000 15728640" >/proc/sys/net/ipv4/tcp_rmem is actually enough to trigger TCP autotuning, if the current default tcp_rmem[1] doesn't work well with 9000 MTU I am curious to know  if there is specific reason behind having 131072 specifically   as  tcp_rmem[1]?I think the number itself has to be divisible by page size (4K) and 16KB given what you said that each Jumbo frame packet may consume up to 16KB.

if the patch I proposed would be risky for users who have MTU of 1500 because of its higher memory footprint in my opinion we should  get the patch you proposed merged instead of asking the Admins doing the manual work.

Thank you.

Hazem

On 07/12/2020, 17:28, "Eric Dumazet" <edumazet@google.com> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    On Mon, Dec 7, 2020 at 6:17 PM Mohamed Abuelfotoh, Hazem
    <abuehaze@amazon.com> wrote:
    >

    >     >Thanks for testing this, Eric. Would you be able to share the MTU

    >     >config commands you used, and the tcpdump traces you get? I'm

    >     >surprised that receive buffer autotuning would work for advmss of

    >     >around 6500 or higher.

    >

    > Packet capture before applying the proposed patch

    >

    > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T170123Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=a599a0e0e6632a957e5619007ba5ce4f63c8e8535ea24470b7093fef440a8300

    >

    > Packet capture after applying the proposed patch

    >

    > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T165831Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f18ec7246107590e8ac35c24322af699e4c2a73d174067c51cf6b0a06bbbca77

    >

    > kernel version & MTU and configuration  from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver.

    >

    > Thank you.

    >

    > Hazem


    Please try again, with a fixed tcp_rmem[1] on receiver, taking into
    account bigger memory requirement for MTU 9000

    Rationale : TCP should be ready to receive 10 full frames before
    autotuning takes place (these 10 MSS are typically in a single GRO
    packet)

    At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)

    TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.

    ->

    echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem



    >

    >

    > On 07/12/2020, 16:34, "Neal Cardwell" <ncardwell@google.com> wrote:

    >

    >     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

    >

    >

    >

    >     On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:

    >     >

    >     > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem

    >     > <abuehaze@amazon.com> wrote:

    >     > >

    >     > >     >Since I can not reproduce this problem with another NIC on x86, I

    >     > >     >really wonder if this is not an issue with ENA driver on PowerPC

    >     > >     >perhaps ?

    >     > >

    >     > >

    >     > > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

    >     > >

    >     > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?

    >     >

    >     >

    >     > 100ms RTT

    >     >

    >     > Which exact version of linux kernel are you using ?

    >

    >     Thanks for testing this, Eric. Would you be able to share the MTU

    >     config commands you used, and the tcpdump traces you get? I'm

    >     surprised that receive buffer autotuning would work for advmss of

    >     around 6500 or higher.

    >

    >     thanks,

    >     neal

    >

    >

    >

    >

    > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

    >

    > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

    >

    >





Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Mohamed Abuelfotoh, Hazem Dec. 8, 2020, 4:30 p.m. UTC | #15
Feel free to ignore this message  as I sent it before seeing  your newly submitted patch (

Thank you.

Hazem



On 08/12/2020, 16:28, "Mohamed Abuelfotoh, Hazem" <abuehaze@amazon.com> wrote:

        >Please try again, with a fixed tcp_rmem[1] on receiver, taking into

        >account bigger memory requirement for MTU 9000


        >Rationale : TCP should be ready to receive 10 full frames before

        >autotuning takes place (these 10 MSS are typically in a single GRO

       > packet)


        >At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)


       >TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.


        ->

        >echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem




    >diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

    >index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab

    >100644

    >--- a/net/ipv4/tcp_input.c

    >+++ b/net/ipv4/tcp_input.c

    >@@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)

    >

    >        tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);

    >       tp->snd_cwnd_stamp = tcp_jiffies32;

    >+       tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);

    >}


    Yes this worked and it looks like echo "4096 140000 15728640" >/proc/sys/net/ipv4/tcp_rmem is actually enough to trigger TCP autotuning, if the current default tcp_rmem[1] doesn't work well with 9000 MTU I am curious to know  if there is specific reason behind having 131072 specifically   as  tcp_rmem[1]?I think the number itself has to be divisible by page size (4K) and 16KB given what you said that each Jumbo frame packet may consume up to 16KB.

    if the patch I proposed would be risky for users who have MTU of 1500 because of its higher memory footprint in my opinion we should  get the patch you proposed merged instead of asking the Admins doing the manual work.

    Thank you.

    Hazem

    On 07/12/2020, 17:28, "Eric Dumazet" <edumazet@google.com> wrote:

        CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



        On Mon, Dec 7, 2020 at 6:17 PM Mohamed Abuelfotoh, Hazem
        <abuehaze@amazon.com> wrote:
        >

        >     >Thanks for testing this, Eric. Would you be able to share the MTU

        >     >config commands you used, and the tcpdump traces you get? I'm

        >     >surprised that receive buffer autotuning would work for advmss of

        >     >around 6500 or higher.

        >

        > Packet capture before applying the proposed patch

        >

        > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-bad-unpatched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T170123Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=a599a0e0e6632a957e5619007ba5ce4f63c8e8535ea24470b7093fef440a8300

        >

        > Packet capture after applying the proposed patch

        >

        > https://tcpautotuningpcaps.s3.eu-west-1.amazonaws.com/sender-bbr-good-patched.pcap?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJNMP5ZZ3I4FAQGAQ%2F20201207%2Feu-west-1%2Fs3%2Faws4_request&X-Amz-Date=20201207T165831Z&X-Amz-Expires=604800&X-Amz-SignedHeaders=host&X-Amz-Signature=f18ec7246107590e8ac35c24322af699e4c2a73d174067c51cf6b0a06bbbca77

        >

        > kernel version & MTU and configuration  from my receiver & sender is attached to this e-mail, please be aware that EC2 is doing MSS clamping so you need to configure MTU as 1500 on the sender side if you don’t have any MSS clamping between sender & receiver.

        >

        > Thank you.

        >

        > Hazem


        Please try again, with a fixed tcp_rmem[1] on receiver, taking into
        account bigger memory requirement for MTU 9000

        Rationale : TCP should be ready to receive 10 full frames before
        autotuning takes place (these 10 MSS are typically in a single GRO
        packet)

        At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)

        TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.

        ->

        echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem



        >

        >

        > On 07/12/2020, 16:34, "Neal Cardwell" <ncardwell@google.com> wrote:

        >

        >     CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

        >

        >

        >

        >     On Mon, Dec 7, 2020 at 11:23 AM Eric Dumazet <edumazet@google.com> wrote:

        >     >

        >     > On Mon, Dec 7, 2020 at 5:09 PM Mohamed Abuelfotoh, Hazem

        >     > <abuehaze@amazon.com> wrote:

        >     > >

        >     > >     >Since I can not reproduce this problem with another NIC on x86, I

        >     > >     >really wonder if this is not an issue with ENA driver on PowerPC

        >     > >     >perhaps ?

        >     > >

        >     > >

        >     > > I am able to reproduce it on x86 based EC2 instances using ENA  or  Xen netfront or Intel ixgbevf driver on the receiver so it's not specific to ENA, we were able to easily reproduce it between 2 VMs running in virtual box on the same physical host considering the environment requirements I mentioned in my first e-mail.

        >     > >

        >     > > What's the RTT between the sender & receiver in your reproduction? Are you using bbr on the sender side?

        >     >

        >     >

        >     > 100ms RTT

        >     >

        >     > Which exact version of linux kernel are you using ?

        >

        >     Thanks for testing this, Eric. Would you be able to share the MTU

        >     config commands you used, and the tcpdump traces you get? I'm

        >     surprised that receive buffer autotuning would work for advmss of

        >     around 6500 or higher.

        >

        >     thanks,

        >     neal

        >

        >

        >

        >

        > Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

        >

        > Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705

        >

        >






Amazon Web Services EMEA SARL, 38 avenue John F. Kennedy, L-1855 Luxembourg, R.C.S. Luxembourg B186284

Amazon Web Services EMEA SARL, Irish Branch, One Burlington Plaza, Burlington Road, Dublin 4, Ireland, branch registration number 908705
Eric Dumazet Dec. 8, 2020, 4:46 p.m. UTC | #16
On Tue, Dec 8, 2020 at 5:28 PM Mohamed Abuelfotoh, Hazem
<abuehaze@amazon.com> wrote:
>

>     >Please try again, with a fixed tcp_rmem[1] on receiver, taking into

>     >account bigger memory requirement for MTU 9000

>

>     >Rationale : TCP should be ready to receive 10 full frames before

>     >autotuning takes place (these 10 MSS are typically in a single GRO

>    > packet)

>

>     >At 9000 MTU, one frame typically consumes 12KB (or 16KB on some arches/drivers)

>

>    >TCP uses a 50% factor rule, accounting 18000 bytes of kernel memory per MSS.

>

>     ->

>

>     >echo "4096 180000 15728640" >/proc/sys/net/ipv4/tcp_rmem

>

>

>

> >diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c

> >index 9e8a6c1aa0190cc248b3b99b073a4c6e45884cf5..81b5d9375860ae583e08045fb25b089c456c60ab

> >100644

> >--- a/net/ipv4/tcp_input.c

> >+++ b/net/ipv4/tcp_input.c

> >@@ -534,6 +534,7 @@ static void tcp_init_buffer_space(struct sock *sk)

> >

> >        tp->rcv_ssthresh = min(tp->rcv_ssthresh, tp->window_clamp);

> >       tp->snd_cwnd_stamp = tcp_jiffies32;

> >+       tp->rcvq_space.space = min(tp->rcv_ssthresh, tp->rcvq_space.space);

> >}

>

> Yes this worked and it looks like echo "4096 140000 15728640" >/proc/sys/net/ipv4/tcp_rmem is actually enough to trigger TCP autotuning, if the current default tcp_rmem[1] doesn't work well with 9000 MTU I am curious to know  if there is specific reason behind having 131072 specifically   as  tcp_rmem[1]?I think the number itself has to be divisible by page size (4K) and 16KB given what you said that each Jumbo frame packet may consume up to 16KB.



I think the idea behind the value of 131072 was that because TCP RWIN
was set to 65535, we had to reserve twice this amount of memory ->
131072 bytes.

Assuming DRS works well, the exact value should matter only for
unresponsive applications (slow to read/drain the receive queue),
since DRS is delayed for them.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 389d1b340248..f0ffac9e937b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -504,13 +504,14 @@  static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 static void tcp_init_buffer_space(struct sock *sk)
 {
 	int tcp_app_win = sock_net(sk)->ipv4.sysctl_tcp_app_win;
+	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	int maxwin;
 
 	if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK))
 		tcp_sndbuf_expand(sk);
 
-	tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * tp->advmss);
+	tp->rcvq_space.space = min_t(u32, tp->rcv_wnd, TCP_INIT_CWND * icsk->icsk_ack.rcv_mss);
 	tcp_mstamp_refresh(tp);
 	tp->rcvq_space.time = tp->tcp_mstamp;
 	tp->rcvq_space.seq = tp->copied_seq;