diff mbox series

[net-next] net: tcp: Updating MSS, when the sending window is smaller than MSS.

Message ID 1623058534-78782-1-git-send-email-chengshuyi@linux.alibaba.com
State New
Headers show
Series [net-next] net: tcp: Updating MSS, when the sending window is smaller than MSS. | expand

Commit Message

Shuyi Cheng June 7, 2021, 9:35 a.m. UTC
When the lo network card is used for communication, the tcp server
reduces the size of the receiving buffer, causing the tcp client
to have a delay of 200ms. Examples are as follows:

Suppose that the MTU of the lo network card is 65536, and the tcp server
sets the receive buffer size to 42KB. According to the
tcp_bound_to_half_wnd function, the MSS of the server and client is
21KB. Then, the tcp server sets the buffer size of the connection to
16KB. At this time, the MSS of the server is 8KB, and the MSS of the
client is still 21KB. But it will cause the client to fail to send the
message, that is, tcp_write_xmit fails. Mainly because tcp_snd_wnd_test
failed, and then entered the zero window detection phase, resulting in a
200ms delay.

Therefore, we mainly modify two places. One is the tcp_current_mss
function. When the sending window is smaller than the current mss, mss
needs to be updated. The other is the tcp_bound_to_half_wnd function.
When the sending window is smaller than the current mss, the mss value
should be calculated according to the current sending window, not
max_window.

Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>
---
 include/net/tcp.h     | 11 ++++++++---
 net/ipv4/tcp_output.c |  3 ++-
 2 files changed, 10 insertions(+), 4 deletions(-)

Comments

Eric Dumazet June 7, 2021, 11:34 a.m. UTC | #1
On Mon, Jun 7, 2021 at 11:35 AM Shuyi Cheng
<chengshuyi@linux.alibaba.com> wrote:
>
> When the lo network card is used for communication, the tcp server
> reduces the size of the receiving buffer, causing the tcp client
> to have a delay of 200ms. Examples are as follows:
>
> Suppose that the MTU of the lo network card is 65536, and the tcp server
> sets the receive buffer size to 42KB. According to the
> tcp_bound_to_half_wnd function, the MSS of the server and client is
> 21KB. Then, the tcp server sets the buffer size of the connection to
> 16KB. At this time, the MSS of the server is 8KB, and the MSS of the
> client is still 21KB. But it will cause the client to fail to send the
> message, that is, tcp_write_xmit fails. Mainly because tcp_snd_wnd_test
> failed, and then entered the zero window detection phase, resulting in a
> 200ms delay.
>
> Therefore, we mainly modify two places. One is the tcp_current_mss
> function. When the sending window is smaller than the current mss, mss
> needs to be updated. The other is the tcp_bound_to_half_wnd function.
> When the sending window is smaller than the current mss, the mss value
> should be calculated according to the current sending window, not
> max_window.
>
> Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>
> ---
>  include/net/tcp.h     | 11 ++++++++---
>  net/ipv4/tcp_output.c |  3 ++-
>  2 files changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index e668f1b..fcdef16 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -641,6 +641,11 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)
>  static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
>  {
>         int cutoff;
> +       int window;
> +
> +       window = tp->max_window;
> +       if (tp->snd_wnd && tp->snd_wnd < pktsize)
> +               window = tp->snd_wnd;
>
>         /* When peer uses tiny windows, there is no use in packetizing
>          * to sub-MSS pieces for the sake of SWS or making sure there
> @@ -649,10 +654,10 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
>          * On the other hand, for extremely large MSS devices, handling
>          * smaller than MSS windows in this way does make sense.
>          */
> -       if (tp->max_window > TCP_MSS_DEFAULT)
> -               cutoff = (tp->max_window >> 1);
> +       if (window > TCP_MSS_DEFAULT)
> +               cutoff = (window >> 1);
>         else
> -               cutoff = tp->max_window;
> +               cutoff = window;
>
>         if (cutoff && pktsize > cutoff)
>                 return max_t(int, cutoff, 68U - tp->tcp_header_len);
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index bde781f..88dcdf2 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1833,7 +1833,8 @@ unsigned int tcp_current_mss(struct sock *sk)
>
>         if (dst) {
>                 u32 mtu = dst_mtu(dst);
> -               if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
> +               if (mtu != inet_csk(sk)->icsk_pmtu_cookie ||
> +                   (tp->snd_wnd && tp->snd_wnd < mss_now))
>                         mss_now = tcp_sync_mss(sk, mtu);


I do not think we want to add code in fast path only to cope with
pathological user choices.

Maybe it is time to accept the fact that setting the receive buffer
size to 42KB is dumb nowadays.

TCP needs to have at least two buffers in flight to avoid 'strange' stalls.

If loopback MTU is 64KB  (and TSO/GRO sizes also can reach 64K), maybe
we need to ensure a minimal rcvbuf.

If applications set 42KB window size, they get what they asked for :
damn slow TCP communications.

We do not want to make them happy, while increasing cpu costs for 99%
of other uses cases which are not trying to make TCP miserable.

I would rather add a sysctl or something to ensure rcvbuf has a
minimal sane value, instead of risking subtle TCP regressions.

In 2021, we should not limit ourselves to memory constraints that were
common 40 years ago when TCP was invented.

References :
commit 9eb5bf838d06aa6ddebe4aca6b5cedcf2eb53b86 net: sock: fix
TCP_SKB_MIN_TRUESIZE
commit eea86af6b1e18d6fa8dc959e3ddc0100f27aff9f net: sock: adapt
SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF
Shuyi Cheng June 8, 2021, 2:26 a.m. UTC | #2
On 6/7/21 7:34 PM, Eric Dumazet wrote:
> On Mon, Jun 7, 2021 at 11:35 AM Shuyi Cheng

> <chengshuyi@linux.alibaba.com> wrote:

>>

>> When the lo network card is used for communication, the tcp server

>> reduces the size of the receiving buffer, causing the tcp client

>> to have a delay of 200ms. Examples are as follows:

>>

>> Suppose that the MTU of the lo network card is 65536, and the tcp server

>> sets the receive buffer size to 42KB. According to the

>> tcp_bound_to_half_wnd function, the MSS of the server and client is

>> 21KB. Then, the tcp server sets the buffer size of the connection to

>> 16KB. At this time, the MSS of the server is 8KB, and the MSS of the

>> client is still 21KB. But it will cause the client to fail to send the

>> message, that is, tcp_write_xmit fails. Mainly because tcp_snd_wnd_test

>> failed, and then entered the zero window detection phase, resulting in a

>> 200ms delay.

>>

>> Therefore, we mainly modify two places. One is the tcp_current_mss

>> function. When the sending window is smaller than the current mss, mss

>> needs to be updated. The other is the tcp_bound_to_half_wnd function.

>> When the sending window is smaller than the current mss, the mss value

>> should be calculated according to the current sending window, not

>> max_window.

>>

>> Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>

>> ---

>>   include/net/tcp.h     | 11 ++++++++---

>>   net/ipv4/tcp_output.c |  3 ++-

>>   2 files changed, 10 insertions(+), 4 deletions(-)

>>

>> diff --git a/include/net/tcp.h b/include/net/tcp.h

>> index e668f1b..fcdef16 100644

>> --- a/include/net/tcp.h

>> +++ b/include/net/tcp.h

>> @@ -641,6 +641,11 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)

>>   static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)

>>   {

>>          int cutoff;

>> +       int window;

>> +

>> +       window = tp->max_window;

>> +       if (tp->snd_wnd && tp->snd_wnd < pktsize)

>> +               window = tp->snd_wnd;

>>

>>          /* When peer uses tiny windows, there is no use in packetizing

>>           * to sub-MSS pieces for the sake of SWS or making sure there

>> @@ -649,10 +654,10 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)

>>           * On the other hand, for extremely large MSS devices, handling

>>           * smaller than MSS windows in this way does make sense.

>>           */

>> -       if (tp->max_window > TCP_MSS_DEFAULT)

>> -               cutoff = (tp->max_window >> 1);

>> +       if (window > TCP_MSS_DEFAULT)

>> +               cutoff = (window >> 1);

>>          else

>> -               cutoff = tp->max_window;

>> +               cutoff = window;

>>

>>          if (cutoff && pktsize > cutoff)

>>                  return max_t(int, cutoff, 68U - tp->tcp_header_len);

>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

>> index bde781f..88dcdf2 100644

>> --- a/net/ipv4/tcp_output.c

>> +++ b/net/ipv4/tcp_output.c

>> @@ -1833,7 +1833,8 @@ unsigned int tcp_current_mss(struct sock *sk)

>>

>>          if (dst) {

>>                  u32 mtu = dst_mtu(dst);

>> -               if (mtu != inet_csk(sk)->icsk_pmtu_cookie)

>> +               if (mtu != inet_csk(sk)->icsk_pmtu_cookie ||

>> +                   (tp->snd_wnd && tp->snd_wnd < mss_now))

>>                          mss_now = tcp_sync_mss(sk, mtu);

> 

>

> I do not think we want to add code in fast path only to cope with

> pathological user choices.

> 


Thank you very much for your reply!

I very much agree with your point of view. However, the kernel currently
accepts the unreasonable RCVBUF set by the developer, that is, the set
RCVBUF is smaller than MSS. Perhaps, we can avoid setting RCVBUF to be
smaller than MSS in the sock_setsockopt function. What do you think?

Thanks.

> Maybe it is time to accept the fact that setting the receive buffer

> size to 42KB is dumb nowadays.

> 

> TCP needs to have at least two buffers in flight to avoid 'strange' stalls.

> 

> If loopback MTU is 64KB  (and TSO/GRO sizes also can reach 64K), maybe

> we need to ensure a minimal rcvbuf.

> 

> If applications set 42KB window size, they get what they asked for :

> damn slow TCP communications.

> 

> We do not want to make them happy, while increasing cpu costs for 99%

> of other uses cases which are not trying to make TCP miserable.

> 

> I would rather add a sysctl or something to ensure rcvbuf has a

> minimal sane value, instead of risking subtle TCP regressions.

> 

> In 2021, we should not limit ourselves to memory constraints that were

> common 40 years ago when TCP was invented.

> 

> References :

> commit 9eb5bf838d06aa6ddebe4aca6b5cedcf2eb53b86 net: sock: fix

> TCP_SKB_MIN_TRUESIZE

> commit eea86af6b1e18d6fa8dc959e3ddc0100f27aff9f net: sock: adapt

> SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF

>
Eric Dumazet June 8, 2021, 3:56 p.m. UTC | #3
On Tue, Jun 8, 2021 at 4:26 AM Shuyi Cheng <chengshuyi@linux.alibaba.com> wrote:
>

>

>

> On 6/7/21 7:34 PM, Eric Dumazet wrote:

> > On Mon, Jun 7, 2021 at 11:35 AM Shuyi Cheng

> > <chengshuyi@linux.alibaba.com> wrote:

> >>

> >> When the lo network card is used for communication, the tcp server

> >> reduces the size of the receiving buffer, causing the tcp client

> >> to have a delay of 200ms. Examples are as follows:

> >>

> >> Suppose that the MTU of the lo network card is 65536, and the tcp server

> >> sets the receive buffer size to 42KB. According to the

> >> tcp_bound_to_half_wnd function, the MSS of the server and client is

> >> 21KB. Then, the tcp server sets the buffer size of the connection to

> >> 16KB. At this time, the MSS of the server is 8KB, and the MSS of the

> >> client is still 21KB. But it will cause the client to fail to send the

> >> message, that is, tcp_write_xmit fails. Mainly because tcp_snd_wnd_test

> >> failed, and then entered the zero window detection phase, resulting in a

> >> 200ms delay.

> >>

> >> Therefore, we mainly modify two places. One is the tcp_current_mss

> >> function. When the sending window is smaller than the current mss, mss

> >> needs to be updated. The other is the tcp_bound_to_half_wnd function.

> >> When the sending window is smaller than the current mss, the mss value

> >> should be calculated according to the current sending window, not

> >> max_window.

> >>

> >> Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>

> >> ---

> >>   include/net/tcp.h     | 11 ++++++++---

> >>   net/ipv4/tcp_output.c |  3 ++-

> >>   2 files changed, 10 insertions(+), 4 deletions(-)

> >>

> >> diff --git a/include/net/tcp.h b/include/net/tcp.h

> >> index e668f1b..fcdef16 100644

> >> --- a/include/net/tcp.h

> >> +++ b/include/net/tcp.h

> >> @@ -641,6 +641,11 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)

> >>   static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)

> >>   {

> >>          int cutoff;

> >> +       int window;

> >> +

> >> +       window = tp->max_window;

> >> +       if (tp->snd_wnd && tp->snd_wnd < pktsize)

> >> +               window = tp->snd_wnd;

> >>

> >>          /* When peer uses tiny windows, there is no use in packetizing

> >>           * to sub-MSS pieces for the sake of SWS or making sure there

> >> @@ -649,10 +654,10 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)

> >>           * On the other hand, for extremely large MSS devices, handling

> >>           * smaller than MSS windows in this way does make sense.

> >>           */

> >> -       if (tp->max_window > TCP_MSS_DEFAULT)

> >> -               cutoff = (tp->max_window >> 1);

> >> +       if (window > TCP_MSS_DEFAULT)

> >> +               cutoff = (window >> 1);

> >>          else

> >> -               cutoff = tp->max_window;

> >> +               cutoff = window;

> >>

> >>          if (cutoff && pktsize > cutoff)

> >>                  return max_t(int, cutoff, 68U - tp->tcp_header_len);

> >> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

> >> index bde781f..88dcdf2 100644

> >> --- a/net/ipv4/tcp_output.c

> >> +++ b/net/ipv4/tcp_output.c

> >> @@ -1833,7 +1833,8 @@ unsigned int tcp_current_mss(struct sock *sk)

> >>

> >>          if (dst) {

> >>                  u32 mtu = dst_mtu(dst);

> >> -               if (mtu != inet_csk(sk)->icsk_pmtu_cookie)

> >> +               if (mtu != inet_csk(sk)->icsk_pmtu_cookie ||

> >> +                   (tp->snd_wnd && tp->snd_wnd < mss_now))

> >>                          mss_now = tcp_sync_mss(sk, mtu);

> >

> >

> > I do not think we want to add code in fast path only to cope with

> > pathological user choices.

> >

>

> Thank you very much for your reply!

>

> I very much agree with your point of view. However, the kernel currently

> accepts the unreasonable RCVBUF set by the developer, that is, the set

> RCVBUF is smaller than MSS. Perhaps, we can avoid setting RCVBUF to be

> smaller than MSS in the sock_setsockopt function. What do you think?


I think this is not trivial to make the limit being MSS dependent,
because SO_RCVBUF can be set before connection is attempted.
(So the MSS is not yet known)

I would rather simply add a sysctl to set a floor for TCP SO_RCVBUF,
and default it to a reasonable value (128KB ?)

The sysctl might be needed for some crazy environments (one million
TCP flows on a host with only 8 GB of memory for example...)

>

> Thanks.

>

> > Maybe it is time to accept the fact that setting the receive buffer

> > size to 42KB is dumb nowadays.

> >

> > TCP needs to have at least two buffers in flight to avoid 'strange' stalls.

> >

> > If loopback MTU is 64KB  (and TSO/GRO sizes also can reach 64K), maybe

> > we need to ensure a minimal rcvbuf.

> >

> > If applications set 42KB window size, they get what they asked for :

> > damn slow TCP communications.

> >

> > We do not want to make them happy, while increasing cpu costs for 99%

> > of other uses cases which are not trying to make TCP miserable.

> >

> > I would rather add a sysctl or something to ensure rcvbuf has a

> > minimal sane value, instead of risking subtle TCP regressions.

> >

> > In 2021, we should not limit ourselves to memory constraints that were

> > common 40 years ago when TCP was invented.

> >

> > References :

> > commit 9eb5bf838d06aa6ddebe4aca6b5cedcf2eb53b86 net: sock: fix

> > TCP_SKB_MIN_TRUESIZE

> > commit eea86af6b1e18d6fa8dc959e3ddc0100f27aff9f net: sock: adapt

> > SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF

> >
Shuyi Cheng June 10, 2021, 6 a.m. UTC | #4
On 6/8/21 11:56 PM, Eric Dumazet wrote:
> On Tue, Jun 8, 2021 at 4:26 AM Shuyi Cheng <chengshuyi@linux.alibaba.com> wrote:

>>

>>

>>

>> On 6/7/21 7:34 PM, Eric Dumazet wrote:

>>> On Mon, Jun 7, 2021 at 11:35 AM Shuyi Cheng

>>> <chengshuyi@linux.alibaba.com> wrote:

>>>>

>>>> When the lo network card is used for communication, the tcp server

>>>> reduces the size of the receiving buffer, causing the tcp client

>>>> to have a delay of 200ms. Examples are as follows:

>>>>

>>>> Suppose that the MTU of the lo network card is 65536, and the tcp server

>>>> sets the receive buffer size to 42KB. According to the

>>>> tcp_bound_to_half_wnd function, the MSS of the server and client is

>>>> 21KB. Then, the tcp server sets the buffer size of the connection to

>>>> 16KB. At this time, the MSS of the server is 8KB, and the MSS of the

>>>> client is still 21KB. But it will cause the client to fail to send the

>>>> message, that is, tcp_write_xmit fails. Mainly because tcp_snd_wnd_test

>>>> failed, and then entered the zero window detection phase, resulting in a

>>>> 200ms delay.

>>>>

>>>> Therefore, we mainly modify two places. One is the tcp_current_mss

>>>> function. When the sending window is smaller than the current mss, mss

>>>> needs to be updated. The other is the tcp_bound_to_half_wnd function.

>>>> When the sending window is smaller than the current mss, the mss value

>>>> should be calculated according to the current sending window, not

>>>> max_window.

>>>>

>>>> Signed-off-by: Shuyi Cheng <chengshuyi@linux.alibaba.com>

>>>> ---

>>>>    include/net/tcp.h     | 11 ++++++++---

>>>>    net/ipv4/tcp_output.c |  3 ++-

>>>>    2 files changed, 10 insertions(+), 4 deletions(-)

>>>>

>>>> diff --git a/include/net/tcp.h b/include/net/tcp.h

>>>> index e668f1b..fcdef16 100644

>>>> --- a/include/net/tcp.h

>>>> +++ b/include/net/tcp.h

>>>> @@ -641,6 +641,11 @@ static inline void tcp_clear_xmit_timers(struct sock *sk)

>>>>    static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)

>>>>    {

>>>>           int cutoff;

>>>> +       int window;

>>>> +

>>>> +       window = tp->max_window;

>>>> +       if (tp->snd_wnd && tp->snd_wnd < pktsize)

>>>> +               window = tp->snd_wnd;

>>>>

>>>>           /* When peer uses tiny windows, there is no use in packetizing

>>>>            * to sub-MSS pieces for the sake of SWS or making sure there

>>>> @@ -649,10 +654,10 @@ static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)

>>>>            * On the other hand, for extremely large MSS devices, handling

>>>>            * smaller than MSS windows in this way does make sense.

>>>>            */

>>>> -       if (tp->max_window > TCP_MSS_DEFAULT)

>>>> -               cutoff = (tp->max_window >> 1);

>>>> +       if (window > TCP_MSS_DEFAULT)

>>>> +               cutoff = (window >> 1);

>>>>           else

>>>> -               cutoff = tp->max_window;

>>>> +               cutoff = window;

>>>>

>>>>           if (cutoff && pktsize > cutoff)

>>>>                   return max_t(int, cutoff, 68U - tp->tcp_header_len);

>>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

>>>> index bde781f..88dcdf2 100644

>>>> --- a/net/ipv4/tcp_output.c

>>>> +++ b/net/ipv4/tcp_output.c

>>>> @@ -1833,7 +1833,8 @@ unsigned int tcp_current_mss(struct sock *sk)

>>>>

>>>>           if (dst) {

>>>>                   u32 mtu = dst_mtu(dst);

>>>> -               if (mtu != inet_csk(sk)->icsk_pmtu_cookie)

>>>> +               if (mtu != inet_csk(sk)->icsk_pmtu_cookie ||

>>>> +                   (tp->snd_wnd && tp->snd_wnd < mss_now))

>>>>                           mss_now = tcp_sync_mss(sk, mtu);

>>>

>>>

>>> I do not think we want to add code in fast path only to cope with

>>> pathological user choices.

>>>

>>

>> Thank you very much for your reply!

>>

>> I very much agree with your point of view. However, the kernel currently

>> accepts the unreasonable RCVBUF set by the developer, that is, the set

>> RCVBUF is smaller than MSS. Perhaps, we can avoid setting RCVBUF to be

>> smaller than MSS in the sock_setsockopt function. What do you think?

> 

> I think this is not trivial to make the limit being MSS dependent,

> because SO_RCVBUF can be set before connection is attempted.

> (So the MSS is not yet known)

>


Thank you very much for your reply!

Maybe it's not clear enough that I described it. The scenario where the 
above problem occurs is precisely because the tcp server sets the size 
of RCVBUFF to be smaller after the connection is established. Here is a 
sample code that caused the problem.

# default tcp_rmem is 87380
tcpServerSocket= socket.socket(socket.AF_INET, socket.SOCK_STREAM)
tcpServerSocket.bind(server_addr)
tcpServerSocket.listen()
while True:
     connection,client_addr = tcpServerSocket.accept()
     # Shrink rmem
     connection.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 16*1024)

Therefore, when the developer calls the sock_setsockopt function to 
reset RCVBUF, we can use sock to determine the TCP state. When in the 
connected state, it is not allowed to set RCVBUF smaller than mss.

Thanks.


> I would rather simply add a sysctl to set a floor for TCP SO_RCVBUF,

> and default it to a reasonable value (128KB ?)

> 

> The sysctl might be needed for some crazy environments (one million

> TCP flows on a host with only 8 GB of memory for example...)

> 

>>

>> Thanks.

>>

>>> Maybe it is time to accept the fact that setting the receive buffer

>>> size to 42KB is dumb nowadays.

>>>

>>> TCP needs to have at least two buffers in flight to avoid 'strange' stalls.

>>>

>>> If loopback MTU is 64KB  (and TSO/GRO sizes also can reach 64K), maybe

>>> we need to ensure a minimal rcvbuf.

>>>

>>> If applications set 42KB window size, they get what they asked for :

>>> damn slow TCP communications.

>>>

>>> We do not want to make them happy, while increasing cpu costs for 99%

>>> of other uses cases which are not trying to make TCP miserable.

>>>

>>> I would rather add a sysctl or something to ensure rcvbuf has a

>>> minimal sane value, instead of risking subtle TCP regressions.

>>>

>>> In 2021, we should not limit ourselves to memory constraints that were

>>> common 40 years ago when TCP was invented.

>>>

>>> References :

>>> commit 9eb5bf838d06aa6ddebe4aca6b5cedcf2eb53b86 net: sock: fix

>>> TCP_SKB_MIN_TRUESIZE

>>> commit eea86af6b1e18d6fa8dc959e3ddc0100f27aff9f net: sock: adapt

>>> SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF

>>>
Eric Dumazet June 10, 2021, 10:04 a.m. UTC | #5
On Thu, Jun 10, 2021 at 8:00 AM Shuyi Cheng
<chengshuyi@linux.alibaba.com> wrote:

> Thank you very much for your reply!

>

> Maybe it's not clear enough that I described it. The scenario where the

> above problem occurs is precisely because the tcp server sets the size

> of RCVBUFF to be smaller after the connection is established. Here is a

> sample code that caused the problem.

>

> # default tcp_rmem is 87380


Except that this value is overridden at connection establishment.

tcp_rmem[1] is only a floor value, say if you want a reasonable value
even if MSS == 100

> tcpServerSocket= socket.socket(socket.AF_INET, socket.SOCK_STREAM)

> tcpServerSocket.bind(server_addr)

> tcpServerSocket.listen()

> while True:

>      connection,client_addr = tcpServerSocket.accept()

>      # Shrink rmem

>      connection.setsockopt(socket.SOL_SOCKET, socket.SO_RCVBUF, 16*1024)

>

> Therefore, when the developer calls the sock_setsockopt function to

> reset RCVBUF, we can use sock to determine the TCP state. When in the

> connected state, it is not allowed to set RCVBUF smaller than mss.

>


Sure, but the application can _also_ set SO_RCVBUF before listen() or connect()

We can not have assumptions about SO_RCVBUF values and socket states.
Otherwise we would have to add some sk_rcvbuf adjustments every time
the socket state is changed.
diff mbox series

Patch

diff --git a/include/net/tcp.h b/include/net/tcp.h
index e668f1b..fcdef16 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -641,6 +641,11 @@  static inline void tcp_clear_xmit_timers(struct sock *sk)
 static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
 {
 	int cutoff;
+	int window;
+
+	window = tp->max_window;
+	if (tp->snd_wnd && tp->snd_wnd < pktsize)
+		window = tp->snd_wnd;
 
 	/* When peer uses tiny windows, there is no use in packetizing
 	 * to sub-MSS pieces for the sake of SWS or making sure there
@@ -649,10 +654,10 @@  static inline int tcp_bound_to_half_wnd(struct tcp_sock *tp, int pktsize)
 	 * On the other hand, for extremely large MSS devices, handling
 	 * smaller than MSS windows in this way does make sense.
 	 */
-	if (tp->max_window > TCP_MSS_DEFAULT)
-		cutoff = (tp->max_window >> 1);
+	if (window > TCP_MSS_DEFAULT)
+		cutoff = (window >> 1);
 	else
-		cutoff = tp->max_window;
+		cutoff = window;
 
 	if (cutoff && pktsize > cutoff)
 		return max_t(int, cutoff, 68U - tp->tcp_header_len);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index bde781f..88dcdf2 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1833,7 +1833,8 @@  unsigned int tcp_current_mss(struct sock *sk)
 
 	if (dst) {
 		u32 mtu = dst_mtu(dst);
-		if (mtu != inet_csk(sk)->icsk_pmtu_cookie)
+		if (mtu != inet_csk(sk)->icsk_pmtu_cookie ||
+		    (tp->snd_wnd && tp->snd_wnd < mss_now))
 			mss_now = tcp_sync_mss(sk, mtu);
 	}