[RFCv2,0/3] tcp: Improve mtu probe preconditions

Message ID	cover.1622025457.git.cdleonard@gmail.com
Headers	show Return-Path: <netdev-owner@kernel.org> From: Leonard Crestez <cdleonard@gmail.com> To: Neal Cardwell <ncardwell@google.com>, Matt Mathis <mattmathis@google.com>, Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net>, Willem de Bruijn <willemb@google.com>, Jakub Kicinski <kuba@kernel.org>, Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>, David Ahern <dsahern@kernel.org>, John Heffner <johnwheffner@gmail.com>, Leonard Crestez <lcrestez@drivenets.com>, Soheil Hassas Yeganeh <soheil@google.com>, Roopa Prabhu <roopa@cumulusnetworks.com>, netdev@vger.kernel.org, linux-kernel@vger.kernel.org Subject: [RFCv2 0/3] tcp: Improve mtu probe preconditions Date: Wed, 26 May 2021 13:38:24 +0300 Message-Id: <cover.1622025457.git.cdleonard@gmail.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	tcp: Improve mtu probe preconditions \| expand [RFCv2,0/3] tcp: Improve mtu probe preconditions [RFCv2,3/3] tcp: Wait for sufficient data in tcp_mtu_probe

Message ID

cover.1622025457.git.cdleonard@gmail.com

Headers

From: Leonard Crestez <cdleonard@gmail.com>
To: Neal Cardwell <ncardwell@google.com>,
	Matt Mathis <mattmathis@google.com>, Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>,
	Willem de Bruijn <willemb@google.com>, Jakub Kicinski <kuba@kernel.org>,
	Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>,
	David Ahern <dsahern@kernel.org>, John Heffner <johnwheffner@gmail.com>,
	Leonard Crestez <lcrestez@drivenets.com>,
	Soheil Hassas Yeganeh <soheil@google.com>,
	Roopa Prabhu <roopa@cumulusnetworks.com>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: [RFCv2 0/3] tcp: Improve mtu probe preconditions
Date: Wed, 26 May 2021 13:38:24 +0300
Message-Id: <cover.1622025457.git.cdleonard@gmail.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

tcp: Improve mtu probe preconditions | expand

Message

Leonard Crestez May 26, 2021, 10:38 a.m. UTC

According to RFC4821 Section 7.4 "Protocols MAY delay sending non-probes
in order to accumulate enough data" but in practice linux only sends
probes when a lot of data accumulates on the send side.

Another improvement is to rely on TCP RACK performing timely loss detection
with fewer outstanding packets. If this is enabled the size required for a
probe can be shrunk.

Successive successful mtu probes will result in reducing the cwnd since
it's measured in packets and we send bigger packets. The cwnd value can get
stuck below 11 on low-latency links and this prevents further probing. The
cwnd logic in tcp_mtu_probe can be reworked to be based on the the number of
packets that we actually need to send instead of arbitrary constants.

It is difficult to improve this behavior without introducing unreasonable
delays or even stalls. Looking at the current behavior of tcp_mtu_probe it
already waits in some scenarios: when there is not enough room inside cwnd
or when there is a gap of unacklowledged data between snd_una and snd_nxt.
It appears that it is safe to wait if packets_in_flight() != 0.

Signed-off-by: Leonard Crestez <cdleonard@gmail.com>

---

Previous RFC: https://lore.kernel.org/netdev/cover.1620733594.git.cdleonard@gmail.com/

This series seems to be "correct" this time, I would appreciate any feedback.
It possible my understanding of when it is safe to return 0 from tcp_mtu_probe
is incorrect. It's possible that even current code would interact poorly with
delayed acks in some circumstances?

The tcp_xmit_size_goal changes were dropped. It's still possible to see strange
interactions between tcp_push_one and mtu probing: If the receiver window is
small (60k) the sender will do a "push_one" when half a window is accumulated
(30k) and that would prevent mtu probing even if the sender is writing more
than enough data in a single syscall.

Leonard Crestez (3):
  tcp: Use smaller mtu probes if RACK is enabled
  tcp: Adjust congestion window handling for mtu probe
  tcp: Wait for sufficient data in tcp_mtu_probe

 Documentation/networking/ip-sysctl.rst | 10 ++++
 include/net/netns/ipv4.h               |  2 +
 net/ipv4/sysctl_net_ipv4.c             | 14 ++++++
 net/ipv4/tcp_ipv4.c                    |  2 +
 net/ipv4/tcp_output.c                  | 70 +++++++++++++++++++++-----
 5 files changed, 86 insertions(+), 12 deletions(-)


base-commit: e4e92ee78702b13ad55118d8b66f06e1aef62586

Comments

Neal Cardwell May 26, 2021, 12:11 p.m. UTC | #1

On Wed, May 26, 2021 at 6:38 AM Leonard Crestez <cdleonard@gmail.com> wrote:
>
> RACK allows detecting a loss in rtt + min_rtt / 4 based on just one
> extra packet. If enabled use this instead of relying of fast retransmit.

IMHO it would be worth adding some more text to motivate the change,
to justify the added complexity and risk from the change. The
substance of the change seems to be decreasing the requirement for
PMTU probing from needing roughly 5 packets worth of data to needing
roughly 3 packets worth of data. It's not clear to me as a reader of
this patch by itself that there are lots of applications that very
often only have 3-4 packets worth of data to send and yet can benefit
greatly from PMTU discovery.

> Suggested-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Leonard Crestez <cdleonard@gmail.com>
> ---
>  Documentation/networking/ip-sysctl.rst |  5 +++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/sysctl_net_ipv4.c             |  7 +++++++
>  net/ipv4/tcp_ipv4.c                    |  1 +
>  net/ipv4/tcp_output.c                  | 26 +++++++++++++++++++++++++-
>  5 files changed, 39 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
> index a5c250044500..7ab52a105a5d 100644
> --- a/Documentation/networking/ip-sysctl.rst
> +++ b/Documentation/networking/ip-sysctl.rst
> @@ -349,10 +349,15 @@ tcp_mtu_probe_floor - INTEGER
>         If MTU probing is enabled this caps the minimum MSS used for search_low
>         for the connection.
>
>         Default : 48
>
> +tcp_mtu_probe_rack - BOOLEAN
> +       Try to use shorter probes if RACK is also enabled
> +
> +       Default: 1

I  would vote to not have a sysctl for this. If we think it's a good
idea to allow MTU probing with a smaller amount of data if RACK is
enabled (which seems true to me), then this is a low-risk enough
change that we should just change the behavior.

>  tcp_min_snd_mss - INTEGER
>         TCP SYN and SYNACK messages usually advertise an ADVMSS option,
>         as described in RFC 1122 and RFC 6691.
>
>         If this ADVMSS option is smaller than tcp_min_snd_mss,
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index 746c80cd4257..b4ff12f25a7f 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -112,10 +112,11 @@ struct netns_ipv4 {
>  #ifdef CONFIG_NET_L3_MASTER_DEV
>         u8 sysctl_tcp_l3mdev_accept;
>  #endif
>         u8 sysctl_tcp_mtu_probing;
>         int sysctl_tcp_mtu_probe_floor;
> +       int sysctl_tcp_mtu_probe_rack;
>         int sysctl_tcp_base_mss;
>         int sysctl_tcp_min_snd_mss;
>         int sysctl_tcp_probe_threshold;
>         u32 sysctl_tcp_probe_interval;
>
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 4fa77f182dcb..275c91fb9cf8 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -847,10 +847,17 @@ static struct ctl_table ipv4_net_table[] = {
>                 .mode           = 0644,
>                 .proc_handler   = proc_dointvec_minmax,
>                 .extra1         = &tcp_min_snd_mss_min,
>                 .extra2         = &tcp_min_snd_mss_max,
>         },
> +       {
> +               .procname       = "tcp_mtu_probe_rack",
> +               .data           = &init_net.ipv4.sysctl_tcp_mtu_probe_rack,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec,
> +       },
>         {
>                 .procname       = "tcp_probe_threshold",
>                 .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,
>                 .maxlen         = sizeof(int),
>                 .mode           = 0644,
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 4f5b68a90be9..ed8af4a7325b 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -2892,10 +2892,11 @@ static int __net_init tcp_sk_init(struct net *net)
>         net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;
>         net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
>         net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
>         net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
>         net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
> +       net->ipv4.sysctl_tcp_mtu_probe_rack = 1;
>
>         net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
>         net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
>         net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index bde781f46b41..9691f435477b 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2311,10 +2311,19 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)
>         }
>
>         return true;
>  }
>
> +/* Check if rack is supported for current connection */
> +static int tcp_mtu_probe_is_rack(const struct sock *sk)
> +{
> +       struct net *net = sock_net(sk);
> +
> +       return (net->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION &&
> +                       net->ipv4.sysctl_tcp_mtu_probe_rack);
> +}

You may want to use the existing helper, tcp_is_rack(), by moving it
to include/net/tcp.h

thanks,
neal

Leonard Crestez June 3, 2021, 1:37 p.m. UTC | #2

On 5/26/21 3:11 PM, Neal Cardwell wrote:
> On Wed, May 26, 2021 at 6:38 AM Leonard Crestez <cdleonard@gmail.com> wrote:

>>

>> RACK allows detecting a loss in rtt + min_rtt / 4 based on just one

>> extra packet. If enabled use this instead of relying of fast retransmit.

> 

> IMHO it would be worth adding some more text to motivate the change,

> to justify the added complexity and risk from the change. The

> substance of the change seems to be decreasing the requirement for

> PMTU probing from needing roughly 5 packets worth of data to needing

> roughly 3 packets worth of data. It's not clear to me as a reader of

> this patch by itself that there are lots of applications that very

> often only have 3-4 packets worth of data to send and yet can benefit

> greatly from PMTU discovery.

> 

>> Suggested-by: Neal Cardwell <ncardwell@google.com>

>> Signed-off-by: Leonard Crestez <cdleonard@gmail.com>

>> ---

>>   Documentation/networking/ip-sysctl.rst |  5 +++++

>>   include/net/netns/ipv4.h               |  1 +

>>   net/ipv4/sysctl_net_ipv4.c             |  7 +++++++

>>   net/ipv4/tcp_ipv4.c                    |  1 +

>>   net/ipv4/tcp_output.c                  | 26 +++++++++++++++++++++++++-

>>   5 files changed, 39 insertions(+), 1 deletion(-)

>>

>> diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst

>> index a5c250044500..7ab52a105a5d 100644

>> --- a/Documentation/networking/ip-sysctl.rst

>> +++ b/Documentation/networking/ip-sysctl.rst

>> @@ -349,10 +349,15 @@ tcp_mtu_probe_floor - INTEGER

>>          If MTU probing is enabled this caps the minimum MSS used for search_low

>>          for the connection.

>>

>>          Default : 48

>>

>> +tcp_mtu_probe_rack - BOOLEAN

>> +       Try to use shorter probes if RACK is also enabled

>> +

>> +       Default: 1

> 

> I  would vote to not have a sysctl for this. If we think it's a good

> idea to allow MTU probing with a smaller amount of data if RACK is

> enabled (which seems true to me), then this is a low-risk enough

> change that we should just change the behavior.

> 

>>   tcp_min_snd_mss - INTEGER

>>          TCP SYN and SYNACK messages usually advertise an ADVMSS option,

>>          as described in RFC 1122 and RFC 6691.

>>

>>          If this ADVMSS option is smaller than tcp_min_snd_mss,

>> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h

>> index 746c80cd4257..b4ff12f25a7f 100644

>> --- a/include/net/netns/ipv4.h

>> +++ b/include/net/netns/ipv4.h

>> @@ -112,10 +112,11 @@ struct netns_ipv4 {

>>   #ifdef CONFIG_NET_L3_MASTER_DEV

>>          u8 sysctl_tcp_l3mdev_accept;

>>   #endif

>>          u8 sysctl_tcp_mtu_probing;

>>          int sysctl_tcp_mtu_probe_floor;

>> +       int sysctl_tcp_mtu_probe_rack;

>>          int sysctl_tcp_base_mss;

>>          int sysctl_tcp_min_snd_mss;

>>          int sysctl_tcp_probe_threshold;

>>          u32 sysctl_tcp_probe_interval;

>>

>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c

>> index 4fa77f182dcb..275c91fb9cf8 100644

>> --- a/net/ipv4/sysctl_net_ipv4.c

>> +++ b/net/ipv4/sysctl_net_ipv4.c

>> @@ -847,10 +847,17 @@ static struct ctl_table ipv4_net_table[] = {

>>                  .mode           = 0644,

>>                  .proc_handler   = proc_dointvec_minmax,

>>                  .extra1         = &tcp_min_snd_mss_min,

>>                  .extra2         = &tcp_min_snd_mss_max,

>>          },

>> +       {

>> +               .procname       = "tcp_mtu_probe_rack",

>> +               .data           = &init_net.ipv4.sysctl_tcp_mtu_probe_rack,

>> +               .maxlen         = sizeof(int),

>> +               .mode           = 0644,

>> +               .proc_handler   = proc_dointvec,

>> +       },

>>          {

>>                  .procname       = "tcp_probe_threshold",

>>                  .data           = &init_net.ipv4.sysctl_tcp_probe_threshold,

>>                  .maxlen         = sizeof(int),

>>                  .mode           = 0644,

>> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c

>> index 4f5b68a90be9..ed8af4a7325b 100644

>> --- a/net/ipv4/tcp_ipv4.c

>> +++ b/net/ipv4/tcp_ipv4.c

>> @@ -2892,10 +2892,11 @@ static int __net_init tcp_sk_init(struct net *net)

>>          net->ipv4.sysctl_tcp_base_mss = TCP_BASE_MSS;

>>          net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;

>>          net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;

>>          net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;

>>          net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;

>> +       net->ipv4.sysctl_tcp_mtu_probe_rack = 1;

>>

>>          net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;

>>          net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;

>>          net->ipv4.sysctl_tcp_keepalive_intvl = TCP_KEEPALIVE_INTVL;

>>

>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c

>> index bde781f46b41..9691f435477b 100644

>> --- a/net/ipv4/tcp_output.c

>> +++ b/net/ipv4/tcp_output.c

>> @@ -2311,10 +2311,19 @@ static bool tcp_can_coalesce_send_queue_head(struct sock *sk, int len)

>>          }

>>

>>          return true;

>>   }

>>

>> +/* Check if rack is supported for current connection */

>> +static int tcp_mtu_probe_is_rack(const struct sock *sk)

>> +{

>> +       struct net *net = sock_net(sk);

>> +

>> +       return (net->ipv4.sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION &&

>> +                       net->ipv4.sysctl_tcp_mtu_probe_rack);

>> +}

> 

> You may want to use the existing helper, tcp_is_rack(), by moving it

> to include/net/tcp.h


OK, for this and other comments.

Initially I though that maybe a more elaborate check is required but it 
seems to be only up to the sender to keep individual timeouts.

--
Regards,
Leonard