diff mbox series

[net] tcp: fix receive window update in tcp_add_backlog()

Message ID 20201005134813.2051883-1-eric.dumazet@gmail.com
State New
Headers show
Series [net] tcp: fix receive window update in tcp_add_backlog() | expand

Commit Message

Eric Dumazet Oct. 5, 2020, 1:48 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

We got reports from GKE customers flows being reset by netfilter
conntrack unless nf_conntrack_tcp_be_liberal is set to 1.

Traces seemed to suggest ACK packet being dropped by the
packet capture, or more likely that ACK were received in the
wrong order.

 wscale=7, SYN and SYNACK not shown here.

 This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
 New right edge of the window -> 51359321+1871*128=51598809

 09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0

 09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
 09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
 09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
 09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
 09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384

 Receiver now allows to send 606*128=77568 from seq 51521241 :
 New right edge of the window -> 51521241+606*128=51598809

 09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0

 09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384

 It seems the sender exceeds RWIN allowance, since 51611353 > 51598809

 09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
 09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040

 09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0

 netfilter conntrack is not happy and sends RST

 09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
 09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0

 Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
 New right edge of the window -> 51521241+859*128=51631193

Normally TCP stack handles OOO packets just fine, but it
turns out tcp_add_backlog() does not. It can update the window
field of the aggregated packet even if the ACK sequence
of the last received packet is too old.

Many thanks to Alexandre Ferrieux for independently reporting the issue
and suggesting a fix.

Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>
---
 net/ipv4/tcp_ipv4.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

Comments

Soheil Hassas Yeganeh Oct. 5, 2020, 1:50 p.m. UTC | #1
On Mon, Oct 5, 2020 at 9:48 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
>
> We got reports from GKE customers flows being reset by netfilter
> conntrack unless nf_conntrack_tcp_be_liberal is set to 1.
>
> Traces seemed to suggest ACK packet being dropped by the
> packet capture, or more likely that ACK were received in the
> wrong order.
>
>  wscale=7, SYN and SYNACK not shown here.
>
>  This ACK allows the sender to send 1871*128 bytes from seq 51359321 :
>  New right edge of the window -> 51359321+1871*128=51598809
>
>  09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0
>
>  09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408
>  09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0
>  09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768
>  09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0
>  09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
>
>  Receiver now allows to send 606*128=77568 from seq 51521241 :
>  New right edge of the window -> 51521241+606*128=51598809
>
>  09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0
>
>  09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384
>
>  It seems the sender exceeds RWIN allowance, since 51611353 > 51598809
>
>  09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728
>  09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040
>
>  09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0
>
>  netfilter conntrack is not happy and sends RST
>
>  09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0
>  09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0
>
>  Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.
>  New right edge of the window -> 51521241+859*128=51631193
>
> Normally TCP stack handles OOO packets just fine, but it
> turns out tcp_add_backlog() does not. It can update the window
> field of the aggregated packet even if the ACK sequence
> of the last received packet is too old.
>
> Many thanks to Alexandre Ferrieux for independently reporting the issue
> and suggesting a fix.
>
> Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Thanks for the fix!

> ---
>  net/ipv4/tcp_ipv4.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
> index 5084333b5ab647ca8ed296235a1ed6573693b250..592c7396272315c864372c158a7bc8850c6ddc61 100644
> --- a/net/ipv4/tcp_ipv4.c
> +++ b/net/ipv4/tcp_ipv4.c
> @@ -1788,12 +1788,12 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
>
>         __skb_pull(skb, hdrlen);
>         if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
> -               thtail->window = th->window;
> -
>                 TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
>
> -               if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq))
> +               if (likely(!before(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq))) {
>                         TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +                       thtail->window = th->window;
> +               }
>
>                 /* We have to update both TCP_SKB_CB(tail)->tcp_flags and
>                  * thtail->fin, so that the fast path in tcp_rcv_established()
> --
> 2.28.0.806.g8561365e88-goog
>
Neal Cardwell Oct. 5, 2020, 2:34 p.m. UTC | #2
On Mon, Oct 5, 2020 at 9:48 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>

> From: Eric Dumazet <edumazet@google.com>

>

> We got reports from GKE customers flows being reset by netfilter

> conntrack unless nf_conntrack_tcp_be_liberal is set to 1.

>

> Traces seemed to suggest ACK packet being dropped by the

> packet capture, or more likely that ACK were received in the

> wrong order.

>

>  wscale=7, SYN and SYNACK not shown here.

>

>  This ACK allows the sender to send 1871*128 bytes from seq 51359321 :

>  New right edge of the window -> 51359321+1871*128=51598809

>

>  09:17:23.389210 IP A > B: Flags [.], ack 51359321, win 1871, options [nop,nop,TS val 10 ecr 999], length 0

>

>  09:17:23.389212 IP B > A: Flags [.], seq 51422681:51424089, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 1408

>  09:17:23.389214 IP A > B: Flags [.], ack 51422681, win 1376, options [nop,nop,TS val 10 ecr 999], length 0

>  09:17:23.389253 IP B > A: Flags [.], seq 51424089:51488857, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 64768

>  09:17:23.389272 IP A > B: Flags [.], ack 51488857, win 859, options [nop,nop,TS val 10 ecr 999], length 0

>  09:17:23.389275 IP B > A: Flags [.], seq 51488857:51521241, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384

>

>  Receiver now allows to send 606*128=77568 from seq 51521241 :

>  New right edge of the window -> 51521241+606*128=51598809

>

>  09:17:23.389296 IP A > B: Flags [.], ack 51521241, win 606, options [nop,nop,TS val 10 ecr 999], length 0

>

>  09:17:23.389308 IP B > A: Flags [.], seq 51521241:51553625, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 32384

>

>  It seems the sender exceeds RWIN allowance, since 51611353 > 51598809

>

>  09:17:23.389346 IP B > A: Flags [.], seq 51553625:51611353, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 57728

>  09:17:23.389356 IP B > A: Flags [.], seq 51611353:51618393, ack 1577, win 268, options [nop,nop,TS val 999 ecr 10], length 7040

>

>  09:17:23.389367 IP A > B: Flags [.], ack 51611353, win 0, options [nop,nop,TS val 10 ecr 999], length 0

>

>  netfilter conntrack is not happy and sends RST

>

>  09:17:23.389389 IP A > B: Flags [R], seq 92176528, win 0, length 0

>  09:17:23.389488 IP B > A: Flags [R], seq 174478967, win 0, length 0

>

>  Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.

>  New right edge of the window -> 51521241+859*128=51631193

>

> Normally TCP stack handles OOO packets just fine, but it

> turns out tcp_add_backlog() does not. It can update the window

> field of the aggregated packet even if the ACK sequence

> of the last received packet is too old.

>

> Many thanks to Alexandre Ferrieux for independently reporting the issue

> and suggesting a fix.

>

> Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")

> Signed-off-by: Eric Dumazet <edumazet@google.com>

> Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>

> ---

>  net/ipv4/tcp_ipv4.c | 6 +++---

>  1 file changed, 3 insertions(+), 3 deletions(-)


Nice detective work, as always. Thanks for the fix!

Acked-by: Neal Cardwell <ncardwell@google.com>


neal
David Miller Oct. 6, 2020, 1:12 p.m. UTC | #3
From: Eric Dumazet <eric.dumazet@gmail.com>

Date: Mon,  5 Oct 2020 06:48:13 -0700

> From: Eric Dumazet <edumazet@google.com>

> 

> We got reports from GKE customers flows being reset by netfilter

> conntrack unless nf_conntrack_tcp_be_liberal is set to 1.

> 

> Traces seemed to suggest ACK packet being dropped by the

> packet capture, or more likely that ACK were received in the

> wrong order.

> 

>  wscale=7, SYN and SYNACK not shown here.

> 

>  This ACK allows the sender to send 1871*128 bytes from seq 51359321 :

>  New right edge of the window -> 51359321+1871*128=51598809

 ...
>  Now imagine ACK were delivered out of order and tcp_add_backlog() sets window based on wrong packet.

>  New right edge of the window -> 51521241+859*128=51631193

> 

> Normally TCP stack handles OOO packets just fine, but it

> turns out tcp_add_backlog() does not. It can update the window

> field of the aggregated packet even if the ACK sequence

> of the last received packet is too old.

> 

> Many thanks to Alexandre Ferrieux for independently reporting the issue

> and suggesting a fix.

> 

> Fixes: 4f693b55c3d2 ("tcp: implement coalescing on backlog queue")

> Signed-off-by: Eric Dumazet <edumazet@google.com>

> Reported-by: Alexandre Ferrieux <alexandre.ferrieux@orange.com>


Applied and queued up for -stable, thank you.
diff mbox series

Patch

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 5084333b5ab647ca8ed296235a1ed6573693b250..592c7396272315c864372c158a7bc8850c6ddc61 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1788,12 +1788,12 @@  bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 
 	__skb_pull(skb, hdrlen);
 	if (skb_try_coalesce(tail, skb, &fragstolen, &delta)) {
-		thtail->window = th->window;
-
 		TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
 
-		if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq))
+		if (likely(!before(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq))) {
 			TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
+			thtail->window = th->window;
+		}
 
 		/* We have to update both TCP_SKB_CB(tail)->tcp_flags and
 		 * thtail->fin, so that the fast path in tcp_rcv_established()