From patchwork Mon Jun 7 15:45:34 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: "dust.li" X-Patchwork-Id: 456483 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, UNPARSEABLE_RELAY, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A844C47094 for ; Mon, 7 Jun 2021 15:45:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id E34816102A for ; Mon, 7 Jun 2021 15:45:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231364AbhFGPr3 (ORCPT ); Mon, 7 Jun 2021 11:47:29 -0400 Received: from out30-131.freemail.mail.aliyun.com ([115.124.30.131]:60590 "EHLO out30-131.freemail.mail.aliyun.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231246AbhFGPr2 (ORCPT ); Mon, 7 Jun 2021 11:47:28 -0400 X-Alimail-AntiSpam: AC=PASS; BC=-1|-1; BR=01201311R201e4; CH=green; DM=||false|; DS=||; FP=0|-1|-1|-1|0|-1|-1|-1; HT=alimailimapcm10staff010182156082; MF=dust.li@linux.alibaba.com; NM=1; PH=DS; RN=5; SR=0; TI=SMTPD_---0UbeGQ3Q_1623080735; Received: from localhost(mailfrom:dust.li@linux.alibaba.com fp:SMTPD_---0UbeGQ3Q_1623080735) by smtp.aliyun-inc.com(127.0.0.1); Mon, 07 Jun 2021 23:45:35 +0800 From: Dust Li To: Eric Dumazet , davem@davemloft.net, netdev@vger.kernel.org Cc: Tony Lu , Xuan Zhuo Subject: [PATCH net-next] tcp: avoid spurious loopback retransmit Date: Mon, 7 Jun 2021 23:45:34 +0800 Message-Id: <20210607154534.57034-1-dust.li@linux.alibaba.com> X-Mailer: git-send-email 2.19.1.3.ge56e4f7 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org We found there are pretty much loopback TCP retransmitions on our online servers. Most of them are TLP retransmition. This is because for loopback communication, TLP is usally triggered about 2ms after the last packet was sent if no ACK was received within that period. This condition can be met if there are some kernel tasks running on that CPU for more than 2ms, which delays the softirq to process the sd->backlog. We sampled the loopback TLP retransmit on our online servers, and found an average 2K+ retransmit per hour. But in some cases, this can be much bigger, I found a peak 40 retrans/s on the same server. Actually, those loopback retransmitions are not a big problem as long as they don't happen too frequently. It's just spurious and meanless and waste some CPU cycles. I also write a test module which just busy-loop in the kernel for more then 2ms, and the lo retransmition can be triggered every time we run the busy-loop on the target CPU. With this patch, the retransmition is gone and the throughput is not affected. Signed-off-by: Dust Li Reported-by: kernel test robot --- include/linux/skbuff.h | 7 +++++-- net/ipv4/tcp_output.c | 31 +++++++++++++++++++++++++++---- net/xfrm/xfrm_policy.c | 2 +- 3 files changed, 33 insertions(+), 7 deletions(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index dbf820a50a39..290e0a6a3a47 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1131,9 +1131,12 @@ struct sk_buff_fclones { * Returns true if skb is a fast clone, and its clone is not freed. * Some drivers call skb_orphan() in their ndo_start_xmit(), * so we also check that this didnt happen. + * For loopback, the skb maybe in the target sock's receive_queue + * we need to ignore that case. */ static inline bool skb_fclone_busy(const struct sock *sk, - const struct sk_buff *skb) + const struct sk_buff *skb, + bool is_loopback) { const struct sk_buff_fclones *fclones; @@ -1141,7 +1144,7 @@ static inline bool skb_fclone_busy(const struct sock *sk, return skb->fclone == SKB_FCLONE_ORIG && refcount_read(&fclones->fclone_ref) > 1 && - READ_ONCE(fclones->skb2.sk) == sk; + is_loopback ? true : READ_ONCE(fclones->skb2.sk) == sk; } /** diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index bde781f46b41..f51a6a565678 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2771,6 +2771,20 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) return true; } +static int sock_is_loopback(const struct sock *sk) +{ + struct dst_entry *dst; + int loopback = 0; + + rcu_read_lock(); + dst = rcu_dereference(sk->sk_dst_cache); + if (dst && dst->dev && + (dst->dev->features & NETIF_F_LOOPBACK)) + loopback = 1; + rcu_read_unlock(); + return loopback; +} + /* Thanks to skb fast clones, we can detect if a prior transmit of * a packet is still in a qdisc or driver queue. * In this case, there is very little point doing a retransmit ! @@ -2778,15 +2792,24 @@ bool tcp_schedule_loss_probe(struct sock *sk, bool advancing_rto) static bool skb_still_in_host_queue(struct sock *sk, const struct sk_buff *skb) { - if (unlikely(skb_fclone_busy(sk, skb))) { - set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags); - smp_mb__after_atomic(); - if (skb_fclone_busy(sk, skb)) { + bool is_loopback = sock_is_loopback(sk); + + if (unlikely(skb_fclone_busy(sk, skb, is_loopback))) { + if (!is_loopback) { + set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags); + smp_mb__after_atomic(); + if (skb_fclone_busy(sk, skb, is_loopback)) { + NET_INC_STATS(sock_net(sk), + LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES); + return true; + } + } else { NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPSPURIOUS_RTX_HOSTQUEUES); return true; } } + return false; } diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c index ce500f847b99..f8ea62a840e9 100644 --- a/net/xfrm/xfrm_policy.c +++ b/net/xfrm/xfrm_policy.c @@ -2846,7 +2846,7 @@ static int xdst_queue_output(struct net *net, struct sock *sk, struct sk_buff *s struct xfrm_policy *pol = xdst->pols[0]; struct xfrm_policy_queue *pq = &pol->polq; - if (unlikely(skb_fclone_busy(sk, skb))) { + if (unlikely(skb_fclone_busy(sk, skb, false))) { kfree_skb(skb); return 0; }