From patchwork Wed Apr 22 16:13:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 220745 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BEF54C54FCB for ; Wed, 22 Apr 2020 16:13:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 919F520774 for ; Wed, 22 Apr 2020 16:13:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="BCGyhKFU" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726619AbgDVQNn (ORCPT ); Wed, 22 Apr 2020 12:13:43 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33718 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726303AbgDVQNl (ORCPT ); Wed, 22 Apr 2020 12:13:41 -0400 Received: from mail-pf1-x44a.google.com (mail-pf1-x44a.google.com [IPv6:2607:f8b0:4864:20::44a]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8E786C03C1A9 for ; Wed, 22 Apr 2020 09:13:41 -0700 (PDT) Received: by mail-pf1-x44a.google.com with SMTP id u137so2474296pfc.1 for ; Wed, 22 Apr 2020 09:13:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=jhLMwp8j2QUu9Vc+R58C4mu34Hprg+54YPbp/VZSXto=; b=BCGyhKFUNK3CmKeJa438SkP46ET4JI+Z1S6eZIYnt5yebVQizx3MPSABIVpK2H62HM YlEU9SfVAxChsZNWZBMWK4Ps6KHdPQl1B12pg42nqmF4CipBPWp8SFgEJ4zeg2UQJMII VH2YRzNURHXDfrbnTFGVELCfk/vd2Ety25TVuHccmpHvby1QGYIJDnYmt6gRTTr97xZu 7FA7um3DfKYYs2fr5EEL65Et7eq1Hu9DJ9im1G2sgrWoShqy/zCNss465vli+zwvSBmd b3w54QOdNfwGbNyItLu73yILTKW5G020HHtGhiItwfg+dDJA7vflFrIY2SvHomkctkWp kneg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=jhLMwp8j2QUu9Vc+R58C4mu34Hprg+54YPbp/VZSXto=; b=iNsYFXt0HELO/cd82I2K5JYV91EENSxaiJvwu4Lr3xDXOkgEl2JTfala1kcxX8mR29 z1pzUApAv/lcyyGcg9ZFE/P+POo4elcaOLk3CWczBlnAVOhp2GVu+WU9n+Xwh9ZjZ2qE 6LbqSU4/EavqXHPiugFywwHMujlOh81xye6UuVNsO+KGiSaRQAFH/FrP44bCSZeLn9ef rbvL4xNmW01azCyVcnmVXNa2dNDQJOqrCcJszWBbnOiiuWkpuPQWlsS3Xbx7b0BzdhPh UKgTvPGNVlDaTTRyNynkLYmtAeyjR3xA3ERk+4Xi0VmSNsJbrFDW/wbr4cWqNsXv18j/ o51w== X-Gm-Message-State: AGi0PuZddDCMgaz/9RGctOiTxuN+nOVWVv83jE/1lopUGrdfbZIzG+Wv w0TaZY7YwcYkiKBklwEFiLSKR5CjlWc2jA== X-Google-Smtp-Source: APiQypLfDizoBlPykOdRHdX43jUhTgPT/2de9a5Yh+EG56BIe3ZSez/HlbjWPpIl6Hw0w8iULU0q/xV+T8unQw== X-Received: by 2002:a17:90a:8d02:: with SMTP id c2mr12614550pjo.113.1587572021127; Wed, 22 Apr 2020 09:13:41 -0700 (PDT) Date: Wed, 22 Apr 2020 09:13:27 -0700 In-Reply-To: <20200422161329.56026-1-edumazet@google.com> Message-Id: <20200422161329.56026-2-edumazet@google.com> Mime-Version: 1.0 References: <20200422161329.56026-1-edumazet@google.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9-goog Subject: [PATCH net-next 1/3] net: napi: add hard irqs deferral feature From: Eric Dumazet To: "David S . Miller" Cc: netdev , Eric Dumazet , Luigi Rizzo , Eric Dumazet Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Back in commit 3b47d30396ba ("net: gro: add a per device gro flush timer") we added the ability to arm one high resolution timer, that we used to keep not-complete packets in GRO engine a bit longer, hoping that further frames might be added to them. Since then, we added the napi_complete_done() interface, and commit 364b6055738b ("net: busy-poll: return busypolling status to drivers") allowed drivers to avoid re-arming NIC interrupts if we made a promise that their NAPI poll() handler would be called in the near future. This infrastructure can be leveraged, thanks to a new device parameter, which allows to arm the napi hrtimer, instead of re-arming the device hard IRQ. We have noticed that on some servers with 32 RX queues or more, the chit-chat between the NIC and the host caused by IRQ delivery and re-arming could hurt throughput by ~20% on 100Gbit NIC. In contrast, hrtimers are using local (percpu) resources and might have lower cost. The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy than gro_flush_timeout (/sys/class/net/ethX/) By default, both gro_flush_timeout and napi_defer_hard_irqs are zero. This patch does not change the prior behavior of gro_flush_timeout if used alone : NIC hard irqs should be rearmed as before. One concrete usage can be : echo 20000 >/sys/class/net/eth1/gro_flush_timeout echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs If at least one packet is retired, then we will reset napi counter to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans of the queue. On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ avoidance was only possible if napi->poll() was exhausting its budget and not call napi_complete_done(). This feature also can be used to work around some non-optimal NIC irq coalescing strategies. Having the ability to insert XX usec delays between each napi->poll() can increase cache efficiency, since we increase batch sizes. It also keeps serving cpus not idle too long, reducing tail latencies. Co-developed-by: Luigi Rizzo Signed-off-by: Eric Dumazet --- include/linux/netdevice.h | 2 ++ net/core/dev.c | 29 ++++++++++++++++++----------- net/core/net-sysfs.c | 18 ++++++++++++++++++ 3 files changed, 38 insertions(+), 11 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 0750b54b37651890eb297e9f97fc956fb5cc48c7..5a8d40f1ffe2afce7ca36c290786d87eb730b8fc 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -329,6 +329,7 @@ struct napi_struct { unsigned long state; int weight; + int defer_hard_irqs_count; unsigned long gro_bitmask; int (*poll)(struct napi_struct *, int); #ifdef CONFIG_NETPOLL @@ -1995,6 +1996,7 @@ struct net_device { struct bpf_prog __rcu *xdp_prog; unsigned long gro_flush_timeout; + int napi_defer_hard_irqs; rx_handler_func_t __rcu *rx_handler; void __rcu *rx_handler_data; diff --git a/net/core/dev.c b/net/core/dev.c index fb61522b1ce163963f8d0bf54c7c5fa58fce7b9a..67585484ad32b698c6bc4bf17f5d87c345d77502 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6227,7 +6227,8 @@ EXPORT_SYMBOL(__napi_schedule_irqoff); bool napi_complete_done(struct napi_struct *n, int work_done) { - unsigned long flags, val, new; + unsigned long flags, val, new, timeout = 0; + bool ret = true; /* * 1) Don't let napi dequeue from the cpu poll list @@ -6239,20 +6240,23 @@ bool napi_complete_done(struct napi_struct *n, int work_done) NAPIF_STATE_IN_BUSY_POLL))) return false; - if (n->gro_bitmask) { - unsigned long timeout = 0; - - if (work_done) + if (work_done) { + if (n->gro_bitmask) timeout = n->dev->gro_flush_timeout; - + n->defer_hard_irqs_count = n->dev->napi_defer_hard_irqs; + } + if (n->defer_hard_irqs_count > 0) { + n->defer_hard_irqs_count--; + timeout = n->dev->gro_flush_timeout; + if (timeout) + ret = false; + } + if (n->gro_bitmask) { /* When the NAPI instance uses a timeout and keeps postponing * it, we need to bound somehow the time packets are kept in * the GRO layer */ napi_gro_flush(n, !!timeout); - if (timeout) - hrtimer_start(&n->timer, ns_to_ktime(timeout), - HRTIMER_MODE_REL_PINNED); } gro_normal_list(n); @@ -6284,7 +6288,10 @@ bool napi_complete_done(struct napi_struct *n, int work_done) return false; } - return true; + if (timeout) + hrtimer_start(&n->timer, ns_to_ktime(timeout), + HRTIMER_MODE_REL_PINNED); + return ret; } EXPORT_SYMBOL(napi_complete_done); @@ -6464,7 +6471,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer) /* Note : we use a relaxed variant of napi_schedule_prep() not setting * NAPI_STATE_MISSED, since we do not react to a device IRQ. */ - if (napi->gro_bitmask && !napi_disable_pending(napi) && + if (!napi_disable_pending(napi) && !test_and_set_bit(NAPI_STATE_SCHED, &napi->state)) __napi_schedule_irqoff(napi); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 0d9e46de205e9e5338b56ecd9441338929e07bc1..f3b650cd09231fd99604f6f66bab454eabaa06be 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -382,6 +382,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev, } NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong); +static int change_napi_defer_hard_irqs(struct net_device *dev, unsigned long val) +{ + dev->napi_defer_hard_irqs = val; + return 0; +} + +static ssize_t napi_defer_hard_irqs_store(struct device *dev, + struct device_attribute *attr, + const char *buf, size_t len) +{ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + return netdev_store(dev, attr, buf, len, change_napi_defer_hard_irqs); +} +NETDEVICE_SHOW_RW(napi_defer_hard_irqs, fmt_dec); + static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr, const char *buf, size_t len) { @@ -545,6 +562,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = { &dev_attr_flags.attr, &dev_attr_tx_queue_len.attr, &dev_attr_gro_flush_timeout.attr, + &dev_attr_napi_defer_hard_irqs.attr, &dev_attr_phys_port_id.attr, &dev_attr_phys_port_name.attr, &dev_attr_phys_switch_id.attr, From patchwork Wed Apr 22 16:13:29 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eric Dumazet X-Patchwork-Id: 220744 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.4 required=3.0 tests=DKIMWL_WL_MED, DKIM_SIGNED, DKIM_VALID, DKIM_VALID_AU, HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH, MAILING_LIST_MULTI, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT, USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 92330C5518C for ; Wed, 22 Apr 2020 16:13:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 73CC120776 for ; Wed, 22 Apr 2020 16:13:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="q3TJM+oe" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726816AbgDVQNv (ORCPT ); Wed, 22 Apr 2020 12:13:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33738 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726303AbgDVQNr (ORCPT ); Wed, 22 Apr 2020 12:13:47 -0400 Received: from mail-pg1-x549.google.com (mail-pg1-x549.google.com [IPv6:2607:f8b0:4864:20::549]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3DE57C03C1A9 for ; Wed, 22 Apr 2020 09:13:47 -0700 (PDT) Received: by mail-pg1-x549.google.com with SMTP id g11so2071565pgd.20 for ; Wed, 22 Apr 2020 09:13:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:in-reply-to:message-id:mime-version:references:subject:from:to :cc; bh=MgkgMi3UlskwclrsH+Z+NA0uAr5Tp+J+iQr6ZG85W34=; b=q3TJM+oe3rMQgd/JbPNpnL5mmBYL8e0c22ZazJyQD3ZmQ20IT3HA0QFdBsjigZCWD1 w+umEfdkwba/C/Nbl4scv9lFu1ZsqWEi+RVnsIBBbcndj+YnTybCAfp6uva0tbUgwUOC gcLWsJyHhQ2deANu1wAvvSoWi7uKG5WWZbeQlLl1Y5f/XNvYr6rXuZkOiV6kF8UrHzvh +yelZsswY8lK/EvSg4pSYcZRel3vWUK8Jn59WNrxiqoEZSc4e7TKfuqngoKYxgegfKGw QG/jy0V66nAtW0c+y6ip5uiPnDWaf5+a/2hoJodX3X4R/3BdG+A2lSFk8cechOWXq84v iG5w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:in-reply-to:message-id:mime-version :references:subject:from:to:cc; bh=MgkgMi3UlskwclrsH+Z+NA0uAr5Tp+J+iQr6ZG85W34=; b=e26WRg0+1fUV43g4z0c5afWer8reGXGrCpy6eCuQAqzRDNqSW0UMXX5H6/c8IqAOsg 7tzvEfiCDG2NXFnTM/GpZlTUPs8cjSL/wd/pAbRTV0w/JreCiCBoOXX20bQhFd68KRBn 554bMJsklkebXxLahih07fZtUSCZMv2uwOTCfa/irWC5jDvOlPFdP56ZtWbBbkJeuJej kmInOf3xkg950BktvqjCfOdGiE1v0dN8pj5EMz14xd0JrjEOwDcO1F8VPk02zA+ar0Ot xkPRhzf8FXmd5KQ1EHaRDsPBzJYvI8XXpjFqm4AzBj7DkVe+miG0kldSSeAsIPXcA0yM YMaA== X-Gm-Message-State: AGi0PuZkn23gJxK2dVE1lRWkV5caE0tpI3RyHcScbH8NUZmd8v6tx0LT 8jFW/D1jWJ4Fu3NQmB4v1nXxTIFlQioBBQ== X-Google-Smtp-Source: APiQypLtB6PtZeYj/9pos4Qc/OOJo2p/5+S/I2f59j+mxDar7z8C8KBaR13whkoWGbyz4pitTxfE46+9pdoYKA== X-Received: by 2002:a17:90b:1104:: with SMTP id gi4mr13102137pjb.115.1587572026806; Wed, 22 Apr 2020 09:13:46 -0700 (PDT) Date: Wed, 22 Apr 2020 09:13:29 -0700 In-Reply-To: <20200422161329.56026-1-edumazet@google.com> Message-Id: <20200422161329.56026-4-edumazet@google.com> Mime-Version: 1.0 References: <20200422161329.56026-1-edumazet@google.com> X-Mailer: git-send-email 2.26.1.301.g55bc3eb7cb9-goog Subject: [PATCH net-next 3/3] net/mlx4_en: use napi_complete_done() in TX completion From: Eric Dumazet To: "David S . Miller" Cc: netdev , Eric Dumazet , Luigi Rizzo , Eric Dumazet Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org In order to benefit from the new napi_defer_hard_irqs feature, we need to use napi_complete_done() variant in this driver. RX path is already using it, this patch implements TX completion side. mlx4_en_process_tx_cq() now returns the amount of retired packets, instead of a boolean, so that mlx4_en_poll_tx_cq() can pass this value to napi_complete_done(). Signed-off-by: Eric Dumazet --- drivers/net/ethernet/mellanox/mlx4/en_rx.c | 2 +- drivers/net/ethernet/mellanox/mlx4/en_tx.c | 20 ++++++++++---------- drivers/net/ethernet/mellanox/mlx4/mlx4_en.h | 4 ++-- 3 files changed, 13 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c index db3552f2d0877e37ce8dcf215d4c273e91c2326c..7871392198130fa7d1a09baf26a0a00f1bf2e1f5 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c @@ -946,7 +946,7 @@ int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget) xdp_tx_cq = priv->tx_cq[TX_XDP][cq->ring]; if (xdp_tx_cq->xdp_busy) { clean_complete = mlx4_en_process_tx_cq(dev, xdp_tx_cq, - budget); + budget) < budget; xdp_tx_cq->xdp_busy = !clean_complete; } } diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c index 4d5ca302c067126b8627cb4809485b45c10e2460..a99d3ed49ed684db5d5b90e78e0767f97ee6cc9d 100644 --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c @@ -382,8 +382,8 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring) return cnt; } -bool mlx4_en_process_tx_cq(struct net_device *dev, - struct mlx4_en_cq *cq, int napi_budget) +int mlx4_en_process_tx_cq(struct net_device *dev, + struct mlx4_en_cq *cq, int napi_budget) { struct mlx4_en_priv *priv = netdev_priv(dev); struct mlx4_cq *mcq = &cq->mcq; @@ -405,7 +405,7 @@ bool mlx4_en_process_tx_cq(struct net_device *dev, u32 ring_cons; if (unlikely(!priv->port_up)) - return true; + return 0; netdev_txq_bql_complete_prefetchw(ring->tx_queue); @@ -480,7 +480,7 @@ bool mlx4_en_process_tx_cq(struct net_device *dev, WRITE_ONCE(ring->cons, ring_cons + txbbs_skipped); if (cq->type == TX_XDP) - return done < budget; + return done; netdev_tx_completed_queue(ring->tx_queue, packets, bytes); @@ -492,7 +492,7 @@ bool mlx4_en_process_tx_cq(struct net_device *dev, ring->wake_queue++; } - return done < budget; + return done; } void mlx4_en_tx_irq(struct mlx4_cq *mcq) @@ -512,14 +512,14 @@ int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget) struct mlx4_en_cq *cq = container_of(napi, struct mlx4_en_cq, napi); struct net_device *dev = cq->dev; struct mlx4_en_priv *priv = netdev_priv(dev); - bool clean_complete; + int work_done; - clean_complete = mlx4_en_process_tx_cq(dev, cq, budget); - if (!clean_complete) + work_done = mlx4_en_process_tx_cq(dev, cq, budget); + if (work_done >= budget) return budget; - napi_complete(napi); - mlx4_en_arm_cq(priv, cq); + if (napi_complete_done(napi, work_done)) + mlx4_en_arm_cq(priv, cq); return 0; } diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h index 630f15977f091c1e28eceb7b6bc33414a69d5694..9f5603612960303c5d9f37603d8f7e51ddee9ac6 100644 --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h @@ -737,8 +737,8 @@ int mlx4_en_process_rx_cq(struct net_device *dev, int budget); int mlx4_en_poll_rx_cq(struct napi_struct *napi, int budget); int mlx4_en_poll_tx_cq(struct napi_struct *napi, int budget); -bool mlx4_en_process_tx_cq(struct net_device *dev, - struct mlx4_en_cq *cq, int napi_budget); +int mlx4_en_process_tx_cq(struct net_device *dev, + struct mlx4_en_cq *cq, int napi_budget); u32 mlx4_en_free_tx_desc(struct mlx4_en_priv *priv, struct mlx4_en_tx_ring *ring, int index, u64 timestamp,