mbox series

[net-next,v10,0/3] implement kthread based napi poll

Message ID 20210204213117.1736289-1-weiwan@google.com
Headers show
Series implement kthread based napi poll | expand

Message

Wei Wang Feb. 4, 2021, 9:31 p.m. UTC
The idea of moving the napi poll process out of softirq context to a
kernel thread based context is not new.
Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
poll to kthread back in 2016. And Felix Fietkau has also proposed
patches of similar ideas to use workqueue to process napi poll just a
few weeks ago.

The main reason we'd like to push forward with this idea is that the
scheduler has poor visibility into cpu cycles spent in softirq context,
and is not able to make optimal scheduling decisions of the user threads.
For example, we see in one of the application benchmark where network
load is high, the CPUs handling network softirqs has ~80% cpu util. And
user threads are still scheduled on those CPUs, despite other more idle
cpus available in the system. And we see very high tail latencies. In this
case, we have to explicitly pin away user threads from the CPUs handling
network softirqs to ensure good performance.
With napi poll moved to kthread, scheduler is in charge of scheduling both
the kthreads handling network load, and the user threads, and is able to
make better decisions. In the previous benchmark, if we do this and we
pin the kthreads processing napi poll to specific CPUs, scheduler is
able to schedule user threads away from these CPUs automatically.

And the reason we prefer 1 kthread per napi, instead of 1 workqueue
entity per host, is that kthread is more configurable than workqueue,
and we could leverage existing tuning tools for threads, like taskset,
chrt, etc to tune scheduling class and cpu set, etc. Another reason is
if we eventually want to provide busy poll feature using kernel threads
for napi poll, kthread seems to be more suitable than workqueue.
Furthermore, for large platforms with 2 NICs attached to 2 sockets,
kthread is more flexible to be pinned to different sets of CPUs.  

In this patch series, I revived Paolo and Hannes's patch in 2016 and
made modifications. Then there are changes proposed by Felix, Jakub,
Paolo and myself on top of those, with suggestions from Eric Dumazet.

In terms of performance, I ran tcp_rr tests with 1000 flows with
various request/response sizes, with RFS/RPS disabled, and compared
performance between softirq vs kthread vs workqueue (patchset proposed
by Felix Fietkau).
Host has 56 hyper threads and 100Gbps nic, 8 rx queues and only 1 numa
node. All threads are unpinned.

        req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
softirq   1B/1B   2.75M   337us       376us      1.04ms     3.69ms
kthread   1B/1B   2.67M   371us       408us      455us      550us
workq     1B/1B   2.56M   384us       435us      673us      822us

softirq 5KB/5KB   1.46M   678us       750us      969us      2.78ms
kthread 5KB/5KB   1.44M   695us       789us      891us      1.06ms
workq   5KB/5KB   1.34M   720us       905us     1.06ms      1.57ms

softirq 1MB/1MB   11.0K   79ms       166ms      306ms       630ms
kthread 1MB/1MB   11.0K   75ms       177ms      303ms       596ms
workq   1MB/1MB   11.0K   79ms       180ms      303ms       587ms

When running workqueue implementation, I found the number of threads
used is usually twice as much as kthread implementation. This probably
introduces higher scheduling cost, which results in higher tail
latencies in most cases.

I also ran an application benchmark, which performs fixed qps remote SSD
read/write operations, with various sizes. Again, both with RFS/RPS
disabled.
The result is as follows:
         op_size  QPS   50%tile 95%tile 99%tile 99.9%tile  
softirq   4K     572.6K   385us   1.5ms  3.16ms   6.41ms
kthread   4K     572.6K   390us   803us  2.21ms   6.83ms
workq     4k     572.6K   384us   763us  3.12ms   6.87ms

softirq   64K    157.9K   736us   1.17ms 3.40ms   13.75ms
kthread   64K    157.9K   745us   1.23ms 2.76ms    9.87ms 
workq     64K    157.9K   746us   1.23ms 2.76ms    9.96ms

softirq   1M     10.98K   2.03ms  3.10ms  3.7ms   11.56ms
kthread   1M     10.98K   2.13ms  3.21ms  4.02ms  13.3ms
workq     1M     10.98K   2.13ms  3.20ms  3.99ms  14.12ms

In this set of tests, the latency is predominant by the SSD operation.
Also, the user threads are much busier compared to tcp_rr tests. We have
to pin the kthreads/workqueue threads to limit to a few CPUs, to not
disturb user threads, and provide some isolation.

Changes since v9:
Small change in napi_poll() in patch 1.
Split napi_kthread_stop() functionality to add separately in
napi_disable() and netif_napi_del() in patch 2.
Add description for napi_set_threaded() and return dev->threaded when
dev->napi_list is empty for threaded sysfs in patch 3.

Changes since v8:
Added description for threaded param in struct net_device in patch 2.

Changes since v7:
Break napi_set_threaded() into 2 parts, one to create kthread called
from netif_napi_add(), the other to set threaded bit in napi_enable(),
to get rid of inconsistency through all napi in 1 dev.
Added documentation for /sys/class/net/<dev>/threaded.

Changes since v6:
Added memory barrier in napi_set_threaded().
Changed /sys/class/net/<dev>/thread to a ternary value.
Change dev->threaded to a bit instead of bool.

Changes since v5:
Removed ASSERT_RTNL() from napi_set_threaded() and removed rtnl_lock()
operation from napi_enable().

Changes since v4:
Recorded the threaded setting in dev and restore it in napi_enable().

Changes since v3:
Merged and rearranged patches in a logical order for easier review. 
Changed sysfs control to be per device.

Changes since v2:
Corrected typo in patch 1, and updated the cover letter with more
detailed and updated test results.

Changes since v1:
Replaced kthread_create() with kthread_run() in patch 5 as suggested by
Felix Fietkau.

Changes since RFC:
Renamed the kthreads to be napi/<dev>-<napi_id> in patch 5 as suggested
by Hannes Frederic Sowa.

Felix Fietkau (1):
  net: extract napi poll functionality to __napi_poll()

Wei Wang (2):
  net: implement threaded-able napi poll loop support
  net: add sysfs attribute to control napi threaded mode

 Documentation/ABI/testing/sysfs-class-net |  15 ++
 include/linux/netdevice.h                 |  23 +--
 net/core/dev.c                            | 212 ++++++++++++++++++++--
 net/core/net-sysfs.c                      |  45 +++++
 4 files changed, 269 insertions(+), 26 deletions(-)

Comments

Wei Wang Feb. 5, 2021, 12:42 a.m. UTC | #1
On Thu, Feb 4, 2021 at 2:17 PM Alexander Duyck
<alexander.duyck@gmail.com> wrote:
>

> On Thu, Feb 4, 2021 at 1:35 PM Wei Wang <weiwan@google.com> wrote:

> >

> > This patch adds a new sysfs attribute to the network device class.

> > Said attribute provides a per-device control to enable/disable the

> > threaded mode for all the napi instances of the given network device,

> > without the need for a device up/down.

> > User sets it to 1 or 0 to enable or disable threaded mode.

> > Note: when switching between threaded and the current softirq based mode

> > for a napi instance, it will not immediately take effect if the napi is

> > currently being polled. The mode switch will happen for the next time

> > napi_schedule() is called.

> >

> > Co-developed-by: Paolo Abeni <pabeni@redhat.com>

> > Signed-off-by: Paolo Abeni <pabeni@redhat.com>

> > Co-developed-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

> > Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

> > Co-developed-by: Felix Fietkau <nbd@nbd.name>

> > Signed-off-by: Felix Fietkau <nbd@nbd.name>

> > Signed-off-by: Wei Wang <weiwan@google.com>

> > ---

> >  Documentation/ABI/testing/sysfs-class-net | 15 +++++

> >  include/linux/netdevice.h                 |  2 +

> >  net/core/dev.c                            | 67 ++++++++++++++++++++++-

> >  net/core/net-sysfs.c                      | 45 +++++++++++++++

> >  4 files changed, 127 insertions(+), 2 deletions(-)

> >

> > diff --git a/Documentation/ABI/testing/sysfs-class-net b/Documentation/ABI/testing/sysfs-class-net

> > index 1f2002df5ba2..1419103d11f9 100644

> > --- a/Documentation/ABI/testing/sysfs-class-net

> > +++ b/Documentation/ABI/testing/sysfs-class-net

> > @@ -337,3 +337,18 @@ Contact:   netdev@vger.kernel.org

> >  Description:

> >                 32-bit unsigned integer counting the number of times the link has

> >                 been down

> > +

> > +What:          /sys/class/net/<iface>/threaded

> > +Date:          Jan 2021

> > +KernelVersion: 5.12

> > +Contact:       netdev@vger.kernel.org

> > +Description:

> > +               Boolean value to control the threaded mode per device. User could

> > +               set this value to enable/disable threaded mode for all napi

> > +               belonging to this device, without the need to do device up/down.

> > +

> > +               Possible values:

> > +               == ==================================

> > +               0  threaded mode disabled for this dev

> > +               1  threaded mode enabled for this dev

> > +               == ==================================

> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h

> > index 99fb4ec9573e..1340327f7abf 100644

> > --- a/include/linux/netdevice.h

> > +++ b/include/linux/netdevice.h

> > @@ -497,6 +497,8 @@ static inline bool napi_complete(struct napi_struct *n)

> >         return napi_complete_done(n, 0);

> >  }

> >

> > +int dev_set_threaded(struct net_device *dev, bool threaded);

> > +

> >  /**

> >   *     napi_disable - prevent NAPI from scheduling

> >   *     @n: NAPI context

> > diff --git a/net/core/dev.c b/net/core/dev.c

> > index a8c5eca17074..9cc9b245419e 100644

> > --- a/net/core/dev.c

> > +++ b/net/core/dev.c

> > @@ -4290,8 +4290,9 @@ static inline void ____napi_schedule(struct softnet_data *sd,

> >

> >         if (test_bit(NAPI_STATE_THREADED, &napi->state)) {

> >                 /* Paired with smp_mb__before_atomic() in

> > -                * napi_enable(). Use READ_ONCE() to guarantee

> > -                * a complete read on napi->thread. Only call

> > +                * napi_enable()/napi_set_threaded().

> > +                * Use READ_ONCE() to guarantee a complete

> > +                * read on napi->thread. Only call

> >                  * wake_up_process() when it's not NULL.

> >                  */

> >                 thread = READ_ONCE(napi->thread);

> > @@ -6743,6 +6744,68 @@ static void init_gro_hash(struct napi_struct *napi)

> >         napi->gro_bitmask = 0;

> >  }

> >

> > +/* Setting/unsetting threaded mode on a napi might not immediately

> > + * take effect, if the current napi instance is actively being

> > + * polled. In this case, the switch between threaded mode and

> > + * softirq mode will happen in the next round of napi_schedule().

> > + * This should not cause hiccups/stalls to the live traffic.

> > + */

> > +static int napi_set_threaded(struct napi_struct *n, bool threaded)

> > +{

> > +       int err = 0;

> > +

> > +       if (threaded == !!test_bit(NAPI_STATE_THREADED, &n->state))

> > +               return 0;

> > +

> > +       if (!threaded) {

> > +               clear_bit(NAPI_STATE_THREADED, &n->state);

> > +               return 0;

> > +       }

>

>

> > +

> > +       if (!n->thread) {

> > +               err = napi_kthread_create(n);

> > +               if (err)

> > +                       return err;

> > +       }

>

> This piece needs to be broken out similar to what we did for the

> napi_add and napi enable. In the case where we are enabling the

> threaded NAPI you should first go through and allocate all the

> threads. Then once all the threads are allocated you then enable them

> by setting the NAPI_STATE_THREADED bit.

>

> I would pull this section out and place it in a loop in

> dev_set_threaded to handle creating the threads before you set

> dev->threaded and then set the threaded flags in the napi instances.

>

> > +

> > +       /* Make sure kthread is created before THREADED bit

> > +        * is set.

> > +        */

> > +       smp_mb__before_atomic();

>

> If we split off the allocation like I mentioned earlier we can

> probably pull out the barrier and place it after the allocation of the

> kthreads.

>

> > +       set_bit(NAPI_STATE_THREADED, &n->state);

> > +

> > +       return 0;

> > +}

> > +

>

> With the change I suggested above you could also drop the return type

> from this because it should always succeed.

>

> > +static void dev_disable_threaded_all(struct net_device *dev)

> > +{

> > +       struct napi_struct *napi;

> > +

> > +       list_for_each_entry(napi, &dev->napi_list, dev_list)

> > +               napi_set_threaded(napi, false);

> > +       dev->threaded = 0;

> > +}

> > +

>

> You might consider renaming this to dev_set_threaded_all and passing a

> boolean "threaded" argument. Then you could reuse this code for both

> enabling and disabling of the threaded setup. Another thing I might

> change would be how we go about toggling the value. Maybe something

> more like:

>

> dev->threaded = 0;

> list_for_each_entry(napi, &dev->napi_list, dev_list)

>     napi_set_threaded(napi, threaded);

> dev->threaded = threaded;

>

> The advantage of doing it this way is that we will be

> enabling/disabling all instances at the same time since we are

> overwriting the dev->threaded first and then taking care of the

> individual flags. If you are enabling then after we have set the bits

> we set dev->threaded which will switch them all on, and when we clear

> dev->threaded it will switch them all off so they should stop checking

> the threaded flags in the napi structs.

>

> > +int dev_set_threaded(struct net_device *dev, bool threaded)

> > +{

> > +       struct napi_struct *napi;

> > +       int ret;

> > +

>

> There are probably a couple changes that need to be made. First I

> would initialize ret to 0 here. More on that below.

>

> Second we need to add a check for if we are actually changing

> anything. If not then we shouldn't be bothering. So something like the

> following at the start should take care of that:

>

> if (!dev->threaded == !threaded)

>     return 0;

>

> I would add a list_for_each_entry loop that does the thread allocation

> I called out above and will break if an error is encountered. That way

> you aren't toggling dev->threaded unless you know all of the instances

> can succeed. Something like:

>

> if (!dev->threaded) {

>     list_for_each_entry(napi, &dev->napi_list, dev_list) {

>         ret =  napi_kthread_create(napi);

>         if (ret)

>             break;

>     }

> }

>

> /* Make sure kthread is created before THREADED bit is set. */

> smp_mb__before_atomic();

>

> > +       dev->threaded = threaded;

> > +       list_for_each_entry(napi, &dev->napi_list, dev_list) {

> > +               ret = napi_set_threaded(napi, threaded);

> > +               if (ret) {

> > +                       /* Error occurred on one of the napi,

> > +                        * reset threaded mode on all napi.

> > +                        */

> > +                       dev_disable_threaded_all(dev);

> > +                       break;

> > +               }

> > +       }

>

> With my suggestion above you could look at just making the

> dev_disable_threaded_all accept a boolean argument for enable/disable

> and then you could just do something like:

>

> dev_set_threaded_all(struct net_device *dev, !ret && threaded);

>


Ack for the above suggestions. I think I used a combined function
napi_set_threaded() to create kthread + set threaded bit here because
I have the knowledge of all the napi instances belonging to dev here.
And if anyone fails, I would just disable them all afterwards. But I
think that might introduce an inconsistent stage where some napi are
in threaded mode, and some are in softirq mode.
I agree it is cleaner to first create kthreads for all napi, then set
the threaded bit for all.

> > +

> > +       return ret;

> > +}

> > +

> >  void netif_napi_add(struct net_device *dev, struct napi_struct *napi,

> >                     int (*poll)(struct napi_struct *, int), int weight)

> >  {

> > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c

> > index daf502c13d6d..969743567257 100644

> > --- a/net/core/net-sysfs.c

> > +++ b/net/core/net-sysfs.c

> > @@ -538,6 +538,50 @@ static ssize_t phys_switch_id_show(struct device *dev,

> >  }

> >  static DEVICE_ATTR_RO(phys_switch_id);

> >

> > +static ssize_t threaded_show(struct device *dev,

> > +                            struct device_attribute *attr, char *buf)

> > +{

> > +       struct net_device *netdev = to_net_dev(dev);

> > +       int ret;

> > +

> > +       if (!rtnl_trylock())

> > +               return restart_syscall();

> > +

> > +       if (!dev_isalive(netdev)) {

> > +               ret = -EINVAL;

> > +               goto unlock;

> > +       }

> > +

> > +       ret = sprintf(buf, fmt_dec, netdev->threaded);

> > +

>

> It seems like the more standard pattern is to initialize ret to

> -EVINAL and just call the sprintf if dev_isalive is true. That would

> allow you to drop the unlock jump label you had to add below.

>

Ack.

> > +unlock:

> > +       rtnl_unlock();

> > +       return ret;

> > +}

> > +

> > +static int modify_napi_threaded(struct net_device *dev, unsigned long val)

> > +{

> > +       int ret;

> > +

> > +       if (list_empty(&dev->napi_list))

> > +               return -EOPNOTSUPP;

> > +

> > +       if (val != 0 && val != 1)

> > +               return -EOPNOTSUPP;

> > +

> > +       ret = dev_set_threaded(dev, val);

> > +

> > +       return ret;

> > +}

> > +

> > +static ssize_t threaded_store(struct device *dev,

> > +                             struct device_attribute *attr,

> > +                             const char *buf, size_t len)

> > +{

> > +       return netdev_store(dev, attr, buf, len, modify_napi_threaded);

> > +}

> > +static DEVICE_ATTR_RW(threaded);

> > +

> >  static struct attribute *net_class_attrs[] __ro_after_init = {

> >         &dev_attr_netdev_group.attr,

> >         &dev_attr_type.attr,

> > @@ -570,6 +614,7 @@ static struct attribute *net_class_attrs[] __ro_after_init = {

> >         &dev_attr_proto_down.attr,

> >         &dev_attr_carrier_up_count.attr,

> >         &dev_attr_carrier_down_count.attr,

> > +       &dev_attr_threaded.attr,

> >         NULL,

> >  };

> >  ATTRIBUTE_GROUPS(net_class);

> > --

> > 2.30.0.365.g02bc693789-goog

> >