mbox series

[net-next,0/5] implement kthread based napi poll

Message ID 20200930192140.4192859-1-weiwan@google.com
Headers show
Series implement kthread based napi poll | expand

Message

Wei Wang Sept. 30, 2020, 7:21 p.m. UTC
The idea of moving the napi poll process out of softirq context to a
kernel thread based context is not new.
Paolo Abeni and Hannes Frederic Sowa have proposed patches to move napi
poll to kthread back in 2016. And Felix Fietkau has also proposed
patches of similar ideas to use workqueue to process napi poll just a
few weeks ago.

The main reason we'd like to push forward with this idea is that the
scheduler has poor visibility into cpu cycles spent in softirq context,
and is not able to make optimal scheduling decisions of the user threads.
For example, we see in one of the application benchmark where network
load is high, the CPUs handling network softirqs has ~80% cpu util. And
user threads are still scheduled on those CPUs, despite other more idle
cpus available in the system. And we see very high tail latencies. In this
case, we have to explicitly pin away user threads from the CPUs handling
network softirqs to ensure good performance.
With napi poll moved to kthread, scheduler is in charge of scheduling both
the kthreads handling network load, and the user threads, and is able to
make better decisions. In the previous benchmark, if we do this and we
pin the kthreads processing napi poll to specific CPUs, scheduler is
able to schedule user threads away from these CPUs automatically.

And the reason we prefer 1 kthread per napi, instead of 1 workqueue
entity per host, is that kthread is more configurable than workqueue,
and we could leverage existing tuning tools for threads, like taskset,
chrt, etc to tune scheduling class and cpu set, etc. Another reason is
if we eventually want to provide busy poll feature using kernel threads
for napi poll, kthread seems to be more suitable than workqueue. 

In this patch series, I revived Paolo and Hannes's patch in 2016 and
left them as the first 2 patches. Then there are changes proposed by
Felix, Jakub, Paolo and myself on top of those, with suggestions from
Eric Dumazet.

In terms of performance, I ran tcp_rr tests with 1000 flows with
various request/response sizes, with RFS/RPS disabled, and compared
performance between softirq vs kthread. Host has 56 hyper threads and
100Gbps nic.

        req/resp   QPS   50%tile    90%tile    99%tile    99.9%tile
softirq   1B/1B   2.19M   284us       987us      1.1ms      1.56ms
kthread   1B/1B   2.14M   295us       987us      1.0ms      1.17ms

softirq 5KB/5KB   1.31M   869us      1.06ms     1.28ms      2.38ms
kthread 5KB/5KB   1.32M   878us      1.06ms     1.26ms      1.66ms

softirq 1MB/1MB  10.78K   84ms       166ms      234ms       294ms
kthread 1MB/1MB  10.83K   82ms       173ms      262ms       320ms

I also ran one application benchmark where the user threads have more
work to do. We do see good amount of tail latency reductions with the
kthread model. 

Paolo Abeni (2):
  net: implement threaded-able napi poll loop support
  net: add sysfs attribute to control napi threaded mode
Felix Fietkau (1):
  net: extract napi poll functionality to __napi_poll()
Jakub Kicinski (1):
  net: modify kthread handler to use __napi_poll()
Wei Wang (1):
  net: improve napi threaded config

 include/linux/netdevice.h |   5 ++
 net/core/dev.c            | 139 +++++++++++++++++++++++++++++++++++---
 net/core/net-sysfs.c      | 100 +++++++++++++++++++++++++++
 3 files changed, 235 insertions(+), 9 deletions(-)

Comments

Jakub Kicinski Sept. 30, 2020, 8:08 p.m. UTC | #1
On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote:
> With napi poll moved to kthread, scheduler is in charge of scheduling both
> the kthreads handling network load, and the user threads, and is able to
> make better decisions. In the previous benchmark, if we do this and we
> pin the kthreads processing napi poll to specific CPUs, scheduler is
> able to schedule user threads away from these CPUs automatically.
> 
> And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> entity per host, is that kthread is more configurable than workqueue,
> and we could leverage existing tuning tools for threads, like taskset,
> chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> if we eventually want to provide busy poll feature using kernel threads
> for napi poll, kthread seems to be more suitable than workqueue. 

As I said in my reply to the RFC I see better performance with the
workqueue implementation, so I would hold off until we have more
conclusive results there, as this set adds fairly strong uAPI that 
we'll have to support for ever.
Eric Dumazet Oct. 1, 2020, 7:52 a.m. UTC | #2
On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote:
> > With napi poll moved to kthread, scheduler is in charge of scheduling both
> > the kthreads handling network load, and the user threads, and is able to
> > make better decisions. In the previous benchmark, if we do this and we
> > pin the kthreads processing napi poll to specific CPUs, scheduler is
> > able to schedule user threads away from these CPUs automatically.
> >
> > And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> > entity per host, is that kthread is more configurable than workqueue,
> > and we could leverage existing tuning tools for threads, like taskset,
> > chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> > if we eventually want to provide busy poll feature using kernel threads
> > for napi poll, kthread seems to be more suitable than workqueue.
>
> As I said in my reply to the RFC I see better performance with the
> workqueue implementation, so I would hold off until we have more
> conclusive results there, as this set adds fairly strong uAPI that
> we'll have to support for ever.


We can make incremental changes, the kthread implementation looks much
nicer to us.

The unique work queue is a problem on server class platforms, with
NUMA placement.
We now have servers with NIC on different NUMA nodes.

We can not introduce a new model that will make all workload better
without any tuning.
If you really think you can do that, think again.

Even the old ' fix'  (commit 4cd13c21b207e80ddb1144c576500098f2d5f882
"softirq: Let ksoftirqd do its job" )
had severe issues for latency sensitive jobs.

We need to be able to opt-in to threads, and let process scheduler
take decisions.
If we believe the process scheduler takes bad decision, it should be
reported to scheduler experts.

I fully support this implementation, I do not want to wait for yet
another 'work queue' model or scheduler classes.
Felix Fietkau Oct. 1, 2020, 10:01 a.m. UTC | #3
On 2020-09-30 21:21, Wei Wang wrote:
> This commit mainly addresses the threaded config to make the switch
> between softirq based and kthread based NAPI processing not require
> a device down/up.
> It also moves the kthread_create() call to the sysfs handler when user
> tries to enable "threaded" on napi, and properly handles the
> kthread_create() failure. This is because certain drivers do not have
> the napi created and linked to the dev when dev_open() is called. So
> the previous implementation does not work properly there.
> 
> Signed-off-by: Wei Wang <weiwan@google.com>
> ---
> Changes since RFC:
> changed the thread name to napi/<dev>-<napi-id>
> 
>  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
>  net/core/net-sysfs.c |  9 +++-----
>  2 files changed, 31 insertions(+), 27 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b4f33e442b5e..bf878d3a9d89 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>  
>  static int napi_threaded_poll(void *data);
>  
> -static void napi_thread_start(struct napi_struct *n)
> +static int napi_kthread_create(struct napi_struct *n)
>  {
> -	if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> -		n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> -					   n->dev->name, n->napi_id);
> +	int err = 0;
> +
> +	n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> +				   n->dev->name, n->napi_id);
> +	if (IS_ERR(n->thread)) {
> +		err = PTR_ERR(n->thread);
> +		pr_err("kthread_create failed with err %d\n", err);
> +		n->thread = NULL;
> +	}
> +
> +	return err;
If I remember correctly, using kthread_create with no explicit first
wakeup means the task will sit there and contribute to system loadavg
until it is woken up the first time.
Shouldn't we use kthread_run here instead?

- Felix
Wei Wang Oct. 1, 2020, 5:01 p.m. UTC | #4
On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 2020-09-30 21:21, Wei Wang wrote:
> > This commit mainly addresses the threaded config to make the switch
> > between softirq based and kthread based NAPI processing not require
> > a device down/up.
> > It also moves the kthread_create() call to the sysfs handler when user
> > tries to enable "threaded" on napi, and properly handles the
> > kthread_create() failure. This is because certain drivers do not have
> > the napi created and linked to the dev when dev_open() is called. So
> > the previous implementation does not work properly there.
> >
> > Signed-off-by: Wei Wang <weiwan@google.com>
> > ---
> > Changes since RFC:
> > changed the thread name to napi/<dev>-<napi-id>
> >
> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
> >  net/core/net-sysfs.c |  9 +++-----
> >  2 files changed, 31 insertions(+), 27 deletions(-)
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index b4f33e442b5e..bf878d3a9d89 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
> >
> >  static int napi_threaded_poll(void *data);
> >
> > -static void napi_thread_start(struct napi_struct *n)
> > +static int napi_kthread_create(struct napi_struct *n)
> >  {
> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> > -                                        n->dev->name, n->napi_id);
> > +     int err = 0;
> > +
> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> > +                                n->dev->name, n->napi_id);
> > +     if (IS_ERR(n->thread)) {
> > +             err = PTR_ERR(n->thread);
> > +             pr_err("kthread_create failed with err %d\n", err);
> > +             n->thread = NULL;
> > +     }
> > +
> > +     return err;
> If I remember correctly, using kthread_create with no explicit first
> wakeup means the task will sit there and contribute to system loadavg
> until it is woken up the first time.
> Shouldn't we use kthread_run here instead?
>

Right. kthread_create() basically creates the thread and leaves it in
sleep mode. I think that is what we want. We rely on the next
___napi_schedule() call to wake up this thread when there is work to
do.

> - Felix
Felix Fietkau Oct. 1, 2020, 5:11 p.m. UTC | #5
On 2020-10-01 19:01, Wei Wang wrote:
> On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
>>
>>
>> On 2020-09-30 21:21, Wei Wang wrote:
>> > This commit mainly addresses the threaded config to make the switch
>> > between softirq based and kthread based NAPI processing not require
>> > a device down/up.
>> > It also moves the kthread_create() call to the sysfs handler when user
>> > tries to enable "threaded" on napi, and properly handles the
>> > kthread_create() failure. This is because certain drivers do not have
>> > the napi created and linked to the dev when dev_open() is called. So
>> > the previous implementation does not work properly there.
>> >
>> > Signed-off-by: Wei Wang <weiwan@google.com>
>> > ---
>> > Changes since RFC:
>> > changed the thread name to napi/<dev>-<napi-id>
>> >
>> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
>> >  net/core/net-sysfs.c |  9 +++-----
>> >  2 files changed, 31 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/net/core/dev.c b/net/core/dev.c
>> > index b4f33e442b5e..bf878d3a9d89 100644
>> > --- a/net/core/dev.c
>> > +++ b/net/core/dev.c
>> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>> >
>> >  static int napi_threaded_poll(void *data);
>> >
>> > -static void napi_thread_start(struct napi_struct *n)
>> > +static int napi_kthread_create(struct napi_struct *n)
>> >  {
>> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
>> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
>> > -                                        n->dev->name, n->napi_id);
>> > +     int err = 0;
>> > +
>> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
>> > +                                n->dev->name, n->napi_id);
>> > +     if (IS_ERR(n->thread)) {
>> > +             err = PTR_ERR(n->thread);
>> > +             pr_err("kthread_create failed with err %d\n", err);
>> > +             n->thread = NULL;
>> > +     }
>> > +
>> > +     return err;
>> If I remember correctly, using kthread_create with no explicit first
>> wakeup means the task will sit there and contribute to system loadavg
>> until it is woken up the first time.
>> Shouldn't we use kthread_run here instead?
>>
> 
> Right. kthread_create() basically creates the thread and leaves it in
> sleep mode. I think that is what we want. We rely on the next
> ___napi_schedule() call to wake up this thread when there is work to
> do.
But what if you have a device that's basically idle and napi isn't
scheduled until much later? It will get a confusing loadavg until then.
I'd prefer waking up the thread immediately and filtering going back to
sleep once in the thread function before running the loop if
NAPI_STATE_SCHED wasn't set.

- Felix
Eric Dumazet Oct. 1, 2020, 6:03 p.m. UTC | #6
On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote:
>
> On 2020-10-01 19:01, Wei Wang wrote:
> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
> >>
> >>
> >> On 2020-09-30 21:21, Wei Wang wrote:
> >> > This commit mainly addresses the threaded config to make the switch
> >> > between softirq based and kthread based NAPI processing not require
> >> > a device down/up.
> >> > It also moves the kthread_create() call to the sysfs handler when user
> >> > tries to enable "threaded" on napi, and properly handles the
> >> > kthread_create() failure. This is because certain drivers do not have
> >> > the napi created and linked to the dev when dev_open() is called. So
> >> > the previous implementation does not work properly there.
> >> >
> >> > Signed-off-by: Wei Wang <weiwan@google.com>
> >> > ---
> >> > Changes since RFC:
> >> > changed the thread name to napi/<dev>-<napi-id>
> >> >
> >> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
> >> >  net/core/net-sysfs.c |  9 +++-----
> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
> >> >
> >> > diff --git a/net/core/dev.c b/net/core/dev.c
> >> > index b4f33e442b5e..bf878d3a9d89 100644
> >> > --- a/net/core/dev.c
> >> > +++ b/net/core/dev.c
> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
> >> >
> >> >  static int napi_threaded_poll(void *data);
> >> >
> >> > -static void napi_thread_start(struct napi_struct *n)
> >> > +static int napi_kthread_create(struct napi_struct *n)
> >> >  {
> >> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> >> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> >> > -                                        n->dev->name, n->napi_id);
> >> > +     int err = 0;
> >> > +
> >> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> >> > +                                n->dev->name, n->napi_id);
> >> > +     if (IS_ERR(n->thread)) {
> >> > +             err = PTR_ERR(n->thread);
> >> > +             pr_err("kthread_create failed with err %d\n", err);
> >> > +             n->thread = NULL;
> >> > +     }
> >> > +
> >> > +     return err;
> >> If I remember correctly, using kthread_create with no explicit first
> >> wakeup means the task will sit there and contribute to system loadavg
> >> until it is woken up the first time.
> >> Shouldn't we use kthread_run here instead?
> >>
> >
> > Right. kthread_create() basically creates the thread and leaves it in
> > sleep mode. I think that is what we want. We rely on the next
> > ___napi_schedule() call to wake up this thread when there is work to
> > do.
> But what if you have a device that's basically idle and napi isn't
> scheduled until much later? It will get a confusing loadavg until then.
> I'd prefer waking up the thread immediately and filtering going back to
> sleep once in the thread function before running the loop if
> NAPI_STATE_SCHED wasn't set.
>

I was not aware of this kthread_create() impact on loadavg.
This seems like a bug to me. (although I do not care about loadavg)

Do you have pointers on some documentation ?

Probably not a big deal, but this seems quite odd to me.
Felix Fietkau Oct. 1, 2020, 6:37 p.m. UTC | #7
On 2020-10-01 20:03, Eric Dumazet wrote:
> On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote:
>>
>> On 2020-10-01 19:01, Wei Wang wrote:
>> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
>> >>
>> >>
>> >> On 2020-09-30 21:21, Wei Wang wrote:
>> >> > This commit mainly addresses the threaded config to make the switch
>> >> > between softirq based and kthread based NAPI processing not require
>> >> > a device down/up.
>> >> > It also moves the kthread_create() call to the sysfs handler when user
>> >> > tries to enable "threaded" on napi, and properly handles the
>> >> > kthread_create() failure. This is because certain drivers do not have
>> >> > the napi created and linked to the dev when dev_open() is called. So
>> >> > the previous implementation does not work properly there.
>> >> >
>> >> > Signed-off-by: Wei Wang <weiwan@google.com>
>> >> > ---
>> >> > Changes since RFC:
>> >> > changed the thread name to napi/<dev>-<napi-id>
>> >> >
>> >> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
>> >> >  net/core/net-sysfs.c |  9 +++-----
>> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
>> >> >
>> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> > index b4f33e442b5e..bf878d3a9d89 100644
>> >> > --- a/net/core/dev.c
>> >> > +++ b/net/core/dev.c
>> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>> >> >
>> >> >  static int napi_threaded_poll(void *data);
>> >> >
>> >> > -static void napi_thread_start(struct napi_struct *n)
>> >> > +static int napi_kthread_create(struct napi_struct *n)
>> >> >  {
>> >> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
>> >> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
>> >> > -                                        n->dev->name, n->napi_id);
>> >> > +     int err = 0;
>> >> > +
>> >> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
>> >> > +                                n->dev->name, n->napi_id);
>> >> > +     if (IS_ERR(n->thread)) {
>> >> > +             err = PTR_ERR(n->thread);
>> >> > +             pr_err("kthread_create failed with err %d\n", err);
>> >> > +             n->thread = NULL;
>> >> > +     }
>> >> > +
>> >> > +     return err;
>> >> If I remember correctly, using kthread_create with no explicit first
>> >> wakeup means the task will sit there and contribute to system loadavg
>> >> until it is woken up the first time.
>> >> Shouldn't we use kthread_run here instead?
>> >>
>> >
>> > Right. kthread_create() basically creates the thread and leaves it in
>> > sleep mode. I think that is what we want. We rely on the next
>> > ___napi_schedule() call to wake up this thread when there is work to
>> > do.
>> But what if you have a device that's basically idle and napi isn't
>> scheduled until much later? It will get a confusing loadavg until then.
>> I'd prefer waking up the thread immediately and filtering going back to
>> sleep once in the thread function before running the loop if
>> NAPI_STATE_SCHED wasn't set.
>>
> 
> I was not aware of this kthread_create() impact on loadavg.
> This seems like a bug to me. (although I do not care about loadavg)
> 
> Do you have pointers on some documentation ?
I don't have any specific documentation pointers, but this is something
I observed on several occasions when playing with kthreads.
Wei Wang Oct. 1, 2020, 7:24 p.m. UTC | #8
On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 2020-10-01 20:03, Eric Dumazet wrote:
> > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote:
> >>
> >> On 2020-10-01 19:01, Wei Wang wrote:
> >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
> >> >>
> >> >>
> >> >> On 2020-09-30 21:21, Wei Wang wrote:
> >> >> > This commit mainly addresses the threaded config to make the switch
> >> >> > between softirq based and kthread based NAPI processing not require
> >> >> > a device down/up.
> >> >> > It also moves the kthread_create() call to the sysfs handler when user
> >> >> > tries to enable "threaded" on napi, and properly handles the
> >> >> > kthread_create() failure. This is because certain drivers do not have
> >> >> > the napi created and linked to the dev when dev_open() is called. So
> >> >> > the previous implementation does not work properly there.
> >> >> >
> >> >> > Signed-off-by: Wei Wang <weiwan@google.com>
> >> >> > ---
> >> >> > Changes since RFC:
> >> >> > changed the thread name to napi/<dev>-<napi-id>
> >> >> >
> >> >> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
> >> >> >  net/core/net-sysfs.c |  9 +++-----
> >> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
> >> >> >
> >> >> > diff --git a/net/core/dev.c b/net/core/dev.c
> >> >> > index b4f33e442b5e..bf878d3a9d89 100644
> >> >> > --- a/net/core/dev.c
> >> >> > +++ b/net/core/dev.c
> >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
> >> >> >
> >> >> >  static int napi_threaded_poll(void *data);
> >> >> >
> >> >> > -static void napi_thread_start(struct napi_struct *n)
> >> >> > +static int napi_kthread_create(struct napi_struct *n)
> >> >> >  {
> >> >> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> >> >> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> >> >> > -                                        n->dev->name, n->napi_id);
> >> >> > +     int err = 0;
> >> >> > +
> >> >> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> >> >> > +                                n->dev->name, n->napi_id);
> >> >> > +     if (IS_ERR(n->thread)) {
> >> >> > +             err = PTR_ERR(n->thread);
> >> >> > +             pr_err("kthread_create failed with err %d\n", err);
> >> >> > +             n->thread = NULL;
> >> >> > +     }
> >> >> > +
> >> >> > +     return err;
> >> >> If I remember correctly, using kthread_create with no explicit first
> >> >> wakeup means the task will sit there and contribute to system loadavg
> >> >> until it is woken up the first time.
> >> >> Shouldn't we use kthread_run here instead?
> >> >>
> >> >
> >> > Right. kthread_create() basically creates the thread and leaves it in
> >> > sleep mode. I think that is what we want. We rely on the next
> >> > ___napi_schedule() call to wake up this thread when there is work to
> >> > do.
> >> But what if you have a device that's basically idle and napi isn't
> >> scheduled until much later? It will get a confusing loadavg until then.
> >> I'd prefer waking up the thread immediately and filtering going back to
> >> sleep once in the thread function before running the loop if
> >> NAPI_STATE_SCHED wasn't set.
> >>
> >
> > I was not aware of this kthread_create() impact on loadavg.
> > This seems like a bug to me. (although I do not care about loadavg)
> >
> > Do you have pointers on some documentation ?

I found this link:
http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
It has a section called "Linux Uninterruptible Tasks" which explains
this behavior specifically. But I don't see a good conclusion on why.
Seems to be a convention.
IMHO, this is actually the problem/decision of the loadavg. It should
not impact how the kernel code is implemented. I think it makes more
sense to only wake up the thread when there is work to do.

> I don't have any specific documentation pointers, but this is something
> I observed on several occasions when playing with kthreads.
>
> From what I can find in the loadavg code it seems that tasks in
> TASK_UNINTERRUPTIBLE state are counted for loadavg alongside actually
> runnable tasks. This seems intentional to me, but I don't know why it
> was made like this.
>
> A kthread does not start the thread function until it has been woken up
> at least once, most likely to give the creating code a chance to perform
> some initializations after successfully creating the thread, before the
> thread function starts doing something. Instead, kthread() sets
> TASK_UNINTERRUPTIBLE and calls schedule() once.
>
> > Probably not a big deal, but this seems quite odd to me.
> I've run into enough users that look at loadavg as a measure of system
> load and would likely start reporting bugs if they observe such
> behavior. I'd like to avoid that.
>
> - Felix
Jakub Kicinski Oct. 1, 2020, 8:26 p.m. UTC | #9
On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote:
> On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote:  
> > > With napi poll moved to kthread, scheduler is in charge of scheduling both
> > > the kthreads handling network load, and the user threads, and is able to
> > > make better decisions. In the previous benchmark, if we do this and we
> > > pin the kthreads processing napi poll to specific CPUs, scheduler is
> > > able to schedule user threads away from these CPUs automatically.
> > >
> > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> > > entity per host, is that kthread is more configurable than workqueue,
> > > and we could leverage existing tuning tools for threads, like taskset,
> > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> > > if we eventually want to provide busy poll feature using kernel threads
> > > for napi poll, kthread seems to be more suitable than workqueue.  
> >
> > As I said in my reply to the RFC I see better performance with the
> > workqueue implementation, so I would hold off until we have more
> > conclusive results there, as this set adds fairly strong uAPI that
> > we'll have to support for ever.  
> 
> We can make incremental changes, the kthread implementation looks much
> nicer to us.

Having done two implementation of something more wq-like now 
I can say with some confidence that it's quite likely not a 
simple extension of this model. And since we'll likely need
to support switching at runtime there will be a fast-path
synchronization overhead.

> The unique work queue is a problem on server class platforms, with
> NUMA placement.
> We now have servers with NIC on different NUMA nodes.

Are you saying that the wq code is less NUMA friendly than unpinned
threads?

> We can not introduce a new model that will make all workload better
> without any tuning.
> If you really think you can do that, think again.

Has Wei tested the wq implementation with real workloads?

All the cover letter has is some basic netperf runs and a vague
sentence saying "real workload also improved".

I think it's possible to get something that will be a better default
for 90% of workloads. Our current model predates SMP by two decades.
It's pretty bad.

I'm talking about upstream defaults, obviously, maybe you're starting
from a different baseline configuration than the rest of the world..

> Even the old ' fix'  (commit 4cd13c21b207e80ddb1144c576500098f2d5f882
> "softirq: Let ksoftirqd do its job" )
> had severe issues for latency sensitive jobs.
> 
> We need to be able to opt-in to threads, and let process scheduler
> take decisions.
> If we believe the process scheduler takes bad decision, it should be
> reported to scheduler experts.

I wouldn't expect that the scheduler will learn all by itself how to
group processes that run identical code for cache efficiency, and how
to schedule at 10us scale. I hope I'm wrong.

> I fully support this implementation, I do not want to wait for yet
> another 'work queue' model or scheduler classes.

I can't sympathize. I don't understand why you're trying to rush this.
And you're not giving me enough info about your target config to be able
to understand your thinking.
Felix Fietkau Oct. 1, 2020, 8:48 p.m. UTC | #10
On 2020-10-01 21:24, Wei Wang wrote:
> On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau <nbd@nbd.name> wrote:
>>
>>
>> On 2020-10-01 20:03, Eric Dumazet wrote:
>> > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote:
>> >>
>> >> On 2020-10-01 19:01, Wei Wang wrote:
>> >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
>> >> >>
>> >> >>
>> >> >> On 2020-09-30 21:21, Wei Wang wrote:
>> >> >> > This commit mainly addresses the threaded config to make the switch
>> >> >> > between softirq based and kthread based NAPI processing not require
>> >> >> > a device down/up.
>> >> >> > It also moves the kthread_create() call to the sysfs handler when user
>> >> >> > tries to enable "threaded" on napi, and properly handles the
>> >> >> > kthread_create() failure. This is because certain drivers do not have
>> >> >> > the napi created and linked to the dev when dev_open() is called. So
>> >> >> > the previous implementation does not work properly there.
>> >> >> >
>> >> >> > Signed-off-by: Wei Wang <weiwan@google.com>
>> >> >> > ---
>> >> >> > Changes since RFC:
>> >> >> > changed the thread name to napi/<dev>-<napi-id>
>> >> >> >
>> >> >> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
>> >> >> >  net/core/net-sysfs.c |  9 +++-----
>> >> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
>> >> >> >
>> >> >> > diff --git a/net/core/dev.c b/net/core/dev.c
>> >> >> > index b4f33e442b5e..bf878d3a9d89 100644
>> >> >> > --- a/net/core/dev.c
>> >> >> > +++ b/net/core/dev.c
>> >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
>> >> >> >
>> >> >> >  static int napi_threaded_poll(void *data);
>> >> >> >
>> >> >> > -static void napi_thread_start(struct napi_struct *n)
>> >> >> > +static int napi_kthread_create(struct napi_struct *n)
>> >> >> >  {
>> >> >> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
>> >> >> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
>> >> >> > -                                        n->dev->name, n->napi_id);
>> >> >> > +     int err = 0;
>> >> >> > +
>> >> >> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
>> >> >> > +                                n->dev->name, n->napi_id);
>> >> >> > +     if (IS_ERR(n->thread)) {
>> >> >> > +             err = PTR_ERR(n->thread);
>> >> >> > +             pr_err("kthread_create failed with err %d\n", err);
>> >> >> > +             n->thread = NULL;
>> >> >> > +     }
>> >> >> > +
>> >> >> > +     return err;
>> >> >> If I remember correctly, using kthread_create with no explicit first
>> >> >> wakeup means the task will sit there and contribute to system loadavg
>> >> >> until it is woken up the first time.
>> >> >> Shouldn't we use kthread_run here instead?
>> >> >>
>> >> >
>> >> > Right. kthread_create() basically creates the thread and leaves it in
>> >> > sleep mode. I think that is what we want. We rely on the next
>> >> > ___napi_schedule() call to wake up this thread when there is work to
>> >> > do.
>> >> But what if you have a device that's basically idle and napi isn't
>> >> scheduled until much later? It will get a confusing loadavg until then.
>> >> I'd prefer waking up the thread immediately and filtering going back to
>> >> sleep once in the thread function before running the loop if
>> >> NAPI_STATE_SCHED wasn't set.
>> >>
>> >
>> > I was not aware of this kthread_create() impact on loadavg.
>> > This seems like a bug to me. (although I do not care about loadavg)
>> >
>> > Do you have pointers on some documentation ?
> 
> I found this link:
> http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
> It has a section called "Linux Uninterruptible Tasks" which explains
> this behavior specifically. But I don't see a good conclusion on why.
> Seems to be a convention.
> IMHO, this is actually the problem/decision of the loadavg. It should
> not impact how the kernel code is implemented. I think it makes more
> sense to only wake up the thread when there is work to do.
There were other users of kthread where the same issue was fixed.
With a quick search, I found these commits:
e890591413819eeb604207ad3261ba617b2ec0bb
3f776e8a25a9d281125490562e1cc5bd7c14cf7c

Please note that one of these describes that a kthread that was created
but not woken was triggering a blocked task warning - so it's not just
the loadavg that matters here.

All the other users of kthread that I looked at also do an initial
wakeup of the thread. Not doing it seems like wrong use of the API to me.

- Felix
Wei Wang Oct. 1, 2020, 10:12 p.m. UTC | #11
On Thu, Oct 1, 2020 at 1:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote:
> > On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote:
> > > > With napi poll moved to kthread, scheduler is in charge of scheduling both
> > > > the kthreads handling network load, and the user threads, and is able to
> > > > make better decisions. In the previous benchmark, if we do this and we
> > > > pin the kthreads processing napi poll to specific CPUs, scheduler is
> > > > able to schedule user threads away from these CPUs automatically.
> > > >
> > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> > > > entity per host, is that kthread is more configurable than workqueue,
> > > > and we could leverage existing tuning tools for threads, like taskset,
> > > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> > > > if we eventually want to provide busy poll feature using kernel threads
> > > > for napi poll, kthread seems to be more suitable than workqueue.
> > >
> > > As I said in my reply to the RFC I see better performance with the
> > > workqueue implementation, so I would hold off until we have more
> > > conclusive results there, as this set adds fairly strong uAPI that
> > > we'll have to support for ever.
> >
> > We can make incremental changes, the kthread implementation looks much
> > nicer to us.
>
> Having done two implementation of something more wq-like now
> I can say with some confidence that it's quite likely not a
> simple extension of this model. And since we'll likely need
> to support switching at runtime there will be a fast-path
> synchronization overhead.
>
> > The unique work queue is a problem on server class platforms, with
> > NUMA placement.
> > We now have servers with NIC on different NUMA nodes.
>
> Are you saying that the wq code is less NUMA friendly than unpinned
> threads?
>
> > We can not introduce a new model that will make all workload better
> > without any tuning.
> > If you really think you can do that, think again.
>
> Has Wei tested the wq implementation with real workloads?
>
> All the cover letter has is some basic netperf runs and a vague
> sentence saying "real workload also improved".
>

Yes. I did a round of testing with workqueue as well. The "real
workload" I mentioned is a google internal application benchmark which
involves networking  as well as disk ops.
There are 2 types of tests there.
1 is sustained tests, where the ops/s is being pushed to very high,
and keeps the overall cpu usage to > 80%, with various sizes of
payload.
In this type of test case, I see a better result with the kthread
model compared to workqueue in the latency metrics, and similar CPU
savings, with some tuning of the kthreads. (e.g., we limit the
kthreads to a pool of CPUs to run on, to avoid mixture with
application threads. I did the same for workqueue as well to be fair.)
The other is trace based tests, where the load is based on the actual
trace taken from the real servers. This kind of test has less load and
ops/s overall. (~25% total cpu usage on the host)
In this test case, I observe a similar amount of latency savings with
both kthread and workqueue, but workqueue seems to have better cpu
saving here, possibly due to less # of threads woken up to process the
load.

And one reason we would like to push forward with 1 kthread per NAPI,
is we are also trying to do busy polling with the kthread. And it
seems a good model to have 1 kthread dedicated to 1 NAPI to begin
with.

> I think it's possible to get something that will be a better default
> for 90% of workloads. Our current model predates SMP by two decades.
> It's pretty bad.
>
> I'm talking about upstream defaults, obviously, maybe you're starting
> from a different baseline configuration than the rest of the world..
>
> > Even the old ' fix'  (commit 4cd13c21b207e80ddb1144c576500098f2d5f882
> > "softirq: Let ksoftirqd do its job" )
> > had severe issues for latency sensitive jobs.
> >
> > We need to be able to opt-in to threads, and let process scheduler
> > take decisions.
> > If we believe the process scheduler takes bad decision, it should be
> > reported to scheduler experts.
>
> I wouldn't expect that the scheduler will learn all by itself how to
> group processes that run identical code for cache efficiency, and how
> to schedule at 10us scale. I hope I'm wrong.
>
> > I fully support this implementation, I do not want to wait for yet
> > another 'work queue' model or scheduler classes.
>
> I can't sympathize. I don't understand why you're trying to rush this.
> And you're not giving me enough info about your target config to be able
> to understand your thinking.
Wei Wang Oct. 1, 2020, 10:42 p.m. UTC | #12
On Thu, Oct 1, 2020 at 1:48 PM Felix Fietkau <nbd@nbd.name> wrote:
>
>
> On 2020-10-01 21:24, Wei Wang wrote:
> > On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau <nbd@nbd.name> wrote:
> >>
> >>
> >> On 2020-10-01 20:03, Eric Dumazet wrote:
> >> > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote:
> >> >>
> >> >> On 2020-10-01 19:01, Wei Wang wrote:
> >> >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote:
> >> >> >>
> >> >> >>
> >> >> >> On 2020-09-30 21:21, Wei Wang wrote:
> >> >> >> > This commit mainly addresses the threaded config to make the switch
> >> >> >> > between softirq based and kthread based NAPI processing not require
> >> >> >> > a device down/up.
> >> >> >> > It also moves the kthread_create() call to the sysfs handler when user
> >> >> >> > tries to enable "threaded" on napi, and properly handles the
> >> >> >> > kthread_create() failure. This is because certain drivers do not have
> >> >> >> > the napi created and linked to the dev when dev_open() is called. So
> >> >> >> > the previous implementation does not work properly there.
> >> >> >> >
> >> >> >> > Signed-off-by: Wei Wang <weiwan@google.com>
> >> >> >> > ---
> >> >> >> > Changes since RFC:
> >> >> >> > changed the thread name to napi/<dev>-<napi-id>
> >> >> >> >
> >> >> >> >  net/core/dev.c       | 49 +++++++++++++++++++++++++-------------------
> >> >> >> >  net/core/net-sysfs.c |  9 +++-----
> >> >> >> >  2 files changed, 31 insertions(+), 27 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/net/core/dev.c b/net/core/dev.c
> >> >> >> > index b4f33e442b5e..bf878d3a9d89 100644
> >> >> >> > --- a/net/core/dev.c
> >> >> >> > +++ b/net/core/dev.c
> >> >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers);
> >> >> >> >
> >> >> >> >  static int napi_threaded_poll(void *data);
> >> >> >> >
> >> >> >> > -static void napi_thread_start(struct napi_struct *n)
> >> >> >> > +static int napi_kthread_create(struct napi_struct *n)
> >> >> >> >  {
> >> >> >> > -     if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread)
> >> >> >> > -             n->thread = kthread_create(napi_threaded_poll, n, "%s-%d",
> >> >> >> > -                                        n->dev->name, n->napi_id);
> >> >> >> > +     int err = 0;
> >> >> >> > +
> >> >> >> > +     n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d",
> >> >> >> > +                                n->dev->name, n->napi_id);
> >> >> >> > +     if (IS_ERR(n->thread)) {
> >> >> >> > +             err = PTR_ERR(n->thread);
> >> >> >> > +             pr_err("kthread_create failed with err %d\n", err);
> >> >> >> > +             n->thread = NULL;
> >> >> >> > +     }
> >> >> >> > +
> >> >> >> > +     return err;
> >> >> >> If I remember correctly, using kthread_create with no explicit first
> >> >> >> wakeup means the task will sit there and contribute to system loadavg
> >> >> >> until it is woken up the first time.
> >> >> >> Shouldn't we use kthread_run here instead?
> >> >> >>
> >> >> >
> >> >> > Right. kthread_create() basically creates the thread and leaves it in
> >> >> > sleep mode. I think that is what we want. We rely on the next
> >> >> > ___napi_schedule() call to wake up this thread when there is work to
> >> >> > do.
> >> >> But what if you have a device that's basically idle and napi isn't
> >> >> scheduled until much later? It will get a confusing loadavg until then.
> >> >> I'd prefer waking up the thread immediately and filtering going back to
> >> >> sleep once in the thread function before running the loop if
> >> >> NAPI_STATE_SCHED wasn't set.
> >> >>
> >> >
> >> > I was not aware of this kthread_create() impact on loadavg.
> >> > This seems like a bug to me. (although I do not care about loadavg)
> >> >
> >> > Do you have pointers on some documentation ?
> >
> > I found this link:
> > http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html
> > It has a section called "Linux Uninterruptible Tasks" which explains
> > this behavior specifically. But I don't see a good conclusion on why.
> > Seems to be a convention.
> > IMHO, this is actually the problem/decision of the loadavg. It should
> > not impact how the kernel code is implemented. I think it makes more
> > sense to only wake up the thread when there is work to do.
> There were other users of kthread where the same issue was fixed.
> With a quick search, I found these commits:
> e890591413819eeb604207ad3261ba617b2ec0bb
> 3f776e8a25a9d281125490562e1cc5bd7c14cf7c
>
> Please note that one of these describes that a kthread that was created
> but not woken was triggering a blocked task warning - so it's not just
> the loadavg that matters here.
>
> All the other users of kthread that I looked at also do an initial
> wakeup of the thread. Not doing it seems like wrong use of the API to me.
>

Thanks Felix for digging up the above commits. Very helpful. I will
change it to kthread_run() in v2.

> - Felix
Jakub Kicinski Oct. 1, 2020, 11:46 p.m. UTC | #13
On Thu, 1 Oct 2020 15:12:20 -0700 Wei Wang wrote:
> Yes. I did a round of testing with workqueue as well. The "real
> workload" I mentioned is a google internal application benchmark which
> involves networking  as well as disk ops.
> There are 2 types of tests there.
> 1 is sustained tests, where the ops/s is being pushed to very high,
> and keeps the overall cpu usage to > 80%, with various sizes of
> payload.
> In this type of test case, I see a better result with the kthread
> model compared to workqueue in the latency metrics, and similar CPU
> savings, with some tuning of the kthreads. (e.g., we limit the
> kthreads to a pool of CPUs to run on, to avoid mixture with
> application threads. I did the same for workqueue as well to be fair.)

Can you share relative performance delta of this banchmark?

Could you explain why threads are slower than ksoftirqd if you pin the
application away? From your cover letter it sounded like you want the
scheduler to see the NAPI load, but then you say you pinned the
application away from the NAPI cores for the test, so I'm confused.

> The other is trace based tests, where the load is based on the actual
> trace taken from the real servers. This kind of test has less load and
> ops/s overall. (~25% total cpu usage on the host)
> In this test case, I observe a similar amount of latency savings with
> both kthread and workqueue, but workqueue seems to have better cpu
> saving here, possibly due to less # of threads woken up to process the
> load.
> 
> And one reason we would like to push forward with 1 kthread per NAPI,
> is we are also trying to do busy polling with the kthread. And it
> seems a good model to have 1 kthread dedicated to 1 NAPI to begin
> with.

And you'd pin those busy polling threads to a specific, single CPU, too?
1 cpu : 1 thread : 1 NAPI?
Wei Wang Oct. 2, 2020, 1:44 a.m. UTC | #14
On Thu, Oct 1, 2020 at 4:46 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 1 Oct 2020 15:12:20 -0700 Wei Wang wrote:
> > Yes. I did a round of testing with workqueue as well. The "real
> > workload" I mentioned is a google internal application benchmark which
> > involves networking  as well as disk ops.
> > There are 2 types of tests there.
> > 1 is sustained tests, where the ops/s is being pushed to very high,
> > and keeps the overall cpu usage to > 80%, with various sizes of
> > payload.
> > In this type of test case, I see a better result with the kthread
> > model compared to workqueue in the latency metrics, and similar CPU
> > savings, with some tuning of the kthreads. (e.g., we limit the
> > kthreads to a pool of CPUs to run on, to avoid mixture with
> > application threads. I did the same for workqueue as well to be fair.)
>
> Can you share relative performance delta of this banchmark?
>
> Could you explain why threads are slower than ksoftirqd if you pin the
> application away? From your cover letter it sounded like you want the
> scheduler to see the NAPI load, but then you say you pinned the
> application away from the NAPI cores for the test, so I'm confused.
>

No. We did not explicitly pin the application threads away.
Application threads are free to run anywhere. What we do is we
restrict the NAPI kthreads to only those CPUs handling rx interrupts.
(For us, 8 cpus out of 56.) So the load on those CPUs are very high
when running the test. And the scheduler is smart enough to avoid
using those CPUs for the application threads automatically.
Here is the results of 1 representative test result:
                     cpu/op   50%tile     95%tile       99%tile
base            71.47        417us      1.01ms          2.9ms
kthread         67.84       396us      976us            2.4ms
workqueue   69.68       386us      791us             1.9ms

Actually, I remembered it wrong. It does seem workqueue is doing
better on latencies. But cpu/op wise, kthread seems to be a bit
better.

> > The other is trace based tests, where the load is based on the actual
> > trace taken from the real servers. This kind of test has less load and
> > ops/s overall. (~25% total cpu usage on the host)
> > In this test case, I observe a similar amount of latency savings with
> > both kthread and workqueue, but workqueue seems to have better cpu
> > saving here, possibly due to less # of threads woken up to process the
> > load.
> >
> > And one reason we would like to push forward with 1 kthread per NAPI,
> > is we are also trying to do busy polling with the kthread. And it
> > seems a good model to have 1 kthread dedicated to 1 NAPI to begin
> > with.
>
> And you'd pin those busy polling threads to a specific, single CPU, too?
> 1 cpu : 1 thread : 1 NAPI?
Yes. That is my thought.
Eric Dumazet Oct. 2, 2020, 7:56 a.m. UTC | #15
On Thu, Oct 1, 2020 at 10:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote:

> > The unique work queue is a problem on server class platforms, with
> > NUMA placement.
> > We now have servers with NIC on different NUMA nodes.
>
> Are you saying that the wq code is less NUMA friendly than unpinned
> threads?

Yes this is what I am saying.

Using a single and shared wq wont allow you to make sure :
- work for NIC0 attached on NUMA node#0 will be using CPUS belonging to node#0
- work for NIC1 attached on NUMA node#1 will be using CPUS belonging to node#1


The only way you can tune things with a single wq is tweaking a single cpumask,
that we can change with /sys/devices/virtual/workqueue/{wqname}/cpumask
The same for the nice value with  /sys/devices/virtual/workqueue/{wqname}/nice.

In contrast, having kthreads let you tune things independently, if needed.

Even with a single NIC, you can still need isolation between queues.
We have queues dedicated to a certain kind of traffic/application.

The work queue approach would need to be able to create/delete
independent workqueues.
But we tested the workqueue with a single NIC and our results gave to
kthreads a win over the work queue.

Really, wq concept might be a nice abstraction when each work can be
running for arbitrary durations,
and arbitrary numbers of cpus, but with the NAPI model of up to 64
packets at a time, and a fixed number of queues,
we should not add the work queue overhead.
Jakub Kicinski Oct. 2, 2020, 10:49 p.m. UTC | #16
On Fri, 2 Oct 2020 09:56:31 +0200 Eric Dumazet wrote:
> On Thu, Oct 1, 2020 at 10:26 PM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote:  
> 
> > > The unique work queue is a problem on server class platforms, with
> > > NUMA placement.
> > > We now have servers with NIC on different NUMA nodes.  
> >
> > Are you saying that the wq code is less NUMA friendly than unpinned
> > threads?  
> 
> Yes this is what I am saying.
> 
> Using a single and shared wq wont allow you to make sure :
> - work for NIC0 attached on NUMA node#0 will be using CPUS belonging to node#0
> - work for NIC1 attached on NUMA node#1 will be using CPUS belonging to node#1
> 
> 
> The only way you can tune things with a single wq is tweaking a single cpumask,
> that we can change with /sys/devices/virtual/workqueue/{wqname}/cpumask
> The same for the nice value with  /sys/devices/virtual/workqueue/{wqname}/nice.
> 
> In contrast, having kthreads let you tune things independently, if needed.
> 
> Even with a single NIC, you can still need isolation between queues.
> We have queues dedicated to a certain kind of traffic/application.
> 
> The work queue approach would need to be able to create/delete
> independent workqueues.
> But we tested the workqueue with a single NIC and our results gave to
> kthreads a win over the work queue.

Not according to the results Wei posted last night..

> Really, wq concept might be a nice abstraction when each work can be
> running for arbitrary durations,
> and arbitrary numbers of cpus, but with the NAPI model of up to 64
> packets at a time, and a fixed number of queues,

In my experiments the worker threads get stalled sooner or later. 
And unless there is some work stealing going on latency spikes follow.

I would also not discount the variability in processing time. For a
budget of 64 the processing can take 0-500us per round, not counting
outliers.

> we should not add the work queue overhead.

Does this mean you're going to be against the (more fleshed out)
work queue implementation?
Jakub Kicinski Oct. 2, 2020, 10:53 p.m. UTC | #17
On Thu, 1 Oct 2020 18:44:40 -0700 Wei Wang wrote:
> > Can you share relative performance delta of this banchmark?
> >
> > Could you explain why threads are slower than ksoftirqd if you pin the
> > application away? From your cover letter it sounded like you want the
> > scheduler to see the NAPI load, but then you say you pinned the
> > application away from the NAPI cores for the test, so I'm confused.
> 
> No. We did not explicitly pin the application threads away.
> Application threads are free to run anywhere. What we do is we
> restrict the NAPI kthreads to only those CPUs handling rx interrupts.

Whatever. You pin the NAPI threads and hand-tune their number so the
load of the NAPI CPUs is always higher. If the workload changes the
system will get very unhappy.

> (For us, 8 cpus out of 56.) So the load on those CPUs are very high
> when running the test. And the scheduler is smart enough to avoid
> using those CPUs for the application threads automatically.
> Here is the results of 1 representative test result:
>                      cpu/op   50%tile     95%tile       99%tile
> base            71.47        417us      1.01ms          2.9ms
> kthread         67.84       396us      976us            2.4ms
> workqueue   69.68       386us      791us             1.9ms

Did you renice ksoftirqd in "base"?

> Actually, I remembered it wrong. It does seem workqueue is doing
> better on latencies. But cpu/op wise, kthread seems to be a bit
> better.

Q.E.D.
David Miller Oct. 2, 2020, 11 p.m. UTC | #18
From: Wei Wang <weiwan@google.com>
Date: Wed, 30 Sep 2020 12:21:35 -0700

 ...
> And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> entity per host, is that kthread is more configurable than workqueue,
> and we could leverage existing tuning tools for threads, like taskset,
> chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> if we eventually want to provide busy poll feature using kernel threads
> for napi poll, kthread seems to be more suitable than workqueue. 
...

I think we still need to discuss this some more.

Jakub has some ideas and I honestly think the whole workqueue
approach hasn't been fully considered yet.

If this wan't urgent years ago (when it was NACK'd btw), it isn't
urgent for 5.10 so I don't know why we are pushing so hard for
this patch series to go in as-is right now.

Please be patient and let's have a full discussion on this.

Thank you.
Alexei Starovoitov Oct. 2, 2020, 11:15 p.m. UTC | #19
On Fri, Oct 2, 2020 at 4:02 PM David Miller <davem@davemloft.net> wrote:
>
> From: Wei Wang <weiwan@google.com>
> Date: Wed, 30 Sep 2020 12:21:35 -0700
>
>  ...
> > And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> > entity per host, is that kthread is more configurable than workqueue,
> > and we could leverage existing tuning tools for threads, like taskset,
> > chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> > if we eventually want to provide busy poll feature using kernel threads
> > for napi poll, kthread seems to be more suitable than workqueue.
> ...
>
> I think we still need to discuss this some more.
>
> Jakub has some ideas and I honestly think the whole workqueue
> approach hasn't been fully considered yet.

I want to point out that it's not kthread vs wq. I think the mechanism
has to be pluggable. The kernel needs to support both kthread and wq.
Or maybe even the 3rd option. Whatever it might be.
Via sysctl or something.
I suspect for some production workloads wq will perform better.
For the others it will be kthread.
Clearly kthread is more tunable, but not everyone would have
knowledge and desire to do the tunning.
We can argue what should be the default, but that's secondary.

> If this wan't urgent years ago (when it was NACK'd btw), it isn't
> urgent for 5.10 so I don't know why we are pushing so hard for
> this patch series to go in as-is right now.
>
> Please be patient and let's have a full discussion on this.

+1. This is the biggest change to the kernel networking in years.
Let's make it right.
Eric Dumazet Oct. 3, 2020, 3:54 a.m. UTC | #20
On Sat, Oct 3, 2020 at 1:15 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Oct 2, 2020 at 4:02 PM David Miller <davem@davemloft.net> wrote:
> >
> > From: Wei Wang <weiwan@google.com>
> > Date: Wed, 30 Sep 2020 12:21:35 -0700
> >
> >  ...
> > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue
> > > entity per host, is that kthread is more configurable than workqueue,
> > > and we could leverage existing tuning tools for threads, like taskset,
> > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is
> > > if we eventually want to provide busy poll feature using kernel threads
> > > for napi poll, kthread seems to be more suitable than workqueue.
> > ...
> >
> > I think we still need to discuss this some more.
> >
> > Jakub has some ideas and I honestly think the whole workqueue
> > approach hasn't been fully considered yet.
>
> I want to point out that it's not kthread vs wq. I think the mechanism
> has to be pluggable. The kernel needs to support both kthread and wq.
> Or maybe even the 3rd option. Whatever it might be.
> Via sysctl or something.
> I suspect for some production workloads wq will perform better.
> For the others it will be kthread.
> Clearly kthread is more tunable, but not everyone would have
> knowledge and desire to do the tunning.

The exact same arguments can be used against RPS and RFS

RPS went first, and was later augmented with RFS (with very different goals)

They both are opt-in

> We can argue what should be the default, but that's secondary.
>
> > If this wan't urgent years ago (when it was NACK'd btw), it isn't
> > urgent for 5.10 so I don't know why we are pushing so hard for
> > this patch series to go in as-is right now.
> >
> > Please be patient and let's have a full discussion on this.
>
> +1. This is the biggest change to the kernel networking in years.
> Let's make it right.

Sure. I do not think it is urgent.

This has been revived by Felix some weeks ago, and I think we were the
ones spending time on the proposed patch set.
Not giving feedback to Felix would have been "something not right".

We reviewed and tested Felix patches, and came up with something more flexible.

Sure, a WQ is already giving nice results on appliances, because there
you do not need strong isolation.
Would a kthread approach also work well on appliances ? Probably...
Alexei Starovoitov Oct. 3, 2020, 4:17 a.m. UTC | #21
On Sat, Oct 03, 2020 at 05:54:38AM +0200, Eric Dumazet wrote:
> 
> Sure, a WQ is already giving nice results on appliances, because there
> you do not need strong isolation.
> Would a kthread approach also work well on appliances ? Probably...

Right. I think we're on the same page.
The only reason I'm bringing up multiple co-existing approaches now is to make
sure they are designed in from the start instead of as afterthought. Two
implementations already exist, but it doesn't look like that they can co-exist
in the kernel. Like this NAPI_STATE_THREADED bit in this patch set. Should we
burn that bit for kthread approach and another bit for workqueue based? 
I don't know. As the user of the feature I would be happy with any mechanism as
long as I can flip between them in runtime :) Just like RPS and RFS.