Message ID | 20200930192140.4192859-1-weiwan@google.com |
---|---|
Headers | show |
Series | implement kthread based napi poll | expand |
On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote: > With napi poll moved to kthread, scheduler is in charge of scheduling both > the kthreads handling network load, and the user threads, and is able to > make better decisions. In the previous benchmark, if we do this and we > pin the kthreads processing napi poll to specific CPUs, scheduler is > able to schedule user threads away from these CPUs automatically. > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > entity per host, is that kthread is more configurable than workqueue, > and we could leverage existing tuning tools for threads, like taskset, > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > if we eventually want to provide busy poll feature using kernel threads > for napi poll, kthread seems to be more suitable than workqueue. As I said in my reply to the RFC I see better performance with the workqueue implementation, so I would hold off until we have more conclusive results there, as this set adds fairly strong uAPI that we'll have to support for ever.
On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote: > > With napi poll moved to kthread, scheduler is in charge of scheduling both > > the kthreads handling network load, and the user threads, and is able to > > make better decisions. In the previous benchmark, if we do this and we > > pin the kthreads processing napi poll to specific CPUs, scheduler is > > able to schedule user threads away from these CPUs automatically. > > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > entity per host, is that kthread is more configurable than workqueue, > > and we could leverage existing tuning tools for threads, like taskset, > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > if we eventually want to provide busy poll feature using kernel threads > > for napi poll, kthread seems to be more suitable than workqueue. > > As I said in my reply to the RFC I see better performance with the > workqueue implementation, so I would hold off until we have more > conclusive results there, as this set adds fairly strong uAPI that > we'll have to support for ever. We can make incremental changes, the kthread implementation looks much nicer to us. The unique work queue is a problem on server class platforms, with NUMA placement. We now have servers with NIC on different NUMA nodes. We can not introduce a new model that will make all workload better without any tuning. If you really think you can do that, think again. Even the old ' fix' (commit 4cd13c21b207e80ddb1144c576500098f2d5f882 "softirq: Let ksoftirqd do its job" ) had severe issues for latency sensitive jobs. We need to be able to opt-in to threads, and let process scheduler take decisions. If we believe the process scheduler takes bad decision, it should be reported to scheduler experts. I fully support this implementation, I do not want to wait for yet another 'work queue' model or scheduler classes.
On 2020-09-30 21:21, Wei Wang wrote: > This commit mainly addresses the threaded config to make the switch > between softirq based and kthread based NAPI processing not require > a device down/up. > It also moves the kthread_create() call to the sysfs handler when user > tries to enable "threaded" on napi, and properly handles the > kthread_create() failure. This is because certain drivers do not have > the napi created and linked to the dev when dev_open() is called. So > the previous implementation does not work properly there. > > Signed-off-by: Wei Wang <weiwan@google.com> > --- > Changes since RFC: > changed the thread name to napi/<dev>-<napi-id> > > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- > net/core/net-sysfs.c | 9 +++----- > 2 files changed, 31 insertions(+), 27 deletions(-) > > diff --git a/net/core/dev.c b/net/core/dev.c > index b4f33e442b5e..bf878d3a9d89 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); > > static int napi_threaded_poll(void *data); > > -static void napi_thread_start(struct napi_struct *n) > +static int napi_kthread_create(struct napi_struct *n) > { > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > - n->dev->name, n->napi_id); > + int err = 0; > + > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", > + n->dev->name, n->napi_id); > + if (IS_ERR(n->thread)) { > + err = PTR_ERR(n->thread); > + pr_err("kthread_create failed with err %d\n", err); > + n->thread = NULL; > + } > + > + return err; If I remember correctly, using kthread_create with no explicit first wakeup means the task will sit there and contribute to system loadavg until it is woken up the first time. Shouldn't we use kthread_run here instead? - Felix
On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: > > > On 2020-09-30 21:21, Wei Wang wrote: > > This commit mainly addresses the threaded config to make the switch > > between softirq based and kthread based NAPI processing not require > > a device down/up. > > It also moves the kthread_create() call to the sysfs handler when user > > tries to enable "threaded" on napi, and properly handles the > > kthread_create() failure. This is because certain drivers do not have > > the napi created and linked to the dev when dev_open() is called. So > > the previous implementation does not work properly there. > > > > Signed-off-by: Wei Wang <weiwan@google.com> > > --- > > Changes since RFC: > > changed the thread name to napi/<dev>-<napi-id> > > > > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- > > net/core/net-sysfs.c | 9 +++----- > > 2 files changed, 31 insertions(+), 27 deletions(-) > > > > diff --git a/net/core/dev.c b/net/core/dev.c > > index b4f33e442b5e..bf878d3a9d89 100644 > > --- a/net/core/dev.c > > +++ b/net/core/dev.c > > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); > > > > static int napi_threaded_poll(void *data); > > > > -static void napi_thread_start(struct napi_struct *n) > > +static int napi_kthread_create(struct napi_struct *n) > > { > > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > > - n->dev->name, n->napi_id); > > + int err = 0; > > + > > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", > > + n->dev->name, n->napi_id); > > + if (IS_ERR(n->thread)) { > > + err = PTR_ERR(n->thread); > > + pr_err("kthread_create failed with err %d\n", err); > > + n->thread = NULL; > > + } > > + > > + return err; > If I remember correctly, using kthread_create with no explicit first > wakeup means the task will sit there and contribute to system loadavg > until it is woken up the first time. > Shouldn't we use kthread_run here instead? > Right. kthread_create() basically creates the thread and leaves it in sleep mode. I think that is what we want. We rely on the next ___napi_schedule() call to wake up this thread when there is work to do. > - Felix
On 2020-10-01 19:01, Wei Wang wrote: > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: >> >> >> On 2020-09-30 21:21, Wei Wang wrote: >> > This commit mainly addresses the threaded config to make the switch >> > between softirq based and kthread based NAPI processing not require >> > a device down/up. >> > It also moves the kthread_create() call to the sysfs handler when user >> > tries to enable "threaded" on napi, and properly handles the >> > kthread_create() failure. This is because certain drivers do not have >> > the napi created and linked to the dev when dev_open() is called. So >> > the previous implementation does not work properly there. >> > >> > Signed-off-by: Wei Wang <weiwan@google.com> >> > --- >> > Changes since RFC: >> > changed the thread name to napi/<dev>-<napi-id> >> > >> > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- >> > net/core/net-sysfs.c | 9 +++----- >> > 2 files changed, 31 insertions(+), 27 deletions(-) >> > >> > diff --git a/net/core/dev.c b/net/core/dev.c >> > index b4f33e442b5e..bf878d3a9d89 100644 >> > --- a/net/core/dev.c >> > +++ b/net/core/dev.c >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); >> > >> > static int napi_threaded_poll(void *data); >> > >> > -static void napi_thread_start(struct napi_struct *n) >> > +static int napi_kthread_create(struct napi_struct *n) >> > { >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", >> > - n->dev->name, n->napi_id); >> > + int err = 0; >> > + >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", >> > + n->dev->name, n->napi_id); >> > + if (IS_ERR(n->thread)) { >> > + err = PTR_ERR(n->thread); >> > + pr_err("kthread_create failed with err %d\n", err); >> > + n->thread = NULL; >> > + } >> > + >> > + return err; >> If I remember correctly, using kthread_create with no explicit first >> wakeup means the task will sit there and contribute to system loadavg >> until it is woken up the first time. >> Shouldn't we use kthread_run here instead? >> > > Right. kthread_create() basically creates the thread and leaves it in > sleep mode. I think that is what we want. We rely on the next > ___napi_schedule() call to wake up this thread when there is work to > do. But what if you have a device that's basically idle and napi isn't scheduled until much later? It will get a confusing loadavg until then. I'd prefer waking up the thread immediately and filtering going back to sleep once in the thread function before running the loop if NAPI_STATE_SCHED wasn't set. - Felix
On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote: > > On 2020-10-01 19:01, Wei Wang wrote: > > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: > >> > >> > >> On 2020-09-30 21:21, Wei Wang wrote: > >> > This commit mainly addresses the threaded config to make the switch > >> > between softirq based and kthread based NAPI processing not require > >> > a device down/up. > >> > It also moves the kthread_create() call to the sysfs handler when user > >> > tries to enable "threaded" on napi, and properly handles the > >> > kthread_create() failure. This is because certain drivers do not have > >> > the napi created and linked to the dev when dev_open() is called. So > >> > the previous implementation does not work properly there. > >> > > >> > Signed-off-by: Wei Wang <weiwan@google.com> > >> > --- > >> > Changes since RFC: > >> > changed the thread name to napi/<dev>-<napi-id> > >> > > >> > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- > >> > net/core/net-sysfs.c | 9 +++----- > >> > 2 files changed, 31 insertions(+), 27 deletions(-) > >> > > >> > diff --git a/net/core/dev.c b/net/core/dev.c > >> > index b4f33e442b5e..bf878d3a9d89 100644 > >> > --- a/net/core/dev.c > >> > +++ b/net/core/dev.c > >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); > >> > > >> > static int napi_threaded_poll(void *data); > >> > > >> > -static void napi_thread_start(struct napi_struct *n) > >> > +static int napi_kthread_create(struct napi_struct *n) > >> > { > >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > >> > - n->dev->name, n->napi_id); > >> > + int err = 0; > >> > + > >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", > >> > + n->dev->name, n->napi_id); > >> > + if (IS_ERR(n->thread)) { > >> > + err = PTR_ERR(n->thread); > >> > + pr_err("kthread_create failed with err %d\n", err); > >> > + n->thread = NULL; > >> > + } > >> > + > >> > + return err; > >> If I remember correctly, using kthread_create with no explicit first > >> wakeup means the task will sit there and contribute to system loadavg > >> until it is woken up the first time. > >> Shouldn't we use kthread_run here instead? > >> > > > > Right. kthread_create() basically creates the thread and leaves it in > > sleep mode. I think that is what we want. We rely on the next > > ___napi_schedule() call to wake up this thread when there is work to > > do. > But what if you have a device that's basically idle and napi isn't > scheduled until much later? It will get a confusing loadavg until then. > I'd prefer waking up the thread immediately and filtering going back to > sleep once in the thread function before running the loop if > NAPI_STATE_SCHED wasn't set. > I was not aware of this kthread_create() impact on loadavg. This seems like a bug to me. (although I do not care about loadavg) Do you have pointers on some documentation ? Probably not a big deal, but this seems quite odd to me.
On 2020-10-01 20:03, Eric Dumazet wrote: > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote: >> >> On 2020-10-01 19:01, Wei Wang wrote: >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: >> >> >> >> >> >> On 2020-09-30 21:21, Wei Wang wrote: >> >> > This commit mainly addresses the threaded config to make the switch >> >> > between softirq based and kthread based NAPI processing not require >> >> > a device down/up. >> >> > It also moves the kthread_create() call to the sysfs handler when user >> >> > tries to enable "threaded" on napi, and properly handles the >> >> > kthread_create() failure. This is because certain drivers do not have >> >> > the napi created and linked to the dev when dev_open() is called. So >> >> > the previous implementation does not work properly there. >> >> > >> >> > Signed-off-by: Wei Wang <weiwan@google.com> >> >> > --- >> >> > Changes since RFC: >> >> > changed the thread name to napi/<dev>-<napi-id> >> >> > >> >> > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- >> >> > net/core/net-sysfs.c | 9 +++----- >> >> > 2 files changed, 31 insertions(+), 27 deletions(-) >> >> > >> >> > diff --git a/net/core/dev.c b/net/core/dev.c >> >> > index b4f33e442b5e..bf878d3a9d89 100644 >> >> > --- a/net/core/dev.c >> >> > +++ b/net/core/dev.c >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); >> >> > >> >> > static int napi_threaded_poll(void *data); >> >> > >> >> > -static void napi_thread_start(struct napi_struct *n) >> >> > +static int napi_kthread_create(struct napi_struct *n) >> >> > { >> >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) >> >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", >> >> > - n->dev->name, n->napi_id); >> >> > + int err = 0; >> >> > + >> >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", >> >> > + n->dev->name, n->napi_id); >> >> > + if (IS_ERR(n->thread)) { >> >> > + err = PTR_ERR(n->thread); >> >> > + pr_err("kthread_create failed with err %d\n", err); >> >> > + n->thread = NULL; >> >> > + } >> >> > + >> >> > + return err; >> >> If I remember correctly, using kthread_create with no explicit first >> >> wakeup means the task will sit there and contribute to system loadavg >> >> until it is woken up the first time. >> >> Shouldn't we use kthread_run here instead? >> >> >> > >> > Right. kthread_create() basically creates the thread and leaves it in >> > sleep mode. I think that is what we want. We rely on the next >> > ___napi_schedule() call to wake up this thread when there is work to >> > do. >> But what if you have a device that's basically idle and napi isn't >> scheduled until much later? It will get a confusing loadavg until then. >> I'd prefer waking up the thread immediately and filtering going back to >> sleep once in the thread function before running the loop if >> NAPI_STATE_SCHED wasn't set. >> > > I was not aware of this kthread_create() impact on loadavg. > This seems like a bug to me. (although I do not care about loadavg) > > Do you have pointers on some documentation ? I don't have any specific documentation pointers, but this is something I observed on several occasions when playing with kthreads.
On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau <nbd@nbd.name> wrote: > > > On 2020-10-01 20:03, Eric Dumazet wrote: > > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote: > >> > >> On 2020-10-01 19:01, Wei Wang wrote: > >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: > >> >> > >> >> > >> >> On 2020-09-30 21:21, Wei Wang wrote: > >> >> > This commit mainly addresses the threaded config to make the switch > >> >> > between softirq based and kthread based NAPI processing not require > >> >> > a device down/up. > >> >> > It also moves the kthread_create() call to the sysfs handler when user > >> >> > tries to enable "threaded" on napi, and properly handles the > >> >> > kthread_create() failure. This is because certain drivers do not have > >> >> > the napi created and linked to the dev when dev_open() is called. So > >> >> > the previous implementation does not work properly there. > >> >> > > >> >> > Signed-off-by: Wei Wang <weiwan@google.com> > >> >> > --- > >> >> > Changes since RFC: > >> >> > changed the thread name to napi/<dev>-<napi-id> > >> >> > > >> >> > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- > >> >> > net/core/net-sysfs.c | 9 +++----- > >> >> > 2 files changed, 31 insertions(+), 27 deletions(-) > >> >> > > >> >> > diff --git a/net/core/dev.c b/net/core/dev.c > >> >> > index b4f33e442b5e..bf878d3a9d89 100644 > >> >> > --- a/net/core/dev.c > >> >> > +++ b/net/core/dev.c > >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); > >> >> > > >> >> > static int napi_threaded_poll(void *data); > >> >> > > >> >> > -static void napi_thread_start(struct napi_struct *n) > >> >> > +static int napi_kthread_create(struct napi_struct *n) > >> >> > { > >> >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > >> >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > >> >> > - n->dev->name, n->napi_id); > >> >> > + int err = 0; > >> >> > + > >> >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", > >> >> > + n->dev->name, n->napi_id); > >> >> > + if (IS_ERR(n->thread)) { > >> >> > + err = PTR_ERR(n->thread); > >> >> > + pr_err("kthread_create failed with err %d\n", err); > >> >> > + n->thread = NULL; > >> >> > + } > >> >> > + > >> >> > + return err; > >> >> If I remember correctly, using kthread_create with no explicit first > >> >> wakeup means the task will sit there and contribute to system loadavg > >> >> until it is woken up the first time. > >> >> Shouldn't we use kthread_run here instead? > >> >> > >> > > >> > Right. kthread_create() basically creates the thread and leaves it in > >> > sleep mode. I think that is what we want. We rely on the next > >> > ___napi_schedule() call to wake up this thread when there is work to > >> > do. > >> But what if you have a device that's basically idle and napi isn't > >> scheduled until much later? It will get a confusing loadavg until then. > >> I'd prefer waking up the thread immediately and filtering going back to > >> sleep once in the thread function before running the loop if > >> NAPI_STATE_SCHED wasn't set. > >> > > > > I was not aware of this kthread_create() impact on loadavg. > > This seems like a bug to me. (although I do not care about loadavg) > > > > Do you have pointers on some documentation ? I found this link: http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html It has a section called "Linux Uninterruptible Tasks" which explains this behavior specifically. But I don't see a good conclusion on why. Seems to be a convention. IMHO, this is actually the problem/decision of the loadavg. It should not impact how the kernel code is implemented. I think it makes more sense to only wake up the thread when there is work to do. > I don't have any specific documentation pointers, but this is something > I observed on several occasions when playing with kthreads. > > From what I can find in the loadavg code it seems that tasks in > TASK_UNINTERRUPTIBLE state are counted for loadavg alongside actually > runnable tasks. This seems intentional to me, but I don't know why it > was made like this. > > A kthread does not start the thread function until it has been woken up > at least once, most likely to give the creating code a chance to perform > some initializations after successfully creating the thread, before the > thread function starts doing something. Instead, kthread() sets > TASK_UNINTERRUPTIBLE and calls schedule() once. > > > Probably not a big deal, but this seems quite odd to me. > I've run into enough users that look at loadavg as a measure of system > load and would likely start reporting bugs if they observe such > behavior. I'd like to avoid that. > > - Felix
On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote: > On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote: > > > With napi poll moved to kthread, scheduler is in charge of scheduling both > > > the kthreads handling network load, and the user threads, and is able to > > > make better decisions. In the previous benchmark, if we do this and we > > > pin the kthreads processing napi poll to specific CPUs, scheduler is > > > able to schedule user threads away from these CPUs automatically. > > > > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > > entity per host, is that kthread is more configurable than workqueue, > > > and we could leverage existing tuning tools for threads, like taskset, > > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > > if we eventually want to provide busy poll feature using kernel threads > > > for napi poll, kthread seems to be more suitable than workqueue. > > > > As I said in my reply to the RFC I see better performance with the > > workqueue implementation, so I would hold off until we have more > > conclusive results there, as this set adds fairly strong uAPI that > > we'll have to support for ever. > > We can make incremental changes, the kthread implementation looks much > nicer to us. Having done two implementation of something more wq-like now I can say with some confidence that it's quite likely not a simple extension of this model. And since we'll likely need to support switching at runtime there will be a fast-path synchronization overhead. > The unique work queue is a problem on server class platforms, with > NUMA placement. > We now have servers with NIC on different NUMA nodes. Are you saying that the wq code is less NUMA friendly than unpinned threads? > We can not introduce a new model that will make all workload better > without any tuning. > If you really think you can do that, think again. Has Wei tested the wq implementation with real workloads? All the cover letter has is some basic netperf runs and a vague sentence saying "real workload also improved". I think it's possible to get something that will be a better default for 90% of workloads. Our current model predates SMP by two decades. It's pretty bad. I'm talking about upstream defaults, obviously, maybe you're starting from a different baseline configuration than the rest of the world.. > Even the old ' fix' (commit 4cd13c21b207e80ddb1144c576500098f2d5f882 > "softirq: Let ksoftirqd do its job" ) > had severe issues for latency sensitive jobs. > > We need to be able to opt-in to threads, and let process scheduler > take decisions. > If we believe the process scheduler takes bad decision, it should be > reported to scheduler experts. I wouldn't expect that the scheduler will learn all by itself how to group processes that run identical code for cache efficiency, and how to schedule at 10us scale. I hope I'm wrong. > I fully support this implementation, I do not want to wait for yet > another 'work queue' model or scheduler classes. I can't sympathize. I don't understand why you're trying to rush this. And you're not giving me enough info about your target config to be able to understand your thinking.
On 2020-10-01 21:24, Wei Wang wrote: > On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau <nbd@nbd.name> wrote: >> >> >> On 2020-10-01 20:03, Eric Dumazet wrote: >> > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote: >> >> >> >> On 2020-10-01 19:01, Wei Wang wrote: >> >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: >> >> >> >> >> >> >> >> >> On 2020-09-30 21:21, Wei Wang wrote: >> >> >> > This commit mainly addresses the threaded config to make the switch >> >> >> > between softirq based and kthread based NAPI processing not require >> >> >> > a device down/up. >> >> >> > It also moves the kthread_create() call to the sysfs handler when user >> >> >> > tries to enable "threaded" on napi, and properly handles the >> >> >> > kthread_create() failure. This is because certain drivers do not have >> >> >> > the napi created and linked to the dev when dev_open() is called. So >> >> >> > the previous implementation does not work properly there. >> >> >> > >> >> >> > Signed-off-by: Wei Wang <weiwan@google.com> >> >> >> > --- >> >> >> > Changes since RFC: >> >> >> > changed the thread name to napi/<dev>-<napi-id> >> >> >> > >> >> >> > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- >> >> >> > net/core/net-sysfs.c | 9 +++----- >> >> >> > 2 files changed, 31 insertions(+), 27 deletions(-) >> >> >> > >> >> >> > diff --git a/net/core/dev.c b/net/core/dev.c >> >> >> > index b4f33e442b5e..bf878d3a9d89 100644 >> >> >> > --- a/net/core/dev.c >> >> >> > +++ b/net/core/dev.c >> >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); >> >> >> > >> >> >> > static int napi_threaded_poll(void *data); >> >> >> > >> >> >> > -static void napi_thread_start(struct napi_struct *n) >> >> >> > +static int napi_kthread_create(struct napi_struct *n) >> >> >> > { >> >> >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) >> >> >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", >> >> >> > - n->dev->name, n->napi_id); >> >> >> > + int err = 0; >> >> >> > + >> >> >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", >> >> >> > + n->dev->name, n->napi_id); >> >> >> > + if (IS_ERR(n->thread)) { >> >> >> > + err = PTR_ERR(n->thread); >> >> >> > + pr_err("kthread_create failed with err %d\n", err); >> >> >> > + n->thread = NULL; >> >> >> > + } >> >> >> > + >> >> >> > + return err; >> >> >> If I remember correctly, using kthread_create with no explicit first >> >> >> wakeup means the task will sit there and contribute to system loadavg >> >> >> until it is woken up the first time. >> >> >> Shouldn't we use kthread_run here instead? >> >> >> >> >> > >> >> > Right. kthread_create() basically creates the thread and leaves it in >> >> > sleep mode. I think that is what we want. We rely on the next >> >> > ___napi_schedule() call to wake up this thread when there is work to >> >> > do. >> >> But what if you have a device that's basically idle and napi isn't >> >> scheduled until much later? It will get a confusing loadavg until then. >> >> I'd prefer waking up the thread immediately and filtering going back to >> >> sleep once in the thread function before running the loop if >> >> NAPI_STATE_SCHED wasn't set. >> >> >> > >> > I was not aware of this kthread_create() impact on loadavg. >> > This seems like a bug to me. (although I do not care about loadavg) >> > >> > Do you have pointers on some documentation ? > > I found this link: > http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html > It has a section called "Linux Uninterruptible Tasks" which explains > this behavior specifically. But I don't see a good conclusion on why. > Seems to be a convention. > IMHO, this is actually the problem/decision of the loadavg. It should > not impact how the kernel code is implemented. I think it makes more > sense to only wake up the thread when there is work to do. There were other users of kthread where the same issue was fixed. With a quick search, I found these commits: e890591413819eeb604207ad3261ba617b2ec0bb 3f776e8a25a9d281125490562e1cc5bd7c14cf7c Please note that one of these describes that a kthread that was created but not woken was triggering a blocked task warning - so it's not just the loadavg that matters here. All the other users of kthread that I looked at also do an initial wakeup of the thread. Not doing it seems like wrong use of the API to me. - Felix
On Thu, Oct 1, 2020 at 1:26 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote: > > On Wed, Sep 30, 2020 at 10:08 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > On Wed, 30 Sep 2020 12:21:35 -0700 Wei Wang wrote: > > > > With napi poll moved to kthread, scheduler is in charge of scheduling both > > > > the kthreads handling network load, and the user threads, and is able to > > > > make better decisions. In the previous benchmark, if we do this and we > > > > pin the kthreads processing napi poll to specific CPUs, scheduler is > > > > able to schedule user threads away from these CPUs automatically. > > > > > > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > > > entity per host, is that kthread is more configurable than workqueue, > > > > and we could leverage existing tuning tools for threads, like taskset, > > > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > > > if we eventually want to provide busy poll feature using kernel threads > > > > for napi poll, kthread seems to be more suitable than workqueue. > > > > > > As I said in my reply to the RFC I see better performance with the > > > workqueue implementation, so I would hold off until we have more > > > conclusive results there, as this set adds fairly strong uAPI that > > > we'll have to support for ever. > > > > We can make incremental changes, the kthread implementation looks much > > nicer to us. > > Having done two implementation of something more wq-like now > I can say with some confidence that it's quite likely not a > simple extension of this model. And since we'll likely need > to support switching at runtime there will be a fast-path > synchronization overhead. > > > The unique work queue is a problem on server class platforms, with > > NUMA placement. > > We now have servers with NIC on different NUMA nodes. > > Are you saying that the wq code is less NUMA friendly than unpinned > threads? > > > We can not introduce a new model that will make all workload better > > without any tuning. > > If you really think you can do that, think again. > > Has Wei tested the wq implementation with real workloads? > > All the cover letter has is some basic netperf runs and a vague > sentence saying "real workload also improved". > Yes. I did a round of testing with workqueue as well. The "real workload" I mentioned is a google internal application benchmark which involves networking as well as disk ops. There are 2 types of tests there. 1 is sustained tests, where the ops/s is being pushed to very high, and keeps the overall cpu usage to > 80%, with various sizes of payload. In this type of test case, I see a better result with the kthread model compared to workqueue in the latency metrics, and similar CPU savings, with some tuning of the kthreads. (e.g., we limit the kthreads to a pool of CPUs to run on, to avoid mixture with application threads. I did the same for workqueue as well to be fair.) The other is trace based tests, where the load is based on the actual trace taken from the real servers. This kind of test has less load and ops/s overall. (~25% total cpu usage on the host) In this test case, I observe a similar amount of latency savings with both kthread and workqueue, but workqueue seems to have better cpu saving here, possibly due to less # of threads woken up to process the load. And one reason we would like to push forward with 1 kthread per NAPI, is we are also trying to do busy polling with the kthread. And it seems a good model to have 1 kthread dedicated to 1 NAPI to begin with. > I think it's possible to get something that will be a better default > for 90% of workloads. Our current model predates SMP by two decades. > It's pretty bad. > > I'm talking about upstream defaults, obviously, maybe you're starting > from a different baseline configuration than the rest of the world.. > > > Even the old ' fix' (commit 4cd13c21b207e80ddb1144c576500098f2d5f882 > > "softirq: Let ksoftirqd do its job" ) > > had severe issues for latency sensitive jobs. > > > > We need to be able to opt-in to threads, and let process scheduler > > take decisions. > > If we believe the process scheduler takes bad decision, it should be > > reported to scheduler experts. > > I wouldn't expect that the scheduler will learn all by itself how to > group processes that run identical code for cache efficiency, and how > to schedule at 10us scale. I hope I'm wrong. > > > I fully support this implementation, I do not want to wait for yet > > another 'work queue' model or scheduler classes. > > I can't sympathize. I don't understand why you're trying to rush this. > And you're not giving me enough info about your target config to be able > to understand your thinking.
On Thu, Oct 1, 2020 at 1:48 PM Felix Fietkau <nbd@nbd.name> wrote: > > > On 2020-10-01 21:24, Wei Wang wrote: > > On Thu, Oct 1, 2020 at 11:38 AM Felix Fietkau <nbd@nbd.name> wrote: > >> > >> > >> On 2020-10-01 20:03, Eric Dumazet wrote: > >> > On Thu, Oct 1, 2020 at 7:12 PM Felix Fietkau <nbd@nbd.name> wrote: > >> >> > >> >> On 2020-10-01 19:01, Wei Wang wrote: > >> >> > On Thu, Oct 1, 2020 at 3:01 AM Felix Fietkau <nbd@nbd.name> wrote: > >> >> >> > >> >> >> > >> >> >> On 2020-09-30 21:21, Wei Wang wrote: > >> >> >> > This commit mainly addresses the threaded config to make the switch > >> >> >> > between softirq based and kthread based NAPI processing not require > >> >> >> > a device down/up. > >> >> >> > It also moves the kthread_create() call to the sysfs handler when user > >> >> >> > tries to enable "threaded" on napi, and properly handles the > >> >> >> > kthread_create() failure. This is because certain drivers do not have > >> >> >> > the napi created and linked to the dev when dev_open() is called. So > >> >> >> > the previous implementation does not work properly there. > >> >> >> > > >> >> >> > Signed-off-by: Wei Wang <weiwan@google.com> > >> >> >> > --- > >> >> >> > Changes since RFC: > >> >> >> > changed the thread name to napi/<dev>-<napi-id> > >> >> >> > > >> >> >> > net/core/dev.c | 49 +++++++++++++++++++++++++------------------- > >> >> >> > net/core/net-sysfs.c | 9 +++----- > >> >> >> > 2 files changed, 31 insertions(+), 27 deletions(-) > >> >> >> > > >> >> >> > diff --git a/net/core/dev.c b/net/core/dev.c > >> >> >> > index b4f33e442b5e..bf878d3a9d89 100644 > >> >> >> > --- a/net/core/dev.c > >> >> >> > +++ b/net/core/dev.c > >> >> >> > @@ -1490,17 +1490,24 @@ EXPORT_SYMBOL(netdev_notify_peers); > >> >> >> > > >> >> >> > static int napi_threaded_poll(void *data); > >> >> >> > > >> >> >> > -static void napi_thread_start(struct napi_struct *n) > >> >> >> > +static int napi_kthread_create(struct napi_struct *n) > >> >> >> > { > >> >> >> > - if (test_bit(NAPI_STATE_THREADED, &n->state) && !n->thread) > >> >> >> > - n->thread = kthread_create(napi_threaded_poll, n, "%s-%d", > >> >> >> > - n->dev->name, n->napi_id); > >> >> >> > + int err = 0; > >> >> >> > + > >> >> >> > + n->thread = kthread_create(napi_threaded_poll, n, "napi/%s-%d", > >> >> >> > + n->dev->name, n->napi_id); > >> >> >> > + if (IS_ERR(n->thread)) { > >> >> >> > + err = PTR_ERR(n->thread); > >> >> >> > + pr_err("kthread_create failed with err %d\n", err); > >> >> >> > + n->thread = NULL; > >> >> >> > + } > >> >> >> > + > >> >> >> > + return err; > >> >> >> If I remember correctly, using kthread_create with no explicit first > >> >> >> wakeup means the task will sit there and contribute to system loadavg > >> >> >> until it is woken up the first time. > >> >> >> Shouldn't we use kthread_run here instead? > >> >> >> > >> >> > > >> >> > Right. kthread_create() basically creates the thread and leaves it in > >> >> > sleep mode. I think that is what we want. We rely on the next > >> >> > ___napi_schedule() call to wake up this thread when there is work to > >> >> > do. > >> >> But what if you have a device that's basically idle and napi isn't > >> >> scheduled until much later? It will get a confusing loadavg until then. > >> >> I'd prefer waking up the thread immediately and filtering going back to > >> >> sleep once in the thread function before running the loop if > >> >> NAPI_STATE_SCHED wasn't set. > >> >> > >> > > >> > I was not aware of this kthread_create() impact on loadavg. > >> > This seems like a bug to me. (although I do not care about loadavg) > >> > > >> > Do you have pointers on some documentation ? > > > > I found this link: > > http://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html > > It has a section called "Linux Uninterruptible Tasks" which explains > > this behavior specifically. But I don't see a good conclusion on why. > > Seems to be a convention. > > IMHO, this is actually the problem/decision of the loadavg. It should > > not impact how the kernel code is implemented. I think it makes more > > sense to only wake up the thread when there is work to do. > There were other users of kthread where the same issue was fixed. > With a quick search, I found these commits: > e890591413819eeb604207ad3261ba617b2ec0bb > 3f776e8a25a9d281125490562e1cc5bd7c14cf7c > > Please note that one of these describes that a kthread that was created > but not woken was triggering a blocked task warning - so it's not just > the loadavg that matters here. > > All the other users of kthread that I looked at also do an initial > wakeup of the thread. Not doing it seems like wrong use of the API to me. > Thanks Felix for digging up the above commits. Very helpful. I will change it to kthread_run() in v2. > - Felix
On Thu, 1 Oct 2020 15:12:20 -0700 Wei Wang wrote: > Yes. I did a round of testing with workqueue as well. The "real > workload" I mentioned is a google internal application benchmark which > involves networking as well as disk ops. > There are 2 types of tests there. > 1 is sustained tests, where the ops/s is being pushed to very high, > and keeps the overall cpu usage to > 80%, with various sizes of > payload. > In this type of test case, I see a better result with the kthread > model compared to workqueue in the latency metrics, and similar CPU > savings, with some tuning of the kthreads. (e.g., we limit the > kthreads to a pool of CPUs to run on, to avoid mixture with > application threads. I did the same for workqueue as well to be fair.) Can you share relative performance delta of this banchmark? Could you explain why threads are slower than ksoftirqd if you pin the application away? From your cover letter it sounded like you want the scheduler to see the NAPI load, but then you say you pinned the application away from the NAPI cores for the test, so I'm confused. > The other is trace based tests, where the load is based on the actual > trace taken from the real servers. This kind of test has less load and > ops/s overall. (~25% total cpu usage on the host) > In this test case, I observe a similar amount of latency savings with > both kthread and workqueue, but workqueue seems to have better cpu > saving here, possibly due to less # of threads woken up to process the > load. > > And one reason we would like to push forward with 1 kthread per NAPI, > is we are also trying to do busy polling with the kthread. And it > seems a good model to have 1 kthread dedicated to 1 NAPI to begin > with. And you'd pin those busy polling threads to a specific, single CPU, too? 1 cpu : 1 thread : 1 NAPI?
On Thu, Oct 1, 2020 at 4:46 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 1 Oct 2020 15:12:20 -0700 Wei Wang wrote: > > Yes. I did a round of testing with workqueue as well. The "real > > workload" I mentioned is a google internal application benchmark which > > involves networking as well as disk ops. > > There are 2 types of tests there. > > 1 is sustained tests, where the ops/s is being pushed to very high, > > and keeps the overall cpu usage to > 80%, with various sizes of > > payload. > > In this type of test case, I see a better result with the kthread > > model compared to workqueue in the latency metrics, and similar CPU > > savings, with some tuning of the kthreads. (e.g., we limit the > > kthreads to a pool of CPUs to run on, to avoid mixture with > > application threads. I did the same for workqueue as well to be fair.) > > Can you share relative performance delta of this banchmark? > > Could you explain why threads are slower than ksoftirqd if you pin the > application away? From your cover letter it sounded like you want the > scheduler to see the NAPI load, but then you say you pinned the > application away from the NAPI cores for the test, so I'm confused. > No. We did not explicitly pin the application threads away. Application threads are free to run anywhere. What we do is we restrict the NAPI kthreads to only those CPUs handling rx interrupts. (For us, 8 cpus out of 56.) So the load on those CPUs are very high when running the test. And the scheduler is smart enough to avoid using those CPUs for the application threads automatically. Here is the results of 1 representative test result: cpu/op 50%tile 95%tile 99%tile base 71.47 417us 1.01ms 2.9ms kthread 67.84 396us 976us 2.4ms workqueue 69.68 386us 791us 1.9ms Actually, I remembered it wrong. It does seem workqueue is doing better on latencies. But cpu/op wise, kthread seems to be a bit better. > > The other is trace based tests, where the load is based on the actual > > trace taken from the real servers. This kind of test has less load and > > ops/s overall. (~25% total cpu usage on the host) > > In this test case, I observe a similar amount of latency savings with > > both kthread and workqueue, but workqueue seems to have better cpu > > saving here, possibly due to less # of threads woken up to process the > > load. > > > > And one reason we would like to push forward with 1 kthread per NAPI, > > is we are also trying to do busy polling with the kthread. And it > > seems a good model to have 1 kthread dedicated to 1 NAPI to begin > > with. > > And you'd pin those busy polling threads to a specific, single CPU, too? > 1 cpu : 1 thread : 1 NAPI? Yes. That is my thought.
On Thu, Oct 1, 2020 at 10:26 PM Jakub Kicinski <kuba@kernel.org> wrote: > > On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote: > > The unique work queue is a problem on server class platforms, with > > NUMA placement. > > We now have servers with NIC on different NUMA nodes. > > Are you saying that the wq code is less NUMA friendly than unpinned > threads? Yes this is what I am saying. Using a single and shared wq wont allow you to make sure : - work for NIC0 attached on NUMA node#0 will be using CPUS belonging to node#0 - work for NIC1 attached on NUMA node#1 will be using CPUS belonging to node#1 The only way you can tune things with a single wq is tweaking a single cpumask, that we can change with /sys/devices/virtual/workqueue/{wqname}/cpumask The same for the nice value with /sys/devices/virtual/workqueue/{wqname}/nice. In contrast, having kthreads let you tune things independently, if needed. Even with a single NIC, you can still need isolation between queues. We have queues dedicated to a certain kind of traffic/application. The work queue approach would need to be able to create/delete independent workqueues. But we tested the workqueue with a single NIC and our results gave to kthreads a win over the work queue. Really, wq concept might be a nice abstraction when each work can be running for arbitrary durations, and arbitrary numbers of cpus, but with the NAPI model of up to 64 packets at a time, and a fixed number of queues, we should not add the work queue overhead.
On Fri, 2 Oct 2020 09:56:31 +0200 Eric Dumazet wrote: > On Thu, Oct 1, 2020 at 10:26 PM Jakub Kicinski <kuba@kernel.org> wrote: > > > > On Thu, 1 Oct 2020 09:52:45 +0200 Eric Dumazet wrote: > > > > The unique work queue is a problem on server class platforms, with > > > NUMA placement. > > > We now have servers with NIC on different NUMA nodes. > > > > Are you saying that the wq code is less NUMA friendly than unpinned > > threads? > > Yes this is what I am saying. > > Using a single and shared wq wont allow you to make sure : > - work for NIC0 attached on NUMA node#0 will be using CPUS belonging to node#0 > - work for NIC1 attached on NUMA node#1 will be using CPUS belonging to node#1 > > > The only way you can tune things with a single wq is tweaking a single cpumask, > that we can change with /sys/devices/virtual/workqueue/{wqname}/cpumask > The same for the nice value with /sys/devices/virtual/workqueue/{wqname}/nice. > > In contrast, having kthreads let you tune things independently, if needed. > > Even with a single NIC, you can still need isolation between queues. > We have queues dedicated to a certain kind of traffic/application. > > The work queue approach would need to be able to create/delete > independent workqueues. > But we tested the workqueue with a single NIC and our results gave to > kthreads a win over the work queue. Not according to the results Wei posted last night.. > Really, wq concept might be a nice abstraction when each work can be > running for arbitrary durations, > and arbitrary numbers of cpus, but with the NAPI model of up to 64 > packets at a time, and a fixed number of queues, In my experiments the worker threads get stalled sooner or later. And unless there is some work stealing going on latency spikes follow. I would also not discount the variability in processing time. For a budget of 64 the processing can take 0-500us per round, not counting outliers. > we should not add the work queue overhead. Does this mean you're going to be against the (more fleshed out) work queue implementation?
On Thu, 1 Oct 2020 18:44:40 -0700 Wei Wang wrote: > > Can you share relative performance delta of this banchmark? > > > > Could you explain why threads are slower than ksoftirqd if you pin the > > application away? From your cover letter it sounded like you want the > > scheduler to see the NAPI load, but then you say you pinned the > > application away from the NAPI cores for the test, so I'm confused. > > No. We did not explicitly pin the application threads away. > Application threads are free to run anywhere. What we do is we > restrict the NAPI kthreads to only those CPUs handling rx interrupts. Whatever. You pin the NAPI threads and hand-tune their number so the load of the NAPI CPUs is always higher. If the workload changes the system will get very unhappy. > (For us, 8 cpus out of 56.) So the load on those CPUs are very high > when running the test. And the scheduler is smart enough to avoid > using those CPUs for the application threads automatically. > Here is the results of 1 representative test result: > cpu/op 50%tile 95%tile 99%tile > base 71.47 417us 1.01ms 2.9ms > kthread 67.84 396us 976us 2.4ms > workqueue 69.68 386us 791us 1.9ms Did you renice ksoftirqd in "base"? > Actually, I remembered it wrong. It does seem workqueue is doing > better on latencies. But cpu/op wise, kthread seems to be a bit > better. Q.E.D.
From: Wei Wang <weiwan@google.com> Date: Wed, 30 Sep 2020 12:21:35 -0700 ... > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > entity per host, is that kthread is more configurable than workqueue, > and we could leverage existing tuning tools for threads, like taskset, > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > if we eventually want to provide busy poll feature using kernel threads > for napi poll, kthread seems to be more suitable than workqueue. ... I think we still need to discuss this some more. Jakub has some ideas and I honestly think the whole workqueue approach hasn't been fully considered yet. If this wan't urgent years ago (when it was NACK'd btw), it isn't urgent for 5.10 so I don't know why we are pushing so hard for this patch series to go in as-is right now. Please be patient and let's have a full discussion on this. Thank you.
On Fri, Oct 2, 2020 at 4:02 PM David Miller <davem@davemloft.net> wrote: > > From: Wei Wang <weiwan@google.com> > Date: Wed, 30 Sep 2020 12:21:35 -0700 > > ... > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > entity per host, is that kthread is more configurable than workqueue, > > and we could leverage existing tuning tools for threads, like taskset, > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > if we eventually want to provide busy poll feature using kernel threads > > for napi poll, kthread seems to be more suitable than workqueue. > ... > > I think we still need to discuss this some more. > > Jakub has some ideas and I honestly think the whole workqueue > approach hasn't been fully considered yet. I want to point out that it's not kthread vs wq. I think the mechanism has to be pluggable. The kernel needs to support both kthread and wq. Or maybe even the 3rd option. Whatever it might be. Via sysctl or something. I suspect for some production workloads wq will perform better. For the others it will be kthread. Clearly kthread is more tunable, but not everyone would have knowledge and desire to do the tunning. We can argue what should be the default, but that's secondary. > If this wan't urgent years ago (when it was NACK'd btw), it isn't > urgent for 5.10 so I don't know why we are pushing so hard for > this patch series to go in as-is right now. > > Please be patient and let's have a full discussion on this. +1. This is the biggest change to the kernel networking in years. Let's make it right.
On Sat, Oct 3, 2020 at 1:15 AM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Oct 2, 2020 at 4:02 PM David Miller <davem@davemloft.net> wrote: > > > > From: Wei Wang <weiwan@google.com> > > Date: Wed, 30 Sep 2020 12:21:35 -0700 > > > > ... > > > And the reason we prefer 1 kthread per napi, instead of 1 workqueue > > > entity per host, is that kthread is more configurable than workqueue, > > > and we could leverage existing tuning tools for threads, like taskset, > > > chrt, etc to tune scheduling class and cpu set, etc. Another reason is > > > if we eventually want to provide busy poll feature using kernel threads > > > for napi poll, kthread seems to be more suitable than workqueue. > > ... > > > > I think we still need to discuss this some more. > > > > Jakub has some ideas and I honestly think the whole workqueue > > approach hasn't been fully considered yet. > > I want to point out that it's not kthread vs wq. I think the mechanism > has to be pluggable. The kernel needs to support both kthread and wq. > Or maybe even the 3rd option. Whatever it might be. > Via sysctl or something. > I suspect for some production workloads wq will perform better. > For the others it will be kthread. > Clearly kthread is more tunable, but not everyone would have > knowledge and desire to do the tunning. The exact same arguments can be used against RPS and RFS RPS went first, and was later augmented with RFS (with very different goals) They both are opt-in > We can argue what should be the default, but that's secondary. > > > If this wan't urgent years ago (when it was NACK'd btw), it isn't > > urgent for 5.10 so I don't know why we are pushing so hard for > > this patch series to go in as-is right now. > > > > Please be patient and let's have a full discussion on this. > > +1. This is the biggest change to the kernel networking in years. > Let's make it right. Sure. I do not think it is urgent. This has been revived by Felix some weeks ago, and I think we were the ones spending time on the proposed patch set. Not giving feedback to Felix would have been "something not right". We reviewed and tested Felix patches, and came up with something more flexible. Sure, a WQ is already giving nice results on appliances, because there you do not need strong isolation. Would a kthread approach also work well on appliances ? Probably...
On Sat, Oct 03, 2020 at 05:54:38AM +0200, Eric Dumazet wrote: > > Sure, a WQ is already giving nice results on appliances, because there > you do not need strong isolation. > Would a kthread approach also work well on appliances ? Probably... Right. I think we're on the same page. The only reason I'm bringing up multiple co-existing approaches now is to make sure they are designed in from the start instead of as afterthought. Two implementations already exist, but it doesn't look like that they can co-exist in the kernel. Like this NAPI_STATE_THREADED bit in this patch set. Should we burn that bit for kthread approach and another bit for workqueue based? I don't know. As the user of the feature I would be happy with any mechanism as long as I can flip between them in runtime :) Just like RPS and RFS.