Message ID | 20221208183101.1162006-1-yury.norov@gmail.com |
---|---|
Headers | show |
Series | cpumask: improve on cpumask_local_spread() locality | expand |
On 12/8/2022 10:30 AM, Yury Norov wrote: > cpumask_local_spread() currently checks local node for presence of i'th > CPU, and then if it finds nothing makes a flat search among all non-local > CPUs. We can do it better by checking CPUs per NUMA hops. > > This series is inspired by Tariq Toukan and Valentin Schneider's > "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity > hints" > > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ > > According to their measurements, for mlx5e: > > Bottleneck in RX side is released, reached linerate (~1.8x speedup). > ~30% less cpu util on TX. > > This patch makes cpumask_local_spread() traversing CPUs based on NUMA > distance, just as well, and I expect comparable improvement for its > users, as in case of mlx5e. > > I tested new behavior on my VM with the following NUMA configuration: > > root@debian:~# numactl -H > available: 4 nodes (0-3) > node 0 cpus: 0 1 2 3 > node 0 size: 3869 MB > node 0 free: 3740 MB > node 1 cpus: 4 5 > node 1 size: 1969 MB > node 1 free: 1937 MB > node 2 cpus: 6 7 > node 2 size: 1967 MB > node 2 free: 1873 MB > node 3 cpus: 8 9 10 11 12 13 14 15 > node 3 size: 7842 MB > node 3 free: 7723 MB > node distances: > node 0 1 2 3 > 0: 10 50 30 70 > 1: 50 10 70 30 > 2: 30 70 10 50 > 3: 70 30 50 10 > > And the cpumask_local_spread() for each node and offset traversing looks > like this: > > node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 > node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 > node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 > node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3 > > v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/ > v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/ > v3: > - fix typo in find_nth_and_andnot_bit(); > - add 5th patch that simplifies cpumask_local_spread(); > - address various coding style nits. > The whole series look reasonable to me! Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
> Now after moving all NUMA logic into sched_numa_find_nth_cpu(), > else-branch of cpumask_local_spread() is just a function call, and > we can simplify logic by using ternary operator. > > While here, replace BUG() with WARN(). Why make this change? It's still as bad to hit the WARN_ON as it was before. > Signed-off-by: Yury Norov yury.norov@gmail.com > > --- > lib/cpumask.c | 16 ++++++---------- > 1 file changed, 6 insertions(+), 10 deletions(-) > > diff --git a/lib/cpumask.c b/lib/cpumask.c > index 255974cd6734..c7029fb3c372 100644 > --- a/lib/cpumask.c > +++ b/lib/cpumask.c > @@ -127,16 +127,12 @@ unsigned int cpumask_local_spread(unsigned int i, int node) > /* Wrap: we always want a cpu. */ > i %= num_online_cpus(); > > - if (node == NUMA_NO_NODE) { > - cpu = cpumask_nth(i, cpu_online_mask); > - if (cpu < nr_cpu_ids) > - return cpu; > - } else { > - cpu = sched_numa_find_nth_cpu(cpu_online_mask, i, node); > - if (cpu < nr_cpu_ids) > - return cpu; > - } > - BUG(); > + cpu = node == NUMA_NO_NODE ? > + cpumask_nth(i, cpu_online_mask) : > + sched_numa_find_nth_cpu(cpu_online_mask, i, node); I find the if version clearer, and cleaner too if you drop the brackets. For the ternary version it would be nice to parenthesize the equality like you did in cmp() in 3/5. > + > + WARN_ON(cpu >= nr_cpu_ids); > > + return cpu; > } > EXPORT_SYMBOL(cpumask_local_spread); > > -- > 2.34.1 Minor nit: cmp() in 3/5 could use a longer name. The file's long, and cmp() doesn't explain _what_ it's comparing. How about cmp_cpumask() or something related to the function using it? Other than the above particularities, the whole series looks good to me. Reviewed-by: Peter Lafreniere <peter@n8pjl.ca>
On 12/8/2022 8:30 PM, Yury Norov wrote: > cpumask_local_spread() currently checks local node for presence of i'th > CPU, and then if it finds nothing makes a flat search among all non-local > CPUs. We can do it better by checking CPUs per NUMA hops. > > This series is inspired by Tariq Toukan and Valentin Schneider's > "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity > hints" > > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ > > According to their measurements, for mlx5e: > > Bottleneck in RX side is released, reached linerate (~1.8x speedup). > ~30% less cpu util on TX. > > This patch makes cpumask_local_spread() traversing CPUs based on NUMA > distance, just as well, and I expect comparable improvement for its > users, as in case of mlx5e. > > I tested new behavior on my VM with the following NUMA configuration: > > root@debian:~# numactl -H > available: 4 nodes (0-3) > node 0 cpus: 0 1 2 3 > node 0 size: 3869 MB > node 0 free: 3740 MB > node 1 cpus: 4 5 > node 1 size: 1969 MB > node 1 free: 1937 MB > node 2 cpus: 6 7 > node 2 size: 1967 MB > node 2 free: 1873 MB > node 3 cpus: 8 9 10 11 12 13 14 15 > node 3 size: 7842 MB > node 3 free: 7723 MB > node distances: > node 0 1 2 3 > 0: 10 50 30 70 > 1: 50 10 70 30 > 2: 30 70 10 50 > 3: 70 30 50 10 > > And the cpumask_local_spread() for each node and offset traversing looks > like this: > > node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 > node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 > node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 > node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3 > > v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/ > v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/ > v3: > - fix typo in find_nth_and_andnot_bit(); > - add 5th patch that simplifies cpumask_local_spread(); > - address various coding style nits. > > Yury Norov (5): > lib/find: introduce find_nth_and_andnot_bit > cpumask: introduce cpumask_nth_and_andnot > sched: add sched_numa_find_nth_cpu() > cpumask: improve on cpumask_local_spread() locality > lib/cpumask: reorganize cpumask_local_spread() logic > > include/linux/cpumask.h | 20 ++++++++++++++ > include/linux/find.h | 33 +++++++++++++++++++++++ > include/linux/topology.h | 8 ++++++ > kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++ > lib/cpumask.c | 26 +++++------------- > lib/find_bit.c | 9 +++++++ > 6 files changed, 134 insertions(+), 19 deletions(-) > Acked-by: Tariq Toukan <tariqt@nvidia.com>
On Thu, Dec 08, 2022 at 08:17:22PM +0000, Peter Lafreniere wrote: > > Now after moving all NUMA logic into sched_numa_find_nth_cpu(), > > else-branch of cpumask_local_spread() is just a function call, and > > we can simplify logic by using ternary operator. > > > > While here, replace BUG() with WARN(). > Why make this change? It's still as bad to hit the WARN_ON as it was before. For example, because of this: > Greg, please don't do this > > > ChangeSet@1.614, 2002-09-05 08:33:20-07:00, greg@kroah.com > > USB: storage driver: replace show_trace() with BUG() > > that BUG() thing is _way_ out of line, and has killed a few of my machines > several times for no good reason. It actively hurts debuggability, because > the machine is totally dead after it, and the whole and ONLY point of > BUG() messages is to help debugging and make it clear that we can't handle > something. > > In this case, we _can_ handle it, and we're much better off with a machine > that works and that you can look up the messages with than killing it. > > Rule of thumb: BUG() is only good for something that never happens and > that we really have no other option for (ie state is so corrupt that > continuing is deadly). > > Linus
On Thu, Dec 08, 2022 at 10:22:22PM +0200, Tariq Toukan wrote: > > > On 12/8/2022 8:30 PM, Yury Norov wrote: > > cpumask_local_spread() currently checks local node for presence of i'th > > CPU, and then if it finds nothing makes a flat search among all non-local > > CPUs. We can do it better by checking CPUs per NUMA hops. > > > > This series is inspired by Tariq Toukan and Valentin Schneider's > > "net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity > > hints" > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ > > > > According to their measurements, for mlx5e: > > > > Bottleneck in RX side is released, reached linerate (~1.8x speedup). > > ~30% less cpu util on TX. > > > > This patch makes cpumask_local_spread() traversing CPUs based on NUMA > > distance, just as well, and I expect comparable improvement for its > > users, as in case of mlx5e. > > > > I tested new behavior on my VM with the following NUMA configuration: > > > > root@debian:~# numactl -H > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 2 3 > > node 0 size: 3869 MB > > node 0 free: 3740 MB > > node 1 cpus: 4 5 > > node 1 size: 1969 MB > > node 1 free: 1937 MB > > node 2 cpus: 6 7 > > node 2 size: 1967 MB > > node 2 free: 1873 MB > > node 3 cpus: 8 9 10 11 12 13 14 15 > > node 3 size: 7842 MB > > node 3 free: 7723 MB > > node distances: > > node 0 1 2 3 > > 0: 10 50 30 70 > > 1: 50 10 70 30 > > 2: 30 70 10 50 > > 3: 70 30 50 10 > > > > And the cpumask_local_spread() for each node and offset traversing looks > > like this: > > > > node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 > > node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 > > node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 > > node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3 > > > > v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/ > > v2: https://lore.kernel.org/all/20221112190946.728270-3-yury.norov@gmail.com/T/ > > v3: > > - fix typo in find_nth_and_andnot_bit(); > > - add 5th patch that simplifies cpumask_local_spread(); > > - address various coding style nits. > > > > Yury Norov (5): > > lib/find: introduce find_nth_and_andnot_bit > > cpumask: introduce cpumask_nth_and_andnot > > sched: add sched_numa_find_nth_cpu() > > cpumask: improve on cpumask_local_spread() locality > > lib/cpumask: reorganize cpumask_local_spread() logic > > > > include/linux/cpumask.h | 20 ++++++++++++++ > > include/linux/find.h | 33 +++++++++++++++++++++++ > > include/linux/topology.h | 8 ++++++ > > kernel/sched/topology.c | 57 ++++++++++++++++++++++++++++++++++++++++ > > lib/cpumask.c | 26 +++++------------- > > lib/find_bit.c | 9 +++++++ > > 6 files changed, 134 insertions(+), 19 deletions(-) > > > > Acked-by: Tariq Toukan <tariqt@nvidia.com> Thanks Tariq, Jacob and Peter for review. I'll add the series in bitmap-for-next for testing. Still, I think that sched/numa branches would be more suitable. Thanks, Yury
> On Thu, Dec 08, 2022 at 08:17:22PM +0000, Peter Lafreniere wrote: > > > Now after moving all NUMA logic into sched_numa_find_nth_cpu(), > > > else-branch of cpumask_local_spread() is just a function call, and > > > we can simplify logic by using ternary operator. > > > > > > While here, replace BUG() with WARN(). > > Why make this change? It's still as bad to hit the WARN_ON as it was before. > > For example, because of this: > > > Greg, please don't do this > > > > > ChangeSet@1.614, 2002-09-05 08:33:20-07:00, greg@kroah.com > > > USB: storage driver: replace show_trace() with BUG() > > > > that BUG() thing is _way_ out of line, and has killed a few of my machines > > several times for no good reason. It actively hurts debuggability, because > > the machine is totally dead after it, and the whole and ONLY point of > > BUG() messages is to help debugging and make it clear that we can't handle > > something. > > > > In this case, we _can_ handle it, and we're much better off with a machine > > that works and that you can look up the messages with than killing it. > > > > Rule of thumb: BUG() is only good for something that never happens and > > that we really have no other option for (ie state is so corrupt that > > continuing is deadly). > > > > Linus Fair enough. It's not like it'll be hit anyway. My concern was for if any of the 23 callers get an invalid result. I guess that if that causes a crash, then so be it. We have the warning to track down the cause. Thanks for the explanation, Peter Lafreniere <peter@n8pjl.ca>