Message ID | 20221112190946.728270-1-yury.norov@gmail.com |
---|---|
Headers | show |
Series | cpumask: improve on cpumask_local_spread() locality | expand |
Hi, On 12/11/22 11:09, Yury Norov wrote: > cpumask_local_spread() currently checks local node for presence of i'th > CPU, and then if it finds nothing makes a flat search among all non-local > CPUs. We can do it better by checking CPUs per NUMA hops. > > This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e: > Improve remote NUMA preferences used for the IRQ affinity hints" > > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ > > According to their measurements, for mlx5e: > > Bottleneck in RX side is released, reached linerate (~1.8x speedup). > ~30% less cpu util on TX. > > This patch makes cpumask_local_spread() traversing CPUs based on NUMA > distance, just as well, and I expect comparabale improvement for its > users, as in case of mlx5e. > > I tested new behavior on my VM with the following NUMA configuration: > > root@debian:~# numactl -H > available: 4 nodes (0-3) > node 0 cpus: 0 1 2 3 > node 0 size: 3869 MB > node 0 free: 3740 MB > node 1 cpus: 4 5 > node 1 size: 1969 MB > node 1 free: 1937 MB > node 2 cpus: 6 7 > node 2 size: 1967 MB > node 2 free: 1873 MB > node 3 cpus: 8 9 10 11 12 13 14 15 > node 3 size: 7842 MB > node 3 free: 7723 MB > node distances: > node 0 1 2 3 > 0: 10 50 30 70 > 1: 50 10 70 30 > 2: 30 70 10 50 > 3: 70 30 50 10 > > And the cpumask_local_spread() for each node and offset traversing looks > like this: > > node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 > node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 > node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 > node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3 > Is this meant as a replacement for [1]? I like that this is changing an existing interface so that all current users directly benefit from the change. Now, about half of the users of cpumask_local_spread() use it in a loop with incremental @i parameter, which makes the repeated bsearch a bit of a shame, but then I'm tempted to say the first point makes it worth it. [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ > v1: https://lore.kernel.org/lkml/20221111040027.621646-5-yury.norov@gmail.com/T/ > v2: > - use bsearch() in sched_numa_find_nth_cpu(); > - fix missing 'static inline' in 3rd patch. > > Yury Norov (4): > lib/find: introduce find_nth_and_andnot_bit > cpumask: introduce cpumask_nth_and_andnot > sched: add sched_numa_find_nth_cpu() > cpumask: improve on cpumask_local_spread() locality > > include/linux/cpumask.h | 20 +++++++++++++++ > include/linux/find.h | 33 ++++++++++++++++++++++++ > include/linux/topology.h | 8 ++++++ > kernel/sched/topology.c | 55 ++++++++++++++++++++++++++++++++++++++++ > lib/cpumask.c | 12 ++------- > lib/find_bit.c | 9 +++++++ > 6 files changed, 127 insertions(+), 10 deletions(-) > > -- > 2.34.1
On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote: > Hi, > > On 12/11/22 11:09, Yury Norov wrote: > > cpumask_local_spread() currently checks local node for presence of i'th > > CPU, and then if it finds nothing makes a flat search among all non-local > > CPUs. We can do it better by checking CPUs per NUMA hops. > > > > This series is inspired by Tariq Toukan and Valentin Schneider's "net/mlx5e: > > Improve remote NUMA preferences used for the IRQ affinity hints" > > > > https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ > > > > According to their measurements, for mlx5e: > > > > Bottleneck in RX side is released, reached linerate (~1.8x speedup). > > ~30% less cpu util on TX. > > > > This patch makes cpumask_local_spread() traversing CPUs based on NUMA > > distance, just as well, and I expect comparabale improvement for its > > users, as in case of mlx5e. > > > > I tested new behavior on my VM with the following NUMA configuration: > > > > root@debian:~# numactl -H > > available: 4 nodes (0-3) > > node 0 cpus: 0 1 2 3 > > node 0 size: 3869 MB > > node 0 free: 3740 MB > > node 1 cpus: 4 5 > > node 1 size: 1969 MB > > node 1 free: 1937 MB > > node 2 cpus: 6 7 > > node 2 size: 1967 MB > > node 2 free: 1873 MB > > node 3 cpus: 8 9 10 11 12 13 14 15 > > node 3 size: 7842 MB > > node 3 free: 7723 MB > > node distances: > > node 0 1 2 3 > > 0: 10 50 30 70 > > 1: 50 10 70 30 > > 2: 30 70 10 50 > > 3: 70 30 50 10 > > > > And the cpumask_local_spread() for each node and offset traversing looks > > like this: > > > > node 0: 0 1 2 3 6 7 4 5 8 9 10 11 12 13 14 15 > > node 1: 4 5 8 9 10 11 12 13 14 15 0 1 2 3 6 7 > > node 2: 6 7 0 1 2 3 8 9 10 11 12 13 14 15 4 5 > > node 3: 8 9 10 11 12 13 14 15 4 5 6 7 0 1 2 3 > > > > Is this meant as a replacement for [1]? No. Your series adds an iterator, and in my experience the code that uses iterators of that sort is almost always better and easier to understand than cpumask_nth() or cpumask_next()-like users. My series has the only advantage that it allows keep existing codebase untouched. > I like that this is changing an existing interface so that all current > users directly benefit from the change. Now, about half of the users of > cpumask_local_spread() use it in a loop with incremental @i parameter, > which makes the repeated bsearch a bit of a shame, but then I'm tempted to > say the first point makes it worth it. > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ In terms of very common case of sequential invocation of local_spread() for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, and your approach is amortized O(n), which is better. Not a big deal _now_, as you mentioned in the other email. But we never know how things will evolve, right? So, I would take both and maybe in comment to cpumask_local_spread() mention that there's a better alternative for those who call the function for all CPUs incrementally. Thanks, Yury
On 15/11/22 10:32, Yury Norov wrote: > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote: >> >> Is this meant as a replacement for [1]? > > No. Your series adds an iterator, and in my experience the code that > uses iterators of that sort is almost always better and easier to > understand than cpumask_nth() or cpumask_next()-like users. > > My series has the only advantage that it allows keep existing codebase > untouched. > Right >> I like that this is changing an existing interface so that all current >> users directly benefit from the change. Now, about half of the users of >> cpumask_local_spread() use it in a loop with incremental @i parameter, >> which makes the repeated bsearch a bit of a shame, but then I'm tempted to >> say the first point makes it worth it. >> >> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ > > In terms of very common case of sequential invocation of local_spread() > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, > and your approach is amortized O(n), which is better. Not a big deal _now_, > as you mentioned in the other email. But we never know how things will > evolve, right? > > So, I would take both and maybe in comment to cpumask_local_spread() > mention that there's a better alternative for those who call the > function for all CPUs incrementally. > Ack, sounds good. > Thanks, > Yury
On 11/17/2022 2:23 PM, Valentin Schneider wrote: > On 15/11/22 10:32, Yury Norov wrote: >> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote: >>> >>> Is this meant as a replacement for [1]? >> >> No. Your series adds an iterator, and in my experience the code that >> uses iterators of that sort is almost always better and easier to >> understand than cpumask_nth() or cpumask_next()-like users. >> >> My series has the only advantage that it allows keep existing codebase >> untouched. >> > > Right > >>> I like that this is changing an existing interface so that all current >>> users directly benefit from the change. Now, about half of the users of >>> cpumask_local_spread() use it in a loop with incremental @i parameter, >>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to >>> say the first point makes it worth it. >>> >>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ >> >> In terms of very common case of sequential invocation of local_spread() >> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, >> and your approach is amortized O(n), which is better. Not a big deal _now_, >> as you mentioned in the other email. But we never know how things will >> evolve, right? >> >> So, I would take both and maybe in comment to cpumask_local_spread() >> mention that there's a better alternative for those who call the >> function for all CPUs incrementally. >> > > Ack, sounds good. > Good. Is a respin needed, to add the comment mentioned above?
On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote: > > > On 11/17/2022 2:23 PM, Valentin Schneider wrote: > > On 15/11/22 10:32, Yury Norov wrote: > > > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote: > > > > > > > > Is this meant as a replacement for [1]? > > > > > > No. Your series adds an iterator, and in my experience the code that > > > uses iterators of that sort is almost always better and easier to > > > understand than cpumask_nth() or cpumask_next()-like users. > > > > > > My series has the only advantage that it allows keep existing codebase > > > untouched. > > > > > > > Right > > > > > > I like that this is changing an existing interface so that all current > > > > users directly benefit from the change. Now, about half of the users of > > > > cpumask_local_spread() use it in a loop with incremental @i parameter, > > > > which makes the repeated bsearch a bit of a shame, but then I'm tempted to > > > > say the first point makes it worth it. > > > > > > > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ > > > > > > In terms of very common case of sequential invocation of local_spread() > > > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, > > > and your approach is amortized O(n), which is better. Not a big deal _now_, > > > as you mentioned in the other email. But we never know how things will > > > evolve, right? > > > > > > So, I would take both and maybe in comment to cpumask_local_spread() > > > mention that there's a better alternative for those who call the > > > function for all CPUs incrementally. > > > > > > > Ack, sounds good. > > > > Good. > Is a respin needed, to add the comment mentioned above? If you think it's worth the effort.
On 11/30/2022 3:47 AM, Yury Norov wrote: > On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote: >> >> >> On 11/17/2022 2:23 PM, Valentin Schneider wrote: >>> On 15/11/22 10:32, Yury Norov wrote: >>>> On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote: >>>>> >>>>> Is this meant as a replacement for [1]? >>>> >>>> No. Your series adds an iterator, and in my experience the code that >>>> uses iterators of that sort is almost always better and easier to >>>> understand than cpumask_nth() or cpumask_next()-like users. >>>> >>>> My series has the only advantage that it allows keep existing codebase >>>> untouched. >>>> >>> >>> Right >>> >>>>> I like that this is changing an existing interface so that all current >>>>> users directly benefit from the change. Now, about half of the users of >>>>> cpumask_local_spread() use it in a loop with incremental @i parameter, >>>>> which makes the repeated bsearch a bit of a shame, but then I'm tempted to >>>>> say the first point makes it worth it. >>>>> >>>>> [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ >>>> >>>> In terms of very common case of sequential invocation of local_spread() >>>> for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, >>>> and your approach is amortized O(n), which is better. Not a big deal _now_, >>>> as you mentioned in the other email. But we never know how things will >>>> evolve, right? >>>> >>>> So, I would take both and maybe in comment to cpumask_local_spread() >>>> mention that there's a better alternative for those who call the >>>> function for all CPUs incrementally. >>>> >>> >>> Ack, sounds good. >>> >> >> Good. >> Is a respin needed, to add the comment mentioned above? > > If you think it's worth the effort. No, not sure it is... I asked because this mail thread was inactive for a while, with the patches not accepted to the kernel yet. If everyone is happy with it, let's make it to this kernel while possible. To which tree should it go? Thanks, Tariq
On Wed, Dec 07, 2022 at 02:53:58PM +0200, Tariq Toukan wrote: > > > On 11/30/2022 3:47 AM, Yury Norov wrote: > > On Mon, Nov 28, 2022 at 08:39:24AM +0200, Tariq Toukan wrote: > > > > > > > > > On 11/17/2022 2:23 PM, Valentin Schneider wrote: > > > > On 15/11/22 10:32, Yury Norov wrote: > > > > > On Tue, Nov 15, 2022 at 05:24:56PM +0000, Valentin Schneider wrote: > > > > > > > > > > > > Is this meant as a replacement for [1]? > > > > > > > > > > No. Your series adds an iterator, and in my experience the code that > > > > > uses iterators of that sort is almost always better and easier to > > > > > understand than cpumask_nth() or cpumask_next()-like users. > > > > > > > > > > My series has the only advantage that it allows keep existing codebase > > > > > untouched. > > > > > > > > > > > > > Right > > > > > > > > > > I like that this is changing an existing interface so that all current > > > > > > users directly benefit from the change. Now, about half of the users of > > > > > > cpumask_local_spread() use it in a loop with incremental @i parameter, > > > > > > which makes the repeated bsearch a bit of a shame, but then I'm tempted to > > > > > > say the first point makes it worth it. > > > > > > > > > > > > [1]: https://lore.kernel.org/all/20221028164959.1367250-1-vschneid@redhat.com/ > > > > > > > > > > In terms of very common case of sequential invocation of local_spread() > > > > > for cpus from 0 to nr_cpu_ids, the complexity of my approach is n * log n, > > > > > and your approach is amortized O(n), which is better. Not a big deal _now_, > > > > > as you mentioned in the other email. But we never know how things will > > > > > evolve, right? > > > > > > > > > > So, I would take both and maybe in comment to cpumask_local_spread() > > > > > mention that there's a better alternative for those who call the > > > > > function for all CPUs incrementally. > > > > > > > > > > > > > Ack, sounds good. > > > > > > > > > > Good. > > > Is a respin needed, to add the comment mentioned above? > > > > If you think it's worth the effort. > > No, not sure it is... > > I asked because this mail thread was inactive for a while, with the patches > not accepted to the kernel yet. > > If everyone is happy with it, let's make it to this kernel while possible. > > To which tree should it go? I've got bitmap tree and can move it there, but this series is related to scheduler and NUMA as well, and I'd prefer move it in those trees. If moving through bitmaps, I'd like to collect more reviews and testing. Thanks, Yury