mbox series

[0/4] cpumask: improve on cpumask_local_spread() locality

Message ID 20221111040027.621646-1-yury.norov@gmail.com
Headers show
Series cpumask: improve on cpumask_local_spread() locality | expand

Message

Yury Norov Nov. 11, 2022, 4 a.m. UTC
cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This series is inspired by Valentin Schneider's "net/mlx5e: Improve remote
NUMA preferences used for the IRQ affinity hints"

https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/

According to Valentin's measurements, for mlx5e:

	Bottleneck in RX side is released, reached linerate (~1.8x speedup).
	~30% less cpu util on TX.

This patch makes cpumask_local_spread() traversing CPUs based on NUMA
distance, just as well, and I expect comparabale improvement for its
users, as in Valentin's case.

I tested it on my VM with the following NUMA configuration:

root@debian:~# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3
node 0 size: 3869 MB
node 0 free: 3740 MB
node 1 cpus: 4 5
node 1 size: 1969 MB
node 1 free: 1937 MB
node 2 cpus: 6 7
node 2 size: 1967 MB
node 2 free: 1873 MB
node 3 cpus: 8 9 10 11 12 13 14 15
node 3 size: 7842 MB
node 3 free: 7723 MB
node distances:
node   0   1   2   3
  0:  10  50  30  70
  1:  50  10  70  30
  2:  30  70  10  50
  3:  70  30  50  10

And the cpumask_local_spread() for each node and offset traversing looks
like this:

node 0:   0   1   2   3   6   7   4   5   8   9  10  11  12  13  14  15
node 1:   4   5   8   9  10  11  12  13  14  15   0   1   2   3   6   7
node 2:   6   7   0   1   2   3   8   9  10  11  12  13  14  15   4   5
node 3:   8   9  10  11  12  13  14  15   4   5   6   7   0   1   2   3

Yury Norov (4):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality

 include/linux/cpumask.h  | 20 +++++++++++++++++++
 include/linux/find.h     | 33 +++++++++++++++++++++++++++++++
 include/linux/topology.h |  8 ++++++++
 kernel/sched/topology.c  | 42 ++++++++++++++++++++++++++++++++++++++++
 lib/cpumask.c            | 12 ++----------
 lib/find_bit.c           |  9 +++++++++
 6 files changed, 114 insertions(+), 10 deletions(-)

Comments

Yury Norov Nov. 11, 2022, 5:07 p.m. UTC | #1
On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> > The function finds Nth set CPU in a given cpumask starting from a given
> > node.
> > 
> > Leveraging the fact that each hop in sched_domains_numa_masks includes the
> > same or greater number of CPUs than the previous one, we can use binary
> > search on hops instead of linear walk, which makes the overall complexity
> > of O(log n) in terms of number of cpumask_weight() calls.
> 
> ...
> 
> > +int sched_numa_find_nth_cpu(const struct cpumask *cpus, int cpu, int node)
> > +{
> > +	unsigned int first = 0, mid, last = sched_domains_numa_levels;
> > +	struct cpumask ***masks;
> 
> *** ?
> Hmm... Do we really need such deep indirection?

It's 2d array of pointers, so - yes.
 
> > +	int w, ret = nr_cpu_ids;
> > +
> > +	rcu_read_lock();
> > +	masks = rcu_dereference(sched_domains_numa_masks);
> > +	if (!masks)
> > +		goto out;
> > +
> > +	while (last >= first) {
> > +		mid = (last + first) / 2;
> > +
> > +		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> > +			first = mid + 1;
> > +			continue;
> > +		}
> > +
> > +		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
> 
> See below.
> 
> > +		if (w <= cpu)
> > +			break;
> > +
> > +		last = mid - 1;
> > +	}
> 
> We have lib/bsearch.h. I haven't really looked deeply into the above, but my
> gut feelings that that might be useful here. Can you check that?

Yes we do. I tried it, and it didn't work because nodes arrays are
allocated dynamically, and distance between different pairs of hops
for a given node is not a constant, which is a requirement for
bsearch.

However, distance between hops pointers in 1st level array should be
constant, and we can try feeding bsearch with it. I'll experiment with
bsearch for more.

> > +	ret = (mid == 0) ?
> > +		cpumask_nth_and(cpu - w, cpus, masks[mid][node]) :
> > +		cpumask_nth_and_andnot(cpu - w, cpus, masks[mid][node], masks[mid - 1][node]);
> 
> You can also shorten this by inversing the conditional:
> 
> 	ret = mid ? ...not 0... : ...for 0...;

Yep, why not.

> > +out:
> 
> out_unlock: ?

Do you think it's better?

> > +	rcu_read_unlock();
> > +	return ret;
> > +}
> 
> -- 
> With Best Regards,
> Andy Shevchenko
>
Yury Norov Nov. 12, 2022, 6:14 p.m. UTC | #2
On Fri, Nov 11, 2022 at 09:07:17AM -0800, Yury Norov wrote:
> On Fri, Nov 11, 2022 at 01:42:29PM +0200, Andy Shevchenko wrote:
> > On Thu, Nov 10, 2022 at 08:00:26PM -0800, Yury Norov wrote:
> > > +	int w, ret = nr_cpu_ids;
> > > +
> > > +	rcu_read_lock();
> > > +	masks = rcu_dereference(sched_domains_numa_masks);
> > > +	if (!masks)
> > > +		goto out;
> > > +
> > > +	while (last >= first) {
> > > +		mid = (last + first) / 2;
> > > +
> > > +		if (cpumask_weight_and(cpus, masks[mid][node]) <= cpu) {
> > > +			first = mid + 1;
> > > +			continue;
> > > +		}
> > > +
> > > +		w = (mid == 0) ? 0 : cpumask_weight_and(cpus, masks[mid - 1][node]);
> > 
> > See below.
> > 
> > > +		if (w <= cpu)
> > > +			break;
> > > +
> > > +		last = mid - 1;
> > > +	}
> > 
> > We have lib/bsearch.h. I haven't really looked deeply into the above, but my
> > gut feelings that that might be useful here. Can you check that?
> 
> Yes we do. I tried it, and it didn't work because nodes arrays are
> allocated dynamically, and distance between different pairs of hops
> for a given node is not a constant, which is a requirement for
> bsearch.
> 
> However, distance between hops pointers in 1st level array should be
> constant, and we can try feeding bsearch with it. I'll experiment with
> bsearch for more.

OK, I tried bsearch on array of hops, and it works. But it requires
adding some black pointers magic. I'll send v2 based on bsearch soon.

Thanks,
Yury
Tariq Toukan Nov. 13, 2022, 7:37 a.m. UTC | #3
On 11/11/2022 6:47 PM, Yury Norov wrote:
> 
> 
> On Fri, Nov 11, 2022, 10:25 AM Jakub Kicinski <kuba@kernel.org 
> <mailto:kuba@kernel.org>> wrote:
> 
>     On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:
>      > cpumask_local_spread() currently checks local node for presence
>     of i'th
>      > CPU, and then if it finds nothing makes a flat search among all
>     non-local
>      > CPUs. We can do it better by checking CPUs per NUMA hops.
> 
>     Nice.
> 

Thanks for your series.
This improves them all, with no changes required to the network device 
drivers.

>      > This series is inspired by Valentin Schneider's "net/mlx5e:
>     Improve remote
>      > NUMA preferences used for the IRQ affinity hints"
>      >
>      >
>     https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/ <https://patchwork.kernel.org/project/netdevbpf/patch/20220728191203.4055-3-tariqt@nvidia.com/>


Find my very first version here, including the perf testing results:
https://patchwork.kernel.org/project/netdevbpf/list/?series=660413&state=*


>      >
>      > According to Valentin's measurements, for mlx5e:
>      >
>      >       Bottleneck in RX side is released, reached linerate (~1.8x
>     speedup).
>      >       ~30% less cpu util on TX.
>      >
>      > This patch makes cpumask_local_spread() traversing CPUs based on NUMA
>      > distance, just as well, and I expect comparabale improvement for its
>      > users, as in Valentin's case.
>      >

Right.

>      > I tested it on my VM with the following NUMA configuration:
> 
>     nit: the authorship is a bit more complicated, it'd be good to mention
>     Tariq. Both for the code and attribution of the testing / measurements.
> 
> 
> Sure. Tariq and Valentine please send your tags as appropriate.
> 

I wonder what fits best here?

As the contribution is based upon previous work that I developed, then 
probably:
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

Thanks,
Tariq
Andy Shevchenko Nov. 13, 2022, 12:29 p.m. UTC | #4
On Sun, Nov 13, 2022 at 09:37:59AM +0200, Tariq Toukan wrote:
> On 11/11/2022 6:47 PM, Yury Norov wrote:
> > On Fri, Nov 11, 2022, 10:25 AM Jakub Kicinski <kuba@kernel.org
> > <mailto:kuba@kernel.org>> wrote:
> >     On Thu, 10 Nov 2022 20:00:23 -0800 Yury Norov wrote:

..

> > Sure. Tariq and Valentine please send your tags as appropriate.
> 
> I wonder what fits best here?
> 
> As the contribution is based upon previous work that I developed, then
> probably:
> Signed-off-by: Tariq Toukan <tariqt@nvidia.com>

Then it probably means that one of you (Yury or you) should also have a
Co-developed-by.