mbox series

[RESEND,0/9] sched: cpumask: improve on cpumask_local_spread() locality

Message ID 20230121042436.2661843-1-yury.norov@gmail.com
Headers show
Series sched: cpumask: improve on cpumask_local_spread() locality | expand

Message

Yury Norov Jan. 21, 2023, 4:24 a.m. UTC
cpumask_local_spread() currently checks local node for presence of i'th
CPU, and then if it finds nothing makes a flat search among all non-local
CPUs. We can do it better by checking CPUs per NUMA hops.

This has significant performance implications on NUMA machines, for example
when using NUMA-aware allocated memory together with NUMA-aware IRQ
affinity hints.

Performance tests from patch 8 of this series for mellanox network
driver show:

  TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
  Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
  
  +-------------------------+-----------+------------------+------------------+
  |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
  +-------------------------+-----------+------------------+------------------+
  | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
  +-------------------------+-----------+------------------+------------------+
  | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
  +-------------------------+-----------+------------------+------------------+
  
  Bottleneck in RX side is released, reached linerate (~1.8x speedup).
  ~30% less cpu util on TX.

This series was supposed to be included in v6.2, but that didn't happen. It
spent enough in -next without any issues, so I hope we'll finally see it
in v6.3.

I believe, the best way would be moving it with scheduler patches, but I'm
OK to try again with bitmap branch as well.

Tariq Toukan (1):
  net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
    hints

Valentin Schneider (2):
  sched/topology: Introduce sched_numa_hop_mask()
  sched/topology: Introduce for_each_numa_hop_mask()

Yury Norov (6):
  lib/find: introduce find_nth_and_andnot_bit
  cpumask: introduce cpumask_nth_and_andnot
  sched: add sched_numa_find_nth_cpu()
  cpumask: improve on cpumask_local_spread() locality
  lib/cpumask: reorganize cpumask_local_spread() logic
  lib/cpumask: update comment for cpumask_local_spread()

 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 +++-
 include/linux/cpumask.h                      | 20 +++++
 include/linux/find.h                         | 33 +++++++
 include/linux/topology.h                     | 33 +++++++
 kernel/sched/topology.c                      | 90 ++++++++++++++++++++
 lib/cpumask.c                                | 52 ++++++-----
 lib/find_bit.c                               |  9 ++
 7 files changed, 230 insertions(+), 25 deletions(-)

Comments

Tariq Toukan Jan. 22, 2023, 12:57 p.m. UTC | #1
On 21/01/2023 6:24, Yury Norov wrote:
> cpumask_local_spread() currently checks local node for presence of i'th
> CPU, and then if it finds nothing makes a flat search among all non-local
> CPUs. We can do it better by checking CPUs per NUMA hops.
> 
> This has significant performance implications on NUMA machines, for example
> when using NUMA-aware allocated memory together with NUMA-aware IRQ
> affinity hints.
> 
> Performance tests from patch 8 of this series for mellanox network
> driver show:
> 
>    TCP multi-stream, using 16 iperf3 instances pinned to 16 cores (with aRFS on).
>    Active cores: 64,65,72,73,80,81,88,89,96,97,104,105,112,113,120,121
>    
>    +-------------------------+-----------+------------------+------------------+
>    |                         | BW (Gbps) | TX side CPU util | RX side CPU util |
>    +-------------------------+-----------+------------------+------------------+
>    | Baseline                | 52.3      | 6.4 %            | 17.9 %           |
>    +-------------------------+-----------+------------------+------------------+
>    | Applied on TX side only | 52.6      | 5.2 %            | 18.5 %           |
>    +-------------------------+-----------+------------------+------------------+
>    | Applied on RX side only | 94.9      | 11.9 %           | 27.2 %           |
>    +-------------------------+-----------+------------------+------------------+
>    | Applied on both sides   | 95.1      | 8.4 %            | 27.3 %           |
>    +-------------------------+-----------+------------------+------------------+
>    
>    Bottleneck in RX side is released, reached linerate (~1.8x speedup).
>    ~30% less cpu util on TX.
> 
> This series was supposed to be included in v6.2, but that didn't happen. It
> spent enough in -next without any issues, so I hope we'll finally see it
> in v6.3.
> 
> I believe, the best way would be moving it with scheduler patches, but I'm
> OK to try again with bitmap branch as well.

Now that Yury dropped several controversial bitmap patches form the PR, 
the rest are mostly in sched, or new API that's used by sched.

Valentin, what do you think? Can you take it to your sched branch?

> 
> Tariq Toukan (1):
>    net/mlx5e: Improve remote NUMA preferences used for the IRQ affinity
>      hints
> 
> Valentin Schneider (2):
>    sched/topology: Introduce sched_numa_hop_mask()
>    sched/topology: Introduce for_each_numa_hop_mask()
> 
> Yury Norov (6):
>    lib/find: introduce find_nth_and_andnot_bit
>    cpumask: introduce cpumask_nth_and_andnot
>    sched: add sched_numa_find_nth_cpu()
>    cpumask: improve on cpumask_local_spread() locality
>    lib/cpumask: reorganize cpumask_local_spread() logic
>    lib/cpumask: update comment for cpumask_local_spread()
> 
>   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 18 +++-
>   include/linux/cpumask.h                      | 20 +++++
>   include/linux/find.h                         | 33 +++++++
>   include/linux/topology.h                     | 33 +++++++
>   kernel/sched/topology.c                      | 90 ++++++++++++++++++++
>   lib/cpumask.c                                | 52 ++++++-----
>   lib/find_bit.c                               |  9 ++
>   7 files changed, 230 insertions(+), 25 deletions(-)
>
Valentin Schneider Jan. 23, 2023, 9:57 a.m. UTC | #2
On 22/01/23 14:57, Tariq Toukan wrote:
> On 21/01/2023 6:24, Yury Norov wrote:
>>
>> This series was supposed to be included in v6.2, but that didn't happen. It
>> spent enough in -next without any issues, so I hope we'll finally see it
>> in v6.3.
>>
>> I believe, the best way would be moving it with scheduler patches, but I'm
>> OK to try again with bitmap branch as well.
>
> Now that Yury dropped several controversial bitmap patches form the PR,
> the rest are mostly in sched, or new API that's used by sched.
>
> Valentin, what do you think? Can you take it to your sched branch?
>

I would if I had one :-)

Peter/Ingo, any objections to stashing this in tip/sched/core?
Tariq Toukan Jan. 29, 2023, 8:07 a.m. UTC | #3
On 23/01/2023 11:57, Valentin Schneider wrote:
> On 22/01/23 14:57, Tariq Toukan wrote:
>> On 21/01/2023 6:24, Yury Norov wrote:
>>>
>>> This series was supposed to be included in v6.2, but that didn't happen. It
>>> spent enough in -next without any issues, so I hope we'll finally see it
>>> in v6.3.
>>>
>>> I believe, the best way would be moving it with scheduler patches, but I'm
>>> OK to try again with bitmap branch as well.
>>
>> Now that Yury dropped several controversial bitmap patches form the PR,
>> the rest are mostly in sched, or new API that's used by sched.
>>
>> Valentin, what do you think? Can you take it to your sched branch?
>>
> 
> I would if I had one :-)
> 

Oh I see :)

> Peter/Ingo, any objections to stashing this in tip/sched/core?
> 

Hi Peter and Ingo,

Can you please look into it? So we'll have enough time to act (in 
case...) during this kernel.

We already missed one kernel...

Thanks,
Tariq
Jakub Kicinski Jan. 30, 2023, 8:22 p.m. UTC | #4
On Sun, 29 Jan 2023 10:07:58 +0200 Tariq Toukan wrote:
> > Peter/Ingo, any objections to stashing this in tip/sched/core?
> 
> Can you please look into it? So we'll have enough time to act (in 
> case...) during this kernel.
> 
> We already missed one kernel...

We really need this in linux-next by the end of the week. PTAL.
Jakub Kicinski Feb. 2, 2023, 5:33 p.m. UTC | #5
On Mon, 30 Jan 2023 12:22:06 -0800 Jakub Kicinski wrote:
> On Sun, 29 Jan 2023 10:07:58 +0200 Tariq Toukan wrote:
> > > Peter/Ingo, any objections to stashing this in tip/sched/core?  
> > 
> > Can you please look into it? So we'll have enough time to act (in 
> > case...) during this kernel.
> > 
> > We already missed one kernel...  
> 
> We really need this in linux-next by the end of the week. PTAL.

Peter, could you please take a look? Linux doesn't have an API for
basic, common sense IRQ distribution on AMD systems. It's important :(
Yury Norov Feb. 2, 2023, 5:37 p.m. UTC | #6
On Thu, Feb 2, 2023 at 9:33 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 30 Jan 2023 12:22:06 -0800 Jakub Kicinski wrote:
> > On Sun, 29 Jan 2023 10:07:58 +0200 Tariq Toukan wrote:
> > > > Peter/Ingo, any objections to stashing this in tip/sched/core?
> > >
> > > Can you please look into it? So we'll have enough time to act (in
> > > case...) during this kernel.
> > >
> > > We already missed one kernel...
> >
> > We really need this in linux-next by the end of the week. PTAL.
>
> Peter, could you please take a look? Linux doesn't have an API for
> basic, common sense IRQ distribution on AMD systems. It's important :(

FWIW, it's already been in linux-next since mid-December through the
bitmap branch, and no issues were reported so far.

Thanks,
Yury
Andy Shevchenko Feb. 8, 2023, 4:08 p.m. UTC | #7
On Wed, Feb 08, 2023 at 09:18:14PM +0530, Kalesh Anakkur Purayil wrote:
> On Wed, Feb 8, 2023 at 9:11 PM Pawel Chmielewski <
> pawel.chmielewski@intel.com> wrote:

...

> > +       u16 v_idx, cpu = 0;
> >
> [Kalesh]: if you initialize v_idx to 0 here, you can avoid the assignment
> below

I would avoid doing this.

The problem is that it will become harder to maintain and more error prone
during development.

So, please leave as is.
Yury Norov Feb. 8, 2023, 4:39 p.m. UTC | #8
On Wed, Feb 08, 2023 at 04:39:05PM +0100, Pawel Chmielewski wrote:
> With the introduction of sched_numa_hop_mask() and
> for_each_numa_hop_mask(), the affinity masks for queue vectors can be
> conveniently set by preferring the CPUs that are closest to the NUMA node
> of the parent PCI device.
> 
> Signed-off-by: Pawel Chmielewski <pawel.chmielewski@intel.com>
> ---
>  drivers/net/ethernet/intel/ice/ice_base.c | 17 ++++++++++++++---
>  1 file changed, 14 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice_base.c b/drivers/net/ethernet/intel/ice/ice_base.c
> index e864634d66bc..fd3550d15c9e 100644
> --- a/drivers/net/ethernet/intel/ice/ice_base.c
> +++ b/drivers/net/ethernet/intel/ice/ice_base.c
> @@ -122,8 +122,6 @@ static int ice_vsi_alloc_q_vector(struct ice_vsi *vsi, u16 v_idx)
>  	if (vsi->type == ICE_VSI_VF)
>  		goto out;
>  	/* only set affinity_mask if the CPU is online */
> -	if (cpu_online(v_idx))
> -		cpumask_set_cpu(v_idx, &q_vector->affinity_mask);
>  
>  	/* This will not be called in the driver load path because the netdev
>  	 * will not be created yet. All other cases with register the NAPI
> @@ -659,8 +657,10 @@ int ice_vsi_wait_one_rx_ring(struct ice_vsi *vsi, bool ena, u16 rxq_idx)
>   */
>  int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
>  {
> +	cpumask_t *aff_mask, *last_aff_mask = cpu_none_mask;
>  	struct device *dev = ice_pf_to_dev(vsi->back);
> -	u16 v_idx;
> +	int numa_node = dev->numa_node;
> +	u16 v_idx, cpu = 0;
>  	int err;
>  
>  	if (vsi->q_vectors[0]) {
> @@ -674,6 +674,17 @@ int ice_vsi_alloc_q_vectors(struct ice_vsi *vsi)
>  			goto err_out;
>  	}
>  
> +	v_idx = 0;
> +	for_each_numa_hop_mask(aff_mask, numa_node) {
> +		for_each_cpu_andnot(cpu, aff_mask, last_aff_mask)
> +			if (v_idx < vsi->num_q_vectors) {
> +				if (cpu_online(cpu))
> +					cpumask_set_cpu(cpu, &vsi->q_vectors[v_idx]->affinity_mask);
> +				v_idx++;
> +			}
                        
                        else
                                goto out;

> +		last_aff_mask = aff_mask;
> +	}
> +
        out:

>  	return 0;
>  
>  err_out:
> -- 
> 2.37.3

Would it make sense to increment v_idx only if matched CPU is online?
It will create a less sparse array of vectors...

Thanks,
Yury