[v4,0/4] isolation: limit msix vectors to housekeeping CPUs

Message ID	20200928183529.471328-1-nitesh@redhat.com
Headers	show Return-Path: <SRS0=2emc=DF=vger.kernel.org=netdev-owner@kernel.org> From: Nitesh Narayan Lal <nitesh@redhat.com> To: linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-pci@vger.kernel.org, intel-wired-lan@lists.osuosl.org, frederic@kernel.org, mtosatti@redhat.com, sassmann@redhat.com, jesse.brandeburg@intel.com, lihong.yang@intel.com, helgaas@kernel.org, nitesh@redhat.com, jeffrey.t.kirsher@intel.com, jacob.e.keller@intel.com, jlelli@redhat.com, hch@infradead.org, bhelgaas@google.com, mike.marciniszyn@intel.com, dennis.dalessandro@intel.com, thomas.lendacky@amd.com, jiri@nvidia.com, mingo@redhat.com, peterz@infradead.org, juri.lelli@redhat.com, vincent.guittot@linaro.org, lgoncalv@redhat.com Subject: [PATCH v4 0/4] isolation: limit msix vectors to housekeeping CPUs Date: Mon, 28 Sep 2020 14:35:25 -0400 Message-Id: <20200928183529.471328-1-nitesh@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	isolation: limit msix vectors to housekeeping CPUs \| expand [v4,0/4] isolation: limit msix vectors to housekeeping CPUs [v4,1/4] sched/isolation: API to get number of housekeeping CPUs [v4,2/4] sched/isolation: Extend nohz_full to isolate managed IRQs [v4,3/4] i40e: Limit msix vectors to housekeeping CPUs [v4,4/4] PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

Nitesh Narayan Lal Sept. 28, 2020, 6:35 p.m. UTC

This is a follow-up posting for "[PATCH v3 0/4] isolation: limit msix vectors
to housekeeping CPUs".

Issue
=====
With the current implementation device drivers while creating their MSIX        
vectors only take num_online_cpus() into consideration which works quite well  
for a non-RT environment, but in an RT environment that has a large number of   
isolated CPUs and very few housekeeping CPUs this could lead to a problem.    
The problem will be triggered when something like tuned will try to move all    
the IRQs from isolated CPUs to the limited number of housekeeping CPUs to       
prevent interruptions for a latency-sensitive workload that will be running
on the isolated CPUs. This failure is caused because of the per CPU vector         
limitation.                                                                     


Proposed Fix
============
In this patch-set, the following changes are proposed:
- A generic API housekeeping_num_online_cpus() which is meant to return the
  online housekeeping CPUs based on the hk_flag passed by the caller.
- i40e: Specifically for the i40e driver the num_online_cpus() used in 
  i40e_init_msix() to calculate numbers msix vectors is replaced with the
  above defined API that returns the online housekeeping CPUs that are meant
  to handle managed IRQ jobs.
- pci_alloc_irq_vector(): With the help of housekeeping_num_online_cpus() the
  max_vecs passed in pci_alloc_irq_vector() is restricted only to the online
  housekeeping CPUs (designated for managed IRQ jobs) strictly in an RT
  environment. However, if the min_vecs exceeds the online housekeeping CPUs,
  max_vecs is limited based on the min_vecs instead.


Future Work
===========

- In the previous upstream discussion [1], it was decided that it would be
  better if we can have a generic framework that can be consumed by all the
  drivers to fix this kind of issue. However, it will be a long term work,
  and since there are RT workloads that are getting impacted by the reported
  issue. We agreed upon the proposed per-device approach for now.


Testing
=======
Functionality:
- To test that the issue is resolved with i40e change I added a tracepoint
  in i40e_init_msix() to find the number of CPUs derived for vector creation
  with and without tuned's realtime-virtual-host profile. As per expectation
  with the profile applied I was only getting the number of housekeeping CPUs
  and all available CPUs without it. Another way to verify is by checking
  the number of IRQs that get created corresponding to a impacted device.
  Similarly did a few more tests with different modes eg with only nohz_full,
  isolcpus etc.

Performance:
- To analyze the performance impact I have targetted the change introduced in 
  pci_alloc_irq_vectors() and compared the results against a vanilla kernel
  (5.9.0-rc3) results.

  Setup Information:
  + I had a couple of 24-core machines connected back to back via a couple of
    mlx5 NICs and I analyzed the average bitrate for server-client TCP and
    UDP transmission via iperf. 
  + To minimize the Bitrate variation of iperf TCP and UDP stream test I have
    applied the tuned's network-throughput profile and disabled HT.
 Test Information:
  + For the environment that had no isolated CPUs:
    I have tested with single stream and 24 streams (same as that of online
    CPUs).
  + For the environment that had 20 isolated CPUs:
    I have tested with single stream, 4 streams (same as that the number of
    housekeeping) and 24 streams (same as that of online CPUs).

 Results:
  # UDP Stream Test:
    + There was no degradation observed in UDP stream tests in both
      environments. (With isolated CPUs and without isolated CPUs after the
      introduction of the patches).
  # TCP Stream Test - No isolated CPUs:
    + No noticeable degradation was observed.
  # TCP Stream Test - With isolated CPUs:
    + Multiple Stream (4)  - Average degradation of around 5-6%
    + Multiple Stream (24) - Average degradation of around 2-3%
    + Single Stream        - Even on a vanilla kernel the Bitrate observed 
                             for a TCP single stream test seem to vary
                             significantly across different runs (eg. the %
                             variation between the best and the worst case on
                             a vanilla kernel was around 8-10%). A similar
                             variation was observed with the kernel that
                             included my patches. No additional degradation
                             was observed.

If there are any suggestions for more performance evaluation, I would
be happy to discuss/perform them.


Changes from v3[2]:
==================
- Moved the logic to limit the max_vecs from pci_alloc_irq_vectors() to
  pci_alloc_irq_vectors_affinity() as that's the exported interface and
  drivers using this API also need to be fixed (suggestion from Bjorn Helgaas).

Changes from v2[3]:
==================
- Renamed hk_num_online_cpus() with housekeeping_num_online_cpus() to keep
  the naming convention consistent (based on a suggestion from Peter
  Zijlstra and Frederic Weisbecker).
- Added an argument "enum hk_flags" to the housekeeping_num_online_cpus() API
  to make it more usable in different use-cases (based on a suggestion from 
  Frederic Weisbecker).
- Replaced cpumask_weight(cpu_online_mask) with num_online_cpus() (suggestion
  from Bjorn Helgaas).
- Modified patch commit messages and comment based on Bjorn Helgaas's
  suggestion.

Changes from v1[4]:
==================
Patch1:                                                                       
- Replaced num_houskeeeping_cpus() with hk_num_online_cpus() and started
  using the cpumask corresponding to HK_FLAG_MANAGED_IRQ to derive the number
  of online housekeeping CPUs. This is based on Frederic Weisbecker's
  suggestion.           
- Since the hk_num_online_cpus() is self-explanatory, got rid of             
  the comment that was added previously.                                     
Patch2:                                                                       
- Added a new patch that is meant to enable managed IRQ isolation for
  nohz_full CPUs. This is based on Frederic Weisbecker's suggestion.              
Patch4 (PCI):                                                                 
- For cases where the min_vecs exceeds the online housekeeping CPUs, instead
  of skipping modification to max_vecs, started restricting it based on the
  min_vecs. This is based on a suggestion from Marcelo Tosatti.                                                                    


[1] https://lore.kernel.org/lkml/20200922095440.GA5217@lenoir/
[2] https://lore.kernel.org/lkml/20200925182654.224004-1-nitesh@redhat.com/
[3] https://lore.kernel.org/lkml/20200923181126.223766-1-nitesh@redhat.com/
[4] https://lore.kernel.org/lkml/20200909150818.313699-1-nitesh@redhat.com/


Nitesh Narayan Lal (4):
  sched/isolation: API to get number of housekeeping CPUs
  sched/isolation: Extend nohz_full to isolate managed IRQs
  i40e: Limit msix vectors to housekeeping CPUs
  PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs

 drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
 drivers/pci/msi.c                           | 18 ++++++++++++++++++
 include/linux/sched/isolation.h             |  9 +++++++++
 kernel/sched/isolation.c                    |  2 +-
 4 files changed, 30 insertions(+), 2 deletions(-)

--

Frederic Weisbecker Oct. 1, 2020, 3:49 p.m. UTC | #1

On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
> Nitesh Narayan Lal (4):
>   sched/isolation: API to get number of housekeeping CPUs
>   sched/isolation: Extend nohz_full to isolate managed IRQs
>   i40e: Limit msix vectors to housekeeping CPUs
>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
> 
>  drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
>  drivers/pci/msi.c                           | 18 ++++++++++++++++++
>  include/linux/sched/isolation.h             |  9 +++++++++
>  kernel/sched/isolation.c                    |  2 +-
>  4 files changed, 30 insertions(+), 2 deletions(-)

Acked-by: Frederic Weisbecker <frederic@kernel.org>

Peter, if you're ok with the set, I guess this should go through
the scheduler tree?

Thanks.

Nitesh Narayan Lal Oct. 8, 2020, 9:40 p.m. UTC | #2

On 10/1/20 11:49 AM, Frederic Weisbecker wrote:
> On Mon, Sep 28, 2020 at 02:35:25PM -0400, Nitesh Narayan Lal wrote:
>> Nitesh Narayan Lal (4):
>>   sched/isolation: API to get number of housekeeping CPUs
>>   sched/isolation: Extend nohz_full to isolate managed IRQs
>>   i40e: Limit msix vectors to housekeeping CPUs
>>   PCI: Limit pci_alloc_irq_vectors() to housekeeping CPUs
>>
>>  drivers/net/ethernet/intel/i40e/i40e_main.c |  3 ++-
>>  drivers/pci/msi.c                           | 18 ++++++++++++++++++
>>  include/linux/sched/isolation.h             |  9 +++++++++
>>  kernel/sched/isolation.c                    |  2 +-
>>  4 files changed, 30 insertions(+), 2 deletions(-)
> Acked-by: Frederic Weisbecker <frederic@kernel.org>
>
> Peter, if you're ok with the set, I guess this should go through
> the scheduler tree?
>
> Thanks.

Hi Peter,

I wanted follow up to check if you have any concerns/suggestions that I
should address in this patch-set before this can be picked?

Peter Zijlstra Oct. 19, 2020, 11:11 a.m. UTC | #3

On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> >> +
> >> +	/*
> >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> >> +	 * to the housekeeping CPUs from the userspace by changing their
> >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> >> +	 * running out of IRQ vectors.
> >> +	 */
> >> +	if (hk_cpus < num_online_cpus()) {
> >> +		if (hk_cpus < min_vecs)
> >> +			max_vecs = min_vecs;
> >> +		else if (hk_cpus < max_vecs)
> >> +			max_vecs = hk_cpus;
> > is that:
> >
> > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> 
> Yes, I think this will do.
> 
> >
> > Also, do we really need to have that conditional on hk_cpus <
> > num_online_cpus()? That is, why can't we do this unconditionally?
> 
> FWIU most of the drivers using this API already restricts the number of
> vectors based on the num_online_cpus, if we do it unconditionally we can
> unnecessary duplicate the restriction for cases where we don't have any
> isolated CPUs.

unnecessary isn't really a concern here, this is a slow path. What's
important is code clarity.

> Also, different driver seems to take different factors into consideration
> along with num_online_cpus while finding the max_vecs to request, for
> example in the case of mlx5:
> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>                MLX5_EQ_VEC_COMP_BASE
> 
> Having hk_cpus < num_online_cpus() helps us ensure that we are only
> changing the behavior when we have isolated CPUs.
> 
> Does that make sense?

That seems to want to allocate N interrupts per cpu (plus some random
static amount, which seems weird, but whatever). This patch breaks that.

So I think it is important to figure out what that driver really wants
in the nohz_full case. If it wants to retain N interrupts per CPU, and
only reduce the number of CPUs, the proposed interface is wrong.

> > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > excluding hotplug is racy.
> 
> The housekeeping_mask should still remain constant, isn't?
> In any case, I can double check this.

The goal is very much to have that dynamically configurable.

Marcelo Tosatti Oct. 19, 2020, 2 p.m. UTC | #4

On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:
> > >> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> > >> +
> > >> +	/*
> > >> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> > >> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> > >> +	 * to the housekeeping CPUs from the userspace by changing their
> > >> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> > >> +	 * running out of IRQ vectors.
> > >> +	 */
> > >> +	if (hk_cpus < num_online_cpus()) {
> > >> +		if (hk_cpus < min_vecs)
> > >> +			max_vecs = min_vecs;
> > >> +		else if (hk_cpus < max_vecs)
> > >> +			max_vecs = hk_cpus;
> > > is that:
> > >
> > > 		max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
> > 
> > Yes, I think this will do.
> > 
> > >
> > > Also, do we really need to have that conditional on hk_cpus <
> > > num_online_cpus()? That is, why can't we do this unconditionally?
> > 
> > FWIU most of the drivers using this API already restricts the number of
> > vectors based on the num_online_cpus, if we do it unconditionally we can
> > unnecessary duplicate the restriction for cases where we don't have any
> > isolated CPUs.
> 
> unnecessary isn't really a concern here, this is a slow path. What's
> important is code clarity.
> 
> > Also, different driver seems to take different factors into consideration
> > along with num_online_cpus while finding the max_vecs to request, for
> > example in the case of mlx5:
> > MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
> >                MLX5_EQ_VEC_COMP_BASE
> > 
> > Having hk_cpus < num_online_cpus() helps us ensure that we are only
> > changing the behavior when we have isolated CPUs.
> > 
> > Does that make sense?
> 
> That seems to want to allocate N interrupts per cpu (plus some random
> static amount, which seems weird, but whatever). This patch breaks that.

On purpose. For the isolated CPUs we don't want network device 
interrupts (in this context).

> So I think it is important to figure out what that driver really wants
> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> only reduce the number of CPUs, the proposed interface is wrong.

It wants N interrupts per non-isolated (AKA housekeeping) CPU.
Zero interrupts for isolated interrupts.

> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Yes, but this patch is a fix for customer bug in the old, static on-boot 
isolation CPU configuration.

---

Discussing the dynamic configuration (not this patch!) case:

Would need to enable/disable interrupts for a particular device 
on a per-CPU basis. Such interface does not exist yet.

Perhaps that is what you are looking for when writing "proposed interface
is wrong" Peter?

Frederic Weisbecker Oct. 19, 2020, 2:21 p.m. UTC | #5

On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
> > > And what are the (desired) semantics vs hotplug? Using a cpumask without
> > > excluding hotplug is racy.
> > 
> > The housekeeping_mask should still remain constant, isn't?
> > In any case, I can double check this.
> 
> The goal is very much to have that dynamically configurable.

Right but unfortunately we are not there before a little while. And the
existing code in these drivers allocating vectors doesn't even take into
account hotplug as you spotted. So I agreed to let Nitesh fix this issue
on top of the existing code until he can look into providing some infrastructure
for these kind of vectors allocation. The first step would be to consolidate
similar code from other drivers, then maybe handle hotplug and later
dynamic housekeeping.

Nitesh Narayan Lal Oct. 19, 2020, 2:25 p.m. UTC | #6

On 10/19/20 10:00 AM, Marcelo Tosatti wrote:
> On Mon, Oct 19, 2020 at 01:11:37PM +0200, Peter Zijlstra wrote:
>> On Sun, Oct 18, 2020 at 02:14:46PM -0400, Nitesh Narayan Lal wrote:

[...]

>>>> Also, do we really need to have that conditional on hk_cpus <
>>>> num_online_cpus()? That is, why can't we do this unconditionally?
>>> FWIU most of the drivers using this API already restricts the number of
>>> vectors based on the num_online_cpus, if we do it unconditionally we can
>>> unnecessary duplicate the restriction for cases where we don't have any
>>> isolated CPUs.
>> unnecessary isn't really a concern here, this is a slow path. What's
>> important is code clarity.

Right, I can skip that check then.

>>
>>> Also, different driver seems to take different factors into consideration
>>> along with num_online_cpus while finding the max_vecs to request, for
>>> example in the case of mlx5:
>>> MLX5_CAP_GEN(dev, num_ports) * num_online_cpus() +
>>>                MLX5_EQ_VEC_COMP_BASE
>>>
>>> Having hk_cpus < num_online_cpus() helps us ensure that we are only
>>> changing the behavior when we have isolated CPUs.
>>>
>>> Does that make sense?
>> That seems to want to allocate N interrupts per cpu (plus some random
>> static amount, which seems weird, but whatever). This patch breaks that.
> On purpose. For the isolated CPUs we don't want network device 
> interrupts (in this context).
>
>> So I think it is important to figure out what that driver really wants
>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>> only reduce the number of CPUs, the proposed interface is wrong.
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Zero interrupts for isolated interrupts.

Right, otherwise we may end up in a situation where we run out of per CPU
vectors while we move the IRQs from isolated CPUs to housekeeping.

Peter Zijlstra Oct. 20, 2020, 7:30 a.m. UTC | #7

On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> > So I think it is important to figure out what that driver really wants
> > in the nohz_full case. If it wants to retain N interrupts per CPU, and
> > only reduce the number of CPUs, the proposed interface is wrong.
> 
> It wants N interrupts per non-isolated (AKA housekeeping) CPU.

Then the patch is wrong and the interface needs changing from @min_vecs,
@max_vecs to something that expresses the N*nr_cpus relation.

Nitesh Narayan Lal Oct. 20, 2020, 1 p.m. UTC | #8

On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>> So I think it is important to figure out what that driver really wants
>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>> only reduce the number of CPUs, the proposed interface is wrong.
>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> Then the patch is wrong and the interface needs changing from @min_vecs,
> @max_vecs to something that expresses the N*nr_cpus relation.

Reading Marcelo's comment again I think what is really expected is 1
interrupt per non-isolated (housekeeping) CPU (not N interrupts).

My bad that I missed it initially.

Peter Zijlstra Oct. 20, 2020, 1:41 p.m. UTC | #9

On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
> 
> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
> > On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
> >>> So I think it is important to figure out what that driver really wants
> >>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
> >>> only reduce the number of CPUs, the proposed interface is wrong.
> >> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
> > Then the patch is wrong and the interface needs changing from @min_vecs,
> > @max_vecs to something that expresses the N*nr_cpus relation.
> 
> Reading Marcelo's comment again I think what is really expected is 1
> interrupt per non-isolated (housekeeping) CPU (not N interrupts).

Then what is the point of them asking for N*nr_cpus when there is no
isolation?

Either everybody wants 1 interrupts per CPU and we can do the clamp
unconditionally, in which case we should go fix this user, or they want
multiple per cpu and we should go fix the interface.

It cannot be both.

Thomas Gleixner Oct. 20, 2020, 2:16 p.m. UTC | #10

On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>  
> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
> +
> +	/*
> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
> +	 * to the housekeeping CPUs from the userspace by changing their
> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
> +	 * running out of IRQ vectors.
> +	 */

This is not true for managed interrupts. The interrupts affinity of
those cannot be changed by user space.

> +	if (hk_cpus < num_online_cpus()) {
> +		if (hk_cpus < min_vecs)
> +			max_vecs = min_vecs;
> +		else if (hk_cpus < max_vecs)
> +			max_vecs = hk_cpus;
> +	}

So now with that assume a 16 core machine (HT off for simplicity)

17 Requested interrupts (1 general, 16 queues)

Managed interrupts will allocate

   1  general interrupt which is free movable by user space
   16 managed interrupts for queues (one per CPU)

This allows the driver to have 16 queues, i.e. one queue per CPU. These
interrupts are only used when an application on a CPU issues I/O.

With the above change this will result

   1  general interrupt which is free movable by user space
   1  managed interrupts (possible affinity to all 16 CPUs, but routed
      to housekeeping CPU as long as there is one online)

So the device is now limited to a single queue which also affects the
housekeeping CPUs because now they have to share a single queue.

With larger machines this gets even worse.

So no. This needs way more thought for managed interrupts and you cannot
do that at the PCI layer. Only the affinity spreading mechanism can do
the right thing here.

Thanks,

        tglx

Nitesh Narayan Lal Oct. 20, 2020, 2:39 p.m. UTC | #11

On 10/20/20 9:41 AM, Peter Zijlstra wrote:
> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>> So I think it is important to figure out what that driver really wants
>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>> @max_vecs to something that expresses the N*nr_cpus relation.
>> Reading Marcelo's comment again I think what is really expected is 1
>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
> Then what is the point of them asking for N*nr_cpus when there is no
> isolation?
>
> Either everybody wants 1 interrupts per CPU and we can do the clamp
> unconditionally, in which case we should go fix this user, or they want
> multiple per cpu and we should go fix the interface.
>
> It cannot be both.

Based on my understanding I don't think this is consistent, the number
of interrupts any driver can request varies to an extent that some
consumer of this API even request just one interrupt for its use.

This was one of the reasons why I thought of having a conditional
restriction.

But I agree there is a lack of consistency.

Nitesh Narayan Lal Oct. 20, 2020, 4:18 p.m. UTC | #12

On 10/20/20 10:16 AM, Thomas Gleixner wrote:
> On Mon, Sep 28 2020 at 14:35, Nitesh Narayan Lal wrote:
>>  
>> +	hk_cpus = housekeeping_num_online_cpus(HK_FLAG_MANAGED_IRQ);
>> +
>> +	/*
>> +	 * If we have isolated CPUs for use by real-time tasks, to keep the
>> +	 * latency overhead to a minimum, device-specific IRQ vectors are moved
>> +	 * to the housekeeping CPUs from the userspace by changing their
>> +	 * affinity mask. Limit the vector usage to keep housekeeping CPUs from
>> +	 * running out of IRQ vectors.
>> +	 */
> This is not true for managed interrupts. The interrupts affinity of
> those cannot be changed by user space.

Ah Yes. Perhaps
s/IRQs/non-managed IRQ ?

>
>> +	if (hk_cpus < num_online_cpus()) {
>> +		if (hk_cpus < min_vecs)
>> +			max_vecs = min_vecs;
>> +		else if (hk_cpus < max_vecs)
>> +			max_vecs = hk_cpus;
>> +	}
> So now with that assume a 16 core machine (HT off for simplicity)
>
> 17 Requested interrupts (1 general, 16 queues)
>
> Managed interrupts will allocate
>
>    1  general interrupt which is free movable by user space
>    16 managed interrupts for queues (one per CPU)
>
> This allows the driver to have 16 queues, i.e. one queue per CPU. These
> interrupts are only used when an application on a CPU issues I/O.

Correct.

>
> With the above change this will result
>
>    1  general interrupt which is free movable by user space
>    1  managed interrupts (possible affinity to all 16 CPUs, but routed
>       to housekeeping CPU as long as there is one online)
>
> So the device is now limited to a single queue which also affects the
> housekeeping CPUs because now they have to share a single queue.
>
> With larger machines this gets even worse.

Yes, the change can impact the performance, however, if we don't do that we
may have a latency impact instead. Specifically, on larger systems where
most of the CPUs are isolated as we will definitely fail in moving all of the
IRQs away from the isolated CPUs to the housekeeping.

>
> So no. This needs way more thought for managed interrupts and you cannot
> do that at the PCI layer.

Maybe we should not be doing anything in the case of managed IRQs as they
are anyways pinned to the housekeeping CPUs as long as we have the
'managed_irq' option included in the kernel cmdline.

>  Only the affinity spreading mechanism can do
> the right thing here.

I can definitely explore this further.

However, IMHO we would still need a logic to prevent the devices from
creating excess vectors.

Do you agree?

>
> Thanks,
>
>         tglx
>

Thomas Gleixner Oct. 20, 2020, 6:07 p.m. UTC | #13

On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> On 10/20/20 10:16 AM, Thomas Gleixner wrote:
>> With the above change this will result
>>
>>    1  general interrupt which is free movable by user space
>>    1  managed interrupts (possible affinity to all 16 CPUs, but routed
>>       to housekeeping CPU as long as there is one online)
>>
>> So the device is now limited to a single queue which also affects the
>> housekeeping CPUs because now they have to share a single queue.
>>
>> With larger machines this gets even worse.
>
> Yes, the change can impact the performance, however, if we don't do that we
> may have a latency impact instead. Specifically, on larger systems where
> most of the CPUs are isolated as we will definitely fail in moving all of the
> IRQs away from the isolated CPUs to the housekeeping.

For non managed interrupts I agree.

>> So no. This needs way more thought for managed interrupts and you cannot
>> do that at the PCI layer.
>
> Maybe we should not be doing anything in the case of managed IRQs as they
> are anyways pinned to the housekeeping CPUs as long as we have the
> 'managed_irq' option included in the kernel cmdline.

Exactly. For the PCI side this vector limiting has to be restricted to
the non managed case.

>>  Only the affinity spreading mechanism can do
>> the right thing here.
>
> I can definitely explore this further.
>
> However, IMHO we would still need a logic to prevent the devices from
> creating excess vectors.

Managed interrupts are preventing exactly that by pinning the interrupts
and queues to one or a set of CPUs, which prevents vector exhaustion on
CPU hotplug.

Non-managed, yes that is and always was a problem. One of the reasons
why managed interrupts exist.

Thanks,

        tglx

Thomas Gleixner Oct. 21, 2020, 8:25 p.m. UTC | #14

On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>> However, IMHO we would still need a logic to prevent the devices from
>> creating excess vectors.
>
> Managed interrupts are preventing exactly that by pinning the interrupts
> and queues to one or a set of CPUs, which prevents vector exhaustion on
> CPU hotplug.
>
> Non-managed, yes that is and always was a problem. One of the reasons
> why managed interrupts exist.

But why is this only a problem for isolation? The very same problem
exists vs. CPU hotplug and therefore hibernation.

On x86 we have at max. 204 vectors available for device interrupts per
CPU. So assumed the only device interrupt in use is networking then any
machine which has more than 204 network interrupts (queues, aux ...)
active will prevent the machine from hibernation.

Aside of that it's silly to have multiple queues targeted at a single
CPU in case of hotplug. And that's not a theoretical problem.  Some
power management schemes shut down sockets when the utilization of a
system is low enough, e.g. outside of working hours.

The whole point of multi-queue is to have locality so that traffic from
a CPU goes through the CPU local queue. What's the point of having two
or more queues on a CPU in case of hotplug?

The right answer to this is to utilize managed interrupts and have
according logic in your network driver to handle CPU hotplug. When a CPU
goes down, then the queue which is associated to that CPU is quiesced
and the interrupt core shuts down the relevant interrupt instead of
moving it to an online CPU (which causes the whole vector exhaustion
problem on x86). When the CPU comes online again, then the interrupt is
reenabled in the core and the driver reactivates the queue.

Thanks,

        tglx

Nitesh Narayan Lal Oct. 21, 2020, 9:04 p.m. UTC | #15

On 10/21/20 4:25 PM, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
>>> However, IMHO we would still need a logic to prevent the devices from
>>> creating excess vectors.
>> Managed interrupts are preventing exactly that by pinning the interrupts
>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>> CPU hotplug.
>>
>> Non-managed, yes that is and always was a problem. One of the reasons
>> why managed interrupts exist.
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
>
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.

Yes, that is indeed the case.

>
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
>
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
>
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

IIRC then i40e does have something like that where it suspends all IRQs
before hibernation and restores them when the CPU is back online.

I am not particularly sure about the other drivers.

This brings me to another discussion that Peter initiated that is to
perform the proposed restriction without any condition for all non-managed
IRQs.

Something on the lines:

+       if (!pci_is_managed(dev))
+               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);


I am not particularly sure about this because I am not sure what kind of
performance penalty this will have on the drivers in general and if
that will be acceptable at all. Any thoughts?

However, this still doesn't solve the generic problem, and an ideal solution
will be something that you suggested.

Will it be sensible to think about having a generic API that can be
consumed by all the drivers and that can do both the things you mentioned?

>
> Thanks,
>
>         tglx
>
>
>

Jakub Kicinski Oct. 22, 2020, 12:02 a.m. UTC | #16

On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:  
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.  
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.  
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.
> 
> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

I think Mellanox folks made some forays into managed irqs, but I don't
remember/can't find the details now.

For networking the locality / queue per core does not always work,
since the incoming traffic is usually spread based on a hash. Many
applications perform better when network processing is done on a small
subset of CPUs, and application doesn't get interrupted every 100us. 
So we do need extra user control here.

We have a bit of a uAPI problem since people had grown to depend on
IRQ == queue == NAPI to configure their systems. "The right way" out
would be a proper API which allows associating queues with CPUs rather
than IRQs, then we can use managed IRQs and solve many other problems.

Such new API has been in the works / discussions for a while now.

(Magnus keep me honest here, if you disagree the queue API solves this.)

Jacob Keller Oct. 22, 2020, 12:27 a.m. UTC | #17

On 10/21/2020 5:02 PM, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
>>> On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:  
>>>> However, IMHO we would still need a logic to prevent the devices from
>>>> creating excess vectors.  
>>>
>>> Managed interrupts are preventing exactly that by pinning the interrupts
>>> and queues to one or a set of CPUs, which prevents vector exhaustion on
>>> CPU hotplug.
>>>
>>> Non-managed, yes that is and always was a problem. One of the reasons
>>> why managed interrupts exist.  
>>
>> But why is this only a problem for isolation? The very same problem
>> exists vs. CPU hotplug and therefore hibernation.
>>
>> On x86 we have at max. 204 vectors available for device interrupts per
>> CPU. So assumed the only device interrupt in use is networking then any
>> machine which has more than 204 network interrupts (queues, aux ...)
>> active will prevent the machine from hibernation.
>>
>> Aside of that it's silly to have multiple queues targeted at a single
>> CPU in case of hotplug. And that's not a theoretical problem.  Some
>> power management schemes shut down sockets when the utilization of a
>> system is low enough, e.g. outside of working hours.
>>
>> The whole point of multi-queue is to have locality so that traffic from
>> a CPU goes through the CPU local queue. What's the point of having two
>> or more queues on a CPU in case of hotplug?
>>
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
> 
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
> 

I remember looking into this a few years ago, and not getting very far
either.

> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many
> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.
> 
> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
> 

I think we (Intel) hit some of the same issues you mention.

I know I personally would like to see something that lets a lot of the
current driver-specific policy be moved out. I think it should be
possible to significantly simplify the abstraction used by the drivers.

> Such new API has been in the works / discussions for a while now.
> 
> (Magnus keep me honest here, if you disagree the queue API solves this.)
>

Thomas Gleixner Oct. 22, 2020, 8:28 a.m. UTC | #18

On Wed, Oct 21 2020 at 17:02, Jakub Kicinski wrote:
> On Wed, 21 Oct 2020 22:25:48 +0200 Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> I think Mellanox folks made some forays into managed irqs, but I don't
> remember/can't find the details now.
>
> For networking the locality / queue per core does not always work,
> since the incoming traffic is usually spread based on a hash. Many

That makes it problematic and is fundamentally different from block I/O.

> applications perform better when network processing is done on a small
> subset of CPUs, and application doesn't get interrupted every 100us. 
> So we do need extra user control here.

Ok.

> We have a bit of a uAPI problem since people had grown to depend on
> IRQ == queue == NAPI to configure their systems. "The right way" out
> would be a proper API which allows associating queues with CPUs rather
> than IRQs, then we can use managed IRQs and solve many other problems.
>
> Such new API has been in the works / discussions for a while now.

If there is anything which needs to be done/extended on the irq side
please let me know.

Thanks

        tglx

Marcelo Tosatti Oct. 22, 2020, 12:28 p.m. UTC | #19

On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
> On Tue, Oct 20 2020 at 20:07, Thomas Gleixner wrote:
> > On Tue, Oct 20 2020 at 12:18, Nitesh Narayan Lal wrote:
> >> However, IMHO we would still need a logic to prevent the devices from
> >> creating excess vectors.
> >
> > Managed interrupts are preventing exactly that by pinning the interrupts
> > and queues to one or a set of CPUs, which prevents vector exhaustion on
> > CPU hotplug.
> >
> > Non-managed, yes that is and always was a problem. One of the reasons
> > why managed interrupts exist.
> 
> But why is this only a problem for isolation? The very same problem
> exists vs. CPU hotplug and therefore hibernation.
> 
> On x86 we have at max. 204 vectors available for device interrupts per
> CPU. So assumed the only device interrupt in use is networking then any
> machine which has more than 204 network interrupts (queues, aux ...)
> active will prevent the machine from hibernation.
> 
> Aside of that it's silly to have multiple queues targeted at a single
> CPU in case of hotplug. And that's not a theoretical problem.  Some
> power management schemes shut down sockets when the utilization of a
> system is low enough, e.g. outside of working hours.

Exactly. It seems the proper way to do handle this is to disable
individual vectors rather than moving them. And that is needed for 
dynamic isolate / unisolate anyway...

> The whole point of multi-queue is to have locality so that traffic from
> a CPU goes through the CPU local queue. What's the point of having two
> or more queues on a CPU in case of hotplug?
> 
> The right answer to this is to utilize managed interrupts and have
> according logic in your network driver to handle CPU hotplug. When a CPU
> goes down, then the queue which is associated to that CPU is quiesced
> and the interrupt core shuts down the relevant interrupt instead of
> moving it to an online CPU (which causes the whole vector exhaustion
> problem on x86). When the CPU comes online again, then the interrupt is
> reenabled in the core and the driver reactivates the queue.

Aha... But it would be necessary to do that from userspace (for runtime
isolate/unisolate).

Nitesh Narayan Lal Oct. 22, 2020, 5:47 p.m. UTC | #20

On 10/20/20 10:39 AM, Nitesh Narayan Lal wrote:
> On 10/20/20 9:41 AM, Peter Zijlstra wrote:
>> On Tue, Oct 20, 2020 at 09:00:01AM -0400, Nitesh Narayan Lal wrote:
>>> On 10/20/20 3:30 AM, Peter Zijlstra wrote:
>>>> On Mon, Oct 19, 2020 at 11:00:05AM -0300, Marcelo Tosatti wrote:
>>>>>> So I think it is important to figure out what that driver really wants
>>>>>> in the nohz_full case. If it wants to retain N interrupts per CPU, and
>>>>>> only reduce the number of CPUs, the proposed interface is wrong.
>>>>> It wants N interrupts per non-isolated (AKA housekeeping) CPU.
>>>> Then the patch is wrong and the interface needs changing from @min_vecs,
>>>> @max_vecs to something that expresses the N*nr_cpus relation.
>>> Reading Marcelo's comment again I think what is really expected is 1
>>> interrupt per non-isolated (housekeeping) CPU (not N interrupts).
>> Then what is the point of them asking for N*nr_cpus when there is no
>> isolation?
>>
>> Either everybody wants 1 interrupts per CPU and we can do the clamp
>> unconditionally, in which case we should go fix this user, or they want
>> multiple per cpu and we should go fix the interface.
>>
>> It cannot be both.
> Based on my understanding I don't think this is consistent, the number
> of interrupts any driver can request varies to an extent that some
> consumer of this API even request just one interrupt for its use.
>
> This was one of the reasons why I thought of having a conditional
> restriction.
>
> But I agree there is a lack of consistency.
>

Hi Peter,

So based on the suggestions from you and Thomas, I think something like the
following should do the job within pci_alloc_irq_vectors_affinity():

+       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
+               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
and to an extent I agree that it does degrade the code clarity.

However, since there is a certain inconsistency in the number of vectors
that drivers request through this API IMHO we will need this, otherwise
we could cause an impact on the drivers even in setups that doesn't
have any isolated CPUs.

If you agree, I can send the next version of the patch-set.

Thomas Gleixner Oct. 22, 2020, 10:39 p.m. UTC | #21

On Thu, Oct 22 2020 at 09:28, Marcelo Tosatti wrote:
> On Wed, Oct 21, 2020 at 10:25:48PM +0200, Thomas Gleixner wrote:
>> The right answer to this is to utilize managed interrupts and have
>> according logic in your network driver to handle CPU hotplug. When a CPU
>> goes down, then the queue which is associated to that CPU is quiesced
>> and the interrupt core shuts down the relevant interrupt instead of
>> moving it to an online CPU (which causes the whole vector exhaustion
>> problem on x86). When the CPU comes online again, then the interrupt is
>> reenabled in the core and the driver reactivates the queue.
>
> Aha... But it would be necessary to do that from userspace (for runtime
> isolate/unisolate).

For anything which uses managed interrupts this is a non-problem and
userspace has absolutely no business with it.

Isolation does not shut down queues, at least not the block multi-queue
ones which are only active when I/O is issued from that isolated CPU.

So transitioning out of isolation requires no action at all.

Transitioning in or changing the housekeeping mask needs some trivial
tweak to handle the case where there is an overlap in the cpuset of a
queue (housekeeping and isolated). This is handled already for setup and
affinity changes, but of course not for runtime isolation mask changes,
but that's a trivial thing to do.

What's more interesting is how to deal with the network problem where
there is no guarantee that the "response" ends up on the same queue as
the "request" which is what the block people rely on. And that problem
is not really an interrupt affinity problem in the first place.

Thanks,

        tglx

Peter Zijlstra Oct. 23, 2020, 8:58 a.m. UTC | #22

On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:

> Hi Peter,

> 

> So based on the suggestions from you and Thomas, I think something like the

> following should do the job within pci_alloc_irq_vectors_affinity():

> 

> +       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))

> +               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);

> 

> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"

> and to an extent I agree that it does degrade the code clarity.


It's not just code clarity; I simply don't understand it. It feels like
a band-aid that breaks thing.

At the very least it needs a ginormous (and coherent) comment that
explains:

 - the interface
 - the usage
 - this hack

> However, since there is a certain inconsistency in the number of vectors

> that drivers request through this API IMHO we will need this, otherwise

> we could cause an impact on the drivers even in setups that doesn't

> have any isolated CPUs.


So shouldn't we then fix the drivers / interface first, to get rid of
this inconsistency?

> If you agree, I can send the next version of the patch-set.


Well, it's not just me you have to convince.

Nitesh Narayan Lal Oct. 23, 2020, 1:10 p.m. UTC | #23

On 10/23/20 4:58 AM, Peter Zijlstra wrote:
> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>
>> Hi Peter,
>>
>> So based on the suggestions from you and Thomas, I think something like the
>> following should do the job within pci_alloc_irq_vectors_affinity():
>>
>> +       if (!pci_is_managed(dev) && (hk_cpus < num_online_cpus()))
>> +               max_vecs = clamp(hk_cpus, min_vecs, max_vecs);
>>
>> I do know that you didn't like the usage of "hk_cpus < num_online_cpus()"
>> and to an extent I agree that it does degrade the code clarity.
> It's not just code clarity; I simply don't understand it. It feels like
> a band-aid that breaks thing.
>
> At the very least it needs a ginormous (and coherent) comment that
> explains:
>
>  - the interface
>  - the usage
>  - this hack

That make sense.

>
>> However, since there is a certain inconsistency in the number of vectors
>> that drivers request through this API IMHO we will need this, otherwise
>> we could cause an impact on the drivers even in setups that doesn't
>> have any isolated CPUs.
> So shouldn't we then fix the drivers / interface first, to get rid of
> this inconsistency?
>

Considering we agree that excess vector is a problem that needs to be
solved across all the drivers and that you are comfortable with the other
three patches in the set. If I may suggest the following:

- We can pick those three patches for now, as that will atleast fix a
  driver that is currently impacting RT workloads. Is that a fair
  expectation?

- In the meanwhile, I will start looking into individual drivers that
  consume this API to find out if there is a co-relation that can be
  derived between the max_vecs and number of CPUs. If that exists then I
  can go ahead and tweak the API's max_vecs accordingly. However, if this
  is absolutely random then I can come up with a sane comment
  before this check that covers the list of items you suggested.

- I also want to explore the comments made by Thomas which may take
  some time.

Peter Zijlstra Oct. 23, 2020, 1:25 p.m. UTC | #24

On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
> Extend nohz_full feature set to include isolation from managed IRQS. This

So you say it's for managed-irqs, the feature is actually called
MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.

Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
fine and don't need help at at...

> is required specifically for setups that only uses nohz_full and still
> requires isolation for maintaining lower latency for the listed CPUs.
> 
> Suggested-by: Frederic Weisbecker <frederic@kernel.org>
> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> ---
>  kernel/sched/isolation.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> index 5a6ea03f9882..9df9598a9e39 100644
> --- a/kernel/sched/isolation.c
> +++ b/kernel/sched/isolation.c
> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
>  	unsigned int flags;
>  
>  	flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> -		HK_FLAG_MISC | HK_FLAG_KTHREAD;
> +		HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>  
>  	return housekeeping_setup(str, flags);
>  }
> -- 
> 2.18.2
>

Frederic Weisbecker Oct. 23, 2020, 1:29 p.m. UTC | #25

On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
> > Extend nohz_full feature set to include isolation from managed IRQS. This
> 
> So you say it's for managed-irqs, the feature is actually called
> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.
> 
> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
> fine and don't need help at at...
> 
> > is required specifically for setups that only uses nohz_full and still
> > requires isolation for maintaining lower latency for the listed CPUs.
> > 
> > Suggested-by: Frederic Weisbecker <frederic@kernel.org>

Ah and yes there is this tag :-p

So that's my bad, I really thought this thing was about managed IRQ.
The problem is that I can't find a single documentation about them so I'm
too clueless on that matter.

Thanks.

> > Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
> > ---
> >  kernel/sched/isolation.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
> > index 5a6ea03f9882..9df9598a9e39 100644
> > --- a/kernel/sched/isolation.c
> > +++ b/kernel/sched/isolation.c
> > @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
> >  	unsigned int flags;
> >  
> >  	flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
> > -		HK_FLAG_MISC | HK_FLAG_KTHREAD;
> > +		HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
> >  
> >  	return housekeeping_setup(str, flags);
> >  }
> > -- 
> > 2.18.2
> >

Nitesh Narayan Lal Oct. 23, 2020, 1:45 p.m. UTC | #26

On 10/23/20 9:25 AM, Peter Zijlstra wrote:
> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
>> Extend nohz_full feature set to include isolation from managed IRQS. This
> So you say it's for managed-irqs, the feature is actually called
> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.

Ah my bad! I should replace the managed IRQS with MANAGED_IRQ.
I can send another version with this fixed.

>
> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
> fine and don't need help at at...

Since the introduction of
"genirq, sched/isolation: Isolate from handling managed interrupts"

Within irq_do_set_affinity(), it is ensured that for managed intrrupts as
well, the isolated CPUs are removed from the affinity mask.

Hence, IMHO before this change managed interrupts were affecting the
isolated CPUs.

My intent of having this change is to basically allow isolation for
nohz_full CPUs even when we don't have something like isolcpus.
Does that make sense?


>
>> is required specifically for setups that only uses nohz_full and still
>> requires isolation for maintaining lower latency for the listed CPUs.
>>
>> Suggested-by: Frederic Weisbecker <frederic@kernel.org>
>> Signed-off-by: Nitesh Narayan Lal <nitesh@redhat.com>
>> ---
>>  kernel/sched/isolation.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
>> index 5a6ea03f9882..9df9598a9e39 100644
>> --- a/kernel/sched/isolation.c
>> +++ b/kernel/sched/isolation.c
>> @@ -141,7 +141,7 @@ static int __init housekeeping_nohz_full_setup(char *str)
>>  	unsigned int flags;
>>  
>>  	flags = HK_FLAG_TICK | HK_FLAG_WQ | HK_FLAG_TIMER | HK_FLAG_RCU |
>> -		HK_FLAG_MISC | HK_FLAG_KTHREAD;
>> +		HK_FLAG_MISC | HK_FLAG_KTHREAD | HK_FLAG_MANAGED_IRQ;
>>  
>>  	return housekeeping_setup(str, flags);
>>  }
>> -- 
>> 2.18.2
>>

Nitesh Narayan Lal Oct. 23, 2020, 1:57 p.m. UTC | #27

On 10/23/20 9:29 AM, Frederic Weisbecker wrote:
> On Fri, Oct 23, 2020 at 03:25:05PM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 28, 2020 at 02:35:27PM -0400, Nitesh Narayan Lal wrote:
>>> Extend nohz_full feature set to include isolation from managed IRQS. This
>> So you say it's for managed-irqs, the feature is actually called
>> MANAGED_IRQ, but, AFAICT, it does *NOT* in fact affect managed IRQs.
>>
>> Also, as per Thomas' earlier points, managed-irqs are in fact perfectly
>> fine and don't need help at at...
>>
>>> is required specifically for setups that only uses nohz_full and still
>>> requires isolation for maintaining lower latency for the listed CPUs.
>>>
>>> Suggested-by: Frederic Weisbecker <frederic@kernel.org>
> Ah and yes there is this tag :-p
>
> So that's my bad, I really thought this thing was about managed IRQ.
> The problem is that I can't find a single documentation about them so I'm
> too clueless on that matter.

I am also confused with this terminology.
So my bad for not taking care of this.

Thomas Gleixner Oct. 23, 2020, 9 p.m. UTC | #28

On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:
> On 10/23/20 4:58 AM, Peter Zijlstra wrote:
>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:
>> So shouldn't we then fix the drivers / interface first, to get rid of
>> this inconsistency?
>>
> Considering we agree that excess vector is a problem that needs to be
> solved across all the drivers and that you are comfortable with the other
> three patches in the set. If I may suggest the following:
>
> - We can pick those three patches for now, as that will atleast fix a
>   driver that is currently impacting RT workloads. Is that a fair
>   expectation?

No. Blindly reducing the maximum vectors to the number of housekeeping
CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide
what the right number of interrupts for this situation is.

Many of these drivers need more than queue interrupts, admin, error
interrupt and some operate best with seperate RX/TX interrupts per
queue. They all can "work" with a single PCI interrupt of course, but
the price you pay is performance.

An isolated setup, which I'm familiar with, has two housekeeping
CPUs. So far I restricted the number of network queues with a module
argument to two, which allocates two management interrupts for the
device and two interrupts (RX/TX) per queue, i.e. a total of six.

Now I reduced the number of available interrupts to two according to
your hack, which makes it use one queue RX/TX combined and one
management interrupt. Guess what happens? Network performance tanks to
the points that it breaks a carefully crafted setup.

The same applies to a device which is application specific and wants one
channel including an interrupt per isolated application core. Today I
can isolate 8 out of 12 CPUs and let the device create 8 channels and
set one interrupt and channel affine to each isolated CPU. With your
hack, I get only 4 interrupts and channels. Fail!

You cannot declare that all this is perfectly fine, just because it does
not matter for your particular use case.

So without information from the driver which tells what the best number
of interrupts is with a reduced number of CPUs, this cutoff will cause
more problems than it solves. Regressions guaranteed.

Managed interrupts base their interrupt allocation and spreading on
information which is handed in by the individual driver and not on crude
assumptions. They are not imposing restrictions on the use case.

It's perfectly fine for isolated work to save a data set to disk after
computation has finished and that just works with the per-cpu I/O queue
which is otherwise completely silent. All isolated workers can do the
same in parallel without trampling on each other toes by competing for a
reduced number of queues which are affine to the housekeeper CPUs.

Unfortunately network multi-queue is substantially different from block
multi-queue (as I learned in this conversation), so the concept cannot
be applied one-to-one to networking as is. But there are certainly part
of it which can be reused.

This needs a lot more thought than just these crude hacks.

Especially under the aspect that there are talks about making isolation
runtime switchable. Are you going to rmmod/insmod the i40e network
driver to do so? That's going to work fine if you do that
reconfiguration over network...

Thanks,

        tglx

Nitesh Narayan Lal Oct. 26, 2020, 1:35 p.m. UTC | #29

On 10/23/20 5:00 PM, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:

>> On 10/23/20 4:58 AM, Peter Zijlstra wrote:

>>> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:

>>> So shouldn't we then fix the drivers / interface first, to get rid of

>>> this inconsistency?

>>>

>> Considering we agree that excess vector is a problem that needs to be

>> solved across all the drivers and that you are comfortable with the other

>> three patches in the set. If I may suggest the following:

>>

>> - We can pick those three patches for now, as that will atleast fix a

>>   driver that is currently impacting RT workloads. Is that a fair

>>   expectation?

> No. Blindly reducing the maximum vectors to the number of housekeeping

> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide

> what the right number of interrupts for this situation is.

>

> Many of these drivers need more than queue interrupts, admin, error

> interrupt and some operate best with seperate RX/TX interrupts per

> queue. They all can "work" with a single PCI interrupt of course, but

> the price you pay is performance.

>

> An isolated setup, which I'm familiar with, has two housekeeping

> CPUs. So far I restricted the number of network queues with a module

> argument to two, which allocates two management interrupts for the

> device and two interrupts (RX/TX) per queue, i.e. a total of six.


Does it somehow take num_online_cpus() into consideration while deciding
the number of interrupts to create?

>

> Now I reduced the number of available interrupts to two according to

> your hack, which makes it use one queue RX/TX combined and one

> management interrupt. Guess what happens? Network performance tanks to

> the points that it breaks a carefully crafted setup.

>

> The same applies to a device which is application specific and wants one

> channel including an interrupt per isolated application core. Today I

> can isolate 8 out of 12 CPUs and let the device create 8 channels and

> set one interrupt and channel affine to each isolated CPU. With your

> hack, I get only 4 interrupts and channels. Fail!

>

> You cannot declare that all this is perfectly fine, just because it does

> not matter for your particular use case.


I agree that does sound wrong.

>

> So without information from the driver which tells what the best number

> of interrupts is with a reduced number of CPUs, this cutoff will cause

> more problems than it solves. Regressions guaranteed.


Indeed.
I think one commonality among the drivers at the moment is the usage of
num_online_cpus() to determine the vectors to create.

So, maybe instead of doing this kind of restrictions in a generic level
API, it will make more sense to do this on a per-device basis by replacing
the number of online CPUs with the housekeeping CPUs?

This is what I have done in the i40e patch.
But that still sounds hackish and will impact the performance.

>

> Managed interrupts base their interrupt allocation and spreading on

> information which is handed in by the individual driver and not on crude

> assumptions. They are not imposing restrictions on the use case.


Right, FWIU it is irq_do_set_affinity that prevents the spreading of
managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,
isn't?

>

> It's perfectly fine for isolated work to save a data set to disk after

> computation has finished and that just works with the per-cpu I/O queue

> which is otherwise completely silent. All isolated workers can do the

> same in parallel without trampling on each other toes by competing for a

> reduced number of queues which are affine to the housekeeper CPUs.

>

> Unfortunately network multi-queue is substantially different from block

> multi-queue (as I learned in this conversation), so the concept cannot

> be applied one-to-one to networking as is. But there are certainly part

> of it which can be reused.


So this is one of the areas that I don't understand well yet and will
explore now.

>

> This needs a lot more thought than just these crude hacks.


Got it.
I am indeed not in the favor of pushing a solution that has a possibility
of regressing/breaking things afterward.

>

> Especially under the aspect that there are talks about making isolation

> runtime switchable. Are you going to rmmod/insmod the i40e network

> driver to do so? That's going to work fine if you do that

> reconfiguration over network...


That's an interesting point. However, for some of the drivers which uses
something like cpu_online/possible_mask while creating its affinity mask
reloading will still associate jobs to the isolated CPUs.


-- 
Thanks
Nitesh

Thomas Gleixner Oct. 26, 2020, 1:57 p.m. UTC | #30

On Mon, Oct 26 2020 at 09:35, Nitesh Narayan Lal wrote:
> On 10/23/20 5:00 PM, Thomas Gleixner wrote:

>> An isolated setup, which I'm familiar with, has two housekeeping

>> CPUs. So far I restricted the number of network queues with a module

>> argument to two, which allocates two management interrupts for the

>> device and two interrupts (RX/TX) per queue, i.e. a total of six.

>

> Does it somehow take num_online_cpus() into consideration while deciding

> the number of interrupts to create?


No, I just tell it to create two queues :)

>> So without information from the driver which tells what the best number

>> of interrupts is with a reduced number of CPUs, this cutoff will cause

>> more problems than it solves. Regressions guaranteed.

>

> Indeed.

> I think one commonality among the drivers at the moment is the usage of

> num_online_cpus() to determine the vectors to create.

>

> So, maybe instead of doing this kind of restrictions in a generic level

> API, it will make more sense to do this on a per-device basis by replacing

> the number of online CPUs with the housekeeping CPUs?

>

> This is what I have done in the i40e patch.

> But that still sounds hackish and will impact the performance.


You want an interface which allows the driver to say:

  I need N interrupts for general management and ideally M interrupts
  per queue.

This is similar to the way drivers tell the core code about their
requirements for managed interrupts for the spreading calculation.

>> Managed interrupts base their interrupt allocation and spreading on

>> information which is handed in by the individual driver and not on crude

>> assumptions. They are not imposing restrictions on the use case.

>

> Right, FWIU it is irq_do_set_affinity that prevents the spreading of

> managed interrupts to isolated CPUs if HK_FLAG_MANAGED_IRQ is enabled,

> isn't?


No. Spreading takes possible CPUs into account. HK_FLAG_MANAGED_IRQ does
not influence spreading at all.

It only handles the case that an interrupt is affine to more than one
CPUs and the resulting affinity mask spawns both housekeeping and
isolated CPUs. It then steers the interrupt to the housekeeping CPUs (as
long as there is one online).

Thanks,

        tglx

Marcelo Tosatti Oct. 26, 2020, 5:30 p.m. UTC | #31

On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
> On Fri, Oct 23 2020 at 09:10, Nitesh Narayan Lal wrote:

> > On 10/23/20 4:58 AM, Peter Zijlstra wrote:

> >> On Thu, Oct 22, 2020 at 01:47:14PM -0400, Nitesh Narayan Lal wrote:

> >> So shouldn't we then fix the drivers / interface first, to get rid of

> >> this inconsistency?

> >>

> > Considering we agree that excess vector is a problem that needs to be

> > solved across all the drivers and that you are comfortable with the other

> > three patches in the set. If I may suggest the following:

> >

> > - We can pick those three patches for now, as that will atleast fix a

> >   driver that is currently impacting RT workloads. Is that a fair

> >   expectation?

> 

> No. Blindly reducing the maximum vectors to the number of housekeeping

> CPUs is patently wrong. The PCI core _cannot_ just nilly willy decide

> what the right number of interrupts for this situation is.

> 

> Many of these drivers need more than queue interrupts, admin, error

> interrupt and some operate best with seperate RX/TX interrupts per

> queue. They all can "work" with a single PCI interrupt of course, but

> the price you pay is performance.

> 

> An isolated setup, which I'm familiar with, has two housekeeping

> CPUs. So far I restricted the number of network queues with a module

> argument to two, which allocates two management interrupts for the

> device and two interrupts (RX/TX) per queue, i.e. a total of six.

> 

> Now I reduced the number of available interrupts to two according to

> your hack, which makes it use one queue RX/TX combined and one

> management interrupt. Guess what happens? Network performance tanks to

> the points that it breaks a carefully crafted setup.

> 

> The same applies to a device which is application specific and wants one

> channel including an interrupt per isolated application core. Today I

> can isolate 8 out of 12 CPUs and let the device create 8 channels and

> set one interrupt and channel affine to each isolated CPU. With your

> hack, I get only 4 interrupts and channels. Fail!


Good point.

> You cannot declare that all this is perfectly fine, just because it does

> not matter for your particular use case.

> 

> So without information from the driver which tells what the best number

> of interrupts is with a reduced number of CPUs, this cutoff will cause

> more problems than it solves. Regressions guaranteed.


One might want to move from one interrupt per isolated app core
to zero, or vice versa. It seems that "best number of interrupts 
is with reduced number of CPUs" information, is therefore in userspace, 
not in driver...

No?

> Managed interrupts base their interrupt allocation and spreading on

> information which is handed in by the individual driver and not on crude

> assumptions. They are not imposing restrictions on the use case.

> 

> It's perfectly fine for isolated work to save a data set to disk after

> computation has finished and that just works with the per-cpu I/O queue

> which is otherwise completely silent. 


Userspace could only change the mask of interrupts which are not 
triggered by requests from the local CPU (admin, error, mgmt, etc),
to avoid the vector exhaustion problem.

However, there is no explicit way for userspace to know that, as far as
i know.

 130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
 131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
 132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
 133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
 134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
 135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
 136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
 137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8


Can that be retrieved from PCI-MSI information, or drivers
have to inform this? 

> All isolated workers can do the

> same in parallel without trampling on each other toes by competing for a

> reduced number of queues which are affine to the housekeeper CPUs.

> 

> Unfortunately network multi-queue is substantially different from block

> multi-queue (as I learned in this conversation), so the concept cannot

> be applied one-to-one to networking as is. But there are certainly part

> of it which can be reused.

> 

> This needs a lot more thought than just these crude hacks.

> 

> Especially under the aspect that there are talks about making isolation

> runtime switchable. Are you going to rmmod/insmod the i40e network

> driver to do so? That's going to work fine if you do that

> reconfiguration over network...

> 

> Thanks,

> 

>         tglx

Thomas Gleixner Oct. 26, 2020, 7 p.m. UTC | #32

On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
> On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
>> So without information from the driver which tells what the best number
>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>> more problems than it solves. Regressions guaranteed.
>
> One might want to move from one interrupt per isolated app core
> to zero, or vice versa. It seems that "best number of interrupts 
> is with reduced number of CPUs" information, is therefore in userspace, 
> not in driver...

How does userspace know about the driver internals? Number of management
interrupts, optimal number of interrupts per queue?

>> Managed interrupts base their interrupt allocation and spreading on
>> information which is handed in by the individual driver and not on crude
>> assumptions. They are not imposing restrictions on the use case.
>> 
>> It's perfectly fine for isolated work to save a data set to disk after
>> computation has finished and that just works with the per-cpu I/O queue
>> which is otherwise completely silent. 
>
> Userspace could only change the mask of interrupts which are not 
> triggered by requests from the local CPU (admin, error, mgmt, etc),
> to avoid the vector exhaustion problem.
>
> However, there is no explicit way for userspace to know that, as far as
> i know.
>
>  130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
>  131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
>  132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
>  133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
>  134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
>  135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
>  136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
>  137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8
>
> Can that be retrieved from PCI-MSI information, or drivers
> have to inform this?

The driver should use a different name for the admin queues.

Thanks,

        tglx

Marcelo Tosatti Oct. 26, 2020, 7:11 p.m. UTC | #33

On Mon, Oct 26, 2020 at 08:00:39PM +0100, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:

> > On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:

> >> So without information from the driver which tells what the best number

> >> of interrupts is with a reduced number of CPUs, this cutoff will cause

> >> more problems than it solves. Regressions guaranteed.

> >

> > One might want to move from one interrupt per isolated app core

> > to zero, or vice versa. It seems that "best number of interrupts 

> > is with reduced number of CPUs" information, is therefore in userspace, 

> > not in driver...

> 

> How does userspace know about the driver internals? Number of management

> interrupts, optimal number of interrupts per queue?

> 

> >> Managed interrupts base their interrupt allocation and spreading on

> >> information which is handed in by the individual driver and not on crude

> >> assumptions. They are not imposing restrictions on the use case.

> >> 

> >> It's perfectly fine for isolated work to save a data set to disk after

> >> computation has finished and that just works with the per-cpu I/O queue

> >> which is otherwise completely silent. 

> >

> > Userspace could only change the mask of interrupts which are not 

> > triggered by requests from the local CPU (admin, error, mgmt, etc),

> > to avoid the vector exhaustion problem.

> >

> > However, there is no explicit way for userspace to know that, as far as

> > i know.

> >

> >  130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1

> >  131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2

> >  132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3

> >  133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4

> >  134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5

> >  135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6

> >  136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7

> >  137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8

> >

> > Can that be retrieved from PCI-MSI information, or drivers

> > have to inform this?

> 

> The driver should use a different name for the admin queues.


Works for me.

Sounds more like a heuristic which can break, so documenting this 
as an "interface" seems appropriate.

Jacob Keller Oct. 26, 2020, 7:21 p.m. UTC | #34

On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:30, Marcelo Tosatti wrote:
>> On Fri, Oct 23, 2020 at 11:00:52PM +0200, Thomas Gleixner wrote:
>>> So without information from the driver which tells what the best number
>>> of interrupts is with a reduced number of CPUs, this cutoff will cause
>>> more problems than it solves. Regressions guaranteed.
>>
>> One might want to move from one interrupt per isolated app core
>> to zero, or vice versa. It seems that "best number of interrupts 
>> is with reduced number of CPUs" information, is therefore in userspace, 
>> not in driver...
> 
> How does userspace know about the driver internals? Number of management
> interrupts, optimal number of interrupts per queue?
> 

I guess this is the problem solved in part by the queue management work
that would make queues a thing that userspace is aware of.

Are there drivers which use more than one interrupt per queue? I know
drivers have multiple management interrupts.. and I guess some drivers
do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
have multiple queues for one interrupt .. I'm not sure how a single
queue with multiple interrupts would work though.

>>> Managed interrupts base their interrupt allocation and spreading on
>>> information which is handed in by the individual driver and not on crude
>>> assumptions. They are not imposing restrictions on the use case.
>>>
>>> It's perfectly fine for isolated work to save a data set to disk after
>>> computation has finished and that just works with the per-cpu I/O queue
>>> which is otherwise completely silent. 
>>
>> Userspace could only change the mask of interrupts which are not 
>> triggered by requests from the local CPU (admin, error, mgmt, etc),
>> to avoid the vector exhaustion problem.
>>
>> However, there is no explicit way for userspace to know that, as far as
>> i know.
>>
>>  130:      34845          0          0          0          0          0          0          0  IR-PCI-MSI 33554433-edge      nvme0q1
>>  131:          0      27062          0          0          0          0          0          0  IR-PCI-MSI 33554434-edge      nvme0q2
>>  132:          0          0      24393          0          0          0          0          0  IR-PCI-MSI 33554435-edge      nvme0q3
>>  133:          0          0          0      24313          0          0          0          0  IR-PCI-MSI 33554436-edge      nvme0q4
>>  134:          0          0          0          0      20608          0          0          0  IR-PCI-MSI 33554437-edge      nvme0q5
>>  135:          0          0          0          0          0      22163          0          0  IR-PCI-MSI 33554438-edge      nvme0q6
>>  136:          0          0          0          0          0          0      23020          0  IR-PCI-MSI 33554439-edge      nvme0q7
>>  137:          0          0          0          0          0          0          0      24285  IR-PCI-MSI 33554440-edge      nvme0q8
>>
>> Can that be retrieved from PCI-MSI information, or drivers
>> have to inform this?
> 
> The driver should use a different name for the admin queues.
> 
> Thanks,
> 
>         tglx
>

Thomas Gleixner Oct. 26, 2020, 8:11 p.m. UTC | #35

On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
> On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
>> How does userspace know about the driver internals? Number of management
>> interrupts, optimal number of interrupts per queue?
>
> I guess this is the problem solved in part by the queue management work
> that would make queues a thing that userspace is aware of.
>
> Are there drivers which use more than one interrupt per queue? I know
> drivers have multiple management interrupts.. and I guess some drivers
> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
> have multiple queues for one interrupt .. I'm not sure how a single
> queue with multiple interrupts would work though.

For block there is always one interrupt per queue. Some Network drivers
seem to have seperate RX and TX interrupts per queue.

Thanks,

        tglx

Jacob Keller Oct. 26, 2020, 9:11 p.m. UTC | #36

On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>> On 10/26/2020 12:00 PM, Thomas Gleixner wrote:
>>> How does userspace know about the driver internals? Number of management
>>> interrupts, optimal number of interrupts per queue?
>>
>> I guess this is the problem solved in part by the queue management work
>> that would make queues a thing that userspace is aware of.
>>
>> Are there drivers which use more than one interrupt per queue? I know
>> drivers have multiple management interrupts.. and I guess some drivers
>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>> have multiple queues for one interrupt .. I'm not sure how a single
>> queue with multiple interrupts would work though.
> 
> For block there is always one interrupt per queue. Some Network drivers
> seem to have seperate RX and TX interrupts per queue.
> 
> Thanks,
> 
>         tglx
> 

That's true when thinking of Tx and Rx as a single queue. Another way to
think about it is "one rx queue" and "one tx queue" each with their own
interrupt...

Even if there are devices which force there to be exactly queue pairs,
you could still think of them as separate entities?

Hmm.

Thomas Gleixner Oct. 26, 2020, 9:50 p.m. UTC | #37

On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>>> Are there drivers which use more than one interrupt per queue? I know
>>> drivers have multiple management interrupts.. and I guess some drivers
>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>> have multiple queues for one interrupt .. I'm not sure how a single
>>> queue with multiple interrupts would work though.
>> 
>> For block there is always one interrupt per queue. Some Network drivers
>> seem to have seperate RX and TX interrupts per queue.
> That's true when thinking of Tx and Rx as a single queue. Another way to
> think about it is "one rx queue" and "one tx queue" each with their own
> interrupt...
>
> Even if there are devices which force there to be exactly queue pairs,
> you could still think of them as separate entities?

Interesting thought.

But as Jakub explained networking queues are fundamentally different
from block queues on the RX side. For block the request issued on queue
X will raise the complete interrupt on queue X.

For networking the TX side will raise the TX interrupt on the queue on
which the packet was queued obviously or should I say hopefully. :)

But incoming packets will be directed to some receive queue based on a
hash or whatever crystallball logic the firmware decided to implement.

Which makes this not really suitable for the managed interrupt and
spreading approach which is used by block-mq. Hrm...

But I still think that for curing that isolation stuff we want at least
some information from the driver. Alternative solution would be to grant
the allocation of interrupts and queues and have some sysfs knob to shut
down queues at runtime. If that shutdown results in releasing the queue
interrupt (via free_irq()) then the vector exhaustion problem goes away.

Needs more thought and information (for network oblivious folks like
me).

Thanks,

        tglx

Jakub Kicinski Oct. 26, 2020, 10:13 p.m. UTC | #38

On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:

> > On 10/26/2020 1:11 PM, Thomas Gleixner wrote:  

> >> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:  

> >>> Are there drivers which use more than one interrupt per queue? I know

> >>> drivers have multiple management interrupts.. and I guess some drivers

> >>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to

> >>> have multiple queues for one interrupt .. I'm not sure how a single

> >>> queue with multiple interrupts would work though.  

> >> 

> >> For block there is always one interrupt per queue. Some Network drivers

> >> seem to have seperate RX and TX interrupts per queue.  

> > That's true when thinking of Tx and Rx as a single queue. Another way to

> > think about it is "one rx queue" and "one tx queue" each with their own

> > interrupt...

> >

> > Even if there are devices which force there to be exactly queue pairs,

> > you could still think of them as separate entities?  

> 

> Interesting thought.

> 

> But as Jakub explained networking queues are fundamentally different

> from block queues on the RX side. For block the request issued on queue

> X will raise the complete interrupt on queue X.

> 

> For networking the TX side will raise the TX interrupt on the queue on

> which the packet was queued obviously or should I say hopefully. :)

> 

> But incoming packets will be directed to some receive queue based on a

> hash or whatever crystallball logic the firmware decided to implement.

> 

> Which makes this not really suitable for the managed interrupt and

> spreading approach which is used by block-mq. Hrm...

> 

> But I still think that for curing that isolation stuff we want at least

> some information from the driver. Alternative solution would be to grant

> the allocation of interrupts and queues and have some sysfs knob to shut

> down queues at runtime. If that shutdown results in releasing the queue

> interrupt (via free_irq()) then the vector exhaustion problem goes away.

> 

> Needs more thought and information (for network oblivious folks like

> me).

One piece of information that may be useful is that even tho the RX
packets may be spread semi-randomly the user space can still control
which queues are included in the mechanism. There is an indirection
table in the HW which allows to weigh queues differently, or exclude
selected queues from the spreading. Other mechanisms exist to filter
flows onto specific queues.

IOW just because a core has an queue/interrupt does not mean that
interrupt will ever fire, provided its excluded from RSS.

Another piece is that by default we suggest drivers allocate 8 RX
queues, and online_cpus TX queues. The number of queues can be
independently controlled via ethtool -L. Drivers which can't support
separate queues will default to online_cpus queue pairs, and let
ethtool -L only set the "combined" parameter.

There are drivers which always allocate online_cpus interrupts, 
and then some of them will go unused if #qs < #cpus.

My unpopular opinion is that for networking devices all the heuristics
we may come up with are going to be a dead end. We need an explicit API
to allow users placing queues on cores, and use managed IRQs for data
queues. (I'm assuming that managed IRQs will let us reliably map a MSI-X
vector to a core :))

Nitesh Narayan Lal Oct. 26, 2020, 10:22 p.m. UTC | #39

On 10/26/20 5:50 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
>> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:
>>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:
>>>> Are there drivers which use more than one interrupt per queue? I know
>>>> drivers have multiple management interrupts.. and I guess some drivers
>>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>>> have multiple queues for one interrupt .. I'm not sure how a single
>>>> queue with multiple interrupts would work though.
>>> For block there is always one interrupt per queue. Some Network drivers
>>> seem to have seperate RX and TX interrupts per queue.
>> That's true when thinking of Tx and Rx as a single queue. Another way to
>> think about it is "one rx queue" and "one tx queue" each with their own
>> interrupt...
>>
>> Even if there are devices which force there to be exactly queue pairs,
>> you could still think of them as separate entities?
> Interesting thought.
>
> But as Jakub explained networking queues are fundamentally different
> from block queues on the RX side. For block the request issued on queue
> X will raise the complete interrupt on queue X.
>
> For networking the TX side will raise the TX interrupt on the queue on
> which the packet was queued obviously or should I say hopefully. :)

This is my impression as well.

> But incoming packets will be directed to some receive queue based on a
> hash or whatever crystallball logic the firmware decided to implement.
>
> Which makes this not really suitable for the managed interrupt and
> spreading approach which is used by block-mq. Hrm...
>
> But I still think that for curing that isolation stuff we want at least
> some information from the driver. Alternative solution would be to grant
> the allocation of interrupts and queues and have some sysfs knob to shut
> down queues at runtime. If that shutdown results in releasing the queue
> interrupt (via free_irq()) then the vector exhaustion problem goes away.

I think this is close to what I and Marcelo were discussing earlier today
privately.

I don't think there is currently a way to control the enablement/disablement of
interrupts from the userspace.

I think in terms of the idea we need something similar to what i40e does,
that is shutdown all IRQs when CPU is suspended and restores the interrupt
schema when the CPU is back online.

The two key difference will be that this API needs to be generic and also
needs to be exposed to the userspace through something like sysfs as you
have mentioned.

Thomas Gleixner Oct. 26, 2020, 10:46 p.m. UTC | #40

On Mon, Oct 26 2020 at 15:13, Jakub Kicinski wrote:
> On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:

>> But I still think that for curing that isolation stuff we want at least

>> some information from the driver. Alternative solution would be to grant

>> the allocation of interrupts and queues and have some sysfs knob to shut

>> down queues at runtime. If that shutdown results in releasing the queue

>> interrupt (via free_irq()) then the vector exhaustion problem goes away.

>> 

>> Needs more thought and information (for network oblivious folks like

>> me).

>

> One piece of information that may be useful is that even tho the RX

> packets may be spread semi-randomly the user space can still control

> which queues are included in the mechanism. There is an indirection

> table in the HW which allows to weigh queues differently, or exclude

> selected queues from the spreading. Other mechanisms exist to filter

> flows onto specific queues.

>

> IOW just because a core has an queue/interrupt does not mean that

> interrupt will ever fire, provided its excluded from RSS.

>

> Another piece is that by default we suggest drivers allocate 8 RX

> queues, and online_cpus TX queues. The number of queues can be

> independently controlled via ethtool -L. Drivers which can't support

> separate queues will default to online_cpus queue pairs, and let

> ethtool -L only set the "combined" parameter.

>

> There are drivers which always allocate online_cpus interrupts, 

> and then some of them will go unused if #qs < #cpus.


Thanks for the enlightment.

> My unpopular opinion is that for networking devices all the heuristics

> we may come up with are going to be a dead end.


I agree. Heuristics suck.

> We need an explicit API to allow users placing queues on cores, and

> use managed IRQs for data queues. (I'm assuming that managed IRQs will

> let us reliably map a MSI-X vector to a core :))


Yes, they allow you to do that. That will need some tweaks to theway
they work today (coming from the strict block mq semantics). You also
need to be aware that managed irqs have also strict semantics vs. CPU
hotplug. If the last CPU in the managed affinity set goes down then the
interrupt is shut down by the irq core which means that you need to
quiesce the associated queue before that happens. When the first CPU
comes online again the interrupt is reenabled, so the queue should be
able to handle it or has ensured that the device does not raise one
before it is able to do so.

Thanks,

        tglx

Thomas Gleixner Oct. 26, 2020, 10:49 p.m. UTC | #41

On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
> On 10/26/20 5:50 PM, Thomas Gleixner wrote:

>> But I still think that for curing that isolation stuff we want at least

>> some information from the driver. Alternative solution would be to grant

>> the allocation of interrupts and queues and have some sysfs knob to shut

>> down queues at runtime. If that shutdown results in releasing the queue

>> interrupt (via free_irq()) then the vector exhaustion problem goes away.

>

> I think this is close to what I and Marcelo were discussing earlier today

> privately.

>

> I don't think there is currently a way to control the enablement/disablement of

> interrupts from the userspace.


You cannot just disable the interrupt. You need to make sure that the
associated queue is shutdown or quiesced _before_ the interrupt is shut
down.

Thanks,

        tglx

Jacob Keller Oct. 26, 2020, 10:52 p.m. UTC | #42

On 10/26/2020 3:13 PM, Jakub Kicinski wrote:
> On Mon, 26 Oct 2020 22:50:45 +0100 Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:
>>> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:  
>>>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:  
>>>>> Are there drivers which use more than one interrupt per queue? I know
>>>>> drivers have multiple management interrupts.. and I guess some drivers
>>>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to
>>>>> have multiple queues for one interrupt .. I'm not sure how a single
>>>>> queue with multiple interrupts would work though.  
>>>>
>>>> For block there is always one interrupt per queue. Some Network drivers
>>>> seem to have seperate RX and TX interrupts per queue.  
>>> That's true when thinking of Tx and Rx as a single queue. Another way to
>>> think about it is "one rx queue" and "one tx queue" each with their own
>>> interrupt...
>>>
>>> Even if there are devices which force there to be exactly queue pairs,
>>> you could still think of them as separate entities?  
>>
>> Interesting thought.
>>
>> But as Jakub explained networking queues are fundamentally different
>> from block queues on the RX side. For block the request issued on queue
>> X will raise the complete interrupt on queue X.
>>
>> For networking the TX side will raise the TX interrupt on the queue on
>> which the packet was queued obviously or should I say hopefully. :)
>>
>> But incoming packets will be directed to some receive queue based on a
>> hash or whatever crystallball logic the firmware decided to implement.
>>
>> Which makes this not really suitable for the managed interrupt and
>> spreading approach which is used by block-mq. Hrm...
>>
>> But I still think that for curing that isolation stuff we want at least
>> some information from the driver. Alternative solution would be to grant
>> the allocation of interrupts and queues and have some sysfs knob to shut
>> down queues at runtime. If that shutdown results in releasing the queue
>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>>
>> Needs more thought and information (for network oblivious folks like
>> me).
> 
> One piece of information that may be useful is that even tho the RX
> packets may be spread semi-randomly the user space can still control
> which queues are included in the mechanism. There is an indirection
> table in the HW which allows to weigh queues differently, or exclude
> selected queues from the spreading. Other mechanisms exist to filter
> flows onto specific queues.
> 
> IOW just because a core has an queue/interrupt does not mean that
> interrupt will ever fire, provided its excluded from RSS.
> 
> Another piece is that by default we suggest drivers allocate 8 RX
> queues, and online_cpus TX queues. The number of queues can be
> independently controlled via ethtool -L. Drivers which can't support
> separate queues will default to online_cpus queue pairs, and let
> ethtool -L only set the "combined" parameter.
> 

I know the Intel drivers usually have defaulted to trying to maintain
queue pairs. I do not believe this is technically a HW restriction, but
it is heavily built into the way the drivers work today.

> There are drivers which always allocate online_cpus interrupts, 
> and then some of them will go unused if #qs < #cpus.
> 
> 

Right.

> My unpopular opinion is that for networking devices all the heuristics
> we may come up with are going to be a dead end. We need an explicit API
> to allow users placing queues on cores, and use managed IRQs for data
> queues. (I'm assuming that managed IRQs will let us reliably map a MSI-X
> vector to a core :))
> 

I don't think it is that unpopular... This is the direction I'd like to
see us go as well.

Jacob Keller Oct. 26, 2020, 11:08 p.m. UTC | #43

On 10/26/2020 3:49 PM, Thomas Gleixner wrote:
> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
>> On 10/26/20 5:50 PM, Thomas Gleixner wrote:
>>> But I still think that for curing that isolation stuff we want at least
>>> some information from the driver. Alternative solution would be to grant
>>> the allocation of interrupts and queues and have some sysfs knob to shut
>>> down queues at runtime. If that shutdown results in releasing the queue
>>> interrupt (via free_irq()) then the vector exhaustion problem goes away.
>>
>> I think this is close to what I and Marcelo were discussing earlier today
>> privately.
>>
>> I don't think there is currently a way to control the enablement/disablement of
>> interrupts from the userspace.
> 
> You cannot just disable the interrupt. You need to make sure that the
> associated queue is shutdown or quiesced _before_ the interrupt is shut
> down.
> 
> Thanks,
> 
>         tglx
> 

Could this be handled with a callback to the driver/hw? I know Intel HW
should support this type of quiesce/shutdown.

Thanks,
Jake

Marcelo Tosatti Oct. 27, 2020, 11:47 a.m. UTC | #44

On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote:
> 

> On 10/26/20 5:50 PM, Thomas Gleixner wrote:

> > On Mon, Oct 26 2020 at 14:11, Jacob Keller wrote:

> >> On 10/26/2020 1:11 PM, Thomas Gleixner wrote:

> >>> On Mon, Oct 26 2020 at 12:21, Jacob Keller wrote:

> >>>> Are there drivers which use more than one interrupt per queue? I know

> >>>> drivers have multiple management interrupts.. and I guess some drivers

> >>>> do combined 1 interrupt per pair of Tx/Rx..  It's also plausible to to

> >>>> have multiple queues for one interrupt .. I'm not sure how a single

> >>>> queue with multiple interrupts would work though.

> >>> For block there is always one interrupt per queue. Some Network drivers

> >>> seem to have seperate RX and TX interrupts per queue.

> >> That's true when thinking of Tx and Rx as a single queue. Another way to

> >> think about it is "one rx queue" and "one tx queue" each with their own

> >> interrupt...

> >>

> >> Even if there are devices which force there to be exactly queue pairs,

> >> you could still think of them as separate entities?

> > Interesting thought.

> >

> > But as Jakub explained networking queues are fundamentally different

> > from block queues on the RX side. For block the request issued on queue

> > X will raise the complete interrupt on queue X.

> >

> > For networking the TX side will raise the TX interrupt on the queue on

> > which the packet was queued obviously or should I say hopefully. :)

> 

> This is my impression as well.

> 

> > But incoming packets will be directed to some receive queue based on a

> > hash or whatever crystallball logic the firmware decided to implement.

> >

> > Which makes this not really suitable for the managed interrupt and

> > spreading approach which is used by block-mq. Hrm...

> >

> > But I still think that for curing that isolation stuff we want at least

> > some information from the driver. Alternative solution would be to grant

> > the allocation of interrupts and queues and have some sysfs knob to shut

> > down queues at runtime. If that shutdown results in releasing the queue

> > interrupt (via free_irq()) then the vector exhaustion problem goes away.

> 

> I think this is close to what I and Marcelo were discussing earlier today

> privately.

> 

> I don't think there is currently a way to control the enablement/disablement of

> interrupts from the userspace.

As long as the interrupt obeys the "trigger when request has been
performed by local CPU" rule (#1) (for MSI type interrupts, where driver allocates
one I/O interrupt per CPU), don't see a need for the interface.

For other types of interrupts, interrupt controller should be programmed
to not include the isolated CPU on its "destination CPU list".

About the block VS network discussion, what we are trying to do at skb
level (Paolo Abeni CC'ed, author of the suggestion) is to use RPS to
avoid skbs from being queued to a CPU (on RX), and to queue skbs
on housekeeping CPUs for processing (TX).

However, if per-CPU interrupts are not disabled, then the (for example)
network device is free to include the CPU in its list of destinations.
Which would require one to say, configure RPS (or whatever mechanism
is distributing interrupts).

Hum, it would feel safer (rather than trust the #1 rule to be valid
in all cases) to ask the driver to disable the interrupt (after shutting
down queue) for that particular CPU.

BTW, Thomas, software is free to configure a particular MSI-X interrupt
to point to any CPU:

10.11 MESSAGE SIGNALLED INTERRUPTS
The PCI Local Bus Specification, Rev 2.2 (www.pcisig.com) introduces the concept of message signalled interrupts.
As the specification indicates:
“Message signalled interrupts (MSI) is an optional feature that enables PCI devices to request
service by writing a system-specified message to a system-specified address (PCI DWORD memory
write transaction). The transaction address specifies the message destination while the transaction
data specifies the message. System software is expected to initialize the message destination and
message during device configuration, allocating one or more non-shared messages to each MSI
capable function.”

Fields in the Message Address Register are as follows:
1. Bits 31-20 — These bits contain a fixed value for interrupt messages (0FEEH). This value locates interrupts at
the 1-MByte area with a base address of 4G – 18M. All accesses to this region are directed as interrupt
messages. Care must to be taken to ensure that no other device claims the region as I/O space.
2. Destination ID — This field contains an 8-bit destination ID. It identifies the message’s target processor(s).
The destination ID corresponds to bits 63:56 of the I/O APIC Redirection Table Entry if the IOAPIC is used to
dispatch the interrupt to the processor(s).

---

So taking the example where computation happens while isolated and later
stored via block interface, aren't we restricting the usage scenarios
by enforcing the "per-CPU queue has interrupt pointing to owner CPU" rule?

> I think in terms of the idea we need something similar to what i40e does,

> that is shutdown all IRQs when CPU is suspended and restores the interrupt

> schema when the CPU is back online.

> 

> The two key difference will be that this API needs to be generic and also

> needs to be exposed to the userspace through something like sysfs as you

> have mentioned.

> 

> -- 

> Thanks

> Nitesh

>

Thomas Gleixner Oct. 27, 2020, 2:28 p.m. UTC | #45

On Mon, Oct 26 2020 at 16:08, Jacob Keller wrote:
> On 10/26/2020 3:49 PM, Thomas Gleixner wrote:
>> On Mon, Oct 26 2020 at 18:22, Nitesh Narayan Lal wrote:
>>> I don't think there is currently a way to control the enablement/disablement of
>>> interrupts from the userspace.
>> 
>> You cannot just disable the interrupt. You need to make sure that the
>> associated queue is shutdown or quiesced _before_ the interrupt is shut
>> down.
>
> Could this be handled with a callback to the driver/hw? I know Intel HW
> should support this type of quiesce/shutdown.

We can't have a callback from the interrupt shutdown code as you have to
wait for the queue to drain packets in flight. Something like this

     mark queue as going down (no more tx queueing)
     tell hardware not to route RX packets to it
     consume pending RX
     wait for already queued TX packets to be sent

Look what the block people did. They have a common multi-instance
hotplug state and they register each context (queue) as an instance. The
hotplug core invokes the corresponding callbacks when bringing a CPU up
or when shutting it down.

Thanks,

        tglx

Thomas Gleixner Oct. 27, 2020, 2:43 p.m. UTC | #46

On Tue, Oct 27 2020 at 08:47, Marcelo Tosatti wrote:
> On Mon, Oct 26, 2020 at 06:22:29PM -0400, Nitesh Narayan Lal wrote:
> However, if per-CPU interrupts are not disabled, then the (for example)
> network device is free to include the CPU in its list of destinations.
> Which would require one to say, configure RPS (or whatever mechanism
> is distributing interrupts).

And why is that a problem? If that's possible then you can prevent
getting RX interrupts already today.

> Hum, it would feel safer (rather than trust the #1 rule to be valid
> in all cases) to ask the driver to disable the interrupt (after shutting
> down queue) for that particular CPU.
>
> BTW, Thomas, software is free to configure a particular MSI-X interrupt
> to point to any CPU:
>
> 10.11 MESSAGE SIGNALLED INTERRUPTS

I know how MSI works :)

> So taking the example where computation happens while isolated and later
> stored via block interface, aren't we restricting the usage scenarios
> by enforcing the "per-CPU queue has interrupt pointing to owner CPU"
> rule?

No. For block this is the ideal configuration (think locality) and it
prevents vector exhaustion. If you make these interrupts freely routable
then you bring back the vector exhaustion problem right away.

Now we already established that networking has different requirements,
so you have to come up with a different solution for it which allows to
work for all use cases.

Thanks,

        tglx

[v4,0/4] isolation: limit msix vectors to housekeeping CPUs

Message

Comments