Message ID | 20231003044403.1974628-1-apatel@ventanamicro.com |
---|---|
Headers | show |
Series | Linux RISC-V AIA Support | expand |
Hey, On Tue, Oct 03, 2023 at 10:13:55AM +0530, Anup Patel wrote: > We add DT bindings document for the RISC-V incoming MSI controller > (IMSIC) defined by the RISC-V advanced interrupt architecture (AIA) > specification. > > Signed-off-by: Anup Patel <apatel@ventanamicro.com> > Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Just FYI, since they'll reply to this themselves, but some of the Microchip folks have run into problems with sparse hart indexes while trying to use the imsic binding to describe some configurations they have. I think there were also so problems with how to describe to a linux guest which file to use, when the first hart available to the guest does not use the first file. They'll do a better job of describing their problems than I will, so I shall leave it to them! Cheers, Conor. > Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > --- > .../interrupt-controller/riscv,imsics.yaml | 172 ++++++++++++++++++ > 1 file changed, 172 insertions(+) > create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml > > diff --git a/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml > new file mode 100644 > index 000000000000..84976f17a4a1 > --- /dev/null > +++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml > @@ -0,0 +1,172 @@ > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: RISC-V Incoming MSI Controller (IMSIC) > + > +maintainers: > + - Anup Patel <anup@brainfault.org> > + > +description: | > + The RISC-V advanced interrupt architecture (AIA) defines a per-CPU incoming > + MSI controller (IMSIC) for handling MSIs in a RISC-V platform. The RISC-V > + AIA specification can be found at https://github.com/riscv/riscv-aia. > + > + The IMSIC is a per-CPU (or per-HART) device with separate interrupt file > + for each privilege level (machine or supervisor). The configuration of > + a IMSIC interrupt file is done using AIA CSRs and it also has a 4KB MMIO > + space to receive MSIs from devices. Each IMSIC interrupt file supports a > + fixed number of interrupt identities (to distinguish MSIs from devices) > + which is same for given privilege level across CPUs (or HARTs). > + > + The device tree of a RISC-V platform will have one IMSIC device tree node > + for each privilege level (machine or supervisor) which collectively describe > + IMSIC interrupt files at that privilege level across CPUs (or HARTs). > + > + The arrangement of IMSIC interrupt files in MMIO space of a RISC-V platform > + follows a particular scheme defined by the RISC-V AIA specification. A IMSIC > + group is a set of IMSIC interrupt files co-located in MMIO space and we can > + have multiple IMSIC groups (i.e. clusters, sockets, chiplets, etc) in a > + RISC-V platform. The MSI target address of a IMSIC interrupt file at given > + privilege level (machine or supervisor) encodes group index, HART index, > + and guest index (shown below). > + > + XLEN-1 > (HART Index MSB) 12 0 > + | | | | > + ------------------------------------------------------------- > + |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index| 0 | > + ------------------------------------------------------------- > + > +allOf: > + - $ref: /schemas/interrupt-controller.yaml# > + - $ref: /schemas/interrupt-controller/msi-controller.yaml# > + > +properties: > + compatible: > + items: > + - enum: > + - qemu,imsics > + - const: riscv,imsics > + > + reg: > + minItems: 1 > + maxItems: 16384 > + description: > + Base address of each IMSIC group. > + > + interrupt-controller: true > + > + "#interrupt-cells": > + const: 0 > + > + msi-controller: true > + > + "#msi-cells": > + const: 0 > + > + interrupts-extended: > + minItems: 1 > + maxItems: 16384 > + description: > + This property represents the set of CPUs (or HARTs) for which given > + device tree node describes the IMSIC interrupt files. Each node pointed > + to should be a riscv,cpu-intc node, which has a CPU node (i.e. RISC-V > + HART) as parent. > + > + riscv,num-ids: > + $ref: /schemas/types.yaml#/definitions/uint32 > + minimum: 63 > + maximum: 2047 > + description: > + Number of interrupt identities supported by IMSIC interrupt file. > + > + riscv,num-guest-ids: > + $ref: /schemas/types.yaml#/definitions/uint32 > + minimum: 63 > + maximum: 2047 > + description: > + Number of interrupt identities are supported by IMSIC guest interrupt > + file. When not specified it is assumed to be same as specified by the > + riscv,num-ids property. > + > + riscv,guest-index-bits: > + minimum: 0 > + maximum: 7 > + default: 0 > + description: > + Number of guest index bits in the MSI target address. > + > + riscv,hart-index-bits: > + minimum: 0 > + maximum: 15 > + description: > + Number of HART index bits in the MSI target address. When not > + specified it is calculated based on the interrupts-extended property. > + > + riscv,group-index-bits: > + minimum: 0 > + maximum: 7 > + default: 0 > + description: > + Number of group index bits in the MSI target address. > + > + riscv,group-index-shift: > + $ref: /schemas/types.yaml#/definitions/uint32 > + minimum: 0 > + maximum: 55 > + default: 24 > + description: > + The least significant bit position of the group index bits in the > + MSI target address. > + > +required: > + - compatible > + - reg > + - interrupt-controller > + - msi-controller > + - "#msi-cells" > + - interrupts-extended > + - riscv,num-ids > + > +unevaluatedProperties: false > + > +examples: > + - | > + // Example 1 (Machine-level IMSIC files with just one group): > + > + interrupt-controller@24000000 { > + compatible = "qemu,imsics", "riscv,imsics"; > + interrupts-extended = <&cpu1_intc 11>, > + <&cpu2_intc 11>, > + <&cpu3_intc 11>, > + <&cpu4_intc 11>; > + reg = <0x28000000 0x4000>; > + interrupt-controller; > + #interrupt-cells = <0>; > + msi-controller; > + #msi-cells = <0>; > + riscv,num-ids = <127>; > + }; > + > + - | > + // Example 2 (Supervisor-level IMSIC files with two groups): > + > + interrupt-controller@28000000 { > + compatible = "qemu,imsics", "riscv,imsics"; > + interrupts-extended = <&cpu1_intc 9>, > + <&cpu2_intc 9>, > + <&cpu3_intc 9>, > + <&cpu4_intc 9>; > + reg = <0x28000000 0x2000>, /* Group0 IMSICs */ > + <0x29000000 0x2000>; /* Group1 IMSICs */ > + interrupt-controller; > + #interrupt-cells = <0>; > + msi-controller; > + #msi-cells = <0>; > + riscv,num-ids = <127>; > + riscv,group-index-bits = <1>; > + riscv,group-index-shift = <24>; > + }; > +... > -- > 2.34.1 >
On Thu, Oct 12, 2023 at 10:05 PM Conor Dooley <conor@kernel.org> wrote: > > Hey, > > On Tue, Oct 03, 2023 at 10:13:55AM +0530, Anup Patel wrote: > > We add DT bindings document for the RISC-V incoming MSI controller > > (IMSIC) defined by the RISC-V advanced interrupt architecture (AIA) > > specification. > > > > Signed-off-by: Anup Patel <apatel@ventanamicro.com> > > Reviewed-by: Conor Dooley <conor.dooley@microchip.com> > > Just FYI, since they'll reply to this themselves, but some of the > Microchip folks have run into problems with sparse hart indexes while > trying to use the imsic binding to describe some configurations they > have. I think there were also so problems with how to describe to a > linux guest which file to use, when the first hart available to the > guest does not use the first file. They'll do a better job of describing > their problems than I will, so I shall leave it to them! Quoting AIA spec: "For the purpose of locating the memory pages of interrupt files in the address space, assume each hart (or each hart within a group) has a unique hart number that may or may not be related to the unique hart identifiers (“hart IDs”) that the RISC-V Privileged Architecture assigns to harts." It is very easy to get confused between the AIA "hart index" and "hart IDs" defined by the RISC-V Privileged specification but these are two very different things. The AIA "hart index" over here is the bits in the address of an IMSIC file. This DT binding follows the IMSIC file arrangement in the address space as defined by the section "3.6 Arrangement of the memory regions of multiple interrupt files" of the AIA specification. This arrangement is MANDATORY for platforms having both APLIC and IMSIC because in MSI-mode the APLIC generates target MSI address based the IMSIC file arrangement described in the section "3.6 Arrangement of the memory regions of multiple interrupt files". In fact, this also applies to virtual platforms created by hypervisors (KVM, Xen, ...) Regards, Anup > > Cheers, > Conor. > > > Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> > > --- > > .../interrupt-controller/riscv,imsics.yaml | 172 ++++++++++++++++++ > > 1 file changed, 172 insertions(+) > > create mode 100644 Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml > > > > diff --git a/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml > > new file mode 100644 > > index 000000000000..84976f17a4a1 > > --- /dev/null > > +++ b/Documentation/devicetree/bindings/interrupt-controller/riscv,imsics.yaml > > @@ -0,0 +1,172 @@ > > +# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) > > +%YAML 1.2 > > +--- > > +$id: http://devicetree.org/schemas/interrupt-controller/riscv,imsics.yaml# > > +$schema: http://devicetree.org/meta-schemas/core.yaml# > > + > > +title: RISC-V Incoming MSI Controller (IMSIC) > > + > > +maintainers: > > + - Anup Patel <anup@brainfault.org> > > + > > +description: | > > + The RISC-V advanced interrupt architecture (AIA) defines a per-CPU incoming > > + MSI controller (IMSIC) for handling MSIs in a RISC-V platform. The RISC-V > > + AIA specification can be found at https://github.com/riscv/riscv-aia. > > + > > + The IMSIC is a per-CPU (or per-HART) device with separate interrupt file > > + for each privilege level (machine or supervisor). The configuration of > > + a IMSIC interrupt file is done using AIA CSRs and it also has a 4KB MMIO > > + space to receive MSIs from devices. Each IMSIC interrupt file supports a > > + fixed number of interrupt identities (to distinguish MSIs from devices) > > + which is same for given privilege level across CPUs (or HARTs). > > + > > + The device tree of a RISC-V platform will have one IMSIC device tree node > > + for each privilege level (machine or supervisor) which collectively describe > > + IMSIC interrupt files at that privilege level across CPUs (or HARTs). > > + > > + The arrangement of IMSIC interrupt files in MMIO space of a RISC-V platform > > + follows a particular scheme defined by the RISC-V AIA specification. A IMSIC > > + group is a set of IMSIC interrupt files co-located in MMIO space and we can > > + have multiple IMSIC groups (i.e. clusters, sockets, chiplets, etc) in a > > + RISC-V platform. The MSI target address of a IMSIC interrupt file at given > > + privilege level (machine or supervisor) encodes group index, HART index, > > + and guest index (shown below). > > + > > + XLEN-1 > (HART Index MSB) 12 0 > > + | | | | > > + ------------------------------------------------------------- > > + |xxxxxx|Group Index|xxxxxxxxxxx|HART Index|Guest Index| 0 | > > + ------------------------------------------------------------- > > + > > +allOf: > > + - $ref: /schemas/interrupt-controller.yaml# > > + - $ref: /schemas/interrupt-controller/msi-controller.yaml# > > + > > +properties: > > + compatible: > > + items: > > + - enum: > > + - qemu,imsics > > + - const: riscv,imsics > > + > > + reg: > > + minItems: 1 > > + maxItems: 16384 > > + description: > > + Base address of each IMSIC group. > > + > > + interrupt-controller: true > > + > > + "#interrupt-cells": > > + const: 0 > > + > > + msi-controller: true > > + > > + "#msi-cells": > > + const: 0 > > + > > + interrupts-extended: > > + minItems: 1 > > + maxItems: 16384 > > + description: > > + This property represents the set of CPUs (or HARTs) for which given > > + device tree node describes the IMSIC interrupt files. Each node pointed > > + to should be a riscv,cpu-intc node, which has a CPU node (i.e. RISC-V > > + HART) as parent. > > + > > + riscv,num-ids: > > + $ref: /schemas/types.yaml#/definitions/uint32 > > + minimum: 63 > > + maximum: 2047 > > + description: > > + Number of interrupt identities supported by IMSIC interrupt file. > > + > > + riscv,num-guest-ids: > > + $ref: /schemas/types.yaml#/definitions/uint32 > > + minimum: 63 > > + maximum: 2047 > > + description: > > + Number of interrupt identities are supported by IMSIC guest interrupt > > + file. When not specified it is assumed to be same as specified by the > > + riscv,num-ids property. > > + > > + riscv,guest-index-bits: > > + minimum: 0 > > + maximum: 7 > > + default: 0 > > + description: > > + Number of guest index bits in the MSI target address. > > + > > + riscv,hart-index-bits: > > + minimum: 0 > > + maximum: 15 > > + description: > > + Number of HART index bits in the MSI target address. When not > > + specified it is calculated based on the interrupts-extended property. > > + > > + riscv,group-index-bits: > > + minimum: 0 > > + maximum: 7 > > + default: 0 > > + description: > > + Number of group index bits in the MSI target address. > > + > > + riscv,group-index-shift: > > + $ref: /schemas/types.yaml#/definitions/uint32 > > + minimum: 0 > > + maximum: 55 > > + default: 24 > > + description: > > + The least significant bit position of the group index bits in the > > + MSI target address. > > + > > +required: > > + - compatible > > + - reg > > + - interrupt-controller > > + - msi-controller > > + - "#msi-cells" > > + - interrupts-extended > > + - riscv,num-ids > > + > > +unevaluatedProperties: false > > + > > +examples: > > + - | > > + // Example 1 (Machine-level IMSIC files with just one group): > > + > > + interrupt-controller@24000000 { > > + compatible = "qemu,imsics", "riscv,imsics"; > > + interrupts-extended = <&cpu1_intc 11>, > > + <&cpu2_intc 11>, > > + <&cpu3_intc 11>, > > + <&cpu4_intc 11>; > > + reg = <0x28000000 0x4000>; > > + interrupt-controller; > > + #interrupt-cells = <0>; > > + msi-controller; > > + #msi-cells = <0>; > > + riscv,num-ids = <127>; > > + }; > > + > > + - | > > + // Example 2 (Supervisor-level IMSIC files with two groups): > > + > > + interrupt-controller@28000000 { > > + compatible = "qemu,imsics", "riscv,imsics"; > > + interrupts-extended = <&cpu1_intc 9>, > > + <&cpu2_intc 9>, > > + <&cpu3_intc 9>, > > + <&cpu4_intc 9>; > > + reg = <0x28000000 0x2000>, /* Group0 IMSICs */ > > + <0x29000000 0x2000>; /* Group1 IMSICs */ > > + interrupt-controller; > > + #interrupt-cells = <0>; > > + msi-controller; > > + #msi-cells = <0>; > > + riscv,num-ids = <127>; > > + riscv,group-index-bits = <1>; > > + riscv,group-index-shift = <24>; > > + }; > > +... > > -- > > 2.34.1 > >
Hey Anup, On Fri, Oct 13, 2023 at 12:16:45PM +0530, Anup Patel wrote: > On Thu, Oct 12, 2023 at 10:05 PM Conor Dooley <conor@kernel.org> wrote: > > On Tue, Oct 03, 2023 at 10:13:55AM +0530, Anup Patel wrote: > > > We add DT bindings document for the RISC-V incoming MSI controller > > > (IMSIC) defined by the RISC-V advanced interrupt architecture (AIA) > > > specification. > > > > > > Signed-off-by: Anup Patel <apatel@ventanamicro.com> > > > Reviewed-by: Conor Dooley <conor.dooley@microchip.com> > > > > Just FYI, since they'll reply to this themselves, but some of the > > Microchip folks have run into problems with sparse hart indexes while > > trying to use the imsic binding to describe some configurations they > > have. I think there were also so problems with how to describe to a > > linux guest which file to use, when the first hart available to the > > guest does not use the first file. They'll do a better job of describing > > their problems than I will, so I shall leave it to them! > > Quoting AIA spec: > "For the purpose of locating the memory pages of interrupt files in the > address space, assume each hart (or each hart within a group) has a > unique hart number that may or may not be related to the unique hart > identifiers (“hart IDs”) that the RISC-V Privileged Architecture assigns > to harts." > > It is very easy to get confused between the AIA "hart index" and > "hart IDs" defined by the RISC-V Privileged specification but these > are two very different things. The AIA "hart index" over here is the > bits in the address of an IMSIC file. > > This DT binding follows the IMSIC file arrangement in the address > space as defined by the section "3.6 Arrangement of the memory > regions of multiple interrupt files" of the AIA specification. This > arrangement is MANDATORY for platforms having both APLIC > and IMSIC because in MSI-mode the APLIC generates target > MSI address based the IMSIC file arrangement described in the > section "3.6 Arrangement of the memory regions of multiple > interrupt files". In fact, this also applies to virtual platforms > created by hypervisors (KVM, Xen, ...) Thanks for pointing this out - I'll pass it on and hopefully it is helpful to them. If not, I expect that you'll hear :) Cheers, Conor.
Anup Patel <apatel@ventanamicro.com> writes: > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> Thanks for the quick reply! >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> Hi Anup, >> >> >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international >> >> > process. The latest ratified AIA specifcation can be found at: >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf >> >> > >> >> > At a high-level, the AIA specification adds three things: >> >> > 1) AIA CSRs >> >> > - Improved local interrupt support >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) >> >> > - Per-HART MSI controller >> >> > - Support MSI virtualization >> >> > - Support IPI along with virtualization >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) >> >> > - Wired interrupt controller >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator) >> >> > - In Direct-mode, injects external interrupts directly into HARTs >> >> >> >> Thanks for working on the AIA support! I had a look at the series, and >> >> have some concerns about interrupt ID abstraction. >> >> >> >> A bit of background, for readers not familiar with the AIA details. >> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and >> >> each MSI is dedicated to a certain hart. The series takes the approach >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally. >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the >> >> slice only *one* msi-irq is acutally used. >> >> >> >> This scheme makes affinity changes more robust, because the interrupt >> >> sources on "other" harts are pre-allocated. On the other hand it >> >> requires to propagate irq masking to other harts via IPIs (this is >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs >> >> are hogged, and cannot be used. >> >> >> >> Contemporary storage/networking drivers usually uses queues per core >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs. >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device >> >> would like to use 5 queues (5 cores) on a 128 core system, the current >> >> scheme would consume 5 * 128 MSIs, instead of just 5. >> >> >> >> On the plus side: >> >> * Changing interrupts affinity will never fail, because the interrupts >> >> on each hart is pre-allocated. >> >> >> >> On the negative side: >> >> * Wasteful interrupt usage, and a system can potientially "run out" of >> >> interrupts. Especially for many core systems. >> >> * Interrupt masking need to proagate to harts via IPIs (there's no >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC >> >> >> >> Summary: >> >> The current series caps the number of global interrupts to maximum >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be >> >> to expose 2047 * #harts unique MSIs. >> >> >> >> I think this could simplify/remove(?) the locking as well. >> > >> > Exposing 2047 * #harts unique MSIs has multiple issues: >> > 1) The irq_set_affinity() does not work for MSIs because each >> > IRQ is not tied to a particular HART. This means we can't >> > balance the IRQ processing load among HARTs. >> >> Yes, you can balance. In your code, each *active* MSI is still >> bound/active to a specific hard together with the affinity mask. In an >> 1-1 model you would still need to track the affinity mask, but the >> irq_set_affinity() would be different. It would try to allocate a new >> MSI from the target CPU, and then switch to having that MSI active. >> >> That's what x86 does AFAIU, which is also constrained by the # of >> available MSIs. >> >> The downside, as I pointed out, is that the set affinity action can >> fail for a certain target CPU. > > Yes, irq_set_affinity() can fail for the suggested approach plus for > RISC-V AIA, one HART does not have access to other HARTs > MSI enable/disable bits so the approach will also involve IPI. Correct, but the current series does a broadcast to all cores, where the 1-1 approach is at most an IPI to a single core. 128+c machines are getting more common, and you have devices that you bring up/down on a per-core basis. Broadcasting IPIs to all cores, when dealing with a per-core activity is a pretty noisy neighbor. This could be fixed in the existing 1-n approach, by not require to sync the cores that are not handling the MSI in question. "Lazy disable" >> > 2) All wired IRQs for APLIC MSI-mode will also target a >> > fixed HART hence irq_set_affinity() won't work for wired >> > IRQs as well. >> >> I'm not following here. Why would APLIC put a constraint here? I had a >> look at the specs, and I didn't see anything supporting the current >> scheme explicitly. > > Lets say the number of APLIC wired interrupts are greater than the > number of per-CPU IMSIC IDs. In this case, if all wired interrupts are > moved to a particular CPU then irq_set_affinity() will fail for some of > the wired interrupts. Right, it's the case of "full remote CPU" again. Thanks for clearing that up. >> > The idea of treating per-HART MSIs as separate IRQs has >> > been discussed in the past. >> >> Aha! I tried to look for it in lore, but didn't find any. Could you >> point me to those discussions? > > This was done 2 years back in the AIA TG meeting when we were > doing the PoC for AIA spec. Ah, too bad. Thanks regardless. >> My concern is interrupts become a scarce resource with this >> implementation, but maybe my view is incorrect. I've seen bare-metal >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe >> that is considered "a lot of interrupts". >> >> As long as we don't get into scenarios where we're running out of >> interrupts, due to the software design. >> > > The current approach is simpler and ensures irq_set_affinity > always works. The limit of max 2047 IDs is sufficient for many > systems (if not all). Let me give you another view. On a 128c system each core has ~16 unique interrupts for disposal. E.g. the Intel E800 NIC has more than 2048 network queue pairs for each PF. > When we encounter a system requiring a large number of MSIs, > we can either: > 1) Extend the AIA spec to support greater than 2047 IDs > 2) Re-think the approach in the IMSIC driver > > The choice between #1 and #2 above depends on the > guarantees we want for irq_set_affinity(). The irq_set_affinity() behavior is better with this series, but I think the other downsides: number of available interrupt sources, and IPI broadcast are worse. Björn
On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote: > > Anup Patel <apatel@ventanamicro.com> writes: > > > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote: > >> > >> Thanks for the quick reply! > >> > >> Anup Patel <apatel@ventanamicro.com> writes: > >> > >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote: > >> >> > >> >> Hi Anup, > >> >> > >> >> Anup Patel <apatel@ventanamicro.com> writes: > >> >> > >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international > >> >> > process. The latest ratified AIA specifcation can be found at: > >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf > >> >> > > >> >> > At a high-level, the AIA specification adds three things: > >> >> > 1) AIA CSRs > >> >> > - Improved local interrupt support > >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) > >> >> > - Per-HART MSI controller > >> >> > - Support MSI virtualization > >> >> > - Support IPI along with virtualization > >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) > >> >> > - Wired interrupt controller > >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator) > >> >> > - In Direct-mode, injects external interrupts directly into HARTs > >> >> > >> >> Thanks for working on the AIA support! I had a look at the series, and > >> >> have some concerns about interrupt ID abstraction. > >> >> > >> >> A bit of background, for readers not familiar with the AIA details. > >> >> > >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and > >> >> each MSI is dedicated to a certain hart. The series takes the approach > >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally. > >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the > >> >> slice only *one* msi-irq is acutally used. > >> >> > >> >> This scheme makes affinity changes more robust, because the interrupt > >> >> sources on "other" harts are pre-allocated. On the other hand it > >> >> requires to propagate irq masking to other harts via IPIs (this is > >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs > >> >> are hogged, and cannot be used. > >> >> > >> >> Contemporary storage/networking drivers usually uses queues per core > >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs. > >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of > >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be > >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device > >> >> would like to use 5 queues (5 cores) on a 128 core system, the current > >> >> scheme would consume 5 * 128 MSIs, instead of just 5. > >> >> > >> >> On the plus side: > >> >> * Changing interrupts affinity will never fail, because the interrupts > >> >> on each hart is pre-allocated. > >> >> > >> >> On the negative side: > >> >> * Wasteful interrupt usage, and a system can potientially "run out" of > >> >> interrupts. Especially for many core systems. > >> >> * Interrupt masking need to proagate to harts via IPIs (there's no > >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC > >> >> > >> >> Summary: > >> >> The current series caps the number of global interrupts to maximum > >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be > >> >> to expose 2047 * #harts unique MSIs. > >> >> > >> >> I think this could simplify/remove(?) the locking as well. > >> > > >> > Exposing 2047 * #harts unique MSIs has multiple issues: > >> > 1) The irq_set_affinity() does not work for MSIs because each > >> > IRQ is not tied to a particular HART. This means we can't > >> > balance the IRQ processing load among HARTs. > >> > >> Yes, you can balance. In your code, each *active* MSI is still > >> bound/active to a specific hard together with the affinity mask. In an > >> 1-1 model you would still need to track the affinity mask, but the > >> irq_set_affinity() would be different. It would try to allocate a new > >> MSI from the target CPU, and then switch to having that MSI active. > >> > >> That's what x86 does AFAIU, which is also constrained by the # of > >> available MSIs. > >> > >> The downside, as I pointed out, is that the set affinity action can > >> fail for a certain target CPU. > > > > Yes, irq_set_affinity() can fail for the suggested approach plus for > > RISC-V AIA, one HART does not have access to other HARTs > > MSI enable/disable bits so the approach will also involve IPI. > > Correct, but the current series does a broadcast to all cores, where the > 1-1 approach is at most an IPI to a single core. > > 128+c machines are getting more common, and you have devices that you > bring up/down on a per-core basis. Broadcasting IPIs to all cores, when > dealing with a per-core activity is a pretty noisy neighbor. Broadcast IPI in the current approach is only done upon MSI mask/unmask operation. It is not done upon set_affinity() of interrupt handling. > > This could be fixed in the existing 1-n approach, by not require to sync > the cores that are not handling the MSI in question. "Lazy disable" Incorrect. The approach you are suggesting involves an IPI upon every irq_set_affinity(). This is because a HART can only enable it's own MSI ID so when an IRQ is moved to from HART A to HART B with a different ID X on HART B then we will need an IPI in irq_set_affinit() to enable ID X on HART B. > > >> > 2) All wired IRQs for APLIC MSI-mode will also target a > >> > fixed HART hence irq_set_affinity() won't work for wired > >> > IRQs as well. > >> > >> I'm not following here. Why would APLIC put a constraint here? I had a > >> look at the specs, and I didn't see anything supporting the current > >> scheme explicitly. > > > > Lets say the number of APLIC wired interrupts are greater than the > > number of per-CPU IMSIC IDs. In this case, if all wired interrupts are > > moved to a particular CPU then irq_set_affinity() will fail for some of > > the wired interrupts. > > Right, it's the case of "full remote CPU" again. Thanks for clearing > that up. > > >> > The idea of treating per-HART MSIs as separate IRQs has > >> > been discussed in the past. > >> > >> Aha! I tried to look for it in lore, but didn't find any. Could you > >> point me to those discussions? > > > > This was done 2 years back in the AIA TG meeting when we were > > doing the PoC for AIA spec. > > Ah, too bad. Thanks regardless. > > >> My concern is interrupts become a scarce resource with this > >> implementation, but maybe my view is incorrect. I've seen bare-metal > >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe > >> that is considered "a lot of interrupts". > >> > >> As long as we don't get into scenarios where we're running out of > >> interrupts, due to the software design. > >> > > > > The current approach is simpler and ensures irq_set_affinity > > always works. The limit of max 2047 IDs is sufficient for many > > systems (if not all). > > Let me give you another view. On a 128c system each core has ~16 unique > interrupts for disposal. E.g. the Intel E800 NIC has more than 2048 > network queue pairs for each PF. Clearly, this example is a hypothetical and represents a poorly designed platform. Having just 16 IDs per-Core is a very poor design choice. In fact, the Server SoC spec mandates a minimum 255 IDs. Regarding NICs which support a large number of queues, the driver will typically enable only one queue per-core and set the affinity to separate cores. We have user-space data plane applications based on DPDK which are capable of using a large number of NIC queues but these applications are polling based and don't use MSIs. > > > When we encounter a system requiring a large number of MSIs, > > we can either: > > 1) Extend the AIA spec to support greater than 2047 IDs > > 2) Re-think the approach in the IMSIC driver > > > > The choice between #1 and #2 above depends on the > > guarantees we want for irq_set_affinity(). > > The irq_set_affinity() behavior is better with this series, but I think > the other downsides: number of available interrupt sources, and IPI > broadcast are worse. The IPI overhead in the approach you are suggesting will be even bad compared to the IPI overhead of the current approach because we will end-up doing IPI upon every irq_set_affinity() in the suggested approach compared to doing IPI upon every mask/unmask in the current approach. The biggest advantage of the current approach is a reliable irq_set_affinity() which is a very valuable thing to have. ARM systems easily support a large number of LPIs per-core. For example, GIC-700 supports 56000 LPIs per-core. (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features) In the RISC-V world, we can easily define a small fast track extension based on S*csrind extension which can allow a large number of IMSIC IDs per-core. Instead of addressing problems on a hypothetical system, I suggest we go ahead with the current approach and deal with a system having MSI over-subscription when such a system shows up. Regards, Anup
Anup Patel <apatel@ventanamicro.com> writes: > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> Thanks for the quick reply! >> >> >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> >> >> Hi Anup, >> >> >> >> >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> >> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international >> >> >> > process. The latest ratified AIA specifcation can be found at: >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf >> >> >> > >> >> >> > At a high-level, the AIA specification adds three things: >> >> >> > 1) AIA CSRs >> >> >> > - Improved local interrupt support >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) >> >> >> > - Per-HART MSI controller >> >> >> > - Support MSI virtualization >> >> >> > - Support IPI along with virtualization >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) >> >> >> > - Wired interrupt controller >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator) >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs >> >> >> >> >> >> Thanks for working on the AIA support! I had a look at the series, and >> >> >> have some concerns about interrupt ID abstraction. >> >> >> >> >> >> A bit of background, for readers not familiar with the AIA details. >> >> >> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and >> >> >> each MSI is dedicated to a certain hart. The series takes the approach >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally. >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the >> >> >> slice only *one* msi-irq is acutally used. >> >> >> >> >> >> This scheme makes affinity changes more robust, because the interrupt >> >> >> sources on "other" harts are pre-allocated. On the other hand it >> >> >> requires to propagate irq masking to other harts via IPIs (this is >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs >> >> >> are hogged, and cannot be used. >> >> >> >> >> >> Contemporary storage/networking drivers usually uses queues per core >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs. >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5. >> >> >> >> >> >> On the plus side: >> >> >> * Changing interrupts affinity will never fail, because the interrupts >> >> >> on each hart is pre-allocated. >> >> >> >> >> >> On the negative side: >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of >> >> >> interrupts. Especially for many core systems. >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC >> >> >> >> >> >> Summary: >> >> >> The current series caps the number of global interrupts to maximum >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be >> >> >> to expose 2047 * #harts unique MSIs. >> >> >> >> >> >> I think this could simplify/remove(?) the locking as well. >> >> > >> >> > Exposing 2047 * #harts unique MSIs has multiple issues: >> >> > 1) The irq_set_affinity() does not work for MSIs because each >> >> > IRQ is not tied to a particular HART. This means we can't >> >> > balance the IRQ processing load among HARTs. >> >> >> >> Yes, you can balance. In your code, each *active* MSI is still >> >> bound/active to a specific hard together with the affinity mask. In an >> >> 1-1 model you would still need to track the affinity mask, but the >> >> irq_set_affinity() would be different. It would try to allocate a new >> >> MSI from the target CPU, and then switch to having that MSI active. >> >> >> >> That's what x86 does AFAIU, which is also constrained by the # of >> >> available MSIs. >> >> >> >> The downside, as I pointed out, is that the set affinity action can >> >> fail for a certain target CPU. >> > >> > Yes, irq_set_affinity() can fail for the suggested approach plus for >> > RISC-V AIA, one HART does not have access to other HARTs >> > MSI enable/disable bits so the approach will also involve IPI. >> >> Correct, but the current series does a broadcast to all cores, where the >> 1-1 approach is at most an IPI to a single core. >> >> 128+c machines are getting more common, and you have devices that you >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when >> dealing with a per-core activity is a pretty noisy neighbor. > > Broadcast IPI in the current approach is only done upon MSI mask/unmask > operation. It is not done upon set_affinity() of interrupt handling. I'm aware. We're on the same page here. >> >> This could be fixed in the existing 1-n approach, by not require to sync >> the cores that are not handling the MSI in question. "Lazy disable" > > Incorrect. The approach you are suggesting involves an IPI upon every > irq_set_affinity(). This is because a HART can only enable it's own > MSI ID so when an IRQ is moved to from HART A to HART B with > a different ID X on HART B then we will need an IPI in irq_set_affinit() > to enable ID X on HART B. Yes, the 1-1 approach will require an IPI to one target cpu on affinity changes, and similar on mask/unmask. The 1-n approach, require no-IPI on affinity changes (nice!), but IPI broadcast to all cores on mask/unmask (not so nice). >> >> My concern is interrupts become a scarce resource with this >> >> implementation, but maybe my view is incorrect. I've seen bare-metal >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe >> >> that is considered "a lot of interrupts". >> >> >> >> As long as we don't get into scenarios where we're running out of >> >> interrupts, due to the software design. >> >> >> > >> > The current approach is simpler and ensures irq_set_affinity >> > always works. The limit of max 2047 IDs is sufficient for many >> > systems (if not all). >> >> Let me give you another view. On a 128c system each core has ~16 unique >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048 >> network queue pairs for each PF. > > Clearly, this example is a hypothetical and represents a poorly > designed platform. > > Having just 16 IDs per-Core is a very poor design choice. In fact, the > Server SoC spec mandates a minimum 255 IDs. You are misreading. A 128c system with 2047 MSIs per-core, will only have 16 *per-core unique* (2047/128) interrupts with the current series. I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c system with the maximum amount of MSIs possible in the spec, you'll end up with 16 *unique* interrupts per core. > Regarding NICs which support a large number of queues, the driver > will typically enable only one queue per-core and set the affinity to > separate cores. We have user-space data plane applications based > on DPDK which are capable of using a large number of NIC queues > but these applications are polling based and don't use MSIs. That's one sample point, and clearly not the only one. There are *many* different usage models. Just because you *assign* MSI, doesn't mean they are firing all the time. I can show you a couple of networking setups where this is clearly not enough. Each core has a large number of QoS queues, and each queue would very much like to have a dedicated MSI. >> > When we encounter a system requiring a large number of MSIs, >> > we can either: >> > 1) Extend the AIA spec to support greater than 2047 IDs >> > 2) Re-think the approach in the IMSIC driver >> > >> > The choice between #1 and #2 above depends on the >> > guarantees we want for irq_set_affinity(). >> >> The irq_set_affinity() behavior is better with this series, but I think >> the other downsides: number of available interrupt sources, and IPI >> broadcast are worse. > > The IPI overhead in the approach you are suggesting will be > even bad compared to the IPI overhead of the current approach > because we will end-up doing IPI upon every irq_set_affinity() > in the suggested approach compared to doing IPI upon every > mask/unmask in the current approach. Again, very workload dependent. This series does IPI broadcast on masking/unmasking, which means that cores that don't care get interrupted because, say, a network queue-pair is setup on another core. Some workloads never change the irq affinity. I'm just pointing out that there are pro/cons with both variants. > The biggest advantage of the current approach is a reliable > irq_set_affinity() which is a very valuable thing to have. ...and I'm arguing that we're paying a big price for that. > ARM systems easily support a large number of LPIs per-core. > For example, GIC-700 supports 56000 LPIs per-core. > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features) Yeah, but this is not the GIC. This is something that looks more like the x86 world. We'll be stuck with a lot of implementations with AIA 1.0 spec, and many cores. > In the RISC-V world, we can easily define a small fast track > extension based on S*csrind extension which can allow a > large number of IMSIC IDs per-core. > > Instead of addressing problems on a hypothetical system, > I suggest we go ahead with the current approach and deal > with a system having MSI over-subscription when such a > system shows up. I've pointed out my concerns. We're not agreeing, but hey, I'm just one sample point here! I'll leave it here for others to chime in! Still much appreciate all the hard work on the series! Have a nice weekend, Björn
On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote: > > Anup Patel <apatel@ventanamicro.com> writes: > > > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote: > >> > >> Anup Patel <apatel@ventanamicro.com> writes: > >> > >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote: > >> >> > >> >> Thanks for the quick reply! > >> >> > >> >> Anup Patel <apatel@ventanamicro.com> writes: > >> >> > >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote: > >> >> >> > >> >> >> Hi Anup, > >> >> >> > >> >> >> Anup Patel <apatel@ventanamicro.com> writes: > >> >> >> > >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international > >> >> >> > process. The latest ratified AIA specifcation can be found at: > >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf > >> >> >> > > >> >> >> > At a high-level, the AIA specification adds three things: > >> >> >> > 1) AIA CSRs > >> >> >> > - Improved local interrupt support > >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) > >> >> >> > - Per-HART MSI controller > >> >> >> > - Support MSI virtualization > >> >> >> > - Support IPI along with virtualization > >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) > >> >> >> > - Wired interrupt controller > >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator) > >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs > >> >> >> > >> >> >> Thanks for working on the AIA support! I had a look at the series, and > >> >> >> have some concerns about interrupt ID abstraction. > >> >> >> > >> >> >> A bit of background, for readers not familiar with the AIA details. > >> >> >> > >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and > >> >> >> each MSI is dedicated to a certain hart. The series takes the approach > >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally. > >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the > >> >> >> slice only *one* msi-irq is acutally used. > >> >> >> > >> >> >> This scheme makes affinity changes more robust, because the interrupt > >> >> >> sources on "other" harts are pre-allocated. On the other hand it > >> >> >> requires to propagate irq masking to other harts via IPIs (this is > >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs > >> >> >> are hogged, and cannot be used. > >> >> >> > >> >> >> Contemporary storage/networking drivers usually uses queues per core > >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs. > >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of > >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be > >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device > >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current > >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5. > >> >> >> > >> >> >> On the plus side: > >> >> >> * Changing interrupts affinity will never fail, because the interrupts > >> >> >> on each hart is pre-allocated. > >> >> >> > >> >> >> On the negative side: > >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of > >> >> >> interrupts. Especially for many core systems. > >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no > >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC > >> >> >> > >> >> >> Summary: > >> >> >> The current series caps the number of global interrupts to maximum > >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be > >> >> >> to expose 2047 * #harts unique MSIs. > >> >> >> > >> >> >> I think this could simplify/remove(?) the locking as well. > >> >> > > >> >> > Exposing 2047 * #harts unique MSIs has multiple issues: > >> >> > 1) The irq_set_affinity() does not work for MSIs because each > >> >> > IRQ is not tied to a particular HART. This means we can't > >> >> > balance the IRQ processing load among HARTs. > >> >> > >> >> Yes, you can balance. In your code, each *active* MSI is still > >> >> bound/active to a specific hard together with the affinity mask. In an > >> >> 1-1 model you would still need to track the affinity mask, but the > >> >> irq_set_affinity() would be different. It would try to allocate a new > >> >> MSI from the target CPU, and then switch to having that MSI active. > >> >> > >> >> That's what x86 does AFAIU, which is also constrained by the # of > >> >> available MSIs. > >> >> > >> >> The downside, as I pointed out, is that the set affinity action can > >> >> fail for a certain target CPU. > >> > > >> > Yes, irq_set_affinity() can fail for the suggested approach plus for > >> > RISC-V AIA, one HART does not have access to other HARTs > >> > MSI enable/disable bits so the approach will also involve IPI. > >> > >> Correct, but the current series does a broadcast to all cores, where the > >> 1-1 approach is at most an IPI to a single core. > >> > >> 128+c machines are getting more common, and you have devices that you > >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when > >> dealing with a per-core activity is a pretty noisy neighbor. > > > > Broadcast IPI in the current approach is only done upon MSI mask/unmask > > operation. It is not done upon set_affinity() of interrupt handling. > > I'm aware. We're on the same page here. > > >> > >> This could be fixed in the existing 1-n approach, by not require to sync > >> the cores that are not handling the MSI in question. "Lazy disable" > > > > Incorrect. The approach you are suggesting involves an IPI upon every > > irq_set_affinity(). This is because a HART can only enable it's own > > MSI ID so when an IRQ is moved to from HART A to HART B with > > a different ID X on HART B then we will need an IPI in irq_set_affinit() > > to enable ID X on HART B. > > Yes, the 1-1 approach will require an IPI to one target cpu on affinity > changes, and similar on mask/unmask. > > The 1-n approach, require no-IPI on affinity changes (nice!), but IPI > broadcast to all cores on mask/unmask (not so nice). > > >> >> My concern is interrupts become a scarce resource with this > >> >> implementation, but maybe my view is incorrect. I've seen bare-metal > >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe > >> >> that is considered "a lot of interrupts". > >> >> > >> >> As long as we don't get into scenarios where we're running out of > >> >> interrupts, due to the software design. > >> >> > >> > > >> > The current approach is simpler and ensures irq_set_affinity > >> > always works. The limit of max 2047 IDs is sufficient for many > >> > systems (if not all). > >> > >> Let me give you another view. On a 128c system each core has ~16 unique > >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048 > >> network queue pairs for each PF. > > > > Clearly, this example is a hypothetical and represents a poorly > > designed platform. > > > > Having just 16 IDs per-Core is a very poor design choice. In fact, the > > Server SoC spec mandates a minimum 255 IDs. > > You are misreading. A 128c system with 2047 MSIs per-core, will only > have 16 *per-core unique* (2047/128) interrupts with the current series. > > I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c > system with the maximum amount of MSIs possible in the spec, you'll end > up with 16 *unique* interrupts per core. -ENOPARSE I don't see how this applies to the current approach because we treat MSI ID space as global across cores so if a system has 2047 MSIs per-core then we have 2047 MSIs across all cores. > > > Regarding NICs which support a large number of queues, the driver > > will typically enable only one queue per-core and set the affinity to > > separate cores. We have user-space data plane applications based > > on DPDK which are capable of using a large number of NIC queues > > but these applications are polling based and don't use MSIs. > > That's one sample point, and clearly not the only one. There are *many* > different usage models. Just because you *assign* MSI, doesn't mean they > are firing all the time. > > I can show you a couple of networking setups where this is clearly not > enough. Each core has a large number of QoS queues, and each queue would > very much like to have a dedicated MSI. > > >> > When we encounter a system requiring a large number of MSIs, > >> > we can either: > >> > 1) Extend the AIA spec to support greater than 2047 IDs > >> > 2) Re-think the approach in the IMSIC driver > >> > > >> > The choice between #1 and #2 above depends on the > >> > guarantees we want for irq_set_affinity(). > >> > >> The irq_set_affinity() behavior is better with this series, but I think > >> the other downsides: number of available interrupt sources, and IPI > >> broadcast are worse. > > > > The IPI overhead in the approach you are suggesting will be > > even bad compared to the IPI overhead of the current approach > > because we will end-up doing IPI upon every irq_set_affinity() > > in the suggested approach compared to doing IPI upon every > > mask/unmask in the current approach. > > Again, very workload dependent. > > This series does IPI broadcast on masking/unmasking, which means that > cores that don't care get interrupted because, say, a network queue-pair > is setup on another core. > > Some workloads never change the irq affinity. There are various events which irq affinity such as irq balance, CPU hotplug, system suspend, etc. Also, the 1-1 approach does IPI upon set_affinity, mask and unmask whereas the 1-n approach does IPI only upon mask and unmask. > > I'm just pointing out that there are pro/cons with both variants. > > > The biggest advantage of the current approach is a reliable > > irq_set_affinity() which is a very valuable thing to have. > > ...and I'm arguing that we're paying a big price for that. > > > ARM systems easily support a large number of LPIs per-core. > > For example, GIC-700 supports 56000 LPIs per-core. > > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features) > > Yeah, but this is not the GIC. This is something that looks more like > the x86 world. We'll be stuck with a lot of implementations with AIA 1.0 > spec, and many cores. Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying is that there are systems with large number per-core interrupt IDs for handling MSIs. > > > In the RISC-V world, we can easily define a small fast track > > extension based on S*csrind extension which can allow a > > large number of IMSIC IDs per-core. > > > > Instead of addressing problems on a hypothetical system, > > I suggest we go ahead with the current approach and deal > > with a system having MSI over-subscription when such a > > system shows up. > > I've pointed out my concerns. We're not agreeing, but hey, I'm just one > sample point here! I'll leave it here for others to chime in! > > Still much appreciate all the hard work on the series! Thanks, we have disagreements on this topic but this is certainly a good discussion. > > > Have a nice weekend, Regards, Anup
Anup Patel <apatel@ventanamicro.com> writes: > On Fri, Oct 20, 2023 at 10:07 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> > On Fri, Oct 20, 2023 at 8:10 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> >> >> > On Fri, Oct 20, 2023 at 2:17 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> >> >> Thanks for the quick reply! >> >> >> >> >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> >> >> >> >> > On Thu, Oct 19, 2023 at 7:13 PM Björn Töpel <bjorn@kernel.org> wrote: >> >> >> >> >> >> >> >> Hi Anup, >> >> >> >> >> >> >> >> Anup Patel <apatel@ventanamicro.com> writes: >> >> >> >> >> >> >> >> > The RISC-V AIA specification is ratified as-per the RISC-V international >> >> >> >> > process. The latest ratified AIA specifcation can be found at: >> >> >> >> > https://github.com/riscv/riscv-aia/releases/download/1.0/riscv-interrupts-1.0.pdf >> >> >> >> > >> >> >> >> > At a high-level, the AIA specification adds three things: >> >> >> >> > 1) AIA CSRs >> >> >> >> > - Improved local interrupt support >> >> >> >> > 2) Incoming Message Signaled Interrupt Controller (IMSIC) >> >> >> >> > - Per-HART MSI controller >> >> >> >> > - Support MSI virtualization >> >> >> >> > - Support IPI along with virtualization >> >> >> >> > 3) Advanced Platform-Level Interrupt Controller (APLIC) >> >> >> >> > - Wired interrupt controller >> >> >> >> > - In MSI-mode, converts wired interrupt into MSIs (i.e. MSI generator) >> >> >> >> > - In Direct-mode, injects external interrupts directly into HARTs >> >> >> >> >> >> >> >> Thanks for working on the AIA support! I had a look at the series, and >> >> >> >> have some concerns about interrupt ID abstraction. >> >> >> >> >> >> >> >> A bit of background, for readers not familiar with the AIA details. >> >> >> >> >> >> >> >> IMSIC allows for 2047 unique MSI ("msi-irq") sources per hart, and >> >> >> >> each MSI is dedicated to a certain hart. The series takes the approach >> >> >> >> to say that there are, e.g., 2047 interrupts ("lnx-irq") globally. >> >> >> >> Each lnx-irq consists of #harts * msi-irq -- a slice -- and in the >> >> >> >> slice only *one* msi-irq is acutally used. >> >> >> >> >> >> >> >> This scheme makes affinity changes more robust, because the interrupt >> >> >> >> sources on "other" harts are pre-allocated. On the other hand it >> >> >> >> requires to propagate irq masking to other harts via IPIs (this is >> >> >> >> mostly done up setup/tear down). It's also wasteful, because msi-irqs >> >> >> >> are hogged, and cannot be used. >> >> >> >> >> >> >> >> Contemporary storage/networking drivers usually uses queues per core >> >> >> >> (or a sub-set of cores). The current scheme wastes a lot of msi-irqs. >> >> >> >> If we instead used a scheme where "msi-irq == lnx-irq", instead of >> >> >> >> "lnq-irq = {hart 0;msi-irq x , ... hart N;msi-irq x}", there would be >> >> >> >> a lot MSIs for other users. 1-1 vs 1-N. E.g., if a storage device >> >> >> >> would like to use 5 queues (5 cores) on a 128 core system, the current >> >> >> >> scheme would consume 5 * 128 MSIs, instead of just 5. >> >> >> >> >> >> >> >> On the plus side: >> >> >> >> * Changing interrupts affinity will never fail, because the interrupts >> >> >> >> on each hart is pre-allocated. >> >> >> >> >> >> >> >> On the negative side: >> >> >> >> * Wasteful interrupt usage, and a system can potientially "run out" of >> >> >> >> interrupts. Especially for many core systems. >> >> >> >> * Interrupt masking need to proagate to harts via IPIs (there's no >> >> >> >> broadcast csr in IMSIC), and a more complex locking scheme IMSIC >> >> >> >> >> >> >> >> Summary: >> >> >> >> The current series caps the number of global interrupts to maximum >> >> >> >> 2047 MSIs for all cores (whole system). A better scheme, IMO, would be >> >> >> >> to expose 2047 * #harts unique MSIs. >> >> >> >> >> >> >> >> I think this could simplify/remove(?) the locking as well. >> >> >> > >> >> >> > Exposing 2047 * #harts unique MSIs has multiple issues: >> >> >> > 1) The irq_set_affinity() does not work for MSIs because each >> >> >> > IRQ is not tied to a particular HART. This means we can't >> >> >> > balance the IRQ processing load among HARTs. >> >> >> >> >> >> Yes, you can balance. In your code, each *active* MSI is still >> >> >> bound/active to a specific hard together with the affinity mask. In an >> >> >> 1-1 model you would still need to track the affinity mask, but the >> >> >> irq_set_affinity() would be different. It would try to allocate a new >> >> >> MSI from the target CPU, and then switch to having that MSI active. >> >> >> >> >> >> That's what x86 does AFAIU, which is also constrained by the # of >> >> >> available MSIs. >> >> >> >> >> >> The downside, as I pointed out, is that the set affinity action can >> >> >> fail for a certain target CPU. >> >> > >> >> > Yes, irq_set_affinity() can fail for the suggested approach plus for >> >> > RISC-V AIA, one HART does not have access to other HARTs >> >> > MSI enable/disable bits so the approach will also involve IPI. >> >> >> >> Correct, but the current series does a broadcast to all cores, where the >> >> 1-1 approach is at most an IPI to a single core. >> >> >> >> 128+c machines are getting more common, and you have devices that you >> >> bring up/down on a per-core basis. Broadcasting IPIs to all cores, when >> >> dealing with a per-core activity is a pretty noisy neighbor. >> > >> > Broadcast IPI in the current approach is only done upon MSI mask/unmask >> > operation. It is not done upon set_affinity() of interrupt handling. >> >> I'm aware. We're on the same page here. >> >> >> >> >> This could be fixed in the existing 1-n approach, by not require to sync >> >> the cores that are not handling the MSI in question. "Lazy disable" >> > >> > Incorrect. The approach you are suggesting involves an IPI upon every >> > irq_set_affinity(). This is because a HART can only enable it's own >> > MSI ID so when an IRQ is moved to from HART A to HART B with >> > a different ID X on HART B then we will need an IPI in irq_set_affinit() >> > to enable ID X on HART B. >> >> Yes, the 1-1 approach will require an IPI to one target cpu on affinity >> changes, and similar on mask/unmask. >> >> The 1-n approach, require no-IPI on affinity changes (nice!), but IPI >> broadcast to all cores on mask/unmask (not so nice). >> >> >> >> My concern is interrupts become a scarce resource with this >> >> >> implementation, but maybe my view is incorrect. I've seen bare-metal >> >> >> x86 systems (no VMs) with ~200 cores, and ~2000 interrupts, but maybe >> >> >> that is considered "a lot of interrupts". >> >> >> >> >> >> As long as we don't get into scenarios where we're running out of >> >> >> interrupts, due to the software design. >> >> >> >> >> > >> >> > The current approach is simpler and ensures irq_set_affinity >> >> > always works. The limit of max 2047 IDs is sufficient for many >> >> > systems (if not all). >> >> >> >> Let me give you another view. On a 128c system each core has ~16 unique >> >> interrupts for disposal. E.g. the Intel E800 NIC has more than 2048 >> >> network queue pairs for each PF. >> > >> > Clearly, this example is a hypothetical and represents a poorly >> > designed platform. >> > >> > Having just 16 IDs per-Core is a very poor design choice. In fact, the >> > Server SoC spec mandates a minimum 255 IDs. >> >> You are misreading. A 128c system with 2047 MSIs per-core, will only >> have 16 *per-core unique* (2047/128) interrupts with the current series. >> >> I'm not saying that each IMSIC has 16 IDs, I'm saying that in a 128c >> system with the maximum amount of MSIs possible in the spec, you'll end >> up with 16 *unique* interrupts per core. > > -ENOPARSE > > I don't see how this applies to the current approach because we treat > MSI ID space as global across cores so if a system has 2047 MSIs > per-core then we have 2047 MSIs across all cores. Ok, I'll try again! :-) Let's assume that each core in the 128c system has some per-core resources, say a two NIC queue pairs, and a storage queue pair. This will consume, e.g., 2*2 + 2 (6) MSI sources from the global namespace. If each core does this it'll be 6*128 MSI sources of the global namespace. The maximum number of "privates" MSI sources a core can utilize is 16. I'm trying (it's does seem to go that well ;-)) to point out that it's only 16 unique sources per core. For, say, a 256 core system it would be 8. 2047 MSI sources in a system is not much. Say that I want to spin up 24 NIC queues with one MSI each on each core on my 128c system. That's not possible with this series, while with an 1-1 system it wouldn't be an issue. Clearer, or still weird? > >> >> > Regarding NICs which support a large number of queues, the driver >> > will typically enable only one queue per-core and set the affinity to >> > separate cores. We have user-space data plane applications based >> > on DPDK which are capable of using a large number of NIC queues >> > but these applications are polling based and don't use MSIs. >> >> That's one sample point, and clearly not the only one. There are *many* >> different usage models. Just because you *assign* MSI, doesn't mean they >> are firing all the time. >> >> I can show you a couple of networking setups where this is clearly not >> enough. Each core has a large number of QoS queues, and each queue would >> very much like to have a dedicated MSI. >> >> >> > When we encounter a system requiring a large number of MSIs, >> >> > we can either: >> >> > 1) Extend the AIA spec to support greater than 2047 IDs >> >> > 2) Re-think the approach in the IMSIC driver >> >> > >> >> > The choice between #1 and #2 above depends on the >> >> > guarantees we want for irq_set_affinity(). >> >> >> >> The irq_set_affinity() behavior is better with this series, but I think >> >> the other downsides: number of available interrupt sources, and IPI >> >> broadcast are worse. >> > >> > The IPI overhead in the approach you are suggesting will be >> > even bad compared to the IPI overhead of the current approach >> > because we will end-up doing IPI upon every irq_set_affinity() >> > in the suggested approach compared to doing IPI upon every >> > mask/unmask in the current approach. >> >> Again, very workload dependent. >> >> This series does IPI broadcast on masking/unmasking, which means that >> cores that don't care get interrupted because, say, a network queue-pair >> is setup on another core. >> >> Some workloads never change the irq affinity. > > There are various events which irq affinity such as irq balance, > CPU hotplug, system suspend, etc. > > Also, the 1-1 approach does IPI upon set_affinity, mask and > unmask whereas the 1-n approach does IPI only upon mask > and unmask. An important distinction; When you say IPI on mask/unmask it is a broadcast IPI to *all* cores, which is pretty instrusive. The 1-1 variant does an IPI to a *one* target core. >> I'm just pointing out that there are pro/cons with both variants. >> >> > The biggest advantage of the current approach is a reliable >> > irq_set_affinity() which is a very valuable thing to have. >> >> ...and I'm arguing that we're paying a big price for that. >> >> > ARM systems easily support a large number of LPIs per-core. >> > For example, GIC-700 supports 56000 LPIs per-core. >> > (Refer, https://developer.arm.com/documentation/101516/0300/About-the-GIC-700/Features) >> >> Yeah, but this is not the GIC. This is something that looks more like >> the x86 world. We'll be stuck with a lot of implementations with AIA 1.0 >> spec, and many cores. > > Well, RISC-V AIA is neigher ARM GIG not x86 APIC. All I am saying > is that there are systems with large number per-core interrupt IDs > for handling MSIs. Yes, and while that is nice, it's not what IMSIC is. Now, back to the weekend for real! ;-) (https://xkcd.com/386/) Björn