Message ID | 20201016152702.1513592-1-Jonathan.Cameron@huawei.com |
---|---|
State | New |
Headers | show |
Series | [RFC] topology: Represent clusters of CPUs within a die. | expand |
On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > Both ACPI and DT provide the ability to describe additional layers of > topology between that of individual cores and higher level constructs > such as the level at which the last level cache is shared. > In ACPI this can be represented in PPTT as a Processor Hierarchy > Node Structure [1] that is the parent of the CPU cores and in turn > has a parent Processor Hierarchy Nodes Structure representing > a higher level of topology. > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > any cache resources, but the interconnect topology is such that > the cost to transfer ownership of a cacheline between CPUs within > a cluster is lower than between CPUs in different clusters on the same > die. Hence, it can make sense to deliberately schedule threads > sharing data to a single cluster. > > This patch simply exposes this information to userspace libraries > like hwloc by providing cluster_cpus and related sysfs attributes. > PoC of HWLOC support at [2]. > > Note this patch only handle the ACPI case. > > Special consideration is needed for SMT processors, where it is > necessary to move 2 levels up the hierarchy from the leaf nodes > (thus skipping the processor core level). > > Currently the ID provided is the offset of the Processor > Hierarchy Nodes Structure within PPTT. Whilst this is unique > it is not terribly elegant so alternative suggestions welcome. > > Note that arm64 / ACPI does not provide any means of identifying > a die level in the topology but that may be unrelate to the cluster > level. > > RFC questions: > 1) Naming > 2) Related to naming, do we want to represent all potential levels, > or this enough? On Kunpeng920, the next level up from cluster happens > to be covered by llc cache sharing, but in theory more than one > level of cluster description might be needed by some future system. > 3) Do we need DT code in place? I'm not sure any DT based ARM64 > systems would have enough complexity for this to be useful. > 4) Other architectures? Is this useful on x86 for example? > > [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node > structure (Type 0) > [2] https://github.com/hisilicon/hwloc/tree/linux-cluster > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > > Documentation/admin-guide/cputopology.rst | 26 ++++++++-- You are adding new sysfs files here, but not adding Documentation/ABI/ entries as well? This cputopology document is nice, but no one knows to look there for sysfs stuff :) thanks, greg k-h
On Sat, 17 Oct 2020 08:44:25 +0200 Greg Kroah-Hartman <gregkh@linuxfoundation.org> wrote: > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > Both ACPI and DT provide the ability to describe additional layers of > > topology between that of individual cores and higher level constructs > > such as the level at which the last level cache is shared. > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > Node Structure [1] that is the parent of the CPU cores and in turn > > has a parent Processor Hierarchy Nodes Structure representing > > a higher level of topology. > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > any cache resources, but the interconnect topology is such that > > the cost to transfer ownership of a cacheline between CPUs within > > a cluster is lower than between CPUs in different clusters on the same > > die. Hence, it can make sense to deliberately schedule threads > > sharing data to a single cluster. > > > > This patch simply exposes this information to userspace libraries > > like hwloc by providing cluster_cpus and related sysfs attributes. > > PoC of HWLOC support at [2]. > > > > Note this patch only handle the ACPI case. > > > > Special consideration is needed for SMT processors, where it is > > necessary to move 2 levels up the hierarchy from the leaf nodes > > (thus skipping the processor core level). > > > > Currently the ID provided is the offset of the Processor > > Hierarchy Nodes Structure within PPTT. Whilst this is unique > > it is not terribly elegant so alternative suggestions welcome. > > > > Note that arm64 / ACPI does not provide any means of identifying > > a die level in the topology but that may be unrelate to the cluster > > level. > > > > RFC questions: > > 1) Naming > > 2) Related to naming, do we want to represent all potential levels, > > or this enough? On Kunpeng920, the next level up from cluster happens > > to be covered by llc cache sharing, but in theory more than one > > level of cluster description might be needed by some future system. > > 3) Do we need DT code in place? I'm not sure any DT based ARM64 > > systems would have enough complexity for this to be useful. > > 4) Other architectures? Is this useful on x86 for example? > > > > [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node > > structure (Type 0) > > [2] https://github.com/hisilicon/hwloc/tree/linux-cluster > > > > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > > --- > > > > Documentation/admin-guide/cputopology.rst | 26 ++++++++-- > > You are adding new sysfs files here, but not adding Documentation/ABI/ > entries as well? This cputopology document is nice, but no one knows to > look there for sysfs stuff :) Hi Greg, Ah. I'd assumed there wasn't a current doc as the patch adding die description didn't touch it. Turns out it was just missing from that patch. (Documentation/ABI/testing/sysfs-devices-system-cpu) Seems those docs are missing quite a bit of more recent stuff such as die and more package related parts. I'll bring it up to date as a precursor to v2 of this series. Thanks, Jonathan > > thanks, > > greg k-h
Le 16/10/2020 à 17:27, Jonathan Cameron a écrit : > Both ACPI and DT provide the ability to describe additional layers of > topology between that of individual cores and higher level constructs > such as the level at which the last level cache is shared. > In ACPI this can be represented in PPTT as a Processor Hierarchy > Node Structure [1] that is the parent of the CPU cores and in turn > has a parent Processor Hierarchy Nodes Structure representing > a higher level of topology. > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > any cache resources, but the interconnect topology is such that > the cost to transfer ownership of a cacheline between CPUs within > a cluster is lower than between CPUs in different clusters on the same > die. Hence, it can make sense to deliberately schedule threads > sharing data to a single cluster. > > This patch simply exposes this information to userspace libraries > like hwloc by providing cluster_cpus and related sysfs attributes. > PoC of HWLOC support at [2]. > > Note this patch only handle the ACPI case. > > Special consideration is needed for SMT processors, where it is > necessary to move 2 levels up the hierarchy from the leaf nodes > (thus skipping the processor core level). > > Currently the ID provided is the offset of the Processor > Hierarchy Nodes Structure within PPTT. Whilst this is unique > it is not terribly elegant so alternative suggestions welcome. > > Note that arm64 / ACPI does not provide any means of identifying > a die level in the topology but that may be unrelate to the cluster > level. > > RFC questions: > 1) Naming > 2) Related to naming, do we want to represent all potential levels, > or this enough? On Kunpeng920, the next level up from cluster happens > to be covered by llc cache sharing, but in theory more than one > level of cluster description might be needed by some future system. > 3) Do we need DT code in place? I'm not sure any DT based ARM64 > systems would have enough complexity for this to be useful. > 4) Other architectures? Is this useful on x86 for example? Hello Jonathan Intel has CPUID registers to describe "tiles" and "modules" too (not used yet as far as I know). The list of levels could become quite long if any processor ever exposes those. If having multiple cluster levels is possible, maybe it's time to think about introducing some sort of generic levels: cluster0_id = your cluster_id cluster0_cpus/cpulist = your cluster_cpus/cpulis cluster0_type = would optionally contain hardware-specific info such as "module" or "tile" on x86 cluster_levels = 1 hwloc already does something like this for some "rare" levels such as s390 book/drawers (by the way, thanks a lot for the hwloc PoC, very good job), we call them "Groups" instead of "cluster" above. However I don't know if the Linux scheduler would like that. Is it better to have 10+ levels with static names, or a dynamic number of levels? Brice
+Morten On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > Both ACPI and DT provide the ability to describe additional layers of > topology between that of individual cores and higher level constructs > such as the level at which the last level cache is shared. > In ACPI this can be represented in PPTT as a Processor Hierarchy > Node Structure [1] that is the parent of the CPU cores and in turn > has a parent Processor Hierarchy Nodes Structure representing > a higher level of topology. > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > any cache resources, but the interconnect topology is such that > the cost to transfer ownership of a cacheline between CPUs within > a cluster is lower than between CPUs in different clusters on the same > die. Hence, it can make sense to deliberately schedule threads > sharing data to a single cluster. > This is very SoC specific and hard to generalise, hence LLC is chosen as one of the main factor to decide. Are there any scheduler topology related changes to achieve the same ? If so, we need their opinion on that. > This patch simply exposes this information to userspace libraries > like hwloc by providing cluster_cpus and related sysfs attributes. > PoC of HWLOC support at [2]. > OK, cluster is too Arm specific with no definition for it whatsoever. How do you plan to support clusters of clusters or higher level of hierarchy present in PPTT. > Note this patch only handle the ACPI case. > If we decide to add this to sysfs, I prefer to keep DT implementation also in sync. The bindings are in sync, just matter of implementation. > Special consideration is needed for SMT processors, where it is > necessary to move 2 levels up the hierarchy from the leaf nodes > (thus skipping the processor core level). > > Currently the ID provided is the offset of the Processor > Hierarchy Nodes Structure within PPTT. Whilst this is unique > it is not terribly elegant so alternative suggestions welcome. > That is firmware choice. May be your firmware just fills in mandatory fields and doesn't care about anything and everything optional. We do check for valid UID fields and if that is not set, offset is used as unique ID in kernel implementation. So if you enhance the firmware, the kernel sysfs will become elegant as you expect 😉 > Note that arm64 / ACPI does not provide any means of identifying > a die level in the topology but that may be unrelate to the cluster > level. > Indeed. It can be cluster of clusters on some platform. If we need that info, we should add it. My assumption was that generally each die forms a NUMA node and hence the information is available there. I may be wrong. > RFC questions: > 1) Naming > 2) Related to naming, do we want to represent all potential levels, > or this enough? On Kunpeng920, the next level up from cluster happens > to be covered by llc cache sharing, but in theory more than one > level of cluster description might be needed by some future system. That is my question above. Can't recall the terminology used in ACPI PPTT, but IIRC "cluster" is not used to keep it generic. May be we need to do the same here as the term "cluster" is ill-defined on Arm and I would avoid using it if possible. > 3) Do we need DT code in place? I'm not sure any DT based ARM64 > systems would have enough complexity for this to be useful. I prefer to keep the implementation in sync > 4) Other architectures? Is this useful on x86 for example? > AMD had multi-die within socket IIRC. IIUC, cpuid will provide the info and nothing more is needed from ACPI ? I may be wrong, just guessing/asking. -- Regards, Sudeep
On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > Both ACPI and DT provide the ability to describe additional layers of > topology between that of individual cores and higher level constructs > such as the level at which the last level cache is shared. > In ACPI this can be represented in PPTT as a Processor Hierarchy > Node Structure [1] that is the parent of the CPU cores and in turn > has a parent Processor Hierarchy Nodes Structure representing > a higher level of topology. > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > any cache resources, but the interconnect topology is such that > the cost to transfer ownership of a cacheline between CPUs within > a cluster is lower than between CPUs in different clusters on the same > die. Hence, it can make sense to deliberately schedule threads > sharing data to a single cluster. > > This patch simply exposes this information to userspace libraries > like hwloc by providing cluster_cpus and related sysfs attributes. > PoC of HWLOC support at [2]. > > Note this patch only handle the ACPI case. > > Special consideration is needed for SMT processors, where it is > necessary to move 2 levels up the hierarchy from the leaf nodes > (thus skipping the processor core level). I'm confused by all of this. The core level is exactly what you seem to want.
On Mon, 19 Oct 2020 12:35:22 +0200 Peter Zijlstra <peterz@infradead.org> wrote: > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > Both ACPI and DT provide the ability to describe additional layers of > > topology between that of individual cores and higher level constructs > > such as the level at which the last level cache is shared. > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > Node Structure [1] that is the parent of the CPU cores and in turn > > has a parent Processor Hierarchy Nodes Structure representing > > a higher level of topology. > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > any cache resources, but the interconnect topology is such that > > the cost to transfer ownership of a cacheline between CPUs within > > a cluster is lower than between CPUs in different clusters on the same > > die. Hence, it can make sense to deliberately schedule threads > > sharing data to a single cluster. > > > > This patch simply exposes this information to userspace libraries > > like hwloc by providing cluster_cpus and related sysfs attributes. > > PoC of HWLOC support at [2]. > > > > Note this patch only handle the ACPI case. > > > > Special consideration is needed for SMT processors, where it is > > necessary to move 2 levels up the hierarchy from the leaf nodes > > (thus skipping the processor core level). Hi Peter, > > I'm confused by all of this. The core level is exactly what you seem to > want. It's the level above the core, whether in an multi-threaded core or a single threaded core. This may correspond to the level at which caches are shared (typically L3). Cores are already well represented via thread_siblings and similar. Extra confusion is that the current core_siblings (deprecated) sysfs interface, actually reflects the package level and ignores anything in between core and package (such as die on x86) So in a typical system with a hierarchical interconnect you would have thread core cluster (possibly multiple layers as mentioned in Brice's reply). die package Unfortunately as pointed out in other branches of this thread, there is no consistent generic name. I'm open to suggestions! Both ACPI PPTT and DT provide generic structures to represent layers of topology. They don't name as such, but in ACPI there are flags to indicate package, core, thread. For example, in zen2 this would correspond to a 'core complex' consisting 4 CPU cores (each one 2 threads) sharing some local L3 cache. https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 In zen3 it looks like this level will be the same as that for the die. Given they used the name in knights landing (and as is pointed out in another branch of this thread, it's the CPUID description) I think Intel calls these 'tiles' (anyone confirm that?) A similar concept exists for some ARM processors. https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 CCLs in the diagram on that page. Centriq 2400 had 2 core 'duplexes' which shared l2. https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3 From the info release at hotchips, it looks like the thunderx3 deploys a similar ring interconnect with groups of cores, each with 4 threads. Not sure what they plan to call them yet though or whether they will chose to represent that layer of the topology in their firmware tables. Arms CMN600 interconnect also support such 'clusters' though I have no idea if anyone has used it in this form yet. In that case, they are called "processor compute clusters" https://developer.arm.com/documentation/100180/0103/ Xuantie-910 is cluster based as well (shares l2). So in many cases the cluster level corresponds to something we already have visibility of due to cache sharing etc, but that isn't true in kunpeng 920. Thanks, Jonathan
On Mon, 19 Oct 2020 12:00:15 +0200 Brice Goglin <Brice.Goglin@inria.fr> wrote: > Le 16/10/2020 à 17:27, Jonathan Cameron a écrit : > > Both ACPI and DT provide the ability to describe additional layers of > > topology between that of individual cores and higher level constructs > > such as the level at which the last level cache is shared. > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > Node Structure [1] that is the parent of the CPU cores and in turn > > has a parent Processor Hierarchy Nodes Structure representing > > a higher level of topology. > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > any cache resources, but the interconnect topology is such that > > the cost to transfer ownership of a cacheline between CPUs within > > a cluster is lower than between CPUs in different clusters on the same > > die. Hence, it can make sense to deliberately schedule threads > > sharing data to a single cluster. > > > > This patch simply exposes this information to userspace libraries > > like hwloc by providing cluster_cpus and related sysfs attributes. > > PoC of HWLOC support at [2]. > > > > Note this patch only handle the ACPI case. > > > > Special consideration is needed for SMT processors, where it is > > necessary to move 2 levels up the hierarchy from the leaf nodes > > (thus skipping the processor core level). > > > > Currently the ID provided is the offset of the Processor > > Hierarchy Nodes Structure within PPTT. Whilst this is unique > > it is not terribly elegant so alternative suggestions welcome. > > > > Note that arm64 / ACPI does not provide any means of identifying > > a die level in the topology but that may be unrelate to the cluster > > level. > > > > RFC questions: > > 1) Naming > > 2) Related to naming, do we want to represent all potential levels, > > or this enough? On Kunpeng920, the next level up from cluster happens > > to be covered by llc cache sharing, but in theory more than one > > level of cluster description might be needed by some future system. > > 3) Do we need DT code in place? I'm not sure any DT based ARM64 > > systems would have enough complexity for this to be useful. > > 4) Other architectures? Is this useful on x86 for example? > > > Hello Jonathan Hi Brice, > > Intel has CPUID registers to describe "tiles" and "modules" too (not > used yet as far as I know). The list of levels could become quite long > if any processor ever exposes those. If having multiple cluster levels > is possible, maybe it's time to think about introducing some sort of > generic levels: I've been wondering what tiles and modules are... Looking back and naming over time, I'm going to guess tiles are the closest to the particular case I was focusing on here. > > cluster0_id = your cluster_id > cluster0_cpus/cpulist = your cluster_cpus/cpulis > cluster0_type = would optionally contain hardware-specific info such as > "module" or "tile" on x86 > cluster_levels = 1 I wondered exactly the same question. At this point, perhaps we just statically introduce an 0 index, but with the assumption we would extend that as / when necessary in future. > > hwloc already does something like this for some "rare" levels such as > s390 book/drawers (by the way, thanks a lot for the hwloc PoC, very good > job), we call them "Groups" instead of "cluster" above. Given we definitely have a 'naming' issue here, perhaps group0 etc is a good generic choice? > > However I don't know if the Linux scheduler would like that. Is it > better to have 10+ levels with static names, or a dynamic number of levels? So far our 'general' experiments with adding clusters into the kernel scheduler have been a typical mixed bunch. Hence the proposal to just expose the structure to userspace where we should at least know what the workload is. Hopefully we'll gain more experience with using it and use that to drive possible kernel scheduler changes. > > Brice Thanks, Jonathan
On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > On Mon, 19 Oct 2020 12:35:22 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > I'm confused by all of this. The core level is exactly what you seem to > > want. > > It's the level above the core, whether in an multi-threaded core > or a single threaded core. This may correspond to the level > at which caches are shared (typically L3). Cores are already well > represented via thread_siblings and similar. Extra confusion is that > the current core_siblings (deprecated) sysfs interface, actually reflects > the package level and ignores anything in between core and > package (such as die on x86) That seems wrong. core-mask should be whatever cores share L3. So on a Intel Core2-Quad (just to pick an example) you should have 4 CPU in a package, but only 2 CPUs for the core-mask. It just so happens that L3 and package were the same for a long while in x86 land, although recent chips started breaking that trend. And I know nothing about the core-mask being depricated; it's what the scheduler uses. It's not going anywhere. So if your 'cluster' is a group of single cores (possibly with SMT) that do not share cache but have a faster cache connection and you want them to behave as-if they were a multi-core group that did share cache, then core-mask it is.
Hi Jonathan, On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > On Mon, 19 Oct 2020 12:35:22 +0200 > Peter Zijlstra <peterz@infradead.org> wrote: > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > > Both ACPI and DT provide the ability to describe additional layers of > > > topology between that of individual cores and higher level constructs > > > such as the level at which the last level cache is shared. > > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > > Node Structure [1] that is the parent of the CPU cores and in turn > > > has a parent Processor Hierarchy Nodes Structure representing > > > a higher level of topology. > > > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > > any cache resources, but the interconnect topology is such that > > > the cost to transfer ownership of a cacheline between CPUs within > > > a cluster is lower than between CPUs in different clusters on the same > > > die. Hence, it can make sense to deliberately schedule threads > > > sharing data to a single cluster. > > > > > > This patch simply exposes this information to userspace libraries > > > like hwloc by providing cluster_cpus and related sysfs attributes. > > > PoC of HWLOC support at [2]. > > > > > > Note this patch only handle the ACPI case. > > > > > > Special consideration is needed for SMT processors, where it is > > > necessary to move 2 levels up the hierarchy from the leaf nodes > > > (thus skipping the processor core level). > > Hi Peter, > > > > > I'm confused by all of this. The core level is exactly what you seem to > > want. > > It's the level above the core, whether in an multi-threaded core > or a single threaded core. This may correspond to the level > at which caches are shared (typically L3). Cores are already well > represented via thread_siblings and similar. Extra confusion is that > the current core_siblings (deprecated) sysfs interface, actually reflects > the package level and ignores anything in between core and > package (such as die on x86) > > So in a typical system with a hierarchical interconnect you would have > > thread > core > cluster (possibly multiple layers as mentioned in Brice's reply). > die > package > > Unfortunately as pointed out in other branches of this thread, there is > no consistent generic name. I'm open to suggestions! IIUC, you are actually proposing another "die" level? I'm not sure if we can actually come up with a generic name since interconnects are highly implementation dependent. How is you memory distributed? Do you already have NUMA nodes? If you want to keep tasks together, it might make sense to define the clusters (in your case) as NUMA nodes. > Both ACPI PPTT and DT provide generic structures to represent layers of > topology. They don't name as such, but in ACPI there are flags to indicate > package, core, thread. I think that is because those are the only ones that a fairly generic :-) It is also the only ones that scheduler cares about (plus NUMA). > > For example, in zen2 this would correspond to a 'core complex' consisting > 4 CPU cores (each one 2 threads) sharing some local L3 cache. > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 > In zen3 it looks like this level will be the same as that for the die. > > Given they used the name in knights landing (and as is pointed out in > another branch of this thread, it's the CPUID description) I think Intel > calls these 'tiles' (anyone confirm that?) > > A similar concept exists for some ARM processors. > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 > CCLs in the diagram on that page. > > Centriq 2400 had 2 core 'duplexes' which shared l2. > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3 > > From the info release at hotchips, it looks like the thunderx3 deploys > a similar ring interconnect with groups of cores, each with 4 threads. > Not sure what they plan to call them yet though or whether they will chose > to represent that layer of the topology in their firmware tables. > > Arms CMN600 interconnect also support such 'clusters' though I have no idea > if anyone has used it in this form yet. In that case, they are called > "processor compute clusters" > https://developer.arm.com/documentation/100180/0103/ > > Xuantie-910 is cluster based as well (shares l2). > > So in many cases the cluster level corresponds to something we already have > visibility of due to cache sharing etc, but that isn't true in kunpeng 920. The problem I see is that the benefit of keeping tasks together due to the interconnect layout might vary significantly between systems. So if we introduce a new cpumask for cluster it has to have represent roughly the same system properties otherwise generic software consuming this information could be tricked. If there is a provable benefit of having interconnect grouping information, I think it would be better represented by a distance matrix like we have for NUMA. Morten
Le 19/10/2020 à 14:50, Peter Zijlstra a écrit : > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: >> On Mon, 19 Oct 2020 12:35:22 +0200 >> Peter Zijlstra <peterz@infradead.org> wrote: >>> I'm confused by all of this. The core level is exactly what you seem to >>> want. >> It's the level above the core, whether in an multi-threaded core >> or a single threaded core. This may correspond to the level >> at which caches are shared (typically L3). Cores are already well >> represented via thread_siblings and similar. Extra confusion is that >> the current core_siblings (deprecated) sysfs interface, actually reflects >> the package level and ignores anything in between core and >> package (such as die on x86) > That seems wrong. core-mask should be whatever cores share L3. So on a > Intel Core2-Quad (just to pick an example) you should have 4 CPU in a > package, but only 2 CPUs for the core-mask. > > It just so happens that L3 and package were the same for a long while in > x86 land, although recent chips started breaking that trend. > > And I know nothing about the core-mask being depricated; it's what the > scheduler uses. It's not going anywhere. Only the sysfs filenames are deprecated: thread_siblings -> core_cpus core_siblings -> package_cpus New names reflect better what has been implemented/documented in the past (core_siblings=package_cpus are processors with same physical package id). And that's indeed different from the core-mask you are talking about above with Core2-Quad (that one has never been exposed anywhere in sysfs, except in the L3 cpumap). Brice
On Mon, Oct 19, 2020 at 02:50:53PM +0200, Peter Zijlstra wrote: > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > > On Mon, 19 Oct 2020 12:35:22 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > I'm confused by all of this. The core level is exactly what you seem to > > > want. > > > > It's the level above the core, whether in an multi-threaded core > > or a single threaded core. This may correspond to the level > > at which caches are shared (typically L3). Cores are already well > > represented via thread_siblings and similar. Extra confusion is that > > the current core_siblings (deprecated) sysfs interface, actually reflects > > the package level and ignores anything in between core and > > package (such as die on x86) > > That seems wrong. core-mask should be whatever cores share L3. So on a > Intel Core2-Quad (just to pick an example) you should have 4 CPU in a > package, but only 2 CPUs for the core-mask. > > It just so happens that L3 and package were the same for a long while in > x86 land, although recent chips started breaking that trend. > > And I know nothing about the core-mask being depricated; it's what the > scheduler uses. It's not going anywhere. Don't get confused over the user-space topology and the scheduler topology, they are _not_ the same despite having similar names for some things :-) > So if your 'cluster' is a group of single cores (possibly with SMT) that > do not share cache but have a faster cache connection and you want them > to behave as-if they were a multi-core group that did share cache, then > core-mask it is. In the scheduler, yes. There is no core-mask exposed to user-space. We have to be clear about whether we discuss scheduler or user-space topology :-)
On Mon, 19 Oct 2020 11:01:00 +0100 Sudeep Holla <sudeep.holla@arm.com> wrote: > +Morten Hi Sudeep, > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > Both ACPI and DT provide the ability to describe additional layers of > > topology between that of individual cores and higher level constructs > > such as the level at which the last level cache is shared. > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > Node Structure [1] that is the parent of the CPU cores and in turn > > has a parent Processor Hierarchy Nodes Structure representing > > a higher level of topology. > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > any cache resources, but the interconnect topology is such that > > the cost to transfer ownership of a cacheline between CPUs within > > a cluster is lower than between CPUs in different clusters on the same > > die. Hence, it can make sense to deliberately schedule threads > > sharing data to a single cluster. > > > > This is very SoC specific and hard to generalise, hence LLC is chosen > as one of the main factor to decide. I absolutely agree. It is very hard to modify the kernel scheduler to take into account this information. Chances are we'll help some workloads and hurt others. However, it is good to at least expose that it exists. > > Are there any scheduler topology related changes to achieve the same ? > If so, we need their opinion on that. So far no. We have done a few experiments with adding this knowledge into the scheduler but it is still very much a work in progress. However, we do have micro benchmarks showing clearly that can make a substantial difference. Our intent was to first expose this information to user space and get experience of using it in applications via HWLOC before we tried to propose any scheduler changes. > > > This patch simply exposes this information to userspace libraries > > like hwloc by providing cluster_cpus and related sysfs attributes. > > PoC of HWLOC support at [2]. > > > > OK, cluster is too Arm specific with no definition for it whatsoever. It does get used by others (ACPI for example), but I agree there may be a naming issue here. Brice observed that hwloc uses the term group to cover the somewhat obscure drawer and book levels that powerpc uses. Perhaps that is a better name? > How do you plan to support clusters of clusters or higher level of > hierarchy present in PPTT. Right now I don't. I was hoping that question would come up, but didn't want to complicate the first version of this proposal. Guess I got that wrong ;) As Brice suggested, probably adding an index makes sense. I'm not sure if we bother implementing more than one layer for now, but we definitely want a naming scheme that allows for it. If anyone has a CPU that already uses another layer let us know! On another note, we are thinking of an ACPI PPTT addition to add a flag for die, if anyone has another suggestions on how to match the x86 CPUID part for that, it would be great to align. At least die is fairly well defined... Tile and Module not so much... > > > Note this patch only handle the ACPI case. > > > > If we decide to add this to sysfs, I prefer to keep DT implementation > also in sync. The bindings are in sync, just matter of implementation. Ok, happy to look at DT that if we seem to be moving forwards with this in general. > > > Special consideration is needed for SMT processors, where it is > > necessary to move 2 levels up the hierarchy from the leaf nodes > > (thus skipping the processor core level). > > > > Currently the ID provided is the offset of the Processor > > Hierarchy Nodes Structure within PPTT. Whilst this is unique > > it is not terribly elegant so alternative suggestions welcome. > > > > That is firmware choice. May be your firmware just fills in mandatory > fields and doesn't care about anything and everything optional. We do > check for valid UID fields and if that is not set, offset is used as > unique ID in kernel implementation. So if you enhance the firmware, the > kernel sysfs will become elegant as you expect 😉 Fair enough that there is a way to improve it, but it would be nice if the default solution was also elegant! I'll add the processor container DSDT stuff to one or two of my test cases and make sure we can get that working. There are a lot of combinations that need testing and I get bored quickly writing firmware tables by hand. > > > Note that arm64 / ACPI does not provide any means of identifying > > a die level in the topology but that may be unrelate to the cluster > > level. > > > > Indeed. It can be cluster of clusters on some platform. If we need that > info, we should add it. My assumption was that generally each die forms > a NUMA node and hence the information is available there. I may be wrong. There are architectures that have exposed multiple NUMA nodes on one die. Given steady increase in number of cores, I'd be surprised if it didn't happen again. An example would be the Haswell cluster on die. This pages suggests they were exposed as NUMA nodes via SRAT proximity domains. https://www.dell.com/support/article/en-uk/sln315049/intel-cluster-on-die-cod-technology-on-vmware-esxi?lang=en So that isn't sufficient information. If we want to expose it it needs to be explicit. > > > RFC questions: > > 1) Naming > > 2) Related to naming, do we want to represent all potential levels, > > or this enough? On Kunpeng920, the next level up from cluster happens > > to be covered by llc cache sharing, but in theory more than one > > level of cluster description might be needed by some future system. > > That is my question above. Can't recall the terminology used in ACPI > PPTT, but IIRC "cluster" is not used to keep it generic. May be we need > to do the same here as the term "cluster" is ill-defined on Arm and I > would avoid using it if possible. In example text, ACPI uses the term cluster. Of course that may reflect that the author was an ARM person (I have no idea) E.g. The Cache Type Cluster Example Fig 5.13 Various other places in description of power management use cluster to mean similar things. > > > 3) Do we need DT code in place? I'm not sure any DT based ARM64 > > systems would have enough complexity for this to be useful. > > I prefer to keep the implementation in sync Fair enough. Once we are sure this is going somewhere I'll get that implemented (mostly writing test cases *sigh*) > > > 4) Other architectures? Is this useful on x86 for example? > > > > AMD had multi-die within socket IIRC. IIUC, cpuid will provide the info > and nothing more is needed from ACPI ? I may be wrong, just guessing/asking. I'm not sure if any x86 platforms use PPTT. However, the CPUID value, if provided would still need to be hooked up to the sysfs files etc. As mentioned in another branch in this thread my guess is the equivalent is 'tile'. Thanks, Jonathan > > -- > Regards, > Sudeep
On Mon, 19 Oct 2020 15:10:52 +0200 Morten Rasmussen <morten.rasmussen@arm.com> wrote: > Hi Jonathan, > > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > > On Mon, 19 Oct 2020 12:35:22 +0200 > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > > > Both ACPI and DT provide the ability to describe additional layers of > > > > topology between that of individual cores and higher level constructs > > > > such as the level at which the last level cache is shared. > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > > > Node Structure [1] that is the parent of the CPU cores and in turn > > > > has a parent Processor Hierarchy Nodes Structure representing > > > > a higher level of topology. > > > > > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > > > any cache resources, but the interconnect topology is such that > > > > the cost to transfer ownership of a cacheline between CPUs within > > > > a cluster is lower than between CPUs in different clusters on the same > > > > die. Hence, it can make sense to deliberately schedule threads > > > > sharing data to a single cluster. > > > > > > > > This patch simply exposes this information to userspace libraries > > > > like hwloc by providing cluster_cpus and related sysfs attributes. > > > > PoC of HWLOC support at [2]. > > > > > > > > Note this patch only handle the ACPI case. > > > > > > > > Special consideration is needed for SMT processors, where it is > > > > necessary to move 2 levels up the hierarchy from the leaf nodes > > > > (thus skipping the processor core level). > > > > Hi Peter, > > > > > > > > I'm confused by all of this. The core level is exactly what you seem to > > > want. > > > > It's the level above the core, whether in an multi-threaded core > > or a single threaded core. This may correspond to the level > > at which caches are shared (typically L3). Cores are already well > > represented via thread_siblings and similar. Extra confusion is that > > the current core_siblings (deprecated) sysfs interface, actually reflects > > the package level and ignores anything in between core and > > package (such as die on x86) > > > > So in a typical system with a hierarchical interconnect you would have > > > > thread > > core > > cluster (possibly multiple layers as mentioned in Brice's reply). > > die > > package > > > > Unfortunately as pointed out in other branches of this thread, there is > > no consistent generic name. I'm open to suggestions! > > IIUC, you are actually proposing another "die" level? I'm not sure if we > can actually come up with a generic name since interconnects are highly > implementation dependent. Brice mentioned hwloc is using 'group'. That seems generic enough perhaps. > > How is you memory distributed? Do you already have NUMA nodes? If you > want to keep tasks together, it might make sense to define the clusters > (in your case) as NUMA nodes. We already have all of the standard levels. We need at least one more. On a near future platform we'll have full set (kunpeng920 is single thread) So on kunpeng 920 we have cores (clusters) die / llc shared at this level package (multiple NUMA nodes in each package) System, multiple packages. > > > Both ACPI PPTT and DT provide generic structures to represent layers of > > topology. They don't name as such, but in ACPI there are flags to indicate > > package, core, thread. > > I think that is because those are the only ones that a fairly generic > :-) It is also the only ones that scheduler cares about (plus NUMA). Agreed, I'm not proposing we add these to the kernel scheduler (at least for now). Another layer just means another layer of complexity. > > > > > For example, in zen2 this would correspond to a 'core complex' consisting > > 4 CPU cores (each one 2 threads) sharing some local L3 cache. > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 > > In zen3 it looks like this level will be the same as that for the die. > > > > Given they used the name in knights landing (and as is pointed out in > > another branch of this thread, it's the CPUID description) I think Intel > > calls these 'tiles' (anyone confirm that?) > > > > A similar concept exists for some ARM processors. > > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 > > CCLs in the diagram on that page. > > > > Centriq 2400 had 2 core 'duplexes' which shared l2. > > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3 > > > > From the info release at hotchips, it looks like the thunderx3 deploys > > a similar ring interconnect with groups of cores, each with 4 threads. > > Not sure what they plan to call them yet though or whether they will chose > > to represent that layer of the topology in their firmware tables. > > > > Arms CMN600 interconnect also support such 'clusters' though I have no idea > > if anyone has used it in this form yet. In that case, they are called > > "processor compute clusters" > > https://developer.arm.com/documentation/100180/0103/ > > > > Xuantie-910 is cluster based as well (shares l2). > > > > So in many cases the cluster level corresponds to something we already have > > visibility of due to cache sharing etc, but that isn't true in kunpeng 920. > > The problem I see is that the benefit of keeping tasks together due to > the interconnect layout might vary significantly between systems. So if > we introduce a new cpumask for cluster it has to have represent roughly > the same system properties otherwise generic software consuming this > information could be tricked. Agreed. Any software would currently have to do it's own benchmarking to work out how to use the presented information. It would imply that you 'want to look at this group of CPUs' rather than providing any hard rules. The same is true of die, which we already have. What that means will vary enormously between different designs in a fashion that may or may not be independent of NUMA topology. Note, there are people who do extensive benchmarking of NUMA topology as the information provided is either inaccurate / missing, or not of sufficient detail to do their scheduling. It's not a big load to do that sort of stuff on startup of software on an HPC system. > > If there is a provable benefit of having interconnect grouping > information, I think it would be better represented by a distance matrix > like we have for NUMA. There have been some discussions in various forums about how to describe the complexity of interconnects well enough to actually be useful. Those have mostly floundered on the immense complexity of designing such a description in a fashion any normal software would actually use. +cc Jerome who raised some of this in the kernel a while back. Add this cluster / group layer seems moderately safe as it just says 'something here you should consider', rather than making particular statements on expected performance etc. So I fully agree it would be good to have that info, if we can figure out how to do it! However, that is never going to be a short term thing. Thanks, Jonathan > > Morten
+Cc Jeremy On 19/10/20 14:10, Morten Rasmussen wrote: > Hi Jonathan, > The problem I see is that the benefit of keeping tasks together due to > the interconnect layout might vary significantly between systems. So if > we introduce a new cpumask for cluster it has to have represent roughly > the same system properties otherwise generic software consuming this > information could be tricked. > > If there is a provable benefit of having interconnect grouping > information, I think it would be better represented by a distance matrix > like we have for NUMA. > > Morten That's my queue to paste some of that stuff I've been rambling on and off about! With regards to cache / interconnect layout, I do believe that if we want to support in the scheduler itself then we should leverage some distance table rather than to create X extra scheduler topology levels. I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given that SLIT gives us a distance value between any two PXM, we could directly express core-to-core distance in that table. With that (and if that still lets us properly discover NUMA node spans), we could let the scheduler build dynamic NUMA-like topology levels representing the inner quirks of the cache / interconnect layout. It's mostly pipe dreams for now, but there seems to be more and more hardware where that would make sense; somewhat recently the PowerPC guys added something to their arch-specific code in that regards.
On Mon, Oct 19, 2020 at 02:41:57PM +0100, Jonathan Cameron wrote: > On Mon, 19 Oct 2020 15:10:52 +0200 > Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > > Hi Jonathan, > > > > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > > > On Mon, 19 Oct 2020 12:35:22 +0200 > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > > > > Both ACPI and DT provide the ability to describe additional layers of > > > > > topology between that of individual cores and higher level constructs > > > > > such as the level at which the last level cache is shared. > > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > > > > Node Structure [1] that is the parent of the CPU cores and in turn > > > > > has a parent Processor Hierarchy Nodes Structure representing > > > > > a higher level of topology. > > > > > > > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > > > > any cache resources, but the interconnect topology is such that > > > > > the cost to transfer ownership of a cacheline between CPUs within > > > > > a cluster is lower than between CPUs in different clusters on the same > > > > > die. Hence, it can make sense to deliberately schedule threads > > > > > sharing data to a single cluster. > > > > > > > > > > This patch simply exposes this information to userspace libraries > > > > > like hwloc by providing cluster_cpus and related sysfs attributes. > > > > > PoC of HWLOC support at [2]. > > > > > > > > > > Note this patch only handle the ACPI case. > > > > > > > > > > Special consideration is needed for SMT processors, where it is > > > > > necessary to move 2 levels up the hierarchy from the leaf nodes > > > > > (thus skipping the processor core level). > > > > > > Hi Peter, > > > > > > > > > > > I'm confused by all of this. The core level is exactly what you seem to > > > > want. > > > > > > It's the level above the core, whether in an multi-threaded core > > > or a single threaded core. This may correspond to the level > > > at which caches are shared (typically L3). Cores are already well > > > represented via thread_siblings and similar. Extra confusion is that > > > the current core_siblings (deprecated) sysfs interface, actually reflects > > > the package level and ignores anything in between core and > > > package (such as die on x86) > > > > > > So in a typical system with a hierarchical interconnect you would have > > > > > > thread > > > core > > > cluster (possibly multiple layers as mentioned in Brice's reply). > > > die > > > package > > > > > > Unfortunately as pointed out in other branches of this thread, there is > > > no consistent generic name. I'm open to suggestions! > > > > IIUC, you are actually proposing another "die" level? I'm not sure if we > > can actually come up with a generic name since interconnects are highly > > implementation dependent. > > Brice mentioned hwloc is using 'group'. That seems generic enough perhaps. > > > > > How is you memory distributed? Do you already have NUMA nodes? If you > > want to keep tasks together, it might make sense to define the clusters > > (in your case) as NUMA nodes. > > We already have all of the standard levels. We need at least one more. > On a near future platform we'll have full set (kunpeng920 is single thread) > > So on kunpeng 920 we have > cores > (clusters) > die / llc shared at this level IIRC, LLC sharing isn't tied to a specific level in the user-space topology description. On some Arm systems LLC is per cluster while the package has a single die with two clusters. I'm slightly confused about the cache sharing. You said above that your clusters don't share cache resources? This list says LLC is at die level, which is above cluster level? > package (multiple NUMA nodes in each package) What is your NUMA node span? Couldn't you just make it equivalent to your clusters? > > > For example, in zen2 this would correspond to a 'core complex' consisting > > > 4 CPU cores (each one 2 threads) sharing some local L3 cache. > > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 > > > In zen3 it looks like this level will be the same as that for the die. > > > > > > Given they used the name in knights landing (and as is pointed out in > > > another branch of this thread, it's the CPUID description) I think Intel > > > calls these 'tiles' (anyone confirm that?) > > > > > > A similar concept exists for some ARM processors. > > > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 > > > CCLs in the diagram on that page. > > > > > > Centriq 2400 had 2 core 'duplexes' which shared l2. > > > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3 > > > > > > From the info release at hotchips, it looks like the thunderx3 deploys > > > a similar ring interconnect with groups of cores, each with 4 threads. > > > Not sure what they plan to call them yet though or whether they will chose > > > to represent that layer of the topology in their firmware tables. > > > > > > Arms CMN600 interconnect also support such 'clusters' though I have no idea > > > if anyone has used it in this form yet. In that case, they are called > > > "processor compute clusters" > > > https://developer.arm.com/documentation/100180/0103/ > > > > > > Xuantie-910 is cluster based as well (shares l2). > > > > > > So in many cases the cluster level corresponds to something we already have > > > visibility of due to cache sharing etc, but that isn't true in kunpeng 920. > > > > The problem I see is that the benefit of keeping tasks together due to > > the interconnect layout might vary significantly between systems. So if > > we introduce a new cpumask for cluster it has to have represent roughly > > the same system properties otherwise generic software consuming this > > information could be tricked. > > Agreed. Any software would currently have to do it's own benchmarking > to work out how to use the presented information. It would imply that > you 'want to look at this group of CPUs' rather than providing any hard > rules. The same is true of die, which we already have. What > that means will vary enormously between different designs in a fashion > that may or may not be independent of NUMA topology. Note, > there are people who do extensive benchmarking of NUMA topology as > the information provided is either inaccurate / missing, or not of > sufficient detail to do their scheduling. It's not a big load to > do that sort of stuff on startup of software on an HPC system. With an undefined CPU grouping mask, you would still have to do some benchmarking to figure out if it is useful for managing your particular workload. It would certainly make the profiling simpler though. > > > > > If there is a provable benefit of having interconnect grouping > > information, I think it would be better represented by a distance matrix > > like we have for NUMA. > > There have been some discussions in various forums about how to > describe the complexity of interconnects well enough to actually be > useful. Those have mostly floundered on the immense complexity of > designing such a description in a fashion any normal software would actually > use. +cc Jerome who raised some of this in the kernel a while back. I agree that representing interconnect details is hard. I had hoped that a distance matrix would be better than nothing and more generic than inserting extra group masks. > > Add this cluster / group layer seems moderately safe as it just says > 'something here you should consider', rather than making particular > statements on expected performance etc. > > So I fully agree it would be good to have that info, if we can figure > out how to do it! However, that is never going to be a short term thing. A new sysfs entry could live for a long time :-) Morten
On Mon, 19 Oct 2020 14:48:02 +0100 Valentin Schneider <valentin.schneider@arm.com> wrote: > +Cc Jeremy > > On 19/10/20 14:10, Morten Rasmussen wrote: > > Hi Jonathan, > > The problem I see is that the benefit of keeping tasks together due to > > the interconnect layout might vary significantly between systems. So if > > we introduce a new cpumask for cluster it has to have represent roughly > > the same system properties otherwise generic software consuming this > > information could be tricked. > > > > If there is a provable benefit of having interconnect grouping > > information, I think it would be better represented by a distance matrix > > like we have for NUMA. > > > > Morten > > That's my queue to paste some of that stuff I've been rambling on and off > about! > > With regards to cache / interconnect layout, I do believe that if we > want to support in the scheduler itself then we should leverage some > distance table rather than to create X extra scheduler topology levels. > > I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given > that SLIT gives us a distance value between any two PXM, we could directly > express core-to-core distance in that table. With that (and if that still > lets us properly discover NUMA node spans), we could let the scheduler > build dynamic NUMA-like topology levels representing the inner quirks of > the cache / interconnect layout. You would rapidly run into the problem SLIT had for numa node description. There is no consistent description of distance and except in the vaguest sense or 'nearer' it wasn't any use for anything. That is why HMAT came along. It's far from perfect but it is a step up. I can't see how you'd generalize those particular tables to do anything for intercore comms without breaking their use for NUMA, but something a bit similar might work. A lot of thought has gone in (and meeting time) to try an improve the situation for complex topology around NUMA. Whilst there are differences in representing the internal interconnects and caches it seems like a somewhat similar problem. The issue there is it is really really hard to describe this stuff with enough detail to be useful, but simple enough to be usable. https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/ > > It's mostly pipe dreams for now, but there seems to be more and more > hardware where that would make sense; somewhat recently the PowerPC guys > added something to their arch-specific code in that regards. Pipe dream == something to work on ;) ACPI has a nice code first model of updating the spec now, so we can discuss this one in public, and propose spec changes only once we have an implementation proven. Note I'm not proposing we put the cluster stuff in the scheduler, just provide it as a hint to userspace. Jonathan
Le 19/10/2020 à 16:16, Morten Rasmussen a écrit : > >>> If there is a provable benefit of having interconnect grouping >>> information, I think it would be better represented by a distance matrix >>> like we have for NUMA. >> There have been some discussions in various forums about how to >> describe the complexity of interconnects well enough to actually be >> useful. Those have mostly floundered on the immense complexity of >> designing such a description in a fashion any normal software would actually >> use. +cc Jerome who raised some of this in the kernel a while back. > I agree that representing interconnect details is hard. I had hoped that > a distance matrix would be better than nothing and more generic than > inserting extra group masks. > The distance matrix is indeed more precise, but would it scale to tens/hundreds of core? When ACPI HMAT latency/bandwidth was added, there were concerns that exposing the full matrix would be an issue for the kernel (that's why only local latency/bandwidth is exposed n sysfs). This was only for NUMA nodes/targets/initiators, you would have significantly more cores than that. Brice
On Mon, 19 Oct 2020 16:16:53 +0200 Morten Rasmussen <morten.rasmussen@arm.com> wrote: > On Mon, Oct 19, 2020 at 02:41:57PM +0100, Jonathan Cameron wrote: > > On Mon, 19 Oct 2020 15:10:52 +0200 > > Morten Rasmussen <morten.rasmussen@arm.com> wrote: > > > > > Hi Jonathan, > > > > > > On Mon, Oct 19, 2020 at 01:32:26PM +0100, Jonathan Cameron wrote: > > > > On Mon, 19 Oct 2020 12:35:22 +0200 > > > > Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > > On Fri, Oct 16, 2020 at 11:27:02PM +0800, Jonathan Cameron wrote: > > > > > > Both ACPI and DT provide the ability to describe additional layers of > > > > > > topology between that of individual cores and higher level constructs > > > > > > such as the level at which the last level cache is shared. > > > > > > In ACPI this can be represented in PPTT as a Processor Hierarchy > > > > > > Node Structure [1] that is the parent of the CPU cores and in turn > > > > > > has a parent Processor Hierarchy Nodes Structure representing > > > > > > a higher level of topology. > > > > > > > > > > > > For example Kunpeng 920 has clusters of 4 CPUs. These do not share > > > > > > any cache resources, but the interconnect topology is such that > > > > > > the cost to transfer ownership of a cacheline between CPUs within > > > > > > a cluster is lower than between CPUs in different clusters on the same > > > > > > die. Hence, it can make sense to deliberately schedule threads > > > > > > sharing data to a single cluster. > > > > > > > > > > > > This patch simply exposes this information to userspace libraries > > > > > > like hwloc by providing cluster_cpus and related sysfs attributes. > > > > > > PoC of HWLOC support at [2]. > > > > > > > > > > > > Note this patch only handle the ACPI case. > > > > > > > > > > > > Special consideration is needed for SMT processors, where it is > > > > > > necessary to move 2 levels up the hierarchy from the leaf nodes > > > > > > (thus skipping the processor core level). > > > > > > > > Hi Peter, > > > > > > > > > > > > > > I'm confused by all of this. The core level is exactly what you seem to > > > > > want. > > > > > > > > It's the level above the core, whether in an multi-threaded core > > > > or a single threaded core. This may correspond to the level > > > > at which caches are shared (typically L3). Cores are already well > > > > represented via thread_siblings and similar. Extra confusion is that > > > > the current core_siblings (deprecated) sysfs interface, actually reflects > > > > the package level and ignores anything in between core and > > > > package (such as die on x86) > > > > > > > > So in a typical system with a hierarchical interconnect you would have > > > > > > > > thread > > > > core > > > > cluster (possibly multiple layers as mentioned in Brice's reply). > > > > die > > > > package > > > > > > > > Unfortunately as pointed out in other branches of this thread, there is > > > > no consistent generic name. I'm open to suggestions! > > > > > > IIUC, you are actually proposing another "die" level? I'm not sure if we > > > can actually come up with a generic name since interconnects are highly > > > implementation dependent. > > > > Brice mentioned hwloc is using 'group'. That seems generic enough perhaps. > > > > > > > > How is you memory distributed? Do you already have NUMA nodes? If you > > > want to keep tasks together, it might make sense to define the clusters > > > (in your case) as NUMA nodes. > > > > We already have all of the standard levels. We need at least one more. > > On a near future platform we'll have full set (kunpeng920 is single thread) > > > > So on kunpeng 920 we have > > cores > > (clusters) > > die / llc shared at this level > > IIRC, LLC sharing isn't tied to a specific level in the user-space > topology description. On some Arm systems LLC is per cluster while the > package has a single die with two clusters. That was just an example of this system where it happens to sit at that level. Nothing stops per core private LLC with the right coherence protocol. Basically LLC can sit at any level. It is indirectly exposed and used by userspace - though which level it attaches to or whether it defines it own varies. So in cpuX/topology there is no mention of caches, but if you are using hwloc it also uses cpuX/cache/indexX and from that it will happily add more layers of topology to reflect cache sharing. It will merge the cache levels with existing levels from cpuX/topology if they happen to match. So it may not have been the intent of the sysfs design, but a commonly used library is figuring it out anyway. > > I'm slightly confused about the cache sharing. You said above that your > clusters don't share cache resources? This list says LLC is at die > level, which is above cluster level? Yes. Cluster is below the level at which the l3 is shared. It's more complex than that in reality but at the coarse scale we can describe at all, the l3 is at the die level. From lstopo (heavily cropped obviously, it's a 128 core 2 socket system) Machine (1006GB total) Package L#0 L3 L#0 (32MB) NUMANode L#0 (P#0 252GB) Cluster L#0 L2 L#0 (512KB) + L1d L#0 (64KB) + L1i L#0 (64KB) + Core L#0 + PU L#0 (P#0) L2 L#1 (512KB) + L1d L#1 (64KB) + L1i L#1 (64KB) + Core L#1 + PU L#1 (P#1) L2 L#2 (512KB) + L1d L#2 (64KB) + L1i L#2 (64KB) + Core L#2 + PU L#2 (P#2) L2 L#3 (512KB) + L1d L#3 (64KB) + L1i L#3 (64KB) + Core L#3 + PU L#3 (P#3) Cluster L#1 L2 L#4 (512KB) + L1d L#4 (64KB) + L1i L#4 (64KB) + Core L#4 + PU L#4 (P#4) L2 L#5 (512KB) + L1d L#5 (64KB) + L1i L#5 (64KB) + Core L#5 + PU L#5 (P#5) L2 L#6 (512KB) + L1d L#6 (64KB) + L1i L#6 (64KB) + Core L#6 + PU L#6 (P#6) L2 L#7 (512KB) + L1d L#7 (64KB) + L1i L#7 (64KB) + Core L#7 + PU L#7 (P#7) ... L3 L#1 (32MB) NUMANode L#1 (P#1 252GB) Cluster L#8 L2 L#32 (512KB) + L1d L#32 (64KB) + L1i L#32 (64KB) + Core L#32 + PU L#32 (P#32) L2 L#33 (512KB) + L1d L#33 (64KB) + L1i L#33 (64KB) + Core L#33 + PU L#33 (P#33) L2 L#34 (512KB) + L1d L#34 (64KB) + L1i L#34 (64KB) + Core L#34 + PU L#34 (P#34) L2 L#35 (512KB) + L1d L#35 (64KB) + L1i L#35 (64KB) + Core L#35 + PU L#35 (P#35) ... Package L#1 L3 L#2 (32MB) NUMANode L#2 (P#2 252GB) Cluster L#16 L2 L#64 (512KB) + L1d L#64 (64KB) + L1i L#64 (64KB) + Core L#64 + PU L#64 (P#64) L2 L#65 (512KB) + L1d L#65 (64KB) + L1i L#65 (64KB) + Core L#65 + PU L#65 (P#65) L2 L#66 (512KB) + L1d L#66 (64KB) + L1i L#66 (64KB) + Core L#66 + PU L#66 (P#66) L2 L#67 (512KB) + L1d L#67 (64KB) + L1i L#67 (64KB) + Core L#67 + PU L#67 (P#67) etc > > package (multiple NUMA nodes in each package) > > What is your NUMA node span? Couldn't you just make it equivalent to > your clusters? Nope can't do that. System has 2 nodes per package, multiple packages per system and interconnect hangs off one end of the each package. It's one of the platforms that has been causing Valentin issues with span. (span 3 I think?) 2 socket system basically looks like [a]-[b]-----[c]-[d] where dash length reflects distance and there isn't a direct [a] to [d] path. We have tried messing around with dropping the die level of NUMA description in favour of representing cluster in the past, but it worked worse than going for NUMA at the die level (which is where the memory controllers are anyway). The cluster affects are generally more minor than NUMA. Hence I'm not proposing putting them anywhere they'd current affect the kernel scheduler. > > > > > For example, in zen2 this would correspond to a 'core complex' consisting > > > > 4 CPU cores (each one 2 threads) sharing some local L3 cache. > > > > https://en.wikichip.org/wiki/amd/microarchitectures/zen_2 > > > > In zen3 it looks like this level will be the same as that for the die. > > > > > > > > Given they used the name in knights landing (and as is pointed out in > > > > another branch of this thread, it's the CPUID description) I think Intel > > > > calls these 'tiles' (anyone confirm that?) > > > > > > > > A similar concept exists for some ARM processors. > > > > https://en.wikichip.org/wiki/hisilicon/microarchitectures/taishan_v110 > > > > CCLs in the diagram on that page. > > > > > > > > Centriq 2400 had 2 core 'duplexes' which shared l2. > > > > https://www.anandtech.com/show/11737/analyzing-falkors-microarchitecture-a-deep-dive-into-qualcomms-centriq-2400-for-windows-server-and-linux/3 > > > > > > > > From the info release at hotchips, it looks like the thunderx3 deploys > > > > a similar ring interconnect with groups of cores, each with 4 threads. > > > > Not sure what they plan to call them yet though or whether they will chose > > > > to represent that layer of the topology in their firmware tables. > > > > > > > > Arms CMN600 interconnect also support such 'clusters' though I have no idea > > > > if anyone has used it in this form yet. In that case, they are called > > > > "processor compute clusters" > > > > https://developer.arm.com/documentation/100180/0103/ > > > > > > > > Xuantie-910 is cluster based as well (shares l2). > > > > > > > > So in many cases the cluster level corresponds to something we already have > > > > visibility of due to cache sharing etc, but that isn't true in kunpeng 920. > > > > > > The problem I see is that the benefit of keeping tasks together due to > > > the interconnect layout might vary significantly between systems. So if > > > we introduce a new cpumask for cluster it has to have represent roughly > > > the same system properties otherwise generic software consuming this > > > information could be tricked. > > > > Agreed. Any software would currently have to do it's own benchmarking > > to work out how to use the presented information. It would imply that > > you 'want to look at this group of CPUs' rather than providing any hard > > rules. The same is true of die, which we already have. What > > that means will vary enormously between different designs in a fashion > > that may or may not be independent of NUMA topology. Note, > > there are people who do extensive benchmarking of NUMA topology as > > the information provided is either inaccurate / missing, or not of > > sufficient detail to do their scheduling. It's not a big load to > > do that sort of stuff on startup of software on an HPC system. > > With an undefined CPU grouping mask, you would still have to do some > benchmarking to figure out if it is useful for managing your particular > workload. It would certainly make the profiling simpler though. Absolutely. Playing guess for the scale of clusters is not going to be pretty. I suppose you'd probably catch most systems if you brute forced 2,4 and 8 and hoped your CPUs were in sane order (optimistic). > > > > > > > > > If there is a provable benefit of having interconnect grouping > > > information, I think it would be better represented by a distance matrix > > > like we have for NUMA. > > > > There have been some discussions in various forums about how to > > describe the complexity of interconnects well enough to actually be > > useful. Those have mostly floundered on the immense complexity of > > designing such a description in a fashion any normal software would actually > > use. +cc Jerome who raised some of this in the kernel a while back. > > I agree that representing interconnect details is hard. I had hoped that > a distance matrix would be better than nothing and more generic than > inserting extra group masks. Short of affectively mapping it to groups (which is basically what SLIT ends up being used for) I'm not sure how on earth we'd define it. Distance for what operation? Shared access to a cacheline? Get exclusive? (with all the fun of chatting to directories and possible DCT) There are whole set of possibilities and we might potentially want bandwidth. All come with a bunch of questions around access patterns etc (probably - I've not really thought about this that much for this case!) Basically we'd be trying to wrap up all the complex stuff those interconnect and cache specialists work on in a single number. It's going to just become and ordered list of who is nearer.. So another way of representing the groupings we can already describe in firmware. > > > > > Add this cluster / group layer seems moderately safe as it just says > > 'something here you should consider', rather than making particular > > statements on expected performance etc. > > > > So I fully agree it would be good to have that info, if we can figure > > out how to do it! However, that is never going to be a short term thing. > > A new sysfs entry could live for a long time :-) Agreed. But it if it is generic and easy to support from existing description (which this is) then that shouldn't be a problem. Definitely good to get it as good as we can at the start though. If something better comes along that is more useful people can transition to that and it doesn't matter if the burden of maintaining the old route is trivial. I'm absolutely in favour of better description and happy to work with people on that but it's a multi year job at the very least. We all love standards bodies :) Biggest job here would be proving the info was useful and generic across many 'interesting' system architectures. Many of which also have more than one mode in which you can run the caches. > > Morten Jonathan
On 19/10/20 15:27, Jonathan Cameron wrote: > On Mon, 19 Oct 2020 14:48:02 +0100 > Valentin Schneider <valentin.schneider@arm.com> wrote: >> >> That's my queue to paste some of that stuff I've been rambling on and off >> about! >> >> With regards to cache / interconnect layout, I do believe that if we >> want to support in the scheduler itself then we should leverage some >> distance table rather than to create X extra scheduler topology levels. >> >> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given >> that SLIT gives us a distance value between any two PXM, we could directly >> express core-to-core distance in that table. With that (and if that still >> lets us properly discover NUMA node spans), we could let the scheduler >> build dynamic NUMA-like topology levels representing the inner quirks of >> the cache / interconnect layout. > > You would rapidly run into the problem SLIT had for numa node description. > There is no consistent description of distance and except in the vaguest > sense or 'nearer' it wasn't any use for anything. That is why HMAT > came along. It's far from perfect but it is a step up. > I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT / PPTT, so thanks for pointing this out. > I can't see how you'd generalize those particular tables to do anything > for intercore comms without breaking their use for NUMA, but something > a bit similar might work. > Right, there's the issue of still being able to determine NUMA node boundaries. > A lot of thought has gone in (and meeting time) to try an improve the > situation for complex topology around NUMA. Whilst there are differences > in representing the internal interconnects and caches it seems like a somewhat > similar problem. The issue there is it is really really hard to describe > this stuff with enough detail to be useful, but simple enough to be usable. > > https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/ > Thanks for the link! >> >> It's mostly pipe dreams for now, but there seems to be more and more >> hardware where that would make sense; somewhat recently the PowerPC guys >> added something to their arch-specific code in that regards. > > Pipe dream == something to work on ;) > > ACPI has a nice code first model of updating the spec now, so we can discuss > this one in public, and propose spec changes only once we have an implementation > proven. > FWIW I blabbered about a "generalization" of NUMA domains & distances within the scheduler at LPC19 (and have been pasting that occasionally, apologies for the broken record): https://linuxplumbersconf.org/event/4/contributions/484/ I've only pondered about the implementation, but if (big if; also I really despise advertising "the one solution that will solve all your issues" which this is starting to sound like) it would help I could cobble together an RFC leveraging a separate distance table. It doesn't solve the "funneling cache properties into a single number" issue, which as you just pointed out in a parallel email is a separate discussion altogether. > Note I'm not proposing we put the cluster stuff in the scheduler, just > provide it as a hint to userspace. > The goal being to tweak tasks' affinities, right? Other than CPU pinning and rare cases, IMO if the userspace has to mess around with affinities it is due to the failings of the underlying scheduler. Restricted CPU affinities is also something the load-balancer struggles with; I have and have been fighting over such issues where just a single per-CPU kworker waking up at the wrong time can mess up load-balance for quite some time. I tend to phrase it as: "if you're rude to the scheduler, it can and will respond in kind". Now yes, it's not the same timescale nor amount of work, but this is something the scheduler itself should leverage, not userspace. > Jonathan
On Mon, 19 Oct 2020 16:51:06 +0100 Valentin Schneider <valentin.schneider@arm.com> wrote: > On 19/10/20 15:27, Jonathan Cameron wrote: > > On Mon, 19 Oct 2020 14:48:02 +0100 > > Valentin Schneider <valentin.schneider@arm.com> wrote: > >> > >> That's my queue to paste some of that stuff I've been rambling on and off > >> about! > >> > >> With regards to cache / interconnect layout, I do believe that if we > >> want to support in the scheduler itself then we should leverage some > >> distance table rather than to create X extra scheduler topology levels. > >> > >> I had a chat with Jeremy on the ACPI side of that sometime ago. IIRC given > >> that SLIT gives us a distance value between any two PXM, we could directly > >> express core-to-core distance in that table. With that (and if that still > >> lets us properly discover NUMA node spans), we could let the scheduler > >> build dynamic NUMA-like topology levels representing the inner quirks of > >> the cache / interconnect layout. > > > > You would rapidly run into the problem SLIT had for numa node description. > > There is no consistent description of distance and except in the vaguest > > sense or 'nearer' it wasn't any use for anything. That is why HMAT > > came along. It's far from perfect but it is a step up. > > > > I wasn't aware of HMAT; my feeble ACPI knowledge is limited to SRAT / SLIT > / PPTT, so thanks for pointing this out. > > > I can't see how you'd generalize those particular tables to do anything > > for intercore comms without breaking their use for NUMA, but something > > a bit similar might work. > > > > Right, there's the issue of still being able to determine NUMA node > boundaries. Backwards compatibility will break you there. I'd definitely look at a separate table. Problem with SLIT etc is that, as static tables, we can't play games with OSC bits to negotiate what the OS and the firmware both understand. > > > A lot of thought has gone in (and meeting time) to try an improve the > > situation for complex topology around NUMA. Whilst there are differences > > in representing the internal interconnects and caches it seems like a somewhat > > similar problem. The issue there is it is really really hard to describe > > this stuff with enough detail to be useful, but simple enough to be usable. > > > > https://lore.kernel.org/linux-mm/20181203233509.20671-1-jglisse@redhat.com/ > > > > Thanks for the link! > > >> > >> It's mostly pipe dreams for now, but there seems to be more and more > >> hardware where that would make sense; somewhat recently the PowerPC guys > >> added something to their arch-specific code in that regards. > > > > Pipe dream == something to work on ;) > > > > ACPI has a nice code first model of updating the spec now, so we can discuss > > this one in public, and propose spec changes only once we have an implementation > > proven. > > > > FWIW I blabbered about a "generalization" of NUMA domains & distances > within the scheduler at LPC19 (and have been pasting that occasionally, > apologies for the broken record): > > https://linuxplumbersconf.org/event/4/contributions/484/ > > I've only pondered about the implementation, but if (big if; also I really > despise advertising "the one solution that will solve all your issues" > which this is starting to sound like) it would help I could cobble together > an RFC leveraging a separate distance table. It would certainly be interesting. > > It doesn't solve the "funneling cache properties into a single number" > issue, which as you just pointed out in a parallel email is a separate > discussion altogether. > > > Note I'm not proposing we put the cluster stuff in the scheduler, just > > provide it as a hint to userspace. > > > > The goal being to tweak tasks' affinities, right? Other than CPU pinning > and rare cases, IMO if the userspace has to mess around with affinities it > is due to the failings of the underlying scheduler. Restricted CPU > affinities is also something the load-balancer struggles with; I have and > have been fighting over such issues where just a single per-CPU kworker > waking up at the wrong time can mess up load-balance for quite some time. I > tend to phrase it as: "if you're rude to the scheduler, it can and will > respond in kind". > > Now yes, it's not the same timescale nor amount of work, but this is > something the scheduler itself should leverage, not userspace. Ideally I absolutely agree, but then we get into the games of trying to classify the types of workload which would benefit. Much like with NUMA spreading, it is going to be hard to come up with a one true solution (nice though that would be!) Not getting regressions with anything in this area is going to be really tricky. J > > > Jonathan
diff --git a/Documentation/admin-guide/cputopology.rst b/Documentation/admin-guide/cputopology.rst index b90dafcc8237..f9d374560047 100644 --- a/Documentation/admin-guide/cputopology.rst +++ b/Documentation/admin-guide/cputopology.rst @@ -24,6 +24,12 @@ core_id: identifier (rather than the kernel's). The actual value is architecture and platform dependent. +cluster_id: + + the Cluster ID of cpuX. Typically it is the hardware platform's + identifier (rather than the kernel's). The actual value is + architecture and platform dependent. + book_id: the book ID of cpuX. Typically it is the hardware platform's @@ -56,6 +62,14 @@ package_cpus_list: human-readable list of CPUs sharing the same physical_package_id. (deprecated name: "core_siblings_list") +cluster_cpus: + + internal kernel map of CPUs within the same cluster. + +cluster_cpus_list: + + human-readable list of CPUs within the same cluster. + die_cpus: internal kernel map of CPUs within the same die. @@ -96,11 +110,13 @@ these macros in include/asm-XXX/topology.h:: #define topology_physical_package_id(cpu) #define topology_die_id(cpu) + #define topology_cluster_id(cpu) #define topology_core_id(cpu) #define topology_book_id(cpu) #define topology_drawer_id(cpu) #define topology_sibling_cpumask(cpu) #define topology_core_cpumask(cpu) + #define topology_cluster_cpumask(cpu) #define topology_die_cpumask(cpu) #define topology_book_cpumask(cpu) #define topology_drawer_cpumask(cpu) @@ -116,10 +132,12 @@ not defined by include/asm-XXX/topology.h: 1) topology_physical_package_id: -1 2) topology_die_id: -1 -3) topology_core_id: 0 -4) topology_sibling_cpumask: just the given CPU -5) topology_core_cpumask: just the given CPU -6) topology_die_cpumask: just the given CPU +3) topology_cluster_id: -1 +4) topology_core_id: 0 +5) topology_sibling_cpumask: just the given CPU +6) topology_core_cpumask: just the given CPU +7) topology_cluster_cpumask: just the given CPU +8) topology_die_cpumask: just the given CPU For architectures that don't support books (CONFIG_SCHED_BOOK) there are no default definitions for topology_book_id() and topology_book_cpumask(). diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c index 0801a0f3c156..4c40240fdf6a 100644 --- a/arch/arm64/kernel/topology.c +++ b/arch/arm64/kernel/topology.c @@ -101,6 +101,8 @@ int __init parse_acpi_topology(void) cpu_topology[cpu].thread_id = -1; cpu_topology[cpu].core_id = topology_id; } + topology_id = find_acpi_cpu_topology_cluster(cpu); + cpu_topology[cpu].cluster_id = topology_id; topology_id = find_acpi_cpu_topology_package(cpu); cpu_topology[cpu].package_id = topology_id; diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c index 4ae93350b70d..8646a93e4348 100644 --- a/drivers/acpi/pptt.c +++ b/drivers/acpi/pptt.c @@ -736,6 +736,66 @@ int find_acpi_cpu_topology_package(unsigned int cpu) ACPI_PPTT_PHYSICAL_PACKAGE); } +/** + * find_acpi_cpu_topology_cluster() - Determine a unique CPU cluster value + * @cpu: Kernel logical CPU number + * + * Determine a topology unique cluster ID for the given CPU/thread. + * This ID can then be used to group peers, which will have matching ids. + * + * The cluster, if present is the level of topology above CPUs. In a + * multi-thread CPU, it will be the level above the CPU, not the thread. + * It may not exist in single CPU systems. In simple multi-CPU systems, + * it may be equal to the package topology level. + * + * Return: -ENOENT if the PPTT doesn't exist, the CPU cannot be found + * or there is no toplogy level above the CPU.. + * Otherwise returns a value which represents the package for this CPU. + */ + +int find_acpi_cpu_topology_cluster(unsigned int cpu) +{ + struct acpi_table_header *table; + acpi_status status; + struct acpi_pptt_processor *cpu_node, *cluster_node; + int retval; + int is_thread; + + status = acpi_get_table(ACPI_SIG_PPTT, 0, &table); + if (ACPI_FAILURE(status)) { + acpi_pptt_warn_missing(); + return -ENOENT; + } + cpu_node = acpi_find_processor_node(table, cpu); + if (cpu_node == NULL || !cpu_node->parent) { + retval = -ENOENT; + goto put_table; + } + + is_thread = cpu_node->flags & ACPI_PPTT_ACPI_PROCESSOR_IS_THREAD; + cluster_node = fetch_pptt_node(table, cpu_node->parent); + if (cluster_node == NULL) { + retval = -ENOENT; + goto put_table; + } + if (is_thread) { + if (!cluster_node->parent) { + retval = -ENOENT; + goto put_table; + } + cluster_node = fetch_pptt_node(table, cluster_node->parent); + if (cluster_node == NULL) { + retval = -ENOENT; + goto put_table; + } + } + retval = ACPI_PTR_DIFF(cluster_node, table); +put_table: + acpi_put_table(table); + + return retval; +} + /** * find_acpi_cpu_topology_hetero_id() - Get a core architecture tag * @cpu: Kernel logical CPU number diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 75f72d684294..e2ca8f39131e 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -497,6 +497,11 @@ const struct cpumask *cpu_coregroup_mask(int cpu) return core_mask; } +const struct cpumask *cpu_clustergroup_mask(int cpu) +{ + return &cpu_topology[cpu].cluster_sibling; +} + void update_siblings_masks(unsigned int cpuid) { struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; @@ -514,6 +519,11 @@ void update_siblings_masks(unsigned int cpuid) if (cpuid_topo->package_id != cpu_topo->package_id) continue; + if (cpuid_topo->cluster_id == cpu_topo->cluster_id) { + cpumask_set_cpu(cpu, &cpuid_topo->cluster_sibling); + cpumask_set_cpu(cpuid, &cpu_topo->cluster_sibling); + } + cpumask_set_cpu(cpuid, &cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpuid_topo->core_sibling); @@ -532,6 +542,9 @@ static void clear_cpu_topology(int cpu) cpumask_clear(&cpu_topo->llc_sibling); cpumask_set_cpu(cpu, &cpu_topo->llc_sibling); + cpumask_clear(&cpu_topo->cluster_sibling); + cpumask_set_cpu(cpu, &cpu_topo->cluster_sibling); + cpumask_clear(&cpu_topo->core_sibling); cpumask_set_cpu(cpu, &cpu_topo->core_sibling); cpumask_clear(&cpu_topo->thread_sibling); @@ -562,6 +575,7 @@ void remove_cpu_topology(unsigned int cpu) cpumask_clear_cpu(cpu, topology_core_cpumask(sibling)); for_each_cpu(sibling, topology_sibling_cpumask(cpu)) cpumask_clear_cpu(cpu, topology_sibling_cpumask(sibling)); + for_each_cpu(sibling, topology_llc_cpumask(cpu)) cpumask_clear_cpu(cpu, topology_llc_cpumask(sibling)); diff --git a/drivers/base/topology.c b/drivers/base/topology.c index ad8d33c6077b..f72ac9acc112 100644 --- a/drivers/base/topology.c +++ b/drivers/base/topology.c @@ -46,6 +46,9 @@ static DEVICE_ATTR_RO(physical_package_id); define_id_show_func(die_id); static DEVICE_ATTR_RO(die_id); +define_id_show_func(cluster_id); +static DEVICE_ATTR_RO(cluster_id); + define_id_show_func(core_id); static DEVICE_ATTR_RO(core_id); @@ -61,6 +64,10 @@ define_siblings_show_func(core_siblings, core_cpumask); static DEVICE_ATTR_RO(core_siblings); static DEVICE_ATTR_RO(core_siblings_list); +define_siblings_show_func(cluster_cpus, cluster_cpumask); +static DEVICE_ATTR_RO(cluster_cpus); +static DEVICE_ATTR_RO(cluster_cpus_list); + define_siblings_show_func(die_cpus, die_cpumask); static DEVICE_ATTR_RO(die_cpus); static DEVICE_ATTR_RO(die_cpus_list); @@ -88,6 +95,7 @@ static DEVICE_ATTR_RO(drawer_siblings_list); static struct attribute *default_attrs[] = { &dev_attr_physical_package_id.attr, &dev_attr_die_id.attr, + &dev_attr_cluster_id.attr, &dev_attr_core_id.attr, &dev_attr_thread_siblings.attr, &dev_attr_thread_siblings_list.attr, @@ -95,6 +103,8 @@ static struct attribute *default_attrs[] = { &dev_attr_core_cpus_list.attr, &dev_attr_core_siblings.attr, &dev_attr_core_siblings_list.attr, + &dev_attr_cluster_cpus.attr, + &dev_attr_cluster_cpus_list.attr, &dev_attr_die_cpus.attr, &dev_attr_die_cpus_list.attr, &dev_attr_package_cpus.attr, diff --git a/include/linux/acpi.h b/include/linux/acpi.h index 159a461d8524..05b1e30f43fc 100644 --- a/include/linux/acpi.h +++ b/include/linux/acpi.h @@ -1333,6 +1333,7 @@ static inline int lpit_read_residency_count_address(u64 *address) #ifdef CONFIG_ACPI_PPTT int acpi_pptt_cpu_is_thread(unsigned int cpu); int find_acpi_cpu_topology(unsigned int cpu, int level); +int find_acpi_cpu_topology_cluster(unsigned int cpu); int find_acpi_cpu_topology_package(unsigned int cpu); int find_acpi_cpu_topology_hetero_id(unsigned int cpu); int find_acpi_cpu_cache_topology(unsigned int cpu, int level); @@ -1345,6 +1346,10 @@ static inline int find_acpi_cpu_topology(unsigned int cpu, int level) { return -EINVAL; } +static inline int find_acpi_cpu_topology_cluster(unsigned int cpu) +{ + return -EINVAL; +} static inline int find_acpi_cpu_topology_package(unsigned int cpu) { return -EINVAL; diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h index 69b1dabe39dc..dba06864eab5 100644 --- a/include/linux/arch_topology.h +++ b/include/linux/arch_topology.h @@ -45,10 +45,12 @@ void topology_set_thermal_pressure(const struct cpumask *cpus, struct cpu_topology { int thread_id; int core_id; + int cluster_id; int package_id; int llc_id; cpumask_t thread_sibling; cpumask_t core_sibling; + cpumask_t cluster_sibling; cpumask_t llc_sibling; }; @@ -56,13 +58,16 @@ struct cpu_topology { extern struct cpu_topology cpu_topology[NR_CPUS]; #define topology_physical_package_id(cpu) (cpu_topology[cpu].package_id) +#define topology_cluster_id(cpu) (cpu_topology[cpu].cluster_id) #define topology_core_id(cpu) (cpu_topology[cpu].core_id) #define topology_core_cpumask(cpu) (&cpu_topology[cpu].core_sibling) #define topology_sibling_cpumask(cpu) (&cpu_topology[cpu].thread_sibling) +#define topology_cluster_cpumask(cpu) (&cpu_topology[cpu].cluster_sibling) #define topology_llc_cpumask(cpu) (&cpu_topology[cpu].llc_sibling) void init_cpu_topology(void); void store_cpu_topology(unsigned int cpuid); const struct cpumask *cpu_coregroup_mask(int cpu); +const struct cpumask *cpu_clustergroup_mask(int cpu); void update_siblings_masks(unsigned int cpu); void remove_cpu_topology(unsigned int cpuid); void reset_cpu_topology(void); diff --git a/include/linux/topology.h b/include/linux/topology.h index 608fa4aadf0e..5f666488f8f1 100644 --- a/include/linux/topology.h +++ b/include/linux/topology.h @@ -185,6 +185,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_die_id #define topology_die_id(cpu) ((void)(cpu), -1) #endif +#ifndef topology_cluster_id +#define topology_cluster_id(cpu) ((void)(cpu), -1) +#endif #ifndef topology_core_id #define topology_core_id(cpu) ((void)(cpu), 0) #endif @@ -194,6 +197,9 @@ static inline int cpu_to_mem(int cpu) #ifndef topology_core_cpumask #define topology_core_cpumask(cpu) cpumask_of(cpu) #endif +#ifndef topology_cluster_cpumask +#define topology_cluster_cpumask(cpu) cpumask_of(cpu) +#endif #ifndef topology_die_cpumask #define topology_die_cpumask(cpu) cpumask_of(cpu) #endif
Both ACPI and DT provide the ability to describe additional layers of topology between that of individual cores and higher level constructs such as the level at which the last level cache is shared. In ACPI this can be represented in PPTT as a Processor Hierarchy Node Structure [1] that is the parent of the CPU cores and in turn has a parent Processor Hierarchy Nodes Structure representing a higher level of topology. For example Kunpeng 920 has clusters of 4 CPUs. These do not share any cache resources, but the interconnect topology is such that the cost to transfer ownership of a cacheline between CPUs within a cluster is lower than between CPUs in different clusters on the same die. Hence, it can make sense to deliberately schedule threads sharing data to a single cluster. This patch simply exposes this information to userspace libraries like hwloc by providing cluster_cpus and related sysfs attributes. PoC of HWLOC support at [2]. Note this patch only handle the ACPI case. Special consideration is needed for SMT processors, where it is necessary to move 2 levels up the hierarchy from the leaf nodes (thus skipping the processor core level). Currently the ID provided is the offset of the Processor Hierarchy Nodes Structure within PPTT. Whilst this is unique it is not terribly elegant so alternative suggestions welcome. Note that arm64 / ACPI does not provide any means of identifying a die level in the topology but that may be unrelate to the cluster level. RFC questions: 1) Naming 2) Related to naming, do we want to represent all potential levels, or this enough? On Kunpeng920, the next level up from cluster happens to be covered by llc cache sharing, but in theory more than one level of cluster description might be needed by some future system. 3) Do we need DT code in place? I'm not sure any DT based ARM64 systems would have enough complexity for this to be useful. 4) Other architectures? Is this useful on x86 for example? [1] ACPI Specification 6.3 - section 5.2.29.1 processor hierarchy node structure (Type 0) [2] https://github.com/hisilicon/hwloc/tree/linux-cluster Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> --- Documentation/admin-guide/cputopology.rst | 26 ++++++++-- arch/arm64/kernel/topology.c | 2 + drivers/acpi/pptt.c | 60 +++++++++++++++++++++++ drivers/base/arch_topology.c | 14 ++++++ drivers/base/topology.c | 10 ++++ include/linux/acpi.h | 5 ++ include/linux/arch_topology.h | 5 ++ include/linux/topology.h | 6 +++ 8 files changed, 124 insertions(+), 4 deletions(-) -- 2.19.1