Message ID | 20180228220619.6992-1-jeremy.linton@arm.com |
---|---|
Headers | show |
Series | Support PPTT for ARM64 | expand |
On 03/01/2018 06:06 AM, Sudeep Holla wrote: > Hi Jeremy, > > On 28/02/18 22:06, Jeremy Linton wrote: >> ACPI 6.2 adds the Processor Properties Topology Table (PPTT), which is >> used to describe the processor and cache topology. Ideally it is >> used to extend/override information provided by the hardware, but >> right now ARM64 is entirely dependent on firmware provided tables. >> >> This patch parses the table for the cache topology and CPU topology. >> When we enable ACPI/PPTT for arm64 we map the physical_id to the >> PPTT node flagged as the physical package by the firmware. >> This results in topologies that match what the remainder of the >> system expects. To avoid inverted scheduler domains we then >> set the MC domain equal to the largest cache within the socket >> below the NUMA domain. >> > I remember reviewing and acknowledging most of the cacheinfo stuff with > couple of minor suggestions for v6. I don't see any Acked-by tags in > this series and don't know if I need to review/ack any more cacheinfo > related patches. Hi, Yes, I didn't put them in because I changed the functionality in 2/13 and there is a bug fix in 5/13. I thought you might want to do a quick diff of the git v6->v7 tree. Although given that most of the changes were in response to your comments in v6 I probably should have just put the tags in. Thanks, -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jeremy, On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote: > Now that we have an accurate view of the physical topology > we need to represent it correctly to the scheduler. In the > case of NUMA in socket, we need to assure that the sched domain > we build for the MC layer isn't larger than the DIE above it. MC shouldn't be larger than any of the NUMA domains either. > To do this correctly, we should really base that on the cache > topology immediately below the NUMA node (for NUMA in socket) > or below the physical package for normal NUMA configurations. That means we wouldn't support multi-die NUMA nodes? > This patch creates a set of early cache_siblings masks, then > when the scheduler requests the coregroup mask we pick the > smaller of the physical package siblings, or the numa siblings > and locate the largest cache which is an entire subset of > those siblings. If we are unable to find a proper subset of > cores then we retain the original behavior and return the > core_sibling list. IIUC, for numa-in-package it is a strict requirement that there is a cache that span the entire NUMA node? For example, having a NUMA node consisting of two clusters with per-cluster caches only wouldn't be supported? > > Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> > --- > arch/arm64/include/asm/topology.h | 5 +++ > arch/arm64/kernel/topology.c | 64 +++++++++++++++++++++++++++++++++++++++ > 2 files changed, 69 insertions(+) > > diff --git a/arch/arm64/include/asm/topology.h b/arch/arm64/include/asm/topology.h > index 6b10459e6905..08db3e4e44e1 100644 > --- a/arch/arm64/include/asm/topology.h > +++ b/arch/arm64/include/asm/topology.h > @@ -4,12 +4,17 @@ > > #include <linux/cpumask.h> > > +#define MAX_CACHE_CHECKS 4 > + > struct cpu_topology { > int thread_id; > int core_id; > int package_id; > + int cache_id[MAX_CACHE_CHECKS]; > cpumask_t thread_sibling; > cpumask_t core_sibling; > + cpumask_t cache_siblings[MAX_CACHE_CHECKS]; > + int cache_level; > }; > > extern struct cpu_topology cpu_topology[NR_CPUS]; > diff --git a/arch/arm64/kernel/topology.c b/arch/arm64/kernel/topology.c > index bd1aae438a31..1809dc9d347c 100644 > --- a/arch/arm64/kernel/topology.c > +++ b/arch/arm64/kernel/topology.c > @@ -212,8 +212,42 @@ static int __init parse_dt_topology(void) > struct cpu_topology cpu_topology[NR_CPUS]; > EXPORT_SYMBOL_GPL(cpu_topology); > > +static void find_llc_topology_for_cpu(int cpu) Isn't it more find core/node siblings? Or is it a requirement that the last level cache spans exactly one NUMA node? For example, a package level cache isn't allowed for numa-in-package? > +{ > + /* first determine if we are a NUMA in package */ > + const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu)); > + int indx; > + > + if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) { > + /* not numa in package, lets use the package siblings */ > + node_mask = &cpu_topology[cpu].core_sibling; > + } > + > + /* > + * node_mask should represent the smallest package/numa grouping > + * lets search for the largest cache smaller than the node_mask. > + */ > + for (indx = 0; indx < MAX_CACHE_CHECKS; indx++) { > + cpumask_t *cache_sibs = &cpu_topology[cpu].cache_siblings[indx]; > + > + if (cpu_topology[cpu].cache_id[indx] < 0) > + continue; > + > + if (cpumask_subset(cache_sibs, node_mask)) > + cpu_topology[cpu].cache_level = indx; I don't this guarantees that the cache level we found matches exactly the NUMA node. Taking the two cluster NUMA node example from above, we would set cache_level to point at the per-cluster cache as it is a subset of the NUMA node but it would only span half of the node. Or am I missing something? > + } > +} > + > const struct cpumask *cpu_coregroup_mask(int cpu) > { > + int *llc = &cpu_topology[cpu].cache_level; > + > + if (*llc == -1) > + find_llc_topology_for_cpu(cpu); > + > + if (*llc != -1) > + return &cpu_topology[cpu].cache_siblings[*llc]; > + > return &cpu_topology[cpu].core_sibling; > } > > @@ -221,6 +255,7 @@ static void update_siblings_masks(unsigned int cpuid) > { > struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; > int cpu; > + int idx; > > /* update core and thread sibling masks */ > for_each_possible_cpu(cpu) { > @@ -229,6 +264,16 @@ static void update_siblings_masks(unsigned int cpuid) > if (cpuid_topo->package_id != cpu_topo->package_id) > continue; > > + for (idx = 0; idx < MAX_CACHE_CHECKS; idx++) { > + cpumask_t *lsib; > + int cput_id = cpuid_topo->cache_id[idx]; > + > + if (cput_id == cpu_topo->cache_id[idx]) { > + lsib = &cpuid_topo->cache_siblings[idx]; > + cpumask_set_cpu(cpu, lsib); > + } Shouldn't the cache_id validity be checked here? I don't think it breaks anything though. Overall, I think this is more or less in line with the MC domain shrinking I just mentioned in the v6 discussion. It is mostly the corner cases and assumption about the system topology I'm not sure about. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 28, 2018 at 04:06:16PM -0600, Jeremy Linton wrote: > Lets match the name of the arm64 topology field > to the kernel macro that uses it. I called it cluster ID in the code because that's what the documentation for MPIDR called it IIRC. Googling around suggests that this naming may be being used by some of the DynamiQ stuff.
On Tue, Feb 27, 2018 at 02:18:47PM -0600, Jeremy Linton wrote: > Hi, > > > First, thanks for taking a look at this. > > On 03/01/2018 09:52 AM, Morten Rasmussen wrote: > >Hi Jeremy, > > > >On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote: > >>Now that we have an accurate view of the physical topology > >>we need to represent it correctly to the scheduler. In the > >>case of NUMA in socket, we need to assure that the sched domain > >>we build for the MC layer isn't larger than the DIE above it. > > > >MC shouldn't be larger than any of the NUMA domains either. > > Right, that is one of the things this patch is assuring.. > > > > >>To do this correctly, we should really base that on the cache > >>topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations. > > > >That means we wouldn't support multi-die NUMA nodes? > > You mean a bottom level NUMA domain that crosses multiple sockets/dies? That > should work. This patch is picking the widest cache layer below the smallest > of the package or numa grouping. What actually happens depends on the > topology. Given a case where there are multiple dies in a socket, and the > numa domain is at the socket level the MC is going to reflect the caching > topology immediately below the socket. In the case of multiple dies, with a > cache that crosses them in socket, then the MC is basically going to be the > socket, otherwise if the widest cache is per die, or some narrower grouping > (cluster?) then that is what ends up in the MC. (this is easier with some > pictures) That is more or less what I meant. I think I got confused with the role of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level cpumask spans exactly the NUMA node, so IIUC we have three scenarios: 1. Multi-die/socket/physical package NUMA node Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level spans multiple multi-package nodes. The MC mask reflects the last-level cache within the NUMA node which is most likely per-die or per-cluster (inside each die). 2. physical package == NUMA node The top non-NUMA (DIE) mask is the same as the core sibling mask. If there is cache spanning the entire node, the scheduler topology will eliminate a layer (DIE?), so bottom NUMA level would be right on top of MC spanning multiple physical packages. If there is no node-wide last level cache, DIE is preserved and MC matches the span of the last level cache. 3. numa-in-package Top non-NUMA (DIE) mask is not reflecting the actual die, but is reflecting the NUMA node. MC has a span equal to the largest share cache span smaller than or equal to the the NUMA node. If it is equal, DIE level is eliminated, otherwise DIE is preserved, but doesn't really represent die. Bottom non-NUMA level spans multiple in-package NUMA nodes. As you said, multi-die nodes should work. However, I'm not sure if shrinking MC to match a cache could cause us trouble, or if it should just be shrunk to be the smaller of the node mask and core siblings. Unless you have a node-wide last level cache DIE level won't be eliminated in scenario 2 and 3, and could cause trouble. For numa-in-package, you can end up with a DIE level inside the node where the default flags don't favour aggressive spreading of tasks. The same could be the case for per-package nodes (scenario 2). Don't we end up redefining physical package to be last level cache instead of using the PPTT flag for scenario 2 and 3? I think DIE level should be eliminated for scenario 2 and 3 like it is for x86. [...] > >>This patch creates a set of early cache_siblings masks, then > >>when the scheduler requests the coregroup mask we pick the > >>smaller of the physical package siblings, or the numa siblings > >>and locate the largest cache which is an entire subset of > >>those siblings. If we are unable to find a proper subset of > >>cores then we retain the original behavior and return the > >>core_sibling list. > > > >IIUC, for numa-in-package it is a strict requirement that there is a > >cache that span the entire NUMA node? For example, having a NUMA node > >consisting of two clusters with per-cluster caches only wouldn't be > >supported? > > Everything is supported, the MC is reflecting the cache topology. We just > use the physical/numa topology to help us pick which layer of cache topology > lands in the MC. (unless of course we fail to find a PPTT/cache topology, in > which case we fallback to the old behavior of the core_siblings which can > reflect the MPIDR/etc). I see. For this example we would end up with a "DIE" level and two MC domains inside each node whether we have the PPTT table and cache topology or not. I'm just wondering if everyone would be happy with basing MC on last level cache instead of the smaller of physical package and NUMA node. > >>+{ > >>+ /* first determine if we are a NUMA in package */ > >>+ const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu)); > >>+ int indx; > >>+ > >>+ if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) { > >>+ /* not numa in package, lets use the package siblings */ > >>+ node_mask = &cpu_topology[cpu].core_sibling; > >>+ } > >>+ > >>+ /* > >>+ * node_mask should represent the smallest package/numa grouping > >>+ * lets search for the largest cache smaller than the node_mask. > >>+ */ > >>+ for (indx = 0; indx < MAX_CACHE_CHECKS; indx++) { > >>+ cpumask_t *cache_sibs = &cpu_topology[cpu].cache_siblings[indx]; > >>+ > >>+ if (cpu_topology[cpu].cache_id[indx] < 0) > >>+ continue; > >>+ > >>+ if (cpumask_subset(cache_sibs, node_mask)) > >>+ cpu_topology[cpu].cache_level = indx; > > > >I don't this guarantees that the cache level we found matches exactly > >the NUMA node. Taking the two cluster NUMA node example from above, we > >would set cache_level to point at the per-cluster cache as it is a > >subset of the NUMA node but it would only span half of the node. Or am I > >missing something? > > I think you got it. If the system is a traditional ARM system with shared > L2's at the cluster level and it doesn't have any L3's/etc and the NUMA node > crosses multiple clusters then you get the cluster L2 grouping in the MC. > > I think this is what we want. Particularly, since the newer/larger machines > do have L3+'s contained within their sockets or numa domains, so you end up > with that as the MC. Okay, thanks for confirming. > > > > > >>+ } > >>+} > >>+ > >> const struct cpumask *cpu_coregroup_mask(int cpu) > >> { > >>+ int *llc = &cpu_topology[cpu].cache_level; > >>+ > >>+ if (*llc == -1) > >>+ find_llc_topology_for_cpu(cpu); > >>+ > >>+ if (*llc != -1) > >>+ return &cpu_topology[cpu].cache_siblings[*llc]; > >>+ > >> return &cpu_topology[cpu].core_sibling; If we don't have any of the cache_sibling masks set up, i.e. we don't have the cache topology, we would keep looking for it every time cpu_coregroup_mask() is called. I'm not sure how extensively it is used, but it could have a performance impact? > >> } > >>@@ -221,6 +255,7 @@ static void update_siblings_masks(unsigned int cpuid) > >> { > >> struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; > >> int cpu; > >>+ int idx; > >> /* update core and thread sibling masks */ > >> for_each_possible_cpu(cpu) { > >>@@ -229,6 +264,16 @@ static void update_siblings_masks(unsigned int cpuid) > >> if (cpuid_topo->package_id != cpu_topo->package_id) > >> continue; > >>+ for (idx = 0; idx < MAX_CACHE_CHECKS; idx++) { > >>+ cpumask_t *lsib; > >>+ int cput_id = cpuid_topo->cache_id[idx]; > >>+ > >>+ if (cput_id == cpu_topo->cache_id[idx]) { > >>+ lsib = &cpuid_topo->cache_siblings[idx]; > >>+ cpumask_set_cpu(cpu, lsib); > >>+ } > > > >Shouldn't the cache_id validity be checked here? I don't think it breaks > >anything though. > > It could be, but since its explicitly looking for unified caches its likely > that some of the levels are invalid. Invalid levels get ignored later on so > we don't really care if they are valid here. > > > > >Overall, I think this is more or less in line with the MC domain > >shrinking I just mentioned in the v6 discussion. It is mostly the corner > >cases and assumption about the system topology I'm not sure about. > > I think its the corner cases i'm taking care of. The simple fix in v6 is to > take the smaller of core_siblings or node_siblings, but that ignores cases > with split L3s (or the L2 only example above). The idea here is to assure > that MC is following a cache topology. In my mind, it is more a question of > how that is picked. The other way I see to do this, is with a PX domain flag > in the PPTT. We could then pick the core grouping one below that flag. Doing > it that way affords the firmware vendors a lever they can pull to optimize a > given machine for the linux scheduler behavior. Okay. I think these assumptions/choices should be spelled out somewhere, either as comments or in the commit message. As said above, I'm not sure if the simple approach is better or not. Using the cache span to define the MC level with a numa-in-cluster switch like some Intel platform seems to have, you could two core being MC siblings with numa-in-package disabled and them not being siblings with numa-in-package enabled unless you reconfigure the span of the caches too and remember to update the ACPI cache topology. Regarding firmware levers, we don't want vendors to optimize for Linux scheduler behaviour, but a mechanism to detect how closely related cores are could make selecting the right mask for MC level easier. As I see it, we basically have to choose between MC being cache boundary based or physical package based. This patch implements the former, the simple solution (core_siblings mask or node_siblings mask) implements the latter. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/02/18 22:06, Jeremy Linton wrote: > In preparation for the next patch, and to aid in > review of that patch, lets move cache_setup_of_node > further down in the module without any changes. > I don't think this is change since v6 so my ack stands. Acked-by: Sudeep Holla <sudeep.holla@arm.com> -- Regards, Sudeep -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/02/18 22:06, Jeremy Linton wrote: > Rename and change the type of of_node to indicate > it is a generic pointer which is generally only used > for comparison purposes. In a later patch we will put > an ACPI/PPTT token pointer in fw_token so that > the code which builds the shared cpu masks can be reused. > Thanks for renaming :) Acked-by: Sudeep Holla <sudeep.holla@arm.com> -- Regards, Sudeep -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/02/18 22:06, Jeremy Linton wrote: > Now that we have a PPTT parser, in preparation for its use > on arm64, lets build it. > Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> -- Regards, Sudeep -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/02/18 22:06, Jeremy Linton wrote: > Its helpful to be able to lookup the acpi_processor_id associated > with a logical cpu. Provide an arm64 helper to do this. > This patch on it's own is good, but it's quite generic and made to look at it again. Sorry for missing this earlier. Can we use "per_cpu(processors, cpu)->acpi_id" at call sites instead ? Or you can make that a generic helper using above expression ? -- Regards, Sudeep -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/02/18 22:06, Jeremy Linton wrote: > The /sys cache entries should support ACPI/PPTT generated cache > topology information. Lets detect ACPI systems and call > an arch specific cache_setup_acpi() routine to update the hardware > probed cache topology. > > For arm64, if ACPI is enabled, determine the max number of cache > levels and populate them using the PPTT table if one is available. > I fail to understand the kbuild failure report association with this patch(most likely the previous patch could be culprit, I am looking at it in detail now), but for this patch Reviewed-by: Sudeep Holla <sudeep.holla@arm.com> -- Regards, Sudeep -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 28/02/18 22:06, Jeremy Linton wrote: > Call ACPI cache parsing routines from base cacheinfo code if ACPI > is enable. Also stub out cache_setup_acpi() so that individual > architectures can enable ACPI topology parsing. > > Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> > --- > drivers/acpi/pptt.c | 1 + > drivers/base/cacheinfo.c | 14 ++++++++++---- > include/linux/cacheinfo.h | 9 +++++++++ > 3 files changed, 20 insertions(+), 4 deletions(-) > > diff --git a/drivers/acpi/pptt.c b/drivers/acpi/pptt.c > index 883e4318c6cd..c98f94ebd272 100644 > --- a/drivers/acpi/pptt.c > +++ b/drivers/acpi/pptt.c > @@ -343,6 +343,7 @@ static void update_cache_properties(struct cacheinfo *this_leaf, > { > int valid_flags = 0; > > + this_leaf->fw_token = cpu_node; Any reason why this can't part of 05/13 ? > if (found_cache->flags & ACPI_PPTT_SIZE_PROPERTY_VALID) { > this_leaf->size = found_cache->size; > valid_flags++; > diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c > index 597aacb233fc..2880e2ab01f5 100644 > --- a/drivers/base/cacheinfo.c > +++ b/drivers/base/cacheinfo.c > @@ -206,7 +206,7 @@ static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf, > struct cacheinfo *sib_leaf) > { > /* > - * For non-DT systems, assume unique level 1 cache, system-wide > + * For non-DT/ACPI systems, assume unique level 1 caches, system-wide > * shared caches for all other levels. This will be used only if > * arch specific code has not populated shared_cpu_map > */ > @@ -214,6 +214,11 @@ static inline bool cache_leaves_are_shared(struct cacheinfo *this_leaf, > } > #endif > > +int __weak cache_setup_acpi(unsigned int cpu) > +{ > + return -ENOTSUPP; > +} > + > static int cache_shared_cpu_map_setup(unsigned int cpu) > { > struct cpu_cacheinfo *this_cpu_ci = get_cpu_cacheinfo(cpu); > @@ -227,8 +232,8 @@ static int cache_shared_cpu_map_setup(unsigned int cpu) > if (of_have_populated_dt()) > ret = cache_setup_of_node(cpu); > else if (!acpi_disabled) > - /* No cache property/hierarchy support yet in ACPI */ > - ret = -ENOTSUPP; > + ret = cache_setup_acpi(cpu); > + > if (ret) > return ret; > > @@ -279,7 +284,8 @@ static void cache_shared_cpu_map_remove(unsigned int cpu) > cpumask_clear_cpu(cpu, &sib_leaf->shared_cpu_map); > cpumask_clear_cpu(sibling, &this_leaf->shared_cpu_map); > } > - of_node_put(this_leaf->fw_token); > + if (of_have_populated_dt()) > + of_node_put(this_leaf->fw_token); > } > } > > diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h > index 0c6f658054d2..1446d3f053a2 100644 > --- a/include/linux/cacheinfo.h > +++ b/include/linux/cacheinfo.h > @@ -97,6 +97,15 @@ int func(unsigned int cpu) \ > struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu); > int init_cache_level(unsigned int cpu); > int populate_cache_leaves(unsigned int cpu); > +int cache_setup_acpi(unsigned int cpu); > +int acpi_find_last_cache_level(unsigned int cpu); > +#ifndef CONFIG_ACPI > +int acpi_find_last_cache_level(unsigned int cpu) The above 3 lines looks weird, can't it be: #ifdef CONFIG_ACPI int acpi_find_last_cache_level(unsigned int cpu); #else int acpi_find_last_cache_level(unsigned int cpu) { /* ACPI kernels should be built with PPTT support */ return 0; } Also I think it should be CONFIG_ACPI_PPTT, otherwise it might cause issue on platforms which define CONFIG_ACPI but CONFIG_ACPI_PPTT is not. I can only relate this to the s390 error reported by kbuild robot. -- Regards, Sudeep -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 03/06/2018 10:07 AM, Morten Rasmussen wrote: > On Tue, Feb 27, 2018 at 02:18:47PM -0600, Jeremy Linton wrote: >> Hi, >> >> >> First, thanks for taking a look at this. >> >> On 03/01/2018 09:52 AM, Morten Rasmussen wrote: >>> Hi Jeremy, >>> >>> On Wed, Feb 28, 2018 at 04:06:19PM -0600, Jeremy Linton wrote: >>>> Now that we have an accurate view of the physical topology >>>> we need to represent it correctly to the scheduler. In the >>>> case of NUMA in socket, we need to assure that the sched domain >>>> we build for the MC layer isn't larger than the DIE above it. >>> >>> MC shouldn't be larger than any of the NUMA domains either. >> >> Right, that is one of the things this patch is assuring.. >> >>> >>>> To do this correctly, we should really base that on the cache >>>> topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations. >>> >>> That means we wouldn't support multi-die NUMA nodes? >> >> You mean a bottom level NUMA domain that crosses multiple sockets/dies? That >> should work. This patch is picking the widest cache layer below the smallest >> of the package or numa grouping. What actually happens depends on the >> topology. Given a case where there are multiple dies in a socket, and the >> numa domain is at the socket level the MC is going to reflect the caching >> topology immediately below the socket. In the case of multiple dies, with a >> cache that crosses them in socket, then the MC is basically going to be the >> socket, otherwise if the widest cache is per die, or some narrower grouping >> (cluster?) then that is what ends up in the MC. (this is easier with some >> pictures) > > That is more or less what I meant. I think I got confused with the role > of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level > cpumask spans exactly the NUMA node, so IIUC we have three scenarios: > > 1. Multi-die/socket/physical package NUMA node > Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level > spans multiple multi-package nodes. The MC mask reflects the last-level > cache within the NUMA node which is most likely per-die or per-cluster > (inside each die). > > 2. physical package == NUMA node > The top non-NUMA (DIE) mask is the same as the core sibling mask. > If there is cache spanning the entire node, the scheduler topology > will eliminate a layer (DIE?), so bottom NUMA level would be right on > top of MC spanning multiple physical packages. If there is no > node-wide last level cache, DIE is preserved and MC matches the span > of the last level cache. > > 3. numa-in-package > Top non-NUMA (DIE) mask is not reflecting the actual die, but is > reflecting the NUMA node. MC has a span equal to the largest share > cache span smaller than or equal to the the NUMA node. If it is > equal, DIE level is eliminated, otherwise DIE is preserved, but > doesn't really represent die. Bottom non-NUMA level spans multiple > in-package NUMA nodes. > > As you said, multi-die nodes should work. However, I'm not sure if > shrinking MC to match a cache could cause us trouble, or if it should > just be shrunk to be the smaller of the node mask and core siblings. Shrinking to the smaller of the numa or package is fairly trivial change, I'm good with that change too.. I discounted it because there might be an advantage in case 2 if the internal hardware is actually a case 3 (or just multiple rings/whatever each with a L3). In those cases the firmware vendor could play around with whatever representation serves them the best. > Unless you have a node-wide last level cache DIE level won't be > eliminated in scenario 2 and 3, and could cause trouble. For > numa-in-package, you can end up with a DIE level inside the node where > the default flags don't favour aggressive spreading of tasks. The same > could be the case for per-package nodes (scenario 2). > > Don't we end up redefining physical package to be last level cache > instead of using the PPTT flag for scenario 2 and 3? I'm not sure I understand, core_siblings isn't changing (its still per package). Only the MC mapping which normally is just core_siblings. For all intents right now this patch is the same as v6, except for the numa-in-package where the MC domain is being shrunk to the node siblings. I'm just trying to setup the code for potential future cases where the LLC isn't equal to the node or package. > > I think DIE level should be eliminated for scenario 2 and 3 like it is > for x86. Ok, that is based on the assumption that MC will always be equal to either the package or node? If that assumption isn't true, then would you keep it, or maybe it doesn't matter? > > [...] > >>>> This patch creates a set of early cache_siblings masks, then >>>> when the scheduler requests the coregroup mask we pick the >>>> smaller of the physical package siblings, or the numa siblings >>>> and locate the largest cache which is an entire subset of >>>> those siblings. If we are unable to find a proper subset of >>>> cores then we retain the original behavior and return the >>>> core_sibling list. >>> >>> IIUC, for numa-in-package it is a strict requirement that there is a >>> cache that span the entire NUMA node? For example, having a NUMA node >>> consisting of two clusters with per-cluster caches only wouldn't be >>> supported? >> >> Everything is supported, the MC is reflecting the cache topology. We just >> use the physical/numa topology to help us pick which layer of cache topology >> lands in the MC. (unless of course we fail to find a PPTT/cache topology, in >> which case we fallback to the old behavior of the core_siblings which can >> reflect the MPIDR/etc). > > I see. For this example we would end up with a "DIE" level and two MC > domains inside each node whether we have the PPTT table and cache > topology or not. I'm just wondering if everyone would be happy with > basing MC on last level cache instead of the smaller of physical package > and NUMA node. I can't judge that, my idea was simply to provide some flexibility to the firmware. I guess in theory someone who still wanted that split could push a numa domain down to whatever level they wanted to group. > >>>> +{ >>>> + /* first determine if we are a NUMA in package */ >>>> + const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu)); >>>> + int indx; >>>> + >>>> + if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) { >>>> + /* not numa in package, lets use the package siblings */ >>>> + node_mask = &cpu_topology[cpu].core_sibling; >>>> + } >>>> + >>>> + /* >>>> + * node_mask should represent the smallest package/numa grouping >>>> + * lets search for the largest cache smaller than the node_mask. >>>> + */ >>>> + for (indx = 0; indx < MAX_CACHE_CHECKS; indx++) { >>>> + cpumask_t *cache_sibs = &cpu_topology[cpu].cache_siblings[indx]; >>>> + >>>> + if (cpu_topology[cpu].cache_id[indx] < 0) >>>> + continue; >>>> + >>>> + if (cpumask_subset(cache_sibs, node_mask)) >>>> + cpu_topology[cpu].cache_level = indx; >>> >>> I don't this guarantees that the cache level we found matches exactly >>> the NUMA node. Taking the two cluster NUMA node example from above, we >>> would set cache_level to point at the per-cluster cache as it is a >>> subset of the NUMA node but it would only span half of the node. Or am I >>> missing something? >> >> I think you got it. If the system is a traditional ARM system with shared >> L2's at the cluster level and it doesn't have any L3's/etc and the NUMA node >> crosses multiple clusters then you get the cluster L2 grouping in the MC. >> >> I think this is what we want. Particularly, since the newer/larger machines >> do have L3+'s contained within their sockets or numa domains, so you end up >> with that as the MC. > > Okay, thanks for confirming. > >> >> >>> >>>> + } >>>> +} >>>> + >>>> const struct cpumask *cpu_coregroup_mask(int cpu) >>>> { >>>> + int *llc = &cpu_topology[cpu].cache_level; >>>> + >>>> + if (*llc == -1) >>>> + find_llc_topology_for_cpu(cpu); >>>> + >>>> + if (*llc != -1) >>>> + return &cpu_topology[cpu].cache_siblings[*llc]; >>>> + >>>> return &cpu_topology[cpu].core_sibling; > > If we don't have any of the cache_sibling masks set up, i.e. we don't > have the cache topology, we would keep looking for it every time > cpu_coregroup_mask() is called. I'm not sure how extensively it is used, > but it could have a performance impact? Its only called when cores come online/offline (AFAIK). > > >>>> } >>>> @@ -221,6 +255,7 @@ static void update_siblings_masks(unsigned int cpuid) >>>> { >>>> struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; >>>> int cpu; >>>> + int idx; >>>> /* update core and thread sibling masks */ >>>> for_each_possible_cpu(cpu) { >>>> @@ -229,6 +264,16 @@ static void update_siblings_masks(unsigned int cpuid) >>>> if (cpuid_topo->package_id != cpu_topo->package_id) >>>> continue; >>>> + for (idx = 0; idx < MAX_CACHE_CHECKS; idx++) { >>>> + cpumask_t *lsib; >>>> + int cput_id = cpuid_topo->cache_id[idx]; >>>> + >>>> + if (cput_id == cpu_topo->cache_id[idx]) { >>>> + lsib = &cpuid_topo->cache_siblings[idx]; >>>> + cpumask_set_cpu(cpu, lsib); >>>> + } >>> >>> Shouldn't the cache_id validity be checked here? I don't think it breaks >>> anything though. >> >> It could be, but since its explicitly looking for unified caches its likely >> that some of the levels are invalid. Invalid levels get ignored later on so >> we don't really care if they are valid here. >> >>> >>> Overall, I think this is more or less in line with the MC domain >>> shrinking I just mentioned in the v6 discussion. It is mostly the corner >>> cases and assumption about the system topology I'm not sure about. >> >> I think its the corner cases i'm taking care of. The simple fix in v6 is to >> take the smaller of core_siblings or node_siblings, but that ignores cases >> with split L3s (or the L2 only example above). The idea here is to assure >> that MC is following a cache topology. In my mind, it is more a question of >> how that is picked. The other way I see to do this, is with a PX domain flag >> in the PPTT. We could then pick the core grouping one below that flag. Doing >> it that way affords the firmware vendors a lever they can pull to optimize a >> given machine for the linux scheduler behavior. > > Okay. I think these assumptions/choices should be spelled out somewhere, > either as comments or in the commit message. As said above, I'm not sure > if the simple approach is better or not. > > Using the cache span to define the MC level with a numa-in-cluster > switch like some Intel platform seems to have, you could two core being > MC siblings with numa-in-package disabled and them not being siblings > with numa-in-package enabled unless you reconfigure the span of the > caches too and remember to update the ACPI cache topology. > > Regarding firmware levers, we don't want vendors to optimize for Linux > scheduler behaviour, but a mechanism to detect how closely related cores > are could make selecting the right mask for MC level easier. As I see > it, we basically have to choose between MC being cache boundary based or > physical package based. This patch implements the former, the simple > solution (core_siblings mask or node_siblings mask) implements the > latter. Basically, right now (AFAIK) the result is the same because the few machines I have access to have cache layers immediately below those boundaries which are the same size as the package/die. I'm ok with tossing this patch in favor of something like: const struct cpumask *cpu_coregroup_mask(int cpu) { const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu)); if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) { /* not numa in package, lets use the package siblings */ return &cpu_topology[cpu].core_sibling; } return node_mask; } Mostly, because I want to merge the PPTT parts, and I only added this to clear the NUMA in package borken.... -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote: > >>>>To do this correctly, we should really base that on the cache > >>>>topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations. > >>> > >>>That means we wouldn't support multi-die NUMA nodes? > >> > >>You mean a bottom level NUMA domain that crosses multiple sockets/dies? That > >>should work. This patch is picking the widest cache layer below the smallest > >>of the package or numa grouping. What actually happens depends on the > >>topology. Given a case where there are multiple dies in a socket, and the > >>numa domain is at the socket level the MC is going to reflect the caching > >>topology immediately below the socket. In the case of multiple dies, with a > >>cache that crosses them in socket, then the MC is basically going to be the > >>socket, otherwise if the widest cache is per die, or some narrower grouping > >>(cluster?) then that is what ends up in the MC. (this is easier with some > >>pictures) > > > >That is more or less what I meant. I think I got confused with the role > >of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level > >cpumask spans exactly the NUMA node, so IIUC we have three scenarios: > > > >1. Multi-die/socket/physical package NUMA node > > Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level > > spans multiple multi-package nodes. The MC mask reflects the last-level > > cache within the NUMA node which is most likely per-die or per-cluster > > (inside each die). > > > >2. physical package == NUMA node > > The top non-NUMA (DIE) mask is the same as the core sibling mask. > > If there is cache spanning the entire node, the scheduler topology > > will eliminate a layer (DIE?), so bottom NUMA level would be right on > > top of MC spanning multiple physical packages. If there is no > > node-wide last level cache, DIE is preserved and MC matches the span > > of the last level cache. > > > >3. numa-in-package > > Top non-NUMA (DIE) mask is not reflecting the actual die, but is > > reflecting the NUMA node. MC has a span equal to the largest share > > cache span smaller than or equal to the the NUMA node. If it is > > equal, DIE level is eliminated, otherwise DIE is preserved, but > > doesn't really represent die. Bottom non-NUMA level spans multiple > > in-package NUMA nodes. > > > >As you said, multi-die nodes should work. However, I'm not sure if > >shrinking MC to match a cache could cause us trouble, or if it should > >just be shrunk to be the smaller of the node mask and core siblings. > > Shrinking to the smaller of the numa or package is fairly trivial change, > I'm good with that change too.. I discounted it because there might be an > advantage in case 2 if the internal hardware is actually a case 3 (or just > multiple rings/whatever each with a L3). In those cases the firmware vendor > could play around with whatever representation serves them the best. Agreed. Distributed last level caches and interconnect speeds makes it virtually impossible to define MC in a way that works well for everyone based on the topology information we have at hand. > > >Unless you have a node-wide last level cache DIE level won't be > >eliminated in scenario 2 and 3, and could cause trouble. For > >numa-in-package, you can end up with a DIE level inside the node where > >the default flags don't favour aggressive spreading of tasks. The same > >could be the case for per-package nodes (scenario 2). > > > >Don't we end up redefining physical package to be last level cache > >instead of using the PPTT flag for scenario 2 and 3? > > I'm not sure I understand, core_siblings isn't changing (its still per > package). Only the MC mapping which normally is just core_siblings. For all > intents right now this patch is the same as v6, except for the > numa-in-package where the MC domain is being shrunk to the node siblings. > I'm just trying to setup the code for potential future cases where the LLC > isn't equal to the node or package. Right, core_siblings remains the same. The scheduler topology just looks a bit odd as we can have core_siblings spanning the full true physical package and have DIE level as a subset of that with an MC level where the MC siblings is a much smaller subset of cpus than core_siblings. IOW, it would lead to having one topology used by the scheduler and another used by the users of topology_core_cpumask() (which is not many I think). Is there a good reason for diverging instead of adjusting the core_sibling mask? On x86 the core_siblings mask is defined by the last level cache span so they don't have this issue. > >I think DIE level should be eliminated for scenario 2 and 3 like it is > >for x86. > > Ok, that is based on the assumption that MC will always be equal to either > the package or node? If that assumption isn't true, then would you keep it, > or maybe it doesn't matter? Yes. IIUC, MC is always equal to package or node on x86. They don't have DIE in their numa-in-package topology as MC is equal to the node. > >>>>+ } > >>>>+} > >>>>+ > >>>> const struct cpumask *cpu_coregroup_mask(int cpu) > >>>> { > >>>>+ int *llc = &cpu_topology[cpu].cache_level; > >>>>+ > >>>>+ if (*llc == -1) > >>>>+ find_llc_topology_for_cpu(cpu); > >>>>+ > >>>>+ if (*llc != -1) > >>>>+ return &cpu_topology[cpu].cache_siblings[*llc]; > >>>>+ > >>>> return &cpu_topology[cpu].core_sibling; > > > >If we don't have any of the cache_sibling masks set up, i.e. we don't > >have the cache topology, we would keep looking for it every time > >cpu_coregroup_mask() is called. I'm not sure how extensively it is used, > >but it could have a performance impact? > > Its only called when cores come online/offline (AFAIK). Yes, it seems to only be used for sched_domain building. That can happen as part of creating/modifying cpusets as well, but I guess the overhead is less critical for all these case. > > > > > > >>>> } > >>>>@@ -221,6 +255,7 @@ static void update_siblings_masks(unsigned int cpuid) > >>>> { > >>>> struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; > >>>> int cpu; > >>>>+ int idx; > >>>> /* update core and thread sibling masks */ > >>>> for_each_possible_cpu(cpu) { > >>>>@@ -229,6 +264,16 @@ static void update_siblings_masks(unsigned int cpuid) > >>>> if (cpuid_topo->package_id != cpu_topo->package_id) > >>>> continue; > >>>>+ for (idx = 0; idx < MAX_CACHE_CHECKS; idx++) { > >>>>+ cpumask_t *lsib; > >>>>+ int cput_id = cpuid_topo->cache_id[idx]; > >>>>+ > >>>>+ if (cput_id == cpu_topo->cache_id[idx]) { > >>>>+ lsib = &cpuid_topo->cache_siblings[idx]; > >>>>+ cpumask_set_cpu(cpu, lsib); > >>>>+ } > >>> > >>>Shouldn't the cache_id validity be checked here? I don't think it breaks > >>>anything though. > >> > >>It could be, but since its explicitly looking for unified caches its likely > >>that some of the levels are invalid. Invalid levels get ignored later on so > >>we don't really care if they are valid here. > >> > >>> > >>>Overall, I think this is more or less in line with the MC domain > >>>shrinking I just mentioned in the v6 discussion. It is mostly the corner > >>>cases and assumption about the system topology I'm not sure about. > >> > >>I think its the corner cases i'm taking care of. The simple fix in v6 is to > >>take the smaller of core_siblings or node_siblings, but that ignores cases > >>with split L3s (or the L2 only example above). The idea here is to assure > >>that MC is following a cache topology. In my mind, it is more a question of > >>how that is picked. The other way I see to do this, is with a PX domain flag > >>in the PPTT. We could then pick the core grouping one below that flag. Doing > >>it that way affords the firmware vendors a lever they can pull to optimize a > >>given machine for the linux scheduler behavior. > > > >Okay. I think these assumptions/choices should be spelled out somewhere, > >either as comments or in the commit message. As said above, I'm not sure > >if the simple approach is better or not. > > > >Using the cache span to define the MC level with a numa-in-cluster > >switch like some Intel platform seems to have, you could two core being > >MC siblings with numa-in-package disabled and them not being siblings > >with numa-in-package enabled unless you reconfigure the span of the > >caches too and remember to update the ACPI cache topology. > > > >Regarding firmware levers, we don't want vendors to optimize for Linux > >scheduler behaviour, but a mechanism to detect how closely related cores > >are could make selecting the right mask for MC level easier. As I see > >it, we basically have to choose between MC being cache boundary based or > >physical package based. This patch implements the former, the simple > >solution (core_siblings mask or node_siblings mask) implements the > >latter. > > Basically, right now (AFAIK) the result is the same because the few machines > I have access to have cache layers immediately below those boundaries which > are the same size as the package/die. Agreed, I'm more worried about what vendors will built in the future. > I'm ok with tossing this patch in favor of something like: > > const struct cpumask *cpu_coregroup_mask(int cpu) > { > const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu)); > if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) > { > /* not numa in package, lets use the package siblings */ > return &cpu_topology[cpu].core_sibling; > } > return node_mask; > } I would prefer this simpler solution as it should eliminate DIE level for all numa-in-package configurations. Although, I think we should consider just shrinking the core_sibling mask instead of having a difference MC mask (cpu_coregroup_mask). Do you see any problems in doing that? > Mostly, because I want to merge the PPTT parts, and I only added this to > clear the NUMA in package borken.... Understood. Whatever choice we make now to fix it will be with us potentially forever. So unless we have really good reason to do things differently, I would prefer to follow what other architectures do. I think the simple solution is closest to what x86 does. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On 03/07/2018 07:06 AM, Morten Rasmussen wrote: > On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote: >>>>>> To do this correctly, we should really base that on the cache >>>>>> topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations. >>>>> >>>>> That means we wouldn't support multi-die NUMA nodes? >>>> >>>> You mean a bottom level NUMA domain that crosses multiple sockets/dies? That >>>> should work. This patch is picking the widest cache layer below the smallest >>>> of the package or numa grouping. What actually happens depends on the >>>> topology. Given a case where there are multiple dies in a socket, and the >>>> numa domain is at the socket level the MC is going to reflect the caching >>>> topology immediately below the socket. In the case of multiple dies, with a >>>> cache that crosses them in socket, then the MC is basically going to be the >>>> socket, otherwise if the widest cache is per die, or some narrower grouping >>>> (cluster?) then that is what ends up in the MC. (this is easier with some >>>> pictures) >>> >>> That is more or less what I meant. I think I got confused with the role >>> of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level >>> cpumask spans exactly the NUMA node, so IIUC we have three scenarios: >>> >>> 1. Multi-die/socket/physical package NUMA node >>> Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level >>> spans multiple multi-package nodes. The MC mask reflects the last-level >>> cache within the NUMA node which is most likely per-die or per-cluster >>> (inside each die). >>> >>> 2. physical package == NUMA node >>> The top non-NUMA (DIE) mask is the same as the core sibling mask. >>> If there is cache spanning the entire node, the scheduler topology >>> will eliminate a layer (DIE?), so bottom NUMA level would be right on >>> top of MC spanning multiple physical packages. If there is no >>> node-wide last level cache, DIE is preserved and MC matches the span >>> of the last level cache. >>> >>> 3. numa-in-package >>> Top non-NUMA (DIE) mask is not reflecting the actual die, but is >>> reflecting the NUMA node. MC has a span equal to the largest share >>> cache span smaller than or equal to the the NUMA node. If it is >>> equal, DIE level is eliminated, otherwise DIE is preserved, but >>> doesn't really represent die. Bottom non-NUMA level spans multiple >>> in-package NUMA nodes. >>> >>> As you said, multi-die nodes should work. However, I'm not sure if >>> shrinking MC to match a cache could cause us trouble, or if it should >>> just be shrunk to be the smaller of the node mask and core siblings. >> >> Shrinking to the smaller of the numa or package is fairly trivial change, >> I'm good with that change too.. I discounted it because there might be an >> advantage in case 2 if the internal hardware is actually a case 3 (or just >> multiple rings/whatever each with a L3). In those cases the firmware vendor >> could play around with whatever representation serves them the best. > > Agreed. Distributed last level caches and interconnect speeds makes it > virtually impossible to define MC in a way that works well for everyone > based on the topology information we have at hand. > >> >>> Unless you have a node-wide last level cache DIE level won't be >>> eliminated in scenario 2 and 3, and could cause trouble. For >>> numa-in-package, you can end up with a DIE level inside the node where >>> the default flags don't favour aggressive spreading of tasks. The same >>> could be the case for per-package nodes (scenario 2). >>> >>> Don't we end up redefining physical package to be last level cache >>> instead of using the PPTT flag for scenario 2 and 3? >> >> I'm not sure I understand, core_siblings isn't changing (its still per >> package). Only the MC mapping which normally is just core_siblings. For all >> intents right now this patch is the same as v6, except for the >> numa-in-package where the MC domain is being shrunk to the node siblings. >> I'm just trying to setup the code for potential future cases where the LLC >> isn't equal to the node or package. > > Right, core_siblings remains the same. The scheduler topology just looks > a bit odd as we can have core_siblings spanning the full true physical > package and have DIE level as a subset of that with an MC level where > the MC siblings is a much smaller subset of cpus than core_siblings. > > IOW, it would lead to having one topology used by the scheduler and > another used by the users of topology_core_cpumask() (which is not > many I think). > > Is there a good reason for diverging instead of adjusting the > core_sibling mask? On x86 the core_siblings mask is defined by the last > level cache span so they don't have this issue. I'm overwhelmingly convinced we are doing the right thing WRT the core siblings at the moment. Its exported to user space, and the general understanding is that its a socket. Even with numa in package/on die if you run lscpu, lstopo, etc... They all understand the system topology correctly doing it this way (AFAIK). > > Yes. IIUC, MC is always equal to package or node on x86. They don't have > DIE in their numa-in-package topology as MC is equal to the node. > >>>>>> + } >>>>>> +} >>>>>> + >>>>>> const struct cpumask *cpu_coregroup_mask(int cpu) >>>>>> { >>>>>> + int *llc = &cpu_topology[cpu].cache_level; >>>>>> + >>>>>> + if (*llc == -1) >>>>>> + find_llc_topology_for_cpu(cpu); >>>>>> + >>>>>> + if (*llc != -1) >>>>>> + return &cpu_topology[cpu].cache_siblings[*llc]; >>>>>> + >>>>>> return &cpu_topology[cpu].core_sibling; >>> >>> If we don't have any of the cache_sibling masks set up, i.e. we don't >>> have the cache topology, we would keep looking for it every time >>> cpu_coregroup_mask() is called. I'm not sure how extensively it is used, >>> but it could have a performance impact? >> >> Its only called when cores come online/offline (AFAIK). > > Yes, it seems to only be used for sched_domain building. That can happen > as part of creating/modifying cpusets as well, but I guess the overhead > is less critical for all these case. > >> >>> >>> >>>>>> } >>>>>> @@ -221,6 +255,7 @@ static void update_siblings_masks(unsigned int cpuid) >>>>>> { >>>>>> struct cpu_topology *cpu_topo, *cpuid_topo = &cpu_topology[cpuid]; >>>>>> int cpu; >>>>>> + int idx; >>>>>> /* update core and thread sibling masks */ >>>>>> for_each_possible_cpu(cpu) { >>>>>> @@ -229,6 +264,16 @@ static void update_siblings_masks(unsigned int cpuid) >>>>>> if (cpuid_topo->package_id != cpu_topo->package_id) >>>>>> continue; >>>>>> + for (idx = 0; idx < MAX_CACHE_CHECKS; idx++) { >>>>>> + cpumask_t *lsib; >>>>>> + int cput_id = cpuid_topo->cache_id[idx]; >>>>>> + >>>>>> + if (cput_id == cpu_topo->cache_id[idx]) { >>>>>> + lsib = &cpuid_topo->cache_siblings[idx]; >>>>>> + cpumask_set_cpu(cpu, lsib); >>>>>> + } >>>>> >>>>> Shouldn't the cache_id validity be checked here? I don't think it breaks >>>>> anything though. >>>> >>>> It could be, but since its explicitly looking for unified caches its likely >>>> that some of the levels are invalid. Invalid levels get ignored later on so >>>> we don't really care if they are valid here. >>>> >>>>> >>>>> Overall, I think this is more or less in line with the MC domain >>>>> shrinking I just mentioned in the v6 discussion. It is mostly the corner >>>>> cases and assumption about the system topology I'm not sure about. >>>> >>>> I think its the corner cases i'm taking care of. The simple fix in v6 is to >>>> take the smaller of core_siblings or node_siblings, but that ignores cases >>>> with split L3s (or the L2 only example above). The idea here is to assure >>>> that MC is following a cache topology. In my mind, it is more a question of >>>> how that is picked. The other way I see to do this, is with a PX domain flag >>>> in the PPTT. We could then pick the core grouping one below that flag. Doing >>>> it that way affords the firmware vendors a lever they can pull to optimize a >>>> given machine for the linux scheduler behavior. >>> >>> Okay. I think these assumptions/choices should be spelled out somewhere, >>> either as comments or in the commit message. As said above, I'm not sure >>> if the simple approach is better or not. >>> >>> Using the cache span to define the MC level with a numa-in-cluster >>> switch like some Intel platform seems to have, you could two core being >>> MC siblings with numa-in-package disabled and them not being siblings >>> with numa-in-package enabled unless you reconfigure the span of the >>> caches too and remember to update the ACPI cache topology. >>> >>> Regarding firmware levers, we don't want vendors to optimize for Linux >>> scheduler behaviour, but a mechanism to detect how closely related cores >>> are could make selecting the right mask for MC level easier. As I see >>> it, we basically have to choose between MC being cache boundary based or >>> physical package based. This patch implements the former, the simple >>> solution (core_siblings mask or node_siblings mask) implements the >>> latter. >> >> Basically, right now (AFAIK) the result is the same because the few machines >> I have access to have cache layers immediately below those boundaries which >> are the same size as the package/die. > > Agreed, I'm more worried about what vendors will built in the future. Yes, and that is why I posted this rather than the code below, because I was leaving vendors a way to compensate for less regular machines. > >> I'm ok with tossing this patch in favor of something like: >> >> const struct cpumask *cpu_coregroup_mask(int cpu) >> { >> const cpumask_t *node_mask = cpumask_of_node(cpu_to_node(cpu)); >> if (!cpumask_subset(node_mask, &cpu_topology[cpu].core_sibling)) >> { >> /* not numa in package, lets use the package siblings */ >> return &cpu_topology[cpu].core_sibling; >> } >> return node_mask; >> } > > I would prefer this simpler solution as it should eliminate DIE level > for all numa-in-package configurations. Although, I think we should consider > just shrinking the core_sibling mask instead of having a difference MC > mask (cpu_coregroup_mask). Do you see any problems in doing that?I'm My strongest opinion is leaning toward core_siblings being correct as it stands. How the scheduler deals with that is less clear. I will toss the above as a separate patch, and we can forget this one. I see dropping DIE as a larger patch set defining an arch specific scheduler topology and tweaking the individual scheduler level/flags/tuning. OTOH, unless there is something particularly creative there, I don't see how to avoid NUMA domains pushing deeper into the cache/system topology. Which means filling the MC layer (and possible others) similarly to the above snippit. > >> Mostly, because I want to merge the PPTT parts, and I only added this to >> clear the NUMA in package borken.... > > Understood. Whatever choice we make now to fix it will be with us > potentially forever. So unless we have really good reason to do things > differently, I would prefer to follow what other architectures do. I > think the simple solution is closest to what x86 does. Sure, that sounds familiar... ;) -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 27 February 2018 at 18:49, Jeremy Linton <jeremy.linton@arm.com> wrote: > On 03/01/2018 06:06 AM, Sudeep Holla wrote: >> >> Hi Jeremy, >> >> On 28/02/18 22:06, Jeremy Linton wrote: >>> >>> ACPI 6.2 adds the Processor Properties Topology Table (PPTT), which is >>> used to describe the processor and cache topology. Ideally it is >>> used to extend/override information provided by the hardware, but >>> right now ARM64 is entirely dependent on firmware provided tables. >>> >>> This patch parses the table for the cache topology and CPU topology. >>> When we enable ACPI/PPTT for arm64 we map the physical_id to the >>> PPTT node flagged as the physical package by the firmware. >>> This results in topologies that match what the remainder of the >>> system expects. To avoid inverted scheduler domains we then >>> set the MC domain equal to the largest cache within the socket >>> below the NUMA domain. >>> >> I remember reviewing and acknowledging most of the cacheinfo stuff with >> couple of minor suggestions for v6. I don't see any Acked-by tags in >> this series and don't know if I need to review/ack any more cacheinfo >> related patches. > > > Hi, > > Yes, I didn't put them in because I changed the functionality in 2/13 and > there is a bug fix in 5/13. I thought you might want to do a quick diff of > the git v6->v7 tree. > > Although given that most of the changes were in response to your comments in > v6 I probably should have just put the tags in. > I get sane output from lstopo when applying these patches and booting my Socionext SynQuacer in ACPI mode: $ lstopo-no-graphics Machine (31GB) Package L#0 + L3 L#0 (4096KB) L2 L#0 (256KB) L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) L2 L#1 (256KB) L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) L2 L#2 (256KB) L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) L2 L#3 (256KB) L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) L2 L#4 (256KB) L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) L2 L#5 (256KB) L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) L2 L#6 (256KB) L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) L2 L#7 (256KB) L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) L2 L#8 (256KB) L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16) L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17) L2 L#9 (256KB) L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18) L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19) L2 L#10 (256KB) L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20) L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21) L2 L#11 (256KB) L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22) L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23) HostBridge L#0 PCIBridge PCIBridge PCI 1b21:0612 Block(Disk) L#0 "sda" HostBridge L#3 PCI 10de:128b GPU L#1 "renderD128" GPU L#2 "card0" GPU L#3 "controlD64" So Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> *However*, while hacking on the firmware that exposes the table, I noticed that a malformed structure (incorrect size) can get the parser in an infinite loop, hanging the boot after [ 8.244281] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled [ 8.251780] Serial: AMBA driver [ 8.255042] msm_serial: driver initialized [ 8.259752] ACPI PPTT: Cache Setup ACPI cpu 0 [ 8.264121] ACPI PPTT: Looking for data cache [ 8.268484] ACPI PPTT: Looking for CPU 0's level 1 cache type 0 so I guess the parsing code could be made a bit more robust? -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Feb 28, 2018 at 04:06:13PM -0600, Jeremy Linton wrote: [...] > diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h > index 0c6f658054d2..1446d3f053a2 100644 > --- a/include/linux/cacheinfo.h > +++ b/include/linux/cacheinfo.h > @@ -97,6 +97,15 @@ int func(unsigned int cpu) \ > struct cpu_cacheinfo *get_cpu_cacheinfo(unsigned int cpu); > int init_cache_level(unsigned int cpu); > int populate_cache_leaves(unsigned int cpu); > +int cache_setup_acpi(unsigned int cpu); > +int acpi_find_last_cache_level(unsigned int cpu); > +#ifndef CONFIG_ACPI > +int acpi_find_last_cache_level(unsigned int cpu) This has got to be a static inline function declaration (see kbot report). Lorenzo > +{ > + /* ACPI kernels should be built with PPTT support */ > + return 0; > +} > +#endif > > const struct attribute_group *cache_get_priv_group(struct cacheinfo *this_leaf); > > -- > 2.13.6 > -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, First thanks for testing this!! On 03/08/2018 09:59 AM, Ard Biesheuvel wrote: > On 27 February 2018 at 18:49, Jeremy Linton <jeremy.linton@arm.com> wrote: >> On 03/01/2018 06:06 AM, Sudeep Holla wrote: >>> >>> Hi Jeremy, >>> >>> On 28/02/18 22:06, Jeremy Linton wrote: >>>> >>>> ACPI 6.2 adds the Processor Properties Topology Table (PPTT), which is >>>> used to describe the processor and cache topology. Ideally it is >>>> used to extend/override information provided by the hardware, but >>>> right now ARM64 is entirely dependent on firmware provided tables. >>>> >>>> This patch parses the table for the cache topology and CPU topology. >>>> When we enable ACPI/PPTT for arm64 we map the physical_id to the >>>> PPTT node flagged as the physical package by the firmware. >>>> This results in topologies that match what the remainder of the >>>> system expects. To avoid inverted scheduler domains we then >>>> set the MC domain equal to the largest cache within the socket >>>> below the NUMA domain. >>>> >>> I remember reviewing and acknowledging most of the cacheinfo stuff with >>> couple of minor suggestions for v6. I don't see any Acked-by tags in >>> this series and don't know if I need to review/ack any more cacheinfo >>> related patches. >> >> >> Hi, >> >> Yes, I didn't put them in because I changed the functionality in 2/13 and >> there is a bug fix in 5/13. I thought you might want to do a quick diff of >> the git v6->v7 tree. >> >> Although given that most of the changes were in response to your comments in >> v6 I probably should have just put the tags in. >> > > I get sane output from lstopo when applying these patches and booting > my Socionext SynQuacer in ACPI mode: > > $ lstopo-no-graphics > Machine (31GB) > Package L#0 + L3 L#0 (4096KB) > L2 L#0 (256KB) > L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) > L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) > L2 L#1 (256KB) > L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) > L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) > L2 L#2 (256KB) > L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4) > L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5) > L2 L#3 (256KB) > L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6) > L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7) > L2 L#4 (256KB) > L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8) > L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9) > L2 L#5 (256KB) > L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10) > L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11) > L2 L#6 (256KB) > L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12) > L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13) > L2 L#7 (256KB) > L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14) > L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15) > L2 L#8 (256KB) > L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16) > L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17) > L2 L#9 (256KB) > L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18) > L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19) > L2 L#10 (256KB) > L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20) > L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21) > L2 L#11 (256KB) > L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22) > L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23) > HostBridge L#0 > PCIBridge > PCIBridge > PCI 1b21:0612 > Block(Disk) L#0 "sda" > HostBridge L#3 > PCI 10de:128b > GPU L#1 "renderD128" > GPU L#2 "card0" > GPU L#3 "controlD64" > > So > > Tested-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> > > *However*, while hacking on the firmware that exposes the table, I > noticed that a malformed structure (incorrect size) can get the parser > in an infinite loop, hanging the boot after > > [ 8.244281] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled > [ 8.251780] Serial: AMBA driver > [ 8.255042] msm_serial: driver initialized > [ 8.259752] ACPI PPTT: Cache Setup ACPI cpu 0 > [ 8.264121] ACPI PPTT: Looking for data cache > [ 8.268484] ACPI PPTT: Looking for CPU 0's level 1 cache type 0 > > so I guess the parsing code could be made a bit more robust? > I've been wondering how long it would take for someone to complain about one of these cases, I added a check in find_processor_node back a few revisions ago to deal with zero length's causing infinite loops, but the leaf node check doesn't have one, and that is likely what your hitting. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Jeremy, > -----Original Message----- > From: Jeremy Linton <jeremy.linton@arm.com> > Sent: Thursday, March 1, 2018 3:36 AM > To: linux-acpi@vger.kernel.org > Cc: linux-arm-kernel@lists.infradead.org; sudeep.holla@arm.com; > lorenzo.pieralisi@arm.com; hanjun.guo@linaro.org; rjw@rjwysocki.net; > will.deacon@arm.com; catalin.marinas@arm.com; > gregkh@linuxfoundation.org; mark.rutland@arm.com; linux- > kernel@vger.kernel.org; linux-riscv@lists.infradead.org; > wangxiongfeng2@huawei.com; vkilari@codeaurora.org; ahs3@redhat.com; > dietmar.eggemann@arm.com; morten.rasmussen@arm.com; > palmer@sifive.com; lenb@kernel.org; john.garry@huawei.com; > austinwc@codeaurora.org; tnowicki@caviumnetworks.com; Jeremy Linton > <jeremy.linton@arm.com> > Subject: [PATCH v7 00/13] Support PPTT for ARM64 > > ACPI 6.2 adds the Processor Properties Topology Table (PPTT), which is used to > describe the processor and cache topology. Ideally it is used to extend/override > information provided by the hardware, but right now ARM64 is entirely > dependent on firmware provided tables. > > This patch parses the table for the cache topology and CPU topology. > When we enable ACPI/PPTT for arm64 we map the physical_id to the PPTT > node flagged as the physical package by the firmware. > This results in topologies that match what the remainder of the system expects. > To avoid inverted scheduler domains we then set the MC domain equal to the > largest cache within the socket below the NUMA domain. > > For example on juno: > [root@mammon-juno-rh topology]# lstopo-no-graphics > Package L#0 > L2 L#0 (1024KB) > L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0) > L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1) > L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2) > L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3) > L2 L#1 (2048KB) > L1d L#4 (32KB) + L1i L#4 (48KB) + Core L#4 + PU L#4 (P#4) > L1d L#5 (32KB) + L1i L#5 (48KB) + Core L#5 + PU L#5 (P#5) > HostBridge L#0 > PCIBridge > PCIBridge > PCIBridge > PCI 1095:3132 > Block(Disk) L#0 "sda" > PCIBridge > PCI 1002:68f9 > GPU L#1 "renderD128" > GPU L#2 "card0" > GPU L#3 "controlD64" > PCIBridge > PCI 11ab:4380 > Net L#4 "enp8s0" > > Git tree at: > http://linux-arm.org/git?p=linux-jlinton.git > branch: pptt_v7 Tested this series and looks good Tested by: Vijaya Kumar K <vkilari@codeaurora.org> > > v6->v7: > Add additional patch to use the last cache level within the NUMA > or socket as the MC domain. This assures the MC domain is > equal or smaller than the DIE. > > Various formatting/etc review comments. > > Rebase to 4.16rc2 > > v5->v6: > Add additional patches which re-factor how the initial DT code sets > up the cacheinfo structure so that its not as dependent on the > of_node stored in that tree. Once that is done we rename it > for use with the ACPI code. > > Additionally there were a fair number of minor name/location/etc > tweaks scattered about made in response to review comments. > > v4->v5: > Update the cache type from NOCACHE to UNIFIED when all the cache > attributes we update are valid. This fixes a problem where caches > which are entirely created by the PPTT don't show up in lstopo. > > Give the PPTT its own firmware_node in the cache structure instead of > sharing it with the of_node. > > Move some pieces around between patches. > > (see previous cover letters for futher changes) > > Jeremy Linton (13): > drivers: base: cacheinfo: move cache_setup_of_node() > drivers: base: cacheinfo: setup DT cache properties early > cacheinfo: rename of_node to fw_token > arm64/acpi: Create arch specific cpu to acpi id helper > ACPI/PPTT: Add Processor Properties Topology Table parsing > ACPI: Enable PPTT support on ARM64 > drivers: base cacheinfo: Add support for ACPI based firmware tables > arm64: Add support for ACPI based firmware tables > ACPI/PPTT: Add topology parsing code > arm64: topology: rename cluster_id > arm64: topology: enable ACPI/PPTT based CPU topology > ACPI: Add PPTT to injectable table list > arm64: topology: divorce MC scheduling domain from core_siblings > > arch/arm64/Kconfig | 1 + > arch/arm64/include/asm/acpi.h | 4 + > arch/arm64/include/asm/topology.h | 9 +- > arch/arm64/kernel/cacheinfo.c | 15 +- > arch/arm64/kernel/topology.c | 132 +++++++- > arch/riscv/kernel/cacheinfo.c | 1 - > drivers/acpi/Kconfig | 3 + > drivers/acpi/Makefile | 1 + > drivers/acpi/pptt.c | 642 > ++++++++++++++++++++++++++++++++++++++ > drivers/acpi/tables.c | 2 +- > drivers/base/cacheinfo.c | 157 +++++----- > include/linux/acpi.h | 4 + > include/linux/cacheinfo.h | 17 +- > 13 files changed, 882 insertions(+), 106 deletions(-) create mode 100644 > drivers/acpi/pptt.c > > -- > 2.13.6 -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 08, 2018 at 09:41:17PM +0100, Brice Goglin wrote: > > > Is there a good reason for diverging instead of adjusting the > > core_sibling mask? On x86 the core_siblings mask is defined by the last > > level cache span so they don't have this issue. > > No. core_siblings is defined as the list of cores that have the same > physical_package_id (see the doc of sysfs topology files), and LLC can > be smaller than that. > Example with E5v3 with cluster-on-die (two L3 per package, core_siblings > is twice larger than L3 cpumap): > https://www.open-mpi.org/projects/hwloc/lstopo/images/2XeonE5v3.v1.11.png > On AMD EPYC, you even have up to 8 LLC per package. Right, I missed the fact that x86 reports a different cpumask for topology_core_cpumask() which defines the core_siblings exported through sysfs than the mask used to define MC level in the scheduler topology. The sysfs core_siblings is defined by the package_id, while the MC level is defined by the LLC. Thanks for pointing this out. On arm64 MC level and sysfs core_siblings are currently defined using the same mask, but we can't break sysfs, so using different masks is the only option. Morten -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 07, 2018 at 10:19:50AM -0600, Jeremy Linton wrote: > Hi, > > On 03/07/2018 07:06 AM, Morten Rasmussen wrote: > >On Tue, Mar 06, 2018 at 04:22:18PM -0600, Jeremy Linton wrote: > >>>>>>To do this correctly, we should really base that on the cache > >>>>>>topology immediately below the NUMA node (for NUMA in socket) >> or below the physical package for normal NUMA configurations. > >>>>> > >>>>>That means we wouldn't support multi-die NUMA nodes? > >>>> > >>>>You mean a bottom level NUMA domain that crosses multiple sockets/dies? That > >>>>should work. This patch is picking the widest cache layer below the smallest > >>>>of the package or numa grouping. What actually happens depends on the > >>>>topology. Given a case where there are multiple dies in a socket, and the > >>>>numa domain is at the socket level the MC is going to reflect the caching > >>>>topology immediately below the socket. In the case of multiple dies, with a > >>>>cache that crosses them in socket, then the MC is basically going to be the > >>>>socket, otherwise if the widest cache is per die, or some narrower grouping > >>>>(cluster?) then that is what ends up in the MC. (this is easier with some > >>>>pictures) > >>> > >>>That is more or less what I meant. I think I got confused with the role > >>>of "DIE" level, i.e. that top non-NUMA level, in this. The DIE level > >>>cpumask spans exactly the NUMA node, so IIUC we have three scenarios: > >>> > >>>1. Multi-die/socket/physical package NUMA node > >>> Top non-NUMA level (DIE) spans multiple packages. Bottom NUMA level > >>> spans multiple multi-package nodes. The MC mask reflects the last-level > >>> cache within the NUMA node which is most likely per-die or per-cluster > >>> (inside each die). > >>> > >>>2. physical package == NUMA node > >>> The top non-NUMA (DIE) mask is the same as the core sibling mask. > >>> If there is cache spanning the entire node, the scheduler topology > >>> will eliminate a layer (DIE?), so bottom NUMA level would be right on > >>> top of MC spanning multiple physical packages. If there is no > >>> node-wide last level cache, DIE is preserved and MC matches the span > >>> of the last level cache. > >>> > >>>3. numa-in-package > >>> Top non-NUMA (DIE) mask is not reflecting the actual die, but is > >>> reflecting the NUMA node. MC has a span equal to the largest share > >>> cache span smaller than or equal to the the NUMA node. If it is > >>> equal, DIE level is eliminated, otherwise DIE is preserved, but > >>> doesn't really represent die. Bottom non-NUMA level spans multiple > >>> in-package NUMA nodes. > >>> > >>>As you said, multi-die nodes should work. However, I'm not sure if > >>>shrinking MC to match a cache could cause us trouble, or if it should > >>>just be shrunk to be the smaller of the node mask and core siblings. > >> > >>Shrinking to the smaller of the numa or package is fairly trivial change, > >>I'm good with that change too.. I discounted it because there might be an > >>advantage in case 2 if the internal hardware is actually a case 3 (or just > >>multiple rings/whatever each with a L3). In those cases the firmware vendor > >>could play around with whatever representation serves them the best. > > > >Agreed. Distributed last level caches and interconnect speeds makes it > >virtually impossible to define MC in a way that works well for everyone > >based on the topology information we have at hand. > > > >> > >>>Unless you have a node-wide last level cache DIE level won't be > >>>eliminated in scenario 2 and 3, and could cause trouble. For > >>>numa-in-package, you can end up with a DIE level inside the node where > >>>the default flags don't favour aggressive spreading of tasks. The same > >>>could be the case for per-package nodes (scenario 2). > >>> > >>>Don't we end up redefining physical package to be last level cache > >>>instead of using the PPTT flag for scenario 2 and 3? > >> > >>I'm not sure I understand, core_siblings isn't changing (its still per > >>package). Only the MC mapping which normally is just core_siblings. For all > >>intents right now this patch is the same as v6, except for the > >>numa-in-package where the MC domain is being shrunk to the node siblings. > >>I'm just trying to setup the code for potential future cases where the LLC > >>isn't equal to the node or package. > > > >Right, core_siblings remains the same. The scheduler topology just looks > >a bit odd as we can have core_siblings spanning the full true physical > >package and have DIE level as a subset of that with an MC level where > >the MC siblings is a much smaller subset of cpus than core_siblings. > > > >IOW, it would lead to having one topology used by the scheduler and > >another used by the users of topology_core_cpumask() (which is not > >many I think). > > > >Is there a good reason for diverging instead of adjusting the > >core_sibling mask? On x86 the core_siblings mask is defined by the last > >level cache span so they don't have this issue. > > I'm overwhelmingly convinced we are doing the right thing WRT the core > siblings at the moment. Its exported to user space, and the general > understanding is that its a socket. Even with numa in package/on die if you > run lscpu, lstopo, etc... They all understand the system topology correctly > doing it this way (AFAIK). Right. As said in my reply to Brice, I thought MC level and sysfs were aligned, but they clearly aren't. I agree that treating them different is the right thing to do. > >I would prefer this simpler solution as it should eliminate DIE level > >for all numa-in-package configurations. Although, I think we should consider > >just shrinking the core_sibling mask instead of having a difference MC > >mask (cpu_coregroup_mask). Do you see any problems in doing that?I'm > My strongest opinion is leaning toward core_siblings being correct as it > stands. How the scheduler deals with that is less clear. I will toss the > above as a separate patch, and we can forget this one. I see dropping DIE as > a larger patch set defining an arch specific scheduler topology and tweaking > the individual scheduler level/flags/tuning. OTOH, unless there is something > particularly creative there, I don't see how to avoid NUMA domains pushing > deeper into the cache/system topology. Which means filling the MC layer (and > possible others) similarly to the above snippit. Agreed that core_siblings is correct. With the simple solution DIE shouldn't show up for any numa_in_package configurations allowing NUMA to sit directly on top of MC, which should mean that flags should be roughly okay. -- To unsubscribe from this list: send the line "unsubscribe linux-acpi" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html