[RFC,tip/core/rcu,6/6] rcu: Reduce cache-miss initialization latencies for large systems

From: Peter Zijlstra <a.p.zijlstra@chello.nl>

On Thu, 2012-04-26 at 09:15 -0700, Paul E. McKenney wrote:
> On Thu, Apr 26, 2012 at 05:28:57PM +0200, Peter Zijlstra wrote:
> > On Thu, 2012-04-26 at 07:12 -0700, Paul E. McKenney wrote:
> > > On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote:
> > 
> > > > Wouldn't it be much better to match the rcu fanout tree to the physical
> > > > topology of the machine?
> > > 
> > > From what I am hearing, doing so requires me to morph the rcu_node tree
> > > at run time.  I might eventually become courageous/inspired/senile
> > > enough to try this, but not yet.  ;-)
> > 
> > Yes, boot time with possibly some hotplug hooks.
> 
> Has anyone actually measured any slowdown due to the rcu_node structure
> not matching the topology?  (But see also below.)

Nope, I'm just whinging ;-)

> > > Actually, some of this topology shifting seems to me like a firmware
> > > bug.  Why not arrange the Linux-visible numbering in a way to promote
> > > locality for code sequencing through the CPUs?
> > 
> > I'm not sure.. but it seems well established on x86 to first enumerate
> > the cores (thread 0) and then the sibling threads (thread 1) -- one
> > 'advantage' is that if you boot with max_cpus=$half you get all cores
> > instead of half the cores.
> > 
> > OTOH it does make linear iteration of the cpus 'funny' :-)
> 
> Like I said, firmware bug.  Seems like the fix should be there as well.
> Perhaps there needs to be two CPU numberings, one for people wanting
> whole cores and another for people who want cache locality.  Yes, this
> could be confusing, but keep in mind that you are asking every kernel
> subsystem to keep its own version of the cache-locality numbering,
> and that will be even more confusing.

I really don't see why it would matter, as far I care they're completely
randomized on boot, its all done using cpu-bitmasks anyway.

Suppose the linear thing would have threads/cores/cache continuity like
you want, that still leaves the node interconnects, and we all know
people love to be creative with those, no way you're going to fold that
into a linear scheme :-)

Anyway, as it currently stands I can offer you: cpus_share_cache(),
which will return true/false depending on if the two cpus do indeed
share a cache.

On top of that there's node_distance(), which will (hopefully) reflect
the interconnect topology between nodes.

Using these you can construct enough of the topology layout to be
useful. NOTE: node topologies don't need to be symmetric!

> > Also, a fanout of 16 is nice when your machine doesn't have HT and has a
> > 2^n core count, but some popular machines these days have 6/10 cores per
> > socket, resulting in your fanout splitting caches.
> 
> That is easy.  Such systems can set CONFIG_RCU_FANOUT to 6, 12, 10,
> or 20, depending on preference.  With a patch intended for 3.6, they
> could set the smallest reasonable value at build time and adjust to
> the hardware using the boot parameter.
> 
> http://www.gossamer-threads.com/lists/linux/kernel/1524864
> 
> I expect to make other similar changes over time, but will be proceeding
> cautiously.

I can very easily give you the size (nr cpus in) a node, still as long
as you iterate the cpu space linearly that's not going to be much help.

I can also offer you access to the scheduler topology if you want.. I've
got the below patch pending which should (hopefully) improve the
scheduler's node topology -- it implements that detection based on
node_distance().

I've tested it using some fake-numa hacks to feed it the distance table
of an AMD quad-socket Magny-Cours:

 "numa=fake=8:10,16,16,22,16,22,16,22,
              16,10,22,16,22,16,22,16,
              16,22,10,16,16,22,16,22,
              22,16,16,10,22,16,22,16,
              16,22,16,22,10,16,16,22,
              22,16,22,16,16,10,22,16,
              16,22,16,22,16,22,10,16,
              22,16,22,16,22,16,16,10"

---
Subject: sched: Rewrite the CONFIG_NUMA sched domain support.
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Tue Apr 17 15:49:36 CEST 2012

The current code groups up to 16 nodes in a level and then puts an
ALLNODES domain spanning the entire tree on top of that. This doesn't
reflect the numa topology and esp for the smaller not-fully-connected
machines out there today this might make a difference.

Therefore, build a proper numa topology based on node_distance().

TODO: figure out a way to set SD_flags based on distance such that
      we disable various expensive load-balancing features at some
      point and increase the balance interval prop. to the distance.

Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Chris Metcalf <cmetcalf@tilera.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: linux-alpha@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-sh@vger.kernel.org
Cc: Matt Turner <mattst88@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Mundt <lethal@linux-sh.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Richard Henderson <rth@twiddle.net>
Cc: sparclinux@vger.kernel.org
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tony Luck <tony.luck@intel.com>
Cc: x86@kernel.org
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-fgj6245hxj61qe8vy7c6cmjj@git.kernel.org
---
 arch/powerpc/include/asm/topology.h |    6 
 arch/x86/include/asm/topology.h     |   38 -----
 include/linux/topology.h            |   36 -----
 kernel/sched/core.c                 |  253 ++++++++++++++++++++++--------------
 4 files changed, 158 insertions(+), 175 deletions(-)

[RFC,tip/core/rcu,6/6] rcu: Reduce cache-miss initialization latencies for large systems

Commit Message

Comments

Patch