Message ID | 1335199347-13926-6-git-send-email-paulmck@linux.vnet.ibm.com |
---|---|
State | Accepted |
Commit | 8932a63d5edb02f714d50c26583152fe0a97a69c |
Headers | show |
On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > needed to reduce lock contention that was induced by the synchronization > of scheduling-clock interrupts, which was in turn needed to improve > energy efficiency for moderate-sized lightly loaded servers. > > However, reducing the leaf-level fanout means that there are more > leaf-level rcu_node structures in the tree, which in turn means that > RCU's grace-period initialization incurs more cache misses. This is > not a problem on moderate-sized servers with only a few tens of CPUs, > but becomes a major source of real-time latency spikes on systems with > many hundreds of CPUs. In addition, the workloads running on these large > systems tend to be CPU-bound, which eliminates the energy-efficiency > advantages of synchronizing scheduling-clock interrupts. Therefore, > these systems need maximal values for the rcu_node leaf-level fanout. > > This commit addresses this problem by introducing a new kernel parameter > named RCU_FANOUT_LEAF that directly controls the leaf-level fanout. > This parameter defaults to 16 to handle the common case of a moderate > sized lightly loaded servers, but may be set higher on larger systems. Wouldn't it be much better to match the rcu fanout tree to the physical topology of the machine?
On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote: > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > > needed to reduce lock contention that was induced by the synchronization > > of scheduling-clock interrupts, which was in turn needed to improve > > energy efficiency for moderate-sized lightly loaded servers. > > > > However, reducing the leaf-level fanout means that there are more > > leaf-level rcu_node structures in the tree, which in turn means that > > RCU's grace-period initialization incurs more cache misses. This is > > not a problem on moderate-sized servers with only a few tens of CPUs, > > but becomes a major source of real-time latency spikes on systems with > > many hundreds of CPUs. In addition, the workloads running on these large > > systems tend to be CPU-bound, which eliminates the energy-efficiency > > advantages of synchronizing scheduling-clock interrupts. Therefore, > > these systems need maximal values for the rcu_node leaf-level fanout. > > > > This commit addresses this problem by introducing a new kernel parameter > > named RCU_FANOUT_LEAF that directly controls the leaf-level fanout. > > This parameter defaults to 16 to handle the common case of a moderate > > sized lightly loaded servers, but may be set higher on larger systems. > > Wouldn't it be much better to match the rcu fanout tree to the physical > topology of the machine? From what I am hearing, doing so requires me to morph the rcu_node tree at run time. I might eventually become courageous/inspired/senile enough to try this, but not yet. ;-) Actually, some of this topology shifting seems to me like a firmware bug. Why not arrange the Linux-visible numbering in a way to promote locality for code sequencing through the CPUs? Thanx, Paul
On Thu, 2012-04-26 at 07:12 -0700, Paul E. McKenney wrote: > On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote: > > Wouldn't it be much better to match the rcu fanout tree to the physical > > topology of the machine? > > From what I am hearing, doing so requires me to morph the rcu_node tree > at run time. I might eventually become courageous/inspired/senile > enough to try this, but not yet. ;-) Yes, boot time with possibly some hotplug hooks. > Actually, some of this topology shifting seems to me like a firmware > bug. Why not arrange the Linux-visible numbering in a way to promote > locality for code sequencing through the CPUs? I'm not sure.. but it seems well established on x86 to first enumerate the cores (thread 0) and then the sibling threads (thread 1) -- one 'advantage' is that if you boot with max_cpus=$half you get all cores instead of half the cores. OTOH it does make linear iteration of the cpus 'funny' :-) Also, a fanout of 16 is nice when your machine doesn't have HT and has a 2^n core count, but some popular machines these days have 6/10 cores per socket, resulting in your fanout splitting caches.
On Thu, Apr 26, 2012 at 05:28:57PM +0200, Peter Zijlstra wrote: > On Thu, 2012-04-26 at 07:12 -0700, Paul E. McKenney wrote: > > On Thu, Apr 26, 2012 at 02:51:47PM +0200, Peter Zijlstra wrote: > > > > Wouldn't it be much better to match the rcu fanout tree to the physical > > > topology of the machine? > > > > From what I am hearing, doing so requires me to morph the rcu_node tree > > at run time. I might eventually become courageous/inspired/senile > > enough to try this, but not yet. ;-) > > Yes, boot time with possibly some hotplug hooks. Has anyone actually measured any slowdown due to the rcu_node structure not matching the topology? (But see also below.) > > Actually, some of this topology shifting seems to me like a firmware > > bug. Why not arrange the Linux-visible numbering in a way to promote > > locality for code sequencing through the CPUs? > > I'm not sure.. but it seems well established on x86 to first enumerate > the cores (thread 0) and then the sibling threads (thread 1) -- one > 'advantage' is that if you boot with max_cpus=$half you get all cores > instead of half the cores. > > OTOH it does make linear iteration of the cpus 'funny' :-) Like I said, firmware bug. Seems like the fix should be there as well. Perhaps there needs to be two CPU numberings, one for people wanting whole cores and another for people who want cache locality. Yes, this could be confusing, but keep in mind that you are asking every kernel subsystem to keep its own version of the cache-locality numbering, and that will be even more confusing. > Also, a fanout of 16 is nice when your machine doesn't have HT and has a > 2^n core count, but some popular machines these days have 6/10 cores per > socket, resulting in your fanout splitting caches. That is easy. Such systems can set CONFIG_RCU_FANOUT to 6, 12, 10, or 20, depending on preference. With a patch intended for 3.6, they could set the smallest reasonable value at build time and adjust to the hardware using the boot parameter. http://www.gossamer-threads.com/lists/linux/kernel/1524864 I expect to make other similar changes over time, but will be proceeding cautiously. Thanx, Paul
On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > needed to reduce lock contention that was induced by the synchronization > of scheduling-clock interrupts, which was in turn needed to improve > energy efficiency for moderate-sized lightly loaded servers. > > However, reducing the leaf-level fanout means that there are more > leaf-level rcu_node structures in the tree, which in turn means that > RCU's grace-period initialization incurs more cache misses. This is > not a problem on moderate-sized servers with only a few tens of CPUs, With a distro config (4096 CPUs) interrupt latency is bad even on a quad. Traversing empty nodes taking locks and cache misses hurts. -Mike
On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote: > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > > needed to reduce lock contention that was induced by the synchronization > > of scheduling-clock interrupts, which was in turn needed to improve > > energy efficiency for moderate-sized lightly loaded servers. > > > > However, reducing the leaf-level fanout means that there are more > > leaf-level rcu_node structures in the tree, which in turn means that > > RCU's grace-period initialization incurs more cache misses. This is > > not a problem on moderate-sized servers with only a few tens of CPUs, > > With a distro config (4096 CPUs) interrupt latency is bad even on a > quad. Traversing empty nodes taking locks and cache misses hurts. Agreed -- and I will be working on an additional patch that makes RCU avoid initializing its data structures for CPUs that don't exist. That said, increasing the leaf-level fanout from 16 to 64 should reduce the latency pain by a factor of four. In addition, I would expect that real-time builds of the kernel would set NR_CPUS to some value much smaller than 4096. ;-) Thanx, Paul
On Fri, 2012-04-27 at 08:15 -0700, Paul E. McKenney wrote: > On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote: > > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > > > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > > > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > > > needed to reduce lock contention that was induced by the synchronization > > > of scheduling-clock interrupts, which was in turn needed to improve > > > energy efficiency for moderate-sized lightly loaded servers. > > > > > > However, reducing the leaf-level fanout means that there are more > > > leaf-level rcu_node structures in the tree, which in turn means that > > > RCU's grace-period initialization incurs more cache misses. This is > > > not a problem on moderate-sized servers with only a few tens of CPUs, > > > > With a distro config (4096 CPUs) interrupt latency is bad even on a > > quad. Traversing empty nodes taking locks and cache misses hurts. > > Agreed -- and I will be working on an additional patch that makes RCU > avoid initializing its data structures for CPUs that don't exist. That's still on my todo list too, your initial patch (and my butchery thereof to skip taking lock) showed this helps a heap. > That said, increasing the leaf-level fanout from 16 to 64 should reduce > the latency pain by a factor of four. In addition, I would expect that > real-time builds of the kernel would set NR_CPUS to some value much > smaller than 4096. ;-) Yup, else you would have heard whimpering months ago ;-) -Mike
On Sat, Apr 28, 2012 at 06:42:22AM +0200, Mike Galbraith wrote: > On Fri, 2012-04-27 at 08:15 -0700, Paul E. McKenney wrote: > > On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote: > > > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > > > > > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > > > > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > > > > needed to reduce lock contention that was induced by the synchronization > > > > of scheduling-clock interrupts, which was in turn needed to improve > > > > energy efficiency for moderate-sized lightly loaded servers. > > > > > > > > However, reducing the leaf-level fanout means that there are more > > > > leaf-level rcu_node structures in the tree, which in turn means that > > > > RCU's grace-period initialization incurs more cache misses. This is > > > > not a problem on moderate-sized servers with only a few tens of CPUs, > > > > > > With a distro config (4096 CPUs) interrupt latency is bad even on a > > > quad. Traversing empty nodes taking locks and cache misses hurts. > > > > Agreed -- and I will be working on an additional patch that makes RCU > > avoid initializing its data structures for CPUs that don't exist. > > That's still on my todo list too, your initial patch (and my butchery > thereof to skip taking lock) showed this helps a heap. Indeed, I am a bit worried about the safety of skipping the lock in that case. Either way, if you would rather keep working on this, I have no problem pushing it further down my todo list. > > That said, increasing the leaf-level fanout from 16 to 64 should reduce > > the latency pain by a factor of four. In addition, I would expect that > > real-time builds of the kernel would set NR_CPUS to some value much > > smaller than 4096. ;-) > > Yup, else you would have heard whimpering months ago ;-) I am not sure that it would have been exactly whimpering, but yes. ;-) Thanx, Paul
On Sat, 2012-04-28 at 10:21 -0700, Paul E. McKenney wrote: > On Sat, Apr 28, 2012 at 06:42:22AM +0200, Mike Galbraith wrote: > > On Fri, 2012-04-27 at 08:15 -0700, Paul E. McKenney wrote: > > > On Fri, Apr 27, 2012 at 06:36:11AM +0200, Mike Galbraith wrote: > > > > On Mon, 2012-04-23 at 09:42 -0700, Paul E. McKenney wrote: > > > > > From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > > > > > > > > > Commit #0209f649 (rcu: limit rcu_node leaf-level fanout) set an upper > > > > > limit of 16 on the leaf-level fanout for the rcu_node tree. This was > > > > > needed to reduce lock contention that was induced by the synchronization > > > > > of scheduling-clock interrupts, which was in turn needed to improve > > > > > energy efficiency for moderate-sized lightly loaded servers. > > > > > > > > > > However, reducing the leaf-level fanout means that there are more > > > > > leaf-level rcu_node structures in the tree, which in turn means that > > > > > RCU's grace-period initialization incurs more cache misses. This is > > > > > not a problem on moderate-sized servers with only a few tens of CPUs, > > > > > > > > With a distro config (4096 CPUs) interrupt latency is bad even on a > > > > quad. Traversing empty nodes taking locks and cache misses hurts. > > > > > > Agreed -- and I will be working on an additional patch that makes RCU > > > avoid initializing its data structures for CPUs that don't exist. > > > > That's still on my todo list too, your initial patch (and my butchery > > thereof to skip taking lock) showed this helps a heap. > > Indeed, I am a bit worried about the safety of skipping the lock in > that case. Either way, if you would rather keep working on this, I > have no problem pushing it further down my todo list. It's on my todo because Paul is a busy fellow. I'd much prefer that RCU tweaking is done by somebody who fully understands RCU's gizzard (!me). IOW, I'll fiddle with RCU as time permits/gripes require, but waiting to test your implementation is the more prudent thing to do. Heck, as yet, I haven't even scraped together the time to test your other patches :-/ For the nonce, I've done the safe thing, reverted 0209f649 and enabled tick skew. Per my measurements, the regression is thus dead, so it's not as pressing an issue, though some users do still want and need more. -Mike
diff --git a/init/Kconfig b/init/Kconfig index 85c6870..6d18ef8 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -458,6 +458,33 @@ config RCU_FANOUT Select a specific number if testing RCU itself. Take the default if unsure. +config RCU_FANOUT_LEAF + int "Tree-based hierarchical RCU leaf-level fanout value" + range 2 RCU_FANOUT if 64BIT + range 2 RCU_FANOUT if !64BIT + depends on TREE_RCU || TREE_PREEMPT_RCU + default 16 + help + This option controls the leaf-level fanout of hierarchical + implementations of RCU, and allows trading off cache misses + against lock contention. Systems that synchronize their + scheduling-clock interrupts for energy-efficiency reasons will + want the default because the smaller leaf-level fanout keeps + lock contention levels acceptably low. Very large systems + (hundreds or thousands of CPUs) will instead want to set this + value to the maximum value possible in order to reduce the + number of cache misses incurred during RCU's grace-period + initialization. These systems tend to run CPU-bound, and thus + are not helped by synchronized interrupts, and thus tend to + skew them, which reduces lock contention enough that large + leaf-level fanouts work well. + + Select a specific number if testing RCU itself. + + Select the maximum permissible value for large systems. + + Take the default if unsure. + config RCU_FANOUT_EXACT bool "Disable tree-based hierarchical RCU auto-balancing" depends on TREE_RCU || TREE_PREEMPT_RCU diff --git a/kernel/rcutree.c b/kernel/rcutree.c index 1050d6d..780acf8 100644 --- a/kernel/rcutree.c +++ b/kernel/rcutree.c @@ -2418,7 +2418,7 @@ static void __init rcu_init_levelspread(struct rcu_state *rsp) for (i = NUM_RCU_LVLS - 1; i > 0; i--) rsp->levelspread[i] = CONFIG_RCU_FANOUT; - rsp->levelspread[0] = RCU_FANOUT_LEAF; + rsp->levelspread[0] = CONFIG_RCU_FANOUT_LEAF; } #else /* #ifdef CONFIG_RCU_FANOUT_EXACT */ static void __init rcu_init_levelspread(struct rcu_state *rsp) diff --git a/kernel/rcutree.h b/kernel/rcutree.h index cdd1be0..a905c20 100644 --- a/kernel/rcutree.h +++ b/kernel/rcutree.h @@ -29,18 +29,14 @@ #include <linux/seqlock.h> /* - * Define shape of hierarchy based on NR_CPUS and CONFIG_RCU_FANOUT. + * Define shape of hierarchy based on NR_CPUS, CONFIG_RCU_FANOUT, and + * CONFIG_RCU_FANOUT_LEAF. * In theory, it should be possible to add more levels straightforwardly. * In practice, this did work well going from three levels to four. * Of course, your mileage may vary. */ #define MAX_RCU_LVLS 4 -#if CONFIG_RCU_FANOUT > 16 -#define RCU_FANOUT_LEAF 16 -#else /* #if CONFIG_RCU_FANOUT > 16 */ -#define RCU_FANOUT_LEAF (CONFIG_RCU_FANOUT) -#endif /* #else #if CONFIG_RCU_FANOUT > 16 */ -#define RCU_FANOUT_1 (RCU_FANOUT_LEAF) +#define RCU_FANOUT_1 (CONFIG_RCU_FANOUT_LEAF) #define RCU_FANOUT_2 (RCU_FANOUT_1 * CONFIG_RCU_FANOUT) #define RCU_FANOUT_3 (RCU_FANOUT_2 * CONFIG_RCU_FANOUT) #define RCU_FANOUT_4 (RCU_FANOUT_3 * CONFIG_RCU_FANOUT)