Message ID | 20180220215954.4092811-1-arnd@arndb.de |
---|---|
Headers | show |
Series | ARM: hacks for link-time optimization | expand |
On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote: > Hi Nico, all, > > I was playing with ARM link-time optimization handling earlier this > month, and eventually got it to build cleanly with randconfig kernels, > but ended up with a lot of ugly hacks to actually pull it off. How are we dealing with the fact that LTO can break RCU in very subtle and scary ways? Do we have a compiler guy on board that has given us a compiler switch that kills that optimization (and thereby guarantees that behaviour for future compilers etc..) ? Also see the thread here: https://lkml.kernel.org/r/20171116115810.GH9361@arm.com (and yes, this is a fine example of how lore sucks for reading email)
On Mon, Dec 17, 2018 at 11:50:20PM +0100, Peter Zijlstra wrote: > On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote: > > Hi Nico, all, > > > > I was playing with ARM link-time optimization handling earlier this > > month, and eventually got it to build cleanly with randconfig kernels, > > but ended up with a lot of ugly hacks to actually pull it off. > > How are we dealing with the fact that LTO can break RCU in very subtle > and scary ways? > > Do we have a compiler guy on board that has given us a compiler switch > that kills that optimization (and thereby guarantees that behaviour for > future compilers etc..) ? Can you actually define what optimization you are worred about? If there are optimizations that cause problems they likely happen even without LTO inside files. The only difference with LTO is that it does them between files too. -Andi
On Mon, Dec 17, 2018 at 04:08:00PM -0800, Andi Kleen wrote: > On Mon, Dec 17, 2018 at 11:50:20PM +0100, Peter Zijlstra wrote: > > On Tue, Feb 20, 2018 at 10:59:47PM +0100, Arnd Bergmann wrote: > > > Hi Nico, all, > > > > > > I was playing with ARM link-time optimization handling earlier this > > > month, and eventually got it to build cleanly with randconfig kernels, > > > but ended up with a lot of ugly hacks to actually pull it off. > > > > How are we dealing with the fact that LTO can break RCU in very subtle > > and scary ways? > > > > Do we have a compiler guy on board that has given us a compiler switch > > that kills that optimization (and thereby guarantees that behaviour for > > future compilers etc..) ? > > Can you actually define what optimization you are worred about? > > If there are optimizations that cause problems they likely happen > even without LTO inside files. The only difference with LTO is that it > does them between files too. In particular turning an address-dependency into a control-dependency, which is something allowed by the C language, since it doesn't recognise these concepts as such. The 'optimization' is allowed currently, but LTO will make it much more likely since it will have a much wider view of things. Esp. when combined with PGO. Specifically; if you have something like: int idx; struct object objs[2]; the statement: val = objs[idx & 1].ponies; which you 'need' to be translated like: struct object *obj = objs; obj += (idx & 1); val = obj->ponies; Such that the load of obj->ponies depends on the load of idx. However our dear compiler is allowed to make it: if (idx & 1) obj = &objs[1]; else obj = &objs[0]; val = obj->ponies; Because C doesn't recognise this as being different. However this is utterly broken, because in this translation we can speculate the load of obj->ponies such that it no longer depends on the load of idx, which breaks RCU. Note that further 'optimization' is possible and the compiler could even make it: if (idx & 1) val = objs[1].ponies; else val = objs[0].ponies; Now, granted, this is a fairly artificial example, but it does illustrate the exact problem. The more the compiler can see of the complete program, the more likely it can make inferrences like this, esp. when coupled with PGO. Now, we're (usually) very careful to wrap things in READ_ONCE() and rcu_dereference() and the like, which makes it harder on the compiler (because 'volatile' is special), but nothing really stops it from doing this. Paul has been trying to beat clue into the language people, but given he's been at it for 10 years now, and there's no resolution, I figure we ought to get compiler implementations to give us a knob.
On Tue, Dec 18, 2018 at 10:18:24AM +0100, Peter Zijlstra wrote: > In particular turning an address-dependency into a control-dependency, > which is something allowed by the C language, since it doesn't recognise > these concepts as such. > > The 'optimization' is allowed currently, but LTO will make it much more > likely since it will have a much wider view of things. Esp. when combined > with PGO. > > Specifically; if you have something like: > > int idx; > struct object objs[2]; > > the statement: > > val = objs[idx & 1].ponies; > > which you 'need' to be translated like: > > struct object *obj = objs; > obj += (idx & 1); > val = obj->ponies; > > Such that the load of obj->ponies depends on the load of idx. However > our dear compiler is allowed to make it: > > if (idx & 1) > obj = &objs[1]; > else > obj = &objs[0]; > > val = obj->ponies; > > Because C doesn't recognise this as being different. However this is > utterly broken, because in this translation we can speculate the load > of obj->ponies such that it no longer depends on the load of idx, which > breaks RCU. > > Note that further 'optimization' is possible and the compiler could even > make it: > > if (idx & 1) > val = objs[1].ponies; > else > val = objs[0].ponies; A variant that is actually broken on x86 too (due to issuing the loads in the 'wrong' order): val = objs[0].ponies; if (idx & 1) val = objs[1].ponies; Which is a translation that makes sense if we either marked unlikely(idx & 1) or if PGO found the same. > Now, granted, this is a fairly artificial example, but it does > illustrate the exact problem. > > The more the compiler can see of the complete program, the more likely > it can make inferrences like this, esp. when coupled with PGO. > > Now, we're (usually) very careful to wrap things in READ_ONCE() and > rcu_dereference() and the like, which makes it harder on the compiler > (because 'volatile' is special), but nothing really stops it from doing > this. > > Paul has been trying to beat clue into the language people, but given > he's been at it for 10 years now, and there's no resolution, I figure we > ought to get compiler implementations to give us a knob.
On Tue, Dec 18, 2018 at 11:00:14AM +0100, Peter Zijlstra wrote: > On Tue, Dec 18, 2018 at 10:18:24AM +0100, Peter Zijlstra wrote: > > In particular turning an address-dependency into a control-dependency, > > which is something allowed by the C language, since it doesn't recognise > > these concepts as such. > > > > The 'optimization' is allowed currently, but LTO will make it much more > > likely since it will have a much wider view of things. Esp. when combined > > with PGO. > > > > Specifically; if you have something like: > > > > int idx; > > struct object objs[2]; > > > > the statement: > > > > val = objs[idx & 1].ponies; > > > > which you 'need' to be translated like: > > > > struct object *obj = objs; > > obj += (idx & 1); > > val = obj->ponies; > > > > Such that the load of obj->ponies depends on the load of idx. However > > our dear compiler is allowed to make it: > > > > if (idx & 1) > > obj = &objs[1]; > > else > > obj = &objs[0]; > > > > val = obj->ponies; > > > > Because C doesn't recognise this as being different. However this is > > utterly broken, because in this translation we can speculate the load > > of obj->ponies such that it no longer depends on the load of idx, which > > breaks RCU. Hence the following in Documentation/RCU/rcu_dereference.txt: You are only permitted to use rcu_dereference on pointer values. The compiler simply knows too much about integral values to trust it to carry dependencies through integer operations. I got rid of the carrying of dependencies via non-pointers in 2014. You are telling me that they have crept back? Sigh!!! :-/ Thanx, Paul > > Note that further 'optimization' is possible and the compiler could even > > make it: > > > > if (idx & 1) > > val = objs[1].ponies; > > else > > val = objs[0].ponies; > > A variant that is actually broken on x86 too (due to issuing the loads > in the 'wrong' order): > > val = objs[0].ponies; > if (idx & 1) > val = objs[1].ponies; > > Which is a translation that makes sense if we either marked > unlikely(idx & 1) or if PGO found the same. > > > Now, granted, this is a fairly artificial example, but it does > > illustrate the exact problem. > > > > The more the compiler can see of the complete program, the more likely > > it can make inferrences like this, esp. when coupled with PGO. > > > > Now, we're (usually) very careful to wrap things in READ_ONCE() and > > rcu_dereference() and the like, which makes it harder on the compiler > > (because 'volatile' is special), but nothing really stops it from doing > > this. > > > > Paul has been trying to beat clue into the language people, but given > > he's been at it for 10 years now, and there's no resolution, I figure we > > ought to get compiler implementations to give us a knob. >
> In particular turning an address-dependency into a control-dependency, > which is something allowed by the C language, since it doesn't recognise > these concepts as such. > > The 'optimization' is allowed currently, but LTO will make it much more > likely since it will have a much wider view of things. Esp. when combined > with PGO. > > Specifically; if you have something like: > > int idx; > struct object objs[2]; > > the statement: > > val = objs[idx & 1].ponies; > > which you 'need' to be translated like: > > struct object *obj = objs; > obj += (idx & 1); > val = obj->ponies; > > Such that the load of obj->ponies depends on the load of idx. However > our dear compiler is allowed to make it: > > if (idx & 1) > obj = &objs[1]; > else > obj = &objs[0]; > > val = obj->ponies; I don't see why a compiler would do such an optimization. Clearly the second variant is worse than the first, bigger and needs branch prediction resources. In fact compilers usually try hard to go into the other direction and apply if conversion. Has anyone seen real world examples of such changes being done, or is this all language lawyering theory? -Andi > > Because C doesn't recognise this as being different. However this is > utterly broken, because in this translation we can speculate the load > of obj->ponies such that it no longer depends on the load of idx, which > breaks RCU. > > Note that further 'optimization' is possible and the compiler could even > make it: > > if (idx & 1) > val = objs[1].ponies; > else > val = objs[0].ponies; > > Now, granted, this is a fairly artificial example, but it does > illustrate the exact problem. > > The more the compiler can see of the complete program, the more likely > it can make inferrences like this, esp. when coupled with PGO. > > Now, we're (usually) very careful to wrap things in READ_ONCE() and > rcu_dereference() and the like, which makes it harder on the compiler > (because 'volatile' is special), but nothing really stops it from doing > this. > > Paul has been trying to beat clue into the language people, but given > he's been at it for 10 years now, and there's no resolution, I figure we > ought to get compiler implementations to give us a knob.
On Fri, Dec 21, 2018 at 09:20:44AM -0800, Andi Kleen wrote: > > In particular turning an address-dependency into a control-dependency, > > which is something allowed by the C language, since it doesn't recognise > > these concepts as such. > > > > The 'optimization' is allowed currently, but LTO will make it much more > > likely since it will have a much wider view of things. Esp. when combined > > with PGO. > > > > Specifically; if you have something like: > > > > int idx; > > struct object objs[2]; > > > > the statement: > > > > val = objs[idx & 1].ponies; > > > > which you 'need' to be translated like: > > > > struct object *obj = objs; > > obj += (idx & 1); > > val = obj->ponies; > > > > Such that the load of obj->ponies depends on the load of idx. However > > our dear compiler is allowed to make it: > > > > if (idx & 1) > > obj = &objs[1]; > > else > > obj = &objs[0]; > > > > val = obj->ponies; > > I don't see why a compiler would do such an optimization. Clearly > the second variant is worse than the first, bigger and needs > branch prediction resources. > > In fact compilers usually try hard to go into the other direction > and apply if conversion. > > Has anyone seen real world examples of such changes being done, or is this > all language lawyering theory? I have not seen it myself, but I have heard others claim to. For example, if "idx & 1" had to be computed for some other reason, especially if there was a pre-exiting "if" statement with this as its condition. Or if you have hardware that has a conditional-move instruction. And so on. Do you have a way to guarantee that it never happens? Thanx, Paul > -Andi > > > > > Because C doesn't recognise this as being different. However this is > > utterly broken, because in this translation we can speculate the load > > of obj->ponies such that it no longer depends on the load of idx, which > > breaks RCU. > > > > Note that further 'optimization' is possible and the compiler could even > > make it: > > > > if (idx & 1) > > val = objs[1].ponies; > > else > > val = objs[0].ponies; > > > > Now, granted, this is a fairly artificial example, but it does > > illustrate the exact problem. > > > > The more the compiler can see of the complete program, the more likely > > it can make inferrences like this, esp. when coupled with PGO. > > > > Now, we're (usually) very careful to wrap things in READ_ONCE() and > > rcu_dereference() and the like, which makes it harder on the compiler > > (because 'volatile' is special), but nothing really stops it from doing > > this. > > > > Paul has been trying to beat clue into the language people, but given > > he's been at it for 10 years now, and there's no resolution, I figure we > > ought to get compiler implementations to give us a knob. >