Message ID | 20240728203001.2551083-1-xur@google.com |
---|---|
Headers | show |
Series | Add AutoFDO and Propeller support for Clang build | expand |
On Sun, Jul 28, 2024 at 01:29:53PM -0700, Rong Xu wrote: > Hi, > > This patch series is to integrate AutoFDO and Propeller support into > the Linux kernel. AutoFDO is a profile-guided optimization technique > that leverages hardware sampling to enhance binary performance. > Unlike Instrumentation-based FDO (iFDO), AutoFDO offers a user-friendly > and straightforward application process. While iFDO generally yields > superior profile quality and performance, our findings reveal that > AutoFDO achieves remarkable effectiveness, bringing performance close > to iFDO for benchmark applications. Similar to AutoFDO, Propeller too > utilizes hardware sampling to collect profiles and apply post-link > optimizations to improve the benchmark’s performance over and above > AutoFDO. > > Our empirical data demonstrates significant performance improvements > with AutoFDO and Propeller, up to 10% on microbenchmarks and up to 5% > on large warehouse-scale benchmarks. This makes a strong case for their > inclusion as supported features in the upstream kernel. > > Background > > A significant fraction of fleet processing cycles (excluding idle time) > from data center workloads are attributable to the kernel. Ware-house > scale workloads maximize performance by optimizing the production kernel > using iFDO (a.k.a instrumented PGO, Profile Guided Optimization). > > iFDO can significantly enhance application performance but its use > within the kernel has raised concerns. AutoFDO is a variant of FDO that > uses the hardware’s Performance Monitoring Unit (PMU) to collect > profiling data. While AutoFDO typically yields smaller performance > gains than iFDO, it presents unique benefits for optimizing kernels. > > AutoFDO eliminates the need for instrumented kernels, allowing a single > optimized kernel to serve both execution and profile collection. It also > minimizes slowdown during profile collection, potentially yielding > higher-fidelity profiling, especially for time-sensitive code, compared > to iFDO. Additionally, AutoFDO profiles can be obtained from production > environments via the hardware’s PMU whereas iFDO profiles require > carefully curated load tests that are representative of real-world > traffic. > > AutoFDO facilitates profile collection across diverse targets. > Preliminary studies indicate significant variation in kernel hot spots > within Google’s infrastructure, suggesting potential performance gains > through target-specific kernel customization. > > Furthermore, other advanced compiler optimization techniques, including > ThinLTO and Propeller can be stacked on top of AutoFDO, similar to iFDO. > ThinLTO achieves better runtime performance through whole-program > analysis and cross module optimizations. The main difference between > traditional LTO and ThinLTO is that the latter is scalable in time and > memory. This, > Propeller is a profile-guided, post-link optimizer that improves > the performance of large-scale applications compiled with LLVM. It > operates by relinking the binary based on an additional round of runtime > profiles, enabling precise optimizations that are not possible at > compile time. should be on top somewhere, not hidden away inside a giant wall of text somewhere at the end.
On Sun, Jul 28, 2024 at 01:29:56PM -0700, Rong Xu wrote: > When the -ffunction-sections compiler option is enabled, each function > is placed in a separate section named .text.function_name rather than > putting all functions in a single .text section. > > However, using -function-sections can cause problems with the > linker script. The comments included in include/asm-generic/vmlinux.lds.h > note these issues.: > “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead > code elimination is enabled, so these sections should be converted > to use ".." first.” > > It is unclear whether there is a straightforward method for converting > a suffix to "..". This patch modifies the order of subsections within the > text output section when the -ffunction-sections flag is enabled. > Specifically, it repositions sections with certain fixed patterns (for > example .text.unlikely) before TEXT_MAIN, ensuring that they are grouped > and matched together. > > Note that the limitation arises because the linker script employs glob > patterns instead of regular expressions for string matching. While there > is a method to maintain the current order using complex patterns, this > significantly complicates the pattern and increases the likelihood of > errors. > > Co-developed-by: Han Shen <shenhan@google.com> > Signed-off-by: Han Shen <shenhan@google.com> > Signed-off-by: Rong Xu <xur@google.com> > Suggested-by: Sriraman Tallam <tmsriram@google.com> > Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> > --- > include/asm-generic/vmlinux.lds.h | 19 ++++++++++++++++--- > 1 file changed, 16 insertions(+), 3 deletions(-) > > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h > index 5703526d6ebf..f3de66bda293 100644 > --- a/include/asm-generic/vmlinux.lds.h > +++ b/include/asm-generic/vmlinux.lds.h > @@ -582,9 +582,21 @@ > * during second ld run in second ld pass when generating System.map > * > * TEXT_MAIN here will match .text.fixup and .text.unlikely if dead > - * code elimination is enabled, so these sections should be converted > - * to use ".." first. > + * code elimination or function-section is enabled. Match these symbols > + * first when in these builds. > */ > +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG) > +#define TEXT_TEXT \ > + *(.text.asan.* .text.tsan.*) \ > + *(.text.unknown .text.unknown.*) \ > + *(.text.unlikely .text.unlikely.*) \ > + ALIGN_FUNCTION(); \ Why leave the above text sections unaligned? > + *(.text.hot .text.hot.*) \ > + *(TEXT_MAIN .text.fixup) \ > + NOINSTR_TEXT \ > + *(.ref.text) \ > + MEM_KEEP(init.text*) > +#else > #define TEXT_TEXT \ > ALIGN_FUNCTION(); \ > *(.text.hot .text.hot.*) \ > @@ -594,7 +606,8 @@ > NOINSTR_TEXT \ > *(.ref.text) \ > *(.text.asan.* .text.tsan.*) \ > - MEM_KEEP(init.text*) \ > + MEM_KEEP(init.text*) > +#endif > > > /* sched.text is aling to function alignment to secure we have same > -- > 2.46.0.rc1.232.g9752f9e123-goog >
On Mon, Jul 29, 2024 at 1:51 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Jul 28, 2024 at 01:29:53PM -0700, Rong Xu wrote: > > Hi, > > > > This patch series is to integrate AutoFDO and Propeller support into > > the Linux kernel. AutoFDO is a profile-guided optimization technique > > that leverages hardware sampling to enhance binary performance. > > Unlike Instrumentation-based FDO (iFDO), AutoFDO offers a user-friendly > > and straightforward application process. While iFDO generally yields > > superior profile quality and performance, our findings reveal that > > AutoFDO achieves remarkable effectiveness, bringing performance close > > to iFDO for benchmark applications. Similar to AutoFDO, Propeller too > > utilizes hardware sampling to collect profiles and apply post-link > > optimizations to improve the benchmark’s performance over and above > > AutoFDO. > > > > Our empirical data demonstrates significant performance improvements > > with AutoFDO and Propeller, up to 10% on microbenchmarks and up to 5% > > on large warehouse-scale benchmarks. This makes a strong case for their > > inclusion as supported features in the upstream kernel. > > > > Background > > > > A significant fraction of fleet processing cycles (excluding idle time) > > from data center workloads are attributable to the kernel. Ware-house > > scale workloads maximize performance by optimizing the production kernel > > using iFDO (a.k.a instrumented PGO, Profile Guided Optimization). > > > > iFDO can significantly enhance application performance but its use > > within the kernel has raised concerns. AutoFDO is a variant of FDO that > > uses the hardware’s Performance Monitoring Unit (PMU) to collect > > profiling data. While AutoFDO typically yields smaller performance > > gains than iFDO, it presents unique benefits for optimizing kernels. > > > > AutoFDO eliminates the need for instrumented kernels, allowing a single > > optimized kernel to serve both execution and profile collection. It also > > minimizes slowdown during profile collection, potentially yielding > > higher-fidelity profiling, especially for time-sensitive code, compared > > to iFDO. Additionally, AutoFDO profiles can be obtained from production > > environments via the hardware’s PMU whereas iFDO profiles require > > carefully curated load tests that are representative of real-world > > traffic. > > > > AutoFDO facilitates profile collection across diverse targets. > > Preliminary studies indicate significant variation in kernel hot spots > > within Google’s infrastructure, suggesting potential performance gains > > through target-specific kernel customization. > > > > Furthermore, other advanced compiler optimization techniques, including > > ThinLTO and Propeller can be stacked on top of AutoFDO, similar to iFDO. > > ThinLTO achieves better runtime performance through whole-program > > analysis and cross module optimizations. The main difference between > > traditional LTO and ThinLTO is that the latter is scalable in time and > > memory. > > This, > > > Propeller is a profile-guided, post-link optimizer that improves > > the performance of large-scale applications compiled with LLVM. It > > operates by relinking the binary based on an additional round of runtime > > profiles, enabling precise optimizations that are not possible at > > compile time. > > should be on top somewhere, not hidden away inside a giant wall of text > somewhere at the end. Thanks for the suggestion. I'll move it up. Maybe after the first paragraph in Background. Sorry if you received a duplicated message -- I'm resending this in plain text mode. -Rong
On Mon, Jul 29, 2024 at 2:34 AM Peter Zijlstra <peterz@infradead.org> wrote: > > On Sun, Jul 28, 2024 at 01:29:56PM -0700, Rong Xu wrote: > > When the -ffunction-sections compiler option is enabled, each function > > is placed in a separate section named .text.function_name rather than > > putting all functions in a single .text section. > > > > However, using -function-sections can cause problems with the > > linker script. The comments included in include/asm-generic/vmlinux.lds.h > > note these issues.: > > “TEXT_MAIN here will match .text.fixup and .text.unlikely if dead > > code elimination is enabled, so these sections should be converted > > to use ".." first.” > > > > It is unclear whether there is a straightforward method for converting > > a suffix to "..". This patch modifies the order of subsections within the > > text output section when the -ffunction-sections flag is enabled. > > Specifically, it repositions sections with certain fixed patterns (for > > example .text.unlikely) before TEXT_MAIN, ensuring that they are grouped > > and matched together. > > > > Note that the limitation arises because the linker script employs glob > > patterns instead of regular expressions for string matching. While there > > is a method to maintain the current order using complex patterns, this > > significantly complicates the pattern and increases the likelihood of > > errors. > > > > Co-developed-by: Han Shen <shenhan@google.com> > > Signed-off-by: Han Shen <shenhan@google.com> > > Signed-off-by: Rong Xu <xur@google.com> > > Suggested-by: Sriraman Tallam <tmsriram@google.com> > > Suggested-by: Krzysztof Pszeniczny <kpszeniczny@google.com> > > --- > > include/asm-generic/vmlinux.lds.h | 19 ++++++++++++++++--- > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h > > index 5703526d6ebf..f3de66bda293 100644 > > --- a/include/asm-generic/vmlinux.lds.h > > +++ b/include/asm-generic/vmlinux.lds.h > > @@ -582,9 +582,21 @@ > > * during second ld run in second ld pass when generating System.map > > * > > * TEXT_MAIN here will match .text.fixup and .text.unlikely if dead > > - * code elimination is enabled, so these sections should be converted > > - * to use ".." first. > > + * code elimination or function-section is enabled. Match these symbols > > + * first when in these builds. > > */ > > +#if defined(CONFIG_LD_DEAD_CODE_DATA_ELIMINATION) || defined(CONFIG_LTO_CLANG) > > +#define TEXT_TEXT \ > > + *(.text.asan.* .text.tsan.*) \ > > + *(.text.unknown .text.unknown.*) \ > > + *(.text.unlikely .text.unlikely.*) \ > > + ALIGN_FUNCTION(); \ > > Why leave the above text sections unaligned? They are considered cold text. They are not aligned before the change. But I have no objections to making it aligned. (Sorry if you receive a duplicated message. I'm resending this in plain text mode.) > > > + *(.text.hot .text.hot.*) \ > > + *(TEXT_MAIN .text.fixup) \ > > + NOINSTR_TEXT \ > > + *(.ref.text) \ > > + MEM_KEEP(init.text*) > > +#else > > #define TEXT_TEXT \ > > ALIGN_FUNCTION(); \ > > *(.text.hot .text.hot.*) \ > > @@ -594,7 +606,8 @@ > > NOINSTR_TEXT \ > > *(.ref.text) \ > > *(.text.asan.* .text.tsan.*) \ > > - MEM_KEEP(init.text*) \ > > + MEM_KEEP(init.text*) > > +#endif > > > > > > /* sched.text is aling to function alignment to secure we have same > > -- > > 2.46.0.rc1.232.g9752f9e123-goog > >
On Mon, Jul 29, 2024 at 11:48:54AM -0700, Rong Xu wrote: > > defined(CONFIG_LTO_CLANG) > > > +#define TEXT_TEXT \ > > > + *(.text.asan.* .text.tsan.*) \ > > > + *(.text.unknown .text.unknown.*) \ > > > + *(.text.unlikely .text.unlikely.*) \ > > > + ALIGN_FUNCTION(); \ > > > > Why leave the above text sections unaligned? > > > > They are considered cold text. They are not aligned before the change. But > I have no objections to making it aligned. At least x86 has hard assumptions about function alignment always being respected -- see the most horrible games we play with CONFIG_CALL_THUNKS. Or is this only text parts and not actual functions in these sections? In which case we can probably get away with not respecting the function call alignment, although we should probably still respect the branch alignment -- but I forgot if we made use of that :/ > > > > > + *(.text.hot .text.hot.*) \ > > > + *(TEXT_MAIN .text.fixup) \ > > > + NOINSTR_TEXT \ > > > + *(.ref.text) \ > > > + MEM_KEEP(init.text*) > > > +#else > > > #define TEXT_TEXT \ > > > ALIGN_FUNCTION(); \ > > > *(.text.hot .text.hot.*) \ > > > @@ -594,7 +606,8 @@ > > > NOINSTR_TEXT \ > > > *(.ref.text) \ > > > *(.text.asan.* .text.tsan.*) \ > > > - MEM_KEEP(init.text*) \ > > > + MEM_KEEP(init.text*) > > > +#endif > > > > > > > > > /* sched.text is aling to function alignment to secure we have same > > > -- > > > 2.46.0.rc1.232.g9752f9e123-goog > > > > >
On July 30, 2024 1:40:22 AM PDT, Peter Zijlstra <peterz@infradead.org> wrote: >On Mon, Jul 29, 2024 at 11:48:54AM -0700, Rong Xu wrote: > >> > defined(CONFIG_LTO_CLANG) >> > > +#define TEXT_TEXT \ >> > > + *(.text.asan.* .text.tsan.*) \ >> > > + *(.text.unknown .text.unknown.*) \ >> > > + *(.text.unlikely .text.unlikely.*) \ >> > > + ALIGN_FUNCTION(); \ >> > >> > Why leave the above text sections unaligned? >> > >> >> They are considered cold text. They are not aligned before the change. But >> I have no objections to making it aligned. > >At least x86 has hard assumptions about function alignment always being >respected -- see the most horrible games we play with >CONFIG_CALL_THUNKS. > >Or is this only text parts and not actual functions in these sections? >In which case we can probably get away with not respecting the function >call alignment, although we should probably still respect the branch >alignment -- but I forgot if we made use of that :/ > > >> > >> > > + *(.text.hot .text.hot.*) \ >> > > + *(TEXT_MAIN .text.fixup) \ >> > > + NOINSTR_TEXT \ >> > > + *(.ref.text) \ >> > > + MEM_KEEP(init.text*) >> > > +#else >> > > #define TEXT_TEXT \ >> > > ALIGN_FUNCTION(); \ >> > > *(.text.hot .text.hot.*) \ >> > > @@ -594,7 +606,8 @@ >> > > NOINSTR_TEXT \ >> > > *(.ref.text) \ >> > > *(.text.asan.* .text.tsan.*) \ >> > > - MEM_KEEP(init.text*) \ >> > > + MEM_KEEP(init.text*) >> > > +#endif >> > > >> > > >> > > /* sched.text is aling to function alignment to secure we have same >> > > -- >> > > 2.46.0.rc1.232.g9752f9e123-goog >> > > >> > The linker should always enforce the alignment of any input section. If we don't have proper alignment either the linker is broken or we don't have the correct .balign directives in the code – which is the right way to fix this.