Message ID | 20220607120530.2447112-1-tarumizu.kohei@fujitsu.com |
---|---|
Headers | show |
Series | Add hardware prefetch control driver for A64FX and x86 | expand |
On Tue, Jun 07, 2022 at 09:05:24PM +0900, Kohei Tarumizu wrote: > This patch series add sysfs interface to control CPU's hardware > prefetch behavior for performance tuning from userspace for the > processor A64FX and x86 (on supported CPU). Why does userspace want to even do this? How will they do this? What programs will do this? And why isn't just automatic and why does this hardware require manual intervention to work properly? thanks, greg k-h
On Tue, Jun 7, 2022 at 2:07 PM Kohei Tarumizu <tarumizu.kohei@fujitsu.com> wrote: > This patch series add sysfs interface to control CPU's hardware > prefetch behavior for performance tuning from userspace for the > processor A64FX and x86 (on supported CPU). OK > A64FX and some Intel processors have implementation-dependent register > for controlling CPU's hardware prefetch behavior. A64FX has > IMP_PF_STREAM_DETECT_CTRL_EL0[1], and Intel processors have MSR 0x1a4 > (MSR_MISC_FEATURE_CONTROL)[2]. Hardware prefetch (I guess of memory contents) is a memory hierarchy feature. Linux has a memory hierarchy manager, conveniently named "mm", developed by some of the smartest people I know. The main problem addressed by that is paging, but prefetching into the CPU from the next lowest level in the memory hierarchy is just another memory hierarchy hardware feature, such as hard disks, primary RAM etc. > These registers cannot be accessed from userspace. Good. The kernel managed hardware. If the memory hierarchy people have userspace now doing stuff behind their back, through some special interface, that makes their world more complicated. This looks like it needs information from the generic memory manager, from the scheduler, and possibly all the way down from the block layer to do the right thing, so it has no business in userspace. Have you seen mm/damon for example? Access to statistics for memory access patterns seems really useful for tuning the behaviour of this hardware. Just my €0.01. If it does interact with userspace I suppose it should be using control groups, like everything else of this type, see e.g. mm/memcontrol.c, not custom sysfs files. Just an example from one of the patches: + - "* Adjacent Cache Line Prefetcher Disable (R/W)" + corresponds to the "adjacent_cache_line_prefetcher_enable" I might only be on "a little knowledge is dangerous" on the memory manager topics, but I know for sure that they at times adjust the members of structs to fit nicely on cache lines. And now this? It looks really useful for kernel machinery that know very well what needs to go into the cache line next and when. Talk to the people on linux-mm and memory maintainer Andrew Morton on how to do this right, it's a really interesting feature! Also given that people say that the memory hierarchy is an important part in the performance of the Apple M1 (M2) silicon, I expect that machine to have this too? Yours, Linus Walleij
Thanks for the comment. > Why does userspace want to even do this? This is because the optimal settings may differ from application to application. Examples of performance improvements for applications with simple memory access characteristics are described in [merit] section. However, some applications have complex characteristics, so it is difficult to predict if an application will improve without actually trying it out. This is not necessary for all applications. However, I want to provide as a minimal interface that can be used by those who want to improve their application even a little. > How will they do this? I assume to be used to tune a specific core and execute an application on that core. The execution example is as follows. 1) The user tunes the parameters of a specific core before executing the program. ``` # echo 1024 > /sys/devices/system/cpu/cpu12/cache/index0/prefetch_control/stream_detect_prefetcher_dist # echo 1024 > /sys/devices/system/cpu/cpu12/cache/index2/prefetch_control/stream_detect_prefetcher_dist # echo 1024 > /sys/devices/system/cpu/cpu13/cache/index0/prefetch_control/stream_detect_prefetcher_dist # echo 1024 > /sys/devices/system/cpu/cpu13/cache/index2/prefetch_control/stream_detect_prefetcher_dist ``` 2) Execute the program bound to the target core. ``` # taskset -c 12-13 a.out ``` If the interface is exposed, the user can develop a library to execute 1) and 2) operation instead. > What programs will do this? It is assumed to be used by programs that execute many continuous memory access. It may be useful for other applications, but I can't explain them in detail right away. > And why isn't just automatic and why does this hardware require manual > intervention to work properly? It is difficult for the hardware to determine the optimal parameters in advance. Therefore, I think that the register is provided to change the behavior of the hardware.
On Tue, Jun 14, 2022 at 11:55:39AM +0000, tarumizu.kohei@fujitsu.com wrote: > Thanks for the comment. > > > Why does userspace want to even do this? > > This is because the optimal settings may differ from application to > application. That's not ok. Linux is a "general purpose" operating system and needs to work well for all applications. Doing application-specific-tuning based on the specific hardware like this is a nightmare for users, and will be for you as you will now have to support this specific model to work correctly on all future kernel releases for the next 20+ years. Are you willing to do that? > Examples of performance improvements for applications with simple > memory access characteristics are described in [merit] section. > However, some applications have complex characteristics, so it is > difficult to predict if an application will improve without actually > trying it out. Then perhaps it isn't anything that they should try out :) Shouldn't the kernel know how the application works (based on the resources it asks for) and tune itself based on that automatically? If not, how is a user supposed to know how to do this? > This is not necessary for all applications. However, I want to provide > as a minimal interface that can be used by those who want to improve > their application even a little. > > > How will they do this? > > I assume to be used to tune a specific core and execute an application > on that core. The execution example is as follows. > > 1) The user tunes the parameters of a specific core before executing > the program. > > ``` > # echo 1024 > /sys/devices/system/cpu/cpu12/cache/index0/prefetch_control/stream_detect_prefetcher_dist > # echo 1024 > /sys/devices/system/cpu/cpu12/cache/index2/prefetch_control/stream_detect_prefetcher_dist > # echo 1024 > /sys/devices/system/cpu/cpu13/cache/index0/prefetch_control/stream_detect_prefetcher_dist > # echo 1024 > /sys/devices/system/cpu/cpu13/cache/index2/prefetch_control/stream_detect_prefetcher_dist > ``` What is "1024" here? Where is any of this documented? And why these specific sysfs files and not others? > 2) Execute the program bound to the target core. > > ``` > # taskset -c 12-13 a.out > ``` > > If the interface is exposed, the user can develop a library to execute > 1) and 2) operation instead. If you have no such user today, nor a library, how do you know any of this works well? And again, how will you support this going forward? Or is this specific api only going to be for one specific piece of hardware and never any future ones? > > What programs will do this? > > It is assumed to be used by programs that execute many continuous > memory access. It may be useful for other applications, but I can't > explain them in detail right away. So you haven't tested this on any real applications? We need real users before being able to add new apis. Otherwise we can just remove the apis :) > > And why isn't just automatic and why does this hardware require manual > > intervention to work properly? > > It is difficult for the hardware to determine the optimal parameters > in advance. Therefore, I think that the register is provided to change > the behavior of the hardware. Kernel programming for a general purpose operating system is hard, but it is possible :) good luck! greg k-h
Hi Linus, Thanks for the comment. > OK > > > A64FX and some Intel processors have implementation-dependent register > > for controlling CPU's hardware prefetch behavior. A64FX has > > IMP_PF_STREAM_DETECT_CTRL_EL0[1], and Intel processors have MSR > 0x1a4 > > (MSR_MISC_FEATURE_CONTROL)[2]. > > Hardware prefetch (I guess of memory contents) is a memory hierarchy feature. > > Linux has a memory hierarchy manager, conveniently named "mm", developed > by some of the smartest people I know. The main problem addressed by that is > paging, but prefetching into the CPU from the next lowest level in the memory > hierarchy is just another memory hierarchy hardware feature, such as hard > disks, primary RAM etc. > > > These registers cannot be accessed from userspace. > > Good. The kernel managed hardware. If the memory hierarchy people have > userspace now doing stuff behind their back, through some special interface, > that makes their world more complicated. > > This looks like it needs information from the generic memory manager, from the > scheduler, and possibly all the way down from the block layer to do the right > thing, so it has no business in userspace. > Have you seen mm/damon for example? Access to statistics for memory > access patterns seems really useful for tuning the behaviour of this hardware. > Just my €0.01. Thank you for the information. I will see if mm/damon statistics can be used for tuning. > If it does interact with userspace I suppose it should be using control groups, > like everything else of this type, see e.g. mm/memcontrol.c, not custom sysfs > files. Hardware prefetch registers exist for each core, and the settings are independent for each cache. Therefore, currently, I created it under /sys/devices/system/cpu/cpu*/cache/index*. However, when user actually configure it for an application, they may want to set it on a per-process basis. Considering that, I think control groups is suitable for this usage. For example, is your idea of interface like the following? ``` /sys/fs/cgroup/memory/memory.hardware_prefetcher.enable ``` Cpuset controller has information about which CPU a process belonging to a group is bound to, so maybe cpuset controller is more appropriate. Control groups has hierarchical structure, so it is necessary to consider whether they can map hardware prefetch behavior well. Currentry I have two concerns. First, upper hierarchy contains the same CPU as the lower hierarchy. In this case, it may not be possible to configure independent setting in each hierarchy. Next, context switch considerations. This function rewrites the value of the register that exists for each core. Therefore, the register value must be changed at the timing of the context switch with a process belonging to a different group. > Just an example from one of the patches: > > + - "* Adjacent Cache Line Prefetcher Disable (R/W)" > + corresponds to the > "adjacent_cache_line_prefetcher_enable" > > I might only be on "a little knowledge is dangerous" on the memory manager > topics, but I know for sure that they at times adjust the members of structs to fit > nicely on cache lines. And now this? It looks really useful for kernel machinery > that know very well what needs to go into the cache line next and when. > > Talk to the people on linux-mm and memory maintainer Andrew Morton on how > to do this right, it's a really interesting feature! Also given that people say that > the memory hierarchy is an important part in the performance of the Apple > M1 (M2) silicon, I expect that machine to have this too? I think this proposal will be useful for users, so I will proceed with concrete studies and talk to the MM people.
Hi Greg, > That's not ok. Linux is a "general purpose" operating system and needs to > work well for all applications. Doing application-specific-tuning based on the > specific hardware like this is a nightmare for users, Hardware prefetch behavior is enabled by default in x86 and A64FX. Many applications can perform well without changing the register setting. Use this feature for some applications that want to be improved performance. In particular, A64FX's hardware prefetch behavior is used for HPC applications. The user running HPC applications needs to improve performance as much as possible. This feature is useful for such users. Therefore, some A64FX machines have their own drivers that control hardware prefetch behavior. It is built into the software products for A64FX and cannot be used without purchase. I want to make this feature available to people who want to improve performance without purchase the product. This is limited in use and depends on the characteristics of the application. Isn't this match with "general purpose"? > and will be for you as you > will now have to support this specific model to work correctly on all future > kernel releases for the next 20+ years. > Are you willing to do that? Rather than relying on a specific model of this API, I want to make it generally available. However, it may not be so now. I am willing to support this if I could make it a community-approved interface. > Then perhaps it isn't anything that they should try out :) > > Shouldn't the kernel know how the application works (based on the resources > it asks for) and tune itself based on that automatically? > > If not, how is a user supposed to know how to do this? It is useful for users if it can be done automatically by the kernel. I will consider if there is anything I can do using statistical information. > What is "1024" here? Where is any of this documented? This parameter means the difference in bytes between the memory address the program is currently accessing and the memory address accessed by the hardware prefetch. My document in sysfs-devices-system-cpu does not specify what the distance means, so I will add it. For reference, the hardware prefetch details are described below. "https://github.com/fujitsu/A64FX/tree/master/doc" A64FX_Microarchitecture_Manual_en_1.7.pdf > And why these > specific sysfs files and not others? I wanted to show an example of changing only the hardware prefetch distance. There is no special reason not to specify with other sysfs files. > If you have no such user today, nor a library, how do you know any of this works > well? The prefetch control function included in the software product for A64FX does the similar operation, and it works well. > And again, how will you support this going forward? > Or is this specific api only going to be for one specific piece of hardware and > never any future ones? In order to make the interface widely usable in the future, I will consider a different specification from the current one. For example control groups which Linus proposaled is one of them. > So you haven't tested this on any real applications? We need real users before > being able to add new apis. Otherwise we can just remove the apis :) At least, some A64FX users use this behavior. However, currently, I don't have which applications and how much performance will be improved. I will try to get the application actually used and confirm that it is effective. > Kernel programming for a general purpose operating system is hard, but it is > possible :) I will try to do kernel programming for a general purpose operating system.
On Fri, Jun 17, 2022 at 11:21 AM tarumizu.kohei@fujitsu.com <tarumizu.kohei@fujitsu.com> wrote: Jumping in here. > Hi Greg, > > > That's not ok. Linux is a "general purpose" operating system and needs to > > work well for all applications. Doing application-specific-tuning based on the > > specific hardware like this is a nightmare for users, > > Hardware prefetch behavior is enabled by default in x86 and A64FX. > Many applications can perform well without changing the register > setting. Use this feature for some applications that want to be > improved performance. The right way to solve this is to make the Linux kernel contain the necessary heuristics to identify which tasks and thus cores need this to improve efficiency and then apply it automatically. Putting it in userspace is making a human do a machines job which isn't sustainable. By putting the heuristics in kernelspace Linux will improve performance also on workloads the human operator didn't think of as the machine will detect them from statictical or other behaviour patterns. Yours, Linus Walleij
Hi Linus, > The right way to solve this is to make the Linux kernel contain the necessary > heuristics to identify which tasks and thus cores need this to improve efficiency > and then apply it automatically. > > Putting it in userspace is making a human do a machines job which isn't > sustainable. > > By putting the heuristics in kernelspace Linux will improve performance also on > workloads the human operator didn't think of as the machine will detect them from > statictical or other behaviour patterns. In order to put the heuristics into kernelspace Linux, I think it necessary to consider the following two points. 1) Which cores are tied with the process? This is different from the core on which the process can run. It probably need to combine some CPU resource limit to avoid affecting non-target processes. 2) How to derive the value to set in the register? It is necessary to verify whether an appropriate set value can be derived using statistical information, etc. In addition, to prevent the cost of automatic derivation from exceeding the value that would be improved by it. I don't have a prospect for resolving these issues yet. I will continue these considerations.
On 6/27/22 02:36, Linus Walleij wrote: > The right way to solve this is to make the Linux kernel contain the > necessary heuristics to identify which tasks and thus cores need this > to improve efficiency and then apply it automatically. I agree in theory. But, I also want a pony in theory. Any suggestions for how to do this in the real world? Otherwise, I'm inclined to say that this series incrementally makes things better in the real world by at least moving folks away from wrmsr(1).
>> The right way to solve this is to make the Linux kernel contain the necessary >> heuristics to identify which tasks and thus cores need this to improve efficiency >> and then apply it automatically. >> >> Putting it in userspace is making a human do a machines job which isn't >> sustainable. >> >> By putting the heuristics in kernelspace Linux will improve performance also on >> workloads the human operator didn't think of as the machine will detect them from >> statictical or other behaviour patterns. > >In order to put the heuristics into kernelspace Linux, I think it >necessary to consider the following two points. > >1) Which cores are tied with the process? >This is different from the core on which the process can run. It >probably need to combine some CPU resource limit to avoid affecting >non-target processes. > >2) How to derive the value to set in the register? >It is necessary to verify whether an appropriate set value can be >derived using statistical information, etc. In addition, to prevent >the cost of automatic derivation from exceeding the value that would >be improved by it. > >I don't have a prospect for resolving these issues yet. I will >continue these considerations. Another approach would be to make the set of prefetch settings a task attribute. Then set them in the context switch code when the process is about to run on a CPU. But that assumes you can cheaply change the attributes. If doing so requires multiple MSR writes (on x86) it might be a non-starter. -Tony
On Tue, Jun 28, 2022 at 5:47 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 6/27/22 02:36, Linus Walleij wrote: > > The right way to solve this is to make the Linux kernel contain the > > necessary heuristics to identify which tasks and thus cores need this > > to improve efficiency and then apply it automatically. > > I agree in theory. But, I also want a pony in theory. > > Any suggestions for how to do this in the real world? Well if the knobs are exposed to userspace, how do people using these knobs know when to turn them? A profiler? perf? All that data is available to the kernel too. The memory access pattern statistics from mm/damon was what I suggested as a starting point. We have pretty elaborate heuristics in the kernel to identify the behaviour of processes, one example is the BFQ block scheduler which determines I/O priority weights of processed based on how interactive they are. If we can determine things like that I am pretty sure we can determine how computing intense a task is for example, by using memory access statistics and scheduler information: if the process is constantly READY to run over a few context switches and PC also stays in a certain rage of memory like two adjacent pages then it is probably running a hard kernel, if that is what we need to know here. It doesn't seem too far-fetched? We have the performance counters as well. That should be possible to utilize to get even more precise heuristics? Maybe that is what userspace is using to determine this already. I'm not saying there has to be a simple solution, but maybe there is something like a really complicated solution? We have academic researchers that like to look at things like this. > Otherwise, I'm inclined to say that this series incrementally makes > things better in the real world by at least moving folks away from wrmsr(1). I don't know if yet another ABI that needs to be maintained helps the situation much, it's just a contract that we will have to maintain for no gain. However if userspace is messing with that register behind our back and we know better, we can just overwrite it with the policy we determine is better in the kernel. Yours, Linus Walleij
On 6/28/22 13:20, Linus Walleij wrote: > On Tue, Jun 28, 2022 at 5:47 PM Dave Hansen <dave.hansen@intel.com> wrote: >> On 6/27/22 02:36, Linus Walleij wrote: >>> The right way to solve this is to make the Linux kernel contain the >>> necessary heuristics to identify which tasks and thus cores need this >>> to improve efficiency and then apply it automatically. >> >> I agree in theory. But, I also want a pony in theory. >> >> Any suggestions for how to do this in the real world? > > Well if the knobs are exposed to userspace, how do people using > these knobs know when to turn them? A profiler? perf? All that > data is available to the kernel too. They run their fortran app. Change the MSRs. Run it again. See if it simulated the nuclear weapon blast any faster or slower. Rinse. Repeat. One thing that is missing from the changelog and cover letter here: On x86, there's a 'wrmsr(1)' tool. That took pokes at Model Specific Registers (MSRs) via the /dev/cpu/X/msr interface. That interface is a very, very thinly-veiled wrapper around the WRMSR (WRite MSR) instruction. In other words, on x86, our current interface allows userspace programs to arbitrarily poke at our most sensitive hardware configuration registers. One of the most common reasons users have reported doing this (we have pr_warn()ings about it) is controlling the prefetch hardware. This interface would take a good chunk of the x86 wrmsr(1) audience and convert them over to a less dangerous interface. That's a win on x86. We don't even *remotely* have line-of-sight for a generic solution for the kernel to figure out a single "best" value for these registers.
On Tue, Jun 28, 2022 at 11:02 PM Dave Hansen <dave.hansen@intel.com> wrote: > On 6/28/22 13:20, Linus Walleij wrote: > > > > Well if the knobs are exposed to userspace, how do people using > > these knobs know when to turn them? A profiler? perf? All that > > data is available to the kernel too. > > They run their fortran app. Change the MSRs. Run it again. See if it > simulated the nuclear weapon blast any faster or slower. Rinse. Repeat. That sounds like a schoolbook definition of the trial-and-error method. https://en.wikipedia.org/wiki/Trial_and_error That's fair. But these people really need a better hammer. > This interface would take a good chunk of the x86 wrmsr(1) audience and > convert them over to a less dangerous interface. That's a win on x86. > We don't even *remotely* have line-of-sight for a generic solution for > the kernel to figure out a single "best" value for these registers. Maybe less dangerous for them, but maybe more dangerous for the kernel community who signs up to maintain the behaviour of that interface perpetually. Yours, Linus Walleij
Hi Tony, Thanks for the comment. > Another approach would be to make the set of prefetch settings a task attribute. > Then set them in the context switch code when the process is about to run on > a CPU. > > But that assumes you can cheaply change the attributes. If doing so requires > multiple MSR writes (on x86) it might be a non-starter. On the x86 and A64FX, each parameter for controlling hardware prefetch is contained in one register. The current specification makes each parameter a separate attribute, so we need to write as many times as there are parameters to change. However it is possible to change the attribute with one MSR write per core by changing multiple parameters before the context switch.
Hi Dave, > They run their fortran app. Change the MSRs. Run it again. See if it > simulated the nuclear weapon blast any faster or slower. Rinse. Repeat. > > One thing that is missing from the changelog and cover letter here: On x86, > there's a 'wrmsr(1)' tool. That took pokes at Model Specific Registers (MSRs) > via the /dev/cpu/X/msr interface. That interface is a very, very thinly-veiled > wrapper around the WRMSR (WRite MSR) instruction. > > In other words, on x86, our current interface allows userspace programs to > arbitrarily poke at our most sensitive hardware configuration registers. One of > the most common reasons users have reported doing this (we have > pr_warn()ings about it) is controlling the prefetch hardware. > > This interface would take a good chunk of the x86 wrmsr(1) audience and > convert them over to a less dangerous interface. That's a win on x86. > We don't even *remotely* have line-of-sight for a generic solution for the kernel > to figure out a single "best" value for these registers. Thank you for mentioning wrmsr tool. This is one of the reason why I want to add the sysfs interface. I will add the description that this interface can be used instead of wrmsr tool (or MSR driver) for hardware prefetch control usage to the cover letter. I read below that we should not accesse any MSR directly from userspace without restriction. https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about/