Message ID | 20241121185315.3416855-1-mizhang@google.com |
---|---|
Headers | show |
Series | KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs | expand |
On Thu, Nov 21, 2024, Mingwei Zhang wrote: > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > (250 Hz by default) to measure their effective CPU frequency. To avoid > the overhead of intercepting these frequent MSR reads, allow the guest > to read them directly by loading guest values into the hardware MSRs. > > These MSRs are continuously running counters whose values must be > carefully tracked during all vCPU state transitions: > - Guest IA32_APERF advances only during guest execution That's not what this series does though. Guest APERF advances while the vCPU is loaded by KVM_RUN, which is *very* different than letting APERF run freely only while the vCPU is actively executing in the guest. E.g. a vCPU that is memory oversubscribed via zswap will account a significant amount of CPU time in APERF when faulting in swapped memory, whereas traditional file-backed swap will not due to the task being scheduled out while waiting on I/O. In general, the "why" of this series is missing. What are the use cases you are targeting? What are the exact semantics you want to define? *Why* did are you proposed those exact semantics? E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that requires userspace exits will not. It's not necessarily wrong for heavy userspace I/O to cause observed frequency to drop, but it's not obviously correct either. The use cases matter a lot for APERF/MPERF, because trying to reason about what's desirable for an oversubscribed setup requires a lot more work than defining semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or less just partitioned. Not to mention the complexity for trying to support all potential use cases is likely quite a bit higher. And if the use case is specifically for slice-of-hardware, hard pinned/partitioned VMs, does it matter if the host's view of APERF/MPERF is not accurately captured at all times? Outside of maybe a few CPUs running bookkeeping tasks, the only workloads running on CPUs should be vCPUs. It's not clear to me that observing the guest utilization is outright wrong in that case. One idea for supporting APERF/MPERF in KVM would be to add a kernel param to disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough APERF/MPERF if and only if the feature is supported in hardware, but hidden from the kernel. I.e. let the system admin gift APERF/MPERF to KVM. > - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is > in C0 state, even when not actively running > - Host kernel access is redirected through get_host_[am]perf() which > adds per-CPU offsets to the hardware MSR values > - Remote MSR reads through /dev/cpu/*/msr also account for these > offsets > > Guest values persist in hardware while the vCPU is loaded and > running. Host MSR values are restored on vcpu_put (either at KVM_RUN > completion or when preempted) and when transitioning to halt state. > > Note that guest TSC scaling via KVM_SET_TSC_KHZ is not supported, as > it would require either intercepting MPERF reads on Intel (where MPERF > ticks at host rate regardless of guest TSC scaling) or significantly > complicating the cycle accounting on AMD. > > The host must have both CONSTANT_TSC and NONSTOP_TSC capabilities > since these ensure stable TSC frequency across C-states and P-states, > which is required for accurate background MPERF accounting. ... > arch/x86/include/asm/kvm_host.h | 11 ++ > arch/x86/include/asm/topology.h | 10 ++ > arch/x86/kernel/cpu/aperfmperf.c | 65 +++++++++++- > arch/x86/kvm/cpuid.c | 12 ++- > arch/x86/kvm/governed_features.h | 1 + > arch/x86/kvm/lapic.c | 5 +- > arch/x86/kvm/reverse_cpuid.h | 6 ++ > arch/x86/kvm/svm/nested.c | 2 +- > arch/x86/kvm/svm/svm.c | 7 ++ > arch/x86/kvm/svm/svm.h | 2 +- > arch/x86/kvm/vmx/nested.c | 2 +- > arch/x86/kvm/vmx/vmx.c | 7 ++ > arch/x86/kvm/vmx/vmx.h | 2 +- > arch/x86/kvm/x86.c | 171 ++++++++++++++++++++++++++++--- > arch/x86/lib/msr-smp.c | 11 ++ > drivers/cpufreq/amd-pstate.c | 4 +- > drivers/cpufreq/intel_pstate.c | 5 +- > 17 files changed, 295 insertions(+), 28 deletions(-) > > > base-commit: 0a9b9d17f3a781dea03baca01c835deaa07f7cc3 > -- > 2.47.0.371.ga323438b13-goog >
On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote: > > On Thu, Nov 21, 2024, Mingwei Zhang wrote: > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > > (250 Hz by default) to measure their effective CPU frequency. To avoid > > the overhead of intercepting these frequent MSR reads, allow the guest > > to read them directly by loading guest values into the hardware MSRs. > > > > These MSRs are continuously running counters whose values must be > > carefully tracked during all vCPU state transitions: > > - Guest IA32_APERF advances only during guest execution > > That's not what this series does though. Guest APERF advances while the vCPU is > loaded by KVM_RUN, which is *very* different than letting APERF run freely only > while the vCPU is actively executing in the guest. > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant > amount of CPU time in APERF when faulting in swapped memory, whereas traditional > file-backed swap will not due to the task being scheduled out while waiting on I/O. Are you saying that APERF should stop completely outside of VMX non-root operation / guest mode? While that is possible, the overhead would be significantly higher...probably high enough to make it impractical. > In general, the "why" of this series is missing. What are the use cases you are > targeting? What are the exact semantics you want to define? *Why* did are you > proposed those exact semantics? I get the impression that the questions above are largely rhetorical, and that you would not be happy with the answers anyway, but if you really are inviting a version 2, I will gladly expound upon the why. > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that > requires userspace exits will not. It's not necessarily wrong for heavy userspace > I/O to cause observed frequency to drop, but it's not obviously correct either. > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's > desirable for an oversubscribed setup requires a lot more work than defining > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or > less just partitioned. Not to mention the complexity for trying to support all > potential use cases is likely quite a bit higher. > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured > at all times? Outside of maybe a few CPUs running bookkeeping tasks, the only > workloads running on CPUs should be vCPUs. It's not clear to me that observing > the guest utilization is outright wrong in that case. My understanding is that Google Cloud customers have been asking for this feature for all manner of VM families for years, and most of those VM families are not slice-of-hardware, since we just launched our first such offering a few months ago. > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough > APERF/MPERF if and only if the feature is supported in hardware, but hidden from > the kernel. I.e. let the system admin gift APERF/MPERF to KVM. Part of our goal has been to enable guest APERF/MPERF without impacting the use of host APERF/MPERF, since one of the first things our support teams look at in response to a performance complaint is the effective frequencies of the CPUs as reported on the host. I can explain all of this in excruciating detail, but I'm not really motivated by your initial response, which honestly seems a bit hostile. At least you looked at the code, which is a far warmer reception than I usually get.
On Tue, Dec 03, 2024, Jim Mattson wrote: > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote: > > > > On Thu, Nov 21, 2024, Mingwei Zhang wrote: > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > > > (250 Hz by default) to measure their effective CPU frequency. To avoid > > > the overhead of intercepting these frequent MSR reads, allow the guest > > > to read them directly by loading guest values into the hardware MSRs. > > > > > > These MSRs are continuously running counters whose values must be > > > carefully tracked during all vCPU state transitions: > > > - Guest IA32_APERF advances only during guest execution > > > > That's not what this series does though. Guest APERF advances while the vCPU is > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only > > while the vCPU is actively executing in the guest. > > > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional > > file-backed swap will not due to the task being scheduled out while waiting on I/O. > > Are you saying that APERF should stop completely outside of VMX > non-root operation / guest mode? > While that is possible, the overhead would be significantly > higher...probably high enough to make it impractical. No, I'm simply pointing out that the cover letter is misleading/inaccurate. > > In general, the "why" of this series is missing. What are the use cases you are > > targeting? What are the exact semantics you want to define? *Why* did are you > > proposed those exact semantics? > > I get the impression that the questions above are largely rhetorical, and Nope, not rhetorical, I genuinely want to know. I can't tell if ya'll thought about the side effects of things like swap and emulated I/O, and if you did, what made you come to the conclusion that the "best" boundary is on sched_out() and return to userspace. > that you would not be happy with the answers anyway, but if you really are > inviting a version 2, I will gladly expound upon the why. No need for a new version at this time, just give me the details. > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that > > requires userspace exits will not. It's not necessarily wrong for heavy userspace > > I/O to cause observed frequency to drop, but it's not obviously correct either. > > > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's > > desirable for an oversubscribed setup requires a lot more work than defining > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or > > less just partitioned. Not to mention the complexity for trying to support all > > potential use cases is likely quite a bit higher. > > > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured > > at all times? Outside of maybe a few CPUs running bookkeeping tasks, the only > > workloads running on CPUs should be vCPUs. It's not clear to me that observing > > the guest utilization is outright wrong in that case. > > My understanding is that Google Cloud customers have been asking for this > feature for all manner of VM families for years, and most of those VM > families are not slice-of-hardware, since we just launched our first such > offering a few months ago. But do you actually want to expose APERF/MPERF to those VMs? With my upstream hat on, what someone's customers are asking for isn't relevant. What's relevant is what that someone wants to deliver/enable. > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from > > the kernel. I.e. let the system admin gift APERF/MPERF to KVM. > > Part of our goal has been to enable guest APERF/MPERF without impacting the > use of host APERF/MPERF, since one of the first things our support teams look > at in response to a performance complaint is the effective frequencies of the > CPUs as reported on the host. But is looking at the host's view even useful if (a) the only thing running on those CPUs is a single vCPU, and (b) host userspace only sees the effective frequencies when _host_ code is running? Getting the effective frequency for when the userspace VMM is processing emulated I/O probably isn't going to be all that helpful. And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs, e.g. via turbostat. It just means the kernel won't use APERF/MPERF for scheduling decisions or any other behaviors that rely on an accurate host view. > I can explain all of this in excruciating detail, but I'm not really > motivated by your initial response, which honestly seems a bit hostile. Probably because this series made me a bit grumpy :-) As presented, this feels way, way too much like KVM's existing PMU "virtualization". Mostly works if you stare at it just so, but devoid of details on why X was done instead of Y, and seemingly ignores multiple edge cases. I'm not saying you and Mingwei haven't thought about edge cases and design tradeoffs, but nothing in the cover letter, changelogs, comments (none), or testcases (also none) communicates those thoughts to others. > At least you looked at the code, which is a far warmer reception than I > usually get.
On Tue, Dec 3, 2024 at 5:59 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Dec 03, 2024, Jim Mattson wrote: > > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Thu, Nov 21, 2024, Mingwei Zhang wrote: > > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > > > > (250 Hz by default) to measure their effective CPU frequency. To avoid > > > > the overhead of intercepting these frequent MSR reads, allow the guest > > > > to read them directly by loading guest values into the hardware MSRs. > > > > > > > > These MSRs are continuously running counters whose values must be > > > > carefully tracked during all vCPU state transitions: > > > > - Guest IA32_APERF advances only during guest execution > > > > > > That's not what this series does though. Guest APERF advances while the vCPU is > > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only > > > while the vCPU is actively executing in the guest. > > > > > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant > > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional > > > file-backed swap will not due to the task being scheduled out while waiting on I/O. > > > > Are you saying that APERF should stop completely outside of VMX > > non-root operation / guest mode? > > While that is possible, the overhead would be significantly > > higher...probably high enough to make it impractical. > > No, I'm simply pointing out that the cover letter is misleading/inaccurate. > > > > In general, the "why" of this series is missing. What are the use cases you are > > > targeting? What are the exact semantics you want to define? *Why* did are you > > > proposed those exact semantics? > > > > I get the impression that the questions above are largely rhetorical, and > > Nope, not rhetorical, I genuinely want to know. I can't tell if ya'll thought > about the side effects of things like swap and emulated I/O, and if you did, what > made you come to the conclusion that the "best" boundary is on sched_out() and > return to userspace. > > > that you would not be happy with the answers anyway, but if you really are > > inviting a version 2, I will gladly expound upon the why. > > No need for a new version at this time, just give me the details. > > > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that > > > requires userspace exits will not. It's not necessarily wrong for heavy userspace > > > I/O to cause observed frequency to drop, but it's not obviously correct either. > > > > > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's > > > desirable for an oversubscribed setup requires a lot more work than defining > > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or > > > less just partitioned. Not to mention the complexity for trying to support all > > > potential use cases is likely quite a bit higher. > > > > > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned > > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured > > > at all times? Outside of maybe a few CPUs running bookkeeping tasks, the only > > > workloads running on CPUs should be vCPUs. It's not clear to me that observing > > > the guest utilization is outright wrong in that case. > > > > My understanding is that Google Cloud customers have been asking for this > > feature for all manner of VM families for years, and most of those VM > > families are not slice-of-hardware, since we just launched our first such > > offering a few months ago. > > But do you actually want to expose APERF/MPERF to those VMs? With my upstream > hat on, what someone's customers are asking for isn't relevant. What's relevant > is what that someone wants to deliver/enable. > > > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to > > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough > > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from > > > the kernel. I.e. let the system admin gift APERF/MPERF to KVM. > > > > Part of our goal has been to enable guest APERF/MPERF without impacting the > > use of host APERF/MPERF, since one of the first things our support teams look > > at in response to a performance complaint is the effective frequencies of the > > CPUs as reported on the host. > > But is looking at the host's view even useful if (a) the only thing running on > those CPUs is a single vCPU, and (b) host userspace only sees the effective > frequencies when _host_ code is running? Getting the effective frequency for > when the userspace VMM is processing emulated I/O probably isn't going to be all > that helpful. (a) is your constraint, not mine, and (b) certainly sounds pointless, but that isn't the behavior of this patch set, so I'm not sure why you're even going there. With this patch set, host APERF/MPERF still reports all cycles accumulated on the logical processor, regardless of whether in the host or the guest. There will be a small loss every time the MSRs are written, but that loss is minimized by writing the MSRs as infrequently as possible. I get ahead of myself, but I just couldn't let this mischaracterization stand uncorrected while I get the rest of my response together. > And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs, > e.g. via turbostat. It just means the kernel won't use APERF/MPERF for scheduling > decisions or any other behaviors that rely on an accurate host view. > > > I can explain all of this in excruciating detail, but I'm not really > > motivated by your initial response, which honestly seems a bit hostile. > > Probably because this series made me a bit grumpy :-) As presented, this feels > way, way too much like KVM's existing PMU "virtualization". Mostly works if you > stare at it just so, but devoid of details on why X was done instead of Y, and > seemingly ignores multiple edge cases. > > I'm not saying you and Mingwei haven't thought about edge cases and design > tradeoffs, but nothing in the cover letter, changelogs, comments (none), or > testcases (also none) communicates those thoughts to others. > > > At least you looked at the code, which is a far warmer reception than I > > usually get.
Sorry for the duplicate message... On Tue, Dec 3, 2024 at 5:59 PM Sean Christopherson <seanjc@google.com> wrote: > > On Tue, Dec 03, 2024, Jim Mattson wrote: > > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote: > > > > > > On Thu, Nov 21, 2024, Mingwei Zhang wrote: > > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > > > > (250 Hz by default) to measure their effective CPU frequency. To avoid > > > > the overhead of intercepting these frequent MSR reads, allow the guest > > > > to read them directly by loading guest values into the hardware MSRs. > > > > > > > > These MSRs are continuously running counters whose values must be > > > > carefully tracked during all vCPU state transitions: > > > > - Guest IA32_APERF advances only during guest execution > > > > > > That's not what this series does though. Guest APERF advances while the vCPU is > > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only > > > while the vCPU is actively executing in the guest. > > > > > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant > > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional > > > file-backed swap will not due to the task being scheduled out while waiting on I/O. > > > > Are you saying that APERF should stop completely outside of VMX > > non-root operation / guest mode? > > While that is possible, the overhead would be significantly > > higher...probably high enough to make it impractical. > > No, I'm simply pointing out that the cover letter is misleading/inaccurate. > > > > In general, the "why" of this series is missing. What are the use cases you are > > > targeting? What are the exact semantics you want to define? *Why* did are you > > > proposed those exact semantics? > > > > I get the impression that the questions above are largely rhetorical, and > > Nope, not rhetorical, I genuinely want to know. I can't tell if ya'll thought > about the side effects of things like swap and emulated I/O, and if you did, what > made you come to the conclusion that the "best" boundary is on sched_out() and > return to userspace. Even for the slice of hardware case, KVM still needs to maintain the guest aperfmperf context and do the context switch. Even if vcpu is pinned, the host system design always has corner cases. For instance, the host may want to move a bunch of vCPUs from one chunk to another, say from 1 CCX to another CCX in AMD. Or maybe in some cases, balancing the memory usage by moving VMs from one (v)NUMA to another. Those should be corner cases and thus rare, but could happen in reality. Even for the slice of hardware case, KVM still needs to maintain the guest aperfmperf context and do the context switch. Even if vcpu is pinned, the host system design always has corner cases. For instance, the host may want to move a bunch of vCPUs from one chunk to another, say from 1 CCX to another CCX in AMD. Or maybe in some cases, balancing the memory usage by moving VMs from one (v)NUMA to another. Those should be corner cases and thus rare, but could happen in reality. > > > that you would not be happy with the answers anyway, but if you really are > > inviting a version 2, I will gladly expound upon the why. > > No need for a new version at this time, just give me the details. > > > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that > > > requires userspace exits will not. It's not necessarily wrong for heavy userspace > > > I/O to cause observed frequency to drop, but it's not obviously correct either. > > > > > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's > > > desirable for an oversubscribed setup requires a lot more work than defining > > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or > > > less just partitioned. Not to mention the complexity for trying to support all > > > potential use cases is likely quite a bit higher. > > > > > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned > > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured > > > at all times? Outside of maybe a few CPUs running bookkeeping tasks, the only > > > workloads running on CPUs should be vCPUs. It's not clear to me that observing > > > the guest utilization is outright wrong in that case. > > > > My understanding is that Google Cloud customers have been asking for this > > feature for all manner of VM families for years, and most of those VM > > families are not slice-of-hardware, since we just launched our first such > > offering a few months ago. > > But do you actually want to expose APERF/MPERF to those VMs? With my upstream > hat on, what someone's customers are asking for isn't relevant. What's relevant > is what that someone wants to deliver/enable. > > > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to > > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough > > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from > > > the kernel. I.e. let the system admin gift APERF/MPERF to KVM. > > > > Part of our goal has been to enable guest APERF/MPERF without impacting the > > use of host APERF/MPERF, since one of the first things our support teams look > > at in response to a performance complaint is the effective frequencies of the > > CPUs as reported on the host. > > But is looking at the host's view even useful if (a) the only thing running on > those CPUs is a single vCPU, and (b) host userspace only sees the effective > frequencies when _host_ code is running? Getting the effective frequency for > when the userspace VMM is processing emulated I/O probably isn't going to be all > that helpful. > > And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs, > e.g. via turbostat. It just means the kernel won't use APERF/MPERF for scheduling > decisions or any other behaviors that rely on an accurate host view. > > > I can explain all of this in excruciating detail, but I'm not really > > motivated by your initial response, which honestly seems a bit hostile. > > Probably because this series made me a bit grumpy :-) As presented, this feels > way, way too much like KVM's existing PMU "virtualization". Mostly works if you > stare at it just so, but devoid of details on why X was done instead of Y, and > seemingly ignores multiple edge cases. ah, I can understand your feelings :) In the existing implementation of vPMU, the guest counter value is really really hard to fetch because part of it is always located in the perf subsystem. But in the case of aperfmperf, the guest value is always in one place when code is within the KVM loop. We pass through rdmsr to aperfmperf. Writes need some adjustment, but it is on the host-level offset, not the guest value. The offset we maintain is quite simple math. > > I'm not saying you and Mingwei haven't thought about edge cases and design > tradeoffs, but nothing in the cover letter, changelogs, comments (none), or > testcases (also none) communicates those thoughts to others. > > > At least you looked at the code, which is a far warmer reception than I > > usually get.
Here is the sordid history behind the proposed APERFMPERF implementation. First of all, I have never considered this feature for only "slice-of-hardware" VMs, for two reasons: (1) The original feature request was first opened in 2015, six months before I joined Google, and long before Google Cloud had anything remotely resembling a slice-of-hardware offering. (2) One could argue that Google Cloud still has no slice-of-hardware offering today. Hence, an implementation that only works for "slice-of-hardware" is essentially a science fair project. We might learn a lot, but there is no ROI. I dragged my feet on this for a long time, because (1) Without actual guest C-state control, it seems essentially pointless (though I probably didn't give sufficient weight to the warm fuzzy feeling it might give customers). (2) It's one of those things that's impossible to virtualize with precision, and I can be a real pedant sometimes. (3) I didn't want to expose a power side-channel that could be used to spy on other tenants. In 2019, Google Cloud launched the C2 VM family, with MWAIT-exiting disabled for whole socket shapes. Though MWAIT hints aren't full C-state control, and we still had the 1 MHz host timer tick that would probably keep the guest out of deep C-states, my first objection started to collapse. As I softened in my old age, the second objection seemed indefensible, especially after I finally caved on nested posted interrupt processing, which truly is unvirtualizable. But, especially after the whole Meltdown/Spectre debacle, I was holding firm to my third objection, despite counter-arguments that the same information could be obtained without APERFMPERF. I guess I'm still smarting from being proven completely wrong about RowHammer. Finally, in December 2021, I thought I had a reasonable solution. We could implement APERFMPERF in userspace, and the low fidelity would make me feel comfortable about my third objection. "How would userspace get this information," you may ask. Well, Google Cloud has been carrying local patches to log guest {APERF, MPERF, TSC} deltas since Ben Serebrin added it in 2017. Though the design document only stipulated that the MSRs should be sampled at VMRUN entry and exit, the original code actually sampled at VM-entry and VM-exit, with a limitation of sampling at those additional points only if 500 microseconds had elapsed since the last samples were taken. Ben calculated the effective frequency at each sample point to populate a histogram, but that's not really relevant to APERFMPERF virtualization. I just mention it to explain why those VM-entry/VM-exit sample points were there. This code accounted for everything between vcpu_load() and vcpu_put() when accumulating "guest" APERF and MPERF, and this data eventually formed the basis of our userspace implementation of APERFMPERF virtualization. In 2022, Riley Gamson implemented APERFMPERF virtualization in userspace, using KVM_X86_SET_MSR_FILTER to intercept guest accesses to the MSRs, and using Ben's "turbostat" data to derive the values to be returned. The APERF delta could be used as-is, but I was insistent that MPERF needed to track guest TSC cycles while the vCPU was not halted. My reasoning was this: (1) The specification says so. Okay; it actually says that MPERF "[i]ncrements at fixed interval (relative to TSC freq.) when the logical processor is in C0," but even turbostat makes the architecturally prohibited assumption that MPERF and TSC tick at the same frequency. (2) It would be disingenuous to claim the effective frequency *while running* for a duty-cycle limited f1-micro or g2-small VM, or for overcommitted VMs that are forced to timeshare with other tenants. APERF is admittedly tricky to virtualize. For instance, how many virtual "core clock counts at the coordinated clock frequency" should we count while KVM is emulating CPUID? That's unclear. We're certainly not trying to *emulate* APERF, so the number of cycles the physical CPU takes to execute the instruction isn't relevant. The virtual CPU just happens to take a few thousand cycles to execute CPUID. Consider it a quirk. Similarly, in the case of zswap, some memory accesses take a *really* long time. Or consider KVM time as the equivalent of SMM time on physical hardware. SMM cycles are accumulated by APERF. It may seem like a memory access just took 60 *milliseconds*, but most of that time was spent in SMM. (That's taken from a recent real-world example.) As much as I hate SMM, it provides a convenient rug to sweep virtualization holes under. At this point, I should note that Aaron Lewis eliminated the rate-limited "turbostat" sampling at VM-entry/VM-exit early this year, because it was too expensive. Admittedly, most of the cost was attributed to reading MSR_C6_CORE_RESIDENCY, which Drew Schmitt added to Ben's sampling in 2018. But this did factor into my thinking regarding cost. The target intercept was the C3 VM family, which is not "slice-of-hardware," and, ironically, does not disable MWAIT-exiting even for full socket shapes (because we realized after launching C2 that that was a huge mistake). However, the userspace approach was abandoned before C3 launch, because of performance issues. You may laugh, but when we advertised APERFMPERF to Linux guests, we were surprised to find that every vCPU started sampling these MSRs every timer tick. I still haven't looked into why. I'm assuming it has something to do with a perception of "fairness" in scheduling, and I just hope that it doesn't give power-hungry instruction mixes like AVX-512 and AMX an even greater fraction of CPU time because their effective frequency is low. In any case, we were seeing 10% to 16% performance degradations when APERFMPERF was advertised to Linux guests, and that was a non-starter. If that seems excessive, it is. A large part of this is due to contention for an unfortunate exclusive lock on the legacy PIC that our userspace grabs and releases for each KVM_RUN ioctl. That could be fixed with a reader/writer lock, but the point is that we were seeing KVM exits at a much higher rate than ever before. I accept full responsibility for this debacle. I thought maybe these MSRs would get sampled once a second while running turbostat. I had no idea that the Linux kernel was so enamored of these MSRs. Just doing a back-of-the-envelope calculation based on a 250 Hz guest tick and 50000 cycles for a KVM exit, this implementation was going to cost 1% or more in guest performance. We certainly couldn't enable it by default, but maybe we could enable it for the specific customers who had been clamoring for the feature. However, when I asked Product Management how much performance customers were willing to trade for this feature, the answer was "none." Okay. How do we achieve that? The obvious approach is to intercept reads of these MSRs and do some math in KVM. I found that really unpalatable, though. For most of our VM families, the dominant source of consistent background VM-exits is the host timer tick. The second highest source is the guest timer tick. With the host timer tick finally going away on the C4 VM family, the guest timer tick now dominates. On Intel parts, where we take advantage of hardware EOI virtualization, we now have two VM-exits per guest timer tick (one for writing the x2APIC initial count MSR, and one for the VMX-preemption timer). I couldn't defend doubling that with intercepted reads of APERF and MPERF. Well, what about the simple hack of passing through the host values? I had considered this, despite the fact that it would only work for slice-of-hardware. I even coerced Josh Don into "fixing" our scheduler so that it wouldn't allow two vCPU threads (a virtual hyperthread pair) to flip-flop between hyperthreads on their assigned physical core. However, I eventually dismissed this as (1) too much of a hack (2) broken with live migration (3) disingenuous when multiple tasks are running on the logical processor. Yes, (3) does happen, even with our C4 VM family. During copyless migration, two vCPU threads share a logical processor. During live migration, I believe the live migration threads compete with vCPU threads. And there is still at least one kworker thread competing for cycles. Actually writing the guest values into the MSRs was initially abhorrent to me, because of the inherent lossage on every write. But, I eventually got over it, and partially assuaged my revulsion by writing the MSRs infrequently. I would much have preferred APERF and MPERF equivalents of IA32_TSC_ADJUST, but I don't have the patience to wait for the CPU vendors. BTW, as an aside, just how is AMD's scaling of MPERF by the TSC_RATIO MSR even remotely useful without an offset? One requirement I refuse to budge on is that host usage of APERFMPERF must continue to work exactly as before, modulo some very small loss of precision. Most of the damage could be contained within KVM, if you're willing to accept the constraint that these MSRs cannot be accessed within an NMI handler (on Intel CPUs), but then you have to swap the guest and host values every VM-entry/VM-exit. This approach increases both the performance overhead (for which our budget is "none") and the loss of precision over the current approach. Given the amount of whining on this list over writing just one MSR on every VM-entry/VM-exit (yes, IA32_SPEC_CTRL, I'm looking at you), I didn't think it would be very popular to add two. And, to be honest, I remembered that rate-limited *reads* of the "turbostat" MSRs were too expensive, but I had forgotten that the real culprit there was the egregiously slow MSR_C6_CORE_RESIDENCY. I do recall the hullabaloo regarding KVM's usurpation of IA32_TSC_AUX, an affront that went unnoticed until the advent of RDPID. Honestly, that's where I expected pushback on this series. Initially, I tried to keep the changes outwith KVM to the minimum possible, replacing only explicit reads of APERF or MPERF with the new accessor functions. I wasn't going to touch the /dev/cpu/*/msr/* interface. After all, none of the other KVM userspace return MSRs do anything special there. But, then I discovered that turbostat on the host uses that interface, and I really wanted that tool to continue to work as expected. So, the rdmsr crosscalls picked up an ugly wart. Frankly, that was the specific patch that I expected to unleash vitriol. As an aside, why does turbostat have to use that interface for its own independent reads of these MSRs, when the kernel is already reading them every scheduler tick?!? Sure, for tickless kernels, maybe, but I digress. Wherever the context-switching happens, I contend that there is no "clean" virtualization of APERF. If it comes down to just a question of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some performance numbers and try to come to a consensus, but if you're fundamentally opposed to virtualizing APERF, because it's messy, then I don't see any point in pursuing this further. Thanks, --jim
On 11/22/2024 12:22 AM, Mingwei Zhang wrote: > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > (250 Hz by default) to measure their effective CPU frequency. To avoid > the overhead of intercepting these frequent MSR reads, allow the guest > to read them directly by loading guest values into the hardware MSRs. > > These MSRs are continuously running counters whose values must be > carefully tracked during all vCPU state transitions: > - Guest IA32_APERF advances only during guest execution > - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is > in C0 state, even when not actively running Any particular reason to treat APERF and MPERF differently? AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state. MPERF counting at constant frequency and the APERF counting at a variable frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0 state and even when "not actively running" ? Can you clarify what do you mean by "not actively running"? Regards Nikunj
On Thu, Dec 5, 2024 at 1:00 AM Nikunj A Dadhania <nikunj@amd.com> wrote: > > On 11/22/2024 12:22 AM, Mingwei Zhang wrote: > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick > > (250 Hz by default) to measure their effective CPU frequency. To avoid > > the overhead of intercepting these frequent MSR reads, allow the guest > > to read them directly by loading guest values into the hardware MSRs. > > > > These MSRs are continuously running counters whose values must be > > carefully tracked during all vCPU state transitions: > > - Guest IA32_APERF advances only during guest execution > > - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is > > in C0 state, even when not actively running > > Any particular reason to treat APERF and MPERF differently? Core cycles accumulated by the logical processor that do not contribute to the execution of the virtual processor should not be counted. For example, consider Google Cloud's e2-small VM type, which is capped at a 25% duty cycle. Even if the logical processor is humming along at an effective frequency of 3.6 GHz, an e2-small vCPU task is only resident 25% of the time, so its effective frequency is more like 0.9 GHz (over a sufficiently large period of time). Similarly, if a logical processor running at 3.6 GHz is shared 50/50 by two vCPUs, the effective frequency of each is about 1.8 GHz (again, over a sufficiently large period of time). Over smaller time periods, the effective frequencies in these examples would look like square waves, alternating between 3.6 GHz and 0, much like thermal throttling. And, much like thermal throttling, MPERF reference cycles continue to tick on at the fixed reference frequency, even when APERF cycles drop to 0. > AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state. > MPERF counting at constant frequency and the APERF counting at a variable > frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0 > state and even when "not actively running" ? > > Can you clarify what do you mean by "not actively running"? The current implementation considers the vCPU to be actively running if the task is in the KVM_RUN ioctl, between vcpu_load() and vcpu_put(). This also implies that the task itself is currently running on a logical processor, since there is a vcpu_put() on sched_out and a vcpu_load() on sched_in. As Sean points out, this is only an approximation, since (a) such things as I/O completion in userspace are not counted, and (b) such things as uncompressing a zswapped page that happen in the vCPU task are counted. > Regards > Nikunj >
On Wed, Dec 04, 2024, Jim Mattson wrote: > Wherever the context-switching happens, I contend that there is no > "clean" virtualization of APERF. If it comes down to just a question > of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some > performance numbers and try to come to a consensus, but if you're > fundamentally opposed to virtualizing APERF, because it's messy, then > I don't see any point in pursuing this further. I'm not fundamentally opposed to virtualizing the feature. My complaints with the series are that it doesn't provide sufficient information to make it feasible for reviewers to provide useful feedback. The history you provided is a great start, but that's still largely just background information. For a feature as messy and subjective as APERF/MPERF, I think we need at least the following: 1. What use cases are being targeted (e.g. because targeting only SoH would allow for a different implementation). 2. The exact requirements, especially with respect to host usage. And the the motivation behind those requirements. 3. The high level design choices, and what, if any, alternatives were considered. 4. Basic rules of thumb for what is/isn't accounted in APERF/MPERF, so that it's feasible to actually maintain support long-term. E.g. does the host need to retain access to APERF/MPERF at all times? If so, why? Do we care about host kernel accesses, e.g. in the scheduler, or just userspace accesses, e.g. turbostat? What information is the host intended to see? E.g. should APERF and MPERF stop when transitioning to the guest? If not, what are the intended semantics for the host's view when running VMs with HLT-exiting disabled? If the host's view of APERF and MPREF account guest time, how does that mesh with upcoming mediated PMU, where the host is disallowed from observing the guest? Is there a plan for supporting VMs with a different TSC frequency than the host? How will live migration work, without generating too much slop/skew between MPERF and GUEST_TSC? I don't expect the series to answer every possible question upfront, but the RFC provided _nothing_, just a "here's what we implemented, please review".