mbox series

[RFC,00/22] KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs

Message ID 20241121185315.3416855-1-mizhang@google.com
Headers show
Series KVM: x86: Virtualize IA32_APERF and IA32_MPERF MSRs | expand

Message

Mingwei Zhang Nov. 21, 2024, 6:52 p.m. UTC
Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
(250 Hz by default) to measure their effective CPU frequency. To avoid
the overhead of intercepting these frequent MSR reads, allow the guest
to read them directly by loading guest values into the hardware MSRs.

These MSRs are continuously running counters whose values must be
carefully tracked during all vCPU state transitions:
- Guest IA32_APERF advances only during guest execution
- Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
  in C0 state, even when not actively running
- Host kernel access is redirected through get_host_[am]perf() which
  adds per-CPU offsets to the hardware MSR values
- Remote MSR reads through /dev/cpu/*/msr also account for these
  offsets

Guest values persist in hardware while the vCPU is loaded and
running. Host MSR values are restored on vcpu_put (either at KVM_RUN
completion or when preempted) and when transitioning to halt state.

Note that guest TSC scaling via KVM_SET_TSC_KHZ is not supported, as
it would require either intercepting MPERF reads on Intel (where MPERF
ticks at host rate regardless of guest TSC scaling) or significantly
complicating the cycle accounting on AMD.

The host must have both CONSTANT_TSC and NONSTOP_TSC capabilities
since these ensure stable TSC frequency across C-states and P-states,
which is required for accurate background MPERF accounting.

Jim Mattson (14):
  x86/aperfmperf: Introduce get_host_[am]perf()
  x86/aperfmperf: Introduce set_guest_[am]perf()
  x86/aperfmperf: Introduce restore_host_[am]perf()
  x86/msr: Adjust remote reads of IA32_[AM]PERF by the per-cpu host
    offset
  KVM: x86: Introduce kvm_vcpu_make_runnable()
  KVM: x86: INIT may transition from HALTED to RUNNABLE
  KVM: nSVM: Nested #VMEXIT may transition from HALTED to RUNNABLE
  KVM: nVMX: Nested VM-exit may transition from HALTED to RUNNABLE
  KVM: x86: Make APERFMPERF a governed feature
  KVM: x86: Initialize guest [am]perf at vcpu power-on
  KVM: x86: Load guest [am]perf when leaving halt state
  KVM: x86: Introduce kvm_user_return_notifier_register()
  KVM: x86: Restore host IA32_[AM]PERF on userspace return
  KVM: x86: Update aperfmperf on host-initiated MP_STATE transitions

Mingwei Zhang (8):
  KVM: x86: Introduce KVM_X86_FEATURE_APERFMPERF
  KVM: x86: Load guest [am]perf into hardware MSRs at vcpu_load()
  KVM: x86: Save guest [am]perf checkpoint on HLT
  KVM: x86: Save guest [am]perf checkpoint on vcpu_put()
  KVM: x86: Allow host and guest access to IA32_[AM]PERF
  KVM: VMX: Pass through guest reads of IA32_[AM]PERF
  KVM: SVM: Pass through guest reads of IA32_[AM]PERF
  KVM: x86: Enable guest usage of X86_FEATURE_APERFMPERF

 arch/x86/include/asm/kvm_host.h  |  11 ++
 arch/x86/include/asm/topology.h  |  10 ++
 arch/x86/kernel/cpu/aperfmperf.c |  65 +++++++++++-
 arch/x86/kvm/cpuid.c             |  12 ++-
 arch/x86/kvm/governed_features.h |   1 +
 arch/x86/kvm/lapic.c             |   5 +-
 arch/x86/kvm/reverse_cpuid.h     |   6 ++
 arch/x86/kvm/svm/nested.c        |   2 +-
 arch/x86/kvm/svm/svm.c           |   7 ++
 arch/x86/kvm/svm/svm.h           |   2 +-
 arch/x86/kvm/vmx/nested.c        |   2 +-
 arch/x86/kvm/vmx/vmx.c           |   7 ++
 arch/x86/kvm/vmx/vmx.h           |   2 +-
 arch/x86/kvm/x86.c               | 171 ++++++++++++++++++++++++++++---
 arch/x86/lib/msr-smp.c           |  11 ++
 drivers/cpufreq/amd-pstate.c     |   4 +-
 drivers/cpufreq/intel_pstate.c   |   5 +-
 17 files changed, 295 insertions(+), 28 deletions(-)


base-commit: 0a9b9d17f3a781dea03baca01c835deaa07f7cc3

Comments

Sean Christopherson Dec. 3, 2024, 11:19 p.m. UTC | #1
On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> (250 Hz by default) to measure their effective CPU frequency. To avoid
> the overhead of intercepting these frequent MSR reads, allow the guest
> to read them directly by loading guest values into the hardware MSRs.
> 
> These MSRs are continuously running counters whose values must be
> carefully tracked during all vCPU state transitions:
> - Guest IA32_APERF advances only during guest execution

That's not what this series does though.  Guest APERF advances while the vCPU is
loaded by KVM_RUN, which is *very* different than letting APERF run freely only
while the vCPU is actively executing in the guest.

E.g. a vCPU that is memory oversubscribed via zswap will account a significant
amount of CPU time in APERF when faulting in swapped memory, whereas traditional
file-backed swap will not due to the task being scheduled out while waiting on I/O.

In general, the "why" of this series is missing.  What are the use cases you are
targeting?  What are the exact semantics you want to define?  *Why* did are you
proposed those exact semantics?

E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
requires userspace exits will not.  It's not necessarily wrong for heavy userspace
I/O to cause observed frequency to drop, but it's not obviously correct either.

The use cases matter a lot for APERF/MPERF, because trying to reason about what's
desirable for an oversubscribed setup requires a lot more work than defining
semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
less just partitioned.  Not to mention the complexity for trying to support all
potential use cases is likely quite a bit higher.

And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
workloads running on CPUs should be vCPUs.  It's not clear to me that observing
the guest utilization is outright wrong in that case.

One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
APERF/MPERF if and only if the feature is supported in hardware, but hidden from
the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.

> - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
>   in C0 state, even when not actively running
> - Host kernel access is redirected through get_host_[am]perf() which
>   adds per-CPU offsets to the hardware MSR values
> - Remote MSR reads through /dev/cpu/*/msr also account for these
>   offsets
> 
> Guest values persist in hardware while the vCPU is loaded and
> running. Host MSR values are restored on vcpu_put (either at KVM_RUN
> completion or when preempted) and when transitioning to halt state.
> 
> Note that guest TSC scaling via KVM_SET_TSC_KHZ is not supported, as
> it would require either intercepting MPERF reads on Intel (where MPERF
> ticks at host rate regardless of guest TSC scaling) or significantly
> complicating the cycle accounting on AMD.
> 
> The host must have both CONSTANT_TSC and NONSTOP_TSC capabilities
> since these ensure stable TSC frequency across C-states and P-states,
> which is required for accurate background MPERF accounting.

...

>  arch/x86/include/asm/kvm_host.h  |  11 ++
>  arch/x86/include/asm/topology.h  |  10 ++
>  arch/x86/kernel/cpu/aperfmperf.c |  65 +++++++++++-
>  arch/x86/kvm/cpuid.c             |  12 ++-
>  arch/x86/kvm/governed_features.h |   1 +
>  arch/x86/kvm/lapic.c             |   5 +-
>  arch/x86/kvm/reverse_cpuid.h     |   6 ++
>  arch/x86/kvm/svm/nested.c        |   2 +-
>  arch/x86/kvm/svm/svm.c           |   7 ++
>  arch/x86/kvm/svm/svm.h           |   2 +-
>  arch/x86/kvm/vmx/nested.c        |   2 +-
>  arch/x86/kvm/vmx/vmx.c           |   7 ++
>  arch/x86/kvm/vmx/vmx.h           |   2 +-
>  arch/x86/kvm/x86.c               | 171 ++++++++++++++++++++++++++++---
>  arch/x86/lib/msr-smp.c           |  11 ++
>  drivers/cpufreq/amd-pstate.c     |   4 +-
>  drivers/cpufreq/intel_pstate.c   |   5 +-
>  17 files changed, 295 insertions(+), 28 deletions(-)
> 
> 
> base-commit: 0a9b9d17f3a781dea03baca01c835deaa07f7cc3
> -- 
> 2.47.0.371.ga323438b13-goog
>
Jim Mattson Dec. 4, 2024, 1:13 a.m. UTC | #2
On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > the overhead of intercepting these frequent MSR reads, allow the guest
> > to read them directly by loading guest values into the hardware MSRs.
> >
> > These MSRs are continuously running counters whose values must be
> > carefully tracked during all vCPU state transitions:
> > - Guest IA32_APERF advances only during guest execution
>
> That's not what this series does though.  Guest APERF advances while the vCPU is
> loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> while the vCPU is actively executing in the guest.
>
> E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> file-backed swap will not due to the task being scheduled out while waiting on I/O.

Are you saying that APERF should stop completely outside of VMX
non-root operation / guest mode?
While that is possible, the overhead would be significantly
higher...probably high enough to make it impractical.

> In general, the "why" of this series is missing.  What are the use cases you are
> targeting?  What are the exact semantics you want to define?  *Why* did are you
> proposed those exact semantics?

I get the impression that the questions above are largely rhetorical,
and that you would not be happy with the answers anyway, but if you
really are inviting a version 2, I will gladly expound upon the why.

> E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> I/O to cause observed frequency to drop, but it's not obviously correct either.
>
> The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> desirable for an oversubscribed setup requires a lot more work than defining
> semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> less just partitioned.  Not to mention the complexity for trying to support all
> potential use cases is likely quite a bit higher.
>
> And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> the guest utilization is outright wrong in that case.

My understanding is that Google Cloud customers have been asking for
this feature for all manner of VM families for years, and most of
those VM families are not slice-of-hardware, since we just launched
our first such offering a few months ago.

> One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.

Part of our goal has been to enable guest APERF/MPERF without
impacting the use of host APERF/MPERF, since one of the first things
our support teams look at in response to a performance complaint is
the effective frequencies of the CPUs as reported on the host.

I can explain all of this in excruciating detail, but I'm not really
motivated by your initial response, which honestly seems a bit
hostile. At least you looked at the code, which is a far warmer
reception than I usually get.
Sean Christopherson Dec. 4, 2024, 1:59 a.m. UTC | #3
On Tue, Dec 03, 2024, Jim Mattson wrote:
> On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > to read them directly by loading guest values into the hardware MSRs.
> > >
> > > These MSRs are continuously running counters whose values must be
> > > carefully tracked during all vCPU state transitions:
> > > - Guest IA32_APERF advances only during guest execution
> >
> > That's not what this series does though.  Guest APERF advances while the vCPU is
> > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > while the vCPU is actively executing in the guest.
> >
> > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> 
> Are you saying that APERF should stop completely outside of VMX
> non-root operation / guest mode?
> While that is possible, the overhead would be significantly
> higher...probably high enough to make it impractical.

No, I'm simply pointing out that the cover letter is misleading/inaccurate.

> > In general, the "why" of this series is missing.  What are the use cases you are
> > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > proposed those exact semantics?
> 
> I get the impression that the questions above are largely rhetorical, and

Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
about the side effects of things like swap and emulated I/O, and if you did, what
made you come to the conclusion that the "best" boundary is on sched_out() and
return to userspace.

> that you would not be happy with the answers anyway, but if you really are
> inviting a version 2, I will gladly expound upon the why.

No need for a new version at this time, just give me the details.

> > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > I/O to cause observed frequency to drop, but it's not obviously correct either.
> >
> > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > desirable for an oversubscribed setup requires a lot more work than defining
> > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > less just partitioned.  Not to mention the complexity for trying to support all
> > potential use cases is likely quite a bit higher.
> >
> > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > the guest utilization is outright wrong in that case.
> 
> My understanding is that Google Cloud customers have been asking for this
> feature for all manner of VM families for years, and most of those VM
> families are not slice-of-hardware, since we just launched our first such
> offering a few months ago.

But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
hat on, what someone's customers are asking for isn't relevant.  What's relevant
is what that someone wants to deliver/enable.

> > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> 
> Part of our goal has been to enable guest APERF/MPERF without impacting the
> use of host APERF/MPERF, since one of the first things our support teams look
> at in response to a performance complaint is the effective frequencies of the
> CPUs as reported on the host.

But is looking at the host's view even useful if (a) the only thing running on
those CPUs is a single vCPU, and (b) host userspace only sees the effective
frequencies when _host_ code is running?  Getting the effective frequency for
when the userspace VMM is processing emulated I/O probably isn't going to be all
that helpful.

And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
decisions or any other behaviors that rely on an accurate host view.

> I can explain all of this in excruciating detail, but I'm not really
> motivated by your initial response, which honestly seems a bit hostile.

Probably because this series made me a bit grumpy :-)  As presented, this feels
way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
stare at it just so, but devoid of details on why X was done instead of Y, and
seemingly ignores multiple edge cases.

I'm not saying you and Mingwei haven't thought about edge cases and design
tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
testcases (also none) communicates those thoughts to others.

> At least you looked at the code, which is a far warmer reception than I
> usually get.
Jim Mattson Dec. 4, 2024, 4 a.m. UTC | #4
On Tue, Dec 3, 2024 at 5:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 03, 2024, Jim Mattson wrote:
> > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > > to read them directly by loading guest values into the hardware MSRs.
> > > >
> > > > These MSRs are continuously running counters whose values must be
> > > > carefully tracked during all vCPU state transitions:
> > > > - Guest IA32_APERF advances only during guest execution
> > >
> > > That's not what this series does though.  Guest APERF advances while the vCPU is
> > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > > while the vCPU is actively executing in the guest.
> > >
> > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> >
> > Are you saying that APERF should stop completely outside of VMX
> > non-root operation / guest mode?
> > While that is possible, the overhead would be significantly
> > higher...probably high enough to make it impractical.
>
> No, I'm simply pointing out that the cover letter is misleading/inaccurate.
>
> > > In general, the "why" of this series is missing.  What are the use cases you are
> > > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > > proposed those exact semantics?
> >
> > I get the impression that the questions above are largely rhetorical, and
>
> Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
> about the side effects of things like swap and emulated I/O, and if you did, what
> made you come to the conclusion that the "best" boundary is on sched_out() and
> return to userspace.
>
> > that you would not be happy with the answers anyway, but if you really are
> > inviting a version 2, I will gladly expound upon the why.
>
> No need for a new version at this time, just give me the details.
>
> > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > > I/O to cause observed frequency to drop, but it's not obviously correct either.
> > >
> > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > > desirable for an oversubscribed setup requires a lot more work than defining
> > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > > less just partitioned.  Not to mention the complexity for trying to support all
> > > potential use cases is likely quite a bit higher.
> > >
> > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > > the guest utilization is outright wrong in that case.
> >
> > My understanding is that Google Cloud customers have been asking for this
> > feature for all manner of VM families for years, and most of those VM
> > families are not slice-of-hardware, since we just launched our first such
> > offering a few months ago.
>
> But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
> hat on, what someone's customers are asking for isn't relevant.  What's relevant
> is what that someone wants to deliver/enable.
>
> > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> >
> > Part of our goal has been to enable guest APERF/MPERF without impacting the
> > use of host APERF/MPERF, since one of the first things our support teams look
> > at in response to a performance complaint is the effective frequencies of the
> > CPUs as reported on the host.
>
> But is looking at the host's view even useful if (a) the only thing running on
> those CPUs is a single vCPU, and (b) host userspace only sees the effective
> frequencies when _host_ code is running?  Getting the effective frequency for
> when the userspace VMM is processing emulated I/O probably isn't going to be all
> that helpful.

(a) is your constraint, not mine, and (b) certainly sounds pointless,
but that isn't the behavior of this patch set, so I'm not sure why
you're even going there.

With this patch set, host APERF/MPERF still reports all cycles
accumulated on the logical processor, regardless of whether in the
host or the guest. There will be a small loss every time the MSRs are
written, but that loss is minimized by writing the MSRs as
infrequently as possible.

I get ahead of myself, but I just couldn't let this
mischaracterization stand uncorrected while I get the rest of my
response together.

> And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
> e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
> decisions or any other behaviors that rely on an accurate host view.
>
> > I can explain all of this in excruciating detail, but I'm not really
> > motivated by your initial response, which honestly seems a bit hostile.
>
> Probably because this series made me a bit grumpy :-)  As presented, this feels
> way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
> stare at it just so, but devoid of details on why X was done instead of Y, and
> seemingly ignores multiple edge cases.
>
> I'm not saying you and Mingwei haven't thought about edge cases and design
> tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
> testcases (also none) communicates those thoughts to others.
>
> > At least you looked at the code, which is a far warmer reception than I
> > usually get.
Mingwei Zhang Dec. 4, 2024, 5:11 a.m. UTC | #5
Sorry for the duplicate message...


On Tue, Dec 3, 2024 at 5:59 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Tue, Dec 03, 2024, Jim Mattson wrote:
> > On Tue, Dec 3, 2024 at 3:19 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > On Thu, Nov 21, 2024, Mingwei Zhang wrote:
> > > > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > > > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > > > the overhead of intercepting these frequent MSR reads, allow the guest
> > > > to read them directly by loading guest values into the hardware MSRs.
> > > >
> > > > These MSRs are continuously running counters whose values must be
> > > > carefully tracked during all vCPU state transitions:
> > > > - Guest IA32_APERF advances only during guest execution
> > >
> > > That's not what this series does though.  Guest APERF advances while the vCPU is
> > > loaded by KVM_RUN, which is *very* different than letting APERF run freely only
> > > while the vCPU is actively executing in the guest.
> > >
> > > E.g. a vCPU that is memory oversubscribed via zswap will account a significant
> > > amount of CPU time in APERF when faulting in swapped memory, whereas traditional
> > > file-backed swap will not due to the task being scheduled out while waiting on I/O.
> >
> > Are you saying that APERF should stop completely outside of VMX
> > non-root operation / guest mode?
> > While that is possible, the overhead would be significantly
> > higher...probably high enough to make it impractical.
>
> No, I'm simply pointing out that the cover letter is misleading/inaccurate.
>
> > > In general, the "why" of this series is missing.  What are the use cases you are
> > > targeting?  What are the exact semantics you want to define?  *Why* did are you
> > > proposed those exact semantics?
> >
> > I get the impression that the questions above are largely rhetorical, and
>
> Nope, not rhetorical, I genuinely want to know.  I can't tell if ya'll thought
> about the side effects of things like swap and emulated I/O, and if you did, what
> made you come to the conclusion that the "best" boundary is on sched_out() and
> return to userspace.

Even for the slice of hardware case, KVM still needs to maintain the
guest aperfmperf context and do the context switch. Even if vcpu is
pinned, the host system design always has corner cases. For instance,
the host may want to move a bunch of vCPUs from one chunk to another,
say from 1 CCX to another CCX in AMD. Or maybe in some cases,
balancing the memory usage by moving VMs from one (v)NUMA to another.
Those should be corner cases and thus rare, but could happen in
reality.

Even for the slice of hardware case, KVM still needs to maintain the
guest aperfmperf context and do the context switch. Even if vcpu is
pinned, the host system design always has corner cases. For instance,
the host may want to move a bunch of vCPUs from one chunk to another,
say from 1 CCX to another CCX in AMD. Or maybe in some cases,
balancing the memory usage by moving VMs from one (v)NUMA to another.
Those should be corner cases and thus rare, but could happen in
reality.

>
> > that you would not be happy with the answers anyway, but if you really are
> > inviting a version 2, I will gladly expound upon the why.
>
> No need for a new version at this time, just give me the details.
>
> > > E.g. emulated I/O that is handled in KVM will be accounted to APERF, but I/O that
> > > requires userspace exits will not.  It's not necessarily wrong for heavy userspace
> > > I/O to cause observed frequency to drop, but it's not obviously correct either.
> > >
> > > The use cases matter a lot for APERF/MPERF, because trying to reason about what's
> > > desirable for an oversubscribed setup requires a lot more work than defining
> > > semantics for setups where all vCPUs are hard pinned 1:1 and memory is more or
> > > less just partitioned.  Not to mention the complexity for trying to support all
> > > potential use cases is likely quite a bit higher.
> > >
> > > And if the use case is specifically for slice-of-hardware, hard pinned/partitioned
> > > VMs, does it matter if the host's view of APERF/MPERF is not accurately captured
> > > at all times?  Outside of maybe a few CPUs running bookkeeping tasks, the only
> > > workloads running on CPUs should be vCPUs.  It's not clear to me that observing
> > > the guest utilization is outright wrong in that case.
> >
> > My understanding is that Google Cloud customers have been asking for this
> > feature for all manner of VM families for years, and most of those VM
> > families are not slice-of-hardware, since we just launched our first such
> > offering a few months ago.
>
> But do you actually want to expose APERF/MPERF to those VMs?  With my upstream
> hat on, what someone's customers are asking for isn't relevant.  What's relevant
> is what that someone wants to deliver/enable.
>
> > > One idea for supporting APERF/MPERF in KVM would be to add a kernel param to
> > > disable/hide APERF/MPERF from the host, and then let KVM virtualize/passthrough
> > > APERF/MPERF if and only if the feature is supported in hardware, but hidden from
> > > the kernel.  I.e. let the system admin gift APERF/MPERF to KVM.
> >
> > Part of our goal has been to enable guest APERF/MPERF without impacting the
> > use of host APERF/MPERF, since one of the first things our support teams look
> > at in response to a performance complaint is the effective frequencies of the
> > CPUs as reported on the host.
>
> But is looking at the host's view even useful if (a) the only thing running on
> those CPUs is a single vCPU, and (b) host userspace only sees the effective
> frequencies when _host_ code is running?  Getting the effective frequency for
> when the userspace VMM is processing emulated I/O probably isn't going to be all
> that helpful.
>
> And gifting APERF/MPERF to VMs doesn't have to mean the host can't read the MSRs,
> e.g. via turbostat.  It just means the kernel won't use APERF/MPERF for scheduling
> decisions or any other behaviors that rely on an accurate host view.
>
> > I can explain all of this in excruciating detail, but I'm not really
> > motivated by your initial response, which honestly seems a bit hostile.
>
> Probably because this series made me a bit grumpy :-)  As presented, this feels
> way, way too much like KVM's existing PMU "virtualization".  Mostly works if you
> stare at it just so, but devoid of details on why X was done instead of Y, and
> seemingly ignores multiple edge cases.

ah, I can understand your feelings :) In the existing implementation
of vPMU, the guest counter value is really really hard to fetch
because part of it is always located in the perf subsystem. But in the
case of aperfmperf, the guest value is always in one place when code
is within the KVM loop. We pass through rdmsr to aperfmperf. Writes
need some adjustment, but it is on the host-level offset, not the
guest value. The offset we maintain is quite simple math.

>
> I'm not saying you and Mingwei haven't thought about edge cases and design
> tradeoffs, but nothing in the cover letter, changelogs, comments (none), or
> testcases (also none) communicates those thoughts to others.
>
> > At least you looked at the code, which is a far warmer reception than I
> > usually get.
Jim Mattson Dec. 4, 2024, 12:30 p.m. UTC | #6
Here is the sordid history behind the proposed APERFMPERF implementation.

First of all, I have never considered this feature for only
"slice-of-hardware" VMs, for two reasons:
(1) The original feature request was first opened in 2015, six months
before I joined Google, and long before Google Cloud had anything
remotely resembling a slice-of-hardware offering.
(2) One could argue that Google Cloud still has no slice-of-hardware
offering today.
Hence, an implementation that only works for "slice-of-hardware" is
essentially a science fair project. We might learn a lot, but there is
no ROI.

I dragged my feet on this for a long time, because
(1) Without actual guest C-state control, it seems essentially
pointless (though I probably didn't give sufficient weight to the warm
fuzzy feeling it might give customers).
(2) It's one of those things that's impossible to virtualize with
precision, and I can be a real pedant sometimes.
(3) I didn't want to expose a power side-channel that could be used to
spy on other tenants.

In 2019, Google Cloud launched the C2 VM family, with MWAIT-exiting
disabled for whole socket shapes. Though MWAIT hints aren't full
C-state control, and we still had the 1 MHz host timer tick that would
probably keep the guest out of deep C-states, my first objection
started to collapse. As I softened in my old age, the second objection
seemed indefensible, especially after I finally caved on nested posted
interrupt processing, which truly is unvirtualizable. But, especially
after the whole Meltdown/Spectre debacle, I was holding firm to my
third objection, despite counter-arguments that the same information
could be obtained without APERFMPERF. I guess I'm still smarting from
being proven completely wrong about RowHammer.

Finally, in December 2021, I thought I had a reasonable solution. We
could implement APERFMPERF in userspace, and the low fidelity would
make me feel comfortable about my third objection. "How would
userspace get this information," you may ask. Well, Google Cloud has
been carrying local patches to log guest {APERF, MPERF, TSC} deltas
since Ben Serebrin added it in 2017. Though the design document only
stipulated that the MSRs should be sampled at VMRUN entry and exit,
the original code actually sampled at VM-entry and VM-exit, with a
limitation of sampling at those additional points only if 500
microseconds had elapsed since the last samples were taken. Ben
calculated the effective frequency at each sample point to populate a
histogram, but that's not really relevant to APERFMPERF
virtualization. I just mention it to explain why those
VM-entry/VM-exit sample points were there. This code accounted for
everything between vcpu_load() and vcpu_put() when accumulating
"guest" APERF and MPERF, and this data eventually formed the basis of
our userspace implementation of APERFMPERF virtualization.

In 2022, Riley Gamson implemented APERFMPERF virtualization in
userspace, using KVM_X86_SET_MSR_FILTER to intercept guest accesses to
the MSRs, and using Ben's "turbostat" data to derive the values to be
returned. The APERF delta could be used as-is, but I was insistent
that MPERF needed to track guest TSC cycles while the vCPU was not
halted. My reasoning was this:
(1) The specification says so. Okay; it actually says that MPERF
"[i]ncrements at fixed interval (relative to TSC freq.) when the
logical processor is in C0," but even turbostat makes the
architecturally prohibited assumption that MPERF and TSC tick at the
same frequency.
(2) It would be disingenuous to claim the effective frequency *while
running* for a duty-cycle limited f1-micro or g2-small VM, or for
overcommitted VMs that are forced to timeshare with other tenants.

APERF is admittedly tricky to virtualize. For instance, how many
virtual "core clock counts at the coordinated clock frequency" should
we count while KVM is emulating CPUID? That's unclear. We're certainly
not trying to *emulate* APERF, so the number of cycles the physical
CPU takes to execute the instruction isn't relevant. The virtual CPU
just happens to take a few thousand cycles to execute CPUID. Consider
it a quirk. Similarly, in the case of zswap, some memory accesses take
a *really* long time. Or consider KVM time as the equivalent of SMM
time on physical hardware. SMM cycles are accumulated by APERF. It may
seem like a memory access just took 60 *milliseconds*, but most of
that time was spent in SMM. (That's taken from a recent real-world
example.) As much as I hate SMM, it provides a convenient rug to sweep
virtualization holes under.

At this point, I should note that Aaron Lewis eliminated the
rate-limited "turbostat" sampling at VM-entry/VM-exit early this year,
because it was too expensive. Admittedly, most of the cost was
attributed to reading MSR_C6_CORE_RESIDENCY, which Drew Schmitt added
to Ben's sampling in 2018. But this did factor into my thinking
regarding cost.

The target intercept was the C3 VM family, which is not
"slice-of-hardware," and, ironically, does not disable MWAIT-exiting
even for full socket shapes (because we realized after launching C2
that that was a huge mistake). However, the userspace approach was
abandoned before C3 launch, because of performance issues. You may
laugh, but when we advertised APERFMPERF to Linux guests, we were
surprised to find that every vCPU started sampling these MSRs every
timer tick. I still haven't looked into why. I'm assuming it has
something to do with a perception of "fairness" in scheduling, and I
just hope that it doesn't give power-hungry instruction mixes like
AVX-512 and AMX an even greater fraction of CPU time because their
effective frequency is low. In any case, we were seeing 10% to 16%
performance degradations when APERFMPERF was advertised to Linux
guests, and that was a non-starter. If that seems excessive, it is. A
large part of this is due to contention for an unfortunate exclusive
lock on the legacy PIC that our userspace grabs and releases for each
KVM_RUN ioctl. That could be fixed with a reader/writer lock, but the
point is that we were seeing KVM exits at a much higher rate than ever
before. I accept full responsibility for this debacle. I thought maybe
these MSRs would get sampled once a second while running turbostat. I
had no idea that the Linux kernel was so enamored of these MSRs.

Just doing a back-of-the-envelope calculation based on a 250 Hz guest
tick and 50000 cycles for a KVM exit, this implementation was going to
cost 1% or more in guest performance. We certainly couldn't enable it
by default, but maybe we could enable it for the specific customers
who had been clamoring for the feature. However, when I asked Product
Management how much performance customers were willing to trade for
this feature, the answer was "none."

Okay. How do we achieve that? The obvious approach is to intercept
reads of these MSRs and do some math in KVM. I found that really
unpalatable, though. For most of our VM families, the dominant source
of consistent background VM-exits is the host timer tick. The second
highest source is the guest timer tick. With the host timer tick
finally going away on the C4 VM family, the guest timer tick now
dominates. On Intel parts, where we take advantage of hardware EOI
virtualization, we now have two VM-exits per guest timer tick (one for
writing the x2APIC initial count MSR, and one for the VMX-preemption
timer). I couldn't defend doubling that with intercepted reads of
APERF and MPERF.

Well, what about the simple hack of passing through the host values? I
had considered this, despite the fact that it would only work for
slice-of-hardware. I even coerced Josh Don into "fixing" our scheduler
so that it wouldn't allow two vCPU threads (a virtual hyperthread
pair) to flip-flop between hyperthreads on their assigned physical
core. However, I eventually dismissed this as
(1) too much of a hack
(2) broken with live migration
(3) disingenuous when multiple tasks are running on the logical processor.

Yes, (3) does happen, even with our C4 VM family. During copyless
migration, two vCPU threads share a logical processor. During live
migration, I believe the live migration threads compete with vCPU
threads. And there is still at least one kworker thread competing for
cycles.

Actually writing the guest values into the MSRs was initially
abhorrent to me, because of the inherent lossage on every write. But,
I eventually got over it, and partially assuaged my revulsion by
writing the MSRs infrequently. I would much have preferred APERF and
MPERF equivalents of IA32_TSC_ADJUST, but I don't have the patience to
wait for the CPU vendors. BTW, as an aside, just how is AMD's scaling
of MPERF by the TSC_RATIO MSR even remotely useful without an offset?

One requirement I refuse to budge on is that host usage of APERFMPERF
must continue to work exactly as before, modulo some very small loss
of precision. Most of the damage could be contained within KVM, if
you're willing to accept the constraint that these MSRs cannot be
accessed within an NMI handler (on Intel CPUs), but then you have to
swap the guest and host values every VM-entry/VM-exit. This approach
increases both the performance overhead (for which our budget is
"none") and the loss of precision over the current approach. Given the
amount of whining on this list over writing just one MSR on every
VM-entry/VM-exit (yes, IA32_SPEC_CTRL, I'm looking at you), I didn't
think it would be very popular to add two. And, to be honest, I
remembered that rate-limited *reads* of the "turbostat" MSRs were too
expensive, but I had forgotten that the real culprit there was the
egregiously slow MSR_C6_CORE_RESIDENCY.

I do recall the hullabaloo regarding KVM's usurpation of IA32_TSC_AUX,
an affront that went unnoticed until the advent of RDPID. Honestly,
that's where I expected pushback on this series. Initially, I tried to
keep the changes outwith KVM to the minimum possible, replacing only
explicit reads of APERF or MPERF with the new accessor functions. I
wasn't going to touch the /dev/cpu/*/msr/* interface. After all, none
of the other KVM userspace return MSRs do anything special there. But,
then I discovered that turbostat on the host uses that interface, and
I really wanted that tool to continue to work as expected. So, the
rdmsr crosscalls picked up an ugly wart. Frankly, that was the
specific patch that I expected to unleash vitriol. As an aside, why
does turbostat have to use that interface for its own independent
reads of these MSRs, when the kernel is already reading them every
scheduler tick?!? Sure, for tickless kernels, maybe, but I digress.

Wherever the context-switching happens, I contend that there is no
"clean" virtualization of APERF. If it comes down to just a question
of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
performance numbers and try to come to a consensus, but if you're
fundamentally opposed to virtualizing APERF, because it's messy, then
I don't see any point in pursuing this further.

Thanks,

--jim
Nikunj A Dadhania Dec. 5, 2024, 8:59 a.m. UTC | #7
On 11/22/2024 12:22 AM, Mingwei Zhang wrote:
> Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> (250 Hz by default) to measure their effective CPU frequency. To avoid
> the overhead of intercepting these frequent MSR reads, allow the guest
> to read them directly by loading guest values into the hardware MSRs.
> 
> These MSRs are continuously running counters whose values must be
> carefully tracked during all vCPU state transitions:
> - Guest IA32_APERF advances only during guest execution
> - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
>   in C0 state, even when not actively running

Any particular reason to treat APERF and MPERF differently?

AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state.
MPERF counting at constant frequency and the APERF counting at a variable
frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0
state and even when "not actively running" ?

Can you clarify what do you mean by "not actively running"?

Regards
Nikunj
Jim Mattson Dec. 5, 2024, 1:48 p.m. UTC | #8
On Thu, Dec 5, 2024 at 1:00 AM Nikunj A Dadhania <nikunj@amd.com> wrote:
>
> On 11/22/2024 12:22 AM, Mingwei Zhang wrote:
> > Linux guests read IA32_APERF and IA32_MPERF on every scheduler tick
> > (250 Hz by default) to measure their effective CPU frequency. To avoid
> > the overhead of intercepting these frequent MSR reads, allow the guest
> > to read them directly by loading guest values into the hardware MSRs.
> >
> > These MSRs are continuously running counters whose values must be
> > carefully tracked during all vCPU state transitions:
> > - Guest IA32_APERF advances only during guest execution
> > - Guest IA32_MPERF advances at the TSC frequency whenever the vCPU is
> >   in C0 state, even when not actively running
>
> Any particular reason to treat APERF and MPERF differently?

Core cycles accumulated by the logical processor that do not
contribute to the execution of the virtual processor should not be
counted. For example, consider Google Cloud's e2-small VM type, which
is capped at a 25% duty cycle. Even if the logical processor is
humming along at an effective frequency of 3.6 GHz, an e2-small vCPU
task is only resident 25% of the time, so its effective frequency is
more like 0.9 GHz (over a sufficiently large period of time).
Similarly, if a logical processor running at 3.6 GHz is shared 50/50
by two vCPUs, the effective frequency of each is about 1.8 GHz (again,
over a sufficiently large period of time). Over smaller time periods,
the effective frequencies in these examples would look like square
waves, alternating between 3.6 GHz and 0, much like thermal
throttling. And, much like thermal throttling, MPERF reference cycles
continue to tick on at the fixed reference frequency, even when APERF
cycles drop to 0.

> AFAIU, APERF and MPERF architecturally will count when the CPU is in C0 state.
> MPERF counting at constant frequency and the APERF counting at a variable
> frequency. Shouldn't we treat APERF and MPERF equal and keep on counting in C0
> state and even when "not actively running" ?
>
> Can you clarify what do you mean by "not actively running"?

The current implementation considers the vCPU to be actively running
if the task is in the KVM_RUN ioctl, between vcpu_load() and
vcpu_put(). This also implies that the task itself is currently
running on a logical processor, since there is a vcpu_put() on
sched_out and a vcpu_load() on sched_in. As Sean points out, this is
only an approximation, since (a) such things as I/O completion in
userspace are not counted, and (b) such things as uncompressing a
zswapped page that happen in the vCPU task are counted.

> Regards
> Nikunj
>
Sean Christopherson Dec. 6, 2024, 4:34 p.m. UTC | #9
On Wed, Dec 04, 2024, Jim Mattson wrote:
> Wherever the context-switching happens, I contend that there is no
> "clean" virtualization of APERF. If it comes down to just a question
> of VM-entry/VM-exit or vcpu_load()/vcpu_put(), we can collect some
> performance numbers and try to come to a consensus, but if you're
> fundamentally opposed to virtualizing APERF, because it's messy, then
> I don't see any point in pursuing this further.

I'm not fundamentally opposed to virtualizing the feature.  My complaints with
the series are that it doesn't provide sufficient information to make it feasible
for reviewers to provide useful feedback.  The history you provided is a great
start, but that's still largely just background information.  For a feature as
messy and subjective as APERF/MPERF, I think we need at least the following:

  1. What use cases are being targeted (e.g. because targeting only SoH would
     allow for a different implementation).
  2. The exact requirements, especially with respect to host usage.  And the
     the motivation behind those requirements.
  3. The high level design choices, and what, if any, alternatives were considered.
  4. Basic rules of thumb for what is/isn't accounted in APERF/MPERF, so that it's
     feasible to actually maintain support long-term.

E.g. does the host need to retain access to APERF/MPERF at all times?  If so, why?
Do we care about host kernel accesses, e.g. in the scheduler, or just userspace
accesses, e.g. turbostat?

What information is the host intended to see?  E.g. should APERF and MPERF stop
when transitioning to the guest?  If not, what are the intended semantics for the
host's view when running VMs with HLT-exiting disabled?  If the host's view of
APERF and MPREF account guest time, how does that mesh with upcoming mediated PMU,
where the host is disallowed from observing the guest?

Is there a plan for supporting VMs with a different TSC frequency than the host?
How will live migration work, without generating too much slop/skew between MPERF
and GUEST_TSC?

I don't expect the series to answer every possible question upfront, but the RFC
provided _nothing_, just a "here's what we implemented, please review".
Sean Christopherson Jan. 13, 2025, 7:15 p.m. UTC | #10
On Wed, Dec 18, 2024, Jim Mattson wrote:
> On Fri, Dec 6, 2024 at 8:34 AM Sean Christopherson <seanjc@google.com> wrote:
> As we discussed off-list, it appears that the primary motivation for
> this change was to minimize the crosscalls executed when examining
> /proc/cpuinfo. I don't really think that use case justifies reading
> these MSRs *every scheduler tick*, but I'm admittedly biased.

Heh, yeah, we missed that boat by ~2 years.  Or maybe KVM's "slow" emulation
would only have further angered the x86 maintainers :-)

> 1. Guest Requirements
> 
> Unlike vPMU, which is primarily a development tool, our customers want
> APERFMPERF enabled on their production VMs, and they are unwilling to
> trade any amount of performance for the feature. They don't want
> frequency-invariant scheduling; they just want to observe the
> effective frequency (possibly via /proc/cpuinfo).
> 
> These requests are not limited to slice-of-hardware VMs. No one can
> tell me what customers expect with respect to KVM "steal time," but it
> seems to me that it would be disingenuous to ignore "steal time." By
> analogy with HDC, the effective frequency should drop to zero when the
> vCPU is "forced idle."
> 
> 2. Host Requirements
> 
> The host needs reliable APERF/MPERF access for:
> - Frequency-invariant scheduling
> - Monitoring through /proc/cpuinfo
> - Turbostat, maybe?
> 
> Our goal was for host APERFMPERF to work as it always has, counting
> both host cycles and guest cycles. We lose cycles on every WRMSR, but
> most of the time, the loss should be very small relative to the
> measurement.
> 
> To be honest, we had not even considered breaking APERF/MPERF on the
> host. We didn't think such an approach would have any chance of
> upstream acceptance.

FWIW, my stance on gifting features to KVM guests is that it's a-ok so long as it
requires an explicit opt-in from the system admin, and that it's decoupled from
KVM.  E.g. add a flag (or KConfig) to disable APERF/MPERF usage, at which point
there's no good reason to prevent KVM from virtualizing the feature.

Unfortunately, my idea of hiding a feature from the kernel has never panned out,
because apparently there's no feature that Linux can't squeeze some amount of
usefulness out of.  :-)

> 3. Design Choices
> 
> We evaluated three approaches:
> 
> a) Userspace virtualization via MSR filtering
> 
>    This approach was implemented before we knew about
>    frequency-invariant scheduling. Because of the frequent guest
>    reads, we observed a 10-16% performance hit, depending on vCPU
>    count. The performance impact was exacerbated by contention for a
>    legacy PIC mutex on KVM_RUN, but even if the mutex were replaced
>    with a reader/writer lock, the performance impact would be too
>    high. Hence, we abandoned this approach.
> 
> b) KVM intercepts RDMSR of APERF/MPERF
> 
>    This approach was ruled out by back-of-the-envelope
>    calculation. We're not going to be able to provide this feature for
>    free, but we could argue that 0.01% overhead is negligible. On a 2
>    GHz processor that gives us a budget of 200,000 cycles per
>    second. With a 250 Hz guest tick generating 500 RDMSR intercepts
>    per second, we have a budget of just 400 cycles per
>    intercept. That's likely to be insufficient for most platforms. A
>    guest with CONFIG_HZ_1000 would drop the budget to just 100 cycles
>    per intercept. That's unachievable.

I think we'd actually have a bit more headroom.  The overhead would be relative
to bare metal, not absolute.  RDMSR is typically ~80 cycles, so even if we are
super duper strict in how that 0.01% overhead is accounted, KVM would have more
like 150+ cycles?  But I'm mostly just being pedantic, I'm pretty sure AMD CPUs
can't achieve 400 cycle roundtrips, i.e. hardware alone would exhaust the budget.

>    We should have a discussion about just how much overhead is
>    negligible, and that may open the door to other implementation
>    options.
> 
> c) Current RDMSR pass-through approach
> 
>    The biggest downside is the loss of cycles on every WRMSR. An NMI
>    or SMI in the critical region could result in millions of lost
>    cycles. However, the damage only persists until all in-progress
>    measurements are completed.

FWIW, the NMI problem is solvable, e.g. by bumping a sequence counter if the CPU
takes an NMI in the critical section, and then retrying until there are no NMIs
(or maybe retry a very limited number of times to avoid creating a set of problems
that could be worse than the loss in accuracy).

>    We had considered context-switching host and guest values on
>    VM-entry and VM-exit. This would have kept everything within KVM,
>    as long as the host doesn't access the MSRs during an NMI or
>    SMI. However, 4 additional RDMSRs and 4 additional WRMSRs on a
>    VM-enter/VM-exit round-trip would have blown the budget. Even
>    without APERFMPERF, an active guest vCPU takes a minimum of two
>    VM-exits per timer tick, so we have even less budget per
>    VM-enter/VM-exit round-trip than we had per RDMSR intercept in (b).
> 
>    Internally, we have already moved the mediated vPMU context-switch
>    from VM-entry/VM-exit to the KVM_RUN loop boundaries, so it seemed
>    natural to do the same for APERFMPERF. I don't have a
>    back-of-the-envelope calculation for this overhead, but I have run
>    Virtuozzo's cpuid_rate benchmark in a guest with and without
>    APERFMPERF, 100 times for each configuration, and a Student's
>    t-test showed that there is no statistically significant difference
>    between the means of the two datasets.
> 
> 4. APERF/MPERF Accounting
> 
>    Virtual MPERF cycles are easy to define. They accumulate at the
>    virtual TSC frequency as long as the vCPU is in C0. There are only
>    a few ways the vCPU can leave C0. If HLT or MWAIT exiting is
>    disabled, then the vCPU can leave C) in VMX non-root operation (or
>    AMD guest mode). If HLT exiting is not disabled, then the vCPU will
>    leave C0 when a HLT instruction is intercepted, and it will reenter
>    C0 when it receives an interrupt (or a PV kick) and starts running
>    again.
> 
>    Virtual APERF cycles are more ambiguous, especially in VMX root
>    operation (or AMD host mode). I think we can all agree that they
>    should accumulate at some non-zero rate as long as the code being
>    executed on the logical processor contributes in some way to guest
>    vCPU progress, but should the virtual APERF accumulate cycles at
>    the same frequency as the physical APERF? Probably not. Ultimately,
>    the decision was pragmatic. Virtual APERF accumulates at the same
>    rate as physical APERF while the guest context is live in the
>    MSR. Doing anything else would have been too expensive.

Hmm, I'm ok stopping virtual APERF while the vCPU task is in userspace, and the
more I poke at it, the more I agree it's the only sane approach.  However, I most
definitely want to document the various gotchas with the alternative.

At first glance, keeping KVM's preempt notifier registered on exits to userspace
would be very doable, but there are lurking complexities that make it very
unpalatable when digging deeper.  E.g. handling the case where userspace
invokes KVM_RUN on a different task+CPU would likely require a per-CPU spinlock,
which is all kinds of gross.  And userspace would need a way to disassociated a
task from a vCPU.

Maybe this would be a good candidate for Paolo's idea of using the merge commit
to capture information that doesn't belong in Documentation, but that is too
specific/detailed for a single commit's changelog.

> 5. Live Migration
> 
>    The IA32_MPERF MSR is serialized independently of the
>    IA32_TIME_STAMP_COUNTER MSR. Yes, this means that the two MSRs do
>    not advance in lock step across live migration, but this is no
>    different from a general purpose vPMU counter programmed to count
>    "unhalted reference cycles." In general, our implementation of
>    guest IA32_MPERF is far superior to the vPMU implementation of
>    "unhalted reference cycles."

Aha!  The SDM gives us an out:

  Only the IA32_APERF/IA32_MPERF ratio is architecturally defined; software should
  not attach meaning to the content of the individual of IA32_APERF or IA32_MPERF
  MSRs.

While the SDM kinda sorta implies that MPERF and TSC will operrate in lock-step,
the above gives me confidence that some amount of drift is tolerable.

Off-list you floated the idea of tying save/restore to TSC as an offset, but I
think that's unnecessary complexity on two fronts.  First, the writes to TSC and
MPERF must happen separately, so even if KVM does back-to-back WRMSRs, some amount
of drift is inevitable.  Second, because virtual TSC doesn't stop on vcpu_{load,put},
there will be non-trivial drift irrespective of migration (and it might even be
worse?).

> 6. Guest TSC Scaling
> 
>    It is not possible to support TSC scaling with IA32_MPERF
>    RDMSR-passthrough on Intel CPUs, because reads of IA32_MPERF in VMX
>    non-root operation are not scaled by the hardware. It is possible
>    to support TSC scaling with IA32_MPERF RDMSR-passthrough on AMD
>    CPUs, but the implementation is left as an exercise for the reader.

So, what's the proposed solution?  Either the limitation needs to be documented
as a KVM erratum, or KVM needs to actively prevent APERF/MPREF virtualization if
TSC scaling is in effect.  I can't think of a third option off the top of my
head.

I'm not sure how I feel about taking an erratum for this one.  The SDM explicitly
states, in multiple places, that MPREF counts at a fixed frequency, e.g.

  IA32_MPERF MSR (E7H) increments in proportion to a fixed frequency, which is
  configured when the processor is booted.

Drift between TSC and MPERF is one thing, having MPERF suddenly count at a
different frequency is problematic on a different level.