diff mbox series

[1/4] KVM: VMX: read the PML log in the same order as it was written

Message ID 20241211193706.469817-2-mlevitsk@redhat.com
State New
Headers show
Series KVM: selftests: dirty_log_test: fixes for running the test nested | expand

Commit Message

Maxim Levitsky Dec. 11, 2024, 7:37 p.m. UTC
X86 spec specifies that the CPU writes to the PML log 'backwards'
or in other words, it first writes entry 511, then entry 510 and so on,
until it writes entry 0, after which the 'PML log full' VM exit happens.

I also confirmed on the bare metal that the CPU indeed writes the entries
in this order.

KVM on the other hand, reads the entries in the opposite order, from the
last written entry and towards entry 511 and dumps them in this order to
the dirty ring.

Usually this doesn't matter, except for one complex nesting case:

KVM reties the instructions that cause MMU faults.
This might cause an emulated PML log entry to be visible to L1's hypervisor
before the actual memory write was committed.

This happens when the L0 MMU fault is followed directly by the VM exit to
L1, for example due to a pending L1 interrupt or due to the L1's
'PML log full' event.

This problem doesn't have a noticeable real-world impact because this
write retry is not much different from the guest writing to the same page
multiple times, which is also not reflected in the dirty log. The users of
the dirty logging only rely on correct reporting of the clean pages, or
in other words they assume that if a page is clean, then no writes were
committed to it since the moment it was marked clean.

However KVM has a kvm_dirty_log_test selftest, a test that tests both
the clean and the dirty pages vs the memory contents, and can fail if it
detects a dirty page which has an old value at the offset 0 which the test
writes.

To avoid failure, the test has a workaround for this specific problem:

The test skips checking memory that belongs to the last dirty ring entry,
which it has seen, relying on the fact that as long as memory writes are
committed in-order, only the last entry can belong to a not yet committed
memory write.

However, since L1's KVM is reading the PML log in the opposite direction
that L0 wrote it, the last dirty ring entry often will be not the last
entry written by the L0.

To fix this, switch the order in which KVM reads the PML log.

Note that this issue is not present on the bare metal, because on the
bare metal, an update of the A/D bits of a present entry, PML logging and
the actual memory write are all done by the CPU without any hypervisor
intervention and pending interrupt evaluation, thus once a PML log and/or
vCPU kick happens, all memory writes that are in the PML log are
committed to memory.

The only exception to this rule is when the guest hits a not present EPT
entry, in which case KVM first reads (backward) the PML log, dumps it to
the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
this to the dirty ring, thus making the entry be the last one in the
dirty ring.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
 arch/x86/kvm/vmx/vmx.h |  1 +
 2 files changed, 22 insertions(+), 11 deletions(-)

Comments

Sean Christopherson Dec. 12, 2024, 12:44 a.m. UTC | #1
On Wed, Dec 11, 2024, Maxim Levitsky wrote:
> X86 spec specifies that the CPU writes to the PML log 'backwards'

SDM, because this is Intel specific.

> or in other words, it first writes entry 511, then entry 510 and so on,
> until it writes entry 0, after which the 'PML log full' VM exit happens.
> 
> I also confirmed on the bare metal that the CPU indeed writes the entries
> in this order.
> 
> KVM on the other hand, reads the entries in the opposite order, from the
> last written entry and towards entry 511 and dumps them in this order to
> the dirty ring.
> 
> Usually this doesn't matter, except for one complex nesting case:
> 
> KVM reties the instructions that cause MMU faults.
> This might cause an emulated PML log entry to be visible to L1's hypervisor
> before the actual memory write was committed.
> 
> This happens when the L0 MMU fault is followed directly by the VM exit to
> L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full'
> event.

Hmm, this an L0 bug.  Exiting to L1 to deliver a pending IRQ in the middle of an
instruction is a blatant architectural violation.  As discussed in the RSM =>
SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU
needs to enter the guest before generating an exit to L1.

Oof.  It's probably worse than that.  For this case, KVM would need to ensure the
original instruction *completed*.  That would get really, really ugly.  And for
something like VSCATTER, where each write can be completed independently, trying
to do the right thing for PML would be absurdly complex.

I'm not opposed to fudging around processing the PML log in the "correct" order,
because irrespective of this bug, populating the dirty ring using order in which
accesses occurred is probably a good idea.

But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
that avoiding exits to L1 would be beneficial, but what use case actually cares
about dirty logging performance in L1?

[*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@google.com

> This problem doesn't have a noticeable real-world impact because this
> write retry is not much different from the guest writing to the same page
> multiple times, which is also not reflected in the dirty log. The users of
> the dirty logging only rely on correct reporting of the clean pages, or
> in other words they assume that if a page is clean, then no writes were
> committed to it since the moment it was marked clean.
> 
> However KVM has a kvm_dirty_log_test selftest, a test that tests both
> the clean and the dirty pages vs the memory contents, and can fail if it
> detects a dirty page which has an old value at the offset 0 which the test
> writes.
> 
> To avoid failure, the test has a workaround for this specific problem:
> 
> The test skips checking memory that belongs to the last dirty ring entry,
> which it has seen, relying on the fact that as long as memory writes are
> committed in-order, only the last entry can belong to a not yet committed
> memory write.
> 
> However, since L1's KVM is reading the PML log in the opposite direction
> that L0 wrote it, the last dirty ring entry often will be not the last
> entry written by the L0.
> 
> To fix this, switch the order in which KVM reads the PML log.
> 
> Note that this issue is not present on the bare metal, because on the
> bare metal, an update of the A/D bits of a present entry, PML logging and
> the actual memory write are all done by the CPU without any hypervisor
> intervention and pending interrupt evaluation, thus once a PML log and/or
> vCPU kick happens, all memory writes that are in the PML log are
> committed to memory.
> 
> The only exception to this rule is when the guest hits a not present EPT
> entry, in which case KVM first reads (backward) the PML log, dumps it to
> the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
> this to the dirty ring, thus making the entry be the last one in the
> dirty ring.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
>  arch/x86/kvm/vmx/vmx.h |  1 +
>  2 files changed, 22 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0f008f5ef6f0..6fb946b58a75 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	u64 *pml_buf;
> -	u16 pml_idx;
> +	u16 pml_idx, pml_last_written_entry;
> +	int i;
>  
>  	pml_idx = vmcs_read16(GUEST_PML_INDEX);
>  
>  	/* Do nothing if PML buffer is empty */
> -	if (pml_idx == (PML_ENTITY_NUM - 1))
> +	if (pml_idx == PML_LAST_ENTRY)

Heh, this is mildly confusing, in that the first entry filled is actually called
the "last" entry by KVM.  And then below, pml_list_written_entry could point at
the first entry.

The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry
becomes pml_tail_index.  It's not a circular buffer, but I think/hope head/tail
terminology would be intuitive for most readers.

E.g. the for-loop becomes:

	for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--)
		u64 gpa;

		gpa = pml_buf[i];
		WARN_ON(gpa & (PAGE_SIZE - 1));
		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
	}

>  		return;
> +	/*
> +	 * PML index always points to the next available PML buffer entity
> +	 * unless PML log has just overflowed, in which case PML index will be

If you don't have a strong preference, I vote to do s/entity/entry and then rename
PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?).  I find the
existing "entity" terminology weird and unhelpful, and arguably wrong.

  entity - a thing with distinct and independent existence.

The things being consumed are entries in a buffer.

> +	 * 0xFFFF.
> +	 */
> +	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
>  
> -	/* PML index always points to next available PML buffer entity */
> -	if (pml_idx >= PML_ENTITY_NUM)
> -		pml_idx = 0;
> -	else
> -		pml_idx++;
> -
> +	/*
> +	 * PML log is written backwards: the CPU first writes the entity 511
> +	 * then the entity 510, and so on, until it writes the entity 0 at which
> +	 * point the PML log full VM exit happens and the logging stops.

This is technically wrong.  The PML Full exit only occurs on the next write.
E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.

  If the PML index is not in the range 0–511, there is a page-modification log-full
  event and a VM exit occurs. In this case, the accessed or dirty flag is not set,
  and the guest-physical access that triggered the event does not occur.
Maxim Levitsky Dec. 12, 2024, 9:37 p.m. UTC | #2
On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> On Wed, Dec 11, 2024, Maxim Levitsky wrote:
> > X86 spec specifies that the CPU writes to the PML log 'backwards'
> 
> SDM, because this is Intel specific.
True.
> 
> > or in other words, it first writes entry 511, then entry 510 and so on,
> > until it writes entry 0, after which the 'PML log full' VM exit happens.
> > 
> > I also confirmed on the bare metal that the CPU indeed writes the entries
> > in this order.
> > 
> > KVM on the other hand, reads the entries in the opposite order, from the
> > last written entry and towards entry 511 and dumps them in this order to
> > the dirty ring.
> > 
> > Usually this doesn't matter, except for one complex nesting case:
> > 
> > KVM reties the instructions that cause MMU faults.
> > This might cause an emulated PML log entry to be visible to L1's hypervisor
> > before the actual memory write was committed.
> > 
> > This happens when the L0 MMU fault is followed directly by the VM exit to
> > L1, for example due to a pending L1 interrupt or due to the L1's 'PML log full'
> > event.
> 
> Hmm, this an L0 bug.  Exiting to L1 to deliver a pending IRQ in the middle of an
> instruction is a blatant architectural violation.  As discussed in the RSM =>
> SHUTDOWN thread[*], fixing this would require adding a flag to note that the vCPU
> needs to enter the guest before generating an exit to L1.

Agree, note that this is to some extent visible not nested as well, for example 
this workaround that the test has is for non nested case.

One can argue that dirty ring is not an x86 feature, but I am sure that there are
other more complicated cases in which retried write can be observed by this or
other vCPUs in the violation of x86 spec.

> 
> Oof.  It's probably worse than that.  For this case, KVM would need to ensure the
> original instruction *completed*.  That would get really, really ugly.  And for
> something like VSCATTER, where each write can be completed independently, trying
> to do the right thing for PML would be absurdly complex.

I also agree. Instruction retry is much simpler and safer that emulating it, KVM
really can't stop doing this.


> I'm not opposed to fudging around processing the PML log in the "correct" order,
> because irrespective of this bug, populating the dirty ring using order in which
> accesses occurred is probably a good idea.
> 
> But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> that avoiding exits to L1 would be beneficial, but what use case actually cares
> about dirty logging performance in L1?

It does help with performance by a lot and the implementation is emulated and simple.

For example this test without pml collects about 500 pages on each iteration
with default parameters, and about 2400 pages per iteration with pml 
(after the caches warm up).


> 
> [*] https://lore.kernel.org/all/ZcY_GbqcFXH2pR5E@google.com
> 
> > This problem doesn't have a noticeable real-world impact because this
> > write retry is not much different from the guest writing to the same page
> > multiple times, which is also not reflected in the dirty log. The users of
> > the dirty logging only rely on correct reporting of the clean pages, or
> > in other words they assume that if a page is clean, then no writes were
> > committed to it since the moment it was marked clean.
> > 
> > However KVM has a kvm_dirty_log_test selftest, a test that tests both
> > the clean and the dirty pages vs the memory contents, and can fail if it
> > detects a dirty page which has an old value at the offset 0 which the test
> > writes.
> > 
> > To avoid failure, the test has a workaround for this specific problem:
> > 
> > The test skips checking memory that belongs to the last dirty ring entry,
> > which it has seen, relying on the fact that as long as memory writes are
> > committed in-order, only the last entry can belong to a not yet committed
> > memory write.
> > 
> > However, since L1's KVM is reading the PML log in the opposite direction
> > that L0 wrote it, the last dirty ring entry often will be not the last
> > entry written by the L0.
> > 
> > To fix this, switch the order in which KVM reads the PML log.
> > 
> > Note that this issue is not present on the bare metal, because on the
> > bare metal, an update of the A/D bits of a present entry, PML logging and
> > the actual memory write are all done by the CPU without any hypervisor
> > intervention and pending interrupt evaluation, thus once a PML log and/or
> > vCPU kick happens, all memory writes that are in the PML log are
> > committed to memory.
> > 
> > The only exception to this rule is when the guest hits a not present EPT
> > entry, in which case KVM first reads (backward) the PML log, dumps it to
> > the dirty ring, and *then* sets up a SPTE entry with A/D bits set, and logs
> > this to the dirty ring, thus making the entry be the last one in the
> > dirty ring.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/kvm/vmx/vmx.c | 32 +++++++++++++++++++++-----------
> >  arch/x86/kvm/vmx/vmx.h |  1 +
> >  2 files changed, 22 insertions(+), 11 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 0f008f5ef6f0..6fb946b58a75 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -6211,31 +6211,41 @@ static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
> >  {
> >  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >  	u64 *pml_buf;
> > -	u16 pml_idx;
> > +	u16 pml_idx, pml_last_written_entry;
> > +	int i;
> >  
> >  	pml_idx = vmcs_read16(GUEST_PML_INDEX);
> >  
> >  	/* Do nothing if PML buffer is empty */
> > -	if (pml_idx == (PML_ENTITY_NUM - 1))
> > +	if (pml_idx == PML_LAST_ENTRY)
> 
> Heh, this is mildly confusing, in that the first entry filled is actually called
> the "last" entry by KVM.  And then below, pml_list_written_entry could point at
> the first entry.
> 
> The best idea I can come up with is PML_HEAD_INDEX and then pml_last_written_entry
> becomes pml_tail_index.  It's not a circular buffer, but I think/hope head/tail
> terminology would be intuitive for most readers.

I agree here. Your proposal does seem better to me, so I'll adopt it in v2.
> 
> E.g. the for-loop becomes:
> 
> 	for (i = PML_HEAD_INDEX; i >= pml_tail_index; i--)
> 		u64 gpa;
> 
> 		gpa = pml_buf[i];
> 		WARN_ON(gpa & (PAGE_SIZE - 1));
> 		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
> 	}
> 
> >  		return;
> > +	/*
> > +	 * PML index always points to the next available PML buffer entity
> > +	 * unless PML log has just overflowed, in which case PML index will be
> 



> If you don't have a strong preference, I vote to do s/entity/entry and then rename
> PML_ENTITY_NUM => NR_PML_ENTRIES (or maybe PML_LOG_NR_ENTRIES?).  I find the
> existing "entity" terminology weird and unhelpful, and arguably wrong.

I don't mind renaming this.

> 
>   entity - a thing with distinct and independent existence.
> 
> The things being consumed are entries in a buffer.
> 
> > +	 * 0xFFFF.
> > +	 */
> > +	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
> >  
> > -	/* PML index always points to next available PML buffer entity */
> > -	if (pml_idx >= PML_ENTITY_NUM)
> > -		pml_idx = 0;
> > -	else
> > -		pml_idx++;
> > -
> > +	/*
> > +	 * PML log is written backwards: the CPU first writes the entity 511
> > +	 * then the entity 510, and so on, until it writes the entity 0 at which
> > +	 * point the PML log full VM exit happens and the logging stops.
> 
> This is technically wrong.  The PML Full exit only occurs on the next write.
> E.g. KVM could observe GUEST_PML_INDEX == -1 without ever seeing a PML Full exit.

I skipped over this part of the PRM when I was reading upon how PML works.
I will drop this sentence in the next version.

> 
>   If the PML index is not in the range 0–511, there is a page-modification log-full
>   event and a VM exit occurs. In this case, the accessed or dirty flag is not set,
>   and the guest-physical access that triggered the event does not occur.
> 

Do you have any comments for the rest of the patch series? If not then I'll send
v2 of the patch series.


Best regards,
	Maxim Levitsky
Sean Christopherson Dec. 13, 2024, 6:19 a.m. UTC | #3
On Thu, Dec 12, 2024, Maxim Levitsky wrote:
> On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> > But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> > that avoiding exits to L1 would be beneficial, but what use case actually cares
> > about dirty logging performance in L1?
> 
> It does help with performance by a lot and the implementation is emulated and simple.

Yeah, it's not a lot of complexity, but it's architecturally flawed.  And I get
that it helps with performance, I'm just stumped as to the use case for dirty
logging in a nested VM in the first place.

> Do you have any comments for the rest of the patch series? If not then I'll send
> v2 of the patch series.

*sigh*

I do.  Through no fault of your own.  I was trying to figure out a way to ensure
the vCPU made meaningful progress, versus just guaranteeing at least one write,
and stumbled onto a plethora of flaws and unnecessary complexity in the test.

Can you post this patch as a standalone v2?  I'd like to do a more agressive
cleanup of the selftest, but I don't want to hold this up, and there's no hard
dependency.

As for the issues I encountered with the selftest:

 1. Tracing how many pages have been written for the current iteration with a
    guest side counter doesn't work without more fixes, because the test doesn't
    collect all dirty entries for the current iterations.  For the dirty ring,
    this results in a vCPU *starting* an iteration with a full dirty ring, and
    the test hangs because the guest can't make forward progress until
    log_mode_collect_dirty_pages() is called.

 2. The test presumably doesn't collect all dirty entries because of the weird
    and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the
    synchronization that comes with it.  The kick is "justified" with a comment
    saying "This makes sure that hardware PML cache flushed", but there's no
    reason to do *if* pages that the test collects dirty pages *after* stopping
    the vCPU.  Which is easy to do while also collecting while the vCPU is
    running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted
    wound of sorts).

 3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so
    every other iteration runs until the ring is full.  Testing the "soft full"
    logic is interesting, but not _that_ interesting, and filling the dirty ring
    basically ignores the "interval".  Fixing this reduces the runtime by a
    significant amount, especially on nested, at the cost of providing less
    coverage for the dirty ring with default interval in a nested VM (but if
    someone cares about testing the dirty ring soft full in a nested VM, they
    can darn well bump the interval).

 4. Fixing the test to collect all dirty entries for the current iteration
    exposes another flaw.  The bitmaps (not dirty ring) start with all bits
    set.  And so the first iteration can see "dirty" pages that have never
    been written, but only when applying your fix to limit the hack to s390.

 5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the
    guest could see a stale iteration if the main thread is delayed.

 6. host_bmap_track and all of the weird exemptions for writes from previous
    iterations goes away if all entries are collected for the current iteration
    (though a second bitmap is needed to handle the second collection; KVM's
    "get" of the bitmap clobbers the previous value).

I have everything more or less coded up, but I need to split it into patches,
write changelogs, and interleave it with your fixes.  Hopefully I'll get to that
tomorrow.
Maxim Levitsky Dec. 13, 2024, 7:56 p.m. UTC | #4
On Thu, 2024-12-12 at 22:19 -0800, Sean Christopherson wrote:
> On Thu, Dec 12, 2024, Maxim Levitsky wrote:
> > On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> > > But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> > > that avoiding exits to L1 would be beneficial, but what use case actually cares
> > > about dirty logging performance in L1?
> > 
> > It does help with performance by a lot and the implementation is emulated and simple.
> 
> Yeah, it's not a lot of complexity, but it's architecturally flawed.  And I get
> that it helps with performance, I'm just stumped as to the use case for dirty
> logging in a nested VM in the first place.
> 
> > Do you have any comments for the rest of the patch series? If not then I'll send
> > v2 of the patch series.
> 
> *sigh*
> 
> I do.  Through no fault of your own.  I was trying to figure out a way to ensure
> the vCPU made meaningful progress, versus just guaranteeing at least one write,
> and stumbled onto a plethora of flaws and unnecessary complexity in the test.
> 
> Can you post this patch as a standalone v2?  I'd like to do a more agressive
> cleanup of the selftest, but I don't want to hold this up, and there's no hard
> dependency.
> 
> As for the issues I encountered with the selftest:
> 
>  1. Tracing how many pages have been written for the current iteration with a
>     guest side counter doesn't work without more fixes, because the test doesn't
>     collect all dirty entries for the current iterations.  For the dirty ring,
>     this results in a vCPU *starting* an iteration with a full dirty ring, and
>     the test hangs because the guest can't make forward progress until
>     log_mode_collect_dirty_pages() is called.
> 
>  2. The test presumably doesn't collect all dirty entries because of the weird
>     and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the
>     synchronization that comes with it.  The kick is "justified" with a comment
>     saying "This makes sure that hardware PML cache flushed", but there's no
>     reason to do *if* pages that the test collects dirty pages *after* stopping
>     the vCPU.  Which is easy to do while also collecting while the vCPU is
>     running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted
>     wound of sorts).
> 
>  3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so
>     every other iteration runs until the ring is full.  Testing the "soft full"
>     logic is interesting, but not _that_ interesting, and filling the dirty ring
>     basically ignores the "interval".  Fixing this reduces the runtime by a
>     significant amount, especially on nested, at the cost of providing less
>     coverage for the dirty ring with default interval in a nested VM (but if
>     someone cares about testing the dirty ring soft full in a nested VM, they
>     can darn well bump the interval).
> 
>  4. Fixing the test to collect all dirty entries for the current iteration
>     exposes another flaw.  The bitmaps (not dirty ring) start with all bits
>     set.  And so the first iteration can see "dirty" pages that have never
>     been written, but only when applying your fix to limit the hack to s390.
> 
>  5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the
>     guest could see a stale iteration if the main thread is delayed.
> 
>  6. host_bmap_track and all of the weird exemptions for writes from previous
>     iterations goes away if all entries are collected for the current iteration
>     (though a second bitmap is needed to handle the second collection; KVM's
>     "get" of the bitmap clobbers the previous value).
> 
> I have everything more or less coded up, but I need to split it into patches,
> write changelogs, and interleave it with your fixes.  Hopefully I'll get to that
> tomorrow.
> 

Hi!

I will take a look at your patch series once you post it.
I also think that the logic in the test is somewhat broken, but then this also 
serves as a way to cause as much havoc as possible.

The fact that not all dirty pages are collected is because the ring harvest happens
at the same time the guest continues dirtying the pages, adding more entries to the
ring, simulating what would happen during real-life migration.

kicking the guest just before ring harvest is also IMHO a good thing as it also
simulates the IRQ load that would happen.

we can avoid kicking the guest if it is already stopped due to dirty ring, in fact,
the fact that we still kick it, delays the kick to the point where we resume the guest
and wait for it to stop again before the do the verify step, which makes it often
exit not due to log full event.

I did this but this makes the test be way less random, and the whole point of this
test is to cause as much havoc as possible.


I do think that we don't need to stop the guest during verify for the dirty-ring case,
this is probably a code that only dirty bitmap part of the test needs.

I added Peter Xu to CC to hear his option about this as well.


Best regards,
	Maxim Levitsky
Sean Christopherson Dec. 13, 2024, 8:31 p.m. UTC | #5
On Fri, Dec 13, 2024, Maxim Levitsky wrote:
> On Thu, 2024-12-12 at 22:19 -0800, Sean Christopherson wrote:
> > On Thu, Dec 12, 2024, Maxim Levitsky wrote:
> > > On Wed, 2024-12-11 at 16:44 -0800, Sean Christopherson wrote:
> > > > But, I can't help but wonder why KVM bothers emulating PML.  I can appreciate
> > > > that avoiding exits to L1 would be beneficial, but what use case actually cares
> > > > about dirty logging performance in L1?
> > > 
> > > It does help with performance by a lot and the implementation is emulated and simple.
> > 
> > Yeah, it's not a lot of complexity, but it's architecturally flawed.  And I get
> > that it helps with performance, I'm just stumped as to the use case for dirty
> > logging in a nested VM in the first place.
> > 
> > > Do you have any comments for the rest of the patch series? If not then I'll send
> > > v2 of the patch series.
> > 
> > *sigh*
> > 
> > I do.  Through no fault of your own.  I was trying to figure out a way to ensure
> > the vCPU made meaningful progress, versus just guaranteeing at least one write,
> > and stumbled onto a plethora of flaws and unnecessary complexity in the test.
> > 
> > Can you post this patch as a standalone v2?  I'd like to do a more agressive
> > cleanup of the selftest, but I don't want to hold this up, and there's no hard
> > dependency.
> > 
> > As for the issues I encountered with the selftest:
> > 
> >  1. Tracing how many pages have been written for the current iteration with a
> >     guest side counter doesn't work without more fixes, because the test doesn't
> >     collect all dirty entries for the current iterations.  For the dirty ring,
> >     this results in a vCPU *starting* an iteration with a full dirty ring, and
> >     the test hangs because the guest can't make forward progress until
> >     log_mode_collect_dirty_pages() is called.
> > 
> >  2. The test presumably doesn't collect all dirty entries because of the weird
> >     and unnecessary kick in dirty_ring_collect_dirty_pages(), and all the
> >     synchronization that comes with it.  The kick is "justified" with a comment
> >     saying "This makes sure that hardware PML cache flushed", but there's no
> >     reason to do *if* pages that the test collects dirty pages *after* stopping
> >     the vCPU.  Which is easy to do while also collecting while the vCPU is
> >     running, if the kick+synchronization is eliminated (i.e. it's a self-inflicted
> >     wound of sorts).
> > 
> >  3. dirty_ring_after_vcpu_run() doesn't honor vcpu_sync_stop_requested, and so
> >     every other iteration runs until the ring is full.  Testing the "soft full"
> >     logic is interesting, but not _that_ interesting, and filling the dirty ring
> >     basically ignores the "interval".  Fixing this reduces the runtime by a
> >     significant amount, especially on nested, at the cost of providing less
> >     coverage for the dirty ring with default interval in a nested VM (but if
> >     someone cares about testing the dirty ring soft full in a nested VM, they
> >     can darn well bump the interval).
> > 
> >  4. Fixing the test to collect all dirty entries for the current iteration
> >     exposes another flaw.  The bitmaps (not dirty ring) start with all bits
> >     set.  And so the first iteration can see "dirty" pages that have never
> >     been written, but only when applying your fix to limit the hack to s390.
> > 
> >  5. "iteration" is synched to the guest *after* the vCPU is restarted, i.e. the
> >     guest could see a stale iteration if the main thread is delayed.
> > 
> >  6. host_bmap_track and all of the weird exemptions for writes from previous
> >     iterations goes away if all entries are collected for the current iteration
> >     (though a second bitmap is needed to handle the second collection; KVM's
> >     "get" of the bitmap clobbers the previous value).
> > 
> > I have everything more or less coded up, but I need to split it into patches,
> > write changelogs, and interleave it with your fixes.  Hopefully I'll get to that
> > tomorrow.
> > 
> 
> Hi!
> 
> I will take a look at your patch series once you post it.
> I also think that the logic in the test is somewhat broken, but then this also 
> serves as a way to cause as much havoc as possible.
> 
> The fact that not all dirty pages are collected is because the ring harvest happens
> at the same time the guest continues dirtying the pages, adding more entries to the
> ring, simulating what would happen during real-life migration.

But as above, that behavior is trivially easy to mimic even when collecting all
entries simply by playing nice with multiple collections per iteration.  

> kicking the guest just before ring harvest is also IMHO a good thing as it also
> simulates the IRQ load that would happen.

I am not at all convinced that's interesting.  And *if* it's really truly all
that interesting, then the kick should be done for all flavors.

Unless the host is tickless, the vCPU will get interrupt from time to time, at
least for any decently large interval.  The kick from the test itself adds an
absurd amount of complexity for no meaningful test coverage.

> we can avoid kicking the guest if it is already stopped due to dirty ring, in fact,
> the fact that we still kick it, delays the kick to the point where we resume the guest
> and wait for it to stop again before the do the verify step, which makes it often
> exit not due to log full event.
> 
> I did this but this makes the test be way less random, and the whole point of this
> test is to cause as much havoc as possible.

I agree randomness is a good thing for testing, but this is more noise than
random/controlled chaos.

E.g. we can do _far_ better for large interval numbers.  As is, collecting _once_
per iteration means the vCPU is all but guarantee to stall on a dirty ring for
any decently large interval.

And conversely, not honor the "stop" request means every other iteration is all
but guaranteed to fill the dirty ring, even for small intervals.

> I do think that we don't need to stop the guest during verify for the
> dirty-ring case, this is probably a code that only dirty bitmap part of the
> test needs.

Not stopping the vCPU would reduce test coverage (which is one of my complaints
with not fully harvesting the dirty entries).  If KVM misses a dirty log event
on iteration X, and the vCPU also writes the same page in iteration X+1, then the
test will get a false pass because iteration X+1 will see the page as dirty and
think all is well.
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0f008f5ef6f0..6fb946b58a75 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6211,31 +6211,41 @@  static void vmx_flush_pml_buffer(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u64 *pml_buf;
-	u16 pml_idx;
+	u16 pml_idx, pml_last_written_entry;
+	int i;
 
 	pml_idx = vmcs_read16(GUEST_PML_INDEX);
 
 	/* Do nothing if PML buffer is empty */
-	if (pml_idx == (PML_ENTITY_NUM - 1))
+	if (pml_idx == PML_LAST_ENTRY)
 		return;
+	/*
+	 * PML index always points to the next available PML buffer entity
+	 * unless PML log has just overflowed, in which case PML index will be
+	 * 0xFFFF.
+	 */
+	pml_last_written_entry = (pml_idx >= PML_ENTITY_NUM) ? 0 : pml_idx + 1;
 
-	/* PML index always points to next available PML buffer entity */
-	if (pml_idx >= PML_ENTITY_NUM)
-		pml_idx = 0;
-	else
-		pml_idx++;
-
+	/*
+	 * PML log is written backwards: the CPU first writes the entity 511
+	 * then the entity 510, and so on, until it writes the entity 0 at which
+	 * point the PML log full VM exit happens and the logging stops.
+	 *
+	 * Read the entries in the same order they were written, to ensure that
+	 * the dirty ring is filled in the same order the CPU wrote them.
+	 */
 	pml_buf = page_address(vmx->pml_pg);
-	for (; pml_idx < PML_ENTITY_NUM; pml_idx++) {
+
+	for (i = PML_LAST_ENTRY; i >= pml_last_written_entry; i--) {
 		u64 gpa;
 
-		gpa = pml_buf[pml_idx];
+		gpa = pml_buf[i];
 		WARN_ON(gpa & (PAGE_SIZE - 1));
 		kvm_vcpu_mark_page_dirty(vcpu, gpa >> PAGE_SHIFT);
 	}
 
 	/* reset PML index */
-	vmcs_write16(GUEST_PML_INDEX, PML_ENTITY_NUM - 1);
+	vmcs_write16(GUEST_PML_INDEX, PML_LAST_ENTRY);
 }
 
 static void vmx_dump_sel(char *name, uint32_t sel)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 43f573f6ca46..d14401c8e499 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -331,6 +331,7 @@  struct vcpu_vmx {
 
 	/* Support for PML */
 #define PML_ENTITY_NUM		512
+#define PML_LAST_ENTRY		(PML_ENTITY_NUM - 1)
 	struct page *pml_pg;
 
 	/* apic deadline value in host tsc */