diff mbox series

[v1,1/5] KVM: arm64: Enable ring-based dirty memory tracking

Message ID 20220819005601.198436-2-gshan@redhat.com
State New
Headers show
Series KVM: arm64: Enable ring-based dirty memory tracking | expand

Commit Message

Gavin Shan Aug. 19, 2022, 12:55 a.m. UTC
The ring-based dirty memory tracking has been available and enabled
on x86 for a while. The feature is beneficial when the number of
dirty pages is small in a checkpointing system or live migration
scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
Implement ring-based dirty memory tracking").

This enables the ring-based dirty memory tracking on ARM64. It's
notable that no extra reserved ring entries are needed on ARM64
because the huge pages are always split into base pages when page
dirty tracking is enabled.

Signed-off-by: Gavin Shan <gshan@redhat.com>
---
 Documentation/virt/kvm/api.rst    | 2 +-
 arch/arm64/include/uapi/asm/kvm.h | 1 +
 arch/arm64/kvm/Kconfig            | 1 +
 arch/arm64/kvm/arm.c              | 8 ++++++++
 4 files changed, 11 insertions(+), 1 deletion(-)

Comments

Gavin Shan Aug. 22, 2022, 1:58 a.m. UTC | #1
Hi Marc,

On 8/19/22 6:00 PM, Marc Zyngier wrote:
> On Fri, 19 Aug 2022 01:55:57 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>>
>> The ring-based dirty memory tracking has been available and enabled
>> on x86 for a while. The feature is beneficial when the number of
>> dirty pages is small in a checkpointing system or live migration
>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
>> Implement ring-based dirty memory tracking").
>>
>> This enables the ring-based dirty memory tracking on ARM64. It's
>> notable that no extra reserved ring entries are needed on ARM64
>> because the huge pages are always split into base pages when page
>> dirty tracking is enabled.
> 
> Can you please elaborate on this? Adding a per-CPU ring of course
> results in extra memory allocation, so there must be a subtle
> x86-specific detail that I'm not aware of...
> 

Sure. I guess it's helpful to explain how it works in next revision.
Something like below:

This enables the ring-based dirty memory tracking on ARM64. The feature
is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
pushed by host when page becomes dirty and pulled by userspace. A vcpu
exit is forced when the ring buffer becomes full. The ring buffers on
all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.

Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
However, it's avoiding synchronization among multiple vcpus when dirty
pages happen on multiple vcpus. More discussion can be found from [1]

[1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
(comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)


>>
>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>> ---
>>   Documentation/virt/kvm/api.rst    | 2 +-
>>   arch/arm64/include/uapi/asm/kvm.h | 1 +
>>   arch/arm64/kvm/Kconfig            | 1 +
>>   arch/arm64/kvm/arm.c              | 8 ++++++++
>>   4 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index abd7c32126ce..19fa1ac017ed 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>>   8.29 KVM_CAP_DIRTY_LOG_RING
>>   ---------------------------
>>   
>> -:Architectures: x86
>> +:Architectures: x86, arm64
>>   :Parameters: args[0] - size of the dirty log ring
>>   
>>   KVM is capable of tracking dirty memory using ring buffers that are
>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>> index 3bb134355874..7e04b0b8d2b2 100644
>> --- a/arch/arm64/include/uapi/asm/kvm.h
>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>> @@ -43,6 +43,7 @@
>>   #define __KVM_HAVE_VCPU_EVENTS
>>   
>>   #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
> 
> For context, the documentation says:
> 
> <quote>
> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>    KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> </quote>
> 
> What is the reason for picking this particular value?
> 

It's inherited from x86. I don't think it has to be this particular value.
The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.

How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
The virtual area is cheap, I guess it's also nice to use x86's
pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.

     #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
     #define KVM_DIRTY_LOG_PAGE_OFFSET        2

>>   
>>   #define KVM_REG_SIZE(id)						\
>>   	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>> index 815cc118c675..0309b2d0f2da 100644
>> --- a/arch/arm64/kvm/Kconfig
>> +++ b/arch/arm64/kvm/Kconfig
>> @@ -32,6 +32,7 @@ menuconfig KVM
>>   	select KVM_VFIO
>>   	select HAVE_KVM_EVENTFD
>>   	select HAVE_KVM_IRQFD
>> +	select HAVE_KVM_DIRTY_RING
>>   	select HAVE_KVM_MSI
>>   	select HAVE_KVM_IRQCHIP
>>   	select HAVE_KVM_IRQ_ROUTING
>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>> index 986cee6fbc7f..3de6b9b39db7 100644
>> --- a/arch/arm64/kvm/arm.c
>> +++ b/arch/arm64/kvm/arm.c
>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>   		if (!ret)
>>   			ret = 1;
>>   
>> +		/* Force vcpu exit if its dirty ring is soft-full */
>> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
>> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
>> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>> +			trace_kvm_dirty_ring_exit(vcpu);
>> +			ret = 0;
>> +		}
>> +
> 
> Why can't this be moved to kvm_vcpu_exit_request() instead? I would
> also very much like the check to be made a common helper with x86.
> 
> A seemingly approach would be to make this a request on dirty log
> insertion, and avoid the whole "check the log size" on every run,
> which adds pointless overhead to unsuspecting users (aka everyone).
> 

I though of having the check in kvm_vcpu_exit_request(). The various
exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
highest priority and ARM64 is just to follow. I don't think it really
matters. I will improve it accordingly in next revision:

- Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c

   bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
   {
        struct kvm *kvm = vcpu->vcpu;
        struct kvm_dirty_ring *ring = &vcpu->dirty_ring;

        if (unlikely(kvm->dirty_ring_size &&
                     kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
            vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
            trace_kvm_dirty_ring_exit(vcpu);
            return true;
        }

        return false;
   }

- Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().

Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
sure if anything else I missed?

Thanks,
Gavin
Peter Xu Aug. 22, 2022, 6:55 p.m. UTC | #2
Hi, Gavin,

On Mon, Aug 22, 2022 at 11:58:20AM +1000, Gavin Shan wrote:
> > For context, the documentation says:
> > 
> > <quote>
> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> >    KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> > </quote>
> > 
> > What is the reason for picking this particular value?
> > 
> 
> It's inherited from x86. I don't think it has to be this particular value.
> The value is used to distinguish the region's owners like kvm_run, KVM_PIO_PAGE_OFFSET,
> KVM_COALESCED_MMIO_PAGE_OFFSET, and KVM_DIRTY_LOG_PAGE_OFFSET.
> 
> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
> The virtual area is cheap, I guess it's also nice to use x86's
> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
> 
>     #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>     #define KVM_DIRTY_LOG_PAGE_OFFSET        2

It was chosen not to be continuous of previous used offset because it'll be
the 1st vcpu region that can cover multiple & dynamic number of pages.  I
wanted to leave the 3-63 (x86 used offset 2 already) for small fields so
they can be continuous, which looks a little bit cleaner.

Currently how many pages it'll use depends on the size set by the user,
though there's a max size limited by KVM_DIRTY_RING_MAX_ENTRIES, which is a
maximum of 1MB memory.

So I think setting it to 2 is okay, as long as we keep the rest 1MB address
space for the per-vcpu ring structure, so any new vcpu fields (even if only
1 page will be needed) need to be after that maximum size of the ring.

Thanks,
Marc Zyngier Aug. 22, 2022, 9:42 p.m. UTC | #3
Hi Gavin,

On Mon, 22 Aug 2022 02:58:20 +0100,
Gavin Shan <gshan@redhat.com> wrote:
> 
> Hi Marc,
> 
> On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > On Fri, 19 Aug 2022 01:55:57 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> >> 
> >> The ring-based dirty memory tracking has been available and enabled
> >> on x86 for a while. The feature is beneficial when the number of
> >> dirty pages is small in a checkpointing system or live migration
> >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> >> Implement ring-based dirty memory tracking").
> >> 
> >> This enables the ring-based dirty memory tracking on ARM64. It's
> >> notable that no extra reserved ring entries are needed on ARM64
> >> because the huge pages are always split into base pages when page
> >> dirty tracking is enabled.
> > 
> > Can you please elaborate on this? Adding a per-CPU ring of course
> > results in extra memory allocation, so there must be a subtle
> > x86-specific detail that I'm not aware of...
> > 
> 
> Sure. I guess it's helpful to explain how it works in next revision.
> Something like below:
> 
> This enables the ring-based dirty memory tracking on ARM64. The feature
> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> pushed by host when page becomes dirty and pulled by userspace. A vcpu
> exit is forced when the ring buffer becomes full. The ring buffers on
> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> 
> Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> However, it's avoiding synchronization among multiple vcpus when dirty
> pages happen on multiple vcpus. More discussion can be found from [1]

Oh, I totally buy the relaxation of the synchronisation (though I
doubt this will have any visible effect until we have something like
Oliver's patches to allow parallel faulting).

But it is the "no extra reserved ring entries are needed on ARM64"
argument that I don't get yet.


>
> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
> 
> 
> >> 
> >> Signed-off-by: Gavin Shan <gshan@redhat.com>
> >> ---
> >>   Documentation/virt/kvm/api.rst    | 2 +-
> >>   arch/arm64/include/uapi/asm/kvm.h | 1 +
> >>   arch/arm64/kvm/Kconfig            | 1 +
> >>   arch/arm64/kvm/arm.c              | 8 ++++++++
> >>   4 files changed, 11 insertions(+), 1 deletion(-)
> >> 
> >> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> >> index abd7c32126ce..19fa1ac017ed 100644
> >> --- a/Documentation/virt/kvm/api.rst
> >> +++ b/Documentation/virt/kvm/api.rst
> >> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
> >>   8.29 KVM_CAP_DIRTY_LOG_RING
> >>   ---------------------------
> >>   -:Architectures: x86
> >> +:Architectures: x86, arm64
> >>   :Parameters: args[0] - size of the dirty log ring
> >>     KVM is capable of tracking dirty memory using ring buffers that
> >> are
> >> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
> >> index 3bb134355874..7e04b0b8d2b2 100644
> >> --- a/arch/arm64/include/uapi/asm/kvm.h
> >> +++ b/arch/arm64/include/uapi/asm/kvm.h
> >> @@ -43,6 +43,7 @@
> >>   #define __KVM_HAVE_VCPU_EVENTS
> >>     #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> >> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
> > 
> > For context, the documentation says:
> > 
> > <quote>
> > - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
> >    KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
> > </quote>
> > 
> > What is the reason for picking this particular value?
> > 
> 
> It's inherited from x86. I don't think it has to be this particular
> value.  The value is used to distinguish the region's owners like
> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and
> KVM_DIRTY_LOG_PAGE_OFFSET.
> 
> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
> The virtual area is cheap, I guess it's also nice to use x86's
> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
> 
>     #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>     #define KVM_DIRTY_LOG_PAGE_OFFSET        2

Given that this is just an offset in the vcpu "file", I don't think it
matters that much. 64 definitely allows for some struct vcpu growth,
and it doesn't hurt to be compatible with x86 (for once...).

> 
> >>     #define KVM_REG_SIZE(id)
> >> \
> >>   	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
> >> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
> >> index 815cc118c675..0309b2d0f2da 100644
> >> --- a/arch/arm64/kvm/Kconfig
> >> +++ b/arch/arm64/kvm/Kconfig
> >> @@ -32,6 +32,7 @@ menuconfig KVM
> >>   	select KVM_VFIO
> >>   	select HAVE_KVM_EVENTFD
> >>   	select HAVE_KVM_IRQFD
> >> +	select HAVE_KVM_DIRTY_RING
> >>   	select HAVE_KVM_MSI
> >>   	select HAVE_KVM_IRQCHIP
> >>   	select HAVE_KVM_IRQ_ROUTING
> >> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> >> index 986cee6fbc7f..3de6b9b39db7 100644
> >> --- a/arch/arm64/kvm/arm.c
> >> +++ b/arch/arm64/kvm/arm.c
> >> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >>   		if (!ret)
> >>   			ret = 1;
> >>   +		/* Force vcpu exit if its dirty ring is soft-full */
> >> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
> >> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
> >> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> >> +			trace_kvm_dirty_ring_exit(vcpu);
> >> +			ret = 0;
> >> +		}
> >> +
> > 
> > Why can't this be moved to kvm_vcpu_exit_request() instead? I would
> > also very much like the check to be made a common helper with x86.
> > 
> > A seemingly approach would be to make this a request on dirty log
> > insertion, and avoid the whole "check the log size" on every run,
> > which adds pointless overhead to unsuspecting users (aka everyone).
> > 
> 
> I though of having the check in kvm_vcpu_exit_request(). The various
> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
> highest priority and ARM64 is just to follow. I don't think it really
> matters. I will improve it accordingly in next revision:
> 
> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
> 
>   bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
>   {
>        struct kvm *kvm = vcpu->vcpu;
>        struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
> 
>        if (unlikely(kvm->dirty_ring_size &&
>                     kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
>            vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>            trace_kvm_dirty_ring_exit(vcpu);
>            return true;
>        }
> 
>        return false;
>   }
> 
> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
> 
> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
> sure if anything else I missed?

I'm fine with the above, but what I really wanted is a request from
the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request.
Something like this (which obviously doesn't compile, but you'll get
the idea):

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..0b41feb6fb7d 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
 
 		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
 			return kvm_vcpu_suspend(vcpu);
+
+		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+			trace_kvm_dirty_ring_exit(vcpu);
+			return 0;
+		}
 	}
 
 	return 1;
diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
index f4c2a6eb1666..08b2f01164fa 100644
--- a/virt/kvm/dirty_ring.c
+++ b/virt/kvm/dirty_ring.c
@@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
 
 void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
 {
+	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
 	struct kvm_dirty_gfn *entry;
 
 	/* It should never get full */
@@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
 	kvm_dirty_gfn_set_dirtied(entry);
 	ring->dirty_index++;
 	trace_kvm_dirty_ring_push(ring, slot, offset);
+
+	if (kvm_dirty_ring_soft_full(vcpu))
+		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
 }
 
 struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)

Thanks,

	M.
Gavin Shan Aug. 23, 2022, 5:22 a.m. UTC | #4
Hi Marc,

On 8/23/22 7:42 AM, Marc Zyngier wrote:
> On Mon, 22 Aug 2022 02:58:20 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
>> On 8/19/22 6:00 PM, Marc Zyngier wrote:
>>> On Fri, 19 Aug 2022 01:55:57 +0100,
>>> Gavin Shan <gshan@redhat.com> wrote:
>>>>
>>>> The ring-based dirty memory tracking has been available and enabled
>>>> on x86 for a while. The feature is beneficial when the number of
>>>> dirty pages is small in a checkpointing system or live migration
>>>> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
>>>> Implement ring-based dirty memory tracking").
>>>>
>>>> This enables the ring-based dirty memory tracking on ARM64. It's
>>>> notable that no extra reserved ring entries are needed on ARM64
>>>> because the huge pages are always split into base pages when page
>>>> dirty tracking is enabled.
>>>
>>> Can you please elaborate on this? Adding a per-CPU ring of course
>>> results in extra memory allocation, so there must be a subtle
>>> x86-specific detail that I'm not aware of...
>>>
>>
>> Sure. I guess it's helpful to explain how it works in next revision.
>> Something like below:
>>
>> This enables the ring-based dirty memory tracking on ARM64. The feature
>> is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
>> CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
>> each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
>> pushed by host when page becomes dirty and pulled by userspace. A vcpu
>> exit is forced when the ring buffer becomes full. The ring buffers on
>> all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
>>
>> Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
>> However, it's avoiding synchronization among multiple vcpus when dirty
>> pages happen on multiple vcpus. More discussion can be found from [1]
> 
> Oh, I totally buy the relaxation of the synchronisation (though I
> doubt this will have any visible effect until we have something like
> Oliver's patches to allow parallel faulting).
> 
> But it is the "no extra reserved ring entries are needed on ARM64"
> argument that I don't get yet.
> 

Ok. The extra reserved ring entries are x86 specific. When x86's PML
(Page Modification Logging) hardware capability is enabled, the vcpu
exits due to full PML buffer, which is 512 entries. All the information
in PML buffer is pushed to the dirty ring buffer in one shoot. To
avoid overrunning the dirty ring buffer, there are 512 entries are
reserved.

   === include/linux/kvm_host.h

   #define KVM_DIRTY_RING_RSVD_ENTRIES    64     // fixed and reserved ring entries

   === virt/kvm/dirty_ring.c

   int __weak kvm_cpu_dirty_log_size(void)
   {
         return 0;
   }

   u32 kvm_dirty_ring_get_rsvd_entries(void)
   {
         return KVM_DIRTY_RING_RSVD_ENTRIES + kvm_cpu_dirty_log_size();
   }

   === arch/x86/kvm/mmu/mmu.c

   int kvm_cpu_dirty_log_size(void)
   {
         return kvm_x86_ops.cpu_dirty_log_size;    // Set to 512 when PML is enabled
   }


kvm_cpu_dirty_log_size() isn't be overrided by ARM64, meaning it returns
zero on ARM64. On x86, it returns 512 when PML is enabled.

>>
>> [1] https://patchwork.kernel.org/project/kvm/patch/BL2PR08MB4812F929A2760BC40EA757CF0630@BL2PR08MB481.namprd08.prod.outlook.com/
>> (comment#8 from Radim Krčmář on May 3, 2016, 2:11 p.m. UTC)
>>
>>
>>>>
>>>> Signed-off-by: Gavin Shan <gshan@redhat.com>
>>>> ---
>>>>    Documentation/virt/kvm/api.rst    | 2 +-
>>>>    arch/arm64/include/uapi/asm/kvm.h | 1 +
>>>>    arch/arm64/kvm/Kconfig            | 1 +
>>>>    arch/arm64/kvm/arm.c              | 8 ++++++++
>>>>    4 files changed, 11 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>>>> index abd7c32126ce..19fa1ac017ed 100644
>>>> --- a/Documentation/virt/kvm/api.rst
>>>> +++ b/Documentation/virt/kvm/api.rst
>>>> @@ -8022,7 +8022,7 @@ regardless of what has actually been exposed through the CPUID leaf.
>>>>    8.29 KVM_CAP_DIRTY_LOG_RING
>>>>    ---------------------------
>>>>    -:Architectures: x86
>>>> +:Architectures: x86, arm64
>>>>    :Parameters: args[0] - size of the dirty log ring
>>>>      KVM is capable of tracking dirty memory using ring buffers that
>>>> are
>>>> diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
>>>> index 3bb134355874..7e04b0b8d2b2 100644
>>>> --- a/arch/arm64/include/uapi/asm/kvm.h
>>>> +++ b/arch/arm64/include/uapi/asm/kvm.h
>>>> @@ -43,6 +43,7 @@
>>>>    #define __KVM_HAVE_VCPU_EVENTS
>>>>      #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
>>>> +#define KVM_DIRTY_LOG_PAGE_OFFSET 64
>>>
>>> For context, the documentation says:
>>>
>>> <quote>
>>> - if KVM_CAP_DIRTY_LOG_RING is available, a number of pages at
>>>     KVM_DIRTY_LOG_PAGE_OFFSET * PAGE_SIZE. [...]
>>> </quote>
>>>
>>> What is the reason for picking this particular value?
>>>
>>
>> It's inherited from x86. I don't think it has to be this particular
>> value.  The value is used to distinguish the region's owners like
>> kvm_run, KVM_PIO_PAGE_OFFSET, KVM_COALESCED_MMIO_PAGE_OFFSET, and
>> KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>> How about to have 2 for KVM_DIRTY_LOG_PAGE_OFFSET in next revision?
>> The virtual area is cheap, I guess it's also nice to use x86's
>> pattern to have 64 for KVM_DIRTY_LOG_PAGE_OFFSET.
>>
>>      #define KVM_COALESCED_MMIO_PAGE_OFFSET   1
>>      #define KVM_DIRTY_LOG_PAGE_OFFSET        2
> 
> Given that this is just an offset in the vcpu "file", I don't think it
> matters that much. 64 definitely allows for some struct vcpu growth,
> and it doesn't hurt to be compatible with x86 (for once...).
> 

Sure, thanks. I think it'd better to have same pattern as x86 either.

>>
>>>>      #define KVM_REG_SIZE(id)
>>>> \
>>>>    	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
>>>> diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
>>>> index 815cc118c675..0309b2d0f2da 100644
>>>> --- a/arch/arm64/kvm/Kconfig
>>>> +++ b/arch/arm64/kvm/Kconfig
>>>> @@ -32,6 +32,7 @@ menuconfig KVM
>>>>    	select KVM_VFIO
>>>>    	select HAVE_KVM_EVENTFD
>>>>    	select HAVE_KVM_IRQFD
>>>> +	select HAVE_KVM_DIRTY_RING
>>>>    	select HAVE_KVM_MSI
>>>>    	select HAVE_KVM_IRQCHIP
>>>>    	select HAVE_KVM_IRQ_ROUTING
>>>> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
>>>> index 986cee6fbc7f..3de6b9b39db7 100644
>>>> --- a/arch/arm64/kvm/arm.c
>>>> +++ b/arch/arm64/kvm/arm.c
>>>> @@ -866,6 +866,14 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>>>    		if (!ret)
>>>>    			ret = 1;
>>>>    +		/* Force vcpu exit if its dirty ring is soft-full */
>>>> +		if (unlikely(vcpu->kvm->dirty_ring_size &&
>>>> +			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
>>>> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>>>> +			trace_kvm_dirty_ring_exit(vcpu);
>>>> +			ret = 0;
>>>> +		}
>>>> +
>>>
>>> Why can't this be moved to kvm_vcpu_exit_request() instead? I would
>>> also very much like the check to be made a common helper with x86.
>>>
>>> A seemingly approach would be to make this a request on dirty log
>>> insertion, and avoid the whole "check the log size" on every run,
>>> which adds pointless overhead to unsuspecting users (aka everyone).
>>>
>>
>> I though of having the check in kvm_vcpu_exit_request(). The various
>> exit reasons are prioritized. x86 gives KVM_EXIT_DIRTY_RING_FULL the
>> highest priority and ARM64 is just to follow. I don't think it really
>> matters. I will improve it accordingly in next revision:
>>
>> - Change kvm_dirty_ring_soft_full() to something as below in dirty_ring.c
>>
>>    bool kvm_dirty_ring_soft_full(struct kvm_vcpu *vcpu)
>>    {
>>         struct kvm *kvm = vcpu->vcpu;
>>         struct kvm_dirty_ring *ring = &vcpu->dirty_ring;
>>
>>         if (unlikely(kvm->dirty_ring_size &&
>>                      kvm_dirty_ring_used(ring) >= ring->soft_limit)) {
>>             vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
>>             trace_kvm_dirty_ring_exit(vcpu);
>>             return true;
>>         }
>>
>>         return false;
>>    }
>>
>> - Use the modified kvm_dirty_ring_soft_full() in kvm_vcpu_exit_request().
>>
>> Userspace needs KVM_EXIT_DIRTY_RING_FULL to collect the dirty log in time.
>> Otherwise, the dirty log in the ring buffer will be overwritten. I'm not
>> sure if anything else I missed?
> 
> I'm fine with the above, but what I really wanted is a request from
> the dirty-ring insertion, instead of a check in kvm_vpcu_exit_request.
> Something like this (which obviously doesn't compile, but you'll get
> the idea):
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..0b41feb6fb7d 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>   
>   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>   			return kvm_vcpu_suspend(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +			trace_kvm_dirty_ring_exit(vcpu);
> +			return 0;
> +		}
>   	}
>   
>   	return 1;
> diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> index f4c2a6eb1666..08b2f01164fa 100644
> --- a/virt/kvm/dirty_ring.c
> +++ b/virt/kvm/dirty_ring.c
> @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
>   
>   void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>   {
> +	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
>   	struct kvm_dirty_gfn *entry;
>   
>   	/* It should never get full */
> @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
>   	kvm_dirty_gfn_set_dirtied(entry);
>   	ring->dirty_index++;
>   	trace_kvm_dirty_ring_push(ring, slot, offset);
> +
> +	if (kvm_dirty_ring_soft_full(vcpu))
> +		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
>   }
>   
>   struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> 

Ok, thanks for the details, Marc. I will adopt your code in next revision :)

Thanks,
Gavin
Peter Xu Aug. 23, 2022, 1:58 p.m. UTC | #5
On Tue, Aug 23, 2022 at 03:22:17PM +1000, Gavin Shan wrote:
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index 986cee6fbc7f..0b41feb6fb7d 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -747,6 +747,12 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
> >   		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
> >   			return kvm_vcpu_suspend(vcpu);
> > +
> > +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu)) {
> > +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> > +			trace_kvm_dirty_ring_exit(vcpu);
> > +			return 0;
> > +		}
> >   	}
> >   	return 1;
> > diff --git a/virt/kvm/dirty_ring.c b/virt/kvm/dirty_ring.c
> > index f4c2a6eb1666..08b2f01164fa 100644
> > --- a/virt/kvm/dirty_ring.c
> > +++ b/virt/kvm/dirty_ring.c
> > @@ -149,6 +149,7 @@ int kvm_dirty_ring_reset(struct kvm *kvm, struct kvm_dirty_ring *ring)
> >   void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> >   {
> > +	struct kvm_vcpu *vcpu = container_of(ring, struct kvm_vcpu, dirty_ring);
> >   	struct kvm_dirty_gfn *entry;
> >   	/* It should never get full */
> > @@ -166,6 +167,9 @@ void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
> >   	kvm_dirty_gfn_set_dirtied(entry);
> >   	ring->dirty_index++;
> >   	trace_kvm_dirty_ring_push(ring, slot, offset);
> > +
> > +	if (kvm_dirty_ring_soft_full(vcpu))
> > +		kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> >   }
> >   struct page *kvm_dirty_ring_get_page(struct kvm_dirty_ring *ring, u32 offset)
> > 
> 
> Ok, thanks for the details, Marc. I will adopt your code in next revision :)

Note that there can be a slight difference with the old/new code, in that
an (especially malicious) userapp can logically ignore the DIRTY_RING_FULL
vmexit and keep kicking VCPU_RUN with the new code.

Unlike the old code, the 2nd/3rd/... KVM_RUN will still run in the new code
until the next dirty pfn being pushed to the ring, then it'll request ring
full exit again.

Each time it exits the ring grows 1.

At last iiuc it can easily hit the ring full and trigger the warning at the
entry of kvm_dirty_ring_push():

	/* It should never get full */
	WARN_ON_ONCE(kvm_dirty_ring_full(ring));

We did that because kvm_dirty_ring_push() was previously designed to not be
able to fail at all (e.g., in the old bitmap world we never will fail too).
We can't because we can't lose any dirty page or migration could silently
fail too (consider when we do user exit due to ring full and migration just
completed; there could be unsynced pages on src/dst).

So even though the old approach will need to read kvm->dirty_ring_size for
every entrance which is a pity, it will avoid issue above.

Side note: for x86 the dirty ring check was put at the entrance not because
it needs to be the highest priority - it should really be the same when
check kvm requests. It's just that it'll be the fastest way to fail
properly if needed before loading mmu, disabling preemption/irq, etc.

Thanks,
Oliver Upton Aug. 23, 2022, 2:44 p.m. UTC | #6
On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
> Hi Gavin,
> 
> On Mon, 22 Aug 2022 02:58:20 +0100,
> Gavin Shan <gshan@redhat.com> wrote:
> > 
> > Hi Marc,
> > 
> > On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > > On Fri, 19 Aug 2022 01:55:57 +0100,
> > > Gavin Shan <gshan@redhat.com> wrote:
> > >> 
> > >> The ring-based dirty memory tracking has been available and enabled
> > >> on x86 for a while. The feature is beneficial when the number of
> > >> dirty pages is small in a checkpointing system or live migration
> > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> > >> Implement ring-based dirty memory tracking").
> > >> 
> > >> This enables the ring-based dirty memory tracking on ARM64. It's
> > >> notable that no extra reserved ring entries are needed on ARM64
> > >> because the huge pages are always split into base pages when page
> > >> dirty tracking is enabled.
> > > 
> > > Can you please elaborate on this? Adding a per-CPU ring of course
> > > results in extra memory allocation, so there must be a subtle
> > > x86-specific detail that I'm not aware of...
> > > 
> > 
> > Sure. I guess it's helpful to explain how it works in next revision.
> > Something like below:
> > 
> > This enables the ring-based dirty memory tracking on ARM64. The feature
> > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> > pushed by host when page becomes dirty and pulled by userspace. A vcpu
> > exit is forced when the ring buffer becomes full. The ring buffers on
> > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> > 
> > Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> > However, it's avoiding synchronization among multiple vcpus when dirty
> > pages happen on multiple vcpus. More discussion can be found from [1]
> 
> Oh, I totally buy the relaxation of the synchronisation (though I
> doubt this will have any visible effect until we have something like
> Oliver's patches to allow parallel faulting).
> 

Heh, yeah I need to get that out the door. I'll also note that Gavin's
changes are still relevant without that series, as we do write unprotect
in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
Add fast path to handle permission relaxation during dirty logging").

--
Thanks,
Oliver
Marc Zyngier Aug. 23, 2022, 8:35 p.m. UTC | #7
On Tue, 23 Aug 2022 15:44:42 +0100,
Oliver Upton <oliver.upton@linux.dev> wrote:
> 
> On Mon, Aug 22, 2022 at 10:42:15PM +0100, Marc Zyngier wrote:
> > Hi Gavin,
> > 
> > On Mon, 22 Aug 2022 02:58:20 +0100,
> > Gavin Shan <gshan@redhat.com> wrote:
> > > 
> > > Hi Marc,
> > > 
> > > On 8/19/22 6:00 PM, Marc Zyngier wrote:
> > > > On Fri, 19 Aug 2022 01:55:57 +0100,
> > > > Gavin Shan <gshan@redhat.com> wrote:
> > > >> 
> > > >> The ring-based dirty memory tracking has been available and enabled
> > > >> on x86 for a while. The feature is beneficial when the number of
> > > >> dirty pages is small in a checkpointing system or live migration
> > > >> scenario. More details can be found from fb04a1eddb1a ("KVM: X86:
> > > >> Implement ring-based dirty memory tracking").
> > > >> 
> > > >> This enables the ring-based dirty memory tracking on ARM64. It's
> > > >> notable that no extra reserved ring entries are needed on ARM64
> > > >> because the huge pages are always split into base pages when page
> > > >> dirty tracking is enabled.
> > > > 
> > > > Can you please elaborate on this? Adding a per-CPU ring of course
> > > > results in extra memory allocation, so there must be a subtle
> > > > x86-specific detail that I'm not aware of...
> > > > 
> > > 
> > > Sure. I guess it's helpful to explain how it works in next revision.
> > > Something like below:
> > > 
> > > This enables the ring-based dirty memory tracking on ARM64. The feature
> > > is enabled by CONFIG_HAVE_KVM_DIRTY_RING, detected and enabled by
> > > CONFIG_HAVE_KVM_DIRTY_RING. A ring buffer is created on every vcpu and
> > > each entry is described by 'struct kvm_dirty_gfn'. The ring buffer is
> > > pushed by host when page becomes dirty and pulled by userspace. A vcpu
> > > exit is forced when the ring buffer becomes full. The ring buffers on
> > > all vcpus can be reset by ioctl command KVM_RESET_DIRTY_RINGS.
> > > 
> > > Yes, I think so. Adding a per-CPU ring results in extra memory allocation.
> > > However, it's avoiding synchronization among multiple vcpus when dirty
> > > pages happen on multiple vcpus. More discussion can be found from [1]
> > 
> > Oh, I totally buy the relaxation of the synchronisation (though I
> > doubt this will have any visible effect until we have something like
> > Oliver's patches to allow parallel faulting).
> > 
> 
> Heh, yeah I need to get that out the door. I'll also note that Gavin's
> changes are still relevant without that series, as we do write unprotect
> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
> Add fast path to handle permission relaxation during dirty logging").

Ah, true. Now if only someone could explain how the whole
producer-consumer thing works without a trace of a barrier, that'd be
great...

Thanks,

	M.
Peter Xu Aug. 23, 2022, 9:20 p.m. UTC | #8
On Tue, Aug 23, 2022 at 08:17:03PM +0100, Marc Zyngier wrote:
> I don't think we really need this check on the hot path. All we need
> is to make the request sticky until userspace gets their act together
> and consumes elements in the ring. Something like:
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index 986cee6fbc7f..e8ed5e1af159 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -747,6 +747,14 @@ static int check_vcpu_requests(struct kvm_vcpu *vcpu)
>  
>  		if (kvm_check_request(KVM_REQ_SUSPEND, vcpu))
>  			return kvm_vcpu_suspend(vcpu);
> +
> +		if (kvm_check_request(KVM_REQ_RING_SOFT_FULL, vcpu) &&
> +		    kvm_dirty_ring_soft_full(vcpu)) {
> +			kvm_make_request(KVM_REQ_RING_SOFT_FULL, vcpu);
> +			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
> +			trace_kvm_dirty_ring_exit(vcpu);
> +			return 0;
> +		}
>  	}
>  
>  	return 1;

Right, this seems working.  We can also use kvm_test_request() here.

> 
> 
> However, I'm a bit concerned by the reset side of things. It iterates
> over the vcpus and expects the view of each ring to be consistent,
> even if userspace is hacking at it from another CPU. For example, I
> can't see what guarantees that the kernel observes the writes from
> userspace in the order they are being performed (the documentation
> provides no requirements other than "it must collect the dirty GFNs in
> sequence", which doesn't mean much from an ordering perspective).
> 
> I can see that working on a strongly ordered architecture, but on
> something as relaxed as ARM, the CPUs may^Wwill aggressively reorder
> stuff that isn't explicitly ordered. I have the feeling that a CAS
> operation on both sides would be enough, but someone who actually
> understands how this works should have a look...

I definitely don't think I 100% understand all the ordering things since
they're complicated.. but my understanding is that the reset procedure
didn't need memory barrier (unlike pushing, where we have explicit wmb),
because we assumed the userapp is not hostile so logically it should only
modify the flags which is a 32bit field, assuming atomicity guaranteed.

IIRC we used to discuss similar questions on "what if the user is hostile
and wants to hack the process by messing up with the ring", and our
conclusion was as long as the process wouldn't mess up anything outside
itself it should be okay. E.g. It should not be able to either cause the
host to misfunction, or trigger kernel warnings in dmesg, etc..

Thanks,
Marc Zyngier Aug. 24, 2022, 8:57 p.m. UTC | #9
On Wed, 24 Aug 2022 17:21:50 +0100,
Peter Xu <peterx@redhat.com> wrote:
> 
> On Wed, Aug 24, 2022 at 03:45:11PM +0100, Marc Zyngier wrote:
> > On Wed, 24 Aug 2022 00:19:04 +0100,
> > Peter Xu <peterx@redhat.com> wrote:
> > > 
> > > On Tue, Aug 23, 2022 at 11:47:03PM +0100, Marc Zyngier wrote:
> > > > Atomicity doesn't guarantee ordering, unfortunately.
> > > 
> > > Right, sorry to be misleading.  The "atomicity" part I was trying to say
> > > the kernel will always see consistent update on the fields.
> > >
> > > The ordering should also be guaranteed, because things must happen with
> > > below sequence:
> > > 
> > >   (1) kernel publish dirty GFN data (slot, offset)
> > >   (2) kernel publish dirty GFN flag (set to DIRTY)
> > >   (3) user sees DIRTY, collects (slots, offset)
> > >   (4) user sets it to RESET
> > >   (5) kernel reads RESET
> > 
> > Maybe. Maybe not. The reset could well be sitting in the CPU write
> > buffer for as long as it wants and not be seen by the kernel if the
> > read occurs on another CPU. And that's the crucial bit: single-CPU is
> > fine, but cross CPU isn't. Unfortunately, the userspace API is per-CPU
> > on collection, and global on reset (this seems like a bad decision,
> > but it is too late to fix this).
> 
> Regarding the last statement, that's something I had question too and
> discussed with Paolo, even though at that time it's not a outcome of
> discussing memory ordering issues.
> 
> IIUC the initial design was trying to avoid tlb flush flood when vcpu
> number is large (each RESET per ring even for one page will need all vcpus
> to flush, so O(N^2) flushing needed). With global RESET it's O(N).  So
> it's kind of a trade-off, and indeed until now I'm not sure which one is
> better.  E.g., with per-ring reset, we can have locality too in userspace,
> e.g. the vcpu thread might be able to recycle without holding global locks.

I don't get that. On x86, each CPU must perform the TLB invalidation
(there is an IPI for that). So whether you do a per-CPU scan of the
ring or a global scan is irrelevant: each entry you find in any of the
rings must result in a global invalidation, since you've updated the
PTE to make the page RO.

The same is true on ARM, except that the broadcast is done in HW
instead of being tracked in SW.

Buy anyway, this is all moot. The API is what it is, and it isn't
going to change any time soon. All we can do is add some
clarifications to the API for the more relaxed architectures, and make
sure the kernel behaves accordingly.

[...]

> > It may be safe, but it isn't what the userspace API promises.
> 
> The document says:
> 
>   After processing one or more entries in the ring buffer, userspace calls
>   the VM ioctl KVM_RESET_DIRTY_RINGS to notify the kernel about it, so that
>   the kernel will reprotect those collected GFNs.  Therefore, the ioctl
>   must be called *before* reading the content of the dirty pages.
> 
> I'd say it's not an explicit promise, but I think I agree with you that at
> least it's unclear on the behavior.

This is the least problematic part of the documentation. The bit I
literally choke on is this:

<quote>
It's not necessary for userspace to harvest the all dirty GFNs at once.
However it must collect the dirty GFNs in sequence, i.e., the userspace
program cannot skip one dirty GFN to collect the one next to it.
</quote>

This is the core of the issue. Without ordering rules, the consumer on
the other side cannot observe the updates correctly, even if userspace
follows the above to the letter. Of course, the kernel itself must do
the right thing (I guess it amounts to the kernel doing a
load-acquire, and userspace doing a store-release -- effectively
emulating x86...).

> Since we have a global recycle mechanism, most likely the app (e.g. current
> qemu impl) will use the same thread to collect/reset dirty GFNs, and
> trigger the RESET ioctl().  In that case it's safe, IIUC, because no
> cross-core ops.
> 
> QEMU even guarantees this by checking it (kvm_dirty_ring_reap_locked):
> 
>     if (total) {
>         ret = kvm_vm_ioctl(s, KVM_RESET_DIRTY_RINGS);
>         assert(ret == total);
>     }
> 
> I think the assert() should never trigger as mentioned above.  But ideally
> maybe it should just be a loop until cleared gfns match total.

Right. If userspace calls the ioctl on every vcpu, then things should
work correctly. It is only that the overhead is higher than what it
should be if multiple vcpus perform a reset at the same time.

>
> > In other words, without further straightening of the API, this doesn't
> > work as expected on relaxed memory architectures. So before this gets
> > enabled on arm64, this whole ordering issue must be addressed.
> 
> How about adding some more documentation for KVM_RESET_DIRTY_RINGS on the
> possibility of recycling partial of the pages, especially when collection
> and the ioctl() aren't done from the same thread?

I'd rather tell people about the ordering rules. That will come at
zero cost on x86.

> Any suggestions will be greatly welcomed.

I'll write a couple of patch when I get the time, most likely next
week. Gavin will hopefully be able to take them as part of his series.

	M.
Paolo Bonzini Aug. 26, 2022, 10:58 a.m. UTC | #10
On 8/23/22 22:35, Marc Zyngier wrote:
>> Heh, yeah I need to get that out the door. I'll also note that Gavin's
>> changes are still relevant without that series, as we do write unprotect
>> in parallel at PTE granularity after commit f783ef1c0e82 ("KVM: arm64:
>> Add fast path to handle permission relaxation during dirty logging").
>
> Ah, true. Now if only someone could explain how the whole
> producer-consumer thing works without a trace of a barrier, that'd be
> great...

Do you mean this?

void kvm_dirty_ring_push(struct kvm_dirty_ring *ring, u32 slot, u64 offset)
{
         struct kvm_dirty_gfn *entry;

         /* It should never get full */
         WARN_ON_ONCE(kvm_dirty_ring_full(ring));

         entry = &ring->dirty_gfns[ring->dirty_index & (ring->size - 1)];

         entry->slot = slot;
         entry->offset = offset;
         /*
          * Make sure the data is filled in before we publish this to
          * the userspace program.  There's no paired kernel-side reader.
          */
         smp_wmb();
         kvm_dirty_gfn_set_dirtied(entry);
         ring->dirty_index++;
         trace_kvm_dirty_ring_push(ring, slot, offset);
}

The matching smp_rmb() is in userspace.

Paolo
diff mbox series

Patch

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index abd7c32126ce..19fa1ac017ed 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8022,7 +8022,7 @@  regardless of what has actually been exposed through the CPUID leaf.
 8.29 KVM_CAP_DIRTY_LOG_RING
 ---------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Parameters: args[0] - size of the dirty log ring
 
 KVM is capable of tracking dirty memory using ring buffers that are
diff --git a/arch/arm64/include/uapi/asm/kvm.h b/arch/arm64/include/uapi/asm/kvm.h
index 3bb134355874..7e04b0b8d2b2 100644
--- a/arch/arm64/include/uapi/asm/kvm.h
+++ b/arch/arm64/include/uapi/asm/kvm.h
@@ -43,6 +43,7 @@ 
 #define __KVM_HAVE_VCPU_EVENTS
 
 #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
+#define KVM_DIRTY_LOG_PAGE_OFFSET 64
 
 #define KVM_REG_SIZE(id)						\
 	(1U << (((id) & KVM_REG_SIZE_MASK) >> KVM_REG_SIZE_SHIFT))
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 815cc118c675..0309b2d0f2da 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -32,6 +32,7 @@  menuconfig KVM
 	select KVM_VFIO
 	select HAVE_KVM_EVENTFD
 	select HAVE_KVM_IRQFD
+	select HAVE_KVM_DIRTY_RING
 	select HAVE_KVM_MSI
 	select HAVE_KVM_IRQCHIP
 	select HAVE_KVM_IRQ_ROUTING
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index 986cee6fbc7f..3de6b9b39db7 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -866,6 +866,14 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		if (!ret)
 			ret = 1;
 
+		/* Force vcpu exit if its dirty ring is soft-full */
+		if (unlikely(vcpu->kvm->dirty_ring_size &&
+			     kvm_dirty_ring_soft_full(&vcpu->dirty_ring))) {
+			vcpu->run->exit_reason = KVM_EXIT_DIRTY_RING_FULL;
+			trace_kvm_dirty_ring_exit(vcpu);
+			ret = 0;
+		}
+
 		if (ret > 0)
 			ret = check_vcpu_requests(vcpu);