diff mbox series

[v2,3/6] KVM: VMX: Handle vectoring error in check_emulate_instruction

Message ID 20241111102749.82761-4-iorlov@amazon.com
State New
Headers show
Series Enhance event delivery error handling | expand

Commit Message

Ivan Orlov Nov. 11, 2024, 10:27 a.m. UTC
Move unhandleable vmexit due to MMIO during vectoring error detection
into check_emulate_instruction. Implement a function which checks if
emul_type indicates MMIO so it can be used for both VMX and SVM.

Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
mean MMIO anymore: it can also be set due to the write protection
violation.

Signed-off-by: Ivan Orlov <iorlov@amazon.com>
---
V1 -> V2:
- Detect the unhandleable vectoring error in vmx_check_emulate_instruction
instead of handling it in the common MMU code (which is specific for
cached MMIO)

 arch/x86/include/asm/kvm_host.h | 10 ++++++++--
 arch/x86/kvm/vmx/vmx.c          | 25 ++++++++++++-------------
 2 files changed, 20 insertions(+), 15 deletions(-)

Comments

Sean Christopherson Dec. 11, 2024, 6:15 p.m. UTC | #1
On Mon, Nov 11, 2024, Ivan Orlov wrote:
> Move unhandleable vmexit due to MMIO during vectoring error detection
> into check_emulate_instruction. Implement a function which checks if
> emul_type indicates MMIO so it can be used for both VMX and SVM.
> 
> Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
> mean MMIO anymore: it can also be set due to the write protection
> violation.
> 
> Signed-off-by: Ivan Orlov <iorlov@amazon.com>
> ---
> V1 -> V2:
> - Detect the unhandleable vectoring error in vmx_check_emulate_instruction
> instead of handling it in the common MMU code (which is specific for
> cached MMIO)
> 
>  arch/x86/include/asm/kvm_host.h | 10 ++++++++--
>  arch/x86/kvm/vmx/vmx.c          | 25 ++++++++++++-------------
>  2 files changed, 20 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index eb413079b7c6..3de9702a9135 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -2017,8 +2017,8 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>   *			VMware backdoor emulation handles select instructions
>   *			and reinjects the #GP for all other cases.
>   *
> - * EMULTYPE_PF - Set when emulating MMIO by way of an intercepted #PF, in which
> - *		 case the CR2/GPA value pass on the stack is valid.
> + * EMULTYPE_PF - Set when an intercepted #PF triggers the emulation, in which case
> + *		 the CR2/GPA value pass on the stack is valid.
>   *
>   * EMULTYPE_COMPLETE_USER_EXIT - Set when the emulator should update interruptibility
>   *				 state and inject single-step #DBs after skipping
> @@ -2053,6 +2053,12 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>  #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
>  #define EMULTYPE_WRITE_PF_TO_SP	    (1 << 8)
>  
> +static inline bool kvm_is_emul_type_mmio(int emul_type)

Hmm, this should probably be "pf_mmio", not just "mmio".  E.g. if KVM is emulating
large swaths of guest code because unrestricted guest is disabled, then can end up
emulating an MMIO access for "normal" emulation.

Hmm, actually, what if we go with this?

  static inline bool kvm_can_emulate_event_vectoring(int emul_type)
  {
	return !(emul_type & EMULTYPE_PF) ||
	       (emul_type & EMULTYPE_WRITE_PF_TO_SP);
  }

The MMIO aspect isn't unique to VMX or SVM, and the above would allow handling
other incompatible emulation types, should they ever arise (unlikely).

More importantly, it provides a single location to document (via comment) exactly
why KVM can't deal with MMIO emulation during event vectoring (and why KVM allows
emulation of write-protect page faults).

It would require using X86EMUL_UNHANDLEABLE_VECTORING instead of
X86EMUL_UNHANDLEABLE_VECTORING_IO, but I don't think that's necessarily a bad
thing.

> +{
> +	return (emul_type & EMULTYPE_PF) &&
> +		!(emul_type & EMULTYPE_WRITE_PF_TO_SP);
> +}
> +
>  int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
>  int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
>  					void *insn, int insn_len);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f92740e7e107..a10f35d9704b 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1693,6 +1693,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
>  int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
>  				  void *insn, int insn_len)
>  {
> +	bool is_vect;
> +
>  	/*
>  	 * Emulation of instructions in SGX enclaves is impossible as RIP does
>  	 * not point at the failing instruction, and even if it did, the code
> @@ -1704,6 +1706,13 @@ int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
>  		kvm_queue_exception(vcpu, UD_VECTOR);
>  		return X86EMUL_PROPAGATE_FAULT;
>  	}
> +
> +	is_vect = to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK;
> +
> +	/* Emulation is not possible when MMIO happens during event vectoring. */
> +	if (kvm_is_emul_type_mmio(emul_type) && is_vect)

I definitely prefer to omit the local variable.  

	if ((to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
	    !kvm_can_emulate_event_vectoring(emul_type))
		return X86EMUL_UNHANDLEABLE_VECTORING;
Ivan Orlov Dec. 11, 2024, 10:05 p.m. UTC | #2
On 12/11/24 18:15, Sean Christopherson wrote:
> On Mon, Nov 11, 2024, Ivan Orlov wrote:
>> Move unhandleable vmexit due to MMIO during vectoring error detection
>> into check_emulate_instruction. Implement a function which checks if
>> emul_type indicates MMIO so it can be used for both VMX and SVM.
>>
>> Fix the comment about EMULTYPE_PF as this flag doesn't necessarily
>> mean MMIO anymore: it can also be set due to the write protection
>> violation.
>>
>> Signed-off-by: Ivan Orlov <iorlov@amazon.com>
>> ---
>> V1 -> V2:
>> - Detect the unhandleable vectoring error in vmx_check_emulate_instruction
>> instead of handling it in the common MMU code (which is specific for
>> cached MMIO)
>>
>>   arch/x86/include/asm/kvm_host.h | 10 ++++++++--
>>   arch/x86/kvm/vmx/vmx.c          | 25 ++++++++++++-------------
>>   2 files changed, 20 insertions(+), 15 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index eb413079b7c6..3de9702a9135 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -2017,8 +2017,8 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>>    *			VMware backdoor emulation handles select instructions
>>    *			and reinjects the #GP for all other cases.
>>    *
>> - * EMULTYPE_PF - Set when emulating MMIO by way of an intercepted #PF, in which
>> - *		 case the CR2/GPA value pass on the stack is valid.
>> + * EMULTYPE_PF - Set when an intercepted #PF triggers the emulation, in which case
>> + *		 the CR2/GPA value pass on the stack is valid.
>>    *
>>    * EMULTYPE_COMPLETE_USER_EXIT - Set when the emulator should update interruptibility
>>    *				 state and inject single-step #DBs after skipping
>> @@ -2053,6 +2053,12 @@ u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
>>   #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
>>   #define EMULTYPE_WRITE_PF_TO_SP	    (1 << 8)
>>   
>> +static inline bool kvm_is_emul_type_mmio(int emul_type)
> 
> Hmm, this should probably be "pf_mmio", not just "mmio".  E.g. if KVM is emulating
> large swaths of guest code because unrestricted guest is disabled, then can end up
> emulating an MMIO access for "normal" emulation.
> 
> Hmm, actually, what if we go with this?
> 
>    static inline bool kvm_can_emulate_event_vectoring(int emul_type)
>    {
> 	return !(emul_type & EMULTYPE_PF) ||
> 	       (emul_type & EMULTYPE_WRITE_PF_TO_SP);
>    }
> 

I don't mind using either option here, in fact both of them would 
require an update if there is a new emulated page fault type; Let's use 
more generic one (which is kvm_can_emulate_event_vectoring) :)

I'm thinking about a static assert we could add to it, to be sure the 
condition gets updated if such an EMUL_TYPE is introduced, but I can't 
think of something neat... Anyway, it can be done in a separate patch I 
guess (if we really need it).

--
Kind regards,
Ivan Orlov
Ivan Orlov Dec. 11, 2024, 11:12 p.m. UTC | #3
On 12/11/24 18:15, Sean Christopherson wrote:
> Hmm, this should probably be "pf_mmio", not just "mmio".  E.g. if KVM is emulating
> large swaths of guest code because unrestricted guest is disabled, then can end up
> emulating an MMIO access for "normal" emulation.
> 
> Hmm, actually, what if we go with this?
> 
>    static inline bool kvm_can_emulate_event_vectoring(int emul_type)
>    {
> 	return !(emul_type & EMULTYPE_PF) ||
> 	       (emul_type & EMULTYPE_WRITE_PF_TO_SP);
>    }
> 

Hm, by the way, what is the desired behaviour if EMULTYPE_ALLOW_RETRY_PF 
is set? Is it correct that we return an internal error if it is set 
during vectoring? Or KVM may try to unprotect the page and re-execute?

If so, we may need something like

static inline bool kvm_can_emulate_event_vectoring(int emul_type)
{
	return !(emul_type & EMULTYPE_PF) ||
	       (emul_type & ~(EMULTYPE_PF));
}

So it returns true if EMULTYPE_PF is not set or if it's not the only set 
bit.
Sean Christopherson Dec. 12, 2024, 1:01 a.m. UTC | #4
On Wed, Dec 11, 2024, Ivan Orlov wrote:
> On 12/11/24 18:15, Sean Christopherson wrote:
> > Hmm, this should probably be "pf_mmio", not just "mmio".  E.g. if KVM is emulating
> > large swaths of guest code because unrestricted guest is disabled, then can end up
> > emulating an MMIO access for "normal" emulation.
> > 
> > Hmm, actually, what if we go with this?
> > 
> >    static inline bool kvm_can_emulate_event_vectoring(int emul_type)
> >    {
> > 	return !(emul_type & EMULTYPE_PF) ||
> > 	       (emul_type & EMULTYPE_WRITE_PF_TO_SP);
> >    }
> > 
> 
> Hm, by the way, what is the desired behaviour if EMULTYPE_ALLOW_RETRY_PF is
> set? Is it correct that we return an internal error if it is set during
> vectoring? Or KVM may try to unprotect the page and re-execute?

Heh, it's sneaky, but EMULTYPE_ALLOW_RETRY_PF can be set if and only if
RET_PF_WRITE_PROTECTED is set.  Hmm, that makes me think we should do the below
(EMULTYPE_WRITE_PF_TO_SP was a recent addition).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e713480933a..de5f6985d123 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9077,7 +9077,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
        if ((emulation_type & EMULTYPE_ALLOW_RETRY_PF) &&
            (WARN_ON_ONCE(is_guest_mode(vcpu)) ||
-            WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))))
+            WARN_ON_ONCE(!(emulation_type & EMULTYPE_WRITE_PF_TO_SP))))
                emulation_type &= ~EMULTYPE_ALLOW_RETRY_PF;
 
        r = kvm_check_emulate_insn(vcpu, emulation_type, insn, insn_len);

That said, let me get back to you on this when my brain is less tired.  I'm not
sure emulating when an exit occurred during event delivery is _ever_ correct.

> If so, we may need something like
> 
> static inline bool kvm_can_emulate_event_vectoring(int emul_type)
> {
> 	return !(emul_type & EMULTYPE_PF) ||
> 	       (emul_type & ~(EMULTYPE_PF));
> }
> 
> So it returns true if EMULTYPE_PF is not set or if it's not the only set
> bit.
Ivan Orlov Dec. 12, 2024, 4:41 p.m. UTC | #5
On Wed, Dec 11, 2024 at 05:01:07PM -0800, Sean Christopherson wrote:
> > Hm, by the way, what is the desired behaviour if EMULTYPE_ALLOW_RETRY_PF is
> > set? Is it correct that we return an internal error if it is set during
> > vectoring? Or KVM may try to unprotect the page and re-execute?
> 
> Heh, it's sneaky, but EMULTYPE_ALLOW_RETRY_PF can be set if and only if
> RET_PF_WRITE_PROTECTED is set.  Hmm, that makes me think we should do the below
> (EMULTYPE_WRITE_PF_TO_SP was a recent addition).
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2e713480933a..de5f6985d123 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9077,7 +9077,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> 
>         if ((emulation_type & EMULTYPE_ALLOW_RETRY_PF) &&
>             (WARN_ON_ONCE(is_guest_mode(vcpu)) ||
> -            WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))))
> +            WARN_ON_ONCE(!(emulation_type & EMULTYPE_WRITE_PF_TO_SP))))
>                 emulation_type &= ~EMULTYPE_ALLOW_RETRY_PF;
> 

What if we are handling a write to write-protected page, but not a write to
the page table? We still can retry after unprotecting the page, so
EMULTYPE_ALLOW_RETRY_PF should be enabled, right?

>         r = kvm_check_emulate_insn(vcpu, emulation_type, insn, insn_len);
> 
> That said, let me get back to you on this when my brain is less tired.  I'm not
> sure emulating when an exit occurred during event delivery is _ever_ correct.
> 

I believe we can re-execute the instruction if exit happened during vectoring
due to exception (and if the address is not MMIO, e.g. when it's a write to
write-protected page, for instance when stack points to it).

KVM unprotects the page, executes the instruction one more time and
(probably) gets this exception once again (but the page is already
unprotected, so vectoring succeeds without vmexit). If not
(e.g. exception conditions are not met anymore), guest shouldn't really
care and it can continue execution.

However, I'm not sure what happens if vectoring is caused by external
interrupt: if we unprotect the page and re-execute the instruction,
will IRQ be delivered nonetheless, or it will be lost as irq is
already in ISR? Do we need to re-inject it in such a case?

--
Kind regards,
Ivan Orlov
Sean Christopherson Dec. 12, 2024, 7:42 p.m. UTC | #6
On Thu, Dec 12, 2024, Ivan Orlov wrote:
> On Wed, Dec 11, 2024 at 05:01:07PM -0800, Sean Christopherson wrote:
> > > Hm, by the way, what is the desired behaviour if EMULTYPE_ALLOW_RETRY_PF is
> > > set? Is it correct that we return an internal error if it is set during
> > > vectoring? Or KVM may try to unprotect the page and re-execute?
> > 
> > Heh, it's sneaky, but EMULTYPE_ALLOW_RETRY_PF can be set if and only if
> > RET_PF_WRITE_PROTECTED is set.  Hmm, that makes me think we should do the below
> > (EMULTYPE_WRITE_PF_TO_SP was a recent addition).
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 2e713480933a..de5f6985d123 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9077,7 +9077,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
> > 
> >         if ((emulation_type & EMULTYPE_ALLOW_RETRY_PF) &&
> >             (WARN_ON_ONCE(is_guest_mode(vcpu)) ||
> > -            WARN_ON_ONCE(!(emulation_type & EMULTYPE_PF))))
> > +            WARN_ON_ONCE(!(emulation_type & EMULTYPE_WRITE_PF_TO_SP))))
> >                 emulation_type &= ~EMULTYPE_ALLOW_RETRY_PF;
> > 
> 
> What if we are handling a write to write-protected page, but not a write to
> the page table? We still can retry after unprotecting the page, so
> EMULTYPE_ALLOW_RETRY_PF should be enabled, right?

Gah, I got my enums mixed up.  I conflated RET_PF_WRITE_PROTECTED with
EMULTYPE_WRITE_PF_TO_SP.  Ignore the above.

FWIW, KVM _can't_ unprotect and retry in the EMULTYPE_WRITE_PF_TO_SP case.  From
kvm_unprotect_and_retry_on_failure():

	/*
	 * If the failed instruction faulted on an access to page tables that
	 * are used to translate any part of the instruction, KVM can't resolve
	 * the issue by unprotecting the gfn, as zapping the shadow page will
	 * result in the instruction taking a !PRESENT page fault and thus put
	 * the vCPU into an infinite loop of page faults.  E.g. KVM will create
	 * a SPTE and write-protect the gfn to resolve the !PRESENT fault, and
	 * then zap the SPTE to unprotect the gfn, and then do it all over
	 * again.  Report the error to userspace.
	 */
	if (emulation_type & EMULTYPE_WRITE_PF_TO_SP)
		return false;


> >         r = kvm_check_emulate_insn(vcpu, emulation_type, insn, insn_len);
> > 
> > That said, let me get back to you on this when my brain is less tired.  I'm not
> > sure emulating when an exit occurred during event delivery is _ever_ correct.
> > 
> 
> I believe we can re-execute the instruction if exit happened during vectoring
> due to exception (and if the address is not MMIO, e.g. when it's a write to
> write-protected page, for instance when stack points to it).

Unprotect and re-execute is fine, what I'm worried about is *successfully*
emulating the instruction.  E.g. 

  1. CPU executes instruction X and hits a #GP.
  2. While vectoring the #GP, a shadow #PF is taken.
  3. On VM-Exit, KVM re-injects the #GP (see __vmx_complete_interrupts()).
  4. KVM emulates because of the write-protected page.
  5. KVM "successfully" emulates and also detects the #GP
  6. KVM synthesizes a #GP, and because the vCPU already has injected #GP,
     incorrectly escalates to a #DF.

The above is a bit contrived, but I think it could happen if the guest reused a
page that _was_ a page table, for a vCPU's kernel stack.

> KVM unprotects the page, executes the instruction one more time and
> (probably) gets this exception once again (but the page is already
> unprotected, so vectoring succeeds without vmexit). If not
> (e.g. exception conditions are not met anymore), guest shouldn't really
> care and it can continue execution.
> 
> However, I'm not sure what happens if vectoring is caused by external
> interrupt: if we unprotect the page and re-execute the instruction,
> will IRQ be delivered nonetheless, or it will be lost as irq is
> already in ISR? Do we need to re-inject it in such a case?

In all cases, the event that was being vectored is re-injected.  Restarting from
scratch would be a bug.  E.g. if the cause of initial exception was "fixed", say
because the initial exception was #BP, and the guest finished patching out the INT3,
then restarting would execute the _new_ instruction, and the INT3 would be lost.

In most cases, the guest would never notice, but it's still undesriable for KVM
to effectively rewrite history.

As far as unprotect+retry being viable, I think we're on the same page.  What I'm
getting at is that I think KVM should never allow emulating on #PF when the #PF
occurred while vectoring.  E.g. this:

  static inline bool kvm_can_emulate_event_vectoring(int emul_type)
  {
	return !(emul_type & EMULTYPE_PF);
  }

and then I believe this?  Where this diff can be a separate prep patch (though I'm
pretty sure it's technically pointless without the vectoring angle, because shadow
#PF can't coincide with any of the failure paths for kvm_check_emulate_insn()).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 07c6f1d5323d..63361b2da450 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9107,6 +9107,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
                if (r == X86EMUL_RETRY_INSTR || r == X86EMUL_PROPAGATE_FAULT)
                        return 1;
 
+               if (kvm_unprotect_and_retry_on_failure(vcpu, cr2_or_gpa,
+                                                      emulation_type))
+                       return 1;
+
                if (r == X86EMUL_UNHANDLEABLE_VECTORING_IO) {
                        kvm_prepare_event_vectoring_exit(vcpu, cr2_or_gpa);
                        return 0;
Ivan Orlov Dec. 13, 2024, 5:38 p.m. UTC | #7
On Thu, Dec 12, 2024 at 11:42:37AM -0800, Sean Christopherson wrote:
> Gah, I got my enums mixed up.  I conflated RET_PF_WRITE_PROTECTED with
> EMULTYPE_WRITE_PF_TO_SP.  Ignore the above.
> 
> FWIW, KVM _can't_ unprotect and retry in the EMULTYPE_WRITE_PF_TO_SP case.  From
> kvm_unprotect_and_retry_on_failure():
> 

Ah alright, I guess I get it now :) So, EMULTYPE_ALLOW_RETRY_PF is set
whenever there is a PF when accessing write-protected page, and
EMULTYPE_WRITE_PF_TO_SP is set when this access touches SPTE used to
translate the write itself. If both are set, we can't unprotect and
retry, and the latter can be set only if the former is set.

> Unprotect and re-execute is fine, what I'm worried about is *successfully*
> emulating the instruction.  E.g.
> 
>   1. CPU executes instruction X and hits a #GP.
>   2. While vectoring the #GP, a shadow #PF is taken.
>   3. On VM-Exit, KVM re-injects the #GP (see __vmx_complete_interrupts()).
>   4. KVM emulates because of the write-protected page.
>   5. KVM "successfully" emulates and also detects the #GP
>   6. KVM synthesizes a #GP, and because the vCPU already has injected #GP,
>      incorrectly escalates to a #DF.
> 
> The above is a bit contrived, but I think it could happen if the guest reused a
> page that _was_ a page table, for a vCPU's kernel stack.
> 

Does it work like that only for contributory exceptions / page faults?
In case if it's not #GP but (for instance) #UD, (as far as I understand)
KVM will queue only one of them without causing #DF so it's gonna be
valid?

> > However, I'm not sure what happens if vectoring is caused by external
> > interrupt: if we unprotect the page and re-execute the instruction,
> > will IRQ be delivered nonetheless, or it will be lost as irq is
> > already in ISR? Do we need to re-inject it in such a case?
> 
> In all cases, the event that was being vectored is re-injected.  Restarting from
> scratch would be a bug.  E.g. if the cause of initial exception was "fixed", say
> because the initial exception was #BP, and the guest finished patching out the INT3,
> then restarting would execute the _new_ instruction, and the INT3 would be lost.
> 

Cool, that is what I was concerned about, glad that it is already
implemented :)

> 
> As far as unprotect+retry being viable, I think we're on the same page.  What I'm
> getting at is that I think KVM should never allow emulating on #PF when the #PF
> occurred while vectoring.  E.g. this:
> 
>   static inline bool kvm_can_emulate_event_vectoring(int emul_type)
>   {
>         return !(emul_type & EMULTYPE_PF);
>   }
> 

Yeah, I agree. I'll post a V3 with suggested fixes (after running all of the
selftests to be sure that it doesn't break anything).

> and then I believe this?  Where this diff can be a separate prep patch (though I'm
> pretty sure it's technically pointless without the vectoring angle, because shadow
> #PF can't coincide with any of the failure paths for kvm_check_emulate_insn()).
> 

Looks good. If you don't mind, I could add this patch to the series with `Suggested-by`
tag since it's neccessary to allow unprotect+retry in case of shadow #PF during
vectoring.

--
Kind regards,
Ivan Orlov

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 07c6f1d5323d..63361b2da450 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9107,6 +9107,10 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>                 if (r == X86EMUL_RETRY_INSTR || r == X86EMUL_PROPAGATE_FAULT)
>                         return 1;
> 
> +               if (kvm_unprotect_and_retry_on_failure(vcpu, cr2_or_gpa,
> +                                                      emulation_type))
> +                       return 1;
> +
>                 if (r == X86EMUL_UNHANDLEABLE_VECTORING_IO) {
>                         kvm_prepare_event_vectoring_exit(vcpu, cr2_or_gpa);
>                         return 0;
>
Sean Christopherson Dec. 13, 2024, 8:09 p.m. UTC | #8
On Fri, Dec 13, 2024, Ivan Orlov wrote:
> On Thu, Dec 12, 2024 at 11:42:37AM -0800, Sean Christopherson wrote:
> > Unprotect and re-execute is fine, what I'm worried about is *successfully*
> > emulating the instruction.  E.g.
> > 
> >   1. CPU executes instruction X and hits a #GP.
> >   2. While vectoring the #GP, a shadow #PF is taken.
> >   3. On VM-Exit, KVM re-injects the #GP (see __vmx_complete_interrupts()).
> >   4. KVM emulates because of the write-protected page.
> >   5. KVM "successfully" emulates and also detects the #GP
> >   6. KVM synthesizes a #GP, and because the vCPU already has injected #GP,
> >      incorrectly escalates to a #DF.
> > 
> > The above is a bit contrived, but I think it could happen if the guest reused a
> > page that _was_ a page table, for a vCPU's kernel stack.
> > 
> 
> Does it work like that only for contributory exceptions / page faults?

The #DF case, yes.

> In case if it's not #GP but (for instance) #UD, (as far as I understand)
> KVM will queue only one of them without causing #DF so it's gonna be
> valid?

No, it can still be invalid.  E.g. initialize hit a #BP, replace it with a #UD,
but there may be guest-visibile side effects from the original #BP.

> > > However, I'm not sure what happens if vectoring is caused by external
> > > interrupt: if we unprotect the page and re-execute the instruction,
> > > will IRQ be delivered nonetheless, or it will be lost as irq is
> > > already in ISR? Do we need to re-inject it in such a case?
> > 
> > In all cases, the event that was being vectored is re-injected.  Restarting from
> > scratch would be a bug.  E.g. if the cause of initial exception was "fixed", say
> > because the initial exception was #BP, and the guest finished patching out the INT3,
> > then restarting would execute the _new_ instruction, and the INT3 would be lost.
> > 
> 
> Cool, that is what I was concerned about, glad that it is already
> implemented :)
> 
> > 
> > As far as unprotect+retry being viable, I think we're on the same page.  What I'm
> > getting at is that I think KVM should never allow emulating on #PF when the #PF
> > occurred while vectoring.  E.g. this:
> > 
> >   static inline bool kvm_can_emulate_event_vectoring(int emul_type)
> >   {
> >         return !(emul_type & EMULTYPE_PF);
> >   }
> > 
> 
> Yeah, I agree. I'll post a V3 with suggested fixes (after running all of the
> selftests to be sure that it doesn't break anything).
> 
> > and then I believe this?  Where this diff can be a separate prep patch (though I'm
> > pretty sure it's technically pointless without the vectoring angle, because shadow
> > #PF can't coincide with any of the failure paths for kvm_check_emulate_insn()).
> > 
> 
> Looks good. If you don't mind, I could add this patch to the series with `Suggested-by`
> tag since it's neccessary to allow unprotect+retry in case of shadow #PF during
> vectoring.

Ya, go for it.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index eb413079b7c6..3de9702a9135 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2017,8 +2017,8 @@  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
  *			VMware backdoor emulation handles select instructions
  *			and reinjects the #GP for all other cases.
  *
- * EMULTYPE_PF - Set when emulating MMIO by way of an intercepted #PF, in which
- *		 case the CR2/GPA value pass on the stack is valid.
+ * EMULTYPE_PF - Set when an intercepted #PF triggers the emulation, in which case
+ *		 the CR2/GPA value pass on the stack is valid.
  *
  * EMULTYPE_COMPLETE_USER_EXIT - Set when the emulator should update interruptibility
  *				 state and inject single-step #DBs after skipping
@@ -2053,6 +2053,12 @@  u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 #define EMULTYPE_COMPLETE_USER_EXIT (1 << 7)
 #define EMULTYPE_WRITE_PF_TO_SP	    (1 << 8)
 
+static inline bool kvm_is_emul_type_mmio(int emul_type)
+{
+	return (emul_type & EMULTYPE_PF) &&
+		!(emul_type & EMULTYPE_WRITE_PF_TO_SP);
+}
+
 int kvm_emulate_instruction(struct kvm_vcpu *vcpu, int emulation_type);
 int kvm_emulate_instruction_from_buffer(struct kvm_vcpu *vcpu,
 					void *insn, int insn_len);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f92740e7e107..a10f35d9704b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1693,6 +1693,8 @@  static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
 int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
 				  void *insn, int insn_len)
 {
+	bool is_vect;
+
 	/*
 	 * Emulation of instructions in SGX enclaves is impossible as RIP does
 	 * not point at the failing instruction, and even if it did, the code
@@ -1704,6 +1706,13 @@  int vmx_check_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
 		kvm_queue_exception(vcpu, UD_VECTOR);
 		return X86EMUL_PROPAGATE_FAULT;
 	}
+
+	is_vect = to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK;
+
+	/* Emulation is not possible when MMIO happens during event vectoring. */
+	if (kvm_is_emul_type_mmio(emul_type) && is_vect)
+		return X86EMUL_UNHANDLEABLE_VECTORING_IO;
+
 	return X86EMUL_CONTINUE;
 }
 
@@ -6452,7 +6461,6 @@  static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	union vmx_exit_reason exit_reason = vmx->exit_reason;
 	u32 vectoring_info = vmx->idt_vectoring_info;
 	u16 exit_handler_index;
-	gpa_t gpa;
 
 	/*
 	 * Flush logged GPAs PML buffer, this will make dirty_bitmap more
@@ -6537,24 +6545,15 @@  static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 		return 0;
 	}
 
-	/*
-	 * Note:
-	 * Do not try to fix EXIT_REASON_EPT_MISCONFIG if it caused by
-	 * delivery event since it indicates guest is accessing MMIO.
-	 * The vm-exit can be triggered again after return to guest that
-	 * will cause infinite loop.
-	 */
 	if ((vectoring_info & VECTORING_INFO_VALID_MASK) &&
 	    (exit_reason.basic != EXIT_REASON_EXCEPTION_NMI &&
 	     exit_reason.basic != EXIT_REASON_EPT_VIOLATION &&
 	     exit_reason.basic != EXIT_REASON_PML_FULL &&
 	     exit_reason.basic != EXIT_REASON_APIC_ACCESS &&
 	     exit_reason.basic != EXIT_REASON_TASK_SWITCH &&
-	     exit_reason.basic != EXIT_REASON_NOTIFY)) {
-		gpa = exit_reason.basic == EXIT_REASON_EPT_MISCONFIG
-		      ? vmcs_read64(GUEST_PHYSICAL_ADDRESS) : INVALID_GPA;
-
-		kvm_prepare_event_vectoring_exit(vcpu, gpa);
+	     exit_reason.basic != EXIT_REASON_NOTIFY &&
+	     exit_reason.basic != EXIT_REASON_EPT_MISCONFIG)) {
+		kvm_prepare_event_vectoring_exit(vcpu, INVALID_GPA);
 		return 0;
 	}