diff mbox series

[Part2,RFC,v4,27/40] KVM: X86: Add kvm_x86_ops to get the max page level for the TDP

Message ID 20210707183616.5620-28-brijesh.singh@amd.com
State New
Headers show
Series [Part2,RFC,v4,01/40] KVM: SVM: Add support to handle AP reset MSR protocol | expand

Commit Message

Brijesh Singh July 7, 2021, 6:36 p.m. UTC
When running an SEV-SNP VM, the sPA used to index the RMP entry is
obtained through the TDP translation (gva->gpa->spa). The TDP page
level is checked against the page level programmed in the RMP entry.
If the page level does not match, then it will cause a nested page
fault with the RMP bit set to indicate the RMP violation.

To keep the TDP and RMP page level's in sync, the KVM fault handle
kvm_handle_page_fault() will call get_tdp_max_page_level() to get
the maximum allowed page level so that it can limit the TDP level.

In the case of SEV-SNP guest, the get_tdp_max_page_level() will consult
the RMP table to compute the maximum allowed page level for a given
GPA.

Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu/mmu.c          |  6 ++++--
 arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
 arch/x86/kvm/svm/svm.c          |  1 +
 arch/x86/kvm/svm/svm.h          |  1 +
 arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
 6 files changed, 35 insertions(+), 2 deletions(-)

Comments

Sean Christopherson July 16, 2021, 7:19 p.m. UTC | #1
On Wed, Jul 07, 2021, Brijesh Singh wrote:
> When running an SEV-SNP VM, the sPA used to index the RMP entry is
> obtained through the TDP translation (gva->gpa->spa). The TDP page
> level is checked against the page level programmed in the RMP entry.
> If the page level does not match, then it will cause a nested page
> fault with the RMP bit set to indicate the RMP violation.
> 
> To keep the TDP and RMP page level's in sync, the KVM fault handle
> kvm_handle_page_fault() will call get_tdp_max_page_level() to get
> the maximum allowed page level so that it can limit the TDP level.
> 
> In the case of SEV-SNP guest, the get_tdp_max_page_level() will consult
> the RMP table to compute the maximum allowed page level for a given
> GPA.
> 
> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu/mmu.c          |  6 ++++--
>  arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
>  arch/x86/kvm/svm/svm.c          |  1 +
>  arch/x86/kvm/svm/svm.h          |  1 +
>  arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
>  6 files changed, 35 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 188110ab2c02..cd2e19e1d323 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1384,6 +1384,7 @@ struct kvm_x86_ops {
>  
>  	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
> +	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);

This is a poor name.  The constraint comes from the RMP, not TDP, and technically
speaking applies to all forms of paging.  It just happens to be relevant only to
TDP because NPT is required for SNP.  And KVM already incorporates the max TDP
level in kvm_configure_mmu().

Regarding the params, I'd much prefer to have this take "struct kvm *kvm" instead
of the vCPU.  It obviously doesn't change the functionality in any way, but I'd
like it to be clear to readers that the adjustment is tied to the VM, not the vCPU.

I think I'd also vote to drop @max_level and make this a pure constraint input as
opposed to an adjuster.

Another option would be to drop the kvm_x86_ops hooks entirely and call
snp_lookup_page_in_rmptable() directly from MMU code.  That would require tracking
that a VM is SNP-enabled in arch code, but I'm pretty sure info has already bled
into common KVM in one form or another.

>  };
>  
>  struct kvm_x86_nested_ops {
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 0144c40d09c7..7991ffae7b31 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3781,11 +3781,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
>  				u32 error_code, bool prefault)
>  {
> +	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);

This is completely bogus, nonpaging_page_fault() is used iff TDP is disabled.

> +
>  	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
>  
>  	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
>  	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
> -				 PG_LEVEL_2M, false);
> +				 max_level, false);
>  }
>  
>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
> @@ -3826,7 +3828,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>  {
>  	int max_level;
>  
> -	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
> +	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);

This is unnecessary.  The max mapping level is computed by factoring in all
constraints, of which there are many.  In this case, KVM is consulting the guest's
MTRR configuration to avoid creating a page that spans different memtypes (because
the guest MTRRs are effectively represented in the TDP PTE).  SNP's RMP constraints
have no relevance to the MTRR constraint, or any other constraint for that matter.

TL;DR: the RMP constraint belong in kvm_mmu_max_mapping_level() and nowhere else.
I would go so far as to argue it belong in host_pfn_mapping_level(), after the
call to lookup_address_in_mm().

>  	     max_level > PG_LEVEL_4K;
>  	     max_level--) {
>  		int page_num = KVM_PAGES_PER_HPAGE(max_level);
> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
> index 3f8824c9a5dc..fd2d00ad80b7 100644
> --- a/arch/x86/kvm/svm/sev.c
> +++ b/arch/x86/kvm/svm/sev.c
> @@ -3206,3 +3206,23 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
>  
>  	return pfn_to_page(pfn);
>  }
> +
> +int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
> +{
> +	struct rmpentry *e;
> +	kvm_pfn_t pfn;
> +	int level;
> +
> +	if (!sev_snp_guest(vcpu->kvm))

I can't tell if this check is correct.  Per the APM:

  When SEV-SNP is enabled globally, the processor places restrictions on all memory
  accesses based on the contents of the RMP, whether the accesses are performed by
  the hypervisor, a legacy guest VM, a non-SNP guest VM or an SNP-active guest VM.
  The processor may perform one or more of the following checks depending on the
  context of the access:

  ...

  Page-Size: Checks that the following conditions are met:
    - If the nested page table indicates a 2MB or 1GB page size, the Page_Size field
      of the RMP entry of the target page is 1.
    - If the nested page table indicates a 4KB page size, the Page_Size field of the
      RMP entry of the target page is 0.

The Page-Size bullet does not have any qualifiers about the NPT checks applying
only to SNP guests.  The Hypervisor-Owned bullet implies that unassigned pages
do not need to have identical sizes, but it's not clear whether or not so called
"Hypervisor-Owned" pages override the nested page tables.

Table 15.36 is similarly vague:

  Assigned Flag indicating that the system physical page is assigned to a guest
  or to the AMD-SP.
    0: Owned by the hypervisor
    1: Owned by a guest or the AMD-SP

My assumption is that all of the "guest owned" stuff really means "SNP guest owned",
e.g. section 15.36.5 says "The hypervisor manages the SEV-SNP security attributes of
pages assigned to SNP-active guests by altering the RMP entries of those pages", but
that's not at all clear throughout most of the RMP documentation.

Regardless of the actual behavior, the APM needs serious cleanup on the aforementioned
sections.  E.g. as written, the "processor may perform one or more of the following
checks depending on the context of the access" verbiage basically gives the CPU carte
blanche to do whatever the hell it wants.

> +		return max_level;
> +
> +	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
> +	if (is_error_noslot_pfn(pfn))
> +		return max_level;
> +
> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);

Assuming pfn is backed by struct page is broken, at least given the existing
call sites..  It might hold true that only struct page pfns are covered by the
RMP, but assuming pfn_to_page() will return a valid pointer here is completely
wrong.  Unless I'm missing something, taking a struct page anywhere in the RMP
helpers is at best sketchy and at worst broken in and of itself.  IMO, the RMP
code should always take a raw PFN and do the necessary checks before assuming
anything about the PFN.  At a glance, the only case that needs additional checks
is the page_to_virt() logic in rmpupdate().

> +	if (unlikely(!e))
> +		return max_level;
> +
> +	return min_t(uint32_t, level, max_level);

As the APM is currently worded, this is wrong, and the whole "tdp_max_page_level"
name is wrong.  As noted above, the Page-Size bullet points states that 2mb/1gb
pages in the NPT _must_ have RMP.page_size=1, and 4kb pages in the NPT _must_
have RMP.page_size=0.  That means that the RMP adjustment is not a constraint,
it's an exact requirement.  Specifically, if the RMP is a 2mb page then KVM must
install a 2mb (or 1gb) page.  Maybe it works because KVM will PSMASH the RMP
after installing a bogus 4kb NPT and taking an RMP violation, but that's a very
convoluted and sub-optimal solution.

That other obvious bug is that this doesn't play nice with 1gb pages.  A 2mb RMP
entry should _not_ force KVM to use a 2mb page instead of a 1gb page.

> +}
Brijesh Singh July 16, 2021, 8:41 p.m. UTC | #2
On 7/16/21 2:19 PM, Sean Christopherson wrote:
> On Wed, Jul 07, 2021, Brijesh Singh wrote:
>> When running an SEV-SNP VM, the sPA used to index the RMP entry is
>> obtained through the TDP translation (gva->gpa->spa). The TDP page
>> level is checked against the page level programmed in the RMP entry.
>> If the page level does not match, then it will cause a nested page
>> fault with the RMP bit set to indicate the RMP violation.
>>
>> To keep the TDP and RMP page level's in sync, the KVM fault handle
>> kvm_handle_page_fault() will call get_tdp_max_page_level() to get
>> the maximum allowed page level so that it can limit the TDP level.
>>
>> In the case of SEV-SNP guest, the get_tdp_max_page_level() will consult
>> the RMP table to compute the maximum allowed page level for a given
>> GPA.
>>
>> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  1 +
>>  arch/x86/kvm/mmu/mmu.c          |  6 ++++--
>>  arch/x86/kvm/svm/sev.c          | 20 ++++++++++++++++++++
>>  arch/x86/kvm/svm/svm.c          |  1 +
>>  arch/x86/kvm/svm/svm.h          |  1 +
>>  arch/x86/kvm/vmx/vmx.c          |  8 ++++++++
>>  6 files changed, 35 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 188110ab2c02..cd2e19e1d323 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1384,6 +1384,7 @@ struct kvm_x86_ops {
>>  
>>  	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
>>  	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
>> +	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
> This is a poor name.  The constraint comes from the RMP, not TDP, and technically
> speaking applies to all forms of paging.  It just happens to be relevant only to
> TDP because NPT is required for SNP.  And KVM already incorporates the max TDP
> level in kvm_configure_mmu().

Noted.


>
> Regarding the params, I'd much prefer to have this take "struct kvm *kvm" instead
> of the vCPU.  It obviously doesn't change the functionality in any way, but I'd
> like it to be clear to readers that the adjustment is tied to the VM, not the vCPU.

Noted.


> I think I'd also vote to drop @max_level and make this a pure constraint input as
> opposed to an adjuster.


Noted.

> Another option would be to drop the kvm_x86_ops hooks entirely and call
> snp_lookup_page_in_rmptable() directly from MMU code.  That would require tracking
> that a VM is SNP-enabled in arch code, but I'm pretty sure info has already bled
> into common KVM in one form or another.

I would prefer this as it eliminates some of the other unnecessary call
sites. Unfortunately, currently there is no generic way to know if its
an SEV guest (outside the svm/*).  So far there was no need as such but
with SNP having such information would help. Should we extend the
'struct kvm' to include a new field that can be used to determine the
guest type. Something like

enum {

   GUEST_TYPE_SEV,

   GUEST_TYPE_SEV_ES,

   GUEST_TYPE_SEV_SNP,

};

struct kvm {

   ...

  u64 enc_type;

};

bool kvm_guest_enc_type(struct kvm *kvm, enum type); {

    return !!kvm->enc_type & type;

}

The mmu.c can then call kvm_guest_enc_type() to check if its SNP guest
and use the SNP lookup directly to determine the pagesize.


>
>>  };
>>  
>>  struct kvm_x86_nested_ops {
>> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>> index 0144c40d09c7..7991ffae7b31 100644
>> --- a/arch/x86/kvm/mmu/mmu.c
>> +++ b/arch/x86/kvm/mmu/mmu.c
>> @@ -3781,11 +3781,13 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
>>  				u32 error_code, bool prefault)
>>  {
>> +	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);
> This is completely bogus, nonpaging_page_fault() is used iff TDP is disabled.

Ah, I totally missed it.


>> +
>>  	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
>>  
>>  	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
>>  	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
>> -				 PG_LEVEL_2M, false);
>> +				 max_level, false);
>>  }
>>  
>>  int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
>> @@ -3826,7 +3828,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>>  {
>>  	int max_level;
>>  
>> -	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
>> +	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);
> This is unnecessary.  The max mapping level is computed by factoring in all
> constraints, of which there are many.  In this case, KVM is consulting the guest's
> MTRR configuration to avoid creating a page that spans different memtypes (because
> the guest MTRRs are effectively represented in the TDP PTE).  SNP's RMP constraints
> have no relevance to the MTRR constraint, or any other constraint for that matter.
>
> TL;DR: the RMP constraint belong in kvm_mmu_max_mapping_level() and nowhere else.
> I would go so far as to argue it belong in host_pfn_mapping_level(), after the
> call to lookup_address_in_mm().


I agree with you; One of the case which I was trying to cover is what if
we do a pre-fault and while generating the prefault we can tell the
handler our max page level; The example is: "Guest issues a page state
transition request to add the page as 2mb". We execute the below steps
to fulfill the request

* create a prefault with a max_level set to 2mb.

* the fault handler may find that it cannot use the large page in the
npt, and it may default to 4k

* read the page-size from the npt;  use the npt pagesize in the rmptable
instead of the guest requested page-size.

We keep the NPT and RMP in sync after the page state change is completed
and avoid any extra RMP fault due to the size mismatch etc.


>>  	     max_level > PG_LEVEL_4K;
>>  	     max_level--) {
>>  		int page_num = KVM_PAGES_PER_HPAGE(max_level);
>> diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
>> index 3f8824c9a5dc..fd2d00ad80b7 100644
>> --- a/arch/x86/kvm/svm/sev.c
>> +++ b/arch/x86/kvm/svm/sev.c
>> @@ -3206,3 +3206,23 @@ struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
>>  
>>  	return pfn_to_page(pfn);
>>  }
>> +
>> +int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
>> +{
>> +	struct rmpentry *e;
>> +	kvm_pfn_t pfn;
>> +	int level;
>> +
>> +	if (!sev_snp_guest(vcpu->kvm))
> I can't tell if this check is correct.  Per the APM:
>
>   When SEV-SNP is enabled globally, the processor places restrictions on all memory
>   accesses based on the contents of the RMP, whether the accesses are performed by
>   the hypervisor, a legacy guest VM, a non-SNP guest VM or an SNP-active guest VM.
>   The processor may perform one or more of the following checks depending on the
>   context of the access:
>
>   ...
>
>   Page-Size: Checks that the following conditions are met:
>     - If the nested page table indicates a 2MB or 1GB page size, the Page_Size field
>       of the RMP entry of the target page is 1.
>     - If the nested page table indicates a 4KB page size, the Page_Size field of the
>       RMP entry of the target page is 0.
>
> The Page-Size bullet does not have any qualifiers about the NPT checks applying
> only to SNP guests.  The Hypervisor-Owned bullet implies that unassigned pages
> do not need to have identical sizes, but it's not clear whether or not so called
> "Hypervisor-Owned" pages override the nested page tables.
>
> Table 15.36 is similarly vague:
>
>   Assigned Flag indicating that the system physical page is assigned to a guest
>   or to the AMD-SP.
>     0: Owned by the hypervisor
>     1: Owned by a guest or the AMD-SP
>
> My assumption is that all of the "guest owned" stuff really means "SNP guest owned",
> e.g. section 15.36.5 says "The hypervisor manages the SEV-SNP security attributes of
> pages assigned to SNP-active guests by altering the RMP entries of those pages", but
> that's not at all clear throughout most of the RMP documentation.
>
> Regardless of the actual behavior, the APM needs serious cleanup on the aforementioned
> sections.  E.g. as written, the "processor may perform one or more of the following
> checks depending on the context of the access" verbiage basically gives the CPU carte
> blanche to do whatever the hell it wants.

I'll raise your concern to the documentation folks so that they clarify
that the page-size check is applicable to the SNP active guests only.


>> +		return max_level;
>> +
>> +	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
>> +	if (is_error_noslot_pfn(pfn))
>> +		return max_level;
>> +
>> +	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
> Assuming pfn is backed by struct page is broken, at least given the existing
> call sites..  It might hold true that only struct page pfns are covered by the
> RMP, but assuming pfn_to_page() will return a valid pointer here is completely
> wrong.  Unless I'm missing something, taking a struct page anywhere in the RMP
> helpers is at best sketchy and at worst broken in and of itself.  IMO, the RMP
> code should always take a raw PFN and do the necessary checks before assuming
> anything about the PFN.  At a glance, the only case that needs additional checks
> is the page_to_virt() logic in rmpupdate().

I agree. Dave also hinted the similar feedback. In next version of the
patch I will stick to use the pfn and then SNP lookup with do the
required checking.


>> +	if (unlikely(!e))
>> +		return max_level;
>> +
>> +	return min_t(uint32_t, level, max_level);
> As the APM is currently worded, this is wrong, and the whole "tdp_max_page_level"
> name is wrong.  As noted above, the Page-Size bullet points states that 2mb/1gb
> pages in the NPT _must_ have RMP.page_size=1, and 4kb pages in the NPT _must_
> have RMP.page_size=0.  That means that the RMP adjustment is not a constraint,
> it's an exact requirement.  Specifically, if the RMP is a 2mb page then KVM must
> install a 2mb (or 1gb) page.  Maybe it works because KVM will PSMASH the RMP
> after installing a bogus 4kb NPT and taking an RMP violation, but that's a very
> convoluted and sub-optimal solution.

This is why I was passing the preferred max_level in the pre-fault
handle then later query the npt level; use the npt level in the RMP to
make sure they are in sync.

There is yet another reason why we can't avoid the PSMASH after doing
everything to ensure that NPT and RMP are in sync. e.g if NPT and RMP
are programmed with 2mb size but the guest tries to PVALIDATE the page
as a 4k. In that case, we will see #NPF with page size mismatch and have
to perform psmash.


>
> That other obvious bug is that this doesn't play nice with 1gb pages.  A 2mb RMP
> entry should _not_ force KVM to use a 2mb page instead of a 1gb page.
>
>> +}
Sean Christopherson July 20, 2021, 7:38 p.m. UTC | #3
On Fri, Jul 16, 2021, Brijesh Singh wrote:
> On 7/16/21 2:19 PM, Sean Christopherson wrote:
> > On Wed, Jul 07, 2021, Brijesh Singh wrote:
> > Another option would be to drop the kvm_x86_ops hooks entirely and call
> > snp_lookup_page_in_rmptable() directly from MMU code.  That would require tracking
> > that a VM is SNP-enabled in arch code, but I'm pretty sure info has already bled
> > into common KVM in one form or another.
> 
> I would prefer this as it eliminates some of the other unnecessary call
> sites. Unfortunately, currently there is no generic way to know if its
> an SEV guest (outside the svm/*).  So far there was no need as such but
> with SNP having such information would help. Should we extend the
> 'struct kvm' to include a new field that can be used to determine the
> guest type. Something like
> 
> enum {
> 
>    GUEST_TYPE_SEV,
> 
>    GUEST_TYPE_SEV_ES,
> 
>    GUEST_TYPE_SEV_SNP,
> 
> };
> 
> struct kvm {
> 
>    ...
> 
>   u64 enc_type;
> 
> };
> 
> bool kvm_guest_enc_type(struct kvm *kvm, enum type); {
> 
>     return !!kvm->enc_type & type;
> 
> }
> 
> The mmu.c can then call kvm_guest_enc_type() to check if its SNP guest
> and use the SNP lookup directly to determine the pagesize.

The other option is to use vm_type, which TDX is already planning on leveraging.
Paolo raised the question of whether or not the TDX type could be reused for SNP.
We should definitely sort that out before merging either series.  I'm personally
in favor of separating TDX and SNP, it seems inevitable that common code will
want to differentiate between the two.

https://lkml.kernel.org/r/8eb87cd52a89d957af03f93a9ece5634426a7757.1625186503.git.isaku.yamahata@intel.com

> > As the APM is currently worded, this is wrong, and the whole "tdp_max_page_level"
> > name is wrong.  As noted above, the Page-Size bullet points states that 2mb/1gb
> > pages in the NPT _must_ have RMP.page_size=1, and 4kb pages in the NPT _must_
> > have RMP.page_size=0.  That means that the RMP adjustment is not a constraint,
> > it's an exact requirement.  Specifically, if the RMP is a 2mb page then KVM must
> > install a 2mb (or 1gb) page.  Maybe it works because KVM will PSMASH the RMP
> > after installing a bogus 4kb NPT and taking an RMP violation, but that's a very
> > convoluted and sub-optimal solution.
> 
> This is why I was passing the preferred max_level in the pre-fault
> handle then later query the npt level; use the npt level in the RMP to
> make sure they are in sync.
> 
> There is yet another reason why we can't avoid the PSMASH after doing
> everything to ensure that NPT and RMP are in sync. e.g if NPT and RMP
> are programmed with 2mb size but the guest tries to PVALIDATE the page
> as a 4k. In that case, we will see #NPF with page size mismatch and have
> to perform psmash.

Boo, there's no way to communicate to the guest that it's doing PVALIDATE wrong
is there?
Brijesh Singh July 20, 2021, 8:06 p.m. UTC | #4
On 7/20/21 2:38 PM, Sean Christopherson wrote:
...

> 

> The other option is to use vm_type, which TDX is already planning on leveraging.

> Paolo raised the question of whether or not the TDX type could be reused for SNP.

> We should definitely sort that out before merging either series.  I'm personally

> in favor of separating TDX and SNP, it seems inevitable that common code will

> want to differentiate between the two.


Yes, I did saw that and it seems much better.

> 

> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F8eb87cd52a89d957af03f93a9ece5634426a7757.1625186503.git.isaku.yamahata%40intel.com&amp;data=04%7C01%7Cbrijesh.singh%40amd.com%7Cb658fcf339234fd9030d08d94bb5edf1%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637624067039647374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=tdVALNsQGer0Z69%2FyYaRqWYZvH27k%2BmgHdslQlJ7qlU%3D&amp;reserved=0

> 


...

>>

>> There is yet another reason why we can't avoid the PSMASH after doing

>> everything to ensure that NPT and RMP are in sync. e.g if NPT and RMP

>> are programmed with 2mb size but the guest tries to PVALIDATE the page

>> as a 4k. In that case, we will see #NPF with page size mismatch and have

>> to perform psmash.

> 

> Boo, there's no way to communicate to the guest that it's doing PVALIDATE wrong

> is there?

> 


if the guest chooses smaller page-size then we don't have any means to 
notify the guest; the hardware will cause an #NPF and its up to the 
hypervisor to resolve the fault.

However, if the guest attempts to validate with the larger page-level 
(e.g guest using 2mb and RMP entry was 4k) then PVALIDATE will return 
SIZEMISMATCH error to the guest.

thanks
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 188110ab2c02..cd2e19e1d323 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1384,6 +1384,7 @@  struct kvm_x86_ops {
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
 	void *(*alloc_apic_backing_page)(struct kvm_vcpu *vcpu);
+	int (*get_tdp_max_page_level)(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0144c40d09c7..7991ffae7b31 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3781,11 +3781,13 @@  static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 static int nonpaging_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
 				u32 error_code, bool prefault)
 {
+	int max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, PG_LEVEL_2M);
+
 	pgprintk("%s: gva %lx error %x\n", __func__, gpa, error_code);
 
 	/* This path builds a PAE pagetable, we can map 2mb pages at maximum. */
 	return direct_page_fault(vcpu, gpa & PAGE_MASK, error_code, prefault,
-				 PG_LEVEL_2M, false);
+				 max_level, false);
 }
 
 int kvm_handle_page_fault(struct kvm_vcpu *vcpu, u64 error_code,
@@ -3826,7 +3828,7 @@  int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
 {
 	int max_level;
 
-	for (max_level = KVM_MAX_HUGEPAGE_LEVEL;
+	for (max_level = kvm_x86_ops.get_tdp_max_page_level(vcpu, gpa, KVM_MAX_HUGEPAGE_LEVEL);
 	     max_level > PG_LEVEL_4K;
 	     max_level--) {
 		int page_num = KVM_PAGES_PER_HPAGE(max_level);
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c
index 3f8824c9a5dc..fd2d00ad80b7 100644
--- a/arch/x86/kvm/svm/sev.c
+++ b/arch/x86/kvm/svm/sev.c
@@ -3206,3 +3206,23 @@  struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu)
 
 	return pfn_to_page(pfn);
 }
+
+int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
+{
+	struct rmpentry *e;
+	kvm_pfn_t pfn;
+	int level;
+
+	if (!sev_snp_guest(vcpu->kvm))
+		return max_level;
+
+	pfn = gfn_to_pfn(vcpu->kvm, gpa_to_gfn(gpa));
+	if (is_error_noslot_pfn(pfn))
+		return max_level;
+
+	e = snp_lookup_page_in_rmptable(pfn_to_page(pfn), &level);
+	if (unlikely(!e))
+		return max_level;
+
+	return min_t(uint32_t, level, max_level);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a7adf6ca1713..2632eae52aa3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4576,6 +4576,7 @@  static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
 
 	.alloc_apic_backing_page = svm_alloc_apic_backing_page,
+	.get_tdp_max_page_level = sev_get_tdp_max_page_level,
 };
 
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index bc5582b44356..32abcbd774d0 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -568,6 +568,7 @@  void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
 void sev_es_prepare_guest_switch(struct vcpu_svm *svm, unsigned int cpu);
 void sev_es_unmap_ghcb(struct vcpu_svm *svm);
 struct page *snp_safe_alloc_page(struct kvm_vcpu *vcpu);
+int sev_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level);
 
 /* vmenter.S */
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4bceb5ca3a89..fbc9034edf16 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7612,6 +7612,12 @@  static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
+
+static int vmx_get_tdp_max_page_level(struct kvm_vcpu *vcpu, gpa_t gpa, int max_level)
+{
+	return max_level;
+}
+
 static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.hardware_unsetup = hardware_unsetup,
 
@@ -7742,6 +7748,8 @@  static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.get_tdp_max_page_level = vmx_get_tdp_max_page_level,
 };
 
 static __init void vmx_setup_user_return_msrs(void)