diff mbox series

[v3,01/26] kvm: Merge kvm_check_extension() and kvm_vm_check_extension()

Message ID 20241125195626.856992-3-jean-philippe@linaro.org
State New
Headers show
Series arm: Run Arm CCA VMs with KVM | expand

Commit Message

Jean-Philippe Brucker Nov. 25, 2024, 7:56 p.m. UTC
The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd
(/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most
extensions, KVM returns the same value with either method, but for some
of them it can refine the returned value depending on the VM type. The
KVM documentation [1] advises to use the VM fd:

  Based on their initialization different VMs may have different
  capabilities. It is thus encouraged to use the vm ioctl to query for
  capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)

Ongoing work on Arm confidential VMs confirms this, as some capabilities
become unavailable to confidential VMs, requiring changes in QEMU to use
kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather
than changing each check one by one, change kvm_check_extension() to
always issue the ioctl on the VM fd when available, and remove
kvm_vm_check_extension().

Fall back to the global fd when the VM check is unavailable:

* Ancient kernels do not support KVM_CHECK_EXTENSION on the VM fd, since
  it was added by commit 92b591a4c46b ("KVM: Allow KVM_CHECK_EXTENSION
  on the vm fd") in Linux 3.17 [3]. Support for Linux 3.16 ended in June
  2020, but there may still be old images around.

* A couple of calls must be issued before the VM fd is available, since
  they determine the VM type: KVM_CAP_MIPS_VZ and KVM_CAP_ARM_VM_IPA_SIZE

Does any user actually depend on the check being done on the global fd
instead of the VM fd?  I surveyed all cases where KVM presently returns
different values depending on the query method. Luckily QEMU already
calls kvm_vm_check_extension() for most of those. Only three of them are
ambiguous, because currently done on the global fd:

* KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_VCPU_ID on Arm, changes value if the
  user requests a vGIC different from the default. But QEMU queries this
  before vGIC configuration, so the reported value will be the same.

* KVM_CAP_SW_TLB on PPC. When issued on the global fd, returns false if
  the kvm-hv module is loaded; when issued on the VM fd, returns false
  only if the VM type is HV instead of PR. If this returns false, then
  QEMU will fail to initialize a BOOKE206 MMU model.

  So this patch supposedly improves things, as it allows to run this
  type of vCPU even when both KVM modules are loaded.

* KVM_CAP_PPC_SECURE_GUEST. Similarly, doing this check on a VM fd
  refines the returned value, and ensures that SVM is actually
  supported. Since QEMU follows the check with kvm_vm_enable_cap(), this
  patch should only provide better error reporting.

[1] https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-check-extension
[2] https://lore.kernel.org/kvm/875ybi0ytc.fsf@redhat.com/
[3] https://github.com/torvalds/linux/commit/92b591a4c46b

Cc: Marcelo Tosatti <mtosatti@redhat.com>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Daniel Henrique Barboza <danielhb413@gmail.com>
Cc: qemu-ppc@nongnu.org
Suggested-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
---
 include/sysemu/kvm.h     |  2 --
 include/sysemu/kvm_int.h |  1 +
 accel/kvm/kvm-all.c      | 41 +++++++++++++++++++---------------------
 target/arm/kvm.c         |  2 +-
 target/i386/kvm/kvm.c    |  6 +++---
 target/ppc/kvm.c         | 36 +++++++++++++++++------------------
 6 files changed, 42 insertions(+), 46 deletions(-)

Comments

Daniel P. Berrangé Nov. 26, 2024, 12:29 p.m. UTC | #1
On Mon, Nov 25, 2024 at 07:56:00PM +0000, Jean-Philippe Brucker wrote:
> The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd
> (/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most
> extensions, KVM returns the same value with either method, but for some
> of them it can refine the returned value depending on the VM type. The
> KVM documentation [1] advises to use the VM fd:
> 
>   Based on their initialization different VMs may have different
>   capabilities. It is thus encouraged to use the vm ioctl to query for
>   capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> 
> Ongoing work on Arm confidential VMs confirms this, as some capabilities
> become unavailable to confidential VMs, requiring changes in QEMU to use
> kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather
> than changing each check one by one, change kvm_check_extension() to
> always issue the ioctl on the VM fd when available, and remove
> kvm_vm_check_extension().

The downside I see of this approach is that it can potentially
mask mistakes / unexpected behaviour.

eg, consider you are in a code path where you /think/ the VM fd
is available, but for some unexpected reason it is NOT in fact
available. The code silently falls back to the global FD, thus
giving a potentially incorrect extension check answer.

Having separate check methods with no fallback ensures that we
are checking exactly what we /intend/ to be checking, or will
see an error

> 
> Fall back to the global fd when the VM check is unavailable:
> 
> * Ancient kernels do not support KVM_CHECK_EXTENSION on the VM fd, since
>   it was added by commit 92b591a4c46b ("KVM: Allow KVM_CHECK_EXTENSION
>   on the vm fd") in Linux 3.17 [3]. Support for Linux 3.16 ended in June
>   2020, but there may still be old images around.
> 
> * A couple of calls must be issued before the VM fd is available, since
>   they determine the VM type: KVM_CAP_MIPS_VZ and KVM_CAP_ARM_VM_IPA_SIZE
> 
> Does any user actually depend on the check being done on the global fd
> instead of the VM fd?  I surveyed all cases where KVM presently returns
> different values depending on the query method. Luckily QEMU already
> calls kvm_vm_check_extension() for most of those. Only three of them are
> ambiguous, because currently done on the global fd:
> 
> * KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_VCPU_ID on Arm, changes value if the
>   user requests a vGIC different from the default. But QEMU queries this
>   before vGIC configuration, so the reported value will be the same.
> 
> * KVM_CAP_SW_TLB on PPC. When issued on the global fd, returns false if
>   the kvm-hv module is loaded; when issued on the VM fd, returns false
>   only if the VM type is HV instead of PR. If this returns false, then
>   QEMU will fail to initialize a BOOKE206 MMU model.
> 
>   So this patch supposedly improves things, as it allows to run this
>   type of vCPU even when both KVM modules are loaded.
> 
> * KVM_CAP_PPC_SECURE_GUEST. Similarly, doing this check on a VM fd
>   refines the returned value, and ensures that SVM is actually
>   supported. Since QEMU follows the check with kvm_vm_enable_cap(), this
>   patch should only provide better error reporting.
> 
> [1] https://www.kernel.org/doc/html/latest/virt/kvm/api.html#kvm-check-extension
> [2] https://lore.kernel.org/kvm/875ybi0ytc.fsf@redhat.com/
> [3] https://github.com/torvalds/linux/commit/92b591a4c46b
> 
> Cc: Marcelo Tosatti <mtosatti@redhat.com>
> Cc: Nicholas Piggin <npiggin@gmail.com>
> Cc: Daniel Henrique Barboza <danielhb413@gmail.com>
> Cc: qemu-ppc@nongnu.org
> Suggested-by: Cornelia Huck <cohuck@redhat.com>
> Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
> ---
>  include/sysemu/kvm.h     |  2 --
>  include/sysemu/kvm_int.h |  1 +
>  accel/kvm/kvm-all.c      | 41 +++++++++++++++++++---------------------
>  target/arm/kvm.c         |  2 +-
>  target/i386/kvm/kvm.c    |  6 +++---
>  target/ppc/kvm.c         | 36 +++++++++++++++++------------------
>  6 files changed, 42 insertions(+), 46 deletions(-)
> 
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index c3a60b2890..63c96d0096 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -437,8 +437,6 @@ bool kvm_arch_stop_on_emulation_error(CPUState *cpu);
>  
>  int kvm_check_extension(KVMState *s, unsigned int extension);
>  
> -int kvm_vm_check_extension(KVMState *s, unsigned int extension);
> -
>  #define kvm_vm_enable_cap(s, capability, cap_flags, ...)             \
>      ({                                                               \
>          struct kvm_enable_cap cap = {                                \
> diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
> index a1e72763da..cb38085d54 100644
> --- a/include/sysemu/kvm_int.h
> +++ b/include/sysemu/kvm_int.h
> @@ -166,6 +166,7 @@ struct KVMState
>      uint16_t xen_gnttab_max_frames;
>      uint16_t xen_evtchn_max_pirq;
>      char *device;
> +    bool check_extension_vm;
>  };
>  
>  void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 801cff16a5..7ea016d598 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1238,7 +1238,11 @@ int kvm_check_extension(KVMState *s, unsigned int extension)
>  {
>      int ret;
>  
> -    ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
> +    if (!s->check_extension_vm) {
> +        ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
> +    } else {
> +        ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
> +    }
>      if (ret < 0) {
>          ret = 0;
>      }
> @@ -1246,19 +1250,6 @@ int kvm_check_extension(KVMState *s, unsigned int extension)
>      return ret;
>  }
>  
> -int kvm_vm_check_extension(KVMState *s, unsigned int extension)
> -{
> -    int ret;
> -
> -    ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
> -    if (ret < 0) {
> -        /* VM wide version not implemented, use global one instead */
> -        ret = kvm_check_extension(s, extension);
> -    }
> -
> -    return ret;
> -}
> -
>  /*
>   * We track the poisoned pages to be able to:
>   * - replace them on VM reset
> @@ -1622,10 +1613,10 @@ static int kvm_dirty_ring_init(KVMState *s)
>       * Read the max supported pages. Fall back to dirty logging mode
>       * if the dirty ring isn't supported.
>       */
> -    ret = kvm_vm_check_extension(s, capability);
> +    ret = kvm_check_extension(s, capability);
>      if (ret <= 0) {
>          capability = KVM_CAP_DIRTY_LOG_RING_ACQ_REL;
> -        ret = kvm_vm_check_extension(s, capability);
> +        ret = kvm_check_extension(s, capability);
>      }
>  
>      if (ret <= 0) {
> @@ -1648,7 +1639,7 @@ static int kvm_dirty_ring_init(KVMState *s)
>      }
>  
>      /* Enable the backup bitmap if it is supported */
> -    ret = kvm_vm_check_extension(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP);
> +    ret = kvm_check_extension(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP);
>      if (ret > 0) {
>          ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, 0);
>          if (ret) {
> @@ -2404,7 +2395,7 @@ static void kvm_irqchip_create(KVMState *s)
>   */
>  static int kvm_recommended_vcpus(KVMState *s)
>  {
> -    int ret = kvm_vm_check_extension(s, KVM_CAP_NR_VCPUS);
> +    int ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
>      return (ret) ? ret : 4;
>  }
>  
> @@ -2625,7 +2616,12 @@ static int kvm_init(MachineState *ms)
>  
>      s->vmfd = ret;
>  
> -    s->nr_as = kvm_vm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
> +    ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_CHECK_EXTENSION_VM);
> +    if (ret > 0) {
> +        s->check_extension_vm = true;
> +    }
> +
> +    s->nr_as = kvm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
>      if (s->nr_as <= 1) {
>          s->nr_as = 1;
>      }
> @@ -2683,7 +2679,7 @@ static int kvm_init(MachineState *ms)
>      }
>  
>      kvm_readonly_mem_allowed =
> -        (kvm_vm_check_extension(s, KVM_CAP_READONLY_MEM) > 0);
> +        (kvm_check_extension(s, KVM_CAP_READONLY_MEM) > 0);
>  
>      kvm_resamplefds_allowed =
>          (kvm_check_extension(s, KVM_CAP_IRQFD_RESAMPLE) > 0);
> @@ -2717,7 +2713,8 @@ static int kvm_init(MachineState *ms)
>          goto err;
>      }
>  
> -    kvm_supported_memory_attributes = kvm_vm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
> +    kvm_supported_memory_attributes =
> +        kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
>      kvm_guest_memfd_supported =
>          kvm_check_extension(s, KVM_CAP_GUEST_MEMFD) &&
>          kvm_check_extension(s, KVM_CAP_USER_MEMORY2) &&
> @@ -2743,7 +2740,7 @@ static int kvm_init(MachineState *ms)
>      memory_listener_register(&kvm_io_listener,
>                               &address_space_io);
>  
> -    s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
> +    s->sync_mmu = !!kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
>      if (!s->sync_mmu) {
>          ret = ram_block_discard_disable(true);
>          assert(!ret);
> diff --git a/target/arm/kvm.c b/target/arm/kvm.c
> index 7b6812c0de..8bdf4abeb6 100644
> --- a/target/arm/kvm.c
> +++ b/target/arm/kvm.c
> @@ -601,7 +601,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>      if (s->kvm_eager_split_size) {
>          uint32_t sizes;
>  
> -        sizes = kvm_vm_check_extension(s, KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES);
> +        sizes = kvm_check_extension(s, KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES);
>          if (!sizes) {
>              s->kvm_eager_split_size = 0;
>              warn_report("Eager Page Split support not available");
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 8e17942c3b..2f35e7468c 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -244,7 +244,7 @@ bool kvm_enable_hypercall(uint64_t enable_mask)
>  
>  bool kvm_has_smm(void)
>  {
> -    return kvm_vm_check_extension(kvm_state, KVM_CAP_X86_SMM);
> +    return kvm_check_extension(kvm_state, KVM_CAP_X86_SMM);
>  }
>  
>  bool kvm_has_adjust_clock_stable(void)
> @@ -3320,7 +3320,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>          }
>      }
>  
> -    if (kvm_vm_check_extension(s, KVM_CAP_X86_USER_SPACE_MSR)) {
> +    if (kvm_check_extension(s, KVM_CAP_X86_USER_SPACE_MSR)) {
>          ret = kvm_vm_enable_userspace_msr(s);
>          if (ret < 0) {
>              return ret;
> @@ -5936,7 +5936,7 @@ static bool __kvm_enable_sgx_provisioning(KVMState *s)
>  {
>      int fd, ret;
>  
> -    if (!kvm_vm_check_extension(s, KVM_CAP_SGX_ATTRIBUTE)) {
> +    if (!kvm_check_extension(s, KVM_CAP_SGX_ATTRIBUTE)) {
>          return false;
>      }
>  
> diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
> index 3efc28f18b..8bcb0368ce 100644
> --- a/target/ppc/kvm.c
> +++ b/target/ppc/kvm.c
> @@ -110,7 +110,7 @@ static uint32_t debug_inst_opcode;
>  static bool kvmppc_is_pr(KVMState *ks)
>  {
>      /* Assume KVM-PR if the GET_PVINFO capability is available */
> -    return kvm_vm_check_extension(ks, KVM_CAP_PPC_GET_PVINFO) != 0;
> +    return kvm_check_extension(ks, KVM_CAP_PPC_GET_PVINFO) != 0;
>  }
>  
>  static int kvm_ppc_register_host_cpu_type(void);
> @@ -127,11 +127,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>      cap_interrupt_unset = kvm_check_extension(s, KVM_CAP_PPC_UNSET_IRQ);
>      cap_segstate = kvm_check_extension(s, KVM_CAP_PPC_SEGSTATE);
>      cap_booke_sregs = kvm_check_extension(s, KVM_CAP_PPC_BOOKE_SREGS);
> -    cap_ppc_smt_possible = kvm_vm_check_extension(s, KVM_CAP_PPC_SMT_POSSIBLE);
> +    cap_ppc_smt_possible = kvm_check_extension(s, KVM_CAP_PPC_SMT_POSSIBLE);
>      cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE);
>      cap_spapr_tce_64 = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_64);
>      cap_spapr_multitce = kvm_check_extension(s, KVM_CAP_SPAPR_MULTITCE);
> -    cap_spapr_vfio = kvm_vm_check_extension(s, KVM_CAP_SPAPR_TCE_VFIO);
> +    cap_spapr_vfio = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_VFIO);
>      cap_one_reg = kvm_check_extension(s, KVM_CAP_ONE_REG);
>      cap_hior = kvm_check_extension(s, KVM_CAP_PPC_HIOR);
>      cap_epr = kvm_check_extension(s, KVM_CAP_PPC_EPR);
> @@ -140,23 +140,23 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>       * Note: we don't set cap_papr here, because this capability is
>       * only activated after this by kvmppc_set_papr()
>       */
> -    cap_htab_fd = kvm_vm_check_extension(s, KVM_CAP_PPC_HTAB_FD);
> +    cap_htab_fd = kvm_check_extension(s, KVM_CAP_PPC_HTAB_FD);
>      cap_fixup_hcalls = kvm_check_extension(s, KVM_CAP_PPC_FIXUP_HCALL);
> -    cap_ppc_smt = kvm_vm_check_extension(s, KVM_CAP_PPC_SMT);
> -    cap_htm = kvm_vm_check_extension(s, KVM_CAP_PPC_HTM);
> -    cap_mmu_radix = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
> -    cap_mmu_hash_v3 = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
> -    cap_xive = kvm_vm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
> -    cap_resize_hpt = kvm_vm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
> +    cap_ppc_smt = kvm_check_extension(s, KVM_CAP_PPC_SMT);
> +    cap_htm = kvm_check_extension(s, KVM_CAP_PPC_HTM);
> +    cap_mmu_radix = kvm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
> +    cap_mmu_hash_v3 = kvm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
> +    cap_xive = kvm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
> +    cap_resize_hpt = kvm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
>      kvmppc_get_cpu_characteristics(s);
> -    cap_ppc_nested_kvm_hv = kvm_vm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
> +    cap_ppc_nested_kvm_hv = kvm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
>      cap_large_decr = kvmppc_get_dec_bits();
> -    cap_fwnmi = kvm_vm_check_extension(s, KVM_CAP_PPC_FWNMI);
> +    cap_fwnmi = kvm_check_extension(s, KVM_CAP_PPC_FWNMI);
>      /*
>       * Note: setting it to false because there is not such capability
>       * in KVM at this moment.
>       *
> -     * TODO: call kvm_vm_check_extension() with the right capability
> +     * TODO: call kvm_check_extension() with the right capability
>       * after the kernel starts implementing it.
>       */
>      cap_ppc_pvr_compat = false;
> @@ -166,8 +166,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>          exit(1);
>      }
>  
> -    cap_rpt_invalidate = kvm_vm_check_extension(s, KVM_CAP_PPC_RPT_INVALIDATE);
> -    cap_ail_mode_3 = kvm_vm_check_extension(s, KVM_CAP_PPC_AIL_MODE_3);
> +    cap_rpt_invalidate = kvm_check_extension(s, KVM_CAP_PPC_RPT_INVALIDATE);
> +    cap_ail_mode_3 = kvm_check_extension(s, KVM_CAP_PPC_AIL_MODE_3);
>      kvm_ppc_register_host_cpu_type();
>  
>      return 0;
> @@ -1976,7 +1976,7 @@ static int kvmppc_get_pvinfo(CPUPPCState *env, struct kvm_ppc_pvinfo *pvinfo)
>  {
>      CPUState *cs = env_cpu(env);
>  
> -    if (kvm_vm_check_extension(cs->kvm_state, KVM_CAP_PPC_GET_PVINFO) &&
> +    if (kvm_check_extension(cs->kvm_state, KVM_CAP_PPC_GET_PVINFO) &&
>          !kvm_vm_ioctl(cs->kvm_state, KVM_PPC_GET_PVINFO, pvinfo)) {
>          return 0;
>      }
> @@ -2298,7 +2298,7 @@ int kvmppc_reset_htab(int shift_hint)
>          /* Full emulation, tell caller to allocate htab itself */
>          return 0;
>      }
> -    if (kvm_vm_check_extension(kvm_state, KVM_CAP_PPC_ALLOC_HTAB)) {
> +    if (kvm_check_extension(kvm_state, KVM_CAP_PPC_ALLOC_HTAB)) {
>          int ret;
>          ret = kvm_vm_ioctl(kvm_state, KVM_PPC_ALLOCATE_HTAB, &shift);
>          if (ret == -ENOTTY) {
> @@ -2507,7 +2507,7 @@ static void kvmppc_get_cpu_characteristics(KVMState *s)
>      cap_ppc_safe_bounds_check = 0;
>      cap_ppc_safe_indirect_branch = 0;
>  
> -    ret = kvm_vm_check_extension(s, KVM_CAP_PPC_GET_CPU_CHAR);
> +    ret = kvm_check_extension(s, KVM_CAP_PPC_GET_CPU_CHAR);
>      if (!ret) {
>          return;
>      }
> -- 
> 2.47.0
> 
> 

With regards,
Daniel
Jean-Philippe Brucker Dec. 4, 2024, 7:07 p.m. UTC | #2
On Tue, Nov 26, 2024 at 12:29:35PM +0000, Daniel P. Berrangé wrote:
> On Mon, Nov 25, 2024 at 07:56:00PM +0000, Jean-Philippe Brucker wrote:
> > The KVM_CHECK_EXTENSION ioctl can be issued either on the global fd
> > (/dev/kvm), or on the VM fd obtained with KVM_CREATE_VM. For most
> > extensions, KVM returns the same value with either method, but for some
> > of them it can refine the returned value depending on the VM type. The
> > KVM documentation [1] advises to use the VM fd:
> > 
> >   Based on their initialization different VMs may have different
> >   capabilities. It is thus encouraged to use the vm ioctl to query for
> >   capabilities (available with KVM_CAP_CHECK_EXTENSION_VM on the vm fd)
> > 
> > Ongoing work on Arm confidential VMs confirms this, as some capabilities
> > become unavailable to confidential VMs, requiring changes in QEMU to use
> > kvm_vm_check_extension() instead of kvm_check_extension() [2]. Rather
> > than changing each check one by one, change kvm_check_extension() to
> > always issue the ioctl on the VM fd when available, and remove
> > kvm_vm_check_extension().
> 
> The downside I see of this approach is that it can potentially
> mask mistakes / unexpected behaviour.
> 
> eg, consider you are in a code path where you /think/ the VM fd
> is available, but for some unexpected reason it is NOT in fact
> available. The code silently falls back to the global FD, thus
> giving a potentially incorrect extension check answer.
> 
> Having separate check methods with no fallback ensures that we
> are checking exactly what we /intend/ to be checking, or will
> see an error

Yes I see your point, and I'm happy dropping this patch since I'm less
familiar with the other archs.

The alternative is replacing kvm_check_extension() with
kvm_vm_check_extension() wherever the Arm ioctl handler behaves
differently depending on the VM type. Simple enough though it does affect
kvm-all.c too:

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..a56b943f31 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2410,13 +2410,13 @@ static int kvm_recommended_vcpus(KVMState *s)
 
 static int kvm_max_vcpus(KVMState *s)
 {
-    int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPUS);
+    int ret = kvm_vm_check_extension(s, KVM_CAP_MAX_VCPUS);
     return (ret) ? ret : kvm_recommended_vcpus(s);
 }
 
 static int kvm_max_vcpu_id(KVMState *s)
 {
-    int ret = kvm_check_extension(s, KVM_CAP_MAX_VCPU_ID);
+    int ret = kvm_vm_check_extension(s, KVM_CAP_MAX_VCPU_ID);
     return (ret) ? ret : kvm_max_vcpus(s);
 }
 
@@ -2693,7 +2693,7 @@ static int kvm_init(MachineState *ms)
 
 #ifdef TARGET_KVM_HAVE_GUEST_DEBUG
     kvm_has_guest_debug =
-        (kvm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
+        (kvm_vm_check_extension(s, KVM_CAP_SET_GUEST_DEBUG) > 0);
 #endif
 
     kvm_sstep_flags = 0;
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 7b6812c0de..609c6d4e7a 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -618,11 +618,11 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
         }
     }
 
-    max_hw_wps = kvm_check_extension(s, KVM_CAP_GUEST_DEBUG_HW_WPS);
+    max_hw_wps = kvm_vm_check_extension(s, KVM_CAP_GUEST_DEBUG_HW_WPS);
     hw_watchpoints = g_array_sized_new(true, true,
                                        sizeof(HWWatchpoint), max_hw_wps);
 
-    max_hw_bps = kvm_check_extension(s, KVM_CAP_GUEST_DEBUG_HW_BPS);
+    max_hw_bps = kvm_vm_check_extension(s, KVM_CAP_GUEST_DEBUG_HW_BPS);
     hw_breakpoints = g_array_sized_new(true, true,
                                        sizeof(HWBreakpoint), max_hw_bps);
 
@@ -1764,7 +1764,7 @@ void kvm_arm_pvtime_init(ARMCPU *cpu, uint64_t ipa)
 
 void kvm_arm_steal_time_finalize(ARMCPU *cpu, Error **errp)
 {
-    bool has_steal_time = kvm_check_extension(kvm_state, KVM_CAP_STEAL_TIME);
+    bool has_steal_time = kvm_vm_check_extension(kvm_state, KVM_CAP_STEAL_TIME);
 
     if (cpu->kvm_steal_time == ON_OFF_AUTO_AUTO) {
         if (!has_steal_time || !arm_feature(&cpu->env, ARM_FEATURE_AARCH64)) {
@@ -1799,7 +1799,7 @@ bool kvm_arm_aarch32_supported(void)
 
 bool kvm_arm_sve_supported(void)
 {
-    return kvm_check_extension(kvm_state, KVM_CAP_ARM_SVE);
+    return kvm_vm_check_extension(kvm_state, KVM_CAP_ARM_SVE);
 }
 
 bool kvm_arm_mte_supported(void)
diff mbox series

Patch

diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index c3a60b2890..63c96d0096 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -437,8 +437,6 @@  bool kvm_arch_stop_on_emulation_error(CPUState *cpu);
 
 int kvm_check_extension(KVMState *s, unsigned int extension);
 
-int kvm_vm_check_extension(KVMState *s, unsigned int extension);
-
 #define kvm_vm_enable_cap(s, capability, cap_flags, ...)             \
     ({                                                               \
         struct kvm_enable_cap cap = {                                \
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index a1e72763da..cb38085d54 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -166,6 +166,7 @@  struct KVMState
     uint16_t xen_gnttab_max_frames;
     uint16_t xen_evtchn_max_pirq;
     char *device;
+    bool check_extension_vm;
 };
 
 void kvm_memory_listener_register(KVMState *s, KVMMemoryListener *kml,
diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 801cff16a5..7ea016d598 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1238,7 +1238,11 @@  int kvm_check_extension(KVMState *s, unsigned int extension)
 {
     int ret;
 
-    ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+    if (!s->check_extension_vm) {
+        ret = kvm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+    } else {
+        ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
+    }
     if (ret < 0) {
         ret = 0;
     }
@@ -1246,19 +1250,6 @@  int kvm_check_extension(KVMState *s, unsigned int extension)
     return ret;
 }
 
-int kvm_vm_check_extension(KVMState *s, unsigned int extension)
-{
-    int ret;
-
-    ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, extension);
-    if (ret < 0) {
-        /* VM wide version not implemented, use global one instead */
-        ret = kvm_check_extension(s, extension);
-    }
-
-    return ret;
-}
-
 /*
  * We track the poisoned pages to be able to:
  * - replace them on VM reset
@@ -1622,10 +1613,10 @@  static int kvm_dirty_ring_init(KVMState *s)
      * Read the max supported pages. Fall back to dirty logging mode
      * if the dirty ring isn't supported.
      */
-    ret = kvm_vm_check_extension(s, capability);
+    ret = kvm_check_extension(s, capability);
     if (ret <= 0) {
         capability = KVM_CAP_DIRTY_LOG_RING_ACQ_REL;
-        ret = kvm_vm_check_extension(s, capability);
+        ret = kvm_check_extension(s, capability);
     }
 
     if (ret <= 0) {
@@ -1648,7 +1639,7 @@  static int kvm_dirty_ring_init(KVMState *s)
     }
 
     /* Enable the backup bitmap if it is supported */
-    ret = kvm_vm_check_extension(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP);
+    ret = kvm_check_extension(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP);
     if (ret > 0) {
         ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP, 0);
         if (ret) {
@@ -2404,7 +2395,7 @@  static void kvm_irqchip_create(KVMState *s)
  */
 static int kvm_recommended_vcpus(KVMState *s)
 {
-    int ret = kvm_vm_check_extension(s, KVM_CAP_NR_VCPUS);
+    int ret = kvm_check_extension(s, KVM_CAP_NR_VCPUS);
     return (ret) ? ret : 4;
 }
 
@@ -2625,7 +2616,12 @@  static int kvm_init(MachineState *ms)
 
     s->vmfd = ret;
 
-    s->nr_as = kvm_vm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
+    ret = kvm_vm_ioctl(s, KVM_CHECK_EXTENSION, KVM_CAP_CHECK_EXTENSION_VM);
+    if (ret > 0) {
+        s->check_extension_vm = true;
+    }
+
+    s->nr_as = kvm_check_extension(s, KVM_CAP_MULTI_ADDRESS_SPACE);
     if (s->nr_as <= 1) {
         s->nr_as = 1;
     }
@@ -2683,7 +2679,7 @@  static int kvm_init(MachineState *ms)
     }
 
     kvm_readonly_mem_allowed =
-        (kvm_vm_check_extension(s, KVM_CAP_READONLY_MEM) > 0);
+        (kvm_check_extension(s, KVM_CAP_READONLY_MEM) > 0);
 
     kvm_resamplefds_allowed =
         (kvm_check_extension(s, KVM_CAP_IRQFD_RESAMPLE) > 0);
@@ -2717,7 +2713,8 @@  static int kvm_init(MachineState *ms)
         goto err;
     }
 
-    kvm_supported_memory_attributes = kvm_vm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
+    kvm_supported_memory_attributes =
+        kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
     kvm_guest_memfd_supported =
         kvm_check_extension(s, KVM_CAP_GUEST_MEMFD) &&
         kvm_check_extension(s, KVM_CAP_USER_MEMORY2) &&
@@ -2743,7 +2740,7 @@  static int kvm_init(MachineState *ms)
     memory_listener_register(&kvm_io_listener,
                              &address_space_io);
 
-    s->sync_mmu = !!kvm_vm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
+    s->sync_mmu = !!kvm_check_extension(kvm_state, KVM_CAP_SYNC_MMU);
     if (!s->sync_mmu) {
         ret = ram_block_discard_disable(true);
         assert(!ret);
diff --git a/target/arm/kvm.c b/target/arm/kvm.c
index 7b6812c0de..8bdf4abeb6 100644
--- a/target/arm/kvm.c
+++ b/target/arm/kvm.c
@@ -601,7 +601,7 @@  int kvm_arch_init(MachineState *ms, KVMState *s)
     if (s->kvm_eager_split_size) {
         uint32_t sizes;
 
-        sizes = kvm_vm_check_extension(s, KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES);
+        sizes = kvm_check_extension(s, KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES);
         if (!sizes) {
             s->kvm_eager_split_size = 0;
             warn_report("Eager Page Split support not available");
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 8e17942c3b..2f35e7468c 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -244,7 +244,7 @@  bool kvm_enable_hypercall(uint64_t enable_mask)
 
 bool kvm_has_smm(void)
 {
-    return kvm_vm_check_extension(kvm_state, KVM_CAP_X86_SMM);
+    return kvm_check_extension(kvm_state, KVM_CAP_X86_SMM);
 }
 
 bool kvm_has_adjust_clock_stable(void)
@@ -3320,7 +3320,7 @@  int kvm_arch_init(MachineState *ms, KVMState *s)
         }
     }
 
-    if (kvm_vm_check_extension(s, KVM_CAP_X86_USER_SPACE_MSR)) {
+    if (kvm_check_extension(s, KVM_CAP_X86_USER_SPACE_MSR)) {
         ret = kvm_vm_enable_userspace_msr(s);
         if (ret < 0) {
             return ret;
@@ -5936,7 +5936,7 @@  static bool __kvm_enable_sgx_provisioning(KVMState *s)
 {
     int fd, ret;
 
-    if (!kvm_vm_check_extension(s, KVM_CAP_SGX_ATTRIBUTE)) {
+    if (!kvm_check_extension(s, KVM_CAP_SGX_ATTRIBUTE)) {
         return false;
     }
 
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index 3efc28f18b..8bcb0368ce 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -110,7 +110,7 @@  static uint32_t debug_inst_opcode;
 static bool kvmppc_is_pr(KVMState *ks)
 {
     /* Assume KVM-PR if the GET_PVINFO capability is available */
-    return kvm_vm_check_extension(ks, KVM_CAP_PPC_GET_PVINFO) != 0;
+    return kvm_check_extension(ks, KVM_CAP_PPC_GET_PVINFO) != 0;
 }
 
 static int kvm_ppc_register_host_cpu_type(void);
@@ -127,11 +127,11 @@  int kvm_arch_init(MachineState *ms, KVMState *s)
     cap_interrupt_unset = kvm_check_extension(s, KVM_CAP_PPC_UNSET_IRQ);
     cap_segstate = kvm_check_extension(s, KVM_CAP_PPC_SEGSTATE);
     cap_booke_sregs = kvm_check_extension(s, KVM_CAP_PPC_BOOKE_SREGS);
-    cap_ppc_smt_possible = kvm_vm_check_extension(s, KVM_CAP_PPC_SMT_POSSIBLE);
+    cap_ppc_smt_possible = kvm_check_extension(s, KVM_CAP_PPC_SMT_POSSIBLE);
     cap_spapr_tce = kvm_check_extension(s, KVM_CAP_SPAPR_TCE);
     cap_spapr_tce_64 = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_64);
     cap_spapr_multitce = kvm_check_extension(s, KVM_CAP_SPAPR_MULTITCE);
-    cap_spapr_vfio = kvm_vm_check_extension(s, KVM_CAP_SPAPR_TCE_VFIO);
+    cap_spapr_vfio = kvm_check_extension(s, KVM_CAP_SPAPR_TCE_VFIO);
     cap_one_reg = kvm_check_extension(s, KVM_CAP_ONE_REG);
     cap_hior = kvm_check_extension(s, KVM_CAP_PPC_HIOR);
     cap_epr = kvm_check_extension(s, KVM_CAP_PPC_EPR);
@@ -140,23 +140,23 @@  int kvm_arch_init(MachineState *ms, KVMState *s)
      * Note: we don't set cap_papr here, because this capability is
      * only activated after this by kvmppc_set_papr()
      */
-    cap_htab_fd = kvm_vm_check_extension(s, KVM_CAP_PPC_HTAB_FD);
+    cap_htab_fd = kvm_check_extension(s, KVM_CAP_PPC_HTAB_FD);
     cap_fixup_hcalls = kvm_check_extension(s, KVM_CAP_PPC_FIXUP_HCALL);
-    cap_ppc_smt = kvm_vm_check_extension(s, KVM_CAP_PPC_SMT);
-    cap_htm = kvm_vm_check_extension(s, KVM_CAP_PPC_HTM);
-    cap_mmu_radix = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
-    cap_mmu_hash_v3 = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
-    cap_xive = kvm_vm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
-    cap_resize_hpt = kvm_vm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
+    cap_ppc_smt = kvm_check_extension(s, KVM_CAP_PPC_SMT);
+    cap_htm = kvm_check_extension(s, KVM_CAP_PPC_HTM);
+    cap_mmu_radix = kvm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
+    cap_mmu_hash_v3 = kvm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
+    cap_xive = kvm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
+    cap_resize_hpt = kvm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
     kvmppc_get_cpu_characteristics(s);
-    cap_ppc_nested_kvm_hv = kvm_vm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
+    cap_ppc_nested_kvm_hv = kvm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
     cap_large_decr = kvmppc_get_dec_bits();
-    cap_fwnmi = kvm_vm_check_extension(s, KVM_CAP_PPC_FWNMI);
+    cap_fwnmi = kvm_check_extension(s, KVM_CAP_PPC_FWNMI);
     /*
      * Note: setting it to false because there is not such capability
      * in KVM at this moment.
      *
-     * TODO: call kvm_vm_check_extension() with the right capability
+     * TODO: call kvm_check_extension() with the right capability
      * after the kernel starts implementing it.
      */
     cap_ppc_pvr_compat = false;
@@ -166,8 +166,8 @@  int kvm_arch_init(MachineState *ms, KVMState *s)
         exit(1);
     }
 
-    cap_rpt_invalidate = kvm_vm_check_extension(s, KVM_CAP_PPC_RPT_INVALIDATE);
-    cap_ail_mode_3 = kvm_vm_check_extension(s, KVM_CAP_PPC_AIL_MODE_3);
+    cap_rpt_invalidate = kvm_check_extension(s, KVM_CAP_PPC_RPT_INVALIDATE);
+    cap_ail_mode_3 = kvm_check_extension(s, KVM_CAP_PPC_AIL_MODE_3);
     kvm_ppc_register_host_cpu_type();
 
     return 0;
@@ -1976,7 +1976,7 @@  static int kvmppc_get_pvinfo(CPUPPCState *env, struct kvm_ppc_pvinfo *pvinfo)
 {
     CPUState *cs = env_cpu(env);
 
-    if (kvm_vm_check_extension(cs->kvm_state, KVM_CAP_PPC_GET_PVINFO) &&
+    if (kvm_check_extension(cs->kvm_state, KVM_CAP_PPC_GET_PVINFO) &&
         !kvm_vm_ioctl(cs->kvm_state, KVM_PPC_GET_PVINFO, pvinfo)) {
         return 0;
     }
@@ -2298,7 +2298,7 @@  int kvmppc_reset_htab(int shift_hint)
         /* Full emulation, tell caller to allocate htab itself */
         return 0;
     }
-    if (kvm_vm_check_extension(kvm_state, KVM_CAP_PPC_ALLOC_HTAB)) {
+    if (kvm_check_extension(kvm_state, KVM_CAP_PPC_ALLOC_HTAB)) {
         int ret;
         ret = kvm_vm_ioctl(kvm_state, KVM_PPC_ALLOCATE_HTAB, &shift);
         if (ret == -ENOTTY) {
@@ -2507,7 +2507,7 @@  static void kvmppc_get_cpu_characteristics(KVMState *s)
     cap_ppc_safe_bounds_check = 0;
     cap_ppc_safe_indirect_branch = 0;
 
-    ret = kvm_vm_check_extension(s, KVM_CAP_PPC_GET_CPU_CHAR);
+    ret = kvm_check_extension(s, KVM_CAP_PPC_GET_CPU_CHAR);
     if (!ret) {
         return;
     }