Message ID | 20231016132819.1002933-49-michael.roth@amd.com |
---|---|
State | New |
Headers | show |
Series | Add AMD Secure Nested Paging (SEV-SNP) Hypervisor Support | expand |
> + > + /* > + * If a VMM-specific certificate blob hasn't been provided, grab the > + * host-wide one. > + */ > + snp_certs = sev_snp_certs_get(sev->snp_certs); > + if (!snp_certs) > + snp_certs = sev_snp_global_certs_get(); > + This is where the generation I suggested adding would get checked. If the instance certs' generation is not the global generation, then I think we need a way to return to the VMM to make that right before continuing to provide outdated certificates. This might be an unreasonable request, but the fact that the certs and reported_tcb can be set while a VM is running makes this an issue.
On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote: > > + > > + /* > > + * If a VMM-specific certificate blob hasn't been provided, grab the > > + * host-wide one. > > + */ > > + snp_certs = sev_snp_certs_get(sev->snp_certs); > > + if (!snp_certs) > > + snp_certs = sev_snp_global_certs_get(); > > + > > This is where the generation I suggested adding would get checked. If > the instance certs' generation is not the global generation, then I > think we need a way to return to the VMM to make that right before > continuing to provide outdated certificates. > This might be an unreasonable request, but the fact that the certs and > reported_tcb can be set while a VM is running makes this an issue. Before we get that far, the changelogs need to explain why the kernel is storing userspace blobs in the first place. The whole thing is a bit of a mess. sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs while sev_snp_global_certs_get() is running. If the compiler reloads snp_certs between bumping the refcount and grabbing the pointer, KVM will end up leaking a refcount and consuming a pointer without a refcount. if (!kref_get_unless_zero(&certs->kref)) return NULL; return certs; If allocating memory for the certs fails, the kernel will have set the config but not store the corresponding certs. ret = __sev_do_cmd_locked(SEV_CMD_SNP_CONFIG, &config, &argp->error); if (ret) goto e_free; memcpy(&sev->snp_config, &config, sizeof(config)); } /* * If the new certs are passed then cache it else free the old certs. */ if (input.certs_len) { snp_certs = sev_snp_certs_new(certs, input.certs_len); if (!snp_certs) { ret = -ENOMEM; goto e_free; } } Reasoning about ordering is also difficult, e.g. what is KVM's contract with userspace in terms of recognizing new global certs? I don't understand why the kernel needs to manage the certs. AFAICT the so called global certs aren't an input to SEV_CMD_SNP_CONFIG, i.e. SNP_SET_EXT_CONFIG is purely a software defined thing. The easiest solution I can think of is to have KVM provide a chunk of memory in kvm_sev_info for SNP guests that userspace can mmap(), a la vcpu->run. struct sev_snp_certs { u8 data[KVM_MAX_SEV_SNP_CERT_SIZE]; u32 size; u8 pad[<size to make the struct page aligned>]; }; When the guest requests the certs, KVM does something like: certs_size = READ_ONCE(sev->snp_certs->size); if (certs_size > sizeof(sev->snp_certs->data) || !IS_ALIGNED(certs_size, PAGE_SIZE)) certs_size = 0; if (certs_size && (data_npages << PAGE_SHIFT) < certs_size) { vcpu->arch.regs[VCPU_REGS_RBX] = certs_size >> PAGE_SHIFT; exitcode = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN); goto cleanup; } ... if (certs_size && kvm_write_guest(kvm, data_gpa, sev->snp_certs->data, certs_size)) exitcode = SEV_RET_INVALID_ADDRESS; If userspace wants to provide garbage to the guest, so be it, not KVM's problem. That way, whether the VM gets the global cert or a per-VM cert is purely a userspace concern. If userspace needs to *stall* cert requests, e.g. while the certs are being updated, then that's a different issue entirely. If the GHCB allows telling the guest to retry the request, then it should be trivially easy to solve, e.g. add a flag in sev_snp_certs. If KVM must "immediately" handle the request, then we'll need more elaborate uAPI.
On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote: > > On 18/10/23 03:27, Sean Christopherson wrote: > > On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote: > > > > + > > > > + /* > > > > + * If a VMM-specific certificate blob hasn't been provided, grab the > > > > + * host-wide one. > > > > + */ > > > > + snp_certs = sev_snp_certs_get(sev->snp_certs); > > > > + if (!snp_certs) > > > > + snp_certs = sev_snp_global_certs_get(); > > > > + > > > > > > This is where the generation I suggested adding would get checked. If > > > the instance certs' generation is not the global generation, then I > > > think we need a way to return to the VMM to make that right before > > > continuing to provide outdated certificates. > > > This might be an unreasonable request, but the fact that the certs and > > > reported_tcb can be set while a VM is running makes this an issue. > > > > Before we get that far, the changelogs need to explain why the kernel is storing > > userspace blobs in the first place. The whole thing is a bit of a mess. > > > > sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU > > bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs > > while sev_snp_global_certs_get() is running. If the compiler reloads snp_certs > > between bumping the refcount and grabbing the pointer, KVM will end up leaking a > > refcount and consuming a pointer without a refcount. > > > > if (!kref_get_unless_zero(&certs->kref)) > > return NULL; > > > > return certs; > > I'm missing something here. The @certs pointer is on the stack, No, nothing guarantees that @certs is on the stack and will never be reloaded. sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely possible that it can be inlined. Then you end up with: struct sev_device *sev; if (!psp_master || !psp_master->sev_data) return NULL; sev = psp_master->sev_data; if (!sev->snp_initialized) return NULL; if (!sev->snp_certs) return NULL; if (!kref_get_unless_zero(&sev->snp_certs->kref)) return NULL; return sev->snp_certs; At which point the compiler could choose to omit a local variable entirely, it could store @certs in a register and reload after kref_get_unless_zero(), etc. If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen. That atomic operation in kref_get_unless_zero() might prevent a reload between getting the kref and the return, but it wouldn't prevent a reload between the !NULL check and kref_get_unless_zero(). > > If userspace wants to provide garbage to the guest, so be it, not KVM's problem. > > That way, whether the VM gets the global cert or a per-VM cert is purely a userspace > > concern. > > The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd. > "A la vcpu->run" is fine for the latter but for the former we need something > else. Why? The cert ultimately comes from userspace, no? Make userspace deal with it. > And there is scenario when one global certs blob is what is needed and > copying it over multiple VMs seems suboptimal. That's a solvable problem. I'm not sure I like the most obvious solution, but it is a solution: let userspace define a KVM-wide blob pointer, either via .mmap() or via an ioctl(). FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the userspace pointer would suffice. The benefit of a kernel controlled pointer is that it doesn't require copying to a kernel buffer (or special code to copy from userspace into guest). Actually, looking at the flow again, AFAICT there's nothing special about the target DATA_PAGE. It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e. KVM doesn't need to do conversions, there's no kernel priveleges required, etc. And the GHCB doesn't dictate ordering between storing the certificates and doing the request. That means the certificate stuff can be punted entirely to usersepace. Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0' for non-SNP guests: unsigned long exitcode = 0; u64 data_gpa; int err, rc; if (!sev_snp_guest(vcpu->kvm)) { rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode" goto e_fail; } e_fail: ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode); Which really highlights that we need to get test infrastructure up and running for SEV-ES, SNP, and TDX. Anyways, back to punting to userspace. Here's a rough sketch. The only new uAPI is the definition of KVM_HC_SNP_GET_CERTS and its arguments. static void snp_handle_guest_request(struct vcpu_svm *svm) { struct vmcb_control_area *control = &svm->vmcb->control; struct sev_data_snp_guest_request data = {0}; struct kvm_vcpu *vcpu = &svm->vcpu; struct kvm *kvm = vcpu->kvm; struct kvm_sev_info *sev; gpa_t req_gpa = control->exit_info_1; gpa_t resp_gpa = control->exit_info_2; unsigned long rc; int err; if (!sev_snp_guest(vcpu->kvm)) { rc = SEV_RET_INVALID_GUEST; goto e_fail; } sev = &to_kvm_svm(kvm)->sev_info; mutex_lock(&sev->guest_req_lock); rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa); if (rc) goto unlock; rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err); if (rc) /* Ensure an error value is returned to guest. */ rc = err ? err : SEV_RET_INVALID_ADDRESS; snp_cleanup_guest_buf(&data, &rc); unlock: mutex_unlock(&sev->guest_req_lock); e_fail: ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc); } static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu) { u64 certs_exitcode = vcpu->run->hypercall.args[2]; struct vcpu_svm *svm = to_svm(vcpu); if (certs_exitcode) ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode); else snp_handle_guest_request(svm); return 1; } static int snp_handle_ext_guest_request(struct vcpu_svm *svm) { struct kvm_vcpu *vcpu = &svm->vcpu; struct kvm *kvm = vcpu->kvm; struct kvm_sev_info *sev; unsigned long exitcode; u64 data_gpa; if (!sev_snp_guest(vcpu->kvm)) { ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST); return 1; } data_gpa = vcpu->arch.regs[VCPU_REGS_RAX]; if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) { ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS); return 1; } vcpu->run->hypercall.nr = KVM_HC_SNP_GET_CERTS; vcpu->run->hypercall.args[0] = data_gpa; vcpu->run->hypercall.args[1] = vcpu->arch.regs[VCPU_REGS_RBX]; vcpu->run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request; return 0; }
On 10/18/2023 8:48 AM, Sean Christopherson wrote: > On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote: >> >> On 18/10/23 03:27, Sean Christopherson wrote: >>> On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote: >>>>> + >>>>> + /* >>>>> + * If a VMM-specific certificate blob hasn't been provided, grab the >>>>> + * host-wide one. >>>>> + */ >>>>> + snp_certs = sev_snp_certs_get(sev->snp_certs); >>>>> + if (!snp_certs) >>>>> + snp_certs = sev_snp_global_certs_get(); >>>>> + >>>> >>>> This is where the generation I suggested adding would get checked. If >>>> the instance certs' generation is not the global generation, then I >>>> think we need a way to return to the VMM to make that right before >>>> continuing to provide outdated certificates. >>>> This might be an unreasonable request, but the fact that the certs and >>>> reported_tcb can be set while a VM is running makes this an issue. >>> >>> Before we get that far, the changelogs need to explain why the kernel is storing >>> userspace blobs in the first place. The whole thing is a bit of a mess. >>> >>> sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU >>> bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs >>> while sev_snp_global_certs_get() is running. If the compiler reloads snp_certs >>> between bumping the refcount and grabbing the pointer, KVM will end up leaking a >>> refcount and consuming a pointer without a refcount. >>> >>> if (!kref_get_unless_zero(&certs->kref)) >>> return NULL; >>> >>> return certs; >> >> I'm missing something here. The @certs pointer is on the stack, > > No, nothing guarantees that @certs is on the stack and will never be reloaded. > sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely > possible that it can be inlined. Then you end up with: > > struct sev_device *sev; > > if (!psp_master || !psp_master->sev_data) > return NULL; > > sev = psp_master->sev_data; > if (!sev->snp_initialized) > return NULL; > > if (!sev->snp_certs) > return NULL; > > if (!kref_get_unless_zero(&sev->snp_certs->kref)) > return NULL; > > return sev->snp_certs; > > At which point the compiler could choose to omit a local variable entirely, it > could store @certs in a register and reload after kref_get_unless_zero(), etc. > If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen. > > That atomic operation in kref_get_unless_zero() might prevent a reload between > getting the kref and the return, but it wouldn't prevent a reload between the > !NULL check and kref_get_unless_zero(). > >>> If userspace wants to provide garbage to the guest, so be it, not KVM's problem. >>> That way, whether the VM gets the global cert or a per-VM cert is purely a userspace >>> concern. >> >> The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd. >> "A la vcpu->run" is fine for the latter but for the former we need something >> else. > > Why? The cert ultimately comes from userspace, no? Make userspace deal with it. > >> And there is scenario when one global certs blob is what is needed and >> copying it over multiple VMs seems suboptimal. > > That's a solvable problem. I'm not sure I like the most obvious solution, but it > is a solution: let userspace define a KVM-wide blob pointer, either via .mmap() > or via an ioctl(). > > FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the > userspace pointer would suffice. The benefit of a kernel controlled pointer is > that it doesn't require copying to a kernel buffer (or special code to copy from > userspace into guest). > > Actually, looking at the flow again, AFAICT there's nothing special about the > target DATA_PAGE. It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e. > KVM doesn't need to do conversions, there's no kernel priveleges required, etc. > And the GHCB doesn't dictate ordering between storing the certificates and doing > the request. That's true. >That means the certificate stuff can be punted entirely to usersepace. > > Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0' > for non-SNP guests: > > unsigned long exitcode = 0; > u64 data_gpa; > int err, rc; > > if (!sev_snp_guest(vcpu->kvm)) { > rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode" > goto e_fail; > } > > e_fail: > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode); > > Which really highlights that we need to get test infrastructure up and running > for SEV-ES, SNP, and TDX. > > Anyways, back to punting to userspace. Here's a rough sketch. The only new uAPI > is the definition of KVM_HC_SNP_GET_CERTS and its arguments. > > static void snp_handle_guest_request(struct vcpu_svm *svm) > { > struct vmcb_control_area *control = &svm->vmcb->control; > struct sev_data_snp_guest_request data = {0}; > struct kvm_vcpu *vcpu = &svm->vcpu; > struct kvm *kvm = vcpu->kvm; > struct kvm_sev_info *sev; > gpa_t req_gpa = control->exit_info_1; > gpa_t resp_gpa = control->exit_info_2; > unsigned long rc; > int err; > > if (!sev_snp_guest(vcpu->kvm)) { > rc = SEV_RET_INVALID_GUEST; > goto e_fail; > } > > sev = &to_kvm_svm(kvm)->sev_info; > > mutex_lock(&sev->guest_req_lock); > > rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa); > if (rc) > goto unlock; > > rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err); > if (rc) > /* Ensure an error value is returned to guest. */ > rc = err ? err : SEV_RET_INVALID_ADDRESS; > > snp_cleanup_guest_buf(&data, &rc); > > unlock: > mutex_unlock(&sev->guest_req_lock); > > e_fail: > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc); > } > > static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu) > { > u64 certs_exitcode = vcpu->run->hypercall.args[2]; > struct vcpu_svm *svm = to_svm(vcpu); > > if (certs_exitcode) > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode); > else > snp_handle_guest_request(svm); > return 1; > } > > static int snp_handle_ext_guest_request(struct vcpu_svm *svm) > { > struct kvm_vcpu *vcpu = &svm->vcpu; > struct kvm *kvm = vcpu->kvm; > struct kvm_sev_info *sev; > unsigned long exitcode; > u64 data_gpa; > > if (!sev_snp_guest(vcpu->kvm)) { > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST); > return 1; > } > > data_gpa = vcpu->arch.regs[VCPU_REGS_RAX]; > if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) { > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS); > return 1; > } > > vcpu->run->hypercall.nr = KVM_HC_SNP_GET_CERTS; > vcpu->run->hypercall.args[0] = data_gpa; > vcpu->run->hypercall.args[1] = vcpu->arch.regs[VCPU_REGS_RBX]; > vcpu->run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; > vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request; > return 0; > } > IIRC, the important consideration here is to ensure that getting the attestation report and retrieving the certificates appears atomic to the guest. When SNP live migration is supported we don't want a case where the guest could have migrated between the call to obtain the certificates and obtaining the attestation report, which can potentially cause failure of validation of the attestation report. Thanks, Ashish
On 10/18/2023 3:38 PM, Sean Christopherson wrote: > On Wed, Oct 18, 2023, Ashish Kalra wrote: >>> static int snp_handle_ext_guest_request(struct vcpu_svm *svm) >>> { >>> struct kvm_vcpu *vcpu = &svm->vcpu; >>> struct kvm *kvm = vcpu->kvm; >>> struct kvm_sev_info *sev; >>> unsigned long exitcode; >>> u64 data_gpa; >>> >>> if (!sev_snp_guest(vcpu->kvm)) { >>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST); >>> return 1; >>> } >>> >>> data_gpa = vcpu->arch.regs[VCPU_REGS_RAX]; >>> if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) { >>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS); >>> return 1; >>> } >>> >>> vcpu->run->hypercall.nr = KVM_HC_SNP_GET_CERTS; >>> vcpu->run->hypercall.args[0] = data_gpa; >>> vcpu->run->hypercall.args[1] = vcpu->arch.regs[VCPU_REGS_RBX]; >>> vcpu->run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; >>> vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request; >>> return 0; >>> } >>> >> >> IIRC, the important consideration here is to ensure that getting the >> attestation report and retrieving the certificates appears atomic to the >> guest. When SNP live migration is supported we don't want a case where the >> guest could have migrated between the call to obtain the certificates and >> obtaining the attestation report, which can potentially cause failure of >> validation of the attestation report. > > Where does "obtaining the attestation report" happen? I see the guest request > and the certificate stuff, I don't see anything about attestation reports (though > I'm not looking very closely). > The guest requests that the firmware construct an attestation report via the SNP_GUEST_REQUEST command. The certificates are piggy-backed to the guest along with the attestation report (retrieved from the FW via the SNP_GUEST_REQUEST command) as part of the SNP Extended Guest Request NAE handling. Thanks, Ashish
On 20/10/23 01:57, Sean Christopherson wrote: > On Thu, Oct 19, 2023, Alexey Kardashevskiy wrote: >> >> On 19/10/23 00:48, Sean Christopherson wrote: >>> static int snp_handle_ext_guest_request(struct vcpu_svm *svm) >>> { >>> struct kvm_vcpu *vcpu = &svm->vcpu; >>> struct kvm *kvm = vcpu->kvm; >>> struct kvm_sev_info *sev; >>> unsigned long exitcode; >>> u64 data_gpa; >>> >>> if (!sev_snp_guest(vcpu->kvm)) { >>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST); >>> return 1; >>> } >>> >>> data_gpa = vcpu->arch.regs[VCPU_REGS_RAX]; >>> if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) { >>> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS); >>> return 1; >>> } >>> >>> vcpu->run->hypercall.nr = KVM_HC_SNP_GET_CERTS; >>> vcpu->run->hypercall.args[0] = data_gpa; >>> vcpu->run->hypercall.args[1] = vcpu->arch.regs[VCPU_REGS_RBX]; >>> vcpu->run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; >> >> btw why is it _LONG_MODE and not just _64? :) > > I'm pretty sure it got copied from Xen when KVM started adding supporting for > emulating Xen's hypercalls. I assume Xen PV actually has a need for identifying > long mode as opposed to just 64-bit mode, but KVM, not so much. > >>> vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request; >>> return 0; >>> } >> >> This should work the KVM stored certs nicely but not for the global certs. >> Although I am not all convinced that global certs is all that valuable but I >> do not know the history of that, happened before I joined so I let others to >> comment on that. Thanks, > > Aren't the global certs provided by userspace too though? If all certs are > ultimately controlled by userspace, I don't see any reason to make the kernel a > middle-man. The max blob size is 32KB or so and for 200 VMs it is: - 6.5MB, all in the userspace so swappable vs - 32KB but in the kernel so not swappable. Sure, a box capable of running 200 VMs must have plenty of RAM but still :) Plus, GHCB now has to go via the userspace before talking to the PSP which was not the case so far (though I cannot think of immediate implication right now).
On 20/10/23 11:13, Sean Christopherson wrote: > On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote: >> >> On 20/10/23 01:57, Sean Christopherson wrote: >>> On Thu, Oct 19, 2023, Alexey Kardashevskiy wrote: >>>>> vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request; >>>>> return 0; >>>>> } >>>> >>>> This should work the KVM stored certs nicely but not for the global certs. >>>> Although I am not all convinced that global certs is all that valuable but I >>>> do not know the history of that, happened before I joined so I let others to >>>> comment on that. Thanks, >>> >>> Aren't the global certs provided by userspace too though? If all certs are >>> ultimately controlled by userspace, I don't see any reason to make the kernel a >>> middle-man. >> >> The max blob size is 32KB or so and for 200 VMs it is: > > Not according to include/linux/psp-sev.h: > > #define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */ > > Ugh, and I see in another patch: > > Also increase the SEV_FW_BLOB_MAX_SIZE another 4K page to allow space > for an extra certificate. > > -#define SEV_FW_BLOB_MAX_SIZE 0x4000 /* 16KB */ > +#define SEV_FW_BLOB_MAX_SIZE 0x5000 /* 20KB */ > > That's gross and just asking for ABI problems, because then there's this: > > +:: > + > + struct kvm_sev_snp_set_certs { > + __u64 certs_uaddr; > + __u64 certs_len > + }; > + > +The certs_len field may not exceed SEV_FW_BLOB_MAX_SIZE. > >> - 6.5MB, all in the userspace so swappable vs >> - 32KB but in the kernel so not swappable. >> Sure, a box capable of running 200 VMs must have plenty of RAM but still :) > > That's making quite a few assumptions. > > 1) That the global cert will be 32KiB (which clearly isn't the case today). > 2) That every VM will want the global cert. > 3) That userspace can't figure out a way to share the global cert. > > Even in that absolutely worst case scenario, I am not remotely convinced that it > justifies taking on the necessary complexity to manage certs in-kernel. > >> Plus, GHCB now has to go via the userspace before talking to the PSP which >> was not the case so far (though I cannot think of immediate implication >> right now). > > Any argument along the lines of "because that's how we've always done it" is going > to fall on deaf ears. If there's a real performance bottleneck with kicking out > to userspace, then I'll happily work to figure out a solution. If. No, not performance, I was trying to imagine what can go wrong if multiple vcpus are making this call, all exiting to QEMU, in a loop, racing, something like this.
On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote: > > On 20/10/23 11:13, Sean Christopherson wrote: > > On Fri, Oct 20, 2023, Alexey Kardashevskiy wrote: > > > Plus, GHCB now has to go via the userspace before talking to the PSP which > > > was not the case so far (though I cannot think of immediate implication > > > right now). > > > > Any argument along the lines of "because that's how we've always done it" is going > > to fall on deaf ears. If there's a real performance bottleneck with kicking out > > to userspace, then I'll happily work to figure out a solution. If. > > No, not performance, I was trying to imagine what can go wrong if multiple > vcpus are making this call, all exiting to QEMU, in a loop, racing, > something like this. I am not at all concerned about userspace being able to handle parallel requests to get a certificate. Per-vCPU exits that access global/shared resources might not be super common, but they're certainly not rare. E.g. a guest access to an option ROM can trigger memslot updates in QEMU, which requires at least taking a mutex to guard KVM_SET_USER_MEMORY_REGION, and IIRC QEMU also uses RCU to protect QEMU accesses to address spaces. Given that we know there will be scenarios where certificates are changed/updated, I wouldn't be at all surprised if handling this in userspace is actually easier as it will give userspace more control and options, and make it easier to reason about the resulting behavior. E.g. userspace could choose between a lockless scheme and a r/w lock if there's a need to ensure per-VM and global certs are updated atomically from the guest's perspective.
On 10/18/23 21:48, Alexey Kardashevskiy wrote: > > On 19/10/23 00:48, Sean Christopherson wrote: >> On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote: >>> >>> On 18/10/23 03:27, Sean Christopherson wrote: >>>> On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote: >>>>>> + >>>>>> + /* >>>>>> + * If a VMM-specific certificate blob hasn't been provided, >>>>>> grab the >>>>>> + * host-wide one. >>>>>> + */ >>>>>> + snp_certs = sev_snp_certs_get(sev->snp_certs); >>>>>> + if (!snp_certs) >>>>>> + snp_certs = sev_snp_global_certs_get(); >>>>>> + >>>>> >>>>> This is where the generation I suggested adding would get checked. If >>>>> the instance certs' generation is not the global generation, then I >>>>> think we need a way to return to the VMM to make that right before >>>>> continuing to provide outdated certificates. >>>>> This might be an unreasonable request, but the fact that the certs and >>>>> reported_tcb can be set while a VM is running makes this an issue. >>>> >>>> Before we get that far, the changelogs need to explain why the kernel >>>> is storing >>>> userspace blobs in the first place. The whole thing is a bit of a mess. >>>> >>>> sev_snp_global_certs_get() has data races that could lead to >>>> variations of TOCTOU >>>> bugs: sev_ioctl_snp_set_config() can overwrite >>>> psp_master->sev_data->snp_certs >>>> while sev_snp_global_certs_get() is running. If the compiler reloads >>>> snp_certs >>>> between bumping the refcount and grabbing the pointer, KVM will end up >>>> leaking a >>>> refcount and consuming a pointer without a refcount. >>>> >>>> if (!kref_get_unless_zero(&certs->kref)) >>>> return NULL; >>>> >>>> return certs; >>> >>> I'm missing something here. The @certs pointer is on the stack, >> >> No, nothing guarantees that @certs is on the stack and will never be >> reloaded. >> sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so >> it's entirely >> possible that it can be inlined. Then you end up with: >> >> struct sev_device *sev; >> >> if (!psp_master || !psp_master->sev_data) >> return NULL; >> >> sev = psp_master->sev_data; >> if (!sev->snp_initialized) >> return NULL; >> >> if (!sev->snp_certs) >> return NULL; >> >> if (!kref_get_unless_zero(&sev->snp_certs->kref)) >> return NULL; >> >> return sev->snp_certs; >> >> At which point the compiler could choose to omit a local variable >> entirely, it >> could store @certs in a register and reload after >> kref_get_unless_zero(), etc. >> If psp_master->sev_data->snp_certs is changed at any point, odd thing >> can happen. >> >> That atomic operation in kref_get_unless_zero() might prevent a reload >> between >> getting the kref and the return, but it wouldn't prevent a reload >> between the >> !NULL check and kref_get_unless_zero(). > > Oh. The function is exported so I thought gcc would not go that far but > yeah it is possible. So this needs an explicit READ_ONCE barrier. > > >>>> If userspace wants to provide garbage to the guest, so be it, not >>>> KVM's problem. >>>> That way, whether the VM gets the global cert or a per-VM cert is >>>> purely a userspace >>>> concern. >>> >>> The global cert lives in CCP (/dev/sev), the per VM cert lives in >>> kvmvm_fd. >>> "A la vcpu->run" is fine for the latter but for the former we need >>> something >>> else. >> >> Why? The cert ultimately comes from userspace, no? Make userspace deal >> with it. >> >>> And there is scenario when one global certs blob is what is needed and >>> copying it over multiple VMs seems suboptimal. >> >> That's a solvable problem. I'm not sure I like the most obvious >> solution, but it >> is a solution: let userspace define a KVM-wide blob pointer, either via >> .mmap() >> or via an ioctl(). >> >> FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the >> userspace pointer would suffice. The benefit of a kernel controlled >> pointer is >> that it doesn't require copying to a kernel buffer (or special code to >> copy from >> userspace into guest). > > Just to clarify - like, a small userspace non-qemu program which just > holds a pointer with the certs blob, or embed it into libvirt or systemd? > > >> Actually, looking at the flow again, AFAICT there's nothing special >> about the >> target DATA_PAGE. It must be SHARED *before* >> SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e. >> KVM doesn't need to do conversions, there's no kernel priveleges >> required, etc. >> And the GHCB doesn't dictate ordering between storing the certificates >> and doing >> the request. That means the certificate stuff can be punted entirely to >> usersepace. > > All true. > >> Heh, typing up the below, there's another bug: KVM will incorrectly >> "return" '0' >> for non-SNP guests: >> >> unsigned long exitcode = 0; >> u64 data_gpa; >> int err, rc; >> >> if (!sev_snp_guest(vcpu->kvm)) { >> rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode" >> goto e_fail; >> } >> >> e_fail: >> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode); >> >> Which really highlights that we need to get test infrastructure up and >> running >> for SEV-ES, SNP, and TDX. >> >> Anyways, back to punting to userspace. Here's a rough sketch. The only >> new uAPI >> is the definition of KVM_HC_SNP_GET_CERTS and its arguments. >> >> static void snp_handle_guest_request(struct vcpu_svm *svm) >> { >> struct vmcb_control_area *control = &svm->vmcb->control; >> struct sev_data_snp_guest_request data = {0}; >> struct kvm_vcpu *vcpu = &svm->vcpu; >> struct kvm *kvm = vcpu->kvm; >> struct kvm_sev_info *sev; >> gpa_t req_gpa = control->exit_info_1; >> gpa_t resp_gpa = control->exit_info_2; >> unsigned long rc; >> int err; >> >> if (!sev_snp_guest(vcpu->kvm)) { >> rc = SEV_RET_INVALID_GUEST; >> goto e_fail; >> } >> >> sev = &to_kvm_svm(kvm)->sev_info; >> >> mutex_lock(&sev->guest_req_lock); >> >> rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa); >> if (rc) >> goto unlock; >> >> rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err); >> if (rc) >> /* Ensure an error value is returned to guest. */ >> rc = err ? err : SEV_RET_INVALID_ADDRESS; >> >> snp_cleanup_guest_buf(&data, &rc); >> >> unlock: >> mutex_unlock(&sev->guest_req_lock); >> >> e_fail: >> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc); >> } >> >> static int snp_complete_ext_guest_request(struct kvm_vcpu *vcpu) >> { >> u64 certs_exitcode = vcpu->run->hypercall.args[2]; >> struct vcpu_svm *svm = to_svm(vcpu); >> >> if (certs_exitcode) >> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, certs_exitcode); >> else >> snp_handle_guest_request(svm); >> return 1; >> } >> >> static int snp_handle_ext_guest_request(struct vcpu_svm *svm) >> { >> struct kvm_vcpu *vcpu = &svm->vcpu; >> struct kvm *kvm = vcpu->kvm; >> struct kvm_sev_info *sev; >> unsigned long exitcode; >> u64 data_gpa; >> >> if (!sev_snp_guest(vcpu->kvm)) { >> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_GUEST); >> return 1; >> } >> >> data_gpa = vcpu->arch.regs[VCPU_REGS_RAX]; >> if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) { >> ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, SEV_RET_INVALID_ADDRESS); >> return 1; >> } >> >> vcpu->run->hypercall.nr = KVM_HC_SNP_GET_CERTS; >> vcpu->run->hypercall.args[0] = data_gpa; >> vcpu->run->hypercall.args[1] = vcpu->arch.regs[VCPU_REGS_RBX]; >> vcpu->run->hypercall.flags = KVM_EXIT_HYPERCALL_LONG_MODE; > > btw why is it _LONG_MODE and not just _64? :) > >> vcpu->arch.complete_userspace_io = snp_complete_ext_guest_request; >> return 0; >> } > > This should work the KVM stored certs nicely but not for the global certs. > Although I am not all convinced that global certs is all that valuable but > I do not know the history of that, happened before I joined so I let > others to comment on that. Thanks, Global certs was the original implementation because it was intended to provide the VCEK, ASK, and ARK. These will be the same for all SNP guests that are launched. The original intention was also to not make the kernel have to manage multiple certificates and instead just treat the data as a blob provided from user-space. The per-VM change was added to allow a per-VM certificates. If a provider has no need to use this, then only the global certs blob is needed which reduces the amount of memory needed for the VM. Thanks, Tom > >
On Wed, Oct 18, 2023 at 06:48:59AM -0700, Sean Christopherson wrote: > On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote: > > > > On 18/10/23 03:27, Sean Christopherson wrote: > > > On Mon, Oct 16, 2023, Dionna Amalie Glaze wrote: > > > > > + > > > > > + /* > > > > > + * If a VMM-specific certificate blob hasn't been provided, grab the > > > > > + * host-wide one. > > > > > + */ > > > > > + snp_certs = sev_snp_certs_get(sev->snp_certs); > > > > > + if (!snp_certs) > > > > > + snp_certs = sev_snp_global_certs_get(); > > > > > + > > > > > > > > This is where the generation I suggested adding would get checked. If > > > > the instance certs' generation is not the global generation, then I > > > > think we need a way to return to the VMM to make that right before > > > > continuing to provide outdated certificates. > > > > This might be an unreasonable request, but the fact that the certs and > > > > reported_tcb can be set while a VM is running makes this an issue. > > > > > > Before we get that far, the changelogs need to explain why the kernel is storing > > > userspace blobs in the first place. The whole thing is a bit of a mess. > > > > > > sev_snp_global_certs_get() has data races that could lead to variations of TOCTOU > > > bugs: sev_ioctl_snp_set_config() can overwrite psp_master->sev_data->snp_certs > > > while sev_snp_global_certs_get() is running. If the compiler reloads snp_certs > > > between bumping the refcount and grabbing the pointer, KVM will end up leaking a > > > refcount and consuming a pointer without a refcount. > > > > > > if (!kref_get_unless_zero(&certs->kref)) > > > return NULL; > > > > > > return certs; > > > > I'm missing something here. The @certs pointer is on the stack, > > No, nothing guarantees that @certs is on the stack and will never be reloaded. > sev_snp_certs_get() is in full view of sev_snp_global_certs_get(), so it's entirely > possible that it can be inlined. Then you end up with: > > struct sev_device *sev; > > if (!psp_master || !psp_master->sev_data) > return NULL; > > sev = psp_master->sev_data; > if (!sev->snp_initialized) > return NULL; > > if (!sev->snp_certs) > return NULL; > > if (!kref_get_unless_zero(&sev->snp_certs->kref)) > return NULL; > > return sev->snp_certs; > > At which point the compiler could choose to omit a local variable entirely, it > could store @certs in a register and reload after kref_get_unless_zero(), etc. > If psp_master->sev_data->snp_certs is changed at any point, odd thing can happen. > > That atomic operation in kref_get_unless_zero() might prevent a reload between > getting the kref and the return, but it wouldn't prevent a reload between the > !NULL check and kref_get_unless_zero(). > > > > If userspace wants to provide garbage to the guest, so be it, not KVM's problem. > > > That way, whether the VM gets the global cert or a per-VM cert is purely a userspace > > > concern. > > > > The global cert lives in CCP (/dev/sev), the per VM cert lives in kvmvm_fd. > > "A la vcpu->run" is fine for the latter but for the former we need something > > else. > > Why? The cert ultimately comes from userspace, no? Make userspace deal with it. > > > And there is scenario when one global certs blob is what is needed and > > copying it over multiple VMs seems suboptimal. > > That's a solvable problem. I'm not sure I like the most obvious solution, but it > is a solution: let userspace define a KVM-wide blob pointer, either via .mmap() > or via an ioctl(). > > FWIW, there's no need to do .mmap() shenanigans, e.g. an ioctl() to set the > userspace pointer would suffice. The benefit of a kernel controlled pointer is > that it doesn't require copying to a kernel buffer (or special code to copy from > userspace into guest). > > Actually, looking at the flow again, AFAICT there's nothing special about the > target DATA_PAGE. It must be SHARED *before* SVM_VMGEXIT_EXT_GUEST_REQUEST, i.e. > KVM doesn't need to do conversions, there's no kernel priveleges required, etc. > And the GHCB doesn't dictate ordering between storing the certificates and doing > the request. That means the certificate stuff can be punted entirely to usersepace. > > Heh, typing up the below, there's another bug: KVM will incorrectly "return" '0' > for non-SNP guests: Thanks for the catch, will fix that up. > > unsigned long exitcode = 0; > u64 data_gpa; > int err, rc; > > if (!sev_snp_guest(vcpu->kvm)) { > rc = SEV_RET_INVALID_GUEST; <= sets "rc", not "exitcode" > goto e_fail; > } > > e_fail: > ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode); > > Which really highlights that we need to get test infrastructure up and running > for SEV-ES, SNP, and TDX. > > Anyways, back to punting to userspace. Here's a rough sketch. The only new uAPI > is the definition of KVM_HC_SNP_GET_CERTS and its arguments. This sketch seems like a good, flexible way to handle per-VM certs, but it does complicate things from a userspace perspective. As a basic requirement, all userspaces will need to provide a way to specify the initial blob (either a very verbose base64-encoded userspace cmdline param, or a filepatch that needs additional management to store and handle permissions/etc.), and also a means to update it (e.g. a HMP/QMP command for QEMU, some libvirt wrappers, etc.). That's all well and good if you want to make use of per-VM certs, but we don't necessarily expect that most deployments will necessarily want to deal with per-VM certs, and would be happy with a system-wide one where they could simply issue the /dev/sev ioctl to inject one automatically for all guests. So we're sort of complicating the more common case to support a more niche one (as far as userspace is concerned anyway; as far as kernel goes, your approach is certainly simplest :)). Instead, maybe a compromise is warranted so the requirements on userspace side are less complicated for a more basic deployment: 1) If /dev/sev is used to set a global certificate, then that will be used unconditionally by KVM, protected by simple dumb mutex during usage/update. 2) If /dev/sev is not used to set the global certificate is the value is NULL, we assume userspace wants full responsibility for managing certificates and exit to userspace to request the certs in the manner you suggested. Sean, Dionna, would this cover your concerns and address the certificate update use-case? -Mike
On Fri, Nov 10, 2023, Michael Roth wrote: > On Wed, Oct 18, 2023 at 06:48:59AM -0700, Sean Christopherson wrote: > > On Wed, Oct 18, 2023, Alexey Kardashevskiy wrote: > > Anyways, back to punting to userspace. Here's a rough sketch. The only new uAPI > > is the definition of KVM_HC_SNP_GET_CERTS and its arguments. > > This sketch seems like a good, flexible way to handle per-VM certs, but > it does complicate things from a userspace perspective. As a basic > requirement, all userspaces will need to provide a way to specify the > initial blob (either a very verbose base64-encoded userspace cmdline param, > or a filepatch that needs additional management to store and handle > permissions/etc.), and also a means to update it (e.g. a HMP/QMP command > for QEMU, some libvirt wrappers, etc.). > > That's all well and good if you want to make use of per-VM certs, but we > don't necessarily expect that most deployments will necessarily want to deal > with per-VM certs, and would be happy with a system-wide one where they could > simply issue the /dev/sev ioctl to inject one automatically for all guests. > > So we're sort of complicating the more common case to support a more niche > one (as far as userspace is concerned anyway; as far as kernel goes, your > approach is certainly simplest :)). > > Instead, maybe a compromise is warranted so the requirements on userspace > side are less complicated for a more basic deployment: > > 1) If /dev/sev is used to set a global certificate, then that will be > used unconditionally by KVM, protected by simple dumb mutex during > usage/update. > 2) If /dev/sev is not used to set the global certificate is the value > is NULL, we assume userspace wants full responsibility for managing > certificates and exit to userspace to request the certs in the manner > you suggested. > > Sean, Dionna, would this cover your concerns and address the certificate > update use-case? Honestly, no. I see zero reason for the kernel to be involved. IIUC, there's no privileged operations that require kernel intervention, which means that shoving a global cert into /dev/sev is using the CCP driver as middleman. Just use a userspace daemon. I have a very hard time believing that passing around large-ish blobs of data in userspace isn't already a solved problem.
> > So we're sort of complicating the more common case to support a more niche > > one (as far as userspace is concerned anyway; as far as kernel goes, your > > approach is certainly simplest :)). > > > > Instead, maybe a compromise is warranted so the requirements on userspace > > side are less complicated for a more basic deployment: > > > > 1) If /dev/sev is used to set a global certificate, then that will be > > used unconditionally by KVM, protected by simple dumb mutex during > > usage/update. > > 2) If /dev/sev is not used to set the global certificate is the value > > is NULL, we assume userspace wants full responsibility for managing > > certificates and exit to userspace to request the certs in the manner > > you suggested. > > > > Sean, Dionna, would this cover your concerns and address the certificate > > update use-case? > > Honestly, no. I see zero reason for the kernel to be involved. IIUC, there's no > privileged operations that require kernel intervention, which means that shoving > a global cert into /dev/sev is using the CCP driver as middleman. Just use a > userspace daemon. I have a very hard time believing that passing around large-ish > blobs of data in userspace isn't already a solved problem. ping sathyanarayanan.kuppuswamy@linux.intel.com and +Dan Williams I think for a uniform experience for all coco technologies, we need someone from Intel to weigh in on supporting auxblob through a similar vmexit. Whereas the quoting enclave gets its PCK cert installed by the host, something like the firmware's SBOM [1] could be delivered in auxblob. The proposal to embed the compressed SBOM binary in a coff section of the UEFI doesn't get it communicated to user space, so this is a good place to get that info about the expected TDMR in. The SBOM proposal itself would need additional modeling in the coRIM profile to have extra coco-specific measurements or we need to find some other method of getting this info bundled with the attestation report. My own plan for SEV-SNP was to have a bespoke signed measurement of the UEFI in the GUID table, but that doesn't extend to TDX. If we're looking more at an industry alignment on coRIM for SBOM formats (yes please), then it'd be great to start getting that kind of info plumbed to the user in a uniform way that doesn't have to rely on servers providing the endorsements. [1] https://uefi.org/blog/firmware-sbom-proposal
[ add Dan Middleton for his awareness ] Dionna Amalie Glaze wrote: > > > So we're sort of complicating the more common case to support a more niche > > > one (as far as userspace is concerned anyway; as far as kernel goes, your > > > approach is certainly simplest :)). > > > > > > Instead, maybe a compromise is warranted so the requirements on userspace > > > side are less complicated for a more basic deployment: > > > > > > 1) If /dev/sev is used to set a global certificate, then that will be > > > used unconditionally by KVM, protected by simple dumb mutex during > > > usage/update. > > > 2) If /dev/sev is not used to set the global certificate is the value > > > is NULL, we assume userspace wants full responsibility for managing > > > certificates and exit to userspace to request the certs in the manner > > > you suggested. > > > > > > Sean, Dionna, would this cover your concerns and address the certificate > > > update use-case? > > > > Honestly, no. I see zero reason for the kernel to be involved. IIUC, there's no > > privileged operations that require kernel intervention, which means that shoving > > a global cert into /dev/sev is using the CCP driver as middleman. Just use a > > userspace daemon. I have a very hard time believing that passing around large-ish > > blobs of data in userspace isn't already a solved problem. > > ping sathyanarayanan.kuppuswamy@linux.intel.com and +Dan Williams Apologies Dionna, I missed this earlier. > > I think for a uniform experience for all coco technologies, we need > someone from Intel to weigh in on supporting auxblob through a similar > vmexit. Whereas the quoting enclave gets its PCK cert installed by the > host, something like the firmware's SBOM [1] could be delivered in > auxblob. The proposal to embed the compressed SBOM binary in a coff > section of the UEFI doesn't get it communicated to user space, so this > is a good place to get that info about the expected TDMR in. The SBOM > proposal itself would need additional modeling in the coRIM profile to > have extra coco-specific measurements or we need to find some other > method of getting this info bundled with the attestation report. SBOM looks different than the SEV use case of @auxblob to convey a certificate chain. Are you asking for @auxblob to be SBOM on TDX and a certchain on SEV, or unifying the @auxblob format on SBOM? > My own plan for SEV-SNP was to have a bespoke signed measurement of > the UEFI in the GUID table, but that doesn't extend to TDX. If we're > looking more at an industry alignment on coRIM for SBOM formats (yes > please), then it'd be great to start getting that kind of info plumbed > to the user in a uniform way that doesn't have to rely on servers > providing the endorsements. > > [1] https://uefi.org/blog/firmware-sbom-proposal Honestly my first reaction for this ABI would be for a new file under /sys/firmware/efi/efivars or similar.
On Mon, Dec 4, 2023 at 4:30 PM Dan Williams <dan.j.williams@intel.com> wrote: > > [ add Dan Middleton for his awareness ] > > Dionna Amalie Glaze wrote: > > > > So we're sort of complicating the more common case to support a more niche > > > > one (as far as userspace is concerned anyway; as far as kernel goes, your > > > > approach is certainly simplest :)). > > > > > > > > Instead, maybe a compromise is warranted so the requirements on userspace > > > > side are less complicated for a more basic deployment: > > > > > > > > 1) If /dev/sev is used to set a global certificate, then that will be > > > > used unconditionally by KVM, protected by simple dumb mutex during > > > > usage/update. > > > > 2) If /dev/sev is not used to set the global certificate is the value > > > > is NULL, we assume userspace wants full responsibility for managing > > > > certificates and exit to userspace to request the certs in the manner > > > > you suggested. > > > > > > > > Sean, Dionna, would this cover your concerns and address the certificate > > > > update use-case? > > > > > > Honestly, no. I see zero reason for the kernel to be involved. IIUC, there's no > > > privileged operations that require kernel intervention, which means that shoving > > > a global cert into /dev/sev is using the CCP driver as middleman. Just use a > > > userspace daemon. I have a very hard time believing that passing around large-ish > > > blobs of data in userspace isn't already a solved problem. > > > > ping sathyanarayanan.kuppuswamy@linux.intel.com and +Dan Williams > > Apologies Dionna, I missed this earlier. > No worries, I've been sick anyway. > > > > I think for a uniform experience for all coco technologies, we need > > someone from Intel to weigh in on supporting auxblob through a similar > > vmexit. Whereas the quoting enclave gets its PCK cert installed by the > > host, something like the firmware's SBOM [1] could be delivered in > > auxblob. The proposal to embed the compressed SBOM binary in a coff > > section of the UEFI doesn't get it communicated to user space, so this > > is a good place to get that info about the expected TDMR in. The SBOM > > proposal itself would need additional modeling in the coRIM profile to > > have extra coco-specific measurements or we need to find some other > > method of getting this info bundled with the attestation report. > > SBOM looks different than the SEV use case of @auxblob to convey a > certificate chain. The SEV use case has a GUID table in which we're allowed to provide more than just the VCEK certificate chain. I'm using it to deliver a UEFI endorsement document as well. > > Are you asking for @auxblob to be SBOM on TDX and a certchain on SEV, or > unifying the @auxblob format on SBOM? Given SEV is both certchain and SBOM and TDX doesn't need a certchain in auxblob, I'd just be looking at delivering SBOM in auxblob for TDX. It's probably better to have something extensible though, like SEV's GUID table format. We may want to provide cached TDI RIMs too, for example. > > > My own plan for SEV-SNP was to have a bespoke signed measurement of > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're > > looking more at an industry alignment on coRIM for SBOM formats (yes > > please), then it'd be great to start getting that kind of info plumbed > > to the user in a uniform way that doesn't have to rely on servers > > providing the endorsements. > > > > [1] https://uefi.org/blog/firmware-sbom-proposal > > Honestly my first reaction for this ABI would be for a new file under > /sys/firmware/efi/efivars or similar. For UEFI specifically that could make sense, yes. Not everyone has been mounting efivars, so it's been a bit of an uphill battle for that one. Still there's the matter of cached TDI RIMs. NVIDIA would have everyone send attestation requests to their servers every quote request in the NRAS architecture, but we're looking at other ways to provide reliable attestation without a third party service, albeit with slightly different security properties.
[ add Ard for the SBOM sysfs ABI commentary ] Dionna Amalie Glaze wrote: [..] > > > My own plan for SEV-SNP was to have a bespoke signed measurement of > > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're > > > looking more at an industry alignment on coRIM for SBOM formats (yes > > > please), then it'd be great to start getting that kind of info plumbed > > > to the user in a uniform way that doesn't have to rely on servers > > > providing the endorsements. > > > > > > [1] https://uefi.org/blog/firmware-sbom-proposal > > > > Honestly my first reaction for this ABI would be for a new file under > > /sys/firmware/efi/efivars or similar. > > For UEFI specifically that could make sense, yes. Not everyone has > been mounting efivars, so it's been a bit of an uphill battle for that > one. I wonder what the concern is with mounting efivarfs vs configfs? In any event this seems distinct enough to be its own /sys/firmware/efi/sbom file. I would defer to Ard, but I think SBOM is a generally useful concept that would be out of place as a blob returned from configfs-tsm. > Still there's the matter of cached TDI RIMs. NVIDIA would have I am not immediatly sure what a "TDI RIM" is? > everyone send attestation requests to their servers every quote > request in the NRAS architecture, but we're looking at other ways to "NRAS" does not parse for me either. > provide reliable attestation without a third party service, albeit > with slightly different security properties. Setting the above confusion aside, I would just say that in general yes, the kernel needs to understand its role in an end-to-end attestation architecture that is not beholden to a single vendor, but also allows the kernel to enforce ABI stability / mitigate regressions based on binary format changes.
On Tue, Dec 5, 2023 at 12:06 PM Dan Williams <dan.j.williams@intel.com> wrote: > > [ add Ard for the SBOM sysfs ABI commentary ] > > Dionna Amalie Glaze wrote: > [..] > > > > My own plan for SEV-SNP was to have a bespoke signed measurement of > > > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're > > > > looking more at an industry alignment on coRIM for SBOM formats (yes > > > > please), then it'd be great to start getting that kind of info plumbed > > > > to the user in a uniform way that doesn't have to rely on servers > > > > providing the endorsements. > > > > > > > > [1] https://uefi.org/blog/firmware-sbom-proposal > > > > > > Honestly my first reaction for this ABI would be for a new file under > > > /sys/firmware/efi/efivars or similar. > > > > For UEFI specifically that could make sense, yes. Not everyone has > > been mounting efivars, so it's been a bit of an uphill battle for that > > one. > > I wonder what the concern is with mounting efivarfs vs configfs? In any > event this seems distinct enough to be its own /sys/firmware/efi/sbom > file. I would defer to Ard, but I think SBOM is a generally useful > concept that would be out of place as a blob returned from configfs-tsm. > > > Still there's the matter of cached TDI RIMs. NVIDIA would have > > I am not immediatly sure what a "TDI RIM" is? > I might just be making up terms. Any trusted hardware device that has its own attestation will (hopefully) have signed reference measurements, or a Reference Integrity Manifest as TCG calls them. > > everyone send attestation requests to their servers every quote > > request in the NRAS architecture, but we're looking at other ways to > > "NRAS" does not parse for me either. > That would be this https://docs.attestation.nvidia.com/api-docs/nras.html > > provide reliable attestation without a third party service, albeit > > with slightly different security properties. > > Setting the above confusion aside, I would just say that in general yes, > the kernel needs to understand its role in an end-to-end attestation > architecture that is not beholden to a single vendor, but also allows > the kernel to enforce ABI stability / mitigate regressions based on > binary format changes. > I'm mainly holding on to hope that I don't have to introduce a new runtime dependency on a service that gives a source of truth about the software that's running in the VM. If we can have a GUID table with a flexible size that the host can request of the guest, then we can version ABI changes with new GUID entries. It's a big enough value space without vanity naming opportunities that we can pretty easily make changes without incurring any guest kernel changes.
Dionna Amalie Glaze wrote: > On Tue, Dec 5, 2023 at 12:06 PM Dan Williams <dan.j.williams@intel.com> wrote: > > > > [ add Ard for the SBOM sysfs ABI commentary ] > > > > Dionna Amalie Glaze wrote: > > [..] > > > > > My own plan for SEV-SNP was to have a bespoke signed measurement of > > > > > the UEFI in the GUID table, but that doesn't extend to TDX. If we're > > > > > looking more at an industry alignment on coRIM for SBOM formats (yes > > > > > please), then it'd be great to start getting that kind of info plumbed > > > > > to the user in a uniform way that doesn't have to rely on servers > > > > > providing the endorsements. > > > > > > > > > > [1] https://uefi.org/blog/firmware-sbom-proposal > > > > > > > > Honestly my first reaction for this ABI would be for a new file under > > > > /sys/firmware/efi/efivars or similar. > > > > > > For UEFI specifically that could make sense, yes. Not everyone has > > > been mounting efivars, so it's been a bit of an uphill battle for that > > > one. > > > > I wonder what the concern is with mounting efivarfs vs configfs? In any > > event this seems distinct enough to be its own /sys/firmware/efi/sbom > > file. I would defer to Ard, but I think SBOM is a generally useful > > concept that would be out of place as a blob returned from configfs-tsm. > > > > > Still there's the matter of cached TDI RIMs. NVIDIA would have > > > > I am not immediatly sure what a "TDI RIM" is? > > > > I might just be making up terms. Any trusted hardware device that has > its own attestation will (hopefully) have signed reference > measurements, or a Reference Integrity Manifest as TCG calls them. Ah, ok. > > > > everyone send attestation requests to their servers every quote > > > request in the NRAS architecture, but we're looking at other ways to > > > > "NRAS" does not parse for me either. > > > > That would be this https://docs.attestation.nvidia.com/api-docs/nras.html Thanks! > > > provide reliable attestation without a third party service, albeit > > > with slightly different security properties. > > > > Setting the above confusion aside, I would just say that in general yes, > > the kernel needs to understand its role in an end-to-end attestation > > architecture that is not beholden to a single vendor, but also allows > > the kernel to enforce ABI stability / mitigate regressions based on > > binary format changes. > > > > I'm mainly holding on to hope that I don't have to introduce a new > runtime dependency on a service that gives a source of truth about the > software that's running in the VM. > If we can have a GUID table with a flexible size that the host can > request of the guest, then we can version ABI changes with new GUID > entries. > It's a big enough value space without vanity naming opportunities that > we can pretty easily make changes without incurring any guest kernel > changes. So it's not only SBOM that you are concerned about, but instead want to have a one stop shop for auxiliary evidence and get the vendors agree on following the same GUID+blob precedent that is already there for the AMD cert chain? That sounds reasonable, but I still feel it should be limited to things that do not fit into an existing ABI namespace. ...unless its evidence / material that only a TVM would ever need.
> So it's not only SBOM that you are concerned about, but instead want to > have a one stop shop for auxiliary evidence and get the vendors agree on > following the same GUID+blob precedent that is already there for the AMD > cert chain? That sounds reasonable, but I still feel it should be > limited to things that do not fit into an existing ABI namespace. > The tl;dr is I want something simple for a hard problem and I'll probably lose this argument. The unfortunate state of affairs here is that it is "vendor dependent" whatever pathway they choose to provide reference material for attestations. Even with the RATS framework, "reference value provider" is just a bubble in a diagram and not a federated service protocol that we've all agreed on. TCG's recommendation for delivering the RIM in the efi volume doesn't quite work in the cloud setting, since images own that and not us as the firmware provider. There's no standard for how to inform a user where to get a RIM other than that :( > ...unless its evidence / material that only a TVM would ever need. There's the rub. Evidence may be provided to the TVM to forward to verifiers, but really it's the verifiers that are tasked with gathering all this attestation collateral. The TVM doesn't have to do anything, even provide machine-specific certificates. That's pretty dreadful system design though, since you end up with global services in your hot path rather than getting cached data from the machine you're getting an attestation from. My first stab at it is to just have a storage bucket with objects named based on expected measurement values, so you just construct a URL and download the endorsement if you need it. I'd really rather just make that part of what the guest can get from auxblob since they pay the bandwidth of forwarding it to verifiers rather than my cost center paying for the storage bucket bandwidth (is this the real reason I'm pushing on this? I'm unsure). If you're instead asking if this information would need to get piped to a non-TVM (say, a non-confidential VM with a virtual TPM), then the answer is maa~aaa~aaybe but probably not. PCR0 in the cloud really needs its own profile, since the TCG platform firmware profile (PFP) is unsuitable. There's not a whole lot of point delivering a signed endorsement of a firmware measurement that we don't include in the event log anyway for stability reasons, even if that's against the PFP spec. So probably not. I think we're pretty clear that host-cached data would only need to be for TVMs. If you ask the folks I've been talking to at Intel or NVIDIA, they'd probably say to put a service in your dependencies and be done with it; it's the vendor's responsibility to provide an available enough service to provide all the evidence that an attestation verifier may want. That's quite unfriendly to the smaller players in the field, but maybe it's easy to integrate with something like Veraison's endorsement provisioning API [1] or Intel's Trust Authority (née Project Amber). I haven't done it. [1] https://github.com/veraison/docs/tree/main/api/endorsement-provisioning
diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 602aaf82eef3..d71ec257debb 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -19,6 +19,7 @@ #include <linux/misc_cgroup.h> #include <linux/processor.h> #include <linux/trace_events.h> +#include <uapi/linux/sev-guest.h> #include <asm/pkru.h> #include <asm/trapnr.h> @@ -339,6 +340,8 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) ret = verify_snp_init_flags(kvm, argp); if (ret) goto e_free; + + mutex_init(&sev->guest_req_lock); } ret = sev_platform_init(&argp->error); @@ -2345,8 +2348,10 @@ static int snp_get_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp) static void snp_replace_certs(struct kvm_sev_info *sev, struct sev_snp_certs *snp_certs) { + mutex_lock(&sev->guest_req_lock); sev_snp_certs_put(sev->snp_certs); sev->snp_certs = snp_certs; + mutex_unlock(&sev->guest_req_lock); } static int snp_set_instance_certs(struct kvm *kvm, struct kvm_sev_cmd *argp) @@ -3218,6 +3223,8 @@ static int sev_es_validate_vmgexit(struct vcpu_svm *svm) case SVM_VMGEXIT_HV_FEATURES: case SVM_VMGEXIT_PSC: case SVM_VMGEXIT_TERM_REQUEST: + case SVM_VMGEXIT_GUEST_REQUEST: + case SVM_VMGEXIT_EXT_GUEST_REQUEST: break; default: reason = GHCB_ERR_INVALID_EVENT; @@ -3627,6 +3634,163 @@ static int sev_snp_ap_creation(struct vcpu_svm *svm) return ret; } +static unsigned long snp_setup_guest_buf(struct vcpu_svm *svm, + struct sev_data_snp_guest_request *data, + gpa_t req_gpa, gpa_t resp_gpa) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm *kvm = vcpu->kvm; + kvm_pfn_t req_pfn, resp_pfn; + struct kvm_sev_info *sev; + + sev = &to_kvm_svm(kvm)->sev_info; + + if (!IS_ALIGNED(req_gpa, PAGE_SIZE) || !IS_ALIGNED(resp_gpa, PAGE_SIZE)) + return SEV_RET_INVALID_PARAM; + + req_pfn = gfn_to_pfn(kvm, gpa_to_gfn(req_gpa)); + if (is_error_noslot_pfn(req_pfn)) + return SEV_RET_INVALID_ADDRESS; + + resp_pfn = gfn_to_pfn(kvm, gpa_to_gfn(resp_gpa)); + if (is_error_noslot_pfn(resp_pfn)) + return SEV_RET_INVALID_ADDRESS; + + if (rmp_make_private(resp_pfn, 0, PG_LEVEL_4K, 0, true)) + return SEV_RET_INVALID_ADDRESS; + + data->gctx_paddr = __psp_pa(sev->snp_context); + data->req_paddr = __sme_set(req_pfn << PAGE_SHIFT); + data->res_paddr = __sme_set(resp_pfn << PAGE_SHIFT); + + return 0; +} + +static void snp_cleanup_guest_buf(struct sev_data_snp_guest_request *data, unsigned long *rc) +{ + u64 pfn = __sme_clr(data->res_paddr) >> PAGE_SHIFT; + int ret; + + ret = snp_page_reclaim(pfn); + if (ret) + *rc = SEV_RET_INVALID_ADDRESS; + + ret = rmp_make_shared(pfn, PG_LEVEL_4K); + if (ret) + *rc = SEV_RET_INVALID_ADDRESS; +} + +static void snp_handle_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa) +{ + struct sev_data_snp_guest_request data = {0}; + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm *kvm = vcpu->kvm; + struct kvm_sev_info *sev; + unsigned long rc; + int err; + + if (!sev_snp_guest(vcpu->kvm)) { + rc = SEV_RET_INVALID_GUEST; + goto e_fail; + } + + sev = &to_kvm_svm(kvm)->sev_info; + + mutex_lock(&sev->guest_req_lock); + + rc = snp_setup_guest_buf(svm, &data, req_gpa, resp_gpa); + if (rc) + goto unlock; + + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &data, &err); + if (rc) + /* Ensure an error value is returned to guest. */ + rc = err ? err : SEV_RET_INVALID_ADDRESS; + + snp_cleanup_guest_buf(&data, &rc); + +unlock: + mutex_unlock(&sev->guest_req_lock); + +e_fail: + ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, rc); +} + +static void snp_handle_ext_guest_request(struct vcpu_svm *svm, gpa_t req_gpa, gpa_t resp_gpa) +{ + struct sev_data_snp_guest_request req = {0}; + struct sev_snp_certs *snp_certs = NULL; + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm *kvm = vcpu->kvm; + unsigned long data_npages; + struct kvm_sev_info *sev; + unsigned long exitcode = 0; + u64 data_gpa; + int err, rc; + + if (!sev_snp_guest(vcpu->kvm)) { + rc = SEV_RET_INVALID_GUEST; + goto e_fail; + } + + sev = &to_kvm_svm(kvm)->sev_info; + + data_gpa = vcpu->arch.regs[VCPU_REGS_RAX]; + data_npages = vcpu->arch.regs[VCPU_REGS_RBX]; + + if (!IS_ALIGNED(data_gpa, PAGE_SIZE)) { + exitcode = SEV_RET_INVALID_ADDRESS; + goto e_fail; + } + + mutex_lock(&sev->guest_req_lock); + + rc = snp_setup_guest_buf(svm, &req, req_gpa, resp_gpa); + if (rc) + goto unlock; + + /* + * If a VMM-specific certificate blob hasn't been provided, grab the + * host-wide one. + */ + snp_certs = sev_snp_certs_get(sev->snp_certs); + if (!snp_certs) + snp_certs = sev_snp_global_certs_get(); + + /* + * If there is a host-wide or VMM-specific certificate blob available, + * make sure the guest has allocated enough space to store it. + * Otherwise, inform the guest how much space is needed. + */ + if (snp_certs && (data_npages << PAGE_SHIFT) < snp_certs->len) { + vcpu->arch.regs[VCPU_REGS_RBX] = snp_certs->len >> PAGE_SHIFT; + exitcode = SNP_GUEST_VMM_ERR(SNP_GUEST_VMM_ERR_INVALID_LEN); + goto cleanup; + } + + rc = sev_issue_cmd(kvm, SEV_CMD_SNP_GUEST_REQUEST, &req, &err); + if (rc) { + /* pass the firmware error code */ + exitcode = err; + goto cleanup; + } + + /* Copy the certificate blob in the guest memory */ + if (snp_certs && + kvm_write_guest(kvm, data_gpa, snp_certs->data, snp_certs->len)) + exitcode = SEV_RET_INVALID_ADDRESS; + +cleanup: + sev_snp_certs_put(snp_certs); + snp_cleanup_guest_buf(&req, &exitcode); + +unlock: + mutex_unlock(&sev->guest_req_lock); + +e_fail: + ghcb_set_sw_exit_info_2(svm->sev_es.ghcb, exitcode); +} + static int sev_handle_vmgexit_msr_protocol(struct vcpu_svm *svm) { struct vmcb_control_area *control = &svm->vmcb->control; @@ -3894,6 +4058,18 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu) vcpu->run->system_event.ndata = 1; vcpu->run->system_event.data[0] = control->ghcb_gpa; break; + case SVM_VMGEXIT_GUEST_REQUEST: + snp_handle_guest_request(svm, control->exit_info_1, control->exit_info_2); + + ret = 1; + break; + case SVM_VMGEXIT_EXT_GUEST_REQUEST: + snp_handle_ext_guest_request(svm, + control->exit_info_1, + control->exit_info_2); + + ret = 1; + break; case SVM_VMGEXIT_UNSUPPORTED_EVENT: vcpu_unimpl(vcpu, "vmgexit: unsupported event - exit_info_1=%#llx, exit_info_2=%#llx\n", diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index bdf792ba06e1..3673a6e4e22e 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -98,6 +98,7 @@ struct kvm_sev_info { void *snp_context; /* SNP guest context page */ u64 sev_features; /* Features set at VMSA creation */ struct sev_snp_certs *snp_certs; + struct mutex guest_req_lock; /* Lock for guest request handling */ }; struct kvm_svm { diff --git a/drivers/crypto/ccp/sev-dev.c b/drivers/crypto/ccp/sev-dev.c index 4807ddd6ec52..f9c75c561c4e 100644 --- a/drivers/crypto/ccp/sev-dev.c +++ b/drivers/crypto/ccp/sev-dev.c @@ -2109,6 +2109,21 @@ void sev_snp_certs_put(struct sev_snp_certs *certs) } EXPORT_SYMBOL_GPL(sev_snp_certs_put); +struct sev_snp_certs *sev_snp_global_certs_get(void) +{ + struct sev_device *sev; + + if (!psp_master || !psp_master->sev_data) + return NULL; + + sev = psp_master->sev_data; + if (!sev->snp_initialized) + return NULL; + + return sev_snp_certs_get(sev->snp_certs); +} +EXPORT_SYMBOL_GPL(sev_snp_global_certs_get); + static void sev_exit(struct kref *ref) { misc_deregister(&misc_dev->misc); diff --git a/include/linux/psp-sev.h b/include/linux/psp-sev.h index 722e26d28d2f..3b294ccbbec9 100644 --- a/include/linux/psp-sev.h +++ b/include/linux/psp-sev.h @@ -25,6 +25,7 @@ struct sev_snp_certs { struct sev_snp_certs *sev_snp_certs_new(void *data, u32 len); struct sev_snp_certs *sev_snp_certs_get(struct sev_snp_certs *certs); void sev_snp_certs_put(struct sev_snp_certs *certs); +struct sev_snp_certs *sev_snp_global_certs_get(void); /** * SEV platform state