Message ID | 20241030134912.515725-1-roypat@amazon.co.uk |
---|---|
Headers | show |
Series | Direct Map Removal for guest_memfd | expand |
On 30.10.24 14:49, Patrick Roy wrote: > Unmapping virtual machine guest memory from the host kernel's direct map > is a successful mitigation against Spectre-style transient execution > issues: If the kernel page tables do not contain entries pointing to > guest memory, then any attempted speculative read through the direct map > will necessarily be blocked by the MMU before any observable > microarchitectural side-effects happen. This means that Spectre-gadgets > and similar cannot be used to target virtual machine memory. Roughly 60% > of speculative execution issues fall into this category [1, Table 1]. > > This patch series extends guest_memfd with the ability to remove its > memory from the host kernel's direct map, to be able to attain the above > protection for KVM guests running inside guest_memfd. > > === Changes to v2 === > > - Handle direct map removal for physically contiguous pages in arch code > (Mike R.) > - Track the direct map state in guest_memfd itself instead of at the > folio level, to prepare for huge pages support (Sean C.) > - Allow configuring direct map state of not-yet faulted in memory > (Vishal A.) > - Pay attention to alignment in ftrace structs (Steven R.) > > Most significantly, I've reduced the patch series to focus only on > direct map removal for guest_memfd for now, leaving the whole "how to do > non-CoCo VMs in guest_memfd" for later. If this separation is > acceptable, then I think I can drop the RFC tag in the next revision > (I've mainly kept it here because I'm not entirely sure what to do with > patches 3 and 4). Hi, keeping upcoming "shared and private memory in guest_memfd" in mind, I assume the focus would be to only remove the direct map for private memory? So in the current upstream state, you would only be removing the direct map for private memory, currently translating to "encrypted"/"protected" memory that is inaccessible either way already. Correct?
On 30.10.24 14:49, Patrick Roy wrote: > From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> > > From: Mike Rapoport (Microsoft) <rppt@kernel.org> > > Add an API that will allow updates of the direct/linear map for a set of > physically contiguous pages. > > It will be used in the following patches. > > Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> > Signed-off-by: Patrick Roy <roypat@amazon.co.uk> [...] > #ifdef CONFIG_DEBUG_PAGEALLOC > void __kernel_map_pages(struct page *page, int numpages, int enable) > { > diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h > index e7aec20fb44f1..3030d9245f5ac 100644 > --- a/include/linux/set_memory.h > +++ b/include/linux/set_memory.h > @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) > return 0; > } > > +static inline int set_direct_map_valid_noflush(struct page *page, > + unsigned nr, bool valid) I recall that "unsigned" is frowned upon; "unsigned int". > +{ > + return 0; > +} Can we add some kernel doc for this? In particular (a) What does it mean when we return 0? That it worked? Then, this dummy function looks wrong. Or this it return the number of processed entries? Then we'd have a possible "int" vs. "unsigned int" inconsistency. (b) What are the semantics when we fail halfway through the operation when processing nr > 1? Is it "all or nothing"?
On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote: > On 30.10.24 14:49, Patrick Roy wrote: >> Unmapping virtual machine guest memory from the host kernel's direct map >> is a successful mitigation against Spectre-style transient execution >> issues: If the kernel page tables do not contain entries pointing to >> guest memory, then any attempted speculative read through the direct map >> will necessarily be blocked by the MMU before any observable >> microarchitectural side-effects happen. This means that Spectre-gadgets >> and similar cannot be used to target virtual machine memory. Roughly 60% >> of speculative execution issues fall into this category [1, Table 1]. >> >> This patch series extends guest_memfd with the ability to remove its >> memory from the host kernel's direct map, to be able to attain the above >> protection for KVM guests running inside guest_memfd. >> >> === Changes to v2 === >> >> - Handle direct map removal for physically contiguous pages in arch code >> (Mike R.) >> - Track the direct map state in guest_memfd itself instead of at the >> folio level, to prepare for huge pages support (Sean C.) >> - Allow configuring direct map state of not-yet faulted in memory >> (Vishal A.) >> - Pay attention to alignment in ftrace structs (Steven R.) >> >> Most significantly, I've reduced the patch series to focus only on >> direct map removal for guest_memfd for now, leaving the whole "how to do >> non-CoCo VMs in guest_memfd" for later. If this separation is >> acceptable, then I think I can drop the RFC tag in the next revision >> (I've mainly kept it here because I'm not entirely sure what to do with >> patches 3 and 4). > > Hi, > > keeping upcoming "shared and private memory in guest_memfd" in mind, I > assume the focus would be to only remove the direct map for private memory? > > So in the current upstream state, you would only be removing the direct > map for private memory, currently translating to "encrypted"/"protected" > memory that is inaccessible either way already. > > Correct? Yea, with the upcomming "shared and private" stuff, I would expect the the shared<->private conversions would call the routines from patch 3 to restore direct map entries on private->shared, and zap them on shared->private. But as you said, the current upstream state has no notion of "shared" memory in guest_memfd, so everything is private and thus everything is direct map removed (although it is indeed already inaccessible anyway for TDX and friends. That's what makes this patch series a bit awkward :( ) > -- > Cheers, > > David / dhildenb > Best, Patrick
On 10/30/24 08:49, Patrick Roy wrote: > Implement (yet unused) routines for manipulating guest_memfd direct map > state. This is largely for illustration purposes. > > kvm_gmem_set_direct_map allows manipulating arbitrary pgoff_t > ranges, even if the covered memory has not yet been faulted in (in which > case the requested direct map state is recorded in the xarray and will > be applied by kvm_gmem_folio_configure_direct_map after the folio is > faulted in and prepared/populated). This can be used to realize > private/shared conversions on not-yet-faulted in memory, as discussed in > the guest_memfd upstream call [1]. > > kvm_gmem_folio_set_direct_map allows manipulating the direct map entries > for a gmem folio that the caller already holds a reference for (whereas > kvm_gmem_set_direct_map needs to look up all folios intersecting the > given pgoff range in the filemap first). > > The xa lock serializes calls to kvm_gmem_folio_set_direct_map and > kvm_gmem_set_direct_map, while the read side > (kvm_gmem_folio_configure_direct_map) is protected by RCU. This is > sufficient to ensure consistency between the xarray and the folio's > actual direct map state, as kvm_gmem_folio_configure_direct_map is > called only for freshly allocated folios, and before the folio lock is > dropped for the first time, meaning kvm_gmem_folio_configure_direct_map > always does it's set_direct_map calls before either of > kvm_gmem_[folio_]set_direct_map get a chance. Even if a concurrent call > to kvm_gmem_[folio_]set_direct_map happens, this ensures a sort of > "eventual consistency" between xarray and actual direct map > configuration by the time kvm_gmem_[folio_]set_direct_map exits. > > [1]: https://lore.kernel.org/kvm/4b49248b-1cf1-44dc-9b50-ee551e1671ac@redhat.com/ > > Signed-off-by: Patrick Roy <roypat@amazon.co.uk> > --- > virt/kvm/guest_memfd.c | 125 +++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 125 insertions(+) > > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c > index 50ffc2ad73eda..54387828dcc6a 100644 > --- a/virt/kvm/guest_memfd.c > +++ b/virt/kvm/guest_memfd.c > @@ -96,6 +96,131 @@ static int kvm_gmem_folio_configure_direct_map(struct folio *folio) > return r; > } > > +/* > + * Updates the range [@start, @end] in @gmem_priv's direct map state xarray to be @state, > + * e.g. erasing entries in this range if @state is the default state, and creating > + * entries otherwise. > + * > + * Assumes the xa_lock is held. > + */ > +static int __kvm_gmem_update_xarray(struct kvm_gmem_inode_private *gmem_priv, pgoff_t start, > + pgoff_t end, bool state) > +{ > + struct xarray *xa = &gmem_priv->direct_map_state; > + int r = 0; > + > + /* > + * Cannot use xa_store_range, as multi-indexes cannot easily > + * be partially updated. > + */ > + for (pgoff_t index = start; index < end; ++index) { > + if (state == gmem_priv->default_direct_map_state) > + __xa_erase(xa, index); > + else > + /* don't care _what_ we store in the xarray, only care about presence */ > + __xa_store(xa, index, gmem_priv, GFP_KERNEL); > + > + r = xa_err(xa); > + if (r) > + goto out; > + } > + > +out: > + return r; > +} > + > +static int __kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, pgoff_t end, > + bool state) > +{ > + unsigned long npages = end - start + 1; > + struct page *first_page = folio_file_page(folio, start); > + > + int r = set_direct_map_valid_noflush(first_page, npages, state); > + > + flush_tlb_kernel_range((unsigned long)page_address(first_page), > + (unsigned long)page_address(first_page) + > + npages * PAGE_SIZE); > + return r; > +} > + > +/* > + * Updates the direct map status for the given range from @start to @end (inclusive), returning > + * -EINVAL if this range is not completely contained within @folio. Also updates the > + * xarray stored in the private data of the inode @folio is attached to. > + * > + * Takes and drops the folio lock. > + */ > +static __always_unused int kvm_gmem_folio_set_direct_map(struct folio *folio, pgoff_t start, > + pgoff_t end, bool state) > +{ > + struct inode *inode = folio_inode(folio); > + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; > + int r = -EINVAL; > + > + if (!folio_contains(folio, start) || !folio_contains(folio, end)) > + goto out; > + > + xa_lock(&gmem_priv->direct_map_state); > + r = __kvm_gmem_update_xarray(gmem_priv, start, end, state); > + if (r) > + goto unlock_xa; > + > + folio_lock(folio); > + r = __kvm_gmem_folio_set_direct_map(folio, start, end, state); > + folio_unlock(folio); > + > +unlock_xa: > + xa_unlock(&gmem_priv->direct_map_state); > +out: > + return r; > +} > + > +/* > + * Updates the direct map status for the given range from @start to @end (inclusive) > + * of @inode. Folios in this range have their direct map entries reconfigured, > + * and the xarray in the @inode's private data is updated. > + */ > +static __always_unused int kvm_gmem_set_direct_map(struct inode *inode, pgoff_t start, > + pgoff_t end, bool state) > +{ > + struct kvm_gmem_inode_private *gmem_priv = inode->i_private; > + struct folio_batch fbatch; > + pgoff_t index = start; > + unsigned int count, i; > + int r = 0; > + > + xa_lock(&gmem_priv->direct_map_state); > + > + r = __kvm_gmem_update_xarray(gmem_priv, start, end, state); > + if (r) > + goto out; > + if (r) { xa_unlock(&gmem_priv->direct_map_state); goto out; } thanks, Mike > + folio_batch_init(&fbatch); > + while (!filemap_get_folios(inode->i_mapping, &index, end, &fbatch) && !r) { > + count = folio_batch_count(&fbatch); > + for (i = 0; i < count; i++) { > + struct folio *folio = fbatch.folios[i]; > + pgoff_t folio_start = max(folio_index(folio), start); > + pgoff_t folio_end = > + min(folio_index(folio) + folio_nr_pages(folio), > + end); > + > + folio_lock(folio); > + r = __kvm_gmem_folio_set_direct_map(folio, folio_start, > + folio_end, state); > + folio_unlock(folio); > + > + if (r) > + break; > + } > + folio_batch_release(&fbatch); > + } > + > + xa_unlock(&gmem_priv->direct_map_state); > +out: > + return r; > +} > + > /** > * folio_file_pfn - like folio_file_page, but return a pfn. > * @folio: The folio which contains this index.
On 31.10.24 11:42, Patrick Roy wrote: > On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote: >> On 30.10.24 14:49, Patrick Roy wrote: >>> Unmapping virtual machine guest memory from the host kernel's direct map >>> is a successful mitigation against Spectre-style transient execution >>> issues: If the kernel page tables do not contain entries pointing to >>> guest memory, then any attempted speculative read through the direct map >>> will necessarily be blocked by the MMU before any observable >>> microarchitectural side-effects happen. This means that Spectre-gadgets >>> and similar cannot be used to target virtual machine memory. Roughly 60% >>> of speculative execution issues fall into this category [1, Table 1]. >>> >>> This patch series extends guest_memfd with the ability to remove its >>> memory from the host kernel's direct map, to be able to attain the above >>> protection for KVM guests running inside guest_memfd. >>> >>> === Changes to v2 === >>> >>> - Handle direct map removal for physically contiguous pages in arch code >>> (Mike R.) >>> - Track the direct map state in guest_memfd itself instead of at the >>> folio level, to prepare for huge pages support (Sean C.) >>> - Allow configuring direct map state of not-yet faulted in memory >>> (Vishal A.) >>> - Pay attention to alignment in ftrace structs (Steven R.) >>> >>> Most significantly, I've reduced the patch series to focus only on >>> direct map removal for guest_memfd for now, leaving the whole "how to do >>> non-CoCo VMs in guest_memfd" for later. If this separation is >>> acceptable, then I think I can drop the RFC tag in the next revision >>> (I've mainly kept it here because I'm not entirely sure what to do with >>> patches 3 and 4). >> >> Hi, >> >> keeping upcoming "shared and private memory in guest_memfd" in mind, I >> assume the focus would be to only remove the direct map for private memory? >> >> So in the current upstream state, you would only be removing the direct >> map for private memory, currently translating to "encrypted"/"protected" >> memory that is inaccessible either way already. >> >> Correct? > > Yea, with the upcomming "shared and private" stuff, I would expect the > the shared<->private conversions would call the routines from patch 3 to > restore direct map entries on private->shared, and zap them on > shared->private. I wanted to follow-up to the discussion we had in the bi-weekly call. We talked about shared (faultable) vs. private (unfaultable), and how it would interact with the directmap patches here. As discussed, having private (unfaultable) memory with the direct-map removed and shared (faultable) memory with the direct-mapping can make sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, the discussion here seems to indicate that it might currently not be required. So one thing we could do is that shared (faultable) will have a direct mapping and be gup-able and private (unfaultable) memory will not have a direct mapping and is, by design, not gup-able. Maybe it could make sense to not have a direct map for all guest_memfd memory, making it behave like secretmem (and it would be easy to implement)? But I'm not sure if that is really desirable in VM context. Having a mixture of "has directmap" and "has no directmap" for shared (faultable) memory should not be done. Similarly, private memory really should stay "unfaultable". I think one of the points raised during the bi-weekly call was that using a viommu/swiotlb might be the right call, such that all memory can be considered private (unfaultable) that is not explicitly shared/expected to be modified by the hypervisor (-> faultable, -> GUP-able). Further, I think Sean had some good points why we should explore that direction, but I recall that there were some issue to be sorted out (interpreted instructions requiring direct map when accessing "private" memory?), not sure if that is already working/can be made working in KVM. What's your opinion after the call and the next step for use cases like you have in mind (IIRC firecracker, which wants to not have the direct-map for guest memory where it can be avoided)?
Hi David, On 11/4/24 12:18, David Hildenbrand wrote: > On 31.10.24 11:42, Patrick Roy wrote: >> On Thu, 2024-10-31 at 09:50 +0000, David Hildenbrand wrote: >>> On 30.10.24 14:49, Patrick Roy wrote: >>>> Unmapping virtual machine guest memory from the host kernel's direct map >>>> is a successful mitigation against Spectre-style transient execution >>>> issues: If the kernel page tables do not contain entries pointing to >>>> guest memory, then any attempted speculative read through the direct map >>>> will necessarily be blocked by the MMU before any observable >>>> microarchitectural side-effects happen. This means that Spectre-gadgets >>>> and similar cannot be used to target virtual machine memory. Roughly 60% >>>> of speculative execution issues fall into this category [1, Table 1]. >>>> >>>> This patch series extends guest_memfd with the ability to remove its >>>> memory from the host kernel's direct map, to be able to attain the above >>>> protection for KVM guests running inside guest_memfd. >>>> >>>> === Changes to v2 === >>>> >>>> - Handle direct map removal for physically contiguous pages in arch code >>>> (Mike R.) >>>> - Track the direct map state in guest_memfd itself instead of at the >>>> folio level, to prepare for huge pages support (Sean C.) >>>> - Allow configuring direct map state of not-yet faulted in memory >>>> (Vishal A.) >>>> - Pay attention to alignment in ftrace structs (Steven R.) >>>> >>>> Most significantly, I've reduced the patch series to focus only on >>>> direct map removal for guest_memfd for now, leaving the whole "how to do >>>> non-CoCo VMs in guest_memfd" for later. If this separation is >>>> acceptable, then I think I can drop the RFC tag in the next revision >>>> (I've mainly kept it here because I'm not entirely sure what to do with >>>> patches 3 and 4). >>> >>> Hi, >>> >>> keeping upcoming "shared and private memory in guest_memfd" in mind, I >>> assume the focus would be to only remove the direct map for private memory? >>> >>> So in the current upstream state, you would only be removing the direct >>> map for private memory, currently translating to "encrypted"/"protected" >>> memory that is inaccessible either way already. >>> >>> Correct? >> >> Yea, with the upcomming "shared and private" stuff, I would expect the >> the shared<->private conversions would call the routines from patch 3 to >> restore direct map entries on private->shared, and zap them on >> shared->private. > > I wanted to follow-up to the discussion we had in the bi-weekly call. Thanks for summarizing! > We talked about shared (faultable) vs. private (unfaultable), and how it > would interact with the directmap patches here. > > As discussed, having private (unfaultable) memory with the direct-map > removed and shared (faultable) memory with the direct-mapping can make > sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, > the discussion here seems to indicate that it might currently not be > required. > > So one thing we could do is that shared (faultable) will have a direct > mapping and be gup-able and private (unfaultable) memory will not have a > direct mapping and is, by design, not gup-able.> > Maybe it could make sense to not have a direct map for all guest_memfd > memory, making it behave like secretmem (and it would be easy to > implement)? But I'm not sure if that is really desirable in VM context. This would work for us (in this scenario, the swiotlb areas would be "traditional" memory, e.g. set to shared via mem attributes instead of "shared" inside KVM), it's kinda what I had prototyped in my v1 of this series (well, we'd need to figure out how to get the mappings of gmem back into KVM, since in this setup, short-circuiting it into userspace_addr wouldn't work, unless we banish swiotlb into a different memslot altogether somehow). But I don't think it'd work for pKVM, iirc they need GUP on gmem, and also want direct map removal (... but maybe, the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be behave differently? non-CoCo gets essentially memfd_secret, pKVM gets GUP+no faults of private mem). > Having a mixture of "has directmap" and "has no directmap" for shared > (faultable) memory should not be done. Similarly, private memory really > should stay "unfaultable". You've convinced me that having both GUP-able and non GUP-able memory in the same VMA will be tricky. However, I'm less convinced on why private memory should stay unfaultable; only that it shouldn't be faultable into a VMA that also allows GUP. Can we have two VMAs? One that disallows GUP, but allows userspace access to shared and private, and one that allows GUP, but disallows accessing private memory? Maybe via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly different spin of the above idea. > I think one of the points raised during the bi-weekly call was that > using a viommu/swiotlb might be the right call, such that all memory can > be considered private (unfaultable) that is not explicitly > shared/expected to be modified by the hypervisor (-> faultable, -> > GUP-able). > > Further, I think Sean had some good points why we should explore that > direction, but I recall that there were some issue to be sorted out > (interpreted instructions requiring direct map when accessing "private" > memory?), not sure if that is already working/can be made working in KVM. Yeah, the big one is MMIO instruction emulation on x86, which does guest page table walks and instruction fetch (and particularly the latter cannot be known ahead-of-time by the guest, aka cannot be explicitly "shared"). That's what the majority of my v2 series was about. For traditional memslots, KVM handles these via get_user and friends, but if we don't have a VMA that allows faulting all of gmem, then that's impossible, and we're in "temporarily restore direct map" land. Which comes with significantly performance penalties due to TLB flushes. > What's your opinion after the call and the next step for use cases like > you have in mind (IIRC firecracker, which wants to not have the > direct-map for guest memory where it can be avoided)? Yea, the usecase is for Firecracker to not have direct map entries for guest memory, unless needed for I/O (-> swiotlb). As for next steps, let's determine once and for all if we can do the KVM-internal guest memory accesses for MMIO emulation through userspace mappings (although if we can't I'll have some serious soul-searching to do, because all other solutions we talked about so far also have fairly big drawbacks; on-demand direct map reinsertion has terrible performance, protection keys would limit us to 15 VMs on the host, and the page table swapping runs into problems with NMIs if I understood Sean correctly last Thursday :( ). > -- > Cheers, > > David / dhildenb Best, Patrick
>> We talked about shared (faultable) vs. private (unfaultable), and how it >> would interact with the directmap patches here. >> >> As discussed, having private (unfaultable) memory with the direct-map >> removed and shared (faultable) memory with the direct-mapping can make >> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, >> the discussion here seems to indicate that it might currently not be >> required. >> >> So one thing we could do is that shared (faultable) will have a direct >> mapping and be gup-able and private (unfaultable) memory will not have a >> direct mapping and is, by design, not gup-able.> >> Maybe it could make sense to not have a direct map for all guest_memfd >> memory, making it behave like secretmem (and it would be easy to >> implement)? But I'm not sure if that is really desirable in VM context. > > This would work for us (in this scenario, the swiotlb areas would be > "traditional" memory, e.g. set to shared via mem attributes instead of > "shared" inside KVM), it's kinda what I had prototyped in my v1 of this > series (well, we'd need to figure out how to get the mappings of gmem > back into KVM, since in this setup, short-circuiting it into > userspace_addr wouldn't work, unless we banish swiotlb into a different > memslot altogether somehow). Right. > But I don't think it'd work for pKVM, iirc > they need GUP on gmem, and also want direct map removal (... but maybe, > the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be > behave differently? non-CoCo gets essentially memfd_secret, pKVM gets > GUP+no faults of private mem). Good question. So far my perception was that the directmap removal on "private/unfaultable" would be sufficient. > >> Having a mixture of "has directmap" and "has no directmap" for shared >> (faultable) memory should not be done. Similarly, private memory really >> should stay "unfaultable". > > You've convinced me that having both GUP-able and non GUP-able > memory in the same VMA will be tricky. However, I'm less convinced on > why private memory should stay unfaultable; only that it shouldn't be > faultable into a VMA that also allows GUP. Can we have two VMAs? One > that disallows GUP, but allows userspace access to shared and private, > and one that allows GUP, but disallows accessing private memory? Maybe > via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly > different spin of the above idea. What we are trying to achieve is making guest_memfd not behave completely different on that level for different "types" of VMs. So one of the goals should be to try to unify it as much as possible. shared -> faultable: GUP-able private -> unfaultable: unGUP-able And it makes sense, because a lot of future work will rely on some important properties: for example, if private memory cannot be faulted in + GUPed, core-MM will never have obtained valid references to such a page. There is no need to split large folios into smaller ones for tracking purposes; there is no need to maintain per-page refcounts and pincounts ... It doesn't mean that we cannot consider it if really required, but there really has to be a strong case for it, because it will all get really messy. For example, one issue is that a folio only has a single mapping (folio->mapping), and that is used in the GUP-fast path (no VMA) to determine whether GUP-fast is allowed or not. So you'd have to force everything through GUP-slow, where you could consider VMA properties :( It sounds quite suboptimal. I don't think multiple VMAs are what we really want. See below. > >> I think one of the points raised during the bi-weekly call was that >> using a viommu/swiotlb might be the right call, such that all memory can >> be considered private (unfaultable) that is not explicitly >> shared/expected to be modified by the hypervisor (-> faultable, -> >> GUP-able). >> >> Further, I think Sean had some good points why we should explore that >> direction, but I recall that there were some issue to be sorted out >> (interpreted instructions requiring direct map when accessing "private" >> memory?), not sure if that is already working/can be made working in KVM. > > Yeah, the big one is MMIO instruction emulation on x86, which does guest > page table walks and instruction fetch (and particularly the latter > cannot be known ahead-of-time by the guest, aka cannot be explicitly > "shared"). That's what the majority of my v2 series was about. For > traditional memslots, KVM handles these via get_user and friends, but if > we don't have a VMA that allows faulting all of gmem, then that's > impossible, and we're in "temporarily restore direct map" land. Which > comes with significantly performance penalties due to TLB flushes. Agreed. > >> What's your opinion after the call and the next step for use cases like >> you have in mind (IIRC firecracker, which wants to not have the >> direct-map for guest memory where it can be avoided)? > > Yea, the usecase is for Firecracker to not have direct map entries for > guest memory, unless needed for I/O (-> swiotlb). > > As for next steps, let's determine once and for all if we can do the > KVM-internal guest memory accesses for MMIO emulation through userspace > mappings (although if we can't I'll have some serious soul-searching to > do, because all other solutions we talked about so far also have fairly > big drawbacks; on-demand direct map reinsertion has terrible > performance So IIUC, KVM would have to access "unfaultable" guest_memfd memory using fd+offset, and that's problematic because "no-directmap". So you'd have to map+unmap the directmap repeatedly, and still expose it temporarily in the direct map to others. I see how that is undesirable, even when trying to cache hotspots (partly destroying the purpose of the directmap removal). Would a per-MM kernel mapping of these pages work, so KVM can access them? It sounds a bit like what is required for clean per-MM allocations [1]: establish a per-MM kernel mapping of (selected?) pages. Not necessarily all of them. Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything involved with ordinary user mappings for these private/unfaultable thingies. Just like as discussed in, and similar to [1]. Just throwing it out there, maybe we really want to avoid the directmap (keep it unmapped) and maintain a per-mm mapping for a bunch of folios that can be easily removed when required by guest_memfd (ftruncate, conversion private->shared) on request. [1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u
On 10/31/24 10:57, David Hildenbrand wrote: > On 30.10.24 14:49, Patrick Roy wrote: >> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> >> >> From: Mike Rapoport (Microsoft) <rppt@kernel.org> >> >> Add an API that will allow updates of the direct/linear map for a set of >> physically contiguous pages. >> >> It will be used in the following patches. >> >> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> >> Signed-off-by: Patrick Roy <roypat@amazon.co.uk> > > > [...] > >> #ifdef CONFIG_DEBUG_PAGEALLOC >> void __kernel_map_pages(struct page *page, int numpages, int enable) >> { >> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h >> index e7aec20fb44f1..3030d9245f5ac 100644 >> --- a/include/linux/set_memory.h >> +++ b/include/linux/set_memory.h >> @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) >> return 0; >> } >> >> +static inline int set_direct_map_valid_noflush(struct page *page, >> + unsigned nr, bool valid) > > I recall that "unsigned" is frowned upon; "unsigned int". > >> +{ >> + return 0; >> +} > > Can we add some kernel doc for this? > > In particular > > (a) What does it mean when we return 0? That it worked? Then, this Seems so. > dummy function looks wrong. Or this it return the That's !CONFIG_ARCH_HAS_SET_DIRECT_MAP and other functions around do it the same way. Looks like the current callers can only exist with the CONFIG_ enabled in the first place. > number of processed entries? Then we'd have a possible "int" vs. > "unsigned int" inconsistency. > > (b) What are the semantics when we fail halfway through the operation > when processing nr > 1? Is it "all or nothing"? Looking at x86 implementation it seems like it can just bail out in the middle, but then I'm not sure if it can really fail in the middle, hmm...
Hi David, sorry for the late response, I ended up catching the flu last week and was out of commission for a while :( On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote: >>> We talked about shared (faultable) vs. private (unfaultable), and how it >>> would interact with the directmap patches here. >>> >>> As discussed, having private (unfaultable) memory with the direct-map >>> removed and shared (faultable) memory with the direct-mapping can make >>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, >>> the discussion here seems to indicate that it might currently not be >>> required. >>> >>> So one thing we could do is that shared (faultable) will have a direct >>> mapping and be gup-able and private (unfaultable) memory will not have a >>> direct mapping and is, by design, not gup-able.> >>> Maybe it could make sense to not have a direct map for all guest_memfd >>> memory, making it behave like secretmem (and it would be easy to >>> implement)? But I'm not sure if that is really desirable in VM context. >> >> This would work for us (in this scenario, the swiotlb areas would be >> "traditional" memory, e.g. set to shared via mem attributes instead of >> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this >> series (well, we'd need to figure out how to get the mappings of gmem >> back into KVM, since in this setup, short-circuiting it into >> userspace_addr wouldn't work, unless we banish swiotlb into a different >> memslot altogether somehow). > > Right. "right" as in, "yes we could do that"? :p >> But I don't think it'd work for pKVM, iirc >> they need GUP on gmem, and also want direct map removal (... but maybe, >> the gmem VMA for non-CoCo usecase and the gmem VMA for pKVM could be >> behave differently? non-CoCo gets essentially memfd_secret, pKVM gets >> GUP+no faults of private mem). > > Good question. So far my perception was that the directmap removal on > "private/unfaultable" would be sufficient. > >> >>> Having a mixture of "has directmap" and "has no directmap" for shared >>> (faultable) memory should not be done. Similarly, private memory really >>> should stay "unfaultable". >> >> You've convinced me that having both GUP-able and non GUP-able >> memory in the same VMA will be tricky. However, I'm less convinced on >> why private memory should stay unfaultable; only that it shouldn't be >> faultable into a VMA that also allows GUP. Can we have two VMAs? One >> that disallows GUP, but allows userspace access to shared and private, >> and one that allows GUP, but disallows accessing private memory? Maybe >> via some `PROT_NOGUP` flag to `mmap`? I guess this is a slightly >> different spin of the above idea. > > What we are trying to achieve is making guest_memfd not behave > completely different on that level for different "types" of VMs. So one > of the goals should be to try to unify it as much as possible. > > shared -> faultable: GUP-able > private -> unfaultable: unGUP-able > > > And it makes sense, because a lot of future work will rely on some > important properties: for example, if private memory cannot be faulted > in + GUPed, core-MM will never have obtained valid references to such a > page. There is no need to split large folios into smaller ones for > tracking purposes; there is no need to maintain per-page refcounts and > pincounts ... > > It doesn't mean that we cannot consider it if really required, but there > really has to be a strong case for it, because it will all get really messy. > > For example, one issue is that a folio only has a single mapping > (folio->mapping), and that is used in the GUP-fast path (no VMA) to > determine whether GUP-fast is allowed or not. > > So you'd have to force everything through GUP-slow, where you could > consider VMA properties :( It sounds quite suboptimal. > > I don't think multiple VMAs are what we really want. See below. Ah, okay, I see. Thanks for explaining, this all makes a lot of sense to me now! >> >>> I think one of the points raised during the bi-weekly call was that >>> using a viommu/swiotlb might be the right call, such that all memory can >>> be considered private (unfaultable) that is not explicitly >>> shared/expected to be modified by the hypervisor (-> faultable, -> >>> GUP-able). >>> >>> Further, I think Sean had some good points why we should explore that >>> direction, but I recall that there were some issue to be sorted out >>> (interpreted instructions requiring direct map when accessing "private" >>> memory?), not sure if that is already working/can be made working in KVM. >> >> Yeah, the big one is MMIO instruction emulation on x86, which does guest >> page table walks and instruction fetch (and particularly the latter >> cannot be known ahead-of-time by the guest, aka cannot be explicitly >> "shared"). That's what the majority of my v2 series was about. For >> traditional memslots, KVM handles these via get_user and friends, but if >> we don't have a VMA that allows faulting all of gmem, then that's >> impossible, and we're in "temporarily restore direct map" land. Which >> comes with significantly performance penalties due to TLB flushes. > > Agreed. > >> >> What's your opinion after the call and the next step for use cases > like >>> you have in mind (IIRC firecracker, which wants to not have the >>> direct-map for guest memory where it can be avoided)? >> >> Yea, the usecase is for Firecracker to not have direct map entries for >> guest memory, unless needed for I/O (-> swiotlb). >> >> As for next steps, let's determine once and for all if we can do the >> KVM-internal guest memory accesses for MMIO emulation through userspace >> mappings (although if we can't I'll have some serious soul-searching to >> do, because all other solutions we talked about so far also have fairly >> big drawbacks; on-demand direct map reinsertion has terrible >> performance > So IIUC, KVM would have to access "unfaultable" guest_memfd memory using > fd+offset, and that's problematic because "no-directmap". > > So you'd have to map+unmap the directmap repeatedly, and still expose it > temporarily in the direct map to others. I see how that is undesirable, > even when trying to cache hotspots (partly destroying the purpose of the > directmap removal). > > > Would a per-MM kernel mapping of these pages work, so KVM can access them? > > It sounds a bit like what is required for clean per-MM allocations [1]: > establish a per-MM kernel mapping of (selected?) pages. Not necessarily > all of them. > > Yes, we'd be avoiding VMAs, GUP, mapcounts, pincounts and everything > involved with ordinary user mappings for these private/unfaultable > thingies. Just like as discussed in, and similar to [1]. > > Just throwing it out there, maybe we really want to avoid the directmap > (keep it unmapped) and maintain a per-mm mapping for a bunch of folios > that can be easily removed when required by guest_memfd (ftruncate, > conversion private->shared) on request. I remember talking to someone at some point about whether we could reuse the proc-local stuff for guest memory, but I cannot remember the outcome of that discussion... (or maybe I just wanted to have a discussion about it, but forgot to follow up on that thought?). I guess we wouldn't use proc-local _allocations_, but rather just set up proc-local mappings of the gmem allocations that have been removed from the direct map. I'm wondering, where exactly would be the differences to Sean's idea about messing with the CR3 register inside KVM to temporarily install page tables that contain all the gmem stuff, conceptually? Wouldn't we run into the same interrupt problems that Sean foresaw for the CR3 stuff? (which, admittedly, I still don't quite follow what these are :( ). (I've cc'd Fares Mehanna as well) > [1] https://lore.kernel.org/all/20240911143421.85612-1-faresx@amazon.de/T/#u > > -- > Cheers, > > David / dhildenb > Best, Patrick
On Mon, 2024-11-11 at 12:12 +0000, Vlastimil Babka wrote: > On 10/31/24 10:57, David Hildenbrand wrote: >> On 30.10.24 14:49, Patrick Roy wrote: >>> From: "Mike Rapoport (Microsoft)" <rppt@kernel.org> >>> >>> From: Mike Rapoport (Microsoft) <rppt@kernel.org> >>> >>> Add an API that will allow updates of the direct/linear map for a set of >>> physically contiguous pages. >>> >>> It will be used in the following patches. >>> >>> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> >>> Signed-off-by: Patrick Roy <roypat@amazon.co.uk> >> >> >> [...] >> >>> #ifdef CONFIG_DEBUG_PAGEALLOC >>> void __kernel_map_pages(struct page *page, int numpages, int enable) >>> { >>> diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h >>> index e7aec20fb44f1..3030d9245f5ac 100644 >>> --- a/include/linux/set_memory.h >>> +++ b/include/linux/set_memory.h >>> @@ -34,6 +34,12 @@ static inline int set_direct_map_default_noflush(struct page *page) >>> return 0; >>> } >>> >>> +static inline int set_direct_map_valid_noflush(struct page *page, >>> + unsigned nr, bool valid) >> >> I recall that "unsigned" is frowned upon; "unsigned int". >> >>> +{ >>> + return 0; >>> +} >> >> Can we add some kernel doc for this? >> >> In particular >> >> (a) What does it mean when we return 0? That it worked? Then, this > > Seems so. > >> dummy function looks wrong. Or this it return the > > That's !CONFIG_ARCH_HAS_SET_DIRECT_MAP and other functions around do it the > same way. Looks like the current callers can only exist with the CONFIG_ > enabled in the first place. Yeah, it looks a bit weird, but these functions seem to generally return 0 if the operation is not supported. ARM specifically has if (!can_set_direct_map()) return 0; inside `set_direct_map_invalid_{noflush,default}`. Documenting this definitely cannot hurt, I'll keep it on my todo list for the next iteration :) >> number of processed entries? Then we'd have a possible "int" vs. >> "unsigned int" inconsistency. >> >> (b) What are the semantics when we fail halfway through the operation >> when processing nr > 1? Is it "all or nothing"? > > Looking at x86 implementation it seems like it can just bail out in the > middle, but then I'm not sure if it can really fail in the middle, hmm... If I understood Mike correctly when talking about this at LPC, then it can only fail if during break-up of huge mappings, it fails to allocate page tables to hold the lower-granularity mappings (which happens before any present bits are modified). Best, Patrick
On 12.11.24 15:40, Patrick Roy wrote: > > Hi David, > > sorry for the late response, I ended up catching the flu last week and > was out of commission for a while :( > > On Mon, 2024-11-04 at 21:30 +0000, David Hildenbrand wrote: >>>> We talked about shared (faultable) vs. private (unfaultable), and how it >>>> would interact with the directmap patches here. >>>> >>>> As discussed, having private (unfaultable) memory with the direct-map >>>> removed and shared (faultable) memory with the direct-mapping can make >>>> sense for non-TDX/AMD-SEV/... non-CoCo use cases. Not sure about CoCo, >>>> the discussion here seems to indicate that it might currently not be >>>> required. >>>> >>>> So one thing we could do is that shared (faultable) will have a direct >>>> mapping and be gup-able and private (unfaultable) memory will not have a >>>> direct mapping and is, by design, not gup-able.> >>>> Maybe it could make sense to not have a direct map for all guest_memfd >>>> memory, making it behave like secretmem (and it would be easy to >>>> implement)? But I'm not sure if that is really desirable in VM context. >>> >>> This would work for us (in this scenario, the swiotlb areas would be >>> "traditional" memory, e.g. set to shared via mem attributes instead of >>> "shared" inside KVM), it's kinda what I had prototyped in my v1 of this >>> series (well, we'd need to figure out how to get the mappings of gmem >>> back into KVM, since in this setup, short-circuiting it into >>> userspace_addr wouldn't work, unless we banish swiotlb into a different >>> memslot altogether somehow). >> >> Right. > > "right" as in, "yes we could do that"? :p "right" as in "I see how that could work" :) [...] > > I remember talking to someone at some point about whether we could reuse > the proc-local stuff for guest memory, but I cannot remember the outcome > of that discussion... (or maybe I just wanted to have a discussion about > it, but forgot to follow up on that thought?). I guess we wouldn't use > proc-local _allocations_, but rather just set up proc-local mappings of > the gmem allocations that have been removed from the direct map. Yes. And likely only for memory we really access / try access, if possible. > > I'm wondering, where exactly would be the differences to Sean's idea > about messing with the CR3 register inside KVM to temporarily install > page tables that contain all the gmem stuff, conceptually? Wouldn't we > run into the same interrupt problems that Sean foresaw for the CR3 > stuff? (which, admittedly, I still don't quite follow what these are :( > ). I'd need some more details on that. If anything would rely on the direct mapping (from IRQ context?) than ... we obviously cannot remove the direct mapping :)
On Tue, 2024-11-12 at 14:52 +0000, David Hildenbrand wrote: > On 12.11.24 15:40, Patrick Roy wrote: >> I remember talking to someone at some point about whether we could reuse >> the proc-local stuff for guest memory, but I cannot remember the outcome >> of that discussion... (or maybe I just wanted to have a discussion about >> it, but forgot to follow up on that thought?). I guess we wouldn't use >> proc-local _allocations_, but rather just set up proc-local mappings of >> the gmem allocations that have been removed from the direct map. > > Yes. And likely only for memory we really access / try access, if possible. Well, if we start on-demand mm-local mapping the things we want to access, we're back in TLB flush hell, no? And we can't know ahead-of-time what needs to be mapped, so everything would need to be mapped (unless we do something like mm-local mapping a page on first access, and then just never unmapping it again, under the assumption that establishing the mapping won't be expensive) >> >> I'm wondering, where exactly would be the differences to Sean's idea >> about messing with the CR3 register inside KVM to temporarily install >> page tables that contain all the gmem stuff, conceptually? Wouldn't we >> run into the same interrupt problems that Sean foresaw for the CR3 >> stuff? (which, admittedly, I still don't quite follow what these are :( >> ). > > I'd need some more details on that. If anything would rely on the direct > mapping (from IRQ context?) than ... we obviously cannot remove the > direct mapping :) I've talked to Fares internally, and it seems that generally doing mm-local mappings of guest memory would work for us. We also figured out what the "interrupt problem" is, namely that if we receive an interrupt while executing in a context that has mm-local mappings available, those mappings will continue to be available while the interrupt is being handled. I'm talking to my security folks to see how much of a concern this is for the speculation hardening we're trying to achieve. Will keep you in the loop there :) > -- > Cheers, > > David / dhildenb > Best, Patrick
On 15.11.24 17:59, Patrick Roy wrote: > > > On Tue, 2024-11-12 at 14:52 +0000, David Hildenbrand wrote: >> On 12.11.24 15:40, Patrick Roy wrote: >>> I remember talking to someone at some point about whether we could reuse >>> the proc-local stuff for guest memory, but I cannot remember the outcome >>> of that discussion... (or maybe I just wanted to have a discussion about >>> it, but forgot to follow up on that thought?). I guess we wouldn't use >>> proc-local _allocations_, but rather just set up proc-local mappings of >>> the gmem allocations that have been removed from the direct map. >> >> Yes. And likely only for memory we really access / try access, if possible. > > Well, if we start on-demand mm-local mapping the things we want to > access, we're back in TLB flush hell, no? At least the on-demand mapping shouldn't require a TLB flush? Only "unmapping" if we want to restrict the size of a "mapped pool" of restricted size. Anyhow, this would be a pure optimization, to avoid the expense of mapping everything, when in practice you'd like not access most of it. (my theory, happy to be told I'm wrong :) ) > And we can't know > ahead-of-time what needs to be mapped, so everything would need to be > mapped (unless we do something like mm-local mapping a page on first > access, and then just never unmapping it again, under the assumption > that establishing the mapping won't be expensive) Right, the whole problem is that we don't know that upfront. > >>> >>> I'm wondering, where exactly would be the differences to Sean's idea >>> about messing with the CR3 register inside KVM to temporarily install >>> page tables that contain all the gmem stuff, conceptually? Wouldn't we >>> run into the same interrupt problems that Sean foresaw for the CR3 >>> stuff? (which, admittedly, I still don't quite follow what these are :( >>> ). >> >> I'd need some more details on that. If anything would rely on the direct >> mapping (from IRQ context?) than ... we obviously cannot remove the >> direct mapping :) > > I've talked to Fares internally, and it seems that generally doing > mm-local mappings of guest memory would work for us. We also figured out > what the "interrupt problem" is, namely that if we receive an interrupt > while executing in a context that has mm-local mappings available, those > mappings will continue to be available while the interrupt is being > handled. Isn't that likely also the case with secretmem where we removed the directmap, but have an effective per-mm mapping in the (user-space portion) of the page table? > I'm talking to my security folks to see how much of a concern > this is for the speculation hardening we're trying to achieve. Will keep > you in the loop there :) Thanks!
On Fri, 2024-11-15 at 17:10 +0000, David Hildenbrand wrote: >> [...] >> >> I've talked to Fares internally, and it seems that generally doing >> mm-local mappings of guest memory would work for us. We also figured out >> what the "interrupt problem" is, namely that if we receive an interrupt >> while executing in a context that has mm-local mappings available, those >> mappings will continue to be available while the interrupt is being >> handled. > > Isn't that likely also the case with secretmem where we removed the > directmap, but have an effective per-mm mapping in the (user-space > portion) of the page table? Mh, that's an excellent point, I never thought of that. But with secretmem, the memory would still be protected by SMAP (admittedly, I have no idea how much this is worth in the face of all these speculative issues), right? >> I'm talking to my security folks to see how much of a concern >> this is for the speculation hardening we're trying to achieve. Will keep >> you in the loop there :) > > Thanks! > > -- > Cheers, > > David / dhildenb Best, Patrick