diff mbox series

[RFC,26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private

Message ID bd163de3118b626d1005aa88e71ef2fb72f0be0f.1726009989.git.ackerleytng@google.com
State New
Headers show
Series 1G page support for guest_memfd | expand

Commit Message

Ackerley Tng Sept. 10, 2024, 11:43 p.m. UTC
The faultability xarray is stored on the inode since faultability is a
property of the guest_memfd's memory contents.

In this RFC, presence of an entry in the xarray indicates faultable,
but this could be flipped so that presence indicates unfaultable. For
flexibility, a special value "FAULT" is used instead of a simple
boolean.

However, at some stages of a VM's lifecycle there could be more
private pages, and at other stages there could be more shared pages.

This is likely to be replaced by a better data structure in a future
revision to better support ranges.

Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
pointer. inode->i_mapping->i_private_data.

Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>

---
 virt/kvm/guest_memfd.c | 105 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 94 insertions(+), 11 deletions(-)

Comments

Peter Xu Oct. 10, 2024, 4:06 p.m. UTC | #1
On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> The faultability xarray is stored on the inode since faultability is a
> property of the guest_memfd's memory contents.
> 
> In this RFC, presence of an entry in the xarray indicates faultable,
> but this could be flipped so that presence indicates unfaultable. For
> flexibility, a special value "FAULT" is used instead of a simple
> boolean.
> 
> However, at some stages of a VM's lifecycle there could be more
> private pages, and at other stages there could be more shared pages.
> 
> This is likely to be replaced by a better data structure in a future
> revision to better support ranges.
> 
> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> pointer. inode->i_mapping->i_private_data.

Could you help explain the difference between faultability v.s. the
existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
confused, otherwise might be good to enrich the commit message.

The latter is per-slot, so one level higher, however I don't think it's a
common use case for mapping the same gmemfd in multiple slots anyway for
KVM (besides corner cases like live upgrade).  So perhaps this is not about
layering but something else?  For example, any use case where PRIVATE and
FAULTABLE can be reported with different values.

Another higher level question is, is there any plan to support non-CoCo
context for 1G?

I saw that you also mentioned you have working QEMU prototypes ready in
another email.  It'll be great if you can push your kernel/QEMU's latest
tree (including all dependency patches) somewhere so anyone can have a
closer look, or play with it.

Thanks,
Ackerley Tng Oct. 11, 2024, 11:32 p.m. UTC | #2
Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>> The faultability xarray is stored on the inode since faultability is a
>> property of the guest_memfd's memory contents.
>> 
>> In this RFC, presence of an entry in the xarray indicates faultable,
>> but this could be flipped so that presence indicates unfaultable. For
>> flexibility, a special value "FAULT" is used instead of a simple
>> boolean.
>> 
>> However, at some stages of a VM's lifecycle there could be more
>> private pages, and at other stages there could be more shared pages.
>> 
>> This is likely to be replaced by a better data structure in a future
>> revision to better support ranges.
>> 
>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>> pointer. inode->i_mapping->i_private_data.
>
> Could you help explain the difference between faultability v.s. the
> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> confused, otherwise might be good to enrich the commit message.

Thank you for this question, I'll add this to the commit message to the
next revision if Fuad's patch set [1] doesn't make it first.

Reason (a): To elaborate on the explanation in [1],
KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
private or shared, and faultability is whether the page is allowed to be
faulted in by userspace.

These two are similar but may not be the same thing. In pKVM, pKVM
cannot trust userspace's configuration of private/shared, and other
information will go into determining the private/shared setting in
faultability.

Perhaps Fuad can elaborate more here.

Reason (b): In this patch series (mostly focus on x86 first), we're
using faultability to prevent any future faults before checking that
there are no mappings.

Having a different xarray from mem_attr_array allows us to disable
faulting before committing to changing mem_attr_array. Please see
`kvm_gmem_should_set_attributes_private()` in this patch [2].

We're not completely sure about the effectiveness of using faultability
to block off future faults here, in future revisions we may be using a
different approach. The folio_lock() is probably important if we need to
check mapcount. Please let me know if you have any ideas!

The starting point of having a different xarray was pKVM's requirement
of having separate xarrays, and we later realized that the xarray could
be used for reason (b). For x86 we could perhaps eventually remove the
second xarray? Not sure as of now.

>
> The latter is per-slot, so one level higher, however I don't think it's a
> common use case for mapping the same gmemfd in multiple slots anyway for
> KVM (besides corner cases like live upgrade).  So perhaps this is not about
> layering but something else?  For example, any use case where PRIVATE and
> FAULTABLE can be reported with different values.
>
> Another higher level question is, is there any plan to support non-CoCo
> context for 1G?

I believe guest_memfd users are generally in favor of eventually using
guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
in the case of CoCo) page support.

However, core-mm's fault path does not support mapping at anything
higher than the PMD level (other than hugetlb_fault(), which the
community wants to move away from), so core-mm wouldn't be able to map
1G pages taken from HugeTLB.

In this patch series, we always split pages before mapping them to
userspace and that's how this series still works with core-mm.

Having 1G page support for shared memory or for non-CoCo use cases would
probably depend on better HugeTLB integration with core-mm, which you'd
be most familiar with.

Thank you for looking through our patches, we need your experience and
help! I've also just sent out the first 3 patches separately, which I
think is useful in improving understandability of the
resv_map/subpool/hstate reservation system in HugeTLB and can be
considered separately. Hope you can also review/comment on [4].

> I saw that you also mentioned you have working QEMU prototypes ready in
> another email.  It'll be great if you can push your kernel/QEMU's latest
> tree (including all dependency patches) somewhere so anyone can have a
> closer look, or play with it.

Vishal's reply [3] might have been a bit confusing. To clarify, my team
doesn't work with Qemu at all (we use a custom userspace VMM internally)
so the patches in this series are tested purely with selftests.

The selftests have fewer dependencies than full Qemu and I'd be happy to
help with running them or explain anything that I might have missed out.

We don't have any Qemu prototypes and are not likely to be building any
prototypes in the foreseeable future.

>
> Thanks,
>
> -- 
> Peter Xu

[1] https://lore.kernel.org/all/20241010085930.1546800-3-tabba@google.com/
[2] https://lore.kernel.org/all/f4ca1711a477a3b56406c05d125dce3d7403b936.1726009989.git.ackerleytng@google.com/
[3] https://lore.kernel.org/all/CAGtprH-GczOb64XrLpdW4ObRG7Gsv8tHWNhiW7=2dE=OAF7-Rw@mail.gmail.com/
[4] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
Peter Xu Oct. 15, 2024, 9:34 p.m. UTC | #3
On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> >> The faultability xarray is stored on the inode since faultability is a
> >> property of the guest_memfd's memory contents.
> >> 
> >> In this RFC, presence of an entry in the xarray indicates faultable,
> >> but this could be flipped so that presence indicates unfaultable. For
> >> flexibility, a special value "FAULT" is used instead of a simple
> >> boolean.
> >> 
> >> However, at some stages of a VM's lifecycle there could be more
> >> private pages, and at other stages there could be more shared pages.
> >> 
> >> This is likely to be replaced by a better data structure in a future
> >> revision to better support ranges.
> >> 
> >> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> >> pointer. inode->i_mapping->i_private_data.
> >
> > Could you help explain the difference between faultability v.s. the
> > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> > confused, otherwise might be good to enrich the commit message.
> 
> Thank you for this question, I'll add this to the commit message to the
> next revision if Fuad's patch set [1] doesn't make it first.
> 
> Reason (a): To elaborate on the explanation in [1],
> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
> private or shared, and faultability is whether the page is allowed to be
> faulted in by userspace.
> 
> These two are similar but may not be the same thing. In pKVM, pKVM
> cannot trust userspace's configuration of private/shared, and other
> information will go into determining the private/shared setting in
> faultability.

It makes sense to me that the kernel has the right to decide which page is
shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
case is most / all pages are private, until some requests to share them for
special purposes (like DMA).  But that'll need to be initiated as a request
from the guest not the userspace hypervisor.

I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
planned to be used in the future. Currently it's always set at least in
QEMU if gmemfd is enabled, so it doesn't yet tell me anything..

If it's driven by the userspace side of the hypervisor, I wonder when
should the user app request some different value it already was, if the
kernel already has an answer in this case.  It made me even more confused,
as we have this in the API doc:

        Note, there is no "get" API.  Userspace is responsible for
        explicitly tracking the state of a gfn/page as needed.

And I do wonder whether we will still need some API just to query whether
the kernel allows the page to be mapped or not (aka, the "real" shared /
private status of a guest page).  I guess that's not directly relevant to
the faultability to be introduced here, but if you or anyone know please
kindly share, I'd love to learn about it.

> 
> Perhaps Fuad can elaborate more here.
> 
> Reason (b): In this patch series (mostly focus on x86 first), we're
> using faultability to prevent any future faults before checking that
> there are no mappings.
> 
> Having a different xarray from mem_attr_array allows us to disable
> faulting before committing to changing mem_attr_array. Please see
> `kvm_gmem_should_set_attributes_private()` in this patch [2].
> 
> We're not completely sure about the effectiveness of using faultability
> to block off future faults here, in future revisions we may be using a
> different approach. The folio_lock() is probably important if we need to
> check mapcount. Please let me know if you have any ideas!
> 
> The starting point of having a different xarray was pKVM's requirement
> of having separate xarrays, and we later realized that the xarray could
> be used for reason (b). For x86 we could perhaps eventually remove the
> second xarray? Not sure as of now.

Just had a quick look at patch 27:

https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/

I'm not yet sure what's protecting from faultability being modified against
a concurrent fault().

I wonder whether one can use the folio lock to serialize that, so that one
needs to take the folio lock to modify/lookup the folio's faultability,
then it may naturally match with the fault() handler design, where
kvm_gmem_get_folio() needs to lock the page first.

But then kvm_gmem_is_faultable() will need to also be called only after the
folio is locked to avoid races.

> 
> >
> > The latter is per-slot, so one level higher, however I don't think it's a
> > common use case for mapping the same gmemfd in multiple slots anyway for
> > KVM (besides corner cases like live upgrade).  So perhaps this is not about
> > layering but something else?  For example, any use case where PRIVATE and
> > FAULTABLE can be reported with different values.
> >
> > Another higher level question is, is there any plan to support non-CoCo
> > context for 1G?
> 
> I believe guest_memfd users are generally in favor of eventually using
> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
> in the case of CoCo) page support.
> 
> However, core-mm's fault path does not support mapping at anything
> higher than the PMD level (other than hugetlb_fault(), which the
> community wants to move away from), so core-mm wouldn't be able to map
> 1G pages taken from HugeTLB.

Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
referring to some other challenges?

> 
> In this patch series, we always split pages before mapping them to
> userspace and that's how this series still works with core-mm.
> 
> Having 1G page support for shared memory or for non-CoCo use cases would
> probably depend on better HugeTLB integration with core-mm, which you'd
> be most familiar with.

My understanding is the mm community wants to avoid adding major new things
on top of current hugetlbfs alone, I'm not sure whether this will also be
accounted as part of that.  IMHO it could depend on how much this series
will reuse hugetlbfs.  If it's only about allocations it might be ok,
however I still feel risky having the name "hugetlbfs" here, the allocator
(if refactored out of hugetlb, but to contain more functions than CMA)
could be named in a more generic way.  No rush on changing anything, you
may always want to talk with more mm people on this I guess.

I also don't know how you treat things like folio_test_hugetlb() on
possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
didn't yet check the rest of the patchset yet - reading a large series
without a git tree is sometimes challenging to me.

> 
> Thank you for looking through our patches, we need your experience and
> help! I've also just sent out the first 3 patches separately, which I
> think is useful in improving understandability of the
> resv_map/subpool/hstate reservation system in HugeTLB and can be
> considered separately. Hope you can also review/comment on [4].

I'll read and think about it.  Before that, I'll probably need to read more
backgrounds you need from hugetlb allocators (e.g. I remember you mentioned
pool management somewhere).  I tried to watch your LPC talk but the
recording has some issue on audio so I can mostly hear nothing in most of
the discussions..  I'll try to join the bi-weekly meeting two days later,
though.

> 
> > I saw that you also mentioned you have working QEMU prototypes ready in
> > another email.  It'll be great if you can push your kernel/QEMU's latest
> > tree (including all dependency patches) somewhere so anyone can have a
> > closer look, or play with it.
> 
> Vishal's reply [3] might have been a bit confusing. To clarify, my team
> doesn't work with Qemu at all (we use a custom userspace VMM internally)
> so the patches in this series are tested purely with selftests.
> 
> The selftests have fewer dependencies than full Qemu and I'd be happy to
> help with running them or explain anything that I might have missed out.
> 
> We don't have any Qemu prototypes and are not likely to be building any
> prototypes in the foreseeable future.

I see, that's totally not a problem.  If there can be, especially !CoCo
support at some point, we're happy to test it on QEMU side.  I'll see what
I can do to help !CoCo kernel side getting there.

Besides, it'll still be great if you can push a latest kernel tree
somewhere (or provide the base commit ID, but that needs to be on a public
tree I can fetch).

Thanks,

> 
> >
> > Thanks,
> >
> > -- 
> > Peter Xu
> 
> [1] https://lore.kernel.org/all/20241010085930.1546800-3-tabba@google.com/
> [2] https://lore.kernel.org/all/f4ca1711a477a3b56406c05d125dce3d7403b936.1726009989.git.ackerleytng@google.com/
> [3] https://lore.kernel.org/all/CAGtprH-GczOb64XrLpdW4ObRG7Gsv8tHWNhiW7=2dE=OAF7-Rw@mail.gmail.com/
> [4] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
>
Ackerley Tng Oct. 15, 2024, 11:42 p.m. UTC | #4
Peter Xu <peterx@redhat.com> writes:

> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>> >> The faultability xarray is stored on the inode since faultability is a
>> >> property of the guest_memfd's memory contents.
>> >> 
>> >> In this RFC, presence of an entry in the xarray indicates faultable,
>> >> but this could be flipped so that presence indicates unfaultable. For
>> >> flexibility, a special value "FAULT" is used instead of a simple
>> >> boolean.
>> >> 
>> >> However, at some stages of a VM's lifecycle there could be more
>> >> private pages, and at other stages there could be more shared pages.
>> >> 
>> >> This is likely to be replaced by a better data structure in a future
>> >> revision to better support ranges.
>> >> 
>> >> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>> >> pointer. inode->i_mapping->i_private_data.
>> >
>> > Could you help explain the difference between faultability v.s. the
>> > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>> > confused, otherwise might be good to enrich the commit message.
>> 
>> Thank you for this question, I'll add this to the commit message to the
>> next revision if Fuad's patch set [1] doesn't make it first.
>> 
>> Reason (a): To elaborate on the explanation in [1],
>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>> private or shared, and faultability is whether the page is allowed to be
>> faulted in by userspace.
>> 
>> These two are similar but may not be the same thing. In pKVM, pKVM
>> cannot trust userspace's configuration of private/shared, and other
>> information will go into determining the private/shared setting in
>> faultability.
>
> It makes sense to me that the kernel has the right to decide which page is
> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
> case is most / all pages are private, until some requests to share them for
> special purposes (like DMA).  But that'll need to be initiated as a request
> from the guest not the userspace hypervisor.

For TDX, the plan is that the guest will request the page to be remapped
as shared or private, and the handler for that request will exit to
the userspace VMM.

The userspace VMM will then do any necessary coordination (e.g. for a
shared to private conversion it may need to unpin pages from DMA), and
then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
the guest's requested conversion. This is where
KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.

Patch 38 [1] updates
tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
demonstrate the usage flow for x86.

Fuad will be in a better position to explain the flow for pKVM. 

> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
> planned to be used in the future. Currently it's always set at least in
> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>
> If it's driven by the userspace side of the hypervisor, I wonder when
> should the user app request some different value it already was, if the
> kernel already has an answer in this case.  It made me even more confused,
> as we have this in the API doc:
>
>         Note, there is no "get" API.  Userspace is responsible for
>         explicitly tracking the state of a gfn/page as needed.
>
> And I do wonder whether we will still need some API just to query whether
> the kernel allows the page to be mapped or not (aka, the "real" shared /
> private status of a guest page).  I guess that's not directly relevant to
> the faultability to be introduced here, but if you or anyone know please
> kindly share, I'd love to learn about it.

The userspace VMM will track the initial shared/private state, in the
sense that when the VM is created, the mem_attr_array is initialized
such that the guest pages are all shared.

Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
it should record all changes so it knows what the state is in the
kernel.

Even if userspace VMM doesn't record the state properly, if the
KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
(e.g. setting an already private page to private), it will just be a
no-op in the kernel.

>> 
>> Perhaps Fuad can elaborate more here.
>> 
>> Reason (b): In this patch series (mostly focus on x86 first), we're
>> using faultability to prevent any future faults before checking that
>> there are no mappings.
>> 
>> Having a different xarray from mem_attr_array allows us to disable
>> faulting before committing to changing mem_attr_array. Please see
>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>> 
>> We're not completely sure about the effectiveness of using faultability
>> to block off future faults here, in future revisions we may be using a
>> different approach. The folio_lock() is probably important if we need to
>> check mapcount. Please let me know if you have any ideas!
>> 
>> The starting point of having a different xarray was pKVM's requirement
>> of having separate xarrays, and we later realized that the xarray could
>> be used for reason (b). For x86 we could perhaps eventually remove the
>> second xarray? Not sure as of now.
>
> Just had a quick look at patch 27:
>
> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>
> I'm not yet sure what's protecting from faultability being modified against
> a concurrent fault().
>
> I wonder whether one can use the folio lock to serialize that, so that one
> needs to take the folio lock to modify/lookup the folio's faultability,
> then it may naturally match with the fault() handler design, where
> kvm_gmem_get_folio() needs to lock the page first.
>
> But then kvm_gmem_is_faultable() will need to also be called only after the
> folio is locked to avoid races.

My bad. In our rush to get this series out before LPC, the patch series
was not organized very well. Patch 39 [2] adds the
lock. filemap_invalidate_lock_shared() should make sure that faulting
doesn't race with faultability updates.

>> > The latter is per-slot, so one level higher, however I don't think it's a
>> > common use case for mapping the same gmemfd in multiple slots anyway for
>> > KVM (besides corner cases like live upgrade).  So perhaps this is not about
>> > layering but something else?  For example, any use case where PRIVATE and
>> > FAULTABLE can be reported with different values.
>> >
>> > Another higher level question is, is there any plan to support non-CoCo
>> > context for 1G?
>> 
>> I believe guest_memfd users are generally in favor of eventually using
>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>> in the case of CoCo) page support.
>> 
>> However, core-mm's fault path does not support mapping at anything
>> higher than the PMD level (other than hugetlb_fault(), which the
>> community wants to move away from), so core-mm wouldn't be able to map
>> 1G pages taken from HugeTLB.
>
> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
> referring to some other challenges?
>

IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
core-mm using vm_operations_struct.huge_fault().

>> 
>> In this patch series, we always split pages before mapping them to
>> userspace and that's how this series still works with core-mm.
>> 
>> Having 1G page support for shared memory or for non-CoCo use cases would
>> probably depend on better HugeTLB integration with core-mm, which you'd
>> be most familiar with.
>
> My understanding is the mm community wants to avoid adding major new things
> on top of current hugetlbfs alone, I'm not sure whether this will also be
> accounted as part of that.  IMHO it could depend on how much this series
> will reuse hugetlbfs.  If it's only about allocations it might be ok,
> however I still feel risky having the name "hugetlbfs" here, the allocator
> (if refactored out of hugetlb, but to contain more functions than CMA)
> could be named in a more generic way.  No rush on changing anything, you
> may always want to talk with more mm people on this I guess.
>

Thanks for your feedback! We do intend to only use the allocator part of
HugeTLB for guest_memfd, which will need some refactoring on the HugeTLB
side. The refactoring is not expected to require any functional changes.

What do you think of refactoring out the allocator part of HugeTLB in
terms of whether it helps with HugeTLB unification?

If the refactoring out of the allocator part of HugeTLB needs a name
change, that could work too.

> I also don't know how you treat things like folio_test_hugetlb() on
> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
> didn't yet check the rest of the patchset yet - reading a large series
> without a git tree is sometimes challenging to me.
>

I'm thinking to basically never involve folio_test_hugetlb(), and the
VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
because only the HugeTLB allocator is used, but by the time the folio is
mapped to userspace, it would have already have been split. After the
page is split, the folio loses its HugeTLB status. guest_memfd folios
will never be mapped to userspace while they still have a HugeTLB
status.

(When 1G pages can be mapped to userspace, we will have to rethink the
above. But possibly by then HugeTLB would have been more unified with
core-mm and hence perhaps things will fall in place?)

The current uses of folio_test_hugetlb() in this patch series are

1. In alloc_migration_target_by_mpol(), which is okay because that's
   during allocation of the HugeTLB folio, before it gets split up and
   loses its status. When the folio is freed, before it is returned to
   HugeTLB, the HugeTLB status will be reinstated.

2. In kvm_gmem_prepare_folio(). If the folio hasn't been split yet, then
   we use folio_zero_user() to zero the folio, and if it has been split,
   then we use a more primitive loop to zero the folio. These two
   methods of zeroing are actually kind of the same thing and can be
   combined. This correctly uses folio_test_hugetlb().

3. In kvm_gmem_fault(), I check if folio_test_hugetlb() when doing the
   same zeroing described in (2), but this is not actually necessary and
   will be removed in a future revision, since HugeTLB folios should
   never get faulted to userspace.

>> 
>> Thank you for looking through our patches, we need your experience and
>> help! I've also just sent out the first 3 patches separately, which I
>> think is useful in improving understandability of the
>> resv_map/subpool/hstate reservation system in HugeTLB and can be
>> considered separately. Hope you can also review/comment on [4].
>
> I'll read and think about it.  Before that, I'll probably need to read more
> backgrounds you need from hugetlb allocators (e.g. I remember you mentioned
> pool management somewhere).  I tried to watch your LPC talk but the
> recording has some issue on audio so I can mostly hear nothing in most of
> the discussions..  I'll try to join the bi-weekly meeting two days later,
> though.
>

Thank you!

>> 
>> > I saw that you also mentioned you have working QEMU prototypes ready in
>> > another email.  It'll be great if you can push your kernel/QEMU's latest
>> > tree (including all dependency patches) somewhere so anyone can have a
>> > closer look, or play with it.
>> 
>> Vishal's reply [3] might have been a bit confusing. To clarify, my team
>> doesn't work with Qemu at all (we use a custom userspace VMM internally)
>> so the patches in this series are tested purely with selftests.
>> 
>> The selftests have fewer dependencies than full Qemu and I'd be happy to
>> help with running them or explain anything that I might have missed out.
>> 
>> We don't have any Qemu prototypes and are not likely to be building any
>> prototypes in the foreseeable future.
>
> I see, that's totally not a problem.  If there can be, especially !CoCo
> support at some point, we're happy to test it on QEMU side.  I'll see what
> I can do to help !CoCo kernel side getting there.
>
> Besides, it'll still be great if you can push a latest kernel tree
> somewhere (or provide the base commit ID, but that needs to be on a public
> tree I can fetch).

I should have added the base commit ID.

The base commit hash for this series is
1c4246294c9841c50805cec0627030c083e019c6.

>
> Thanks,
>
>> 
>> >
>> > Thanks,
>> >
>> > -- 
>> > Peter Xu
>> 
>> [1] https://lore.kernel.org/all/20241010085930.1546800-3-tabba@google.com/
>> [2] https://lore.kernel.org/all/f4ca1711a477a3b56406c05d125dce3d7403b936.1726009989.git.ackerleytng@google.com/
>> [3] https://lore.kernel.org/all/CAGtprH-GczOb64XrLpdW4ObRG7Gsv8tHWNhiW7=2dE=OAF7-Rw@mail.gmail.com/
>> [4] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
>> 
>
> -- 
> Peter Xu

[1] https://lore.kernel.org/all/3ef4b32d32dca6e1b506e967c950dc2d4c3bc7ae.1726009989.git.ackerleytng@google.com/
[2] https://lore.kernel.org/all/38723c5d5e9b530e52f28b9f9f4a6d862ed69bcd.1726009989.git.ackerleytng@google.com/
David Hildenbrand Oct. 16, 2024, 8:45 a.m. UTC | #5
On 16.10.24 01:42, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>>>>> The faultability xarray is stored on the inode since faultability is a
>>>>> property of the guest_memfd's memory contents.
>>>>>
>>>>> In this RFC, presence of an entry in the xarray indicates faultable,
>>>>> but this could be flipped so that presence indicates unfaultable. For
>>>>> flexibility, a special value "FAULT" is used instead of a simple
>>>>> boolean.
>>>>>
>>>>> However, at some stages of a VM's lifecycle there could be more
>>>>> private pages, and at other stages there could be more shared pages.
>>>>>
>>>>> This is likely to be replaced by a better data structure in a future
>>>>> revision to better support ranges.
>>>>>
>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>>>>> pointer. inode->i_mapping->i_private_data.
>>>>
>>>> Could you help explain the difference between faultability v.s. the
>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>>>> confused, otherwise might be good to enrich the commit message.
>>>
>>> Thank you for this question, I'll add this to the commit message to the
>>> next revision if Fuad's patch set [1] doesn't make it first.
>>>
>>> Reason (a): To elaborate on the explanation in [1],
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>>> private or shared, and faultability is whether the page is allowed to be
>>> faulted in by userspace.
>>>
>>> These two are similar but may not be the same thing. In pKVM, pKVM
>>> cannot trust userspace's configuration of private/shared, and other
>>> information will go into determining the private/shared setting in
>>> faultability.
>>
>> It makes sense to me that the kernel has the right to decide which page is
>> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
>> case is most / all pages are private, until some requests to share them for
>> special purposes (like DMA).  But that'll need to be initiated as a request
>> from the guest not the userspace hypervisor.
> 
> For TDX, the plan is that the guest will request the page to be remapped
> as shared or private, and the handler for that request will exit to
> the userspace VMM.
> 
> The userspace VMM will then do any necessary coordination (e.g. for a
> shared to private conversion it may need to unpin pages from DMA), and
> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
> the guest's requested conversion. This is where
> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
> 
> Patch 38 [1] updates
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
> demonstrate the usage flow for x86.
> 
> Fuad will be in a better position to explain the flow for pKVM.
> 
>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
>> planned to be used in the future. Currently it's always set at least in
>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>>
>> If it's driven by the userspace side of the hypervisor, I wonder when
>> should the user app request some different value it already was, if the
>> kernel already has an answer in this case.  It made me even more confused,
>> as we have this in the API doc:
>>
>>          Note, there is no "get" API.  Userspace is responsible for
>>          explicitly tracking the state of a gfn/page as needed.
>>
>> And I do wonder whether we will still need some API just to query whether
>> the kernel allows the page to be mapped or not (aka, the "real" shared /
>> private status of a guest page).  I guess that's not directly relevant to
>> the faultability to be introduced here, but if you or anyone know please
>> kindly share, I'd love to learn about it.
> 
> The userspace VMM will track the initial shared/private state, in the
> sense that when the VM is created, the mem_attr_array is initialized
> such that the guest pages are all shared.
> 
> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
> it should record all changes so it knows what the state is in the
> kernel.
> 
> Even if userspace VMM doesn't record the state properly, if the
> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
> (e.g. setting an already private page to private), it will just be a
> no-op in the kernel.
> 
>>>
>>> Perhaps Fuad can elaborate more here.
>>>
>>> Reason (b): In this patch series (mostly focus on x86 first), we're
>>> using faultability to prevent any future faults before checking that
>>> there are no mappings.
>>>
>>> Having a different xarray from mem_attr_array allows us to disable
>>> faulting before committing to changing mem_attr_array. Please see
>>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>>>
>>> We're not completely sure about the effectiveness of using faultability
>>> to block off future faults here, in future revisions we may be using a
>>> different approach. The folio_lock() is probably important if we need to
>>> check mapcount. Please let me know if you have any ideas!
>>>
>>> The starting point of having a different xarray was pKVM's requirement
>>> of having separate xarrays, and we later realized that the xarray could
>>> be used for reason (b). For x86 we could perhaps eventually remove the
>>> second xarray? Not sure as of now.
>>
>> Just had a quick look at patch 27:
>>
>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>>
>> I'm not yet sure what's protecting from faultability being modified against
>> a concurrent fault().
>>
>> I wonder whether one can use the folio lock to serialize that, so that one
>> needs to take the folio lock to modify/lookup the folio's faultability,
>> then it may naturally match with the fault() handler design, where
>> kvm_gmem_get_folio() needs to lock the page first.
>>
>> But then kvm_gmem_is_faultable() will need to also be called only after the
>> folio is locked to avoid races.
> 
> My bad. In our rush to get this series out before LPC, the patch series
> was not organized very well. Patch 39 [2] adds the
> lock. filemap_invalidate_lock_shared() should make sure that faulting
> doesn't race with faultability updates.
> 
>>>> The latter is per-slot, so one level higher, however I don't think it's a
>>>> common use case for mapping the same gmemfd in multiple slots anyway for
>>>> KVM (besides corner cases like live upgrade).  So perhaps this is not about
>>>> layering but something else?  For example, any use case where PRIVATE and
>>>> FAULTABLE can be reported with different values.
>>>>
>>>> Another higher level question is, is there any plan to support non-CoCo
>>>> context for 1G?
>>>
>>> I believe guest_memfd users are generally in favor of eventually using
>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>>> in the case of CoCo) page support.
>>>
>>> However, core-mm's fault path does not support mapping at anything
>>> higher than the PMD level (other than hugetlb_fault(), which the
>>> community wants to move away from), so core-mm wouldn't be able to map
>>> 1G pages taken from HugeTLB.
>>
>> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
>> referring to some other challenges?
>>
> 
> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
> core-mm using vm_operations_struct.huge_fault().


Just to clarify a bit for Peter: as has been discussed previously, there 
are rather big difference between CoCo and non-CoCo VMs.

In CoCo VMs, the primary portion of all pages are private, and they are 
not mapped into user space. Only a handful of pages are commonly shared 
and mapped into user space.

In non-CoCo VMs, all pages are shared and (for the time being) all pages 
are mapped into user space from where KVM will consume them.


Installing pmd/pud mappings into user space (recall: shared memory only) 
is currently not really a requirement for CoCo VMs, and therefore not 
the focus of this work.

Further, it's currently considered to be incompatible with getting 
in-place private<->share conversion on *page* granularity right, as we 
will be exposing huge/gigantic folios via individual small folios to 
core-MM. Mapping a PMD/PUD into core-mm, that is composed of multiple 
folios is not going to fly, unless using a PFNMAP, which has been 
briefly discussed as well, bu disregarded so far (no page pinning support).

So in the context of this work here, huge faults and PUD/PMD *user space 
page tables* do not apply.

For non-CoCo VMs there is no in-place conversion problem. One could use 
the same CoCo implementation, but without user space pud/pmd mappings. 
KVM and VFIO would have to consume this memory via the guest_memfd in 
memslots instead of via the user space mappings to more easily get 
PMD/PUD mappings into the secondary MMU. And the downsides would be 
sacrificing the vmemmap optimization and PMD/PUD user space mappings, 
while at the same time benefiting from being able to easily map only 
parts of a huge/gigantic page into user space.


So I consider pmd/pud user space mappings for non-CoCo an independent 
work item, not something that is part of the current effort of 
huge/gigantic pages with in-place conversion at page granularity for 
CoCo VMs.


More information is available in the bi-weekly upstream MM meeting (that 
was recorded) and the LPC talks, where most of that has been discussed.
David Hildenbrand Oct. 16, 2024, 8:50 a.m. UTC | #6
>> I also don't know how you treat things like folio_test_hugetlb() on
>> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
>> didn't yet check the rest of the patchset yet - reading a large series
>> without a git tree is sometimes challenging to me.
>>
> 
> I'm thinking to basically never involve folio_test_hugetlb(), and the
> VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
> because only the HugeTLB allocator is used, but by the time the folio is
> mapped to userspace, it would have already have been split. After the
> page is split, the folio loses its HugeTLB status. guest_memfd folios
> will never be mapped to userspace while they still have a HugeTLB
> status.

We absolutely must convert these hugetlb folios to non-hugetlb folios.

That is one of the reasons why I raised at LPC that we should focus on 
leaving hugetlb out of the picture and rather have a global pool, and 
the option to move folios from the global pool back and forth to hugetlb 
or to guest_memfd.

How exactly that would look like is TBD.

For the time being, I think we could add a "hack" to take hugetlb folios 
from hugetlb for our purposes, but we would absolutely have to convert 
them to non-hugetlb folios, especially when we split them to small 
folios and start using the mapcount. But it doesn't feel quite clean.

Simply starting with a separate global pool (e.g., boot-time allocation 
similar to as done by hugetlb, or CMA) might be cleaner, and a lot of 
stuff could be factored out from hugetlb code to achieve that.
Vishal Annapurve Oct. 16, 2024, 10:48 a.m. UTC | #7
On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I also don't know how you treat things like folio_test_hugetlb() on
> >> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
> >> didn't yet check the rest of the patchset yet - reading a large series
> >> without a git tree is sometimes challenging to me.
> >>
> >
> > I'm thinking to basically never involve folio_test_hugetlb(), and the
> > VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
> > because only the HugeTLB allocator is used, but by the time the folio is
> > mapped to userspace, it would have already have been split. After the
> > page is split, the folio loses its HugeTLB status. guest_memfd folios
> > will never be mapped to userspace while they still have a HugeTLB
> > status.
>
> We absolutely must convert these hugetlb folios to non-hugetlb folios.
>
> That is one of the reasons why I raised at LPC that we should focus on
> leaving hugetlb out of the picture and rather have a global pool, and
> the option to move folios from the global pool back and forth to hugetlb
> or to guest_memfd.
>
> How exactly that would look like is TBD.
>
> For the time being, I think we could add a "hack" to take hugetlb folios
> from hugetlb for our purposes, but we would absolutely have to convert
> them to non-hugetlb folios, especially when we split them to small
> folios and start using the mapcount. But it doesn't feel quite clean.

As hugepage folios need to be split up in order to support backing
CoCo VMs with hugepages, I would assume any folio based hugepage
memory allocation will need to go through split/merge cycles through
the guest memfd lifetime.

Plan through next RFC series is to abstract out the hugetlb folio
management within guest_memfd so that any hugetlb specific logic is
cleanly separated out and allows guest memfd to allocate memory from
other hugepage allocators in the future.

>
> Simply starting with a separate global pool (e.g., boot-time allocation
> similar to as done by hugetlb, or CMA) might be cleaner, and a lot of
> stuff could be factored out from hugetlb code to achieve that.

I am not sure if a separate global pool necessarily solves all the
issues here unless we come up with more concrete implementation
details. One of the concerns was the ability of implementing/retaining
HVO while transferring memory between the separate global pool and
hugetlb pool i.e. whether it can seamlessly serve all hugepage users
on the host. Another question could be whether the separate
pool/allocator simplifies the split/merge operations at runtime.

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Oct. 16, 2024, 11:54 a.m. UTC | #8
On 16.10.24 12:48, Vishal Annapurve wrote:
> On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I also don't know how you treat things like folio_test_hugetlb() on
>>>> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
>>>> didn't yet check the rest of the patchset yet - reading a large series
>>>> without a git tree is sometimes challenging to me.
>>>>
>>>
>>> I'm thinking to basically never involve folio_test_hugetlb(), and the
>>> VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
>>> because only the HugeTLB allocator is used, but by the time the folio is
>>> mapped to userspace, it would have already have been split. After the
>>> page is split, the folio loses its HugeTLB status. guest_memfd folios
>>> will never be mapped to userspace while they still have a HugeTLB
>>> status.
>>
>> We absolutely must convert these hugetlb folios to non-hugetlb folios.
>>
>> That is one of the reasons why I raised at LPC that we should focus on
>> leaving hugetlb out of the picture and rather have a global pool, and
>> the option to move folios from the global pool back and forth to hugetlb
>> or to guest_memfd.
>>
>> How exactly that would look like is TBD.
>>
>> For the time being, I think we could add a "hack" to take hugetlb folios
>> from hugetlb for our purposes, but we would absolutely have to convert
>> them to non-hugetlb folios, especially when we split them to small
>> folios and start using the mapcount. But it doesn't feel quite clean.
> 
> As hugepage folios need to be split up in order to support backing
> CoCo VMs with hugepages, I would assume any folio based hugepage
> memory allocation will need to go through split/merge cycles through
> the guest memfd lifetime.

Yes, that's my understanding as well.

> 
> Plan through next RFC series is to abstract out the hugetlb folio
> management within guest_memfd so that any hugetlb specific logic is
> cleanly separated out and allows guest memfd to allocate memory from
> other hugepage allocators in the future.

Yes, that must happen. As soon as a hugetlb folio would transition to 
guest_memfd, it must no longer be a hugetlb folio.

> 
>>
>> Simply starting with a separate global pool (e.g., boot-time allocation
>> similar to as done by hugetlb, or CMA) might be cleaner, and a lot of
>> stuff could be factored out from hugetlb code to achieve that.
> 
> I am not sure if a separate global pool necessarily solves all the
> issues here unless we come up with more concrete implementation
> details. One of the concerns was the ability of implementing/retaining
> HVO while transferring memory between the separate global pool and
> hugetlb pool i.e. whether it can seamlessly serve all hugepage users
> on the host.

Likely should be doable. All we need is the generalized concept of a 
folio with HVO, and a way to move these folios between owners (e.g., 
global<->hugetlb, global<->guest_memfd).

Factoring the HVO optimization out shouldn't be too crazy I believe. 
Famous last words :)

> Another question could be whether the separate
> pool/allocator simplifies the split/merge operations at runtime.

The less hugetlb hacks we have to add, the better :)
Jason Gunthorpe Oct. 16, 2024, 11:57 a.m. UTC | #9
On Wed, Oct 16, 2024 at 01:54:32PM +0200, David Hildenbrand wrote:

> Likely should be doable. All we need is the generalized concept of a folio
> with HVO, and a way to move these folios between owners (e.g.,
> global<->hugetlb, global<->guest_memfd).

+1

HVO seems to become a sticking point in these discussions, having a
way to make any big folio HVO optimized (and undo it) then put hugetlb
on top of that would be a nice refactoring.

Jason
Peter Xu Oct. 16, 2024, 8:16 p.m. UTC | #10
On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote:
> On 16.10.24 01:42, Ackerley Tng wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
> > > > Peter Xu <peterx@redhat.com> writes:
> > > > 
> > > > > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> > > > > > The faultability xarray is stored on the inode since faultability is a
> > > > > > property of the guest_memfd's memory contents.
> > > > > > 
> > > > > > In this RFC, presence of an entry in the xarray indicates faultable,
> > > > > > but this could be flipped so that presence indicates unfaultable. For
> > > > > > flexibility, a special value "FAULT" is used instead of a simple
> > > > > > boolean.
> > > > > > 
> > > > > > However, at some stages of a VM's lifecycle there could be more
> > > > > > private pages, and at other stages there could be more shared pages.
> > > > > > 
> > > > > > This is likely to be replaced by a better data structure in a future
> > > > > > revision to better support ranges.
> > > > > > 
> > > > > > Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> > > > > > pointer. inode->i_mapping->i_private_data.
> > > > > 
> > > > > Could you help explain the difference between faultability v.s. the
> > > > > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> > > > > confused, otherwise might be good to enrich the commit message.
> > > > 
> > > > Thank you for this question, I'll add this to the commit message to the
> > > > next revision if Fuad's patch set [1] doesn't make it first.
> > > > 
> > > > Reason (a): To elaborate on the explanation in [1],
> > > > KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
> > > > private or shared, and faultability is whether the page is allowed to be
> > > > faulted in by userspace.
> > > > 
> > > > These two are similar but may not be the same thing. In pKVM, pKVM
> > > > cannot trust userspace's configuration of private/shared, and other
> > > > information will go into determining the private/shared setting in
> > > > faultability.
> > > 
> > > It makes sense to me that the kernel has the right to decide which page is
> > > shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
> > > case is most / all pages are private, until some requests to share them for
> > > special purposes (like DMA).  But that'll need to be initiated as a request
> > > from the guest not the userspace hypervisor.
> > 
> > For TDX, the plan is that the guest will request the page to be remapped
> > as shared or private, and the handler for that request will exit to
> > the userspace VMM.
> > 
> > The userspace VMM will then do any necessary coordination (e.g. for a
> > shared to private conversion it may need to unpin pages from DMA), and
> > then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
> > the guest's requested conversion. This is where
> > KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
> > 
> > Patch 38 [1] updates
> > tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
> > demonstrate the usage flow for x86.
> > 
> > Fuad will be in a better position to explain the flow for pKVM.
> > 
> > > I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
> > > planned to be used in the future. Currently it's always set at least in
> > > QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
> > > 
> > > If it's driven by the userspace side of the hypervisor, I wonder when
> > > should the user app request some different value it already was, if the
> > > kernel already has an answer in this case.  It made me even more confused,
> > > as we have this in the API doc:
> > > 
> > >          Note, there is no "get" API.  Userspace is responsible for
> > >          explicitly tracking the state of a gfn/page as needed.
> > > 
> > > And I do wonder whether we will still need some API just to query whether
> > > the kernel allows the page to be mapped or not (aka, the "real" shared /
> > > private status of a guest page).  I guess that's not directly relevant to
> > > the faultability to be introduced here, but if you or anyone know please
> > > kindly share, I'd love to learn about it.
> > 
> > The userspace VMM will track the initial shared/private state, in the
> > sense that when the VM is created, the mem_attr_array is initialized
> > such that the guest pages are all shared.
> > 
> > Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
> > it should record all changes so it knows what the state is in the
> > kernel.
> > 
> > Even if userspace VMM doesn't record the state properly, if the
> > KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
> > (e.g. setting an already private page to private), it will just be a
> > no-op in the kernel.
> > 
> > > > 
> > > > Perhaps Fuad can elaborate more here.
> > > > 
> > > > Reason (b): In this patch series (mostly focus on x86 first), we're
> > > > using faultability to prevent any future faults before checking that
> > > > there are no mappings.
> > > > 
> > > > Having a different xarray from mem_attr_array allows us to disable
> > > > faulting before committing to changing mem_attr_array. Please see
> > > > `kvm_gmem_should_set_attributes_private()` in this patch [2].
> > > > 
> > > > We're not completely sure about the effectiveness of using faultability
> > > > to block off future faults here, in future revisions we may be using a
> > > > different approach. The folio_lock() is probably important if we need to
> > > > check mapcount. Please let me know if you have any ideas!
> > > > 
> > > > The starting point of having a different xarray was pKVM's requirement
> > > > of having separate xarrays, and we later realized that the xarray could
> > > > be used for reason (b). For x86 we could perhaps eventually remove the
> > > > second xarray? Not sure as of now.
> > > 
> > > Just had a quick look at patch 27:
> > > 
> > > https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
> > > 
> > > I'm not yet sure what's protecting from faultability being modified against
> > > a concurrent fault().
> > > 
> > > I wonder whether one can use the folio lock to serialize that, so that one
> > > needs to take the folio lock to modify/lookup the folio's faultability,
> > > then it may naturally match with the fault() handler design, where
> > > kvm_gmem_get_folio() needs to lock the page first.
> > > 
> > > But then kvm_gmem_is_faultable() will need to also be called only after the
> > > folio is locked to avoid races.
> > 
> > My bad. In our rush to get this series out before LPC, the patch series
> > was not organized very well. Patch 39 [2] adds the
> > lock. filemap_invalidate_lock_shared() should make sure that faulting
> > doesn't race with faultability updates.
> > 
> > > > > The latter is per-slot, so one level higher, however I don't think it's a
> > > > > common use case for mapping the same gmemfd in multiple slots anyway for
> > > > > KVM (besides corner cases like live upgrade).  So perhaps this is not about
> > > > > layering but something else?  For example, any use case where PRIVATE and
> > > > > FAULTABLE can be reported with different values.
> > > > > 
> > > > > Another higher level question is, is there any plan to support non-CoCo
> > > > > context for 1G?
> > > > 
> > > > I believe guest_memfd users are generally in favor of eventually using
> > > > guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
> > > > in the case of CoCo) page support.
> > > > 
> > > > However, core-mm's fault path does not support mapping at anything
> > > > higher than the PMD level (other than hugetlb_fault(), which the
> > > > community wants to move away from), so core-mm wouldn't be able to map
> > > > 1G pages taken from HugeTLB.
> > > 
> > > Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
> > > referring to some other challenges?
> > > 
> > 
> > IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
> > PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
> > core-mm using vm_operations_struct.huge_fault().
> 
> 
> Just to clarify a bit for Peter: as has been discussed previously, there are
> rather big difference between CoCo and non-CoCo VMs.
> 
> In CoCo VMs, the primary portion of all pages are private, and they are not
> mapped into user space. Only a handful of pages are commonly shared and
> mapped into user space.
> 
> In non-CoCo VMs, all pages are shared and (for the time being) all pages are
> mapped into user space from where KVM will consume them.
> 
> 
> Installing pmd/pud mappings into user space (recall: shared memory only) is
> currently not really a requirement for CoCo VMs, and therefore not the focus
> of this work.
> 
> Further, it's currently considered to be incompatible with getting in-place
> private<->share conversion on *page* granularity right, as we will be
> exposing huge/gigantic folios via individual small folios to core-MM.
> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not
> going to fly, unless using a PFNMAP, which has been briefly discussed as
> well, bu disregarded so far (no page pinning support).
> 
> So in the context of this work here, huge faults and PUD/PMD *user space
> page tables* do not apply.
> 
> For non-CoCo VMs there is no in-place conversion problem. One could use the
> same CoCo implementation, but without user space pud/pmd mappings. KVM and
> VFIO would have to consume this memory via the guest_memfd in memslots
> instead of via the user space mappings to more easily get PMD/PUD mappings
> into the secondary MMU. And the downsides would be sacrificing the vmemmap

Is there chance that when !CoCo will be supported, then external modules
(e.g. VFIO) can reuse the old user mappings, just like before gmemfd?

To support CoCo, I understand gmem+offset is required all over the places.
However in a non-CoCo context, I wonder whether the other modules are
required to stick with gmem+offset, or they can reuse the old VA ways,
because how it works can fundamentally be the same as before, except that
the folios now will be managed by gmemfd.

I think the good thing with such approach is when developing CoCo support
for all these modules, there's less constraints / concerns to be compatible
with non-CoCo use case, also it'll make it even easier to be used in
production before all CoCo facilities ready, as most infrastructures are
already around and being used for years if VA can be mapped and GUPed like
before.

Thanks,

> optimization and PMD/PUD user space mappings, while at the same time
> benefiting from being able to easily map only parts of a huge/gigantic page
> into user space.
> 
> 
> So I consider pmd/pud user space mappings for non-CoCo an independent work
> item, not something that is part of the current effort of huge/gigantic
> pages with in-place conversion at page granularity for CoCo VMs.
> 
> 
> More information is available in the bi-weekly upstream MM meeting (that was
> recorded) and the LPC talks, where most of that has been discussed.
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
Jason Gunthorpe Oct. 16, 2024, 10:51 p.m. UTC | #11
On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> 
> Is there chance that when !CoCo will be supported, then external modules
> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> 
> To support CoCo, I understand gmem+offset is required all over the places.
> However in a non-CoCo context, I wonder whether the other modules are
> required to stick with gmem+offset, or they can reuse the old VA ways,
> because how it works can fundamentally be the same as before, except that
> the folios now will be managed by gmemfd.

My intention with iommufd was to see fd + offest as the "new" way
to refer to all guest memory and discourage people from using VMA
handles.

Jason
Peter Xu Oct. 16, 2024, 11:49 p.m. UTC | #12
On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> > 
> > Is there chance that when !CoCo will be supported, then external modules
> > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> > 
> > To support CoCo, I understand gmem+offset is required all over the places.
> > However in a non-CoCo context, I wonder whether the other modules are
> > required to stick with gmem+offset, or they can reuse the old VA ways,
> > because how it works can fundamentally be the same as before, except that
> > the folios now will be managed by gmemfd.
> 
> My intention with iommufd was to see fd + offest as the "new" way
> to refer to all guest memory and discourage people from using VMA
> handles.

Does it mean anonymous memory guests will not be supported at all for
iommufd?

Indeed it's very rare now, lose quite some flexibility (v.s. fd based), and
I can't think of a lot besides some default configs or KSM users (which I
would expect rare), but still I wonder there're other use cases that people
would still need to stick with anon, hence fd isn't around.

Thanks,
Jason Gunthorpe Oct. 16, 2024, 11:54 p.m. UTC | #13
On Wed, Oct 16, 2024 at 07:49:31PM -0400, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
> > On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> > > 
> > > Is there chance that when !CoCo will be supported, then external modules
> > > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> > > 
> > > To support CoCo, I understand gmem+offset is required all over the places.
> > > However in a non-CoCo context, I wonder whether the other modules are
> > > required to stick with gmem+offset, or they can reuse the old VA ways,
> > > because how it works can fundamentally be the same as before, except that
> > > the folios now will be managed by gmemfd.
> > 
> > My intention with iommufd was to see fd + offest as the "new" way
> > to refer to all guest memory and discourage people from using VMA
> > handles.
> 
> Does it mean anonymous memory guests will not be supported at all for
> iommufd?

No, they can use the "old" way with normal VMA's still, or they can
use an anonymous memfd with the new way..

I just don't expect to have new complex stuff built on the VMA
interface - I don't expect guestmemfd VMAs to work.

Jason
David Hildenbrand Oct. 17, 2024, 2:56 p.m. UTC | #14
On 17.10.24 01:49, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
>> On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
>>>
>>> Is there chance that when !CoCo will be supported, then external modules
>>> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
>>>
>>> To support CoCo, I understand gmem+offset is required all over the places.
>>> However in a non-CoCo context, I wonder whether the other modules are
>>> required to stick with gmem+offset, or they can reuse the old VA ways,
>>> because how it works can fundamentally be the same as before, except that
>>> the folios now will be managed by gmemfd.
>>
>> My intention with iommufd was to see fd + offest as the "new" way
>> to refer to all guest memory and discourage people from using VMA
>> handles.
> 
> Does it mean anonymous memory guests will not be supported at all for
> iommufd?
> 
> Indeed it's very rare now, lose quite some flexibility (v.s. fd based), and
> I can't think of a lot besides some default configs or KSM users (which I
> would expect rare), but still I wonder there're other use cases that people
> would still need to stick with anon, hence fd isn't around.

Not sure I completely understand the question, but for most VMs out 
there I expect an anonymous memory to remain the default memory backing.

Regarding users of iommufd, I have absolutely no clue :)
Peter Xu Oct. 17, 2024, 2:58 p.m. UTC | #15
On Wed, Oct 16, 2024 at 08:54:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 16, 2024 at 07:49:31PM -0400, Peter Xu wrote:
> > On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> > > > 
> > > > Is there chance that when !CoCo will be supported, then external modules
> > > > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> > > > 
> > > > To support CoCo, I understand gmem+offset is required all over the places.
> > > > However in a non-CoCo context, I wonder whether the other modules are
> > > > required to stick with gmem+offset, or they can reuse the old VA ways,
> > > > because how it works can fundamentally be the same as before, except that
> > > > the folios now will be managed by gmemfd.
> > > 
> > > My intention with iommufd was to see fd + offest as the "new" way
> > > to refer to all guest memory and discourage people from using VMA
> > > handles.
> > 
> > Does it mean anonymous memory guests will not be supported at all for
> > iommufd?
> 
> No, they can use the "old" way with normal VMA's still, or they can
> use an anonymous memfd with the new way..
> 
> I just don't expect to have new complex stuff built on the VMA
> interface - I don't expect guestmemfd VMAs to work.

Yes, if with guestmemfd already we probably don't need to bother on the VA
interface.

It's the same when guestmemfd supports KVM_SET_USER_MEMORY_REGION2 already,
then it's not a problem at all to use fd+offset for this KVM API.

My question was more torwards whether gmemfd could still expose the
possibility to be used in VA forms to other modules that may not support
fd+offsets yet.  And I assume your reference on the word "VMA" means "VA
ranges", while "gmemfd VMA" on its own is probably OK?  Which is proposed
in this series with the fault handler.

It may not be a problem to many cloud providers, but if QEMU is involved,
it's still pretty flexible and QEMU will need to add fd+offset support for
many of the existing interfaces that is mostly based on VA or VA ranges.  I
believe that includes QEMU itself, aka, the user hypervisor (which is about
how user app should access shared pages that KVM is fault-allowed),
vhost-kernel (more GUP oriented), vhost-user (similar to userapp side),
etc.

I think as long as we can provide gmemfd VMAs like what this series
provides, it sounds possible to reuse the old VA interfaces before the CoCo
interfaces are ready, so that people can already start leveraging gmemfd
backing pages.

The idea is in general nice to me - QEMU used to have a requirement where
we want to have strict vIOMMU semantics between QEMU and another process
that runs the device emulation (aka, vhost-user).  We didn't want to map
all guest RAM all the time because OVS bug can corrupt QEMU memory until
now even if vIOMMU is present (which should be able to prevent this, only
logically..).  We used to have the idea that we can have one fd sent to
vhost-user process that we can have control of what is mapped and what can
be zapped.

In this case of gmemfd that is mostly what we used to persue already
before, that:

  - It allows mmap() of a guest memory region (without yet the capability
    to access all of them... otherwise it can bypass protection, no matter
    it's for CoCo or a vIOMMU in this case)

  - It allows the main process (in this case, it can be QEMU/KVM or
    anything/KVM) to control how to fault in the pages, in this case gmemfd
    lazily faults in the pages only if they're falutable / shared

  - It allows remote tearing down of pages that were not faultable / shared
    anymore, which guarantees the safety measure that the other process
    cannot access any page that was not authorized

I wonder if it's good enough even for CoCo's use case, where if anyone
wants to illegally access some page, it'll simply crash.

Besides that, we definitely can also have good use of non-CoCo 1G pages on
either postcopy solution (that James used to work on for HGM), or
hwpoisoning (where currently at least the latter one is, I believe, still a
common issue for all of us, to make hwpoison work for hugetlbfs with
PAGE_SIZE granule [1]).  The former issue will be still required at least
for QEMU to leverage the split-abliity of gmemfd huge folios.

Then even if both KVM ioctls + iommufd ioctls will only support fd+offsets,
as long as it's allowed to be faultable and gupped on the shared portion of
the gmemfd folios, they can start to be considered using to replace hugetlb
to overcome those difficulties even before CoCo is supported all over the
places.  There's also a question on whether all the known modules would
finally support fd+offsets, which I'm not sure.  If some module won't
support it, maybe it can still work with gmemfd in VA ranges so that it can
still benefit from what gmemfd can provide.

So in short, not sure if the use case can use a combination of (fd, offset)
interfacing on some modules like KVM/iommufd, but VA ranges like before on
some others.

Thanks,

[1] https://lore.kernel.org/all/20240924043924.3562257-1-jiaqiyan@google.com/
David Hildenbrand Oct. 17, 2024, 3:02 p.m. UTC | #16
On 16.10.24 22:16, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote:
>> On 16.10.24 01:42, Ackerley Tng wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>>>>> Peter Xu <peterx@redhat.com> writes:
>>>>>
>>>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>>>>>>> The faultability xarray is stored on the inode since faultability is a
>>>>>>> property of the guest_memfd's memory contents.
>>>>>>>
>>>>>>> In this RFC, presence of an entry in the xarray indicates faultable,
>>>>>>> but this could be flipped so that presence indicates unfaultable. For
>>>>>>> flexibility, a special value "FAULT" is used instead of a simple
>>>>>>> boolean.
>>>>>>>
>>>>>>> However, at some stages of a VM's lifecycle there could be more
>>>>>>> private pages, and at other stages there could be more shared pages.
>>>>>>>
>>>>>>> This is likely to be replaced by a better data structure in a future
>>>>>>> revision to better support ranges.
>>>>>>>
>>>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>>>>>>> pointer. inode->i_mapping->i_private_data.
>>>>>>
>>>>>> Could you help explain the difference between faultability v.s. the
>>>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>>>>>> confused, otherwise might be good to enrich the commit message.
>>>>>
>>>>> Thank you for this question, I'll add this to the commit message to the
>>>>> next revision if Fuad's patch set [1] doesn't make it first.
>>>>>
>>>>> Reason (a): To elaborate on the explanation in [1],
>>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>>>>> private or shared, and faultability is whether the page is allowed to be
>>>>> faulted in by userspace.
>>>>>
>>>>> These two are similar but may not be the same thing. In pKVM, pKVM
>>>>> cannot trust userspace's configuration of private/shared, and other
>>>>> information will go into determining the private/shared setting in
>>>>> faultability.
>>>>
>>>> It makes sense to me that the kernel has the right to decide which page is
>>>> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
>>>> case is most / all pages are private, until some requests to share them for
>>>> special purposes (like DMA).  But that'll need to be initiated as a request
>>>> from the guest not the userspace hypervisor.
>>>
>>> For TDX, the plan is that the guest will request the page to be remapped
>>> as shared or private, and the handler for that request will exit to
>>> the userspace VMM.
>>>
>>> The userspace VMM will then do any necessary coordination (e.g. for a
>>> shared to private conversion it may need to unpin pages from DMA), and
>>> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
>>> the guest's requested conversion. This is where
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
>>>
>>> Patch 38 [1] updates
>>> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
>>> demonstrate the usage flow for x86.
>>>
>>> Fuad will be in a better position to explain the flow for pKVM.
>>>
>>>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
>>>> planned to be used in the future. Currently it's always set at least in
>>>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>>>>
>>>> If it's driven by the userspace side of the hypervisor, I wonder when
>>>> should the user app request some different value it already was, if the
>>>> kernel already has an answer in this case.  It made me even more confused,
>>>> as we have this in the API doc:
>>>>
>>>>           Note, there is no "get" API.  Userspace is responsible for
>>>>           explicitly tracking the state of a gfn/page as needed.
>>>>
>>>> And I do wonder whether we will still need some API just to query whether
>>>> the kernel allows the page to be mapped or not (aka, the "real" shared /
>>>> private status of a guest page).  I guess that's not directly relevant to
>>>> the faultability to be introduced here, but if you or anyone know please
>>>> kindly share, I'd love to learn about it.
>>>
>>> The userspace VMM will track the initial shared/private state, in the
>>> sense that when the VM is created, the mem_attr_array is initialized
>>> such that the guest pages are all shared.
>>>
>>> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
>>> it should record all changes so it knows what the state is in the
>>> kernel.
>>>
>>> Even if userspace VMM doesn't record the state properly, if the
>>> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
>>> (e.g. setting an already private page to private), it will just be a
>>> no-op in the kernel.
>>>
>>>>>
>>>>> Perhaps Fuad can elaborate more here.
>>>>>
>>>>> Reason (b): In this patch series (mostly focus on x86 first), we're
>>>>> using faultability to prevent any future faults before checking that
>>>>> there are no mappings.
>>>>>
>>>>> Having a different xarray from mem_attr_array allows us to disable
>>>>> faulting before committing to changing mem_attr_array. Please see
>>>>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>>>>>
>>>>> We're not completely sure about the effectiveness of using faultability
>>>>> to block off future faults here, in future revisions we may be using a
>>>>> different approach. The folio_lock() is probably important if we need to
>>>>> check mapcount. Please let me know if you have any ideas!
>>>>>
>>>>> The starting point of having a different xarray was pKVM's requirement
>>>>> of having separate xarrays, and we later realized that the xarray could
>>>>> be used for reason (b). For x86 we could perhaps eventually remove the
>>>>> second xarray? Not sure as of now.
>>>>
>>>> Just had a quick look at patch 27:
>>>>
>>>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>>>>
>>>> I'm not yet sure what's protecting from faultability being modified against
>>>> a concurrent fault().
>>>>
>>>> I wonder whether one can use the folio lock to serialize that, so that one
>>>> needs to take the folio lock to modify/lookup the folio's faultability,
>>>> then it may naturally match with the fault() handler design, where
>>>> kvm_gmem_get_folio() needs to lock the page first.
>>>>
>>>> But then kvm_gmem_is_faultable() will need to also be called only after the
>>>> folio is locked to avoid races.
>>>
>>> My bad. In our rush to get this series out before LPC, the patch series
>>> was not organized very well. Patch 39 [2] adds the
>>> lock. filemap_invalidate_lock_shared() should make sure that faulting
>>> doesn't race with faultability updates.
>>>
>>>>>> The latter is per-slot, so one level higher, however I don't think it's a
>>>>>> common use case for mapping the same gmemfd in multiple slots anyway for
>>>>>> KVM (besides corner cases like live upgrade).  So perhaps this is not about
>>>>>> layering but something else?  For example, any use case where PRIVATE and
>>>>>> FAULTABLE can be reported with different values.
>>>>>>
>>>>>> Another higher level question is, is there any plan to support non-CoCo
>>>>>> context for 1G?
>>>>>
>>>>> I believe guest_memfd users are generally in favor of eventually using
>>>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>>>>> in the case of CoCo) page support.
>>>>>
>>>>> However, core-mm's fault path does not support mapping at anything
>>>>> higher than the PMD level (other than hugetlb_fault(), which the
>>>>> community wants to move away from), so core-mm wouldn't be able to map
>>>>> 1G pages taken from HugeTLB.
>>>>
>>>> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
>>>> referring to some other challenges?
>>>>
>>>
>>> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
>>> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
>>> core-mm using vm_operations_struct.huge_fault().
>>
>>
>> Just to clarify a bit for Peter: as has been discussed previously, there are
>> rather big difference between CoCo and non-CoCo VMs.
>>
>> In CoCo VMs, the primary portion of all pages are private, and they are not
>> mapped into user space. Only a handful of pages are commonly shared and
>> mapped into user space.
>>
>> In non-CoCo VMs, all pages are shared and (for the time being) all pages are
>> mapped into user space from where KVM will consume them.
>>
>>
>> Installing pmd/pud mappings into user space (recall: shared memory only) is
>> currently not really a requirement for CoCo VMs, and therefore not the focus
>> of this work.
>>
>> Further, it's currently considered to be incompatible with getting in-place
>> private<->share conversion on *page* granularity right, as we will be
>> exposing huge/gigantic folios via individual small folios to core-MM.
>> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not
>> going to fly, unless using a PFNMAP, which has been briefly discussed as
>> well, bu disregarded so far (no page pinning support).
>>
>> So in the context of this work here, huge faults and PUD/PMD *user space
>> page tables* do not apply.
>>
>> For non-CoCo VMs there is no in-place conversion problem. One could use the
>> same CoCo implementation, but without user space pud/pmd mappings. KVM and
>> VFIO would have to consume this memory via the guest_memfd in memslots
>> instead of via the user space mappings to more easily get PMD/PUD mappings
>> into the secondary MMU. And the downsides would be sacrificing the vmemmap
> 
> Is there chance that when !CoCo will be supported, then external modules
> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?

I expect this at least initially to be the case. At some point, we might 
see a transition to fd+offset for some interfaces.

I recall that there was a similar discussion when specifying "shared" 
memory in a KVM memory slot that will be backed by a guest_memfd: 
initially, this would be via VA and not via guest_memfd+offset. I recall 
Sean and James wants it to stay that way (sorry if I am wrong!), and 
James might require that to get the fancy uffd mechanism flying.

> 
> To support CoCo, I understand gmem+offset is required all over the places.
> However in a non-CoCo context, I wonder whether the other modules are
> required to stick with gmem+offset, or they can reuse the old VA ways,
> because how it works can fundamentally be the same as before, except that
> the folios now will be managed by gmemfd.
 > > I think the good thing with such approach is when developing CoCo 
support
> for all these modules, there's less constraints / concerns to be compatible
> with non-CoCo use case, also it'll make it even easier to be used in
> production before all CoCo facilities ready, as most infrastructures are
> already around and being used for years if VA can be mapped and GUPed like
> before.

Right, but even if most interfaces support guest_memfd+offset, things 
like DIRECT_IO to shared guest memory will require VA+GUP (someone 
brought that up at LPC).
Jason Gunthorpe Oct. 17, 2024, 4:47 p.m. UTC | #17
On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:

> My question was more torwards whether gmemfd could still expose the
> possibility to be used in VA forms to other modules that may not support
> fd+offsets yet.

I keep hearing they don't want to support page pinning on a guestmemfd
mapping, so VA based paths could not work.

> I think as long as we can provide gmemfd VMAs like what this series
> provides, it sounds possible to reuse the old VA interfaces before the CoCo
> interfaces are ready, so that people can already start leveraging gmemfd
> backing pages.

And you definitely can't get the private pages out of the VA interface
because all the VMA PTEs of private pages are non-present by definition.

Hence, you must use the FD for a lot of use cases here.

Jason
Peter Xu Oct. 17, 2024, 5:05 p.m. UTC | #18
On Thu, Oct 17, 2024 at 01:47:13PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> 
> > My question was more torwards whether gmemfd could still expose the
> > possibility to be used in VA forms to other modules that may not support
> > fd+offsets yet.
> 
> I keep hearing they don't want to support page pinning on a guestmemfd
> mapping, so VA based paths could not work.

Do you remember the reasoning of it?  Is it because CoCo still needs to
have a bounded time window to convert from shared back to private?  If so,
maybe that's a non-issue for non-CoCo, where the VM object / gmemfd object
(when created) can have a flag marking that it's always shared and can
never be converted to private for any page within.

So how would VFIO's DMA work even with iommufd if pages cannot be pinned?
Is some form of bounce buffering required, then?

It sounds like if so there'll be a lot of use cases that won't work with
current infrastructure..

> 
> > I think as long as we can provide gmemfd VMAs like what this series
> > provides, it sounds possible to reuse the old VA interfaces before the CoCo
> > interfaces are ready, so that people can already start leveraging gmemfd
> > backing pages.
> 
> And you definitely can't get the private pages out of the VA interface
> because all the VMA PTEs of private pages are non-present by definition.

It's the same as "not present" if the fault() gets a SIGBUS always for
private pages, IIUC.

My prior references to "VA ranges" are mostly only for shared / faultable
pages. And they'll get zapped too when requested to be converted from
shared -> private, aka, always not present for private.

> 
> Hence, you must use the FD for a lot of use cases here.

Thanks,
Jason Gunthorpe Oct. 17, 2024, 5:10 p.m. UTC | #19
On Thu, Oct 17, 2024 at 01:05:34PM -0400, Peter Xu wrote:
> On Thu, Oct 17, 2024 at 01:47:13PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > 
> > > My question was more torwards whether gmemfd could still expose the
> > > possibility to be used in VA forms to other modules that may not support
> > > fd+offsets yet.
> > 
> > I keep hearing they don't want to support page pinning on a guestmemfd
> > mapping, so VA based paths could not work.
> 
> Do you remember the reasoning of it?  Is it because CoCo still needs to
> have a bounded time window to convert from shared back to private?  

I think so

> If so, maybe that's a non-issue for non-CoCo, where the VM object /
> gmemfd object (when created) can have a flag marking that it's
> always shared and can never be converted to private for any page
> within.

What is non-CoCo? Does it include the private/shared concept?

> So how would VFIO's DMA work even with iommufd if pages cannot be pinned?
> Is some form of bounce buffering required, then?

We can do some kind of atomic replace during a private/shared
exchange. In some HW cases the iommu table doesn't even need an
update.

It will be tricky stuff.

Jason
David Hildenbrand Oct. 17, 2024, 5:11 p.m. UTC | #20
On 17.10.24 18:47, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> 
>> My question was more torwards whether gmemfd could still expose the
>> possibility to be used in VA forms to other modules that may not support
>> fd+offsets yet.
> 
> I keep hearing they don't want to support page pinning on a guestmemfd
> mapping, so VA based paths could not work.

For shared pages it absolutely must work. That's what I keep hearing :)

> 
>> I think as long as we can provide gmemfd VMAs like what this series
>> provides, it sounds possible to reuse the old VA interfaces before the CoCo
>> interfaces are ready, so that people can already start leveraging gmemfd
>> backing pages.
> 
> And you definitely can't get the private pages out of the VA interface
> because all the VMA PTEs of private pages are non-present by definition.

Agreed.
Jason Gunthorpe Oct. 17, 2024, 5:16 p.m. UTC | #21
On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
> On 17.10.24 18:47, Jason Gunthorpe wrote:
> > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > 
> > > My question was more torwards whether gmemfd could still expose the
> > > possibility to be used in VA forms to other modules that may not support
> > > fd+offsets yet.
> > 
> > I keep hearing they don't want to support page pinning on a guestmemfd
> > mapping, so VA based paths could not work.
> 
> For shared pages it absolutely must work. That's what I keep hearing :)

Oh that's confusing. I assume non longterm pins desired on shared
pages though??

Jason
David Hildenbrand Oct. 17, 2024, 5:55 p.m. UTC | #22
On 17.10.24 19:16, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
>> On 17.10.24 18:47, Jason Gunthorpe wrote:
>>> On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
>>>
>>>> My question was more torwards whether gmemfd could still expose the
>>>> possibility to be used in VA forms to other modules that may not support
>>>> fd+offsets yet.
>>>
>>> I keep hearing they don't want to support page pinning on a guestmemfd
>>> mapping, so VA based paths could not work.
>>
>> For shared pages it absolutely must work. That's what I keep hearing :)
> 
> Oh that's confusing. I assume non longterm pins desired on shared
> pages though??

For user space to driver I/O to shared pages GUP is often required 
(e.g., O_DIRECT), as was raised at LPC in a session IIRC (someone 
brought up a use case that involved vhost-user and friends).

Of course, for the guest_memfd use cases where we want to remove also 
shared pages from the directmap, it's not possible, but let's put that 
aside (I recall there was a brief discussion at LPC about that: it's 
tricky for shared memory for exactly this reason -- I/O).

longterm pins would have to be used with care, and it's under user-space 
control, and user-space must be aware of the implications: for example, 
registering shared pages as fixed buffers for liburing is possible, but 
when a conversion to private is requested it must unregister these buffers.

(in VFIO terms, a prior unmap operation would be required)

Of course, a conversion to private will not work as long as the pages 
are pinned, and this is under user space control.

If the guest attempts to perform such a conversion while pages will be 
pinned, there will likely be a notification to user space (we touched on 
that today in the upstream call) that something is blocking the 
conversion of that page, and user space has to fix that up and retry.

It's not expected to matter much in practice, but it can be triggered 
and there must be a way to handle it: if a guest triggers a 
shared->private conversion while there is still I/O going on the page, 
something is messed up, and the conversion will be delayed until the I/O 
is done and the page can be converted.

There are still quite some things to be clarified, but this is my 
understanding so far.
Vishal Annapurve Oct. 17, 2024, 6:26 p.m. UTC | #23
On Thu, Oct 17, 2024 at 10:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
> > On 17.10.24 18:47, Jason Gunthorpe wrote:
> > > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > >
> > > > My question was more torwards whether gmemfd could still expose the
> > > > possibility to be used in VA forms to other modules that may not support
> > > > fd+offsets yet.
> > >
> > > I keep hearing they don't want to support page pinning on a guestmemfd
> > > mapping, so VA based paths could not work.
> >
> > For shared pages it absolutely must work. That's what I keep hearing :)
>
> Oh that's confusing. I assume non longterm pins desired on shared
> pages though??
>
> Jason

For hugepage support to work, longterm pins on guest private pages
need to be avoided [1], If this somehow was the cause of any confusion
here.

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
(slide 12)
Peter Xu Oct. 17, 2024, 7:11 p.m. UTC | #24
On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
> > If so, maybe that's a non-issue for non-CoCo, where the VM object /
> > gmemfd object (when created) can have a flag marking that it's
> > always shared and can never be converted to private for any page
> > within.
> 
> What is non-CoCo? Does it include the private/shared concept?

I used that to represent the possible gmemfd use cases outside confidential
computing.

So the private/shared things should still be around as fundamental property
of gmemfd, but it should be always shared and no convertion needed for the
whole lifecycle of the gmemfd when marked !CoCo.

Basically, that's the KVM-only hugetlbfs v2.. especially if this series
will move on with hugetlb allocators, that's even closer.. which makes some
sense to me at least for now to avoid reinvent the wheels all over the
places over cgroup/pool/meminfo/etc.
Jason Gunthorpe Oct. 17, 2024, 7:18 p.m. UTC | #25
On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
> > > If so, maybe that's a non-issue for non-CoCo, where the VM object /
> > > gmemfd object (when created) can have a flag marking that it's
> > > always shared and can never be converted to private for any page
> > > within.
> > 
> > What is non-CoCo? Does it include the private/shared concept?
> 
> I used that to represent the possible gmemfd use cases outside confidential
> computing.
> 
> So the private/shared things should still be around as fundamental property
> of gmemfd, but it should be always shared and no convertion needed for the
> whole lifecycle of the gmemfd when marked !CoCo.

But what does private mean in this context?

Is it just like a bit of additional hypervisor security that the page
is not mapped anyplace except the KVM stage 2 and the hypervisor can
cause it to become mapped/shared at any time? But the guest has no
idea about this?

Jason
David Hildenbrand Oct. 17, 2024, 7:29 p.m. UTC | #26
On 17.10.24 21:18, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>> gmemfd object (when created) can have a flag marking that it's
>>>> always shared and can never be converted to private for any page
>>>> within.
>>>
>>> What is non-CoCo? Does it include the private/shared concept?
>>
>> I used that to represent the possible gmemfd use cases outside confidential
>> computing.
>>
>> So the private/shared things should still be around as fundamental property
>> of gmemfd, but it should be always shared and no convertion needed for the
>> whole lifecycle of the gmemfd when marked !CoCo.
> 
> But what does private mean in this context?
> 
> Is it just like a bit of additional hypervisor security that the page
> is not mapped anyplace except the KVM stage 2 and the hypervisor can
> cause it to become mapped/shared at any time? But the guest has no
> idea about this?

I think what Peter is trying to say is that it would all be shared. 
Private conversion is never triggered by the host or the guest.

No special security, nothing. Just like using hugetlb, but without the 
hugetlb.
Patrick Roy Oct. 18, 2024, 7:15 a.m. UTC | #27
On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>> gmemfd object (when created) can have a flag marking that it's
>>>> always shared and can never be converted to private for any page
>>>> within.
>>>
>>> What is non-CoCo? Does it include the private/shared concept?
>>
>> I used that to represent the possible gmemfd use cases outside confidential
>> computing.
>>
>> So the private/shared things should still be around as fundamental property
>> of gmemfd, but it should be always shared and no convertion needed for the
>> whole lifecycle of the gmemfd when marked !CoCo.
> 
> But what does private mean in this context?
> 
> Is it just like a bit of additional hypervisor security that the page
> is not mapped anyplace except the KVM stage 2 and the hypervisor can
> cause it to become mapped/shared at any time? But the guest has no
> idea about this?
> 
> Jason

Yes, this is pretty much exactly what I'm after when I say "non-CoCo".
No direct map entries to provide defense-in-depth for guests against
various speculative execution issues, but not a full confidential
computing setup (e.g. the guest should be completely oblivious to this,
and not require any modifications).
David Hildenbrand Oct. 18, 2024, 7:50 a.m. UTC | #28
On 18.10.24 09:15, Patrick Roy wrote:
> 
> 
> On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote:
>> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>>> gmemfd object (when created) can have a flag marking that it's
>>>>> always shared and can never be converted to private for any page
>>>>> within.
>>>>
>>>> What is non-CoCo? Does it include the private/shared concept?
>>>
>>> I used that to represent the possible gmemfd use cases outside confidential
>>> computing.
>>>
>>> So the private/shared things should still be around as fundamental property
>>> of gmemfd, but it should be always shared and no convertion needed for the
>>> whole lifecycle of the gmemfd when marked !CoCo.
>>
>> But what does private mean in this context?
>>
>> Is it just like a bit of additional hypervisor security that the page
>> is not mapped anyplace except the KVM stage 2 and the hypervisor can
>> cause it to become mapped/shared at any time? But the guest has no
>> idea about this?
>>
>> Jason
> 
> Yes, this is pretty much exactly what I'm after when I say "non-CoCo".

It's likely not what Peter meant, though.

I think there are three scenarios:

(a) Secure CoCo VMs: private is protected by HW
(b) Semi-secured non-CoCo VMs: private is removed from the directmap
(c) Non-CoCo VMs: only shared memory

Does that match what you have in mind? Are there other cases?
Patrick Roy Oct. 18, 2024, 9:34 a.m. UTC | #29
On Fri, 2024-10-18 at 08:50 +0100, David Hildenbrand wrote:
> On 18.10.24 09:15, Patrick Roy wrote:
>>
>>
>> On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote:
>>> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>>>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>>>> gmemfd object (when created) can have a flag marking that it's
>>>>>> always shared and can never be converted to private for any page
>>>>>> within.
>>>>>
>>>>> What is non-CoCo? Does it include the private/shared concept?
>>>>
>>>> I used that to represent the possible gmemfd use cases outside confidential
>>>> computing.
>>>>
>>>> So the private/shared things should still be around as fundamental property
>>>> of gmemfd, but it should be always shared and no convertion needed for the
>>>> whole lifecycle of the gmemfd when marked !CoCo.
>>>
>>> But what does private mean in this context?
>>>
>>> Is it just like a bit of additional hypervisor security that the page
>>> is not mapped anyplace except the KVM stage 2 and the hypervisor can
>>> cause it to become mapped/shared at any time? But the guest has no
>>> idea about this?
>>>
>>> Jason
>>
>> Yes, this is pretty much exactly what I'm after when I say "non-CoCo".
> 
> It's likely not what Peter meant, though.
> 
> I think there are three scenarios:
> 
> (a) Secure CoCo VMs: private is protected by HW
> (b) Semi-secured non-CoCo VMs: private is removed from the directmap
> (c) Non-CoCo VMs: only shared memory
> 
> Does that match what you have in mind? Are there other cases?

Yeah, I'm after your case (b). I suppose I will not call it just
"non-CoCo" anymore then :)

> -- 
> Cheers,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8151df2c03e5..b603518f7b62 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -26,11 +26,21 @@  struct kvm_gmem_hugetlb {
 	struct hugepage_subpool *spool;
 };
 
-static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
+struct kvm_gmem_inode_private {
+	struct xarray faultability;
+	struct kvm_gmem_hugetlb *hgmem;
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
 {
 	return inode->i_mapping->i_private_data;
 }
 
+static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
+{
+	return kvm_gmem_private(inode)->hgmem;
+}
+
 static bool is_kvm_gmem_hugetlb(struct inode *inode)
 {
 	u64 flags = (u64)inode->i_private;
@@ -38,6 +48,57 @@  static bool is_kvm_gmem_hugetlb(struct inode *inode)
 	return flags & KVM_GUEST_MEMFD_HUGETLB;
 }
 
+#define KVM_GMEM_FAULTABILITY_VALUE 0x4641554c54  /* FAULT */
+
+/**
+ * Set faultability of given range of inode indices [@start, @end) to
+ * @faultable. Return 0 if attributes were successfully updated or negative
+ * errno on error.
+ */
+static int kvm_gmem_set_faultable(struct inode *inode, pgoff_t start, pgoff_t end,
+				  bool faultable)
+{
+	struct xarray *faultability;
+	void *val;
+	pgoff_t i;
+
+	/*
+	 * The expectation is that fewer pages are faultable, hence save memory
+	 * entries are created for faultable pages as opposed to creating
+	 * entries for non-faultable pages.
+	 */
+	val = faultable ? xa_mk_value(KVM_GMEM_FAULTABILITY_VALUE) : NULL;
+	faultability = &kvm_gmem_private(inode)->faultability;
+
+	/*
+	 * TODO replace this with something else (maybe interval
+	 * tree?). store_range doesn't quite do what we expect if overlapping
+	 * ranges are specified: if we store_range(5, 10, val) and then
+	 * store_range(7, 12, NULL), the entire range [5, 12] will be NULL.  For
+	 * now, use the slower xa_store() to store individual entries on indices
+	 * to avoid this.
+	 */
+	for (i = start; i < end; i++) {
+		int r;
+
+		r = xa_err(xa_store(faultability, i, val, GFP_KERNEL_ACCOUNT));
+		if (r)
+			return r;
+	}
+
+	return 0;
+}
+
+/**
+ * Return true if the page at @index is allowed to be faulted in.
+ */
+static bool kvm_gmem_is_faultable(struct inode *inode, pgoff_t index)
+{
+	struct xarray *faultability = &kvm_gmem_private(inode)->faultability;
+
+	return xa_to_value(xa_load(faultability, index)) == KVM_GMEM_FAULTABILITY_VALUE;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -895,11 +956,21 @@  static void kvm_gmem_hugetlb_teardown(struct inode *inode)
 
 static void kvm_gmem_evict_inode(struct inode *inode)
 {
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+	/*
+	 * .evict_inode can be called before faultability is set up if there are
+	 * issues during inode creation.
+	 */
+	if (private)
+		xa_destroy(&private->faultability);
+
 	if (is_kvm_gmem_hugetlb(inode))
 		kvm_gmem_hugetlb_teardown(inode);
 	else
 		truncate_inode_pages_final(inode->i_mapping);
 
+	kfree(private);
 	clear_inode(inode);
 }
 
@@ -1028,7 +1099,9 @@  static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
-static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
+static int kvm_gmem_hugetlb_setup(struct inode *inode,
+				  struct kvm_gmem_inode_private *private,
+				  loff_t size, u64 flags)
 {
 	struct kvm_gmem_hugetlb *hgmem;
 	struct hugepage_subpool *spool;
@@ -1036,6 +1109,10 @@  static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
 	struct hstate *h;
 	long hpages;
 
+	hgmem = kzalloc(sizeof(*hgmem), GFP_KERNEL);
+	if (!hgmem)
+		return -ENOMEM;
+
 	page_size_log = (flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) & KVM_GUEST_MEMFD_HUGE_MASK;
 	h = hstate_sizelog(page_size_log);
 
@@ -1046,21 +1123,16 @@  static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
 	if (!spool)
 		goto err;
 
-	hgmem = kzalloc(sizeof(*hgmem), GFP_KERNEL);
-	if (!hgmem)
-		goto err_subpool;
-
 	inode->i_blkbits = huge_page_shift(h);
 
 	hgmem->h = h;
 	hgmem->spool = spool;
-	inode->i_mapping->i_private_data = hgmem;
 
+	private->hgmem = hgmem;
 	return 0;
 
-err_subpool:
-	kfree(spool);
 err:
+	kfree(hgmem);
 	return -ENOMEM;
 }
 
@@ -1068,6 +1140,7 @@  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
 	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
 	int err;
 
@@ -1079,12 +1152,20 @@  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	if (err)
 		goto out;
 
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
 	if (flags & KVM_GUEST_MEMFD_HUGETLB) {
-		err = kvm_gmem_hugetlb_setup(inode, size, flags);
+		err = kvm_gmem_hugetlb_setup(inode, private, size, flags);
 		if (err)
-			goto out;
+			goto free_private;
 	}
 
+	xa_init(&private->faultability);
+	inode->i_mapping->i_private_data = private;
+
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
 	inode->i_mapping->a_ops = &kvm_gmem_aops;
@@ -1097,6 +1178,8 @@  static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 
 	return inode;
 
+free_private:
+	kfree(private);
 out:
 	iput(inode);