Message ID | 20220706082016.2603916-2-chao.p.peng@linux.intel.com |
---|---|
State | New |
Headers | show |
Series | KVM: mm: fd-based approach for supporting KVM guest private memory | expand |
On 06.07.22 10:20, Chao Peng wrote: > Normally, a write to unallocated space of a file or the hole of a sparse > file automatically causes space allocation, for memfd, this equals to > memory allocation. This new seal prevents such automatically allocating, > either this is from a direct write() or a write on the previously > mmap-ed area. The seal does not prevent fallocate() so an explicit > fallocate() can still cause allocating and can be used to reserve > memory. > > This is used to prevent unintentional allocation from userspace on a > stray or careless write and any intentional allocation should use an > explicit fallocate(). One of the main usecases is to avoid memory double > allocation for confidential computing usage where we use two memfds to > back guest memory and at a single point only one memfd is alive and we > want to prevent memory allocation for the other memfd which may have > been mmap-ed previously. More discussion can be found at: > > https://lkml.org/lkml/2022/6/14/1255 > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > include/uapi/linux/fcntl.h | 1 + > mm/memfd.c | 3 ++- > mm/shmem.c | 16 ++++++++++++++-- > 3 files changed, 17 insertions(+), 3 deletions(-) > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index 2f86b2ad6d7e..98bdabc8e309 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -43,6 +43,7 @@ > #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > #define F_SEAL_WRITE 0x0008 /* prevent writes */ > #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ Why only "on writes" and not "on reads". IIRC, shmem doesn't support the shared zeropage, so you'll simply allocate a new page via read() or on read faults. Also, I *think* you can place pages via userfaultfd into shmem. Not sure if that would count "auto alloc", but it would certainly bypass fallocate().
On 21.07.22 11:44, David Hildenbrand wrote: > On 06.07.22 10:20, Chao Peng wrote: >> Normally, a write to unallocated space of a file or the hole of a sparse >> file automatically causes space allocation, for memfd, this equals to >> memory allocation. This new seal prevents such automatically allocating, >> either this is from a direct write() or a write on the previously >> mmap-ed area. The seal does not prevent fallocate() so an explicit >> fallocate() can still cause allocating and can be used to reserve >> memory. >> >> This is used to prevent unintentional allocation from userspace on a >> stray or careless write and any intentional allocation should use an >> explicit fallocate(). One of the main usecases is to avoid memory double >> allocation for confidential computing usage where we use two memfds to >> back guest memory and at a single point only one memfd is alive and we >> want to prevent memory allocation for the other memfd which may have >> been mmap-ed previously. More discussion can be found at: >> >> https://lkml.org/lkml/2022/6/14/1255 >> >> Suggested-by: Sean Christopherson <seanjc@google.com> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> >> --- >> include/uapi/linux/fcntl.h | 1 + >> mm/memfd.c | 3 ++- >> mm/shmem.c | 16 ++++++++++++++-- >> 3 files changed, 17 insertions(+), 3 deletions(-) >> >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h >> index 2f86b2ad6d7e..98bdabc8e309 100644 >> --- a/include/uapi/linux/fcntl.h >> +++ b/include/uapi/linux/fcntl.h >> @@ -43,6 +43,7 @@ >> #define F_SEAL_GROW 0x0004 /* prevent file from growing */ >> #define F_SEAL_WRITE 0x0008 /* prevent writes */ >> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ >> +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > shared zeropage, so you'll simply allocate a new page via read() or on > read faults. Correction: on read() we don't allocate a fresh page. But on read faults we would. So this comment here needs clarification. > > > Also, I *think* you can place pages via userfaultfd into shmem. Not sure > if that would count "auto alloc", but it would certainly bypass fallocate(). >
>> Normally, a write to unallocated space of a file or the hole of a sparse >> file automatically causes space allocation, for memfd, this equals to >> memory allocation. This new seal prevents such automatically allocating, >> either this is from a direct write() or a write on the previously >> mmap-ed area. The seal does not prevent fallocate() so an explicit >> fallocate() can still cause allocating and can be used to reserve >> memory. >> >> This is used to prevent unintentional allocation from userspace on a >> stray or careless write and any intentional allocation should use an >> explicit fallocate(). One of the main usecases is to avoid memory double >> allocation for confidential computing usage where we use two memfds to >> back guest memory and at a single point only one memfd is alive and we >> want to prevent memory allocation for the other memfd which may have >> been mmap-ed previously. More discussion can be found at: >> >> https://lkml.org/lkml/2022/6/14/1255 >> >> Suggested-by: Sean Christopherson <seanjc@google.com> >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> >> --- >> include/uapi/linux/fcntl.h | 1 + >> mm/memfd.c | 3 ++- >> mm/shmem.c | 16 ++++++++++++++-- >> 3 files changed, 17 insertions(+), 3 deletions(-) >> >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h >> index 2f86b2ad6d7e..98bdabc8e309 100644 >> --- a/include/uapi/linux/fcntl.h >> +++ b/include/uapi/linux/fcntl.h >> @@ -43,6 +43,7 @@ >> #define F_SEAL_GROW 0x0004 /* prevent file from growing */ >> #define F_SEAL_WRITE 0x0008 /* prevent writes */ >> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ >> +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > shared zeropage, so you'll simply allocate a new page via read() or on > read faults. > > > Also, I *think* you can place pages via userfaultfd into shmem. Not sure > if that would count "auto alloc", but it would certainly bypass fallocate(). I was also thinking this at the same time, but for different reason: "Want to populate private preboot memory with firmware payload", so was thinking userfaulftd could be an option as direct writes are restricted? Thanks, Pankaj
On Thu, Jul 21, 2022, David Hildenbrand wrote: > On 21.07.22 11:44, David Hildenbrand wrote: > > On 06.07.22 10:20, Chao Peng wrote: > >> Normally, a write to unallocated space of a file or the hole of a sparse > >> file automatically causes space allocation, for memfd, this equals to > >> memory allocation. This new seal prevents such automatically allocating, > >> either this is from a direct write() or a write on the previously > >> mmap-ed area. The seal does not prevent fallocate() so an explicit > >> fallocate() can still cause allocating and can be used to reserve > >> memory. > >> > >> This is used to prevent unintentional allocation from userspace on a > >> stray or careless write and any intentional allocation should use an > >> explicit fallocate(). One of the main usecases is to avoid memory double > >> allocation for confidential computing usage where we use two memfds to > >> back guest memory and at a single point only one memfd is alive and we > >> want to prevent memory allocation for the other memfd which may have > >> been mmap-ed previously. More discussion can be found at: > >> > >> https://lkml.org/lkml/2022/6/14/1255 > >> > >> Suggested-by: Sean Christopherson <seanjc@google.com> > >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > >> --- > >> include/uapi/linux/fcntl.h | 1 + > >> mm/memfd.c | 3 ++- > >> mm/shmem.c | 16 ++++++++++++++-- > >> 3 files changed, 17 insertions(+), 3 deletions(-) > >> > >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > >> index 2f86b2ad6d7e..98bdabc8e309 100644 > >> --- a/include/uapi/linux/fcntl.h > >> +++ b/include/uapi/linux/fcntl.h > >> @@ -43,6 +43,7 @@ > >> #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > >> #define F_SEAL_WRITE 0x0008 /* prevent writes */ > >> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > >> +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > > shared zeropage, so you'll simply allocate a new page via read() or on > > read faults. > > Correction: on read() we don't allocate a fresh page. But on read faults > we would. So this comment here needs clarification. Not just the comment, the code too. The intent of F_SEAL_AUTO_ALLOCATE is very much to block _all_ implicit allocations (or maybe just fault-based allocations if "implicit" is too broad of a description).
On Thu, Jul 21, 2022 at 11:44:11AM +0200, David Hildenbrand wrote: > On 06.07.22 10:20, Chao Peng wrote: > > Normally, a write to unallocated space of a file or the hole of a sparse > > file automatically causes space allocation, for memfd, this equals to > > memory allocation. This new seal prevents such automatically allocating, > > either this is from a direct write() or a write on the previously > > mmap-ed area. The seal does not prevent fallocate() so an explicit > > fallocate() can still cause allocating and can be used to reserve > > memory. > > > > This is used to prevent unintentional allocation from userspace on a > > stray or careless write and any intentional allocation should use an > > explicit fallocate(). One of the main usecases is to avoid memory double > > allocation for confidential computing usage where we use two memfds to > > back guest memory and at a single point only one memfd is alive and we > > want to prevent memory allocation for the other memfd which may have > > been mmap-ed previously. More discussion can be found at: > > > > https://lkml.org/lkml/2022/6/14/1255 > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > --- > > include/uapi/linux/fcntl.h | 1 + > > mm/memfd.c | 3 ++- > > mm/shmem.c | 16 ++++++++++++++-- > > 3 files changed, 17 insertions(+), 3 deletions(-) > > > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > index 2f86b2ad6d7e..98bdabc8e309 100644 > > --- a/include/uapi/linux/fcntl.h > > +++ b/include/uapi/linux/fcntl.h > > @@ -43,6 +43,7 @@ > > #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > > #define F_SEAL_WRITE 0x0008 /* prevent writes */ > > #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > > +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > shared zeropage, so you'll simply allocate a new page via read() or on > read faults. Right, it also prevents read faults. > > > Also, I *think* you can place pages via userfaultfd into shmem. Not sure > if that would count "auto alloc", but it would certainly bypass fallocate(). Userfaultfd sounds interesting, will further investigate it. But a rough look sounds it only faults to usrspace for write/read fault, not write()? Also sounds it operates on vma and userfaultfd_register() takes mmap_lock which is what we want to avoid for frequent register/unregister during private/shared memory conversion. Chao > > -- > Thanks, > > David / dhildenb
On Thu, Jul 21, 2022 at 03:05:09PM +0000, Sean Christopherson wrote: > On Thu, Jul 21, 2022, David Hildenbrand wrote: > > On 21.07.22 11:44, David Hildenbrand wrote: > > > On 06.07.22 10:20, Chao Peng wrote: > > >> Normally, a write to unallocated space of a file or the hole of a sparse > > >> file automatically causes space allocation, for memfd, this equals to > > >> memory allocation. This new seal prevents such automatically allocating, > > >> either this is from a direct write() or a write on the previously > > >> mmap-ed area. The seal does not prevent fallocate() so an explicit > > >> fallocate() can still cause allocating and can be used to reserve > > >> memory. > > >> > > >> This is used to prevent unintentional allocation from userspace on a > > >> stray or careless write and any intentional allocation should use an > > >> explicit fallocate(). One of the main usecases is to avoid memory double > > >> allocation for confidential computing usage where we use two memfds to > > >> back guest memory and at a single point only one memfd is alive and we > > >> want to prevent memory allocation for the other memfd which may have > > >> been mmap-ed previously. More discussion can be found at: > > >> > > >> https://lkml.org/lkml/2022/6/14/1255 > > >> > > >> Suggested-by: Sean Christopherson <seanjc@google.com> > > >> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > >> --- > > >> include/uapi/linux/fcntl.h | 1 + > > >> mm/memfd.c | 3 ++- > > >> mm/shmem.c | 16 ++++++++++++++-- > > >> 3 files changed, 17 insertions(+), 3 deletions(-) > > >> > > >> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > >> index 2f86b2ad6d7e..98bdabc8e309 100644 > > >> --- a/include/uapi/linux/fcntl.h > > >> +++ b/include/uapi/linux/fcntl.h > > >> @@ -43,6 +43,7 @@ > > >> #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > > >> #define F_SEAL_WRITE 0x0008 /* prevent writes */ > > >> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > > >> +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > > > > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > > > shared zeropage, so you'll simply allocate a new page via read() or on > > > read faults. > > > > Correction: on read() we don't allocate a fresh page. But on read faults > > we would. So this comment here needs clarification. > > Not just the comment, the code too. The intent of F_SEAL_AUTO_ALLOCATE is very > much to block _all_ implicit allocations (or maybe just fault-based allocations > if "implicit" is too broad of a description). So maybe still your initial suggestion F_SEAL_FAULT_ALLOCATIONS? One reason I don't like it is the write() ioctl also cause allocation and we want to prevent it. Chao
On Thu, Jul 21, 2022 at 12:27:03PM +0200, Gupta, Pankaj wrote: > > > > Normally, a write to unallocated space of a file or the hole of a sparse > > > file automatically causes space allocation, for memfd, this equals to > > > memory allocation. This new seal prevents such automatically allocating, > > > either this is from a direct write() or a write on the previously > > > mmap-ed area. The seal does not prevent fallocate() so an explicit > > > fallocate() can still cause allocating and can be used to reserve > > > memory. > > > > > > This is used to prevent unintentional allocation from userspace on a > > > stray or careless write and any intentional allocation should use an > > > explicit fallocate(). One of the main usecases is to avoid memory double > > > allocation for confidential computing usage where we use two memfds to > > > back guest memory and at a single point only one memfd is alive and we > > > want to prevent memory allocation for the other memfd which may have > > > been mmap-ed previously. More discussion can be found at: > > > > > > https://lkml.org/lkml/2022/6/14/1255 > > > > > > Suggested-by: Sean Christopherson <seanjc@google.com> > > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > > > --- > > > include/uapi/linux/fcntl.h | 1 + > > > mm/memfd.c | 3 ++- > > > mm/shmem.c | 16 ++++++++++++++-- > > > 3 files changed, 17 insertions(+), 3 deletions(-) > > > > > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > > > index 2f86b2ad6d7e..98bdabc8e309 100644 > > > --- a/include/uapi/linux/fcntl.h > > > +++ b/include/uapi/linux/fcntl.h > > > @@ -43,6 +43,7 @@ > > > #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > > > #define F_SEAL_WRITE 0x0008 /* prevent writes */ > > > #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > > > +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ > > > > Why only "on writes" and not "on reads". IIRC, shmem doesn't support the > > shared zeropage, so you'll simply allocate a new page via read() or on > > read faults. > > > > > > Also, I *think* you can place pages via userfaultfd into shmem. Not sure > > if that would count "auto alloc", but it would certainly bypass fallocate(). > > I was also thinking this at the same time, but for different reason: > > "Want to populate private preboot memory with firmware payload", so was > thinking userfaulftd could be an option as direct writes are restricted? If that can be a side effect, I definitely glad to see it, though I'm still not clear how userfaultfd can be particularly helpful for that. Chao > > Thanks, > Pankaj > > > >
>>>> Normally, a write to unallocated space of a file or the hole of a sparse >>>> file automatically causes space allocation, for memfd, this equals to >>>> memory allocation. This new seal prevents such automatically allocating, >>>> either this is from a direct write() or a write on the previously >>>> mmap-ed area. The seal does not prevent fallocate() so an explicit >>>> fallocate() can still cause allocating and can be used to reserve >>>> memory. >>>> >>>> This is used to prevent unintentional allocation from userspace on a >>>> stray or careless write and any intentional allocation should use an >>>> explicit fallocate(). One of the main usecases is to avoid memory double >>>> allocation for confidential computing usage where we use two memfds to >>>> back guest memory and at a single point only one memfd is alive and we >>>> want to prevent memory allocation for the other memfd which may have >>>> been mmap-ed previously. More discussion can be found at: >>>> >>>> https://lkml.org/lkml/2022/6/14/1255 >>>> >>>> Suggested-by: Sean Christopherson <seanjc@google.com> >>>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> >>>> --- >>>> include/uapi/linux/fcntl.h | 1 + >>>> mm/memfd.c | 3 ++- >>>> mm/shmem.c | 16 ++++++++++++++-- >>>> 3 files changed, 17 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h >>>> index 2f86b2ad6d7e..98bdabc8e309 100644 >>>> --- a/include/uapi/linux/fcntl.h >>>> +++ b/include/uapi/linux/fcntl.h >>>> @@ -43,6 +43,7 @@ >>>> #define F_SEAL_GROW 0x0004 /* prevent file from growing */ >>>> #define F_SEAL_WRITE 0x0008 /* prevent writes */ >>>> #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ >>>> +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ >>> >>> Why only "on writes" and not "on reads". IIRC, shmem doesn't support the >>> shared zeropage, so you'll simply allocate a new page via read() or on >>> read faults. >>> >>> >>> Also, I *think* you can place pages via userfaultfd into shmem. Not sure >>> if that would count "auto alloc", but it would certainly bypass fallocate(). >> >> I was also thinking this at the same time, but for different reason: >> >> "Want to populate private preboot memory with firmware payload", so was >> thinking userfaulftd could be an option as direct writes are restricted? > > If that can be a side effect, I definitely glad to see it, though I'm > still not clear how userfaultfd can be particularly helpful for that. Was thinking if we can use userfaultfd to monitor the pagefault on virtual firmware memory range and use to populate the private memory. Not sure if it is a side effect. Was just theoretically thinking (for now kept the idea aside as these enhancements can be worked later). Thanks, Pankaj
On 7/21/22 11:44, David Hildenbrand wrote: > > Also, I*think* you can place pages via userfaultfd into shmem. Not > sure if that would count "auto alloc", but it would certainly bypass > fallocate(). Yeah, userfaultfd_register would probably have to forbid this for F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register would do something like memfile_node_get_flags(vma->vm_file) and check the result. This means moving this patch later, after "mm: Introduce memfile_notifier". Thanks, Paolo
On 05.08.22 19:55, Paolo Bonzini wrote: > On 7/21/22 11:44, David Hildenbrand wrote: >> >> Also, I*think* you can place pages via userfaultfd into shmem. Not >> sure if that would count "auto alloc", but it would certainly bypass >> fallocate(). > > Yeah, userfaultfd_register would probably have to forbid this for > F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for > this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then > userfault_register would do something like > memfile_node_get_flags(vma->vm_file) and check the result. An alternative is to simply have the shmem allocation fail in a similar way. Maybe it does already, I haven't checked (don't think so).
On Fri, Aug 05, 2022 at 08:06:03PM +0200, David Hildenbrand wrote: > On 05.08.22 19:55, Paolo Bonzini wrote: > > On 7/21/22 11:44, David Hildenbrand wrote: > >> > >> Also, I*think* you can place pages via userfaultfd into shmem. Not > >> sure if that would count "auto alloc", but it would certainly bypass > >> fallocate(). > > > > Yeah, userfaultfd_register would probably have to forbid this for > > F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for > > this, adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then > > userfault_register would do something like > > memfile_node_get_flags(vma->vm_file) and check the result. > > An alternative is to simply have the shmem allocation fail in a similar > way. Maybe it does already, I haven't checked (don't think so). This sounds a better option. We don't need uAPI changes for userfault_register uAPI but I guess we will still need a KVM uAPI, either on the memslot or on the whole VM since Roth said this feature should be optional because some usages may want to disable it for performance reason. For details please see discussion: https://lkml.org/lkml/2022/6/23/1905 Chao > > > -- > Thanks, > > David / dhildenb
On 8/18/22 01:41, Kirill A. Shutemov wrote: > Note, that userfaultfd is only relevant for shared memory as it requires > VMA which we don't have for MFD_INACCESSIBLE. Oh, you're right! So yeah, looks like userfaultfd is not a problem. Paolo
On 18.08.22 01:41, Kirill A. Shutemov wrote: > On Fri, Aug 05, 2022 at 07:55:38PM +0200, Paolo Bonzini wrote: >> On 7/21/22 11:44, David Hildenbrand wrote: >>> >>> Also, I*think* you can place pages via userfaultfd into shmem. Not >>> sure if that would count "auto alloc", but it would certainly bypass >>> fallocate(). >> >> Yeah, userfaultfd_register would probably have to forbid this for >> F_SEAL_AUTO_ALLOCATE vmas. Maybe the memfile_node can be reused for this, >> adding a new MEMFILE_F_NO_AUTO_ALLOCATE flags? Then userfault_register >> would do something like memfile_node_get_flags(vma->vm_file) and check the >> result. > > I donno, memory allocation with userfaultfd looks pretty intentional to > me. Why would F_SEAL_AUTO_ALLOCATE prevent it? > Can't we say the same about a write()? > Maybe we would need it in the future for post-copy migration or something? > > Or existing practises around userfaultfd touch memory randomly and > therefore incompatible with F_SEAL_AUTO_ALLOCATE intent? > > Note, that userfaultfd is only relevant for shared memory as it requires > VMA which we don't have for MFD_INACCESSIBLE. This feature (F_SEAL_AUTO_ALLOCATE) is independent of all the lovely encrypted VM stuff, so it doesn't matter how it relates to MFD_INACCESSIBLE.
Hi Chao, On Wed, Jul 6, 2022 at 9:25 AM Chao Peng <chao.p.peng@linux.intel.com> wrote: > > Normally, a write to unallocated space of a file or the hole of a sparse > file automatically causes space allocation, for memfd, this equals to > memory allocation. This new seal prevents such automatically allocating, > either this is from a direct write() or a write on the previously > mmap-ed area. The seal does not prevent fallocate() so an explicit > fallocate() can still cause allocating and can be used to reserve > memory. > > This is used to prevent unintentional allocation from userspace on a > stray or careless write and any intentional allocation should use an > explicit fallocate(). One of the main usecases is to avoid memory double > allocation for confidential computing usage where we use two memfds to > back guest memory and at a single point only one memfd is alive and we > want to prevent memory allocation for the other memfd which may have > been mmap-ed previously. More discussion can be found at: > > https://lkml.org/lkml/2022/6/14/1255 > > Suggested-by: Sean Christopherson <seanjc@google.com> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > --- > include/uapi/linux/fcntl.h | 1 + > mm/memfd.c | 3 ++- > mm/shmem.c | 16 ++++++++++++++-- > 3 files changed, 17 insertions(+), 3 deletions(-) > > diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h > index 2f86b2ad6d7e..98bdabc8e309 100644 > --- a/include/uapi/linux/fcntl.h > +++ b/include/uapi/linux/fcntl.h > @@ -43,6 +43,7 @@ > #define F_SEAL_GROW 0x0004 /* prevent file from growing */ > #define F_SEAL_WRITE 0x0008 /* prevent writes */ > #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ > +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ I think this should also be added to tools/include/uapi/linux/fcntl.h Cheers, /fuad > /* (1U << 31) is reserved for signed error codes */ > > /* > diff --git a/mm/memfd.c b/mm/memfd.c > index 08f5f8304746..2afd898798e4 100644 > --- a/mm/memfd.c > +++ b/mm/memfd.c > @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) > F_SEAL_SHRINK | \ > F_SEAL_GROW | \ > F_SEAL_WRITE | \ > - F_SEAL_FUTURE_WRITE) > + F_SEAL_FUTURE_WRITE | \ > + F_SEAL_AUTO_ALLOCATE) > > static int memfd_add_seals(struct file *file, unsigned int seals) > { > diff --git a/mm/shmem.c b/mm/shmem.c > index a6f565308133..6c8aef15a17d 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) > struct vm_area_struct *vma = vmf->vma; > struct inode *inode = file_inode(vma->vm_file); > gfp_t gfp = mapping_gfp_mask(inode->i_mapping); > + struct shmem_inode_info *info = SHMEM_I(inode); > + enum sgp_type sgp; > int err; > vm_fault_t ret = VM_FAULT_LOCKED; > > @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) > spin_unlock(&inode->i_lock); > } > > - err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE, > + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) > + sgp = SGP_NOALLOC; > + else > + sgp = SGP_CACHE; > + > + err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, > gfp, vma, vmf, &ret); > if (err) > return vmf_error(err); > @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, > struct inode *inode = mapping->host; > struct shmem_inode_info *info = SHMEM_I(inode); > pgoff_t index = pos >> PAGE_SHIFT; > + enum sgp_type sgp; > int ret = 0; > > /* i_rwsem is held by caller */ > @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping, > return -EPERM; > } > > - ret = shmem_getpage(inode, index, pagep, SGP_WRITE); > + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) > + sgp = SGP_NOALLOC; > + else > + sgp = SGP_WRITE; > + ret = shmem_getpage(inode, index, pagep, sgp); > > if (ret) > return ret; > -- > 2.25.1 >
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h index 2f86b2ad6d7e..98bdabc8e309 100644 --- a/include/uapi/linux/fcntl.h +++ b/include/uapi/linux/fcntl.h @@ -43,6 +43,7 @@ #define F_SEAL_GROW 0x0004 /* prevent file from growing */ #define F_SEAL_WRITE 0x0008 /* prevent writes */ #define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */ +#define F_SEAL_AUTO_ALLOCATE 0x0020 /* prevent allocation for writes */ /* (1U << 31) is reserved for signed error codes */ /* diff --git a/mm/memfd.c b/mm/memfd.c index 08f5f8304746..2afd898798e4 100644 --- a/mm/memfd.c +++ b/mm/memfd.c @@ -150,7 +150,8 @@ static unsigned int *memfd_file_seals_ptr(struct file *file) F_SEAL_SHRINK | \ F_SEAL_GROW | \ F_SEAL_WRITE | \ - F_SEAL_FUTURE_WRITE) + F_SEAL_FUTURE_WRITE | \ + F_SEAL_AUTO_ALLOCATE) static int memfd_add_seals(struct file *file, unsigned int seals) { diff --git a/mm/shmem.c b/mm/shmem.c index a6f565308133..6c8aef15a17d 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2051,6 +2051,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) struct vm_area_struct *vma = vmf->vma; struct inode *inode = file_inode(vma->vm_file); gfp_t gfp = mapping_gfp_mask(inode->i_mapping); + struct shmem_inode_info *info = SHMEM_I(inode); + enum sgp_type sgp; int err; vm_fault_t ret = VM_FAULT_LOCKED; @@ -2113,7 +2115,12 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf) spin_unlock(&inode->i_lock); } - err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, SGP_CACHE, + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) + sgp = SGP_NOALLOC; + else + sgp = SGP_CACHE; + + err = shmem_getpage_gfp(inode, vmf->pgoff, &vmf->page, sgp, gfp, vma, vmf, &ret); if (err) return vmf_error(err); @@ -2459,6 +2466,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping, struct inode *inode = mapping->host; struct shmem_inode_info *info = SHMEM_I(inode); pgoff_t index = pos >> PAGE_SHIFT; + enum sgp_type sgp; int ret = 0; /* i_rwsem is held by caller */ @@ -2470,7 +2478,11 @@ shmem_write_begin(struct file *file, struct address_space *mapping, return -EPERM; } - ret = shmem_getpage(inode, index, pagep, SGP_WRITE); + if (unlikely(info->seals & F_SEAL_AUTO_ALLOCATE)) + sgp = SGP_NOALLOC; + else + sgp = SGP_WRITE; + ret = shmem_getpage(inode, index, pagep, sgp); if (ret) return ret;
Normally, a write to unallocated space of a file or the hole of a sparse file automatically causes space allocation, for memfd, this equals to memory allocation. This new seal prevents such automatically allocating, either this is from a direct write() or a write on the previously mmap-ed area. The seal does not prevent fallocate() so an explicit fallocate() can still cause allocating and can be used to reserve memory. This is used to prevent unintentional allocation from userspace on a stray or careless write and any intentional allocation should use an explicit fallocate(). One of the main usecases is to avoid memory double allocation for confidential computing usage where we use two memfds to back guest memory and at a single point only one memfd is alive and we want to prevent memory allocation for the other memfd which may have been mmap-ed previously. More discussion can be found at: https://lkml.org/lkml/2022/6/14/1255 Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> --- include/uapi/linux/fcntl.h | 1 + mm/memfd.c | 3 ++- mm/shmem.c | 16 ++++++++++++++-- 3 files changed, 17 insertions(+), 3 deletions(-)