diff mbox series

userfaultfd: mark uffd_wp regardless of VM_WRITE flag

Message ID 20220217211602.2769-1-namit@vmware.com
State New
Headers show
Series userfaultfd: mark uffd_wp regardless of VM_WRITE flag | expand

Commit Message

Nadav Amit Feb. 17, 2022, 9:16 p.m. UTC
From: Nadav Amit <namit@vmware.com>

When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
currently only marked as write-protected if the VMA has VM_WRITE flag
set. This seems incorrect or at least would be unexpected by the users.

Consider the following sequence of operations that are being performed
on a certain page:

	mprotect(PROT_READ)
	UFFDIO_COPY(UFFDIO_COPY_MODE_WP)
	mprotect(PROT_READ|PROT_WRITE)

At this point the user would expect to still get UFFD notification when
the page is accessed for write, but the user would not get one, since
the PTE was not marked as UFFD_WP during UFFDIO_COPY.

Fix it by always marking PTEs as UFFD_WP regardless on the
write-permission in the VMA flags.

Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Nadav Amit <namit@vmware.com>
---
 mm/userfaultfd.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

Comments

Peter Xu Feb. 21, 2022, 6:23 a.m. UTC | #1
On Thu, Feb 17, 2022 at 08:00:12PM -0800, Nadav Amit wrote:
> 
> > On Feb 17, 2022, at 6:23 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
> > 
> >> PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
> >> 
> >>       if (dirty_accountable && pte_dirty(ptent) &&
> >>                       (pte_soft_dirty(ptent) ||
> >>                       (vma->vm_flags & VM_SOFTDIRTY))) {
> >>               ptent = pte_mkwrite(ptent);
> >>       }
> 
> I know it is off-topic (not directly related to my patch), but
> I tried to understand the logic - both of the existing code and of
> your suggested change - and I failed.
> 
> IIUC dirty_accountable (whose value is taken from
> vma_wants_writenotify()) means that the writes *should* be tracked,
> and therefore the page should remain read-only.

Right.

> 
> So basically the condition should have been based on
> !dirty_accountable, i.e. the inverted value of dirty_accountable.
> 
> The problem is that dirty_accountable also reflects VM_SOFTDIRTY
> considerations, so looking on the PTE does not tell you whether
> the PTE should remain write-protected even if it is dirty.

My understanding is that the dirty bits (especially if both set) means
we've tracked dirty on this pte already so we don't need to, hence we can
set the dirty bit here.  E.g., continuous mprotect(RO), mprotect(RW) upon a
full dirty pte.

When something wants to enable tracking again, it needs to clear the dirty
bit, either the real one or soft-dirty one.  So it's a pure performance
enhancement to conditionally set write bit here, when we're sure we won't
need any further tracking on this pte.

One thing to mention is that this path only applies to VM_SHARED|VM_WRITE,
because that's what checked the first in vma_wants_writenotify():

	/* If it was private or non-writable, the write bit is already clear */
	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
		return 0;

IOW private mappings are not optimized in current tree yet.

Peter Collingbourne proposed a patch some time ago to optimize it but it
didn't get merged somehow.  Meanwhile even with his latest version it
should still miss the thp case, so if to reference the private optimization
Andrea's tree would be the best:

https://github.com/aagit/aa/commit/fadb5e04d94472614c76819acd979b2f60e4eff6

Hope it clarifies things a bit.  Thanks,
Nadav Amit Feb. 28, 2022, 6:31 p.m. UTC | #2
> On Feb 20, 2022, at 10:23 PM, Peter Xu <peterx@redhat.com> wrote:
> 
> On Thu, Feb 17, 2022 at 08:00:12PM -0800, Nadav Amit wrote:
>> 
>>> On Feb 17, 2022, at 6:23 PM, Nadav Amit <nadav.amit@gmail.com> wrote:
>>> 
>>>> PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be:
>>>> 
>>>>      if (dirty_accountable && pte_dirty(ptent) &&
>>>>                      (pte_soft_dirty(ptent) ||
>>>>                      (vma->vm_flags & VM_SOFTDIRTY))) {
>>>>              ptent = pte_mkwrite(ptent);
>>>>      }
>> 
>> I know it is off-topic (not directly related to my patch), but
>> I tried to understand the logic - both of the existing code and of
>> your suggested change - and I failed.
>> 
>> IIUC dirty_accountable (whose value is taken from
>> vma_wants_writenotify()) means that the writes *should* be tracked,
>> and therefore the page should remain read-only.
> 
> Right.
> 
>> 
>> So basically the condition should have been based on
>> !dirty_accountable, i.e. the inverted value of dirty_accountable.
>> 
>> The problem is that dirty_accountable also reflects VM_SOFTDIRTY
>> considerations, so looking on the PTE does not tell you whether
>> the PTE should remain write-protected even if it is dirty.
> 
> My understanding is that the dirty bits (especially if both set) means
> we've tracked dirty on this pte already so we don't need to, hence we can
> set the dirty bit here.  E.g., continuous mprotect(RO), mprotect(RW) upon a
> full dirty pte.
> 
> When something wants to enable tracking again, it needs to clear the dirty
> bit, either the real one or soft-dirty one.  So it's a pure performance
> enhancement to conditionally set write bit here, when we're sure we won't
> need any further tracking on this pte.
> 
> One thing to mention is that this path only applies to VM_SHARED|VM_WRITE,
> because that's what checked the first in vma_wants_writenotify():
> 
> 	/* If it was private or non-writable, the write bit is already clear */
> 	if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED)))
> 		return 0;
> 
> IOW private mappings are not optimized in current tree yet.
> 
> Peter Collingbourne proposed a patch some time ago to optimize it but it
> didn't get merged somehow.  Meanwhile even with his latest version it
> should still miss the thp case, so if to reference the private optimization
> Andrea's tree would be the best:
> 
> https://github.com/aagit/aa/commit/fadb5e04d94472614c76819acd979b2f60e4eff6
> 
> Hope it clarifies things a bit.  Thanks,

Thanks for the clarification. That’s what I suspected - I did not encounter
it since I only used private anonymous mappings. I will try to create a
test-case and send an additional fix for this issue.

Regards,
Nadav
Andrew Morton March 16, 2022, 10:05 p.m. UTC | #3
As I understand it, this patch (below) is still good to merge upstream,
although Peter hasn't acked it (please).

And a whole bunch of followup patches are being thought about, but have
yet to eventuate.

Do we feel that this patch warrants the cc:stable?  I'm suspecting
"no", as it isn't clear that the use-case is really legitimate at this
time?

Thanks.


From: Nadav Amit <namit@vmware.com>
Subject: userfaultfd: mark uffd_wp regardless of VM_WRITE flag

When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is
currently only marked as write-protected if the VMA has VM_WRITE flag set.
This seems incorrect or at least would be unexpected by the users.

Consider the following sequence of operations that are being performed on
a certain page:

	mprotect(PROT_READ)
	UFFDIO_COPY(UFFDIO_COPY_MODE_WP)
	mprotect(PROT_READ|PROT_WRITE)

At this point the user would expect to still get UFFD notification when
the page is accessed for write, but the user would not get one, since the
PTE was not marked as UFFD_WP during UFFDIO_COPY.

Fix it by always marking PTEs as UFFD_WP regardless on the
write-permission in the VMA flags.

Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com
Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit")
Signed-off-by: Nadav Amit <namit@vmware.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/userfaultfd.c |   15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

--- a/mm/userfaultfd.c~userfaultfd-mark-uffd_wp-regardless-of-vm_write-flag
+++ a/mm/userfaultfd.c
@@ -72,12 +72,15 @@ int mfill_atomic_install_pte(struct mm_s
 	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
-	if (writable) {
-		if (wp_copy)
-			_dst_pte = pte_mkuffd_wp(_dst_pte);
-		else
-			_dst_pte = pte_mkwrite(_dst_pte);
-	}
+
+	/*
+	 * Always mark a PTE as write-protected when needed, regardless of
+	 * VM_WRITE, which the user might change.
+	 */
+	if (wp_copy)
+		_dst_pte = pte_mkuffd_wp(_dst_pte);
+	else if (writable)
+		_dst_pte = pte_mkwrite(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
Peter Xu March 17, 2022, 12:11 a.m. UTC | #4
Hi, Andrew,

On Wed, Mar 16, 2022 at 03:05:53PM -0700, Andrew Morton wrote:
> 
> As I understand it, this patch (below) is still good to merge upstream,
> although Peter hasn't acked it (please).

Thanks for asking.  I didn't ack because I saw that it's queued a long time
ago into -mm, and also it's in -next for a long time too (my new uffd-wp
patchset is rebased to this one already).

I normally assume that means you read and ack that patch already, so if I
didn't see anything obviously wrong I'll just keep silent. Please let me
know if that's not the expected behavior..

So here it is..

Acked-by: Peter Xu <peterx@redhat.com>

> 
> And a whole bunch of followup patches are being thought about, but have
> yet to eventuate.

Is there a pointer/subject?

> 
> Do we feel that this patch warrants the cc:stable?  I'm suspecting
> "no", as it isn't clear that the use-case is really legitimate at this
> time?

Right. Uffd-wp+mprotect usage is IMHO not a major use case for uffd-wp per
my knowledge. At least I didn't even expect both work together, not until
Nadav started working on some of the problems..

Said that, for this specific case it should be literally only changing the
behavior of anonymous UFFD-WP && !WRITE case but nothing else (please
correct me otherwise..), then IMHO it's pretty safe to copy stable too
especially for the cleanly applicable branches.

Thanks,
diff mbox series

Patch

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 0780c2a57ff1..885e5adb0168 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -72,12 +72,15 @@  int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd,
 	_dst_pte = pte_mkdirty(_dst_pte);
 	if (page_in_cache && !vm_shared)
 		writable = false;
-	if (writable) {
-		if (wp_copy)
-			_dst_pte = pte_mkuffd_wp(_dst_pte);
-		else
-			_dst_pte = pte_mkwrite(_dst_pte);
-	}
+
+	/*
+	 * Always mark a PTE as write-protected when needed, regardless of
+	 * VM_WRITE, which the user might change.
+	 */
+	if (wp_copy)
+		_dst_pte = pte_mkuffd_wp(_dst_pte);
+	else if (writable)
+		_dst_pte = pte_mkwrite(_dst_pte);
 
 	dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);