Message ID | 20220217211602.2769-1-namit@vmware.com |
---|---|
State | New |
Headers | show |
Series | userfaultfd: mark uffd_wp regardless of VM_WRITE flag | expand |
On Thu, Feb 17, 2022 at 08:00:12PM -0800, Nadav Amit wrote: > > > On Feb 17, 2022, at 6:23 PM, Nadav Amit <nadav.amit@gmail.com> wrote: > > > >> PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be: > >> > >> if (dirty_accountable && pte_dirty(ptent) && > >> (pte_soft_dirty(ptent) || > >> (vma->vm_flags & VM_SOFTDIRTY))) { > >> ptent = pte_mkwrite(ptent); > >> } > > I know it is off-topic (not directly related to my patch), but > I tried to understand the logic - both of the existing code and of > your suggested change - and I failed. > > IIUC dirty_accountable (whose value is taken from > vma_wants_writenotify()) means that the writes *should* be tracked, > and therefore the page should remain read-only. Right. > > So basically the condition should have been based on > !dirty_accountable, i.e. the inverted value of dirty_accountable. > > The problem is that dirty_accountable also reflects VM_SOFTDIRTY > considerations, so looking on the PTE does not tell you whether > the PTE should remain write-protected even if it is dirty. My understanding is that the dirty bits (especially if both set) means we've tracked dirty on this pte already so we don't need to, hence we can set the dirty bit here. E.g., continuous mprotect(RO), mprotect(RW) upon a full dirty pte. When something wants to enable tracking again, it needs to clear the dirty bit, either the real one or soft-dirty one. So it's a pure performance enhancement to conditionally set write bit here, when we're sure we won't need any further tracking on this pte. One thing to mention is that this path only applies to VM_SHARED|VM_WRITE, because that's what checked the first in vma_wants_writenotify(): /* If it was private or non-writable, the write bit is already clear */ if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) return 0; IOW private mappings are not optimized in current tree yet. Peter Collingbourne proposed a patch some time ago to optimize it but it didn't get merged somehow. Meanwhile even with his latest version it should still miss the thp case, so if to reference the private optimization Andrea's tree would be the best: https://github.com/aagit/aa/commit/fadb5e04d94472614c76819acd979b2f60e4eff6 Hope it clarifies things a bit. Thanks,
> On Feb 20, 2022, at 10:23 PM, Peter Xu <peterx@redhat.com> wrote: > > On Thu, Feb 17, 2022 at 08:00:12PM -0800, Nadav Amit wrote: >> >>> On Feb 17, 2022, at 6:23 PM, Nadav Amit <nadav.amit@gmail.com> wrote: >>> >>>> PS: I always think here the VM_SOFTDIRTY check is wrong, IMHO it should be: >>>> >>>> if (dirty_accountable && pte_dirty(ptent) && >>>> (pte_soft_dirty(ptent) || >>>> (vma->vm_flags & VM_SOFTDIRTY))) { >>>> ptent = pte_mkwrite(ptent); >>>> } >> >> I know it is off-topic (not directly related to my patch), but >> I tried to understand the logic - both of the existing code and of >> your suggested change - and I failed. >> >> IIUC dirty_accountable (whose value is taken from >> vma_wants_writenotify()) means that the writes *should* be tracked, >> and therefore the page should remain read-only. > > Right. > >> >> So basically the condition should have been based on >> !dirty_accountable, i.e. the inverted value of dirty_accountable. >> >> The problem is that dirty_accountable also reflects VM_SOFTDIRTY >> considerations, so looking on the PTE does not tell you whether >> the PTE should remain write-protected even if it is dirty. > > My understanding is that the dirty bits (especially if both set) means > we've tracked dirty on this pte already so we don't need to, hence we can > set the dirty bit here. E.g., continuous mprotect(RO), mprotect(RW) upon a > full dirty pte. > > When something wants to enable tracking again, it needs to clear the dirty > bit, either the real one or soft-dirty one. So it's a pure performance > enhancement to conditionally set write bit here, when we're sure we won't > need any further tracking on this pte. > > One thing to mention is that this path only applies to VM_SHARED|VM_WRITE, > because that's what checked the first in vma_wants_writenotify(): > > /* If it was private or non-writable, the write bit is already clear */ > if ((vm_flags & (VM_WRITE|VM_SHARED)) != ((VM_WRITE|VM_SHARED))) > return 0; > > IOW private mappings are not optimized in current tree yet. > > Peter Collingbourne proposed a patch some time ago to optimize it but it > didn't get merged somehow. Meanwhile even with his latest version it > should still miss the thp case, so if to reference the private optimization > Andrea's tree would be the best: > > https://github.com/aagit/aa/commit/fadb5e04d94472614c76819acd979b2f60e4eff6 > > Hope it clarifies things a bit. Thanks, Thanks for the clarification. That’s what I suspected - I did not encounter it since I only used private anonymous mappings. I will try to create a test-case and send an additional fix for this issue. Regards, Nadav
As I understand it, this patch (below) is still good to merge upstream, although Peter hasn't acked it (please). And a whole bunch of followup patches are being thought about, but have yet to eventuate. Do we feel that this patch warrants the cc:stable? I'm suspecting "no", as it isn't clear that the use-case is really legitimate at this time? Thanks. From: Nadav Amit <namit@vmware.com> Subject: userfaultfd: mark uffd_wp regardless of VM_WRITE flag When a PTE is set by UFFD operations such as UFFDIO_COPY, the PTE is currently only marked as write-protected if the VMA has VM_WRITE flag set. This seems incorrect or at least would be unexpected by the users. Consider the following sequence of operations that are being performed on a certain page: mprotect(PROT_READ) UFFDIO_COPY(UFFDIO_COPY_MODE_WP) mprotect(PROT_READ|PROT_WRITE) At this point the user would expect to still get UFFD notification when the page is accessed for write, but the user would not get one, since the PTE was not marked as UFFD_WP during UFFDIO_COPY. Fix it by always marking PTEs as UFFD_WP regardless on the write-permission in the VMA flags. Link: https://lkml.kernel.org/r/20220217211602.2769-1-namit@vmware.com Fixes: 292924b26024 ("userfaultfd: wp: apply _PAGE_UFFD_WP bit") Signed-off-by: Nadav Amit <namit@vmware.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- mm/userfaultfd.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) --- a/mm/userfaultfd.c~userfaultfd-mark-uffd_wp-regardless-of-vm_write-flag +++ a/mm/userfaultfd.c @@ -72,12 +72,15 @@ int mfill_atomic_install_pte(struct mm_s _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; - if (writable) { - if (wp_copy) - _dst_pte = pte_mkuffd_wp(_dst_pte); - else - _dst_pte = pte_mkwrite(_dst_pte); - } + + /* + * Always mark a PTE as write-protected when needed, regardless of + * VM_WRITE, which the user might change. + */ + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else if (writable) + _dst_pte = pte_mkwrite(_dst_pte); dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);
Hi, Andrew, On Wed, Mar 16, 2022 at 03:05:53PM -0700, Andrew Morton wrote: > > As I understand it, this patch (below) is still good to merge upstream, > although Peter hasn't acked it (please). Thanks for asking. I didn't ack because I saw that it's queued a long time ago into -mm, and also it's in -next for a long time too (my new uffd-wp patchset is rebased to this one already). I normally assume that means you read and ack that patch already, so if I didn't see anything obviously wrong I'll just keep silent. Please let me know if that's not the expected behavior.. So here it is.. Acked-by: Peter Xu <peterx@redhat.com> > > And a whole bunch of followup patches are being thought about, but have > yet to eventuate. Is there a pointer/subject? > > Do we feel that this patch warrants the cc:stable? I'm suspecting > "no", as it isn't clear that the use-case is really legitimate at this > time? Right. Uffd-wp+mprotect usage is IMHO not a major use case for uffd-wp per my knowledge. At least I didn't even expect both work together, not until Nadav started working on some of the problems.. Said that, for this specific case it should be literally only changing the behavior of anonymous UFFD-WP && !WRITE case but nothing else (please correct me otherwise..), then IMHO it's pretty safe to copy stable too especially for the cleanly applicable branches. Thanks,
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 0780c2a57ff1..885e5adb0168 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -72,12 +72,15 @@ int mfill_atomic_install_pte(struct mm_struct *dst_mm, pmd_t *dst_pmd, _dst_pte = pte_mkdirty(_dst_pte); if (page_in_cache && !vm_shared) writable = false; - if (writable) { - if (wp_copy) - _dst_pte = pte_mkuffd_wp(_dst_pte); - else - _dst_pte = pte_mkwrite(_dst_pte); - } + + /* + * Always mark a PTE as write-protected when needed, regardless of + * VM_WRITE, which the user might change. + */ + if (wp_copy) + _dst_pte = pte_mkuffd_wp(_dst_pte); + else if (writable) + _dst_pte = pte_mkwrite(_dst_pte); dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl);