mbox series

[v11,0/7] Implement IOCTL to get and optionally clear info about PTEs

Message ID 20230309135718.1490461-1-usama.anjum@collabora.com
Headers show
Series Implement IOCTL to get and optionally clear info about PTEs | expand

Message

Muhammad Usama Anjum March 9, 2023, 1:57 p.m. UTC
These patches are based on next-20230307 and UFFD_FEATURE_WP_UNPOPULATED
patches from Peter.

*Changes in v11*
- Rebase on top of next-20230307
- Base patches on UFFD_FEATURE_WP_UNPOPULATED (https://lore.kernel.org/all/20230306213925.617814-1-peterx@redhat.com)
- Do a lot of cosmetic changes and review updates
- Remove ENGAGE_WP + ! GET operation as it can be performed with UFFDIO_WRITEPROTECT

*Changes in v10*
- Add specific condition to return error if hugetlb is used with wp
  async
- Move changes in tools/include/uapi/linux/fs.h to separate patch
- Add documentation

*Changes in v9:*
- Correct fault resolution for userfaultfd wp async
- Fix build warnings and errors which were happening on some configs
- Simplify pagemap ioctl's code

*Changes in v8:*
- Update uffd async wp implementation
- Improve PAGEMAP_IOCTL implementation

*Changes in v7:*
- Add uffd wp async
- Update the IOCTL to use uffd under the hood instead of soft-dirty
  flags

Hello,

Note:
Soft-dirty pages and pages which have been written-to are synonyms. As
kernel already has soft-dirty feature inside which we have given up to
use, we are using written-to terminology while using UFFD async WP under
the hood.

This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
the info about page table entries. The following operations are
supported in this ioctl:
- Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
  file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
  (PAGE_IS_SWAPPED).
- Write-protect the pages (PAGEMAP_WP_ENGAGE) to start finding which
  pages have been written-to.
- Find pages which have been written-to and write protect the pages
  (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)

It is possible to find and clear soft-dirty pages entirely in userspace.
But it isn't efficient:
- The mprotect and SIGSEGV handler for bookkeeping
- The userfaultfd wp (synchronous) with the handler for bookkeeping

Some benchmarks can be seen here[1]. This series adds features that weren't
present earlier:
- There is no atomic get soft-dirty/Written-to status and clear present in
  the kernel.
- The pages which have been written-to can not be found in accurate way.
  (Kernel's soft-dirty PTE bit + sof_dirty VMA bit shows more soft-dirty
  pages than there actually are.)

Historically, soft-dirty PTE bit tracking has been used in the CRIU
project. The procfs interface is enough for finding the soft-dirty bit
status and clearing the soft-dirty bit of all the pages of a process.
We have the use case where we need to track the soft-dirty PTE bit for
only specific pages on-demand. We need this tracking and clear mechanism
of a region of memory while the process is running to emulate the
getWriteWatch() syscall of Windows.

*(Moved to using UFFD instead of soft-dirtyi feature to find pages which
have been written-to from v7 patch series)*:
Stop using the soft-dirty flags for finding which pages have been
written to. It is too delicate and wrong as it shows more soft-dirty
pages than the actual soft-dirty pages. There is no interest in
correcting it [2][3] as this is how the feature was written years ago.
It shouldn't be updated to changed behaviour. Peter Xu has suggested
using the async version of the UFFD WP [4] as it is based inherently
on the PTEs.

So in this patch series, I've added a new mode to the UFFD which is
asynchronous version of the write protect. When this variant of the
UFFD WP is used, the page faults are resolved automatically by the
kernel. The pages which have been written-to can be found by reading
pagemap file (!PM_UFFD_WP). This feature can be used successfully to
find which pages have been written to from the time the pages were
write protected. This works just like the soft-dirty flag without
showing any extra pages which aren't soft-dirty in reality.

The information related to pages if the page is file mapped, present and
swapped is required for the CRIU project [5][6]. The addition of the
required mask, any mask, excluded mask and return masks are also required
for the CRIU project [5].

The IOCTL returns the addresses of the pages which match the specific
masks. The page addresses are returned in struct page_region in a compact
form. The max_pages is needed to support a use case where user only wants
to get a specific number of pages. So there is no need to find all the
pages of interest in the range when max_pages is specified. The IOCTL
returns when the maximum number of the pages are found. The max_pages is
optional. If max_pages is specified, it must be equal or greater than the
vec_size. This restriction is needed to handle worse case when one
page_region only contains info of one page and it cannot be compacted.
This is needed to emulate the Windows getWriteWatch() syscall.

The patch series include the detailed selftest which can be used as an
example for the uffd async wp test and PAGEMAP_IOCTL. It shows the
interface usages as well.

[1] https://lore.kernel.org/lkml/54d4c322-cd6e-eefd-b161-2af2b56aae24@collabora.com/
[2] https://lore.kernel.org/all/20221220162606.1595355-1-usama.anjum@collabora.com
[3] https://lore.kernel.org/all/20221122115007.2787017-1-usama.anjum@collabora.com
[4] https://lore.kernel.org/all/Y6Hc2d+7eTKs7AiH@x1n
[5] https://lore.kernel.org/all/YyiDg79flhWoMDZB@gmail.com/
[6] https://lore.kernel.org/all/20221014134802.1361436-1-mdanylo@google.com/

Regards,
Muhammad Usama Anjum

Muhammad Usama Anjum (7):
  userfaultfd: Add UFFD WP Async support
  userfaultfd: Define dummy uffd_wp_range()
  userfaultfd: update documentation to describe UFFD_FEATURE_WP_ASYNC
  fs/proc/task_mmu: Implement IOCTL to get and optionally clear info
    about PTEs
  tools headers UAPI: Update linux/fs.h with the kernel sources
  mm/pagemap: add documentation of PAGEMAP_SCAN IOCTL
  selftests: mm: add pagemap ioctl tests

 Documentation/admin-guide/mm/pagemap.rst     |  56 ++
 Documentation/admin-guide/mm/userfaultfd.rst |  21 +
 fs/proc/task_mmu.c                           | 366 ++++++++
 fs/userfaultfd.c                             |  25 +-
 include/linux/userfaultfd_k.h                |  14 +
 include/uapi/linux/fs.h                      |  53 ++
 include/uapi/linux/userfaultfd.h             |  11 +-
 mm/memory.c                                  |  27 +-
 tools/include/uapi/linux/fs.h                |  53 ++
 tools/testing/selftests/mm/.gitignore        |   1 +
 tools/testing/selftests/mm/Makefile          |   4 +-
 tools/testing/selftests/mm/config            |   1 +
 tools/testing/selftests/mm/pagemap_ioctl.c   | 920 +++++++++++++++++++
 tools/testing/selftests/mm/run_vmtests.sh    |   4 +
 14 files changed, 1549 insertions(+), 7 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pagemap_ioctl.c
 mode change 100644 => 100755 tools/testing/selftests/mm/run_vmtests.sh

Comments

Michał Mirosław March 13, 2023, 4:02 p.m. UTC | #1
On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
> the info about page table entries. The following operations are supported
> in this ioctl:
> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>   (PAGE_IS_SWAPPED).
> - Find pages which have been written-to and write protect the pages
>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
[...]
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -19,6 +19,7 @@
>  #include <linux/shmem_fs.h>
>  #include <linux/uaccess.h>
>  #include <linux/pkeys.h>
> +#include <linux/minmax.h>
>
>  #include <asm/elf.h>
>  #include <asm/tlb.h>
> @@ -1132,6 +1133,18 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>  }
>  #endif
>
> +static inline bool is_pte_uffd_wp(pte_t pte)
> +{
> +       return ((pte_present(pte) && pte_uffd_wp(pte)) ||
> +               (pte_swp_uffd_wp_any(pte)));

Parentheses around pte_swp_uffd_wp_any() are redundant. Please remove
here and in all following if()s. (Nit: those extra parentheses are
used inconsistently in the patch anyway.)

[...]
> +static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p)

This seems to check if the PAGE_IS_WRITTEN flag is tested, so
"pagemap_scan_needs_wp_checks()"? Or maybe document/expand the "wt"
acronym as it seems used also on following code.

> +{
> +       return  ((p->required_mask & PAGE_IS_WRITTEN) ||
> +                (p->anyof_mask & PAGE_IS_WRITTEN) ||
> +                (p->excluded_mask & PAGE_IS_WRITTEN));

Nit: It looks like it should answer "do any of the masks contain
PAGE_IS_WRITTEN?" so maybe:

return (p->required_mask | p->anyof_mask | p->excluded_mask) & PAGE_IS_WRITTEN;

[...]

> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> +                              struct pagemap_scan_private *p,
> +                              unsigned long addr, unsigned int n_pages)
> +{
> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
> +       struct page_region *cur = &p->cur;
> +       bool cpy = true;
> +
> +       if (p->max_pages && (p->found_pages == p->max_pages))
> +               return -ENOSPC;
> +
> +       if (!n_pages)
> +               return -EINVAL;
> +
> +       if (p->required_mask)
> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
> +       if (cpy && p->anyof_mask)
> +               cpy = (p->anyof_mask & bitmap);
> +       if (cpy && p->excluded_mask)
> +               cpy = !(p->excluded_mask & bitmap);

Since the rest of the code is executed only when `cpy` is true, this
could just return early for easier understanding.

BTW, some of the tests are redundant. Eg: if required_mask == 0, then
`required_mask & x == required_mask` will always hold. Same for
`excluded_mask & x == 0`.

> +
> +       bitmap = bitmap & p->return_mask;

Nit: bitmap &= p->return_mask;

> +       if (cpy && bitmap) {

Assuming early returns on `!cpy` are done earlier:

if (!bitmap)
  return 0;

> +               if ((cur->len) && (cur->bitmap == bitmap) &&
> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {

I'd recommend removing the extra parentheses as they make the code
less readable for me (too many parentheses to match visually).
The `cur->len` test seems redundant: is it possible to have
`cur->start == addr` in that case (I guess it would have to get
`n_pages == 0` in an earlier invocation)?

> +
> +                       cur->len += n_pages;
> +                       p->found_pages += n_pages;

Please add an early return so that 'else' chaining won't be necessary.

> +               } else if ((!p->vec_index) ||
> +                          ((p->vec_index + 1) < p->vec_len)) {

Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:

if (vec_index >= p->vec_len)
    return -ENOSPC;

> +                       if (cur->len) {
> +                               memcpy(&p->vec[p->vec_index], cur,
> +                                      sizeof(struct page_region));
> +                               p->vec_index++;
> +                       }
> +
> +                       cur->start = addr;
> +                       cur->len = n_pages;
> +                       cur->bitmap = bitmap;
> +                       p->found_pages += n_pages;
> +               } else {
> +                       return -ENOSPC;
> +               }
> +       }
> +
> +       return 0;
> +}
[...]

> +static int pagemap_scan_deposit(struct pagemap_scan_private *p,
> +                               struct page_region __user *vec,
> +                               unsigned long *vec_index)
> +{
> +       struct page_region *cur = &p->cur;
> +
> +       if (cur->len) {

if (!cur->len)
  return 0;

> +               if (copy_to_user(&vec[*vec_index], cur,
> +                                sizeof(struct page_region)))
> +                       return -EFAULT;
> +
> +               p->vec_index++;
> +               (*vec_index)++;
> +       }
> +
> +       return 0;
> +}

> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> +                                 unsigned long end, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       bool is_writ, is_file, is_pres, is_swap;
> +       unsigned long addr = end;
> +       spinlock_t *ptl;
> +       int ret = 0;
> +       pte_t *pte;
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE

Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
NULL if transparent hugepages are not compiled in. OTOH I see
BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
case), so that would need to be worked around first.

> +       ptl = pmd_trans_huge_lock(pmd, vma);
> +       if (ptl) {
> +               unsigned long n_pages;
> +
> +               is_writ = !is_pmd_uffd_wp(*pmd);

`is_written`?

> +               /*
> +                * Break huge page into small pages if operation needs to be
> +                * performed is on a portion of the huge page.
> +                */
> +               if (is_writ && PM_SCAN_OP_IS_WP(p) &&
> +                   (end - start < HPAGE_SIZE)) {
> +                       spin_unlock(ptl);
> +
> +                       split_huge_pmd(vma, pmd, start);
> +                       goto process_smaller_pages;
> +               }
> +
> +               n_pages = (end - start)/PAGE_SIZE;
> +               if (p->max_pages &&
> +                   p->found_pages + n_pages >= p->max_pages)

Nit: greater-than is also correct and avoids no-op assignment.

> +                       n_pages = p->max_pages - p->found_pages;
> +
> +               ret = pagemap_scan_output(is_writ, vma->vm_file,
> +                                         pmd_present(*pmd), is_swap_pmd(*pmd),
> +                                         p, start, n_pages);
> +               spin_unlock(ptl);

if (ret || !is_written)
  return ret;

This will avoid those tests in the following if().

> +
> +               if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) &&
> +                   uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0)
> +                       ret = -EINVAL;
> +
> +               return ret;

After above early returns, this will be always `return 0;`.

> +       }
> +process_smaller_pages:
> +       if (pmd_trans_unstable(pmd))
> +               return 0;
> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
> +
> +       for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {

The `!ret` can be removed if the EINVAL case was to `break` by itself.

> +               pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> +               is_writ = !is_pte_uffd_wp(*pte);
> +               is_file = vma->vm_file;
> +               is_pres = pte_present(*pte);
> +               is_swap = is_swap_pte(*pte);
> +
> +               pte_unmap_unlock(pte, ptl);
> +
> +               ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
> +                                         p, addr, 1);
> +               if (ret)
> +                       break;
> +
> +               if (PM_SCAN_OP_IS_WP(p) && is_writ &&
> +                   uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
> +                       ret = -EINVAL;
> +       }
> +
> +       cond_resched();
> +       return ret;
> +}
> +
> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
> +                                int depth, struct mm_walk *walk)
> +{
> +       struct pagemap_scan_private *p = walk->private;
> +       struct vm_area_struct *vma = walk->vma;
> +       unsigned long n_pages;
> +       int ret = 0;
> +
> +       if (vma) {

if (!vma) return 0;

> +               n_pages = (end - addr)/PAGE_SIZE;
> +               if (p->max_pages &&
> +                   p->found_pages + n_pages >= p->max_pages)
> +                       n_pages = p->max_pages - p->found_pages;
> +
> +               ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
> +                                         addr, n_pages);
> +       }
> +
> +       return ret;
> +}


> +/* No hugetlb support is present. */

"FIXME: hugetlb support is not implemented."? (There seems to be no
#ifdef CONFIG_HUGETLB or similar, so I guess the comment is about the
current implementation.)

> +static const struct mm_walk_ops pagemap_scan_ops = {
> +       .test_walk = pagemap_scan_test_walk,
> +       .pmd_entry = pagemap_scan_pmd_entry,
> +       .pte_hole = pagemap_scan_pte_hole,
> +};
> +
> +static bool pagemap_scan_args_valid(struct pm_scan_arg *arg,
> +                                   struct page_region __user *vec,
> +                                   unsigned long start)
> +{
> +       /* Detect illegal size, flags and masks */
> +       if (arg->size != sizeof(struct pm_scan_arg))
> +               return false;
> +       if (arg->flags & ~PM_SCAN_OPS)
> +               return false;
> +       if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
> +            arg->return_mask) & ~PM_SCAN_BITS_ALL)
> +               return false;

> +       if (!arg->required_mask && !arg->anyof_mask &&
> +           !arg->excluded_mask)
> +               return false;

Is there an assumption in the code that those checks are needed? I'd
expect that no selection criteria makes a valid page set?

> +       if (!arg->return_mask)
> +               return false;
> +
> +       /* Validate memory ranges */
> +       if (!(arg->flags & PM_SCAN_OP_GET))
> +               return false;
> +       if (!arg->vec)
> +               return false;
> +       if (arg->vec_len == 0)
> +               return false;

> +       if (!access_ok((void __user *)vec,
> +                      arg->vec_len * sizeof(struct page_region)))
> +               return false;

Is there a provision that userspace threads are all blocked from
manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
and the writes should be checked each time as another userspace thread
could remap the memory while the ioctl() is working. Anyway, the
return should be EFAULT for this case.

> +       if (!IS_ALIGNED(start, PAGE_SIZE))
> +               return false;
> +       if (!access_ok((void __user *)start, arg->len))
> +               return false;

This I guess want's to check if the range to be scanned is mapped -
but isn't this what the ioctl() should do during the scan? (But, also
see above.)

> +       if (PM_SCAN_OP_IS_WP(arg)) {

if (!...IS_WP) return true;

> +               if (arg->required_mask & PM_SCAN_NON_WT_BITS)
> +                       return false;
> +               if (arg->anyof_mask & PM_SCAN_NON_WT_BITS)
> +                       return false;
> +               if (arg->excluded_mask & PM_SCAN_NON_WT_BITS)
> +                       return false;

Please see: pagemap_scan_is_wt_required comment. Also, it seems this
constant is used only here, so ~PAGE_IS_WRITTEN might be enough?

[...]
> +static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg)
> +{
> +       unsigned long start, end, walk_start, walk_end;
> +       unsigned long empty_slots, vec_index = 0;
> +       struct page_region __user *vec;
> +       struct pagemap_scan_private p;
> +       int ret = 0;
> +
> +       start = (unsigned long)untagged_addr(arg->start);
> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
> +
> +       if (!pagemap_scan_args_valid(arg, vec, start))
> +               return -EINVAL;
> +
> +       end = start + arg->len;
> +       p.max_pages = arg->max_pages;
> +       p.found_pages = 0;
> +       p.flags = arg->flags;
> +       p.required_mask = arg->required_mask;
> +       p.anyof_mask = arg->anyof_mask;
> +       p.excluded_mask = arg->excluded_mask;
> +       p.return_mask = arg->return_mask;
> +       p.cur.len = 0;
> +       p.vec = NULL;
> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
> +
> +       /*
> +        * Allocate smaller buffer to get output from inside the page walk
> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
> +        * we want to return output to user in compact form where no two
> +        * consecutive regions should be continuous and have the same flags.
> +        * So store the latest element in p.cur between different walks and
> +        * store the p.cur at the end of the walk to the user buffer.
> +        */
> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
> +                             GFP_KERNEL);
> +       if (!p.vec)
> +               return -ENOMEM;
> +
> +       walk_start = walk_end = start;
> +       while (walk_end < end) {
> +               p.vec_index = 0;
> +
> +               empty_slots = arg->vec_len - vec_index;
> +               p.vec_len = min(p.vec_len, empty_slots);
> +
> +               walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
> +               if (walk_end > end)
> +                       walk_end = end;
> +
> +               mmap_read_lock(mm);
> +               ret = walk_page_range(mm, walk_start, walk_end,
> +                                     &pagemap_scan_ops, &p);
> +               mmap_read_unlock(mm);
> +
> +               if (!(!ret || ret == -ENOSPC))

if (ret && ret != -ENOSPC)

> +                       goto free_data;
> +
> +               walk_start = walk_end;
> +               if (p.vec_index) {
> +                       if (copy_to_user(&vec[vec_index], p.vec,
> +                                        p.vec_index *
> +                                        sizeof(struct page_region))) {
> +                               ret = -EFAULT;
> +                               goto free_data;
> +                       }
> +                       vec_index += p.vec_index;
> +               }
> +       }
> +       ret = pagemap_scan_deposit(&p, vec, &vec_index);
> +       if (!ret)
> +               ret = vec_index;
> +free_data:
> +       kfree(p.vec);
> +
> +       return ret;
> +}
> +
> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
> +                              unsigned long arg)
> +{
> +       struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
> +       struct mm_struct *mm = file->private_data;
> +       struct pm_scan_arg argument;
> +
> +       if (cmd == PAGEMAP_SCAN) {

switch() for easier expansion later?

> +               if (copy_from_user(&argument, uarg,
> +                                  sizeof(struct pm_scan_arg)))

sizeof(*argument);

Could you push this to do_pagemap_cmd()? In case this file gets more
ioctl() commands there won't be need to add more command-specific
structures in this function.

> +                       return -EFAULT;
> +               return do_pagemap_cmd(mm, &argument);
> +       }
> +
> +       return -EINVAL;
> +}
> +
>  const struct file_operations proc_pagemap_operations = {
>         .llseek         = mem_lseek, /* borrow this */
>         .read           = pagemap_read,
>         .open           = pagemap_open,
>         .release        = pagemap_release,
> +       .unlocked_ioctl = pagemap_scan_ioctl,
> +       .compat_ioctl   = pagemap_scan_ioctl,

Is this correct? Would the code need a different userspace pointer
handling for 32-bit userspace on 64-bit kernel?

>  };
>  #endif /* CONFIG_PROC_PAGE_MONITOR */
Peter Xu March 15, 2023, 3:55 p.m. UTC | #2
On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> +
> +		is_writ = !is_pte_uffd_wp(*pte);
> +		is_file = vma->vm_file;
> +		is_pres = pte_present(*pte);
> +		is_swap = is_swap_pte(*pte);
> +
> +		pte_unmap_unlock(pte, ptl);
> +
> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
> +					  p, addr, 1);
> +		if (ret)
> +			break;
> +
> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
> +			ret = -EINVAL;
> +	}

This is not real atomic..

Taking the spinlock for eacy pte is not only overkill but wrong in
atomicity because the pte can change right after spinlock unlocked.

Unfortunately you also cannot reuse uffd_wp_range() because that's not
atomic either, my fault here.  Probably I was thinking mostly from
soft-dirty pov on batching the collect+reset.

You need to take the spin lock, collect whatever bits, set/clear whatever
bits, only until then release the spin lock.

"Not atomic" means you can have some page got dirtied but you could miss
it.  Depending on how strict you want, I think it'll break apps like CRIU
if strict atomicity needed for migrating a process.  If we want to have a
new interface anyway, IMHO we'd better do that in the strict way.

Same comment applies to the THP handling (where I cut from the context).
Muhammad Usama Anjum March 15, 2023, 4:54 p.m. UTC | #3
On 3/15/23 8:55 PM, Peter Xu wrote:
> On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
>> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +
>> +		is_writ = !is_pte_uffd_wp(*pte);
>> +		is_file = vma->vm_file;
>> +		is_pres = pte_present(*pte);
>> +		is_swap = is_swap_pte(*pte);
>> +
>> +		pte_unmap_unlock(pte, ptl);
>> +
>> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>> +					  p, addr, 1);
>> +		if (ret)
>> +			break;
>> +
>> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>> +			ret = -EINVAL;
>> +	}
> 
> This is not real atomic..
> 
> Taking the spinlock for eacy pte is not only overkill but wrong in
> atomicity because the pte can change right after spinlock unlocked.
Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
it didn't show any side-effect.  Here we are finding out if a page is
written. If page is written, only then we clear it. Lets look at the
different possibilities here:
- If a page isn't written, we'll not clear it.
- If a page is written and there isn't any race, we'll clear written-to
flag by write protecting it.
- If a page is written but before clearing it, data is written again to the
page. The page would remain written and we'll clear it.
- If a page is written but before clearing it, it gets write protected,
we'll still write protected it. There is double right protection here, but
no side-effect.

Lets turn this into a truth table for easier understanding. Here first
coulmn and thrid column represents this above code. 2nd column represents
any other thread interacting with the page.

If page is written/dirty	some other task interacts	wp_page
no				does nothing			no
no				writes to page			no
no				wp the page			no
yes				does nothing			yes
yes				write to page			yes
yes				wp the page			yes

As you can see there isn't any side-effect happening. We aren't over doing
the wp or under-doing the write-protect.

Even if we were doing something wrong here and I bring the lock over all of
this, the pages get become written or wp just after unlocking. It is
expected. This current implementation doesn't seem to be breaking this.

Is my understanding wrong somewhere here? Can you point out?

Previous to this current locking design were either buggy or slower when
multiple threads were working on same pages. Current implementation removes
the limitations:
- The memcpy inside pagemap_scan_output is happening with pte unlocked.
- We are only wp a page if we have noted this page to be dirty
- No mm write lock is required. Only read lock works fine just like
userfaultfd_writeprotect() takes only read lock.

There is only one con here that we are locking and unlocking the pte lock
again and again.

Please have a look at my explanation and let me know what do you think.

> 
> Unfortunately you also cannot reuse uffd_wp_range() because that's not
> atomic either, my fault here.  Probably I was thinking mostly from
> soft-dirty pov on batching the collect+reset.
> 
> You need to take the spin lock, collect whatever bits, set/clear whatever
> bits, only until then release the spin lock.
> 
> "Not atomic" means you can have some page got dirtied but you could miss
> it.  Depending on how strict you want, I think it'll break apps like CRIU
> if strict atomicity needed for migrating a process.  If we want to have a
> new interface anyway, IMHO we'd better do that in the strict way.
In my rigorous multi-threaded testing where a lots of threads are working
on same set of pages, we aren't losing even a single update. I can share
the test if you want.

> 
> Same comment applies to the THP handling (where I cut from the context).
>
Peter Xu March 15, 2023, 7:53 p.m. UTC | #4
On Wed, Mar 15, 2023 at 09:54:40PM +0500, Muhammad Usama Anjum wrote:
> On 3/15/23 8:55 PM, Peter Xu wrote:
> > On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
> >> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
> >> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> >> +
> >> +		is_writ = !is_pte_uffd_wp(*pte);
> >> +		is_file = vma->vm_file;
> >> +		is_pres = pte_present(*pte);
> >> +		is_swap = is_swap_pte(*pte);
> >> +
> >> +		pte_unmap_unlock(pte, ptl);
> >> +
> >> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
> >> +					  p, addr, 1);
> >> +		if (ret)
> >> +			break;
> >> +
> >> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
> >> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
> >> +			ret = -EINVAL;
> >> +	}
> > 
> > This is not real atomic..
> > 
> > Taking the spinlock for eacy pte is not only overkill but wrong in
> > atomicity because the pte can change right after spinlock unlocked.
> Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
> it didn't show any side-effect.  Here we are finding out if a page is
> written. If page is written, only then we clear it. Lets look at the
> different possibilities here:
> - If a page isn't written, we'll not clear it.
> - If a page is written and there isn't any race, we'll clear written-to
> flag by write protecting it.
> - If a page is written but before clearing it, data is written again to the
> page. The page would remain written and we'll clear it.
> - If a page is written but before clearing it, it gets write protected,
> we'll still write protected it. There is double right protection here, but
> no side-effect.
> 
> Lets turn this into a truth table for easier understanding. Here first
> coulmn and thrid column represents this above code. 2nd column represents
> any other thread interacting with the page.
> 
> If page is written/dirty	some other task interacts	wp_page
> no				does nothing			no
> no				writes to page			no
> no				wp the page			no
> yes				does nothing			yes
> yes				write to page			yes
> yes				wp the page			yes
> 
> As you can see there isn't any side-effect happening. We aren't over doing
> the wp or under-doing the write-protect.
> 
> Even if we were doing something wrong here and I bring the lock over all of
> this, the pages get become written or wp just after unlocking. It is
> expected. This current implementation doesn't seem to be breaking this.
> 
> Is my understanding wrong somewhere here? Can you point out?

Yes you're right.  With is_writ check it looks all fine.

> 
> Previous to this current locking design were either buggy or slower when
> multiple threads were working on same pages. Current implementation removes
> the limitations:
> - The memcpy inside pagemap_scan_output is happening with pte unlocked.

Why this has anything to worry?  Isn't that memcpy only applies to a
page_region struct?

> - We are only wp a page if we have noted this page to be dirty
> - No mm write lock is required. Only read lock works fine just like
> userfaultfd_writeprotect() takes only read lock.

I didn't even notice you used to use write lock.  Yes I think read lock is
suffice here.

> 
> There is only one con here that we are locking and unlocking the pte lock
> again and again.
> 
> Please have a look at my explanation and let me know what do you think.

I think this is fine as long as the semantics is correct, which I believe
is the case.  The spinlock can be optimized, but it can be done on top if
needs more involved changes.

> 
> > 
> > Unfortunately you also cannot reuse uffd_wp_range() because that's not
> > atomic either, my fault here.  Probably I was thinking mostly from
> > soft-dirty pov on batching the collect+reset.
> > 
> > You need to take the spin lock, collect whatever bits, set/clear whatever
> > bits, only until then release the spin lock.
> > 
> > "Not atomic" means you can have some page got dirtied but you could miss
> > it.  Depending on how strict you want, I think it'll break apps like CRIU
> > if strict atomicity needed for migrating a process.  If we want to have a
> > new interface anyway, IMHO we'd better do that in the strict way.
> In my rigorous multi-threaded testing where a lots of threads are working
> on same set of pages, we aren't losing even a single update. I can share
> the test if you want.

Good to have tests covering that.  I'd say you can add the test into
selftests along with the series when you repost if it's convenient.  It can
be part of an existing test or it can be a new one under mm/.

Thanks,
Muhammad Usama Anjum March 16, 2023, 5:17 a.m. UTC | #5
On 3/16/23 12:53 AM, Peter Xu wrote:
> On Wed, Mar 15, 2023 at 09:54:40PM +0500, Muhammad Usama Anjum wrote:
>> On 3/15/23 8:55 PM, Peter Xu wrote:
>>> On Thu, Mar 09, 2023 at 06:57:15PM +0500, Muhammad Usama Anjum wrote:
>>>> +	for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
>>>> +		pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>>>> +
>>>> +		is_writ = !is_pte_uffd_wp(*pte);
>>>> +		is_file = vma->vm_file;
>>>> +		is_pres = pte_present(*pte);
>>>> +		is_swap = is_swap_pte(*pte);
>>>> +
>>>> +		pte_unmap_unlock(pte, ptl);
>>>> +
>>>> +		ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>>>> +					  p, addr, 1);
>>>> +		if (ret)
>>>> +			break;
>>>> +
>>>> +		if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>>>> +		    uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>>>> +			ret = -EINVAL;
>>>> +	}
>>>
>>> This is not real atomic..
>>>
>>> Taking the spinlock for eacy pte is not only overkill but wrong in
>>> atomicity because the pte can change right after spinlock unlocked.
>> Let me explain. It seems like wrong, but it isn't. In my rigorous testing,
>> it didn't show any side-effect.  Here we are finding out if a page is
>> written. If page is written, only then we clear it. Lets look at the
>> different possibilities here:
>> - If a page isn't written, we'll not clear it.
>> - If a page is written and there isn't any race, we'll clear written-to
>> flag by write protecting it.
>> - If a page is written but before clearing it, data is written again to the
>> page. The page would remain written and we'll clear it.
>> - If a page is written but before clearing it, it gets write protected,
>> we'll still write protected it. There is double right protection here, but
>> no side-effect.
>>
>> Lets turn this into a truth table for easier understanding. Here first
>> coulmn and thrid column represents this above code. 2nd column represents
>> any other thread interacting with the page.
>>
>> If page is written/dirty	some other task interacts	wp_page
>> no				does nothing			no
>> no				writes to page			no
>> no				wp the page			no
>> yes				does nothing			yes
>> yes				write to page			yes
>> yes				wp the page			yes
>>
>> As you can see there isn't any side-effect happening. We aren't over doing
>> the wp or under-doing the write-protect.
>>
>> Even if we were doing something wrong here and I bring the lock over all of
>> this, the pages get become written or wp just after unlocking. It is
>> expected. This current implementation doesn't seem to be breaking this.
>>
>> Is my understanding wrong somewhere here? Can you point out?
> 
> Yes you're right.  With is_writ check it looks all fine.
> 
>>
>> Previous to this current locking design were either buggy or slower when
>> multiple threads were working on same pages. Current implementation removes
>> the limitations:
>> - The memcpy inside pagemap_scan_output is happening with pte unlocked.
> 
> Why this has anything to worry?  Isn't that memcpy only applies to a
> page_region struct?
Yeah, correct. I'm just saying that memcpy without pte lock is better than
memcpy with pte locked. :)

> 
>> - We are only wp a page if we have noted this page to be dirty
>> - No mm write lock is required. Only read lock works fine just like
>> userfaultfd_writeprotect() takes only read lock.
> 
> I didn't even notice you used to use write lock.  Yes I think read lock is
> suffice here.
> 
>>
>> There is only one con here that we are locking and unlocking the pte lock
>> again and again.
>>
>> Please have a look at my explanation and let me know what do you think.
> 
> I think this is fine as long as the semantics is correct, which I believe
> is the case.  The spinlock can be optimized, but it can be done on top if
> needs more involved changes.
> 
>>
>>>
>>> Unfortunately you also cannot reuse uffd_wp_range() because that's not
>>> atomic either, my fault here.  Probably I was thinking mostly from
>>> soft-dirty pov on batching the collect+reset.
>>>
>>> You need to take the spin lock, collect whatever bits, set/clear whatever
>>> bits, only until then release the spin lock.
>>>
>>> "Not atomic" means you can have some page got dirtied but you could miss
>>> it.  Depending on how strict you want, I think it'll break apps like CRIU
>>> if strict atomicity needed for migrating a process.  If we want to have a
>>> new interface anyway, IMHO we'd better do that in the strict way.
>> In my rigorous multi-threaded testing where a lots of threads are working
>> on same set of pages, we aren't losing even a single update. I can share
>> the test if you want.
> 
> Good to have tests covering that.  I'd say you can add the test into
> selftests along with the series when you repost if it's convenient.  It can
> be part of an existing test or it can be a new one under mm/.
Sure, I'll add it to the selftests.

Thank you for reviewing and asking the questions.

> 
> Thanks,
>
Muhammad Usama Anjum March 16, 2023, 5:53 p.m. UTC | #6
Hi,

Thank you so much for reviewing.

On 3/13/23 9:02 PM, Michał Mirosław wrote:
> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> This IOCTL, PAGEMAP_SCAN on pagemap file can be used to get and/or clear
>> the info about page table entries. The following operations are supported
>> in this ioctl:
>> - Get the information if the pages have been written-to (PAGE_IS_WRITTEN),
>>   file mapped (PAGE_IS_FILE), present (PAGE_IS_PRESENT) or swapped
>>   (PAGE_IS_SWAPPED).
>> - Find pages which have been written-to and write protect the pages
>>   (atomic PAGE_IS_WRITTEN + PAGEMAP_WP_ENGAGE)
> [...]
>> --- a/fs/proc/task_mmu.c
>> +++ b/fs/proc/task_mmu.c
>> @@ -19,6 +19,7 @@
>>  #include <linux/shmem_fs.h>
>>  #include <linux/uaccess.h>
>>  #include <linux/pkeys.h>
>> +#include <linux/minmax.h>
>>
>>  #include <asm/elf.h>
>>  #include <asm/tlb.h>
>> @@ -1132,6 +1133,18 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
>>  }
>>  #endif
>>
>> +static inline bool is_pte_uffd_wp(pte_t pte)
>> +{
>> +       return ((pte_present(pte) && pte_uffd_wp(pte)) ||
>> +               (pte_swp_uffd_wp_any(pte)));
> 
> Parentheses around pte_swp_uffd_wp_any() are redundant. Please remove
> here and in all following if()s. (Nit: those extra parentheses are
> used inconsistently in the patch anyway.)
I'll remove these in next version.

> 
> [...]
>> +static inline bool pagemap_scan_is_wt_required(struct pagemap_scan_private *p)
> 
> This seems to check if the PAGE_IS_WRITTEN flag is tested, so
> "pagemap_scan_needs_wp_checks()"? Or maybe document/expand the "wt"
> acronym as it seems used also on following code.
I'll expand wt.

> 
>> +{
>> +       return  ((p->required_mask & PAGE_IS_WRITTEN) ||
>> +                (p->anyof_mask & PAGE_IS_WRITTEN) ||
>> +                (p->excluded_mask & PAGE_IS_WRITTEN));
> 
> Nit: It looks like it should answer "do any of the masks contain
> PAGE_IS_WRITTEN?" so maybe:
> 
> return (p->required_mask | p->anyof_mask | p->excluded_mask) & PAGE_IS_WRITTEN;
I'll update.

> 
> [...]
> 
>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>> +                              struct pagemap_scan_private *p,
>> +                              unsigned long addr, unsigned int n_pages)
>> +{
>> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
>> +       struct page_region *cur = &p->cur;
>> +       bool cpy = true;
>> +
>> +       if (p->max_pages && (p->found_pages == p->max_pages))
>> +               return -ENOSPC;
>> +
>> +       if (!n_pages)
>> +               return -EINVAL;
>> +
>> +       if (p->required_mask)
>> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
>> +       if (cpy && p->anyof_mask)
>> +               cpy = (p->anyof_mask & bitmap);
>> +       if (cpy && p->excluded_mask)
>> +               cpy = !(p->excluded_mask & bitmap);
> 
> Since the rest of the code is executed only when `cpy` is true, this
> could just return early for easier understanding.
Hmm... I'll do the following:
	if (!cpy || !bitmap)
		return 0;

> 
> BTW, some of the tests are redundant. Eg: if required_mask == 0, then
> `required_mask & x == required_mask` will always hold. Same for
> `excluded_mask & x == 0`.
Correct. This is why I'm checking if required_mask is set and then
comparing bitmap with it. required_mask may be 0 if not set. This if will
ignore the subsequent check.

	if (p->required_mask)
		cpy = ((p->required_mask & bitmap) == p->required_mask);

I don't see any redundancy here. Please let me know otherwise?

> 
>> +
>> +       bitmap = bitmap & p->return_mask;
> 
> Nit: bitmap &= p->return_mask;
Sure. Will do.

Just for my knowledge, what does "Nit" signify if a comment is marked with it?

> 
>> +       if (cpy && bitmap) {
> 
> Assuming early returns on `!cpy` are done earlier:
> 
> if (!bitmap)
>   return 0;
I've posted condition above which would better suit here.

> 
>> +               if ((cur->len) && (cur->bitmap == bitmap) &&
>> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {
> 
> I'd recommend removing the extra parentheses as they make the code
> less readable for me (too many parentheses to match visually).
I'll remove parenthesis.

> The `cur->len` test seems redundant: is it possible to have
> `cur->start == addr` in that case (I guess it would have to get
> `n_pages == 0` in an earlier invocation)?
No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
essential to check the validity from cur->len before performing other
checks. Also cur->start can never be equal to addr as we are walking over
page addressing in serial manner. We want to see here if the current
address matches the previous data by finding the ending address of last
stored data (cur->start + cur->len * PAGE_SIZE).

> 
>> +
>> +                       cur->len += n_pages;
>> +                       p->found_pages += n_pages;
> 
> Please add an early return so that 'else' chaining won't be necessary.
I'll do it.

> 
>> +               } else if ((!p->vec_index) ||
>> +                          ((p->vec_index + 1) < p->vec_len)) {
> 
> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
> 
> if (vec_index >= p->vec_len)
>     return -ENOSPC;

No, it'll not work. Lets leave it as it is. :)

It has gotten somewhat complex, but I don't have any other way to make it
simpler which works. First note the following points:
1) We walk over 512 page or 1 thp at a time to not over allocate memory in
kernel (p->vec).
2) We also want to merge the consective pages with same flags into one
struct page_region. p->vec of current walk may merge with next walk. So we
cannot write to user memory until we find the results of the next walk.

So most recent data is put into p->cur. When non-intersecting or mergeable
data is found, we move p->cur to p->vec[p->index] inside the page walk.
After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
the walks are over. We move the p->cur to arg->vec. It completes the data
transfer to user buffer.

	--------------
	|   p->cur   |
	--------------
	      |
	      |
	      V
	--------------
	|	     |
	|	     |
	|   p->vec   |
	|	     |
	|	     |
	--------------
	      |
	      |
	      V
	--------------
	|	     |
	|	     |
	|	     |
	|  arg->vec  |
	|	     |
	|	     |
	|	     |
	--------------


I'm so sorry that it has gotten this much complex. It was way simpler when
we were walking over all the memory in one go. But then we needed an
unbounded memory from the kernel which we don't want.

> 
>> +                       if (cur->len) {
>> +                               memcpy(&p->vec[p->vec_index], cur,
>> +                                      sizeof(struct page_region));
>> +                               p->vec_index++;
>> +                       }
>> +
>> +                       cur->start = addr;
>> +                       cur->len = n_pages;
>> +                       cur->bitmap = bitmap;
>> +                       p->found_pages += n_pages;
>> +               } else {
>> +                       return -ENOSPC;
>> +               }
>> +       }
>> +
>> +       return 0;
>> +}
> [...]
> 
>> +static int pagemap_scan_deposit(struct pagemap_scan_private *p,
>> +                               struct page_region __user *vec,
>> +                               unsigned long *vec_index)
>> +{
>> +       struct page_region *cur = &p->cur;
>> +
>> +       if (cur->len) {
> 
> if (!cur->len)
>   return 0;
Sure.

> 
>> +               if (copy_to_user(&vec[*vec_index], cur,
>> +                                sizeof(struct page_region)))
>> +                       return -EFAULT;
>> +
>> +               p->vec_index++;
>> +               (*vec_index)++;
>> +       }
>> +
>> +       return 0;
>> +}
> 
>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>> +                                 unsigned long end, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       bool is_writ, is_file, is_pres, is_swap;
>> +       unsigned long addr = end;
>> +       spinlock_t *ptl;
>> +       int ret = 0;
>> +       pte_t *pte;
>> +
>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> 
> Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
> NULL if transparent hugepages are not compiled in. OTOH I see
> BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
> case), so that would need to be worked around first.
I'd got the build error [1]. So I'd added these. I've tested it again with
the same config. We don't need these #ifdef now. I'll remove these.

[1] https://lore.kernel.org/all/202211120107.cYLiq2cH-lkp@intel.com

> 
>> +       ptl = pmd_trans_huge_lock(pmd, vma);
>> +       if (ptl) {
>> +               unsigned long n_pages;
>> +
>> +               is_writ = !is_pmd_uffd_wp(*pmd);
> 
> `is_written`?
I'd kept it is_writ to match the pattern of is_file, is_pres and is_swap.
I'll update it to is_written and is_pres to is_present.

> 
>> +               /*
>> +                * Break huge page into small pages if operation needs to be
>> +                * performed is on a portion of the huge page.
>> +                */
>> +               if (is_writ && PM_SCAN_OP_IS_WP(p) &&
>> +                   (end - start < HPAGE_SIZE)) {
>> +                       spin_unlock(ptl);
>> +
>> +                       split_huge_pmd(vma, pmd, start);
>> +                       goto process_smaller_pages;
>> +               }
>> +
>> +               n_pages = (end - start)/PAGE_SIZE;
>> +               if (p->max_pages &&
>> +                   p->found_pages + n_pages >= p->max_pages)
> 
> Nit: greater-than is also correct and avoids no-op assignment.
Ohh... I'll update.

> 
>> +                       n_pages = p->max_pages - p->found_pages;
>> +
>> +               ret = pagemap_scan_output(is_writ, vma->vm_file,
>> +                                         pmd_present(*pmd), is_swap_pmd(*pmd),
>> +                                         p, start, n_pages);
>> +               spin_unlock(ptl);
> 
> if (ret || !is_written)
>   return ret;
> 
> This will avoid those tests in the following if().
Done.

> 
>> +
>> +               if (!ret && is_writ && PM_SCAN_OP_IS_WP(p) &&
>> +                   uffd_wp_range(walk->mm, vma, start, HPAGE_SIZE, true) < 0)
>> +                       ret = -EINVAL;
>> +
>> +               return ret;
> 
> After above early returns, this will be always `return 0;`.
Sure.

> 
>> +       }
>> +process_smaller_pages:
>> +       if (pmd_trans_unstable(pmd))
>> +               return 0;
>> +#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>> +
>> +       for (addr = start; !ret && addr < end; pte++, addr += PAGE_SIZE) {
> 
> The `!ret` can be removed if the EINVAL case was to `break` by itself.
Sure. Will do.

> 
>> +               pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
>> +
>> +               is_writ = !is_pte_uffd_wp(*pte);
>> +               is_file = vma->vm_file;
>> +               is_pres = pte_present(*pte);
>> +               is_swap = is_swap_pte(*pte);
>> +
>> +               pte_unmap_unlock(pte, ptl);
>> +
>> +               ret = pagemap_scan_output(is_writ, is_file, is_pres, is_swap,
>> +                                         p, addr, 1);
>> +               if (ret)
>> +                       break;
>> +
>> +               if (PM_SCAN_OP_IS_WP(p) && is_writ &&
>> +                   uffd_wp_range(walk->mm, vma, addr, PAGE_SIZE, true) < 0)
>> +                       ret = -EINVAL;
>> +       }
>> +
>> +       cond_resched();
>> +       return ret;
>> +}
>> +
>> +static int pagemap_scan_pte_hole(unsigned long addr, unsigned long end,
>> +                                int depth, struct mm_walk *walk)
>> +{
>> +       struct pagemap_scan_private *p = walk->private;
>> +       struct vm_area_struct *vma = walk->vma;
>> +       unsigned long n_pages;
>> +       int ret = 0;
>> +
>> +       if (vma) {
> 
> if (!vma) return 0;
Will do.

> 
>> +               n_pages = (end - addr)/PAGE_SIZE;
>> +               if (p->max_pages &&
>> +                   p->found_pages + n_pages >= p->max_pages)
>> +                       n_pages = p->max_pages - p->found_pages;
>> +
>> +               ret = pagemap_scan_output(false, vma->vm_file, false, false, p,
>> +                                         addr, n_pages);
>> +       }
>> +
>> +       return ret;
>> +}
> 
> 
>> +/* No hugetlb support is present. */
> 
> "FIXME: hugetlb support is not implemented."? (There seems to be no
> #ifdef CONFIG_HUGETLB or similar, so I guess the comment is about the
> current implementation.)
I'm working on adding hugetlb support. I'll remove this comment.

> 
>> +static const struct mm_walk_ops pagemap_scan_ops = {
>> +       .test_walk = pagemap_scan_test_walk,
>> +       .pmd_entry = pagemap_scan_pmd_entry,
>> +       .pte_hole = pagemap_scan_pte_hole,
>> +};
>> +
>> +static bool pagemap_scan_args_valid(struct pm_scan_arg *arg,
>> +                                   struct page_region __user *vec,
>> +                                   unsigned long start)
>> +{
>> +       /* Detect illegal size, flags and masks */
>> +       if (arg->size != sizeof(struct pm_scan_arg))
>> +               return false;
>> +       if (arg->flags & ~PM_SCAN_OPS)
>> +               return false;
>> +       if ((arg->required_mask | arg->anyof_mask | arg->excluded_mask |
>> +            arg->return_mask) & ~PM_SCAN_BITS_ALL)
>> +               return false;
> 
>> +       if (!arg->required_mask && !arg->anyof_mask &&
>> +           !arg->excluded_mask)
>> +               return false;
> 
> Is there an assumption in the code that those checks are needed? I'd
> expect that no selection criteria makes a valid page set?
In my view, selection criterion must be specified for the ioctl to work. If
there is no criterio, user should go and read pagemap file directly. So the
assumption is that at least one selection criterion must be specified.

> 
>> +       if (!arg->return_mask)
>> +               return false;
>> +
>> +       /* Validate memory ranges */
>> +       if (!(arg->flags & PM_SCAN_OP_GET))
>> +               return false;
>> +       if (!arg->vec)
>> +               return false;
>> +       if (arg->vec_len == 0)
>> +               return false;
> 
>> +       if (!access_ok((void __user *)vec,
>> +                      arg->vec_len * sizeof(struct page_region)))
>> +               return false;
> 
> Is there a provision that userspace threads are all blocked from
> manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
> and the writes should be checked each time as another userspace thread
> could remap the memory while the ioctl() is working.
mincore() syscall is doing in the same way. It checks the validity in the
start only. What provision should I add? Isn't it obvious that the user
should not remap such memory?

> Anyway, the
> return should be EFAULT for this case.
I'll update.

> 
>> +       if (!IS_ALIGNED(start, PAGE_SIZE))
>> +               return false;
>> +       if (!access_ok((void __user *)start, arg->len))
>> +               return false;
> 
> This I guess want's to check if the range to be scanned is mapped -
> but isn't this what the ioctl() should do during the scan? (But, also
> see above.)
No, start represents the memory which the user wants to watch. User must
allocate this memory first and then pass the address to this ioctl to find
out the flags per page.

> 
>> +       if (PM_SCAN_OP_IS_WP(arg)) {
> 
> if (!...IS_WP) return true;
I was liking this way. Anyways I'll update.

> 
>> +               if (arg->required_mask & PM_SCAN_NON_WT_BITS)
>> +                       return false;
>> +               if (arg->anyof_mask & PM_SCAN_NON_WT_BITS)
>> +                       return false;
>> +               if (arg->excluded_mask & PM_SCAN_NON_WT_BITS)
>> +                       return false;
> 
> Please see: pagemap_scan_is_wt_required comment. Also, it seems this
> constant is used only here, so ~PAGE_IS_WRITTEN might be enough?
Yup, I'll update.

> 
> [...]
>> +static long do_pagemap_cmd(struct mm_struct *mm, struct pm_scan_arg *arg)
>> +{
>> +       unsigned long start, end, walk_start, walk_end;
>> +       unsigned long empty_slots, vec_index = 0;
>> +       struct page_region __user *vec;
>> +       struct pagemap_scan_private p;
>> +       int ret = 0;
>> +
>> +       start = (unsigned long)untagged_addr(arg->start);
>> +       vec = (struct page_region *)(unsigned long)untagged_addr(arg->vec);
>> +
>> +       if (!pagemap_scan_args_valid(arg, vec, start))
>> +               return -EINVAL;
>> +
>> +       end = start + arg->len;
>> +       p.max_pages = arg->max_pages;
>> +       p.found_pages = 0;
>> +       p.flags = arg->flags;
>> +       p.required_mask = arg->required_mask;
>> +       p.anyof_mask = arg->anyof_mask;
>> +       p.excluded_mask = arg->excluded_mask;
>> +       p.return_mask = arg->return_mask;
>> +       p.cur.len = 0;
>> +       p.vec = NULL;
>> +       p.vec_len = (PAGEMAP_WALK_SIZE >> PAGE_SHIFT);
>> +
>> +       /*
>> +        * Allocate smaller buffer to get output from inside the page walk
>> +        * functions and walk page range in PAGEMAP_WALK_SIZE size chunks. As
>> +        * we want to return output to user in compact form where no two
>> +        * consecutive regions should be continuous and have the same flags.
>> +        * So store the latest element in p.cur between different walks and
>> +        * store the p.cur at the end of the walk to the user buffer.
>> +        */
>> +       p.vec = kmalloc_array(p.vec_len, sizeof(struct page_region),
>> +                             GFP_KERNEL);
>> +       if (!p.vec)
>> +               return -ENOMEM;
>> +
>> +       walk_start = walk_end = start;
>> +       while (walk_end < end) {
>> +               p.vec_index = 0;
>> +
>> +               empty_slots = arg->vec_len - vec_index;
>> +               p.vec_len = min(p.vec_len, empty_slots);
>> +
>> +               walk_end = (walk_start + PAGEMAP_WALK_SIZE) & PAGEMAP_WALK_MASK;
>> +               if (walk_end > end)
>> +                       walk_end = end;
>> +
>> +               mmap_read_lock(mm);
>> +               ret = walk_page_range(mm, walk_start, walk_end,
>> +                                     &pagemap_scan_ops, &p);
>> +               mmap_read_unlock(mm);
>> +
>> +               if (!(!ret || ret == -ENOSPC))
> 
> if (ret && ret != -ENOSPC)
Sorry, I should have thought of this one. Thanks.

> 
>> +                       goto free_data;
>> +
>> +               walk_start = walk_end;
>> +               if (p.vec_index) {
>> +                       if (copy_to_user(&vec[vec_index], p.vec,
>> +                                        p.vec_index *
>> +                                        sizeof(struct page_region))) {
>> +                               ret = -EFAULT;
>> +                               goto free_data;
>> +                       }
>> +                       vec_index += p.vec_index;
>> +               }
>> +       }
>> +       ret = pagemap_scan_deposit(&p, vec, &vec_index);
>> +       if (!ret)
>> +               ret = vec_index;
>> +free_data:
>> +       kfree(p.vec);
>> +
>> +       return ret;
>> +}
>> +
>> +static long pagemap_scan_ioctl(struct file *file, unsigned int cmd,
>> +                              unsigned long arg)
>> +{
>> +       struct pm_scan_arg __user *uarg = (struct pm_scan_arg __user *)arg;
>> +       struct mm_struct *mm = file->private_data;
>> +       struct pm_scan_arg argument;
>> +
>> +       if (cmd == PAGEMAP_SCAN) {
> 
> switch() for easier expansion later?
I'd switch here once. I'll add it again.

> 
>> +               if (copy_from_user(&argument, uarg,
>> +                                  sizeof(struct pm_scan_arg)))
> 
> sizeof(*argument);
> 
> Could you push this to do_pagemap_cmd()? In case this file gets more
> ioctl() commands there won't be need to add more command-specific
> structures in this function.
Sure, I'll update.

> 
>> +                       return -EFAULT;
>> +               return do_pagemap_cmd(mm, &argument);
>> +       }
>> +
>> +       return -EINVAL;
>> +}
>> +
>>  const struct file_operations proc_pagemap_operations = {
>>         .llseek         = mem_lseek, /* borrow this */
>>         .read           = pagemap_read,
>>         .open           = pagemap_open,
>>         .release        = pagemap_release,
>> +       .unlocked_ioctl = pagemap_scan_ioctl,
>> +       .compat_ioctl   = pagemap_scan_ioctl,
> 
> Is this correct? Would the code need a different userspace pointer
> handling for 32-bit userspace on 64-bit kernel?
Yeah, it is needed for 32-bit application to run on 64-bit kernel.

> 
>>  };
>>  #endif /* CONFIG_PROC_PAGE_MONITOR */
Peter Xu March 16, 2023, 7:20 p.m. UTC | #7
Hello, Muhammad,

On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote:
> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
> faults on its own. It can be used to track that which pages have been
> written-to from the time the pages were write-protected. It is very
> efficient way to track the changes as uffd is by nature pte/pmd based.
> 
> UFFD synchronous WP sends the page faults to the userspace where the
> pages which have been written-to can be tracked. But it is not efficient.
> This is why this asynchronous version is being added. After setting the
> WP Async, the pages which have been written to can be found in the pagemap
> file or information can be obtained from the PAGEMAP_IOCTL.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>

Here's the patch that can enable WP_ASYNC for all kinds of memories (as I
promised..).  Currently I only tested btrfs (besides the common three)
which is the major fs I use locally, but I guess it'll also enable the rest
no matter what's underneath, just like soft-dirty.

As I mentioned, I just feel it very unfortunate to have a lot of suffixes
for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for
this WP_ASYNC from the start because the workflow should really be similar
to anon/shmem handling for most of the rest, just a few tweaks here and
there.

I had a feeling that some type of special VMA will work weirdly, but let's
see.. so far I don't come up with any.

If the patch looks fine to you, please consider replace this patch with
patch 1 of mine where I attached.  Then patch 1 can be reviewed alongside
with your series.

Logically patch 1 can be reviewed separately too, because it works
perfectly afaiu without the atomic version of pagemap already.  But on my
side I don't think it justifies anything really matters, so unless someone
thinks it a good idea to post / review / merge it separately, you can keep
that with your new pagemap ioctl.

Patch 2 is only for your reference.  It's not for merging quality so please
don't put it into your series.  I do plan to cleanup the userfaultfd
selftests in the near future first (when I wrote this I am more eager to do
so..).  I also think your final pagemap test cases can cover quite a bit.

Thanks,
Michał Mirosław March 16, 2023, 9:28 p.m. UTC | #8
On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
>
> Hi,
>
> Thank you so much for reviewing.
>
> On 3/13/23 9:02 PM, Michał Mirosław wrote:
> > On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
[...]
> >> --- a/fs/proc/task_mmu.c
> >> +++ b/fs/proc/task_mmu.c
[...]
> >> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> >> +                              struct pagemap_scan_private *p,
> >> +                              unsigned long addr, unsigned int n_pages)
> >> +{
> >> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
> >> +       struct page_region *cur = &p->cur;
> >> +       bool cpy = true;
> >> +
> >> +       if (p->max_pages && (p->found_pages == p->max_pages))
> >> +               return -ENOSPC;
> >> +
> >> +       if (!n_pages)
> >> +               return -EINVAL;
> >> +
> >> +       if (p->required_mask)
> >> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
> >> +       if (cpy && p->anyof_mask)
> >> +               cpy = (p->anyof_mask & bitmap);
> >> +       if (cpy && p->excluded_mask)
> >> +               cpy = !(p->excluded_mask & bitmap);
> >
> > Since the rest of the code is executed only when `cpy` is true, this
> > could just return early for easier understanding.
> Hmm... I'll do the following:
>         if (!cpy || !bitmap)
>                 return 0;
> > BTW, some of the tests are redundant. Eg: if required_mask == 0, then
> > `required_mask & x == required_mask` will always hold. Same for
> > `excluded_mask & x == 0`.
> Correct. This is why I'm checking if required_mask is set and then
> comparing bitmap with it. required_mask may be 0 if not set. This if will
> ignore the subsequent check.
>
>         if (p->required_mask)
>                 cpy = ((p->required_mask & bitmap) == p->required_mask);
>
> I don't see any redundancy here. Please let me know otherwise?
[...]
> >> +       if (cpy && bitmap) {
> >
> > Assuming early returns on `!cpy` are done earlier:
> >
> > if (!bitmap)
> >   return 0;
> I've posted condition above which would better suit here.
[...]

Since the `cpy` condition is updated and passed to each new branch
(IOW: after setting cpy = 0 for whatever reason all the further code
is skipped) you can drop the variable and do early returns everywhere.
E.g.:

if ((bitmap & p->required_mask) != p->required_mask)
  return 0;
if (p->anyof_mask && !(bitmap & p->anyof_mask))
  return 0;
if (bitmap & p->excluded_mask)
  return 0;
if (!bitmap)
  return 0;

Also you can take the "special" effect of masking with zero to be
always zero (and in C - false) to avoid testing for an empty mask
separately in most cases.

> Just for my knowledge, what does "Nit" signify if a comment is marked with it?

A low priority / cosmetic item that you might consider ignoring if a
fix is too expensive or controversial.

>> +               if ((cur->len) && (cur->bitmap == bitmap) &&
>> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {
>
> I'd recommend removing the extra parentheses as they make the code
> less readable for me (too many parentheses to match visually).
I'll remove parenthesis.

[...]
>> The `cur->len` test seems redundant: is it possible to have
>> `cur->start == addr` in that case (I guess it would have to get
>> `n_pages == 0` in an earlier invocation)?
> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
> essential to check the validity from cur->len before performing other
> checks. Also cur->start can never be equal to addr as we are walking over
> page addressing in serial manner. We want to see here if the current
> address matches the previous data by finding the ending address of last
> stored data (cur->start + cur->len * PAGE_SIZE).

If cur->len == 0, then it doesn't matter if it gets merged or not - it
can be filtered out during the flush (see below).

[...]
> >> +               } else if ((!p->vec_index) ||
> >> +                          ((p->vec_index + 1) < p->vec_len)) {
> >
> > Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
> >
> > if (vec_index >= p->vec_len)
> >     return -ENOSPC;
>
> No, it'll not work. Lets leave it as it is. :)
>
> It has gotten somewhat complex, but I don't have any other way to make it
> simpler which works. First note the following points:
> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
> kernel (p->vec).
> 2) We also want to merge the consecutive pages with the same flags into one
> struct page_region. p->vec of current walk may merge with next walk. So we
> cannot write to user memory until we find the results of the next walk.
>
> So most recent data is put into p->cur. When non-intersecting or mergeable
> data is found, we move p->cur to p->vec[p->index] inside the page walk.
> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
> the walks are over. We move the p->cur to arg->vec. It completes the data
> transfer to user buffer.
[...]
> I'm so sorry that it has gotten this much complex. It was way simpler when
> we were walking over all the memory in one go. But then we needed an
> unbounded memory from the kernel which we don't want.
[...]

I've gone through and hopefully understood the code. I'm not sure this
needs to be so complicated: when traversing a single PMD you can
always copy p->cur to p->vec[p->vec_index++] because you can have at
most pages_per_PMD non-merges (in the worst case the last page always
is left in p->cur and whole p->vec is used). After each PMD p->vec
needs a flush if p->vec_index > 0, skipping the dummy entry at front
(len == 0; if present). (This is mostly how it is implemented now, but
I propose to remove the "overflow" check and do the starting guard
removal only every PMD.)

BTW, the pagemap_scan_deposit() got me a bit confused: it seems that
it is just a copy of the p->vec flush to userspace. Please either use
it for both p->vec and p->cur flushing or inline.

BTW#2, I think the ENOSPC return in pagemap_scan_output() should
happen later - only if the pages would match and that caused the count
to exceed the limit. For THP n_pages should be truncated to the limit
(and ENOSPC returned right away) only after the pages were verified to
match.

[...]
> >> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
> >> +                                 unsigned long end, struct mm_walk *walk)
> >> +{
> >> +       struct pagemap_scan_private *p = walk->private;
> >> +       struct vm_area_struct *vma = walk->vma;
> >> +       bool is_writ, is_file, is_pres, is_swap;
> >> +       unsigned long addr = end;
> >> +       spinlock_t *ptl;
> >> +       int ret = 0;
> >> +       pte_t *pte;
> >> +
> >> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >
> > Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
> > NULL if transparent hugepages are not compiled in. OTOH I see
> > BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
> > case), so that would need to be worked around first.
> I'd got the build error [1]. So I'd added these. I've tested it again with
> the same config. We don't need these #ifdef now. I'll remove these.

I mean that there are cases like [1] that actually need the #ifdef at
least to wrap HPAGE_SIZE usage. But maybe just this constant can be
wrapped so that we keep the code always compile-tested?

[1] https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/mips/include/asm/page.h#L66

[...]
> >> +       if (!arg->required_mask && !arg->anyof_mask &&
> >> +           !arg->excluded_mask)
> >> +               return false;
> >
> > Is there an assumption in the code that those checks are needed? I'd
> > expect that no selection criteria makes a valid page set?
> In my view, selection criterion must be specified for the ioctl to work. If
> there is no criterio, user should go and read pagemap file directly. So the
> assumption is that at least one selection criterion must be specified.

Yes. I'm not sure we need to prevent multiple ways of doing the same
thing. But doesn't pagemap reading lack the range aggregation feature?

[...]
> >> +       if (!access_ok((void __user *)vec,
> >> +                      arg->vec_len * sizeof(struct page_region)))
> >> +               return false;
> >
> > Is there a provision that userspace threads are all blocked from
> > manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
> > and the writes should be checked each time as another userspace thread
> > could remap the memory while the ioctl() is working.
> mincore() syscall is doing in the same way. It checks the validity in the
> start only. What provision should I add? Isn't it obvious that the user
> should not remap such memory?

On the second look, I think the code already checks that while doing
copy_to_user(), so this check is redundant and can be removed.

> >
> >> +       if (!IS_ALIGNED(start, PAGE_SIZE))
> >> +               return false;
> >> +       if (!access_ok((void __user *)start, arg->len))
> >> +               return false;
> >
> > This I guess wants to check if the range to be scanned is mapped -
> > but isn't this what the ioctl() should do during the scan? (But, also
> > see above.)
> No, start represents the memory which the user wants to watch. User must
> allocate this memory first and then pass the address to this ioctl to find
> out the flags per page.

From:
+ * struct pm_scan_arg - Pagemap ioctl argument
+ * @size:              Size of the structure
+ * @flags:             Flags for the IOCTL
+ * @start:             Starting address of the region
+ * @len:               Length of the region (All the pages in this
length are included)
...

I'd expect the `start` field to just be a virtual address to start
scanning from. Does it need to be mapped? For CRIU usecase I'd start
with "start = 0" to find out all mappings, but 0 is (always) not
mapped. Is this supposed to only work on already discovered page
ranges? Anyway, I'd expect the code should be tolerant of another
thread changing the mappings while this ioctl() is walking the page
tables - is it so? If yes, then this check serves at most as an
optimization used only for an invalid call.

> >>  const struct file_operations proc_pagemap_operations = {
> >>         .llseek         = mem_lseek, /* borrow this */
> >>         .read           = pagemap_read,
> >>         .open           = pagemap_open,
> >>         .release        = pagemap_release,
> >> +       .unlocked_ioctl = pagemap_scan_ioctl,
> >> +       .compat_ioctl   = pagemap_scan_ioctl,
> >
> > Is this correct? Would the code need a different userspace pointer
> > handling for 32-bit userspace on 64-bit kernel?
> Yeah, it is needed for 32-bit application to run on 64-bit kernel.

I mean is using the same function for both entry points correct? Don't
the pointers to userspace memory (e.g. arg->vec) need to be mapped for
32-bit process?

Best Regards

Michał Mirosław
Muhammad Usama Anjum March 17, 2023, 12:43 p.m. UTC | #9
On 3/17/23 2:28 AM, Michał Mirosław wrote:
> On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>>
>> Hi,
>>
>> Thank you so much for reviewing.
>>
>> On 3/13/23 9:02 PM, Michał Mirosław wrote:
>>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
> [...]
>>>> --- a/fs/proc/task_mmu.c
>>>> +++ b/fs/proc/task_mmu.c
> [...]
>>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>>>> +                              struct pagemap_scan_private *p,
>>>> +                              unsigned long addr, unsigned int n_pages)
>>>> +{
>>>> +       unsigned long bitmap = PM_SCAN_BITMAP(wt, file, pres, swap);
>>>> +       struct page_region *cur = &p->cur;
>>>> +       bool cpy = true;
>>>> +
>>>> +       if (p->max_pages && (p->found_pages == p->max_pages))
>>>> +               return -ENOSPC;
>>>> +
>>>> +       if (!n_pages)
>>>> +               return -EINVAL;
>>>> +
>>>> +       if (p->required_mask)
>>>> +               cpy = ((p->required_mask & bitmap) == p->required_mask);
>>>> +       if (cpy && p->anyof_mask)
>>>> +               cpy = (p->anyof_mask & bitmap);
>>>> +       if (cpy && p->excluded_mask)
>>>> +               cpy = !(p->excluded_mask & bitmap);
>>>
>>> Since the rest of the code is executed only when `cpy` is true, this
>>> could just return early for easier understanding.
>> Hmm... I'll do the following:
>>         if (!cpy || !bitmap)
>>                 return 0;
>>> BTW, some of the tests are redundant. Eg: if required_mask == 0, then
>>> `required_mask & x == required_mask` will always hold. Same for
>>> `excluded_mask & x == 0`.
>> Correct. This is why I'm checking if required_mask is set and then
>> comparing bitmap with it. required_mask may be 0 if not set. This if will
>> ignore the subsequent check.
>>
>>         if (p->required_mask)
>>                 cpy = ((p->required_mask & bitmap) == p->required_mask);
>>
>> I don't see any redundancy here. Please let me know otherwise?
> [...]
>>>> +       if (cpy && bitmap) {
>>>
>>> Assuming early returns on `!cpy` are done earlier:
>>>
>>> if (!bitmap)
>>>   return 0;
>> I've posted condition above which would better suit here.
> [...]
> 
> Since the `cpy` condition is updated and passed to each new branch
> (IOW: after setting cpy = 0 for whatever reason all the further code
> is skipped) you can drop the variable and do early returns everywhere.
> E.g.:
> 
> if ((bitmap & p->required_mask) != p->required_mask)
>   return 0;
> if (p->anyof_mask && !(bitmap & p->anyof_mask))
>   return 0;
> if (bitmap & p->excluded_mask)
>   return 0;
> if (!bitmap)
>   return 0;
Clever. Will do.

> 
> Also you can take the "special" effect of masking with zero to be
> always zero (and in C - false) to avoid testing for an empty mask
> separately in most cases.
Done.

> 
>> Just for my knowledge, what does "Nit" signify if a comment is marked with it?
> 
> A low priority / cosmetic item that you might consider ignoring if a
> fix is too expensive or controversial.
> 
>>> +               if ((cur->len) && (cur->bitmap == bitmap) &&
>>> +                   (cur->start + cur->len * PAGE_SIZE == addr)) {
>>
>> I'd recommend removing the extra parentheses as they make the code
>> less readable for me (too many parentheses to match visually).
> I'll remove parenthesis.
> 
> [...]
>>> The `cur->len` test seems redundant: is it possible to have
>>> `cur->start == addr` in that case (I guess it would have to get
>>> `n_pages == 0` in an earlier invocation)?
>> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
>> essential to check the validity from cur->len before performing other
>> checks. Also cur->start can never be equal to addr as we are walking over
>> page addressing in serial manner. We want to see here if the current
>> address matches the previous data by finding the ending address of last
>> stored data (cur->start + cur->len * PAGE_SIZE).
> 
> If cur->len == 0, then it doesn't matter if it gets merged or not - it
> can be filtered out during the flush (see below).

> 
> [...]
>>>> +               } else if ((!p->vec_index) ||
>>>> +                          ((p->vec_index + 1) < p->vec_len)) {
>>>
>>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
>>>
>>> if (vec_index >= p->vec_len)
>>>     return -ENOSPC;
>>
>> No, it'll not work. Lets leave it as it is. :)
>>
>> It has gotten somewhat complex, but I don't have any other way to make it
>> simpler which works. First note the following points:
>> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
>> kernel (p->vec).
>> 2) We also want to merge the consecutive pages with the same flags into one
>> struct page_region. p->vec of current walk may merge with next walk. So we
>> cannot write to user memory until we find the results of the next walk.
>>
>> So most recent data is put into p->cur. When non-intersecting or mergeable
>> data is found, we move p->cur to p->vec[p->index] inside the page walk.
>> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
>> the walks are over. We move the p->cur to arg->vec. It completes the data
>> transfer to user buffer.
> [...]
>> I'm so sorry that it has gotten this much complex. It was way simpler when
>> we were walking over all the memory in one go. But then we needed an
>> unbounded memory from the kernel which we don't want.
> [...]
> 
> I've gone through and hopefully understood the code. I'm not sure this
> needs to be so complicated: when traversing a single PMD you can
> always copy p->cur to p->vec[p->vec_index++] because you can have at
> most pages_per_PMD non-merges (in the worst case the last page always
> is left in p->cur and whole p->vec is used). After each PMD p->vec
> needs a flush if p->vec_index > 0, skipping the dummy entry at front
> (len == 0; if present). (This is mostly how it is implemented now, but
> I propose to remove the "overflow" check and do the starting guard
> removal only every PMD.)
Sorry, unable to understand where to remove the guard?

> 
> BTW, the pagemap_scan_deposit() got me a bit confused: it seems that
> it is just a copy of the p->vec flush to userspace. Please either use
> it for both p->vec and p->cur flushing or inline.
I can inline this function if you say so, now that you understand all the
logic. I don't see what else can be done here.

> 
> BTW#2, I think the ENOSPC return in pagemap_scan_output() should
> happen later - only if the pages would match and that caused the count
> to exceed the limit. For THP n_pages should be truncated to the limit
> (and ENOSPC returned right away) only after the pages were verified to
> match.
We have 2 counters here:
* the p->max_pages optionally can be set to find out only N pages of
interest. So p->found_pages is counting this. We need to return early if
the page limit is complete.
* the p->vec_index keeps track of output buffer array size

> 
> [...]
>>>> +static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigned long start,
>>>> +                                 unsigned long end, struct mm_walk *walk)
>>>> +{
>>>> +       struct pagemap_scan_private *p = walk->private;
>>>> +       struct vm_area_struct *vma = walk->vma;
>>>> +       bool is_writ, is_file, is_pres, is_swap;
>>>> +       unsigned long addr = end;
>>>> +       spinlock_t *ptl;
>>>> +       int ret = 0;
>>>> +       pte_t *pte;
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>
>>> Is the `#ifdef` needed? `pmd_trans_huge_lock()` will always return
>>> NULL if transparent hugepages are not compiled in. OTOH I see
>>> BUILD_BUG() is possible in HPAGE_SIZE definition (irrelevant in this
>>> case), so that would need to be worked around first.
>> I'd got the build error [1]. So I'd added these. I've tested it again with
>> the same config. We don't need these #ifdef now. I'll remove these.
> 
> I mean that there are cases like [1] that actually need the #ifdef at
> least to wrap HPAGE_SIZE usage. But maybe just this constant can be
> wrapped so that we keep the code always compile-tested?
Some arch define HPAGE_SIZE even if huge page config isn't enabled and some
don't. Lets just add #ifdef CONFIG_TRANSPARENT_HUGEPAGE as it is just like
the similar code in this same file is using this same #ifdef.

> 
> [1] https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/mips/include/asm/page.h#L66
> 
> [...]
>>>> +       if (!arg->required_mask && !arg->anyof_mask &&
>>>> +           !arg->excluded_mask)
>>>> +               return false;
>>>
>>> Is there an assumption in the code that those checks are needed? I'd
>>> expect that no selection criteria makes a valid page set?
>> In my view, selection criterion must be specified for the ioctl to work. If
>> there is no criterio, user should go and read pagemap file directly. So the
>> assumption is that at least one selection criterion must be specified.
> 
> Yes. I'm not sure we need to prevent multiple ways of doing the same
> thing. But doesn't pagemap reading lack the range aggregation feature?
Yeah, correct. But note that we are supporting only selective 4 flags in
this ioctl, not all pagemap flags. So it is useful for only those users who
depend only on these 4 flags. Out pagemap_ioctl interface is not so much
generic that we can cater anyone. Its interface is specific and we are
adding only those cases which are of our interest. So if someone wants
range aggregation from pagemap_ioctl, he'll need to add that flag in the
IOCTL first. When IOCTL support is added, he can specify the selection
criterion etc.

> 
> [...]
>>>> +       if (!access_ok((void __user *)vec,
>>>> +                      arg->vec_len * sizeof(struct page_region)))
>>>> +               return false;
>>>
>>> Is there a provision that userspace threads are all blocked from
>>> manipulating mmaps during this ioctl()? If not, this is a TOCTOU bug
>>> and the writes should be checked each time as another userspace thread
>>> could remap the memory while the ioctl() is working.
>> mincore() syscall is doing in the same way. It checks the validity in the
>> start only. What provision should I add? Isn't it obvious that the user
>> should not remap such memory?
> 
> On the second look, I think the code already checks that while doing
> copy_to_user(), so this check is redundant and can be removed.
I'll remove.

> 
>>>
>>>> +       if (!IS_ALIGNED(start, PAGE_SIZE))
>>>> +               return false;
>>>> +       if (!access_ok((void __user *)start, arg->len))
>>>> +               return false;
>>>
>>> This I guess wants to check if the range to be scanned is mapped -
>>> but isn't this what the ioctl() should do during the scan? (But, also
>>> see above.)
>> No, start represents the memory which the user wants to watch. User must
>> allocate this memory first and then pass the address to this ioctl to find
>> out the flags per page.
> 
> From:
> + * struct pm_scan_arg - Pagemap ioctl argument
> + * @size:              Size of the structure
> + * @flags:             Flags for the IOCTL
> + * @start:             Starting address of the region
> + * @len:               Length of the region (All the pages in this
> length are included)
> ...
> 
> I'd expect the `start` field to just be a virtual address to start
> scanning from. Does it need to be mapped? For CRIU usecase I'd start
> with "start = 0" to find out all mappings, but 0 is (always) not
> mapped. Is this supposed to only work on already discovered page
> ranges? Anyway, I'd expect the code should be tolerant of another
> thread changing the mappings while this ioctl() is walking the page
> tables - is it so? If yes, then this check serves at most as an
> optimization used only for an invalid call.
Ohh... Ignore my previous comment. Yeah, any valid memory range can be
passed to view the page flags. This check just verifies if the memory range
is valid.

> 
>>>>  const struct file_operations proc_pagemap_operations = {
>>>>         .llseek         = mem_lseek, /* borrow this */
>>>>         .read           = pagemap_read,
>>>>         .open           = pagemap_open,
>>>>         .release        = pagemap_release,
>>>> +       .unlocked_ioctl = pagemap_scan_ioctl,
>>>> +       .compat_ioctl   = pagemap_scan_ioctl,
>>>
>>> Is this correct? Would the code need a different userspace pointer
>>> handling for 32-bit userspace on 64-bit kernel?
>> Yeah, it is needed for 32-bit application to run on 64-bit kernel.
> 
> I mean is using the same function for both entry points correct? Don't
> the pointers to userspace memory (e.g. arg->vec) need to be mapped for
> 32-bit process?
No, every member is our argument structure is of 64 bit in our structure
which keeps memory layout same. So we don't need any specific conversion
here. (Even if we had any 32-bit variable, we just needed to make sure that
the layout remains the same in the memory.)

Thanks,
Usama

> 
> Best Regards
> 
> Michał Mirosław
Michał Mirosław March 17, 2023, 2:15 p.m. UTC | #10
On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum
<usama.anjum@collabora.com> wrote:
> On 3/17/23 2:28 AM, Michał Mirosław wrote:
> > On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
> > <usama.anjum@collabora.com> wrote:
> >> On 3/13/23 9:02 PM, Michał Mirosław wrote:
> >>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
> >>> <usama.anjum@collabora.com> wrote:
> > [...]
> >>>> --- a/fs/proc/task_mmu.c
> >>>> +++ b/fs/proc/task_mmu.c
> > [...]
> >>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
> > [...]
> >>> The `cur->len` test seems redundant: is it possible to have
> >>> `cur->start == addr` in that case (I guess it would have to get
> >>> `n_pages == 0` in an earlier invocation)?
> >> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
> >> essential to check the validity from cur->len before performing other
> >> checks. Also cur->start can never be equal to addr as we are walking over
> >> page addressing in serial manner. We want to see here if the current
> >> address matches the previous data by finding the ending address of last
> >> stored data (cur->start + cur->len * PAGE_SIZE).
> >
> > If cur->len == 0, then it doesn't matter if it gets merged or not - it
> > can be filtered out during the flush (see below).
> > [...]
> >>>> +               } else if ((!p->vec_index) ||
> >>>> +                          ((p->vec_index + 1) < p->vec_len)) {
> >>>
> >>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
> >>>
> >>> if (vec_index >= p->vec_len)
> >>>     return -ENOSPC;
> >>
> >> No, it'll not work. Lets leave it as it is. :)
> >>
> >> It has gotten somewhat complex, but I don't have any other way to make it
> >> simpler which works. First note the following points:
> >> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
> >> kernel (p->vec).
> >> 2) We also want to merge the consecutive pages with the same flags into one
> >> struct page_region. p->vec of current walk may merge with next walk. So we
> >> cannot write to user memory until we find the results of the next walk.
> >>
> >> So most recent data is put into p->cur. When non-intersecting or mergeable
> >> data is found, we move p->cur to p->vec[p->index] inside the page walk.
> >> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
> >> the walks are over. We move the p->cur to arg->vec. It completes the data
> >> transfer to user buffer.
> > [...]
> >> I'm so sorry that it has gotten this much complex. It was way simpler when
> >> we were walking over all the memory in one go. But then we needed an
> >> unbounded memory from the kernel which we don't want.
> > [...]
> >
> > I've gone through and hopefully understood the code. I'm not sure this
> > needs to be so complicated: when traversing a single PMD you can
> > always copy p->cur to p->vec[p->vec_index++] because you can have at
> > most pages_per_PMD non-merges (in the worst case the last page always
> > is left in p->cur and whole p->vec is used). After each PMD p->vec
> > needs a flush if p->vec_index > 0, skipping the dummy entry at front
> > (len == 0; if present). (This is mostly how it is implemented now, but
> > I propose to remove the "overflow" check and do the starting guard
> > removal only every PMD.)
> Sorry, unable to understand where to remove the guard?

Instead of checking for it in pagemap_scan_output() for each page you
can skip it in do_pagemap_cmd() when doing the flush.

> > BTW#2, I think the ENOSPC return in pagemap_scan_output() should
> > happen later - only if the pages would match and that caused the count
> > to exceed the limit. For THP n_pages should be truncated to the limit
> > (and ENOSPC returned right away) only after the pages were verified to
> > match.
> We have 2 counters here:
> * the p->max_pages optionally can be set to find out only N pages of
> interest. So p->found_pages is counting this. We need to return early if
> the page limit is complete.
> * the p->vec_index keeps track of output buffer array size

I think I get how the limits are supposed to work, but I also think
the implementation is not optimal. An example (assuming max_pages = 1
and vec_len = 1):
 - a matching page has been found
 - a second - non-matching - is tried but results in immediate -ENOSPC.
-> In this case I'd expect the early return to happen just after the
first page is found so that non
A similar problem occurs for hugepage: when the limit is hit (we found
>= max_pages, n_pages is possibly truncated), but the scan continues
until next page / PMD.

[...]
> >>>> +       if (!arg->required_mask && !arg->anyof_mask &&
> >>>> +           !arg->excluded_mask)
> >>>> +               return false;
> >>>
> >>> Is there an assumption in the code that those checks are needed? I'd
> >>> expect that no selection criteria makes a valid page set?
> >> In my view, selection criterion must be specified for the ioctl to work. If
> >> there is no criterio, user should go and read pagemap file directly. So the
> >> assumption is that at least one selection criterion must be specified.
> >
> > Yes. I'm not sure we need to prevent multiple ways of doing the same
> > thing. But doesn't pagemap reading lack the range aggregation feature?
> Yeah, correct. But note that we are supporting only selective 4 flags in
> this ioctl, not all pagemap flags. So it is useful for only those users who
> depend only on these 4 flags. Out pagemap_ioctl interface is not so much
> generic that we can cater anyone. Its interface is specific and we are
> adding only those cases which are of our interest. So if someone wants
> range aggregation from pagemap_ioctl, he'll need to add that flag in the
> IOCTL first. When IOCTL support is added, he can specify the selection
> criterion etc.

The available flag set is not a problem. An example usecase: dumping
the memory state for debugging: ioctl(return_mask=ALL) returns a
conveniently compact vector of ranges of pages that are actually used
by the process (not only having reserved the virtual space). This is
actually something that helps dumping processes with using tools like
AddressSanitizer that create huge sparse mappings.

Best Regards
Michał Mirosław
Muhammad Usama Anjum March 20, 2023, 6:08 a.m. UTC | #11
On 3/17/23 7:15 PM, Michał Mirosław wrote:
> On Fri, 17 Mar 2023 at 13:44, Muhammad Usama Anjum
> <usama.anjum@collabora.com> wrote:
>> On 3/17/23 2:28 AM, Michał Mirosław wrote:
>>> On Thu, 16 Mar 2023 at 18:53, Muhammad Usama Anjum
>>> <usama.anjum@collabora.com> wrote:
>>>> On 3/13/23 9:02 PM, Michał Mirosław wrote:
>>>>> On Thu, 9 Mar 2023 at 14:58, Muhammad Usama Anjum
>>>>> <usama.anjum@collabora.com> wrote:
>>> [...]
>>>>>> --- a/fs/proc/task_mmu.c
>>>>>> +++ b/fs/proc/task_mmu.c
>>> [...]
>>>>>> +static int pagemap_scan_output(bool wt, bool file, bool pres, bool swap,
>>> [...]
>>>>> The `cur->len` test seems redundant: is it possible to have
>>>>> `cur->start == addr` in that case (I guess it would have to get
>>>>> `n_pages == 0` in an earlier invocation)?
>>>> No, both wouldn't work. cur->len == 0 means that it has only garbage. It is
>>>> essential to check the validity from cur->len before performing other
>>>> checks. Also cur->start can never be equal to addr as we are walking over
>>>> page addressing in serial manner. We want to see here if the current
>>>> address matches the previous data by finding the ending address of last
>>>> stored data (cur->start + cur->len * PAGE_SIZE).
>>>
>>> If cur->len == 0, then it doesn't matter if it gets merged or not - it
>>> can be filtered out during the flush (see below).
>>> [...]
>>>>>> +               } else if ((!p->vec_index) ||
>>>>>> +                          ((p->vec_index + 1) < p->vec_len)) {
>>>>>
>>>>> Can you explain this test? Why not just `p->vec_index < p->vec_len`? Or better:
>>>>>
>>>>> if (vec_index >= p->vec_len)
>>>>>     return -ENOSPC;
>>>>
>>>> No, it'll not work. Lets leave it as it is. :)
>>>>
>>>> It has gotten somewhat complex, but I don't have any other way to make it
>>>> simpler which works. First note the following points:
>>>> 1) We walk over 512 page or 1 thp at a time to not over allocate memory in
>>>> kernel (p->vec).
>>>> 2) We also want to merge the consecutive pages with the same flags into one
>>>> struct page_region. p->vec of current walk may merge with next walk. So we
>>>> cannot write to user memory until we find the results of the next walk.
>>>>
>>>> So most recent data is put into p->cur. When non-intersecting or mergeable
>>>> data is found, we move p->cur to p->vec[p->index] inside the page walk.
>>>> After the page walk, p->vec[0 to p->index] is moved to arg->vec. After all
>>>> the walks are over. We move the p->cur to arg->vec. It completes the data
>>>> transfer to user buffer.
>>> [...]
>>>> I'm so sorry that it has gotten this much complex. It was way simpler when
>>>> we were walking over all the memory in one go. But then we needed an
>>>> unbounded memory from the kernel which we don't want.
>>> [...]
>>>
>>> I've gone through and hopefully understood the code. I'm not sure this
>>> needs to be so complicated: when traversing a single PMD you can
>>> always copy p->cur to p->vec[p->vec_index++] because you can have at
>>> most pages_per_PMD non-merges (in the worst case the last page always
>>> is left in p->cur and whole p->vec is used). After each PMD p->vec
>>> needs a flush if p->vec_index > 0, skipping the dummy entry at front
>>> (len == 0; if present). (This is mostly how it is implemented now, but
>>> I propose to remove the "overflow" check and do the starting guard
>>> removal only every PMD.)
>> Sorry, unable to understand where to remove the guard?
> 
> Instead of checking for it in pagemap_scan_output() for each page you
> can skip it in do_pagemap_cmd() when doing the flush.
No, this cannot be done because in do_pagemap_cmd() we don't know that we
have space for new pages in the output buffer or not because the next page
may be aggregated to already present data.

> 
>>> BTW#2, I think the ENOSPC return in pagemap_scan_output() should
>>> happen later - only if the pages would match and that caused the count
>>> to exceed the limit. For THP n_pages should be truncated to the limit
>>> (and ENOSPC returned right away) only after the pages were verified to
>>> match.
>> We have 2 counters here:
>> * the p->max_pages optionally can be set to find out only N pages of
>> interest. So p->found_pages is counting this. We need to return early if
>> the page limit is complete.
>> * the p->vec_index keeps track of output buffer array size
> 
> I think I get how the limits are supposed to work, but I also think
> the implementation is not optimal. An example (assuming max_pages = 1
> and vec_len = 1):
>  - a matching page has been found
>  - a second - non-matching - is tried but results in immediate -ENOSPC.
> -> In this case I'd expect the early return to happen just after the
> first page is found so that non
> A similar problem occurs for hugepage: when the limit is hit (we found
>> = max_pages, n_pages is possibly truncated), but the scan continues
> until next page / PMD.
I'll try to check if I can optimize it. It seems like I should be able to
update this pretty easily by returning a negative status/error which
signifies that we have found the max_pages. Now just abort in sane way.

> 
> [...]
>>>>>> +       if (!arg->required_mask && !arg->anyof_mask &&
>>>>>> +           !arg->excluded_mask)
>>>>>> +               return false;
>>>>>
>>>>> Is there an assumption in the code that those checks are needed? I'd
>>>>> expect that no selection criteria makes a valid page set?
>>>> In my view, selection criterion must be specified for the ioctl to work. If
>>>> there is no criterio, user should go and read pagemap file directly. So the
>>>> assumption is that at least one selection criterion must be specified.
>>>
>>> Yes. I'm not sure we need to prevent multiple ways of doing the same
>>> thing. But doesn't pagemap reading lack the range aggregation feature?
>> Yeah, correct. But note that we are supporting only selective 4 flags in
>> this ioctl, not all pagemap flags. So it is useful for only those users who
>> depend only on these 4 flags. Out pagemap_ioctl interface is not so much
>> generic that we can cater anyone. Its interface is specific and we are
>> adding only those cases which are of our interest. So if someone wants
>> range aggregation from pagemap_ioctl, he'll need to add that flag in the
>> IOCTL first. When IOCTL support is added, he can specify the selection
>> criterion etc.
> 
> The available flag set is not a problem. An example usecase: dumping
> the memory state for debugging: ioctl(return_mask=ALL) returns a
> conveniently compact vector of ranges of pages that are actually used
> by the process (not only having reserved the virtual space). This is
> actually something that helps dumping processes with using tools like
> AddressSanitizer that create huge sparse mappings.
I don't know, we are adding more and more use cases as people are noticing
it. I've not thought about this use case. So I need more understanding
about it:
How should I identify "which pages are used"? Does use mean present and
swapped both? We we want to find present or swapped pages in other words
!pte_none pages and return in compact form, it can already be done by
ioctl(anyod_mask=PRESET | SWAPPED, return_mask=ALL).

> 
> Best Regards
> Michał Mirosław
Muhammad Usama Anjum March 21, 2023, 12:21 p.m. UTC | #12
On 3/17/23 12:20 AM, Peter Xu wrote:
> Hello, Muhammad,
> 
> On Thu, Mar 09, 2023 at 06:57:12PM +0500, Muhammad Usama Anjum wrote:
>> Add new WP Async mode (UFFD_FEATURE_WP_ASYNC) which resolves the page
>> faults on its own. It can be used to track that which pages have been
>> written-to from the time the pages were write-protected. It is very
>> efficient way to track the changes as uffd is by nature pte/pmd based.
>>
>> UFFD synchronous WP sends the page faults to the userspace where the
>> pages which have been written-to can be tracked. But it is not efficient.
>> This is why this asynchronous version is being added. After setting the
>> WP Async, the pages which have been written to can be found in the pagemap
>> file or information can be obtained from the PAGEMAP_IOCTL.
>>
>> Suggested-by: Peter Xu <peterx@redhat.com>
>> Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com>
> 
> Here's the patch that can enable WP_ASYNC for all kinds of memories (as I
> promised..).  Currently I only tested btrfs (besides the common three)
> which is the major fs I use locally, but I guess it'll also enable the rest
> no matter what's underneath, just like soft-dirty.
> 
> As I mentioned, I just feel it very unfortunate to have a lot of suffixes
> for the UFFD_FEATURE_* on types of memory, and I hope we get rid of it for
> this WP_ASYNC from the start because the workflow should really be similar
> to anon/shmem handling for most of the rest, just a few tweaks here and
> there.
> 
> I had a feeling that some type of special VMA will work weirdly, but let's
> see.. so far I don't come up with any.
> 
> If the patch looks fine to you, please consider replace this patch with
> patch 1 of mine where I attached.  Then patch 1 can be reviewed alongside
> with your series.
> 
> Logically patch 1 can be reviewed separately too, because it works
> perfectly afaiu without the atomic version of pagemap already.  But on my
> side I don't think it justifies anything really matters, so unless someone
> thinks it a good idea to post / review / merge it separately, you can keep
> that with your new pagemap ioctl.
> 
> Patch 2 is only for your reference.  It's not for merging quality so please
> don't put it into your series.  I do plan to cleanup the userfaultfd
> selftests in the near future first (when I wrote this I am more eager to do
> so..).  I also think your final pagemap test cases can cover quite a bit.
> 
> Thanks,
Thank you so much for the patch. I've tested hugetlb mem. This patch is
working fine for hugetlb shmem:
*shmid = shmget(2, size, SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W);
mem = shmat(*shmid, 0, 0);

I've found slight issue with hugetlb mem which has been mmaped:
mem = mmap(NULL, size, PROT_READ | PROT_WRITE,
	   MAP_ANONYMOUS | MAP_HUGETLB | MAP_PRIVATE, -1, 0);
The issue is that even after witting to this memory, the wp flag is still
present there and memory doesn't appear to be dirty when it should have
been dirty. The temporary fix is to write to memory and write protect the
memory one extra time.

Here is how I'm checking if WP flag is set or not:
static inline bool is_huge_pte_uffd_wp(pte_t pte)
{
	return ((pte_present(pte) && huge_pte_uffd_wp(pte)) ||
		pte_swp_uffd_wp_any(pte));
}

I've isolated the reproducer inside kselftests by commenting the unrelated
code. Please have a look at the attached kselftest and follow from main or
search `//#define tmpfix` in the code.