diff mbox series

[v1,06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb)

Message ID 20211217113049.23850-7-david@redhat.com
State New
Headers show
Series mm: COW fixes part 1: fix the COW security issue for THP and hugetlb | expand

Commit Message

David Hildenbrand Dec. 17, 2021, 11:30 a.m. UTC
FAULT_FLAG_UNSHARE is a new type of page fault applicable to COW-able
anonymous memory (including hugetlb but excluding KSM) and its purpose is
to allow for unsharing of shared anonymous pages on selected GUP *read*
access, in comparison to the traditional COW on *write* access.

In contrast to a COW, GUP-triggered unsharing will still maintain the write
protection. It will be triggered by GUP to properly prevent a child process
from finding ways via GUP to observe memory modifications of anonymous
memory of the parent process after fork().

Rename the relevant functions to make it clear whether we're dealing
with unsharing, cow, or both.

The hugetlb part will be added separately.

This commit is based on prototype patches by Andrea.

Co-developed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/mm.h |   4 ++
 mm/memory.c        | 136 ++++++++++++++++++++++++++++++++++++++-------
 2 files changed, 119 insertions(+), 21 deletions(-)

Comments

Linus Torvalds Dec. 17, 2021, 7:04 p.m. UTC | #1
On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>
> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
> + * set) and then unmaps the target page, we have:
> + *
> + * * page has mapcount == 1 and refcount > 1

All these games with mapcount makes me think this is still broken.

mapcount has been a horribly broken thing in the past, and I'm not
convinced it's not a broken thing now.

> +       vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
> +       if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
> +           page_mapcount(vmf->page) > 1) {

What keeps the mapcount stable in here?

And I still believe that the whole notion that "COW should use
mapcount" is pure and utter garbage.

If we are doing a COW, we need an *exclusive* access to the page. That
is not mapcount, that is the page ref.

mapcount is insane, and I think this is making this worse again.

                Linus
Linus Torvalds Dec. 17, 2021, 7:22 p.m. UTC | #2
On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> If we are doing a COW, we need an *exclusive* access to the page. That
> is not mapcount, that is the page ref.
>
> mapcount is insane, and I think this is making this worse again.

Maybe I'm misreading this, but afaik

 - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.

   This just increments the page count, because mapcount == 1.

 - fork()

 - unmap in the original

 - child now has "mapcount == 1" on a page again, but refcount is
elevated, and child HAS TO COW before writing.

Notice? "mapcount" is complete BS. The number of times a page is
mapped is irrelevant for COW. All that matters is that we get an
exclusive access to the page before we can write to it.

Anybody who takes mapcount into account at COW time is broken, and it
worries me how this is all mixing up with the COW logic.

Now, maybe this "unshare" case is sufficiently different from COW that
it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it
doesn't happen for a real COW.

But honestly, for "unshare", I still don't see that the mapcount
matters. What does "mapcount == 1" mean? Why is it meaningful?

Because if COW does things right, and always breaks a COW based on
refcount, then what's the problem with taking a read-only ref to the
page whether it is mapped multiple times or mapped just once? Anybody
who already had write access to the page can write to it regardless,
and any new writers go through COW and get a new page.

I must be missing something realyl fundamental here, but to me it
really reads like "mapcount can fundamentally never be relevant for
COW, and if it's not relevant for COW, how can it be relevant for a
read-only copy?"

             Linus
David Hildenbrand Dec. 17, 2021, 8:17 p.m. UTC | #3
On 17.12.21 20:22, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> If we are doing a COW, we need an *exclusive* access to the page. That
>> is not mapcount, that is the page ref.
>>
>> mapcount is insane, and I think this is making this worse again.
> 
> Maybe I'm misreading this, but afaik
> 
>  - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
> 
>    This just increments the page count, because mapcount == 1.
> 
>  - fork()
> 
>  - unmap in the original
> 
>  - child now has "mapcount == 1" on a page again, but refcount is
> elevated, and child HAS TO COW before writing.

Hi Linus,

This is just GUP before fork(), which is in general
problematic/incompatible with sharing. What we're concerned about in the
context of this series (see the security issue) is GUP after fork(). And
we're not changing GUP before fork() or even the COW logic in the
context of this series.

I agree that GUP before fork() has to be handled differently: during
fork(): don't share something that cannot possibly be shared in a safe
way. Don't allow COW semantics for something that is just broken with COW.

> 
> Notice? "mapcount" is complete BS. The number of times a page is
> mapped is irrelevant for COW. All that matters is that we get an
> exclusive access to the page before we can write to it.

We have to be very careful about the two sides of the story: GUP before
fork and GUP after fork.

> 
> Anybody who takes mapcount into account at COW time is broken, and it
> worries me how this is all mixing up with the COW logic.
> 
> Now, maybe this "unshare" case is sufficiently different from COW that
> it's ok to look at mapcount for FAULT_FLAG_UNSHARE, as long as it
> doesn't happen for a real COW.
> 
> But honestly, for "unshare", I still don't see that the mapcount
> matters. What does "mapcount == 1" mean? Why is it meaningful?

I'll reply to your first mail in a sec.

GUP is the problem with COW, not ordinary processes mapping a page
(mapcount), where you will only get new sharers during fork() -- in a
very controlled way. So GUP has to take care to unshare *before* taking
a reference, such that we can never reach the point of missed COW. GUP
really is the problematic bit with it all.

Without GUP, we'd be living in a wonderful world in regards to COW.

> 
> Because if COW does things right, and always breaks a COW based on
> refcount, then what's the problem with taking a read-only ref to the
> page whether it is mapped multiple times or mapped just once? Anybody
> who already had write access to the page can write to it regardless,
> and any new writers go through COW and get a new page.

Let's just take a look at what refcount does *wrong*. Let's use an
adjusted version of your example above, because it's a perfect fit:

1. mem = mmap(pagesize, MAP_PRIVATE)
-> refcount == 1

2. memset(mem, 0, pagesize); /* Page is mapped R/W */

3. fork() /* Page gets mapped R/O */
-> refcount > 1

4. child quits
-> refcount == 1

5. Take a R/O pin (RDMA, VFIO, ...)
-> refcount > 1

6. memset(mem, 0xff, pagesize);
-> Write fault -> COW

And GUP sees something different than our MM -- and this is perfectly
valid, the R/O pin is just reading page content we might be modifying
afterwrds. Take out 3. and 4. and it works as expected. This wouldn't
happen when relying on the mapcount.

And 3+4 can really be anything that results in a R/O mapping of an
anonymous page, even if it's just swapout followed by read fault that
maps the page R/O.

> 
> I must be missing something realyl fundamental here, but to me it
> really reads like "mapcount can fundamentally never be relevant for
> COW, and if it's not relevant for COW, how can it be relevant for a
> read-only copy?"

It really is the right value to use. Only GUP is the problematic bit
that has to trigger unsharing to not mess up COW logic later. Take GUP
out of the equation and COW just works as expected with the mapcount --
 as long as we can read an atomic value and synchronize against fork.
(again, still composing the other mail :) )
Linus Torvalds Dec. 17, 2021, 8:36 p.m. UTC | #4
On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 17.12.21 20:22, Linus Torvalds wrote:
> > On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
> > <torvalds@linux-foundation.org> wrote:
> >>
> >  - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
> >
> >    This just increments the page count, because mapcount == 1.
> >
> >  - fork()
> >
> >  - unmap in the original
> >
> >  - child now has "mapcount == 1" on a page again, but refcount is
> > elevated, and child HAS TO COW before writing.
>
> Hi Linus,
>
> This is just GUP before fork(), which is in general
> problematic/incompatible with sharing.

Note that my example was not meant to be an example of a problem per
se, but purely as an example of how meaningless 'mapcount' is, and how
'mapcount==1' isn't really a very meaningful test.

So it wasn't mean to show "look, GUP before fork is problematic".  We
have that problem already solved at least for regular pages.

It was purely meant to show how "mapcount==1" isn't a meaningful thing
to test, and my worry about how you're adding that nonsensical test to
the new code.

> Let's just take a look at what refcount does *wrong*. Let's use an
> adjusted version of your example above, because it's a perfect fit:
>
> 1. mem = mmap(pagesize, MAP_PRIVATE)
> -> refcount == 1
>
> 2. memset(mem, 0, pagesize); /* Page is mapped R/W */
>
> 3. fork() /* Page gets mapped R/O */
> -> refcount > 1
>
> 4. child quits
> -> refcount == 1
>
> 5. Take a R/O pin (RDMA, VFIO, ...)
> -> refcount > 1
>
> 6. memset(mem, 0xff, pagesize);
> -> Write fault -> COW

I do not believe this is actually a bug.

You asked for a R/O pin, and you got one.

Then somebody else modified that page, and you got exactly what you
asked for - a COW event. The original R/O pin has the original page
that it asked for, and can read it just fine.

So what is your argument?

              Linus
Linus Torvalds Dec. 17, 2021, 8:39 p.m. UTC | #5
On Fri, Dec 17, 2021 at 12:36 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> >
> > 5. Take a R/O pin (RDMA, VFIO, ...)
> > -> refcount > 1
> >
> > 6. memset(mem, 0xff, pagesize);
> > -> Write fault -> COW
>
> I do not believe this is actually a bug.
>
> You asked for a R/O pin, and you got one.

If you want a shared pin that actually follows the changes of your
process around, then that is what you should have asked for.

At the time of such a shared pin, you can do what we already do:
re-use the page if it has a refcount of 1. Or do an early COW event
(feel free to avoid the "mark it writable and dirty").

But note: *refcount* of 1. Not "mapcount". Because mapcount would be
broken garbage.

                Linus
David Hildenbrand Dec. 17, 2021, 8:42 p.m. UTC | #6
On 17.12.21 21:36, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:18 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 17.12.21 20:22, Linus Torvalds wrote:
>>> On Fri, Dec 17, 2021 at 11:04 AM Linus Torvalds
>>> <torvalds@linux-foundation.org> wrote:
>>>>
>>>  - get a "readonly" copy of a local private page using FAULT_FLAG_UNSHARE.
>>>
>>>    This just increments the page count, because mapcount == 1.
>>>
>>>  - fork()
>>>
>>>  - unmap in the original
>>>
>>>  - child now has "mapcount == 1" on a page again, but refcount is
>>> elevated, and child HAS TO COW before writing.
>>
>> Hi Linus,
>>
>> This is just GUP before fork(), which is in general
>> problematic/incompatible with sharing.
> 
> Note that my example was not meant to be an example of a problem per
> se, but purely as an example of how meaningless 'mapcount' is, and how
> 'mapcount==1' isn't really a very meaningful test.
> 
> So it wasn't mean to show "look, GUP before fork is problematic".  We
> have that problem already solved at least for regular pages.
> 
> It was purely meant to show how "mapcount==1" isn't a meaningful thing
> to test, and my worry about how you're adding that nonsensical test to
> the new code.
> 
>> Let's just take a look at what refcount does *wrong*. Let's use an
>> adjusted version of your example above, because it's a perfect fit:
>>
>> 1. mem = mmap(pagesize, MAP_PRIVATE)
>> -> refcount == 1
>>
>> 2. memset(mem, 0, pagesize); /* Page is mapped R/W */
>>
>> 3. fork() /* Page gets mapped R/O */
>> -> refcount > 1
>>
>> 4. child quits
>> -> refcount == 1
>>
>> 5. Take a R/O pin (RDMA, VFIO, ...)
>> -> refcount > 1
>>
>> 6. memset(mem, 0xff, pagesize);
>> -> Write fault -> COW
> 
> I do not believe this is actually a bug.

It's debatable if it's a BUG or not (I think it is one). It's for sure
inconsistent.

> 
> You asked for a R/O pin, and you got one.
> 
> Then somebody else modified that page, and you got exactly what you
> asked for - a COW event. The original R/O pin has the original page
> that it asked for, and can read it just fine.

Where in the code did I ask for a COW event? I asked for a R/O pin, not
any kind of memory protection.
Linus Torvalds Dec. 17, 2021, 8:43 p.m. UTC | #7
On Fri, Dec 17, 2021 at 12:39 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> At the time of such a shared pin, you can do what we already do:
> re-use the page if it has a refcount of 1. Or do an early COW event
> (feel free to avoid the "mark it writable and dirty").

Note that this also depends on fork() doing the right thing, marking
things for "a fork() can not share this page any more".

Which it does for regular pages, and is exactly what that
page_needs_cow_for_dma() logic is all about (and the special
write_protect_seq around gup/fork).

I do believe that huge-pages don't do it right. But I think that as
you try to fix hugepages, you are now breaking the normal case.

If all your logic was only about hugepages, I wouldn't care so much.
But you are playing questionable games with code that I think is
correct.

Please explain why.

              Linus
David Hildenbrand Dec. 17, 2021, 8:45 p.m. UTC | #8
On 17.12.21 20:04, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 3:34 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> + * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
>> + * set) and then unmaps the target page, we have:
>> + *
>> + * * page has mapcount == 1 and refcount > 1
> 

Hi Linus,

> All these games with mapcount makes me think this is still broken.
> 
> mapcount has been a horribly broken thing in the past, and I'm not
> convinced it's not a broken thing now.

It all started when Jann detected the security issue in GUP – and this
patch set is fixing the problem exactly there, in GUP itself. Are you
aware of COW issues regarding the mapcount if we would remove GUP from
the equation? My point is that COW without GUP works just perfectly
fine, but I’ll be happy to learn about other cases I was ignoring so far.

Unfortunately, page_count() is even more unreliable, and the issues
we're just detecting (see the link in the cover letter: memory
corruptions inside user space -- e.g., lost DMA writes) are even worse
than what we had before -- IMHO of course.

> 
>> +       vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
>> +       if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
>> +           page_mapcount(vmf->page) > 1) {
> 
> What keeps the mapcount stable in here?

So, we're reading an atomic value here. It’s read via atomic_read for
regular pages, and the THP mapcount case has also been made atomic (as
lockless as page_count) in patch #5.

If a page is mapped exactly once, page_mapcount(page) == 1 and there is
nothing to do.

If the page is mapped more than once, page_mapcount(page) > 1 and we
would have to trigger unsharing. And it’s true that the value is
unstable in this case, but we really only care about page_mapcount(page)
> 1 vs. page_mapcount(page) == 1. In this respect, there is no
difference from the instability of the page_count and the mapcount – we
still only care if it’s >1 or == 1.

So the only case we could care about is concurrent additional mappings
that can increment the mapcount -- which can only happen due to
concurrent fork. So if we're reading page_mapcount(page) == 1 the only
way we can get page_mapcount(page) > 1 is due to fork(). But we're
holding the mmap_lock in read mode during faults and fork requires the
mmap_lock in write mode.


> 
> And I still believe that the whole notion that "COW should use
> mapcount" is pure and utter garbage.
> 
> If we are doing a COW, we need an *exclusive* access to the page. That
> is not mapcount, that is the page ref.

I thought about this a lot, because initially I had the same opinion.

But really, we don't care about any speculative references (pagecache,
migration, daemon, pagevec, ...) or any short-term "I just want to grab
this reference real quick such that the page can't get freed" references.

All we care about are GUP references, and we attack that problem at the
root by triggering unsharing exactly at the point where GUP comes into play.

So IMHO GUP is the problem and needs unsharing either:
* On write access to a shared anonymous page, which is just COW as we
know it.
* On read access to a shared anonymous page, which is what we’re
proposing in this patch set.

So as soon as GUP comes into play, even if only pinning R/O, we have to
trigger unsharing . Doing so enforces the invariant that it is
impossible to take a GUP pin on an anonymous page with a mapcount > 1.
In turn, the COW does not need to worry about the GUP after fork()
security issue anymore and it can focus on doing optimally the COW
faults as if GUP just wouldn’t exist.
Linus Torvalds Dec. 17, 2021, 8:45 p.m. UTC | #9
On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand <david@redhat.com> wrote:
>
> > Then somebody else modified that page, and you got exactly what you
> > asked for - a COW event. The original R/O pin has the original page
> > that it asked for, and can read it just fine.
>
> Where in the code did I ask for a COW event? I asked for a R/O pin, not
> any kind of memory protection.

Why didn't you ask for a shared pin, if that is what you want?

We already support that.

If you don't like the read-only pins, don't use them. It's that simple.

              Linus
Jason Gunthorpe Dec. 17, 2021, 8:47 p.m. UTC | #10
On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:

> > 5. Take a R/O pin (RDMA, VFIO, ...)
> > -> refcount > 1
> >
> > 6. memset(mem, 0xff, pagesize);
> > -> Write fault -> COW
> 
> I do not believe this is actually a bug.
> 
> You asked for a R/O pin, and you got one.
> 
> Then somebody else modified that page, and you got exactly what you
> asked for - a COW event. The original R/O pin has the original page
> that it asked for, and can read it just fine.

To remind all, the GUP users, like RDMA, VFIO use
FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
COW breaking the coherence. In these case 'r/o pin' does not mean
"snapshot the data", but its only a promise not to write to the pages
and still desires coherence with the memory map.

Eg in RDMA we know of apps asking for a R/O pin of something in .bss
then filling that something with data finally doing the actual
DMA. Breaking COW after pin breaks those apps.

The above #5 can occur for O_DIRECT read and in that case the
'snapshot the data' is perfectly fine as racing the COW with the
O_DIRECT read just resolves the race toward the read() direction.

IIRC there is some other scenario that motivated this patch?

Jason
Linus Torvalds Dec. 17, 2021, 8:51 p.m. UTC | #11
On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand <david@redhat.com> wrote:
>
> If a page is mapped exactly once, page_mapcount(page) == 1 and there is
> nothing to do.

Why?

You state that, but you stating that doesn't magically make it so.

What makes "mapcount==1" stable and special? Your "it's an
atomic_read()" argument is nonsense - it implies that the count can be
changing, but you will get _one_ answer.

What makes that one answer of a changing count special?

What if there are other references to that same page, gotten with
vmsplice(), and just about to be mapped into another address space?

This is the meat of my argument. You claim that "mapcount==1" is
special. I claim that you haven't explained why it would be. And I do
not believe it is.

                 Linus
David Hildenbrand Dec. 17, 2021, 8:55 p.m. UTC | #12
On 17.12.21 21:51, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:45 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> If a page is mapped exactly once, page_mapcount(page) == 1 and there is
>> nothing to do.
> 
> Why?
> 
> You state that, but you stating that doesn't magically make it so.
> 
> What makes "mapcount==1" stable and special? Your "it's an
> atomic_read()" argument is nonsense - it implies that the count can be
> changing, but you will get _one_ answer.

And I explained how it can not increment. And the only way is via
fork(), which cannot run concurrently.

> 
> What makes that one answer of a changing count special?
> 
> What if there are other references to that same page, gotten with
> vmsplice(), and just about to be mapped into another address space?

If we have a shared anonymous page we cannot have GUP references, not
even R/O ones. Because GUP would have unshared and copied the page,
resulting in a R/O mapped anonymous page.

What am I missing?
Linus Torvalds Dec. 17, 2021, 8:56 p.m. UTC | #13
On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> To remind all, the GUP users, like RDMA, VFIO use
> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
> COW breaking the coherence. In these case 'r/o pin' does not mean
> "snapshot the data", but its only a promise not to write to the pages
> and still desires coherence with the memory map.
>
> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
> then filling that something with data finally doing the actual
> DMA. Breaking COW after pin breaks those apps.

I agree.

And my argument is that those kinds of things that ask for a R/O pin
are broken, and should just make sure to use the shared pins.

If the page was already writable, you can just re-use the page
directly (marking it pinned, so that any subsequent fork() does the
right pre-cow thing)

And if the page was *NOT* already writable, you do a COW - which might
be sharing the page directly too, if it has no other references.

What's the downside of just doing this properly?

Again: if a DMA user wants coherent memory, then it should use the
coherent pinning. Not some kind of "read-only sharing that looks at
crazy mapcounts that have absolutely zero relevance to whether the
page is coherent or not".

                   Linus
David Hildenbrand Dec. 17, 2021, 9:04 p.m. UTC | #14
On 17.12.21 21:47, Jason Gunthorpe wrote:
> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
> 
>>> 5. Take a R/O pin (RDMA, VFIO, ...)
>>> -> refcount > 1
>>>
>>> 6. memset(mem, 0xff, pagesize);
>>> -> Write fault -> COW
>>
>> I do not believe this is actually a bug.
>>
>> You asked for a R/O pin, and you got one.
>>
>> Then somebody else modified that page, and you got exactly what you
>> asked for - a COW event. The original R/O pin has the original page
>> that it asked for, and can read it just fine.
> 

Hi Jason

> To remind all, the GUP users, like RDMA, VFIO use
> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the

I heard that statement often. Can you point me at the code?

VFIO: drivers/vfio/vfio_iommu_type1.c

vaddr_get_pfns() will end up doing a 
pin_user_pages_remote(FOLL_LONGTERM) without 
FOLL_FORCE|FOLL_WRITE.

Is that added automatically internally?

Note the comment in the next patch

+ *
+ * TODO: although the security issue described does no longer apply in any case,
+ * the full consistency between the pinned pages and the pages mapped into the
+ * page tables of the MM only apply to short-term pinnings only. For
+ * FOLL_LONGTERM, FOLL_WRITE|FOLL_FORCE is required for now, which can be
+ * inefficient and still result in some consistency issues. Extend this
+ * mechanism to also provide full synchronicity to FOLL_LONGTERM, avoiding
+ * FOLL_WRITE|FOLL_FORCE.

> 
> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
> then filling that something with data finally doing the actual
> DMA. Breaking COW after pin breaks those apps.
> 
> The above #5 can occur for O_DIRECT read and in that case the
> 'snapshot the data' is perfectly fine as racing the COW with the
> O_DIRECT read just resolves the race toward the read() direction.
> 
> IIRC there is some other scenario that motivated this patch?

1. I want to fix the COW security issue as documented.
   Reproducers in patch #11

2. I want to fix all of the other issues as documented and linked
   in the cover letter that result from the imprecise page_count
   check in COW code. Especially the ones where we have memory
   corruptions, because this is just not acceptable. There are
   reproducers as well for everybody that doesn't believe me.

But this series really just wants to fix the security issue as "part 1".
Without any more breakages.

I'm sorry, but it's all described in the cover letter. Maybe TL;DR
Nadav Amit Dec. 17, 2021, 9:15 p.m. UTC | #15
> On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
> 
>>> 5. Take a R/O pin (RDMA, VFIO, ...)
>>> -> refcount > 1
>>> 
>>> 6. memset(mem, 0xff, pagesize);
>>> -> Write fault -> COW
>> 
>> I do not believe this is actually a bug.
>> 
>> You asked for a R/O pin, and you got one.
>> 
>> Then somebody else modified that page, and you got exactly what you
>> asked for - a COW event. The original R/O pin has the original page
>> that it asked for, and can read it just fine.
> 
> To remind all, the GUP users, like RDMA, VFIO use
> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
> COW breaking the coherence. In these case 'r/o pin' does not mean
> "snapshot the data", but its only a promise not to write to the pages
> and still desires coherence with the memory map.
> 
> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
> then filling that something with data finally doing the actual
> DMA. Breaking COW after pin breaks those apps.
> 
> The above #5 can occur for O_DIRECT read and in that case the
> 'snapshot the data' is perfectly fine as racing the COW with the
> O_DIRECT read just resolves the race toward the read() direction.
> 
> IIRC there is some other scenario that motivated this patch?

I think that there is an assumption that once a page is COW-broken,
it would never have another write-fault that might lead to COW
breaking later.

AFAIK at least after userfaultfd-WP followed by
userfaultfd-write-unprotect a page might be write-protected and
go through do_wp_page() a second time to be COW-broken again. In
such case, I think the FOLL_FORCE|FOLL_WRITE would not help.

I suspect (not sure) that this might even happen with mprotect()
since I do not see all code-paths preserving whether the page
was writable.
David Hildenbrand Dec. 17, 2021, 9:17 p.m. UTC | #16
On 17.12.21 21:56, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:47 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>> To remind all, the GUP users, like RDMA, VFIO use
>> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
>> COW breaking the coherence. In these case 'r/o pin' does not mean
>> "snapshot the data", but its only a promise not to write to the pages
>> and still desires coherence with the memory map.
>>
>> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
>> then filling that something with data finally doing the actual
>> DMA. Breaking COW after pin breaks those apps.
> 
> I agree.
> 

I agree that breaking COW after a pin should never be done. Therefore,
break the COW before the pin -> unsharing as implemented here.

> And my argument is that those kinds of things that ask for a R/O pin
> are broken, and should just make sure to use the shared pins.

And trigger a write fault although they are just reading. To me this is
just a band aid instead of eventually ...

...
> What's the downside of just doing this properly?

Doing it properly by fixing GUP and not the COW logic. GUP and COW are
just incompatible and we should unshare early.

Honestly, the memory corruptions we can trigger in user space due to the
current COW logic *especially* with FOLL_WRITE users such O_DIRECT,
io_uring or vfio are not a joke anymore. (again, link in the cover letter)
David Hildenbrand Dec. 17, 2021, 9:20 p.m. UTC | #17
On 17.12.21 22:15, Nadav Amit wrote:
> 
> 
>> On Dec 17, 2021, at 12:47 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
>>
>> On Fri, Dec 17, 2021 at 12:36:43PM -0800, Linus Torvalds wrote:
>>
>>>> 5. Take a R/O pin (RDMA, VFIO, ...)
>>>> -> refcount > 1
>>>>
>>>> 6. memset(mem, 0xff, pagesize);
>>>> -> Write fault -> COW
>>>
>>> I do not believe this is actually a bug.
>>>
>>> You asked for a R/O pin, and you got one.
>>>
>>> Then somebody else modified that page, and you got exactly what you
>>> asked for - a COW event. The original R/O pin has the original page
>>> that it asked for, and can read it just fine.
>>
>> To remind all, the GUP users, like RDMA, VFIO use
>> FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
>> COW breaking the coherence. In these case 'r/o pin' does not mean
>> "snapshot the data", but its only a promise not to write to the pages
>> and still desires coherence with the memory map.
>>
>> Eg in RDMA we know of apps asking for a R/O pin of something in .bss
>> then filling that something with data finally doing the actual
>> DMA. Breaking COW after pin breaks those apps.
>>
>> The above #5 can occur for O_DIRECT read and in that case the
>> 'snapshot the data' is perfectly fine as racing the COW with the
>> O_DIRECT read just resolves the race toward the read() direction.
>>
>> IIRC there is some other scenario that motivated this patch?
> 
> I think that there is an assumption that once a page is COW-broken,
> it would never have another write-fault that might lead to COW
> breaking later.
> 
> AFAIK at least after userfaultfd-WP followed by
> userfaultfd-write-unprotect a page might be write-protected and
> go through do_wp_page() a second time to be COW-broken again. In
> such case, I think the FOLL_FORCE|FOLL_WRITE would not help.
> 
> I suspect (not sure) that this might even happen with mprotect()
> since I do not see all code-paths preserving whether the page
> was writable.
> 

uffd-wp and mprotect() are broken as well, yes. But the easiest example
is just swap + read fault.

Section 2 and 3 in [1], along with reproducers.

Note that I didn't mention uffd-wp and mprotect(), because these require
"manual intervention". With swap, it's not your application doing
something "special".

[1]
https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
Linus Torvalds Dec. 17, 2021, 9:36 p.m. UTC | #18
On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand <david@redhat.com> wrote:
>
> If we have a shared anonymous page we cannot have GUP references, not
> even R/O ones. Because GUP would have unshared and copied the page,
> resulting in a R/O mapped anonymous page.

Doing a GUP on an actual shared page is wrong to begin with.

You even know that, you try to use "page_mapcount() > 1" to disallow it.

My point is that it's wrong regardless, and that "mapcount" is
dubious, and that COW cannot - and must not - use mapcount, and that I
think your shared case should strive to avoid it for the exact same
reason.

So, what I think should happen is:

 (a) GUP makes sure that it only ever looks up pages that can be
shared with this VM. This may in involve breaking COW early with any
past fork().

 (b) it marks such pages so that any future work will not cause them
to COW either

Note that (a) is not necessarily "always COW and have to allocate and
copy new page". In particular, if the page is already writable, you
know you already have exclusive access to it and don't need to COW.

And if it isn't writable, then the other common case is "the cow has
only one user, and it's us" - that's the "refcount == 1" case.

And (b) is what we do with that page_maybe_dma_pinned() logic for
fork(), but also for things like swap cache creation (eg see commit
feb889fb40fa: "mm: don't put pinned pages into the swap cache").

Note that this code all already exists, and already works - even
without getting the (very expensive) mmap_sem. So it works with
fast-GUP and it can race with concurrent forking by another thread,
which is why we also have that seqcount thing.

As far as I can tell, your "mapcount" logic fundamentally requires
mmap_sem for the fork() race avoidance, for example.

So this is why I don't like the mapcount games - I think they are very
fragile, and not at all as logical as the two simple rules a/b above.

I believe you can make mapcount games _work_ - we used to have
something like that. It was incredibly fragile, and it had its own set
of bugs, but with enough care it's doable.

But my argument really is that I think it's the wrong approach, and
that we should simply strive to follow the two simple conceptual rules
above.

            Linus
David Hildenbrand Dec. 17, 2021, 9:47 p.m. UTC | #19
On 17.12.21 22:36, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:55 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> If we have a shared anonymous page we cannot have GUP references, not
>> even R/O ones. Because GUP would have unshared and copied the page,
>> resulting in a R/O mapped anonymous page.
> 
> Doing a GUP on an actual shared page is wrong to begin with.
> 
> You even know that, you try to use "page_mapcount() > 1" to disallow it.

GUP is incomaptible with shared anonymous pages, therefore it has to
trigger unsharing, correct.

> 
> My point is that it's wrong regardless, and that "mapcount" is
> dubious, and that COW cannot - and must not - use mapcount, and that I
> think your shared case should strive to avoid it for the exact same
> reason.

For now I have not heard a compelling argument why the mapcount is
dubious, I repeat:

* mapcount can only increase due to fork()
* mapcount can decrease due to unmap / zap

We can protect from the transtition == 1 -> >1 using the mmap_lock.

For COW the mapcount is the only thing that matters *if we take GUP* out
of the equation. And that's exactly what we

OTOH, take a look which issues resulted from the page_count changes.
That's what I call dubious, sorry to say.

> 
> So, what I think should happen is:
> 
>  (a) GUP makes sure that it only ever looks up pages that can be
> shared with this VM. This may in involve breaking COW early with any
> past fork().

Is that unsharing as we propose it?

> 
>  (b) it marks such pages so that any future work will not cause them
> to COW either

Right, exactly. GUP before fork does not result in a page getting shared
again.

> 
> Note that (a) is not necessarily "always COW and have to allocate and
> copy new page". In particular, if the page is already writable, you
> know you already have exclusive access to it and don't need to COW.
> 
> And if it isn't writable, then the other common case is "the cow has
> only one user, and it's us" - that's the "refcount == 1" case.
> 
> And (b) is what we do with that page_maybe_dma_pinned() logic for
> fork(), but also for things like swap cache creation (eg see commit
> feb889fb40fa: "mm: don't put pinned pages into the swap cache").

I fully agree with b). GUP before fork is a totally different set of
problems than GUP after fork.

> 
> Note that this code all already exists, and already works - even
> without getting the (very expensive) mmap_sem. So it works with
> fast-GUP and it can race with concurrent forking by another thread,
> which is why we also have that seqcount thing.

I know, I studied it intensively :)

> 
> As far as I can tell, your "mapcount" logic fundamentally requires
> mmap_sem for the fork() race avoidance, for example.

Yes. Or any other more lightweight synchronization in the future. For
now this is just perfect.

> 
> So this is why I don't like the mapcount games - I think they are very
> fragile, and not at all as logical as the two simple rules a/b above.

I don't really see anything fragile, really. I'm happy to learn as always.

> 
> I believe you can make mapcount games _work_ - we used to have
> something like that. It was incredibly fragile, and it had its own set
> of bugs, but with enough care it's doable.

We made it work, and it was comparatively simple.
Linus Torvalds Dec. 17, 2021, 9:50 p.m. UTC | #20
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> For now I have not heard a compelling argument why the mapcount is
> dubious, I repeat:
>
> * mapcount can only increase due to fork()
> * mapcount can decrease due to unmap / zap
>
> We can protect from the transtition == 1 -> >1 using the mmap_lock.
>
> For COW the mapcount is the only thing that matters *if we take GUP* out
> of the equation. And that's exactly what we

What do you have against just doing what we already do in other parts,
that a/b thing?

Which avoids the whole mmap_sem issue. That was a big issue for the
rdma people, afaik.

                Linus
Linus Torvalds Dec. 17, 2021, 10:18 p.m. UTC | #21
On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote:
>
> For now I have not heard a compelling argument why the mapcount is
> dubious, I repeat:
>
> * mapcount can only increase due to fork()
> * mapcount can decrease due to unmap / zap

And to answer the "why is this dubious", let' sjust look at your
actual code that I reacted to:

+       vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
+       if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
+           page_mapcount(vmf->page) > 1) {

Note how you don't just check page_mapcount(). Why not? Because
mapcount is completely immaterial if it's not a PageAnon page, so you
test for that.

So even when you do the mapcount read as one atomic thing, it's one
atomic thing that depends on _other_ things, and all these checks are
not atomic.

But a PageAnon() page can actually become a swap-backed page, and as
far as I can tell, your code doesn't have any locking to protect
against that.

So now you need not only the mmap_sem (to protect against fork), you
also need the page lock (to protect against rmap changing the type of
page).

I don't see you taking the page lock anywhere. Maybe the page table
lock ends up serializing sufficiently with the rmap code that it ends
up working

In the do_wp_page() path, we currently do those kinds of racy checks
too, but then we do a trylock_page, and re-do them. And at any time
there is any question about things, we fall back to copying - because
a copy is always safe.

Well, it's always safe if we have the rule that "once we've pinned
things, we don't cause them to be COW again".

But that "it's safe if" was exactly my (b) case.

That's why I much prefer the model I'm trying to push - it's
conceptually quite simple. I can literally explain mine at a
conceptual level with that "break pre-existing COW, make sure no
future COW" model.

In contrast, I look at your page_mapcount() code, and I go "there is
no conceptual rules here, and the actual implementation details look
dodgy".

I personally like having clear conceptual rules - as opposed to random
implementation details.

             Linus
David Hildenbrand Dec. 17, 2021, 10:29 p.m. UTC | #22
On 17.12.21 22:50, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> For now I have not heard a compelling argument why the mapcount is
>> dubious, I repeat:
>>
>> * mapcount can only increase due to fork()
>> * mapcount can decrease due to unmap / zap
>>
>> We can protect from the transtition == 1 -> >1 using the mmap_lock.
>>
>> For COW the mapcount is the only thing that matters *if we take GUP* out
>> of the equation. And that's exactly what we
> 
> What do you have against just doing what we already do in other parts,
> that a/b thing?

Let me put it that way: I just want to get all of this fixed for good. I
don't particularly care how. *But* I will fight for something that is
superior, logically makes sense (at least to me :) ) and not super
complicated. And I call also avoiding unnecessary COW "superior".

I do know that what this series proposes fixes the CVE: GUP after fork.
I do know that the part 2 we'll be sending out next year will fix
everything else we discovered so far, and it will rely on this as a
basis, to not reintroduce any other COW issues we've seen so far.


If someone can propose something comparable that makes all discovered
problems go away I'll be *extremely* happy. We have reproducers for all
issues, so it's easy to verify, and I'm planning on extending the
selftests to cover even more corner cases.


So far, I am not convinced that using the mapcount is dubious or
problematic, I just don't see how. COW is an about sharing pages between
processes, each expressed in the mapcount. It's a pure optimization for
exactly that purpose.

GUP is the problem, not COW, not the mapcount. To me the mapcount is the
only thing that makes sense in COW+unsharing logic, and GUP has to be
taught to identify it and resolve it -> unshare when it detects a shared
anaonymous page.


> 
> Which avoids the whole mmap_sem issue. That was a big issue for the
> rdma people, afaik.

While I do care about future use cases, I cannot possibly see fork() not
requiring the mmap_lock in the foreseeable future. Just so much depends
on it as of now.

And after all, fixing everything what we discovered so far is more
important to me than something like that for the future. We have other
problems to solve in that regard.

----------------------------------------------------------------------

I didn't want to talk about hugetlb here but I will just because it's a
good example why using the refocunt is just wrong -- because unnecessary
COW are just absolutely problematic.

Assume you have a R/O mapped huge page that is only mapped into your
process and you get a write fault. What should your COW logic do?

a) Rely on the mapcount? Yes, iff GUP has been taught to unshare
   properly, because then it expresses exactly what we want to know.
   mapcount == 1 -> reuse. mapcount > 1 -> COW.

b) Rely on the refocunt? If we have a speculative refrence on the page
   we would COW. As huge pages are a scarce resource we can easily just
   not have a free huge page anymore and crash the application. The app
   didn't do anything wrong.

So teaching the hugetlb COW code to rely on the refount would just be
highly fragile.
David Hildenbrand Dec. 17, 2021, 10:43 p.m. UTC | #23
On 17.12.21 23:18, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 1:47 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> For now I have not heard a compelling argument why the mapcount is
>> dubious, I repeat:
>>
>> * mapcount can only increase due to fork()
>> * mapcount can decrease due to unmap / zap
> 
> And to answer the "why is this dubious", let' sjust look at your
> actual code that I reacted to:
> 
> +       vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
> +       if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
> +           page_mapcount(vmf->page) > 1) {
> 
> Note how you don't just check page_mapcount(). Why not? Because
> mapcount is completely immaterial if it's not a PageAnon page, so you
> test for that.
> 
> So even when you do the mapcount read as one atomic thing, it's one
> atomic thing that depends on _other_ things, and all these checks are
> not atomic.
> 
> But a PageAnon() page can actually become a swap-backed page, and as
> far as I can tell, your code doesn't have any locking to protect
> against that.

The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
mapcount still applies.

> 
> So now you need not only the mmap_sem (to protect against fork), you
> also need the page lock (to protect against rmap changing the type of
> page).

No, I don't think so. But I'm happy to be proven wrong because I might
just be missing something important.

> 
> I don't see you taking the page lock anywhere. Maybe the page table
> lock ends up serializing sufficiently with the rmap code that it ends
> up working
> 
> In the do_wp_page() path, we currently do those kinds of racy checks
> too, but then we do a trylock_page, and re-do them. And at any time
> there is any question about things, we fall back to copying - because
> a copy is always safe.

Yes, I studied that code in detail as well.

> 
> Well, it's always safe if we have the rule that "once we've pinned
> things, we don't cause them to be COW again".

We should also be handling FOLL_GET, but that's a completely different
discussion.

> 
> But that "it's safe if" was exactly my (b) case.
> 
> That's why I much prefer the model I'm trying to push - it's
> conceptually quite simple. I can literally explain mine at a
> conceptual level with that "break pre-existing COW, make sure no
> future COW" model.

:)

We really might be talking about the same thing just that my point is
that the mapcount is the right thing to use for making the discussion
whether to break COW -> triger unsharing.

> 
> In contrast, I look at your page_mapcount() code, and I go "there is
> no conceptual rules here, and the actual implementation details look
> dodgy".
> 
> I personally like having clear conceptual rules - as opposed to random
> implementation details.

Oh, don't get me wrong, me to. But for me it just all makes perfect.

What we document is:

"The fault is an unsharing request to unshare a shared anonymous page
(-> mapped R/O). Does not apply to KSM."

And the code checks for exactly that. And in that context the mapcount
just expresses exactly what we want. Again, unless I am missing
something important that you raise above.


Anyhow, it's late in Germany. thanks for the discussion Linus!
Linus Torvalds Dec. 17, 2021, 10:58 p.m. UTC | #24
On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand <david@redhat.com> wrote:
>
> While I do care about future use cases, I cannot possibly see fork() not
> requiring the mmap_lock in the foreseeable future. Just so much depends
> on it as of now.

It's not that *fork()* depends on it.

Of course fork() takes the mmap_sem.

It's that fast-gup really really doesn't want it, and can't take it.

So any fast-gup user fundamentally cannot look at mapcount(), because
that would be fundamentally wrong and racy, and could race with fork.

And yet, as far as I can tell, that's *exactly* what your gup patches
do, with gup_pte_range() adding

+               if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
+                       put_compound_head(head, 1, flags);
+                       goto pte_unmap;
+               }

which looks at the page mapcount without holding the mmap sem at all.

And see my other email - I think there are other examples of your
patches looking at data that isn't stable because you don't hold the
right locks.

And you can't even do the optimistic case without taking the lock,
because in your world, a COW that optimistically copies in the case of
a race condition is fundamentally *wrong* and buggy. Because in your
world-view, GUP and COW are very different and have different rules,
but you need things to be *exact*, and they aren't.

And none of this is anything at least I can think about, because I
don't see what the "design" is.

I really have a hard time following what the rules actually are. You
seem to think that "page_mapcount()" is a really simple rule, and I
fundamentally disagree. It's a _very_ complicated thing indeed, with
locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES!

See why I'm so unhappy?

We *did* do the page_mapcount() thing. It was bad. It forced COW to
always take the page lock. There's a very real reason why I'm pushing
my "let's have a _design_ here", instead of your "let's look at
page_mapcount without even doing the locking".

And yes, I *know* that fork() takes the mmap_sem, and likely always
will. That really isn't the problem here. The problem is that your
page_mapcount() paths DO NOT take that lock.

Btw, maybe I'm misreading things. I looked at the individual patches,
I didn't apply them, maybe I missed something. But I don't think I am.

             Linus
Linus Torvalds Dec. 17, 2021, 11:20 p.m. UTC | #25
On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <david@redhat.com> wrote:
>
> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
> mapcount still applies.

Our code-base is too large for me to remember all the details, but if
we still end up having PageAnon for swapbacked pages, then mapcount
can increase from another process faulting in an pte with that swap
entry.

And mmap_sem doesn't protect against that. Again, page_lock() does.

And taking the page lock was a big performance issue.

One of the reasons that new COW handling is so nice is that you can do
things like

                if (!trylock_page(page))
                        goto copy;

exactly because in the a/b world order, the copy case is always safe.

In your model, as far as I can tell, you leave the page read-only and
a subsequent COW fault _can_ happen, which means that now the
subsequent COW needs to b every very careful, because if it ever
copies a page that was GUP'ed, you just broke the rules.

So COWing too much is a bug (because it breaks the page from the GUP),
but COWing too little is an even worse problem (because it measn that
now the GUP user can see data it shouldn't have seen).

Our old code literally COWed too  little. It's why all those changes
happened in the first place.

This is why I'm pushing that whole story line of

 (1) COW is based purely on refcounting, because that's the only thing
that obviously can never COW too little.

 (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then
makes sure to not mark pinned pages COW again (that "(b)" rule).

and here "don't use page_mapcount()" really is about that (1).

You do seem to have kept (1) in that your COW rules don't seem to
change (but maybe I missed it), but because your GUP-vs-COW semantics
are very different indeed, I'm not at all convinced about (2).

            Linus
David Hildenbrand Dec. 17, 2021, 11:29 p.m. UTC | #26
On 17.12.21 23:58, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 2:29 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> While I do care about future use cases, I cannot possibly see fork() not
>> requiring the mmap_lock in the foreseeable future. Just so much depends
>> on it as of now.
> 
> It's not that *fork()* depends on it.
> 
> Of course fork() takes the mmap_sem.
> 
> It's that fast-gup really really doesn't want it, and can't take it.

Oh, sorry, I was misreading your mail.

> 
> So any fast-gup user fundamentally cannot look at mapcount(), because
> that would be fundamentally wrong and racy, and could race with fork.
> 

So we're concerned about fork() racing with gup-fast-only. gup-fast-only
runs essentially lockless.

As we read an atomic mapcount, the relevant thing to happen would be
that we read an mapcount of 1 and decide to share the page, but there is
concurrent fork() such that the mapcount is increased.

So the parent process has to be the only process owning that page for
this to trigger (mapcount == 1). In that situation, we would pin the
page in gup-fast-only.

BUT this is just like GUP before fork() and caught using the
mm->write_protect_seq, so we'd immediately unpin it and not actually
return it from get-user-pages-fast. No harm done AFAIKS.


> And yet, as far as I can tell, that's *exactly* what your gup patches
> do, with gup_pte_range() adding
> 
> +               if (!pte_write(pte) && gup_must_unshare(flags, page, false)) {
> +                       put_compound_head(head, 1, flags);
> +                       goto pte_unmap;
> +               }
> 
> which looks at the page mapcount without holding the mmap sem at all.
> 
> And see my other email - I think there are other examples of your
> patches looking at data that isn't stable because you don't hold the
> right locks.

We rely on PageAnon(), PageKsm() and the mapcount. To my understanding,
they are stable for our use in pagefault handling code under mmap_lock
and in gup-fast because of above reasoning.

> 
> And you can't even do the optimistic case without taking the lock,
> because in your world, a COW that optimistically copies in the case of
> a race condition is fundamentally *wrong* and buggy. Because in your
> world-view, GUP and COW are very different and have different rules,
> but you need things to be *exact*, and they aren't.
> 
> And none of this is anything at least I can think about, because I
> don't see what the "design" is.
> 
> I really have a hard time following what the rules actually are. You
> seem to think that "page_mapcount()" is a really simple rule, and I
> fundamentally disagree. It's a _very_ complicated thing indeed, with
> locking issues, AND YOU ACTIVELY VIOLATE THE LOCKING RULES!
> 
> See why I'm so unhappy?

I see why your unhappy, and I appreciate the productive discussion :)
But I think we just have to complete the big picture of what we're
proposing and how the mapcount is safe to be used for this purpose.

I mean, I'm happy if you actually find a flaw in the current design
proposal.

> 
> We *did* do the page_mapcount() thing. It was bad. It forced COW to
> always take the page lock. There's a very real reason why I'm pushing
> my "let's have a _design_ here", instead of your "let's look at
> page_mapcount without even doing the locking".

The locking semantics just have to be clarified and written in stone --
if we don't find any flaws.

But this will be my last mail for today, have a nice weekend Linus!
Nadav Amit Dec. 17, 2021, 11:53 p.m. UTC | #27
> On Dec 17, 2021, at 3:29 PM, David Hildenbrand <david@redhat.com> wrote:
> 
> On 17.12.21 23:58, Linus Torvalds wrote:
>> 
>> And you can't even do the optimistic case without taking the lock,
>> because in your world, a COW that optimistically copies in the case of
>> a race condition is fundamentally *wrong* and buggy. Because in your
>> world-view, GUP and COW are very different and have different rules,
>> but you need things to be *exact*, and they aren’t.

I understand the discussion mainly revolves correctness, which is
obviously the most important property, but I would like to mention
that having transient get_page() calls causing unnecessary COWs can
cause hard-to-analyze and hard-to-avoid performance degradation. COW
means a page copy, a TLB flush and potentially a TLB shootdown, which
is the most painful, specifically on VMs.

So I think that any solution should be able to limit the cases/number
of unnecessary COW operations to be minimal.
Jason Gunthorpe Dec. 18, 2021, 12:50 a.m. UTC | #28
On Fri, Dec 17, 2021 at 10:04:11PM +0100, David Hildenbrand wrote:

> > To remind all, the GUP users, like RDMA, VFIO use
> > FOLL_FORCE|FOLL_WRITE to get a 'r/o pin' specifically because of the
> 
> I heard that statement often. Can you point me at the code?
> 
> VFIO: drivers/vfio/vfio_iommu_type1.c
> 
> vaddr_get_pfns() will end up doing a 
> pin_user_pages_remote(FOLL_LONGTERM) without 
> FOLL_FORCE|FOLL_WRITE.

> Is that added automatically internally?

No, it is just that VFIO is broken in this regard. AFAIK VFIO users
rarely use the read-only mode and haven't noticed this bug yet.

You can search for FOLL_FORCE and see the drivers that do it this way:

drivers/misc/habanalabs/common/memory.c:                                 FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,
drivers/infiniband/core/umem.c:         gup_flags |= FOLL_FORCE;
drivers/media/v4l2-core/videobuf-dma-sg.c:      unsigned int flags = FOLL_FORCE;
drivers/media/common/videobuf2/frame_vector.c:                            FOLL_FORCE | FOLL_WRITE | FOLL_LONGTERM,

[etc]

No doubt there are others that do it right and wrong, this is badly
documented and misunderstood.

> But this series really just wants to fix the security issue as "part 1".
> Without any more breakages.

Sure, I'm just trying to understand why this example was brought up.

Jason
Jason Gunthorpe Dec. 18, 2021, 12:50 a.m. UTC | #29
On Fri, Dec 17, 2021 at 09:15:45PM +0000, Nadav Amit wrote:

> I think that there is an assumption that once a page is COW-broken,
> it would never have another write-fault that might lead to COW
> breaking later.

Yes, that is what Linus has been explaining, AFAICT

> AFAIK at least after userfaultfd-WP followed by
> userfaultfd-write-unprotect a page might be write-protected and
> go through do_wp_page() a second time to be COW-broken again. In
> such case, I think the FOLL_FORCE|FOLL_WRITE would not help.

Right, and this is a good reason why refcount is running into trouble,
it COW's too much in cases like that because userfaultfd-WP doesn't
align to the first assumption.

Jason
Linus Torvalds Dec. 18, 2021, 1:53 a.m. UTC | #30
[ Going back in the thread to this one ]

On Fri, Dec 17, 2021 at 1:15 PM Nadav Amit <namit@vmware.com> wrote:
>
> I think that there is an assumption that once a page is COW-broken,
> it would never have another write-fault that might lead to COW
> breaking later.

Right. I do think there are problems in the current code, I just think
that the patches are a step back.

The problems with the current code are of two kinds:

 - I think the largepage code (either THP or explicit hugetlb) doesn't
do as good a job of this whole COW handling as the regular pages do

 - some of the "you can make pages read-only again explicitly" kinds of loads.

But honestly, at least for the second case, if somebody does a GUP,
and then starts playing mprotect games on the same virtual memory area
that they did a GUP on, and are surprised when they get another COW
fault that breaks their own connection with a page they did a GUP on
earlier, that's their own fault.

So I think there's some of "If you broke it, you get to keep both
pieces". Literally, in this case. You have your GUP page that you
looked up, and you have your virtual address page that you caused COW
on with mprotect() by making it read-only and then read-write again,
then you have two different pages, and at some point it really is just
"Well, don't do that then".

But yes, there's also some of "some code probably didn't get fully
converted to the new world order".  So if VFIO only uses
FOLL_LONGTERM, and didn't ask for the COW breaking, then yes, VFIO
will see page incoherencies. But that should be an issue of "VFIO
should do the right thing".

So part of it is a combination of "if you do crazy things, you'll get
crazy results". And some of it is some kernel pinning code that
doesn't do the right thing to actually make sure it gets a shared page
to be pinned.

And then there's THP and HUGETLB, that I do think needs fixing and
aren't about those two kinds of cases.

I think we never got around to just doing the same thing we did for
regular pages. I think the hugepage code simply doesn't follow that
"COW on GUP, mark to not COW later" pattern.

           Linus
Linus Torvalds Dec. 18, 2021, 2:17 a.m. UTC | #31
On Fri, Dec 17, 2021 at 5:53 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
>
> And then there's THP and HUGETLB, that I do think needs fixing and
> aren't about those two kinds of cases.
>
> I think we never got around to just doing the same thing we did for
> regular pages. I think the hugepage code simply doesn't follow that
> "COW on GUP, mark to not COW later" pattern.

In particular, do_huge_pmd_wp_page() has this pattern:

        /*
         * We can only reuse the page if nobody else maps the huge page or it's
         * part.
         */
        if (reuse_swap_page(page, NULL)) {
                ... mark it writable ...

and that never got converted to "only mark it writable if we actually
have exclusive access to this huge page".

So the problem is literally that reuse_swap_page() uses that
"page_mapcount()" logic, and doesn't take into account that the page
is actually used by a GUP reference.

Which is exactly why David then sees that "oh, we got a GUP reference
to it, and now we're seeing the writes come through". Because that
code looks at mapcount, and it shouldn't.

I think the hugepage code should use the exact same logic that the
regular wp fault code does.

                 Linus
Linus Torvalds Dec. 18, 2021, 2:42 a.m. UTC | #32
On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I think the hugepage code should use the exact same logic that the
> regular wp fault code does.

IOW, I think that this stupid (AND UNTESTED) patch should likely just
fix David's test-case with the hugepage and splice thing..

Or at least be somewhat close.  But it should be paired with the GUP
side doing the right thing too, of course. Maybe it already does,
maybe it doesn't, I didn't check...

And maybe there's something subtle about the page_count() of a THP
entry. Again, I can't really claim to have tested this all, but I'm
hoping this makes somebody go "Ahh, now I see what Linus means"

            Linus
Jason Gunthorpe Dec. 18, 2021, 3:05 a.m. UTC | #33
On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote:

> But honestly, at least for the second case, if somebody does a GUP,
> and then starts playing mprotect games on the same virtual memory area
> that they did a GUP on, and are surprised when they get another COW
> fault that breaks their own connection with a page they did a GUP on
> earlier, that's their own fault.

I've been told there are real workloads that do this.

Something like qemu will use GUP with VFIO to insert PCI devices into
the guest and GUP with RDMA to do fast network copy of VM memory
during VM migration. 

qemu also uses the WP games to implement dirty tracking of VM memory
during migration (and more? I'm not sure). It expects that during all
of this nothing will COW the pages, as the two kinds of DMA must
always go to the pages mapped to KVM.

The big trouble here is this all worked before, so it is a userspace
visible regression.

Can this be made to work at all? I wonder if qemu uses MAP_SHARED, eg
via a memfd or something, does the COW then go away naturally?

Jason
Nadav Amit Dec. 18, 2021, 3:30 a.m. UTC | #34
> On Dec 17, 2021, at 7:05 PM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Fri, Dec 17, 2021 at 05:53:45PM -0800, Linus Torvalds wrote:
> 
>> But honestly, at least for the second case, if somebody does a GUP,
>> and then starts playing mprotect games on the same virtual memory area
>> that they did a GUP on, and are surprised when they get another COW
>> fault that breaks their own connection with a page they did a GUP on
>> earlier, that's their own fault.
> 
> I've been told there are real workloads that do this.
> 
> Something like qemu will use GUP with VFIO to insert PCI devices into
> the guest and GUP with RDMA to do fast network copy of VM memory
> during VM migration.
> 
> qemu also uses the WP games to implement dirty tracking of VM memory
> during migration (and more? I'm not sure). It expects that during all
> of this nothing will COW the pages, as the two kinds of DMA must
> always go to the pages mapped to KVM.
> 
> The big trouble here is this all worked before, so it is a userspace
> visible regression.

In such a case, I do think it makes sense to fail uffd-wp (when
page_count() > 1), and in a prototype I am working on I do something
like that. Otherwise, if the page is written and you use uffd for
dirty tracking, what do you actually achieve?

You can return EAGAIN (which is documented and actually returned
while “mmap_changing”) in such case. This would not break userspace,
but indeed still likely to cause a performance regression.
Linus Torvalds Dec. 18, 2021, 3:36 a.m. UTC | #35
On Fri, Dec 17, 2021 at 6:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> IOW, I think that this stupid (AND UNTESTED) patch should likely just
> fix David's test-case with the hugepage and splice thing..

Looking at that patch, the page lock is entirely pointless.

It existed because of that broken reuse_swap_page() that tried to
count page mappings etc, but once you get rid of that - like we got
rid of it for the regular pages - it's not even needed.

So as we hold the page table lock, and see a page_count() of 1, we
could be done without any page lock at all. So that whole
trylock/unlock is actually unnecessary.

That said, it's possibly woth re-using any swap cache pages at this
point, and that would want the page lock. So some complexity in this
area is likely worth it. Similar to how we did it in commit
f4c4a3f48480 ("mm: free idle swap cache page after COW") for regular
pages.

So that patch is not great, but I think it works as a guiding one.

And notice how *simple* it is. It doesn't require careful counting of
swap entries that depend on page locking.

                Linus
Linus Torvalds Dec. 18, 2021, 3:38 a.m. UTC | #36
On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <namit@vmware.com> wrote:
>
> In such a case, I do think it makes sense to fail uffd-wp (when
> page_count() > 1), and in a prototype I am working on I do something
> like that.

Ack. If uddf-wp finds a page that is pinned, just skip it as not
write-protectable.

Because some of the pinners might be writing to it, of course - just
not through the page tables.

So that sounds like the right thing to do. I _think_ we discussed this
the last time this came up. I have some dim memory of that. Jason,
ring a bell?

             Linus
Linus Torvalds Dec. 18, 2021, 4:02 a.m. UTC | #37
On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <namit@vmware.com> wrote:
>
> I understand the discussion mainly revolves correctness, which is
> obviously the most important property, but I would like to mention
> that having transient get_page() calls causing unnecessary COWs can
> cause hard-to-analyze and hard-to-avoid performance degradation.

Note that the COW itself is pretty cheap. Yes, there's the page
allocation and copy, but it's mostly a local thing.

So that falls under the "good to avoid" heading, but in the end it's
not an immense deal.

In contrast, the page lock has been an actual big user-visible latency
issue, to the point of correctness.

A couple of years ago, we literally had NMI watchdog timeouts due to
the page wait-queues growing basically boundlessly. This was some
customer internal benchmark code that I never saw, so it wasn't
*quite* clear exactly what was going on, but we ended up having to
split up the page wait list traversal using bookmark entries, because
it was such a huge latency issue.

That was mostly NUMA balancing faults, I think, but the point I'm
making is that avoiding the page lock can be a *much* bigger deal than
avoiding some local allocation and copying of a page of data. There
are real loads where the page-lock gets insanely bad, and I think it's
because we use it much too much.

See commit 2554db916586 ("sched/wait: Break up long wake list walk")
for some of that saga.

So I really think that having to serialize with the page lock in order
to do some "exact page use counting" is a false economy. Yes, maybe
you'd be able to avoid a COW or two, but at what locking cost?

                Linus
Nadav Amit Dec. 18, 2021, 4:52 a.m. UTC | #38
> On Dec 17, 2021, at 8:02 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, Dec 17, 2021 at 3:53 PM Nadav Amit <namit@vmware.com> wrote:
>> 
>> I understand the discussion mainly revolves correctness, which is
>> obviously the most important property, but I would like to mention
>> that having transient get_page() calls causing unnecessary COWs can
>> cause hard-to-analyze and hard-to-avoid performance degradation.
> 
> Note that the COW itself is pretty cheap. Yes, there's the page
> allocation and copy, but it's mostly a local thing.

I don’t know about the page-lock overhead, but I understand your argument.

Having said that, I do know a bit about TLB flushes, which you did not
mention as overheads of COW. Such flushes can be quite expensive on
multithreaded workloads (specifically on VMs, but lets put those aside).

Take for instance memcached and assume you overcommit memory with a very fast
swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached
often accesses a page first for read and shortly after for write. I
encountered, in a similar scenario, that the page reference that
lru_cache_add() takes during the first faultin event (for read), causes a COW
on a write page-fault that happens shortly after [1]. So on memcached I
assume this would also trigger frequent unnecessary COWs.

Besides page allocation and copy, COW would then require a TLB flush, which,
when performed locally, might not be too bad (~200 cycles). But if memcached
has many threads, as it usually does, then you need a TLB shootdown and this
one can be expensive (microseconds). If you start getting a TLB shootdown
storm, you may avoid some IPIs since you see that other CPUs already queued
IPIs for the target CPU. But then the kernel would flush the entire TLB on
the the target CPU, as it realizes that multiple TLB flushes were queued,
and as it assumes that a full TLB flush would be cheaper.

[ I can try to run a benchmark during the weekend to measure the impact, as I
  did not really measure the impact on memcached before/after 5.8. ]

So I am in no position to prioritize one overhead over the other, but I do
not think that COW can be characterized as mostly-local and cheap in the
case of multithreaded workloads.


[1] https://lore.kernel.org/linux-mm/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com/
Nadav Amit Dec. 18, 2021, 5:23 a.m. UTC | #39
> On Dec 17, 2021, at 9:03 PM, Matthew Wilcox <willy@infradead.org> wrote:
> 
> On Sat, Dec 18, 2021 at 04:52:13AM +0000, Nadav Amit wrote:
>> Take for instance memcached and assume you overcommit memory with a very fast
>> swap (e.g., pmem, zram, perhaps even slower). Now, it turns out memcached
>> often accesses a page first for read and shortly after for write. I
>> encountered, in a similar scenario, that the page reference that
>> lru_cache_add() takes during the first faultin event (for read), causes a COW
>> on a write page-fault that happens shortly after [1]. So on memcached I
>> assume this would also trigger frequent unnecessary COWs.
> 
> Why are we comparing page_count() against 1 and not 1 + PageLRU(page)?
> Having a reference from the LRU should be expected.  Is it because of
> some race that we'd need to take the page lock to protect against?
> 

IIUC, the reference that is taken on the page is taken before SetPageLRU()
is called and the reference is later dropped:

lru_add_drain()
 lru_add_drain_cpu()
  __pagevec_lru_add()
   __pagevec_lru_add_fn()
    __pagevec_lru_add_fn()
     SetPageLRU()		<- sets the LRU
  release_pages()		<- drops the reference

It is one scenario I encountered. There might be others that take transient
references on pages that cause unnecessary COWs. I think David and Andrea
had few in mind. To trigger a COW bug I once used mlock()/munlock() that
take such transient reference. But who knows how many other cases exist
(KSM? vmscan?)
David Hildenbrand Dec. 18, 2021, 9:57 a.m. UTC | #40
On 18.12.21 00:20, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC.
>> mapcount still applies.
> 
> Our code-base is too large for me to remember all the details, but if
> we still end up having PageAnon for swapbacked pages, then mapcount
> can increase from another process faulting in an pte with that swap
> entry.

"Our code-base is too large for me to remember all the details". I
second that.

You might a valid point with the mapcount regarding concurrent swapin in
the current code, I'll have to think further about that if it could be a
problem and if it cannot be handled without heavy synchronization (I
think the concern is that gup unsharing could miss doing an unshare
because it doesn't detect that there are other page sharers not
expressed in the mapcount code but via the swap code when seeing
mapcount == 1).

Do you have any other concerns regarding the semantics/stability
regarding the following points (as discussed, fork() is not the issue
because it can be handled via write_protect_seq or something comparable.
handling per-process thingies is not the problem):

a) Using PageAnon(): It cannot possibly change in the pagefault path or
   in the gup-fast-only path (otherwise there would be use-after-free
   already).
b) Using PageKsm(): It cannot possibly change in the pagefault path or
   in the gup-fast path (otherwise there would be use-after-free
   already).
c) Using mapcount: It cannot possibly change in the way we care about or
   cannot detect  (mapcount going from == 1 to > 1 concurrently) in the
   pagefault path or in the gup-fast path due to fork().

You're point for c) is that we might currently not handle swap
correctly. Any other concerns, especially regarding the mapcount or is
that it?


IIUC, any GUP approach to detect necessary unsharing would at least
require a check for a) and b). What we're arguing about is c).

> 
> And mmap_sem doesn't protect against that. Again, page_lock() does.
> 
> And taking the page lock was a big performance issue.
> 
> One of the reasons that new COW handling is so nice is that you can do
> things like
> 
>                 if (!trylock_page(page))
>                         goto copy;
> 
> exactly because in the a/b world order, the copy case is always safe.
> 
> In your model, as far as I can tell, you leave the page read-only and
> a subsequent COW fault _can_ happen, which means that now the
> subsequent COW needs to b every very careful, because if it ever
> copies a page that was GUP'ed, you just broke the rules.
> 
> So COWing too much is a bug (because it breaks the page from the GUP),
> but COWing too little is an even worse problem (because it measn that
> now the GUP user can see data it shouldn't have seen).

Good summary, I'll extend below.

> 
> Our old code literally COWed too  little. It's why all those changes
> happened in the first place.

Let's see if we can agree on some things to get a common understanding.


What can happen with COW is:

1) Missed COW

We miss a COW, therefore someone has access to a wrong page.

This is the security issue as in patch #11. The security issue
documented in [1].

2) Unnecessary COW

We do a COW, but there are no other valid users, so it's just overhead +
noise.

The performance issue documented in section 5 in [1].

3) Wrong COW

We do a COW but there are other valid users (-> GUP).

The memory corruption issue documented in section 2 and 3 in [1].

Most notably, the io_uring reproducer which races with the
page_maybe_dma_pinned() check in current code can trigger this easily,
and exactly this issues is what gives me nightmares. [2]


Does that make sense? If we agree on the above, then here is how the
currently discussed approaches differ:

page_count != 1:
* 1) cannot happen
* 2) can happen easily (speculative references due to pagecache,
     migration, daemon, pagevec, ...)
* 3) can happen in the current code

mapcount > 1:
* 1) your concern is that this can happen due to concurrent swapin
* 2) cannot happen.
* 3) your concern is that this can happen due to concurrent swapin


If we can agree on that, I can see why you dislike mapcount, can you see
why I dislike page_count?

Ideally we'd really have a fast and reliable check for "is this page
shared and could get used by multiple processes -- either multiple
processes are already mapping it R/O or could map it via the swap R/O
later".


> This is why I'm pushing that whole story line of
> 
>  (1) COW is based purely on refcounting, because that's the only thing
> that obviously can never COW too little.

I am completely missing how 2) or 3) could *ever* be handled properly
for page_count != 1. 3) is obviously more important and gives me nightmares.


And that's what I'm trying to communicate the whole time: page_count is
absolutely fragile, because anything that results in a page getting
mapped R/O into a page table can trigger 3). And as [2] proves that can
even happen with *swap*.

(see how we're running into the same swap issues with both approaches?
Stupid swap :) )

> 
>  (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then
> makes sure to not mark pinned pages COW again (that "(b)" rule).
> 
> and here "don't use page_mapcount()" really is about that (1).
> 
> You do seem to have kept (1) in that your COW rules don't seem to
> change (but maybe I missed it), but because your GUP-vs-COW semantics
> are very different indeed, I'm not at all convinced about (2).

Oh yes, sorry, not in the context of this series. The point is that the
current page_count != 1 covers mapcount > 1, so we can adjust that
separately later.


You mentioned "design", so let's assume we have a nice function:

/*
 * Check if an anon page is shared or exclusively used by a single
 * process: if shared, the page is shared by multiple processes either
 * mapping the page R/O ("active sharing") or having swap entries that
 * could result in the page getting mapped R/O ("inactive sharing").
 *
 * This function is safe to be called under mmap_lock in read/write mode
 * because it prevents concurrent fork() sharing the page.
 * This function is safe to be called from gup-fast-only in IRQ context,
 * as it detects concurrent fork() sharing the page
 */
bool page_anon_shared();


Can we agree that that would that be a suitable function for (1) and (2)
instead of using either the page_count or the mapcount directly? (yes,
how to actually make it reliable due to swapin is to be discussed, but
it might be a problem worth solving if that's the way to go)

For hugetlb, this would really have to use the mapcount as explained
(after all, fortunately there is no swap ...).



[1]
https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/

[2]
https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_swap.c
David Hildenbrand Dec. 18, 2021, 10:06 a.m. UTC | #41
On 18.12.21 03:42, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 6:17 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> I think the hugepage code should use the exact same logic that the
>> regular wp fault code does.
> 
> IOW, I think that this stupid (AND UNTESTED) patch should likely just
> fix David's test-case with the hugepage and splice thing..
> 
> Or at least be somewhat close.  But it should be paired with the GUP
> side doing the right thing too, of course. Maybe it already does,
> maybe it doesn't, I didn't check...
> 
> And maybe there's something subtle about the page_count() of a THP
> entry. Again, I can't really claim to have tested this all, but I'm
> hoping this makes somebody go "Ahh, now I see what Linus means"

Not the reaction you are hoping for: "Gah, Linus still doesn't see why
the page_count is just wrong". :)

See the mail I just wrote, let's get a common understanding of how our
check should actually look like.
Jason Gunthorpe Dec. 18, 2021, 6:42 p.m. UTC | #42
On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <namit@vmware.com> wrote:
> >
> > In such a case, I do think it makes sense to fail uffd-wp (when
> > page_count() > 1), and in a prototype I am working on I do something
> > like that.
> 
> Ack. If uddf-wp finds a page that is pinned, just skip it as not
> write-protectable.
>
> Because some of the pinners might be writing to it, of course - just
> not through the page tables.

That doesn't address the qemu use case though. The RDMA pin is the
'coherent r/o pin' we discussed before, which requires that the pages
remain un-write-protected and the HW DMA is read only.

The VFIO pin will enable dirty page tracking in the system IOMMU so it
gets the same effect from qemu's perspective as the CPU WP is doing.

In these operations every single page of the guest will be pinned, so
skip it just means userfault fd wp doesn't work at all.

Qemu needs some solution to be able to dirty track the CPU memory for
migration..

> So that sounds like the right thing to do. I _think_ we discussed this
> the last time this came up. I have some dim memory of that. Jason,
> ring a bell?

We talked about clear_refs alot, but it was never really clear the use
case, I think. Plus that discussion never finialized to anything.

David's latest summary seems accurate, if I paraphrase at a high
level, Linus's approach always does enough COWs but might do extra and
David's approach tries to do exactly the right number of COWs.

It looks like to have the same functionality with Linus's approach we
need to have a way for userspace to opt out of COW and work in an
entirely deterministic non-COW world. WP&GUP can never work together
otherwise which leaves qemu stranded.

Or, we follow David's approach and make COW be precise and accept the
complexity..

Jason
Linus Torvalds Dec. 18, 2021, 7:21 p.m. UTC | #43
[ Cutting down ruthlessly to the core of the issue ]

On Sat, Dec 18, 2021 at 1:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> 1) Missed COW
>
> 2) Unnecessary COW
>
> 3) Wrong COW

> Does that make sense? If we agree on the above, then here is how the
> currently discussed approaches differ:
>
> page_count != 1:
> * 1) cannot happen
> * 2) can happen easily (speculative references due to pagecache,
>      migration, daemon, pagevec, ...)
> * 3) can happen in the current code

I claim that (1) "cannot happen" is a huge mother of a deal. It's
*LITERALLY* the bug you are chasing, and it's the security issue, so
on a bug scale, it's about the worst there is.

I further then claim that (2) "happen easily" is you just making
things up. Yes, it can happen. But no, it's not actually that common,
and since (2) is harmless from a correctness standpoint, it is purely
about performance.

And as mentioned, not using the mapcount actually makes *common*
operations much simpler and faster. You don't need the page lock to
serialize the mapcount.

So (2) is a performance argument, and you haven't actually shown it to
be a problem.

Which really only leaves (3). Which I've already explained what the
fix is: don't ever mark pages that shouldn't be COW'ed as being COW
pages.

(3) is really that simple, although it ended up depending on Jason and
John Hubbard and others doing that FOLL_PIN logic to distinguish "I
just want to see a random page, and I don't care about COW" from "I
want to get a page, and that page needs to be coherent with this VM
and not be COW'ed away"

So I'm not claiming (3) is "trivial", but at the same time it's
certainly not some fundamentally complicated thing, and it's easy to
explain what is going on.

> mapcount > 1:
> * 1) your concern is that this can happen due to concurrent swapin
> * 2) cannot happen.
> * 3) your concern is that this can happen due to concurrent swapin

No, my concern about (1) is that IT IS WRONG.

"mapcount" means nothing for COW. I even gave you an example of
exactly where it means nothing. It's crazy. It's illogical. And it's
complicated as hell.

The fact that only one user maps a page is simply not meaningful. That
page can have other users that you don't know anything about, and that
don't show up in the mapcount.

That page can be swapcached, in which case mapcount can change
radically in ways that you earlier indicated cannot happen. You were
wrong.

But even if you fix that - by taking the page lock in every single
place - there are still *other* users that for all you know may want
the old contents. You don't know.

The only thing that says "no other users" is the page count. Not the mapcount.

In other words, I claim that

 (a) mapcount is fundamentally the wrong thing to test. You can be the
only mapper, without being the "owner" of the page.

 (b) it's *LITERALLY* the direct and present source of that bug in the
testcase you added, where a page with a mapcount of 1 has other
concurrent users and needs to be COW'ed but isn't.

 (c) it's complicated and expensive to calculate (where one big part
of the expense is the page lock synchronization requirements, but
there are others)

And this all happens for that "case (1)", which is the worst adn
scariest of them all.

In contrast to that, your argument that "(2) cannot happen" is a total
non-argument. (2) isn't the problem.

And I claim that (3) can happen because you're testing the wrong
counter, so who knows if the COW is wrong or not?

> I am completely missing how 2) or 3) could *ever* be handled properly
> for page_count != 1. 3) is obviously more important and gives me nightmares.

Ok, so if I tell you how (2) and (3) are handled properly, you will
just admit you were wrong?

Here's how they are handled properly with page counts. I have told you
this before, but I'll summarize:

 (2) is handled semantically properly by definition - it may be
"unnecessary", but it has no semantic meaning

This is an IMPORTANT thing to realize. The fact is, (2) is not in the
same class as (1) or (3).

And honestly - we've been doing this for all the common cases already
since at least 5.9, and your performance argument simply has not
really reared its head.  Which makes the whole argument moot. I claim
that it simplifies lots of common operations and avoids having to
serialize on a lock that has been a real and major problem. You claim
it's extra overhead and can cause extra COW events. Neither of has any
numbers worth anything, but at least I can point to the fact that all
the *normal* VM paths have been doing the thing I advocate for many
releases now, and the sky most definitely is NOT falling.

So that only leaves (3).

Handling (3) really is so conceptually simple that I feel silly for
repeating it: if you don't want a COW to happen, then you mark the
page as being not-COW.

That sounds so simple as to be stupid. But it really is the solution.
It's what that pinning logic does, and keeps that "page may be pinned"
state around, and then operations like fork() that would otherwise
create a COW mapping of it will just not do it.

So that incredibly simple approach does require actual code: it
requires that explicit "fork() needs to copy instead of COW" code, it
requires that "if it's pinned, we don't make a new swapcache entry out
of it". So it's real code, and it's a real issue, but it's
conceptually absolutely trivial, and the code is usualyl really simple
to understand too.

So you have a *trivial* concept, and you have simple code that could
be described to a slightly developmentally challenged waterfowl.  If
you're one of the programmers doing the "explain your code to a rubber
ducky", you can look at code like this:

                /*
                 * Anonymous process memory has backing store?
                 * Try to allocate it some swap space here.
                 * Lazyfree page could be freed directly
                 */
                if (PageAnon(page) && PageSwapBacked(page)) {
                        if (!PageSwapCache(page)) {
                                if (!(sc->gfp_mask & __GFP_IO))
                                        goto keep_locked;
                                if (page_maybe_dma_pinned(page))
                                        goto keep_locked;

and you can explain that page_maybe_dma_pinned() test to your rubber
ducky, and that rubber ducky will literally nod its head. It gets it.

To recap:
 (1) is important, and page_count() is the only thing that guarantees
"you get full access to a page only when it's *obviously* exclusively
yours".
 (2) is NOT important, but could be a performance issue, but we have
real data from the past year that it isn't.
 (3) is important, and has a really spectacularly simple conceptual
fix with quite simple code too.

In contrast, with the "mapcount" games you can't even explain why they
should work, and the patches I see are actively buggy because
everything is so subtle.

                  Linus
Linus Torvalds Dec. 18, 2021, 7:52 p.m. UTC | #44
On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> To recap:
>  (1) is important, and page_count() is the only thing that guarantees
> "you get full access to a page only when it's *obviously* exclusively
> yours".
>  (2) is NOT important, but could be a performance issue, but we have
> real data from the past year that it isn't.
>  (3) is important, and has a really spectacularly simple conceptual
> fix with quite simple code too.
>
> In contrast, with the "mapcount" games you can't even explain why they
> should work, and the patches I see are actively buggy because
> everything is so subtle.

So to challenge you, please explain exactly how mapcount works to
solve (1) and (3), and how it incidentally guarantees that (2) doesn't
happen.

And that really involves explaining the actual code too. I can explain
the high-level concepts in literally a couple of sentences.

For (1), "the page_count()==1 guarantees you are the only owner, so a
COW event can re-use the page" really explains it. And the code is
pretty simple too. There's nothing subtle about "goto copy" when
pagecount is not 1. And even the locking is simple: "we hold the page
table lock, we found a page, it has only one ref to it, we own it"

Our VM is *incredibly* complicated. There really are serious
advantages to having simple rules in place.

And for (2), the simple rule is "yeah, we can cause spurious cow
events". That's not only simple to explain, it's simple to code for.
Suddenly you don't need to worry. "Copying the page is always safe".
That's a really really powerful statement.

Now, admittedly (3) is the one that ends up being more complicated,
but the *concept* sure is simple. "If you don't want to COW this page,
then don't mark it for COW".

The *code* for (3) is admittedly a bit more complicated. The "don't
mark it for COW" is simple to say, but we do have that fairly odd
locking thing with fork() doing a seqcount_write_begin/end, and then
GIP does the read-seqcount thing with retry. So it's a bit unusual,
and I don't think we have that particular pattern anywhere else, but
it's one well-defined lock and while unusual it's not *complicated* as
far as kernel locking rules go. It's unusual and perhaps not trivial,
but in the end those seqcount code sequences are maybe 10 lines total,
and they don't interact with anything else.

And yes, the "don't mark it for COW" means that write-protecting
something is special, mainly because we sadly do not have extra bits
in the page tables. It would be *really* easy if we could just hide
this "don't COW this page" in the page table. Truly trivial. We don't,
because of portability across different architectures ;(

So I'll freely give you that my (3) is somewhat painful, but it's
painful with a really simple concept.

And the places that get (3) wrong are generally places that nobody has
been able to care about. I didn't realize the problem with creating a
swap page after the fact for a while, so that commit feb889fb40fa
("mm: don't put pinned pages into the swap cache") came later, but
it's literally a very simple two-liner.

The commit message for commit feb889fb40fa may be worth reading. It
very much explains the spirit of the thing, and is much longer than
the trivial patch itself.

Simple and clear concepts matter. Code gets complicated even then, but
complex code with complex concepts is a bad combination.

              Linus
Nadav Amit Dec. 18, 2021, 9:48 p.m. UTC | #45
> On Dec 18, 2021, at 10:42 AM, Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> On Fri, Dec 17, 2021 at 07:38:39PM -0800, Linus Torvalds wrote:
>> On Fri, Dec 17, 2021 at 7:30 PM Nadav Amit <namit@vmware.com> wrote:
>>> 
>>> In such a case, I do think it makes sense to fail uffd-wp (when
>>> page_count() > 1), and in a prototype I am working on I do something
>>> like that.
>> 
>> Ack. If uddf-wp finds a page that is pinned, just skip it as not
>> write-protectable.
>> 
>> Because some of the pinners might be writing to it, of course - just
>> not through the page tables.
> 
> That doesn't address the qemu use case though. The RDMA pin is the
> 'coherent r/o pin' we discussed before, which requires that the pages
> remain un-write-protected and the HW DMA is read only.
> 
> The VFIO pin will enable dirty page tracking in the system IOMMU so it
> gets the same effect from qemu's perspective as the CPU WP is doing.
> 
> In these operations every single page of the guest will be pinned, so
> skip it just means userfault fd wp doesn't work at all.
> 
> Qemu needs some solution to be able to dirty track the CPU memory for
> migration..

My bad. I misunderstood the scenario.

Yes, I guess that you pin the pages early for RDMA registration, which
is also something you may do for IO-uring buffers. This would render
userfaultfd unusable.

I do not see how it can be solved without custom, potentially
complicated logic, which the page_count() approach wants to avoid.

The only thing I can think of is requiring the pinned regions to be
first madvise’d with MADV_DONTFORK and not COW’ing in such case.
But this would break existing code though.
Kirill A. Shutemov Dec. 18, 2021, 10:52 p.m. UTC | #46
On Fri, Dec 17, 2021 at 12:45:45PM -0800, Linus Torvalds wrote:
> On Fri, Dec 17, 2021 at 12:42 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > > Then somebody else modified that page, and you got exactly what you
> > > asked for - a COW event. The original R/O pin has the original page
> > > that it asked for, and can read it just fine.
> >
> > Where in the code did I ask for a COW event? I asked for a R/O pin, not
> > any kind of memory protection.
> 
> Why didn't you ask for a shared pin, if that is what you want?
> 
> We already support that.
> 
> If you don't like the read-only pins, don't use them. It's that simple.

So you are saying that if a GUP user wants to see changes made by
userspace to the page after the GUP it must ask for FOLL_WRITE, even if it
doesn't have intend to write to the page?

That's news to me.

Or did I misunderstand you?
Linus Torvalds Dec. 18, 2021, 10:53 p.m. UTC | #47
On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <namit@vmware.com> wrote:
>
> Yes, I guess that you pin the pages early for RDMA registration, which
> is also something you may do for IO-uring buffers. This would render
> userfaultfd unusable.

I think this is all on usefaultfd.

That code literally stole two of the bits from the page table layout -
bits that we could have used for better things.

And guess what? Because it required those two bits in the page tables,
and because that's non-portable, it turns out that UFFD_WP can only be
enabled and only works on x86-64 in the first place.

So UFFS_WP is fundamentally non-portable. Don't use it.

Anyway, the good news is that I think that exactly because uffd_wp
stole two bits from the page table layout, it already has all the
knowledge it needs to handle this entirely on its own. It's just too
lazy to do so now.

In particular, it has that special UFFD_WP bit that basically says
"this page is actually writable, but I've made it read-only just to
get the fault for soft-dirty".

And the hint here is that if the page truly *was* writable, then COW
just shouldn't happen, and all that the page fault code should do is
set soft-dirty and return with the page set writable.

And if the page was *not* writable, then UFFD_WP wasn't actually
needed in the first place, but the uffd code just sets it blindly.

Notice? It _should_ be just an operation based purely on the page
table contents, never even looking at the page AT ALL. Not even the
page count, much less some mapcount thing.

Done right, that soft-dirty thing could work even with no page backing
at all, I think.

But as far as I know, we've actually never seen a workload that does
all this, so.. Does anybody even have a test-case?

Because I do think that UFFD_WP really should never really look at the
page, and this issue is actually independent of the "page_count() vs
page_mapcount()" discussion.

(Somewhat related aside: Looking up the page is actually one of the
more expensive operations of a page fault and a lot of other page
table manipulation functions - it's where most of the cache misses
happen. That's true on the page fault side, but it's also true for
things like copy_page_range() etc)

                 Linus
Linus Torvalds Dec. 18, 2021, 11:05 p.m. UTC | #48
On Sat, Dec 18, 2021 at 2:52 PM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> So you are saying that if a GUP user wants to see changes made by
> userspace to the page after the GUP it must ask for FOLL_WRITE, even if it
> doesn't have intend to write to the page?

Yup. Put the onus very clearly on GUP.

It's a very special operation, and it's one of the operations that
cause a lot of problems for the VM code. It's by no means the _only_
one - we've got a lot of other things that cause issues - but I think
it's very clear that GUP is very very special, and nobody should say
"I want GUP to do whatever".

There are two cases for GUP:

 - look up the page as it is *NOW*

 - look up the page in order to see any future state on it (and
possibly modify it)

that "any future state" is a fundamentally much heavier operation.
It's the one that really *has* to get rid of any sharing, and it has
to make sure no sharing happens in the future either.

So ii it is an anonymous page, it basically needs to act like a write.
Even if that page is then used only for reading.

Note that here that "if it's anonymous" is kind of a big deal. If it's
a shared file-mapped page, at that point it's "just another
reference". It's potentially problematic even in that case (think of
"truncate()" that tries to force-unmap all pages from VM's), but for
the shared case the answer is "if you truncate it and disassociate the
page from the mapping, it's _your_ problem.

And once it acts as a write, and once it does that COW and we have
exclusive access to it, it might as well be just writable and dirty.
You've done the expensive part already. You've forced it to be private
to that VM.

And this was all triggered by the user doing something very special,
so no amount of "but POSIX" or whatever matters.

GUP is not great. If you use GUP, you get to deal with the downsides.

               Linus
Nadav Amit Dec. 19, 2021, 12:19 a.m. UTC | #49
> On Dec 18, 2021, at 2:53 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Sat, Dec 18, 2021 at 1:49 PM Nadav Amit <namit@vmware.com> wrote:
>> 
>> Yes, I guess that you pin the pages early for RDMA registration, which
>> is also something you may do for IO-uring buffers. This would render
>> userfaultfd unusable.
> 
> I think this is all on usefaultfd.
> 
> That code literally stole two of the bits from the page table layout -
> bits that we could have used for better things.
> 
> And guess what? Because it required those two bits in the page tables,
> and because that's non-portable, it turns out that UFFD_WP can only be
> enabled and only works on x86-64 in the first place.
> 
> So UFFS_WP is fundamentally non-portable. Don't use it.

I have always felt that the PTE software-bits limit is very artificial.
We can just allocate two adjacent pages when needed, one for PTEs and
one for extra software bits. A software bit in the PTE can indicate
“extra software bits” are relevant (to save cache-misses), and a bit
in the PTEs' page-struct indicate whether there is adjacent “extra
software bits” page.

I’ve done it something similar once in a research project. It is
rather similar to what is done for PTI in the PGD level.


> 
> Anyway, the good news is that I think that exactly because uffd_wp
> stole two bits from the page table layout, it already has all the
> knowledge it needs to handle this entirely on its own. It's just too
> lazy to do so now.
> 
> In particular, it has that special UFFD_WP bit that basically says
> "this page is actually writable, but I've made it read-only just to
> get the fault for soft-dirty".
> 
> And the hint here is that if the page truly *was* writable, then COW
> just shouldn't happen, and all that the page fault code should do is
> set soft-dirty and return with the page set writable.
> 
> And if the page was *not* writable, then UFFD_WP wasn't actually
> needed in the first place, but the uffd code just sets it blindly.

I don’t think that I am following. The write-protection of UFFD means
that the userspace wants to intervene before anything else (including
COW).

UFFD_WP indications are recorded per PTE (i.e., not VMA). So if
userspace wants to intervene on write, it must use UFFD_WP even if
the page is write-protected. The kernel then has to keep the UFFD_WP
indication to call userspace upon a write.


> 
> Notice? It _should_ be just an operation based purely on the page
> table contents, never even looking at the page AT ALL. Not even the
> page count, much less some mapcount thing.
> 
> Done right, that soft-dirty thing could work even with no page backing
> at all, I think.
> 
> But as far as I know, we've actually never seen a workload that does
> all this, so.. Does anybody even have a test-case?
> 
> Because I do think that UFFD_WP really should never really look at the
> page, and this issue is actually independent of the "page_count() vs
> page_mapcount()" discussion.

I can think of two examples for reasonable flows of UFFD:

[ M = Monitor thread;  F = Faulting thread ]

(A) Userspace page-fault tracking (e.g., for memory migration):

1. M: WP memory.
2. F: WP page-fault: provide UFFD notification.
3. M: Unprotect the page.
4. M: Wake the faulting thread (usually as part of the unprotect)
5. F: Retry the page-fault (and handle COW).

(B) Userspace memory snapshots:

1. M: Write-protect memory.
2. M: Copy the memory to disk.
3. M: Write-unprotect memory  (e.g., speculatively as you expect a page
   to be written to and do not want to pay the #PF price).

   [ notice that the un-protection is logical, not really in the PTEs]

4. F: Get a page-fault (later) and handle it (because it might or
      might not need COW)

There may be “crazier” flows (e.g., wake the faulting thread and
emulate the instruction that triggered the write with ptrace), but
let’s put those aside.

IIUC the solution you propose, it tries to address flows such as (A).

I am not sure whether the proposal is to change the write-protection
API to only provide notifications (i.e., not block to after page-fault
as done today), but I do not see how it addresses (B).

I am not saying it is impossible, but I think that the solution would
complicate the code by making UFFD a special case.
Linus Torvalds Dec. 19, 2021, 12:35 a.m. UTC | #50
On Sat, Dec 18, 2021 at 4:19 PM Nadav Amit <namit@vmware.com> wrote:
>
> I have always felt that the PTE software-bits limit is very artificial.
> We can just allocate two adjacent pages when needed, one for PTEs and
> one for extra software bits. A software bit in the PTE can indicate
> “extra software bits” are relevant (to save cache-misses), and a bit
> in the PTEs' page-struct indicate whether there is adjacent “extra
> software bits” page.

Hmm. That doesn't sound very bad, no. And it would be nice to have
more software bits (and have them portably).

> I don’t think that I am following. The write-protection of UFFD means
> that the userspace wants to intervene before anything else (including
> COW).

The point I was making (badly) is that UFFD_WP is only needed to for
the case where the pte isn't already non-writable for other reasons.

> UFFD_WP indications are recorded per PTE (i.e., not VMA).

The changing of those bits are basically a bastardized 'mprotect()',
and does already require the vma to be marked VM_UFFD_WP.

And the way you set (or clear) the bits is with a range operation. It
really could have been done with mprotect(), and with actual explicit
vma bits.

The fact that it now uses the page table bit is rather random. I think
it would actually be cleaner to make that userfaultfd_writeprotect
truly *be* a vma range.

Right now it's kind of "half this, half that".

Of course, it's possible that because of this situation, some users do
a lot of fine-grained VM_UFFD_WP setting, and they kind of expect to
not have issues with lots of vma fragments. So practical concerns may
have made the implementation set in stone.

(I have only ever seen the kernel side of uffd, not the actual user
side, so I'm not sure about the use patterns).

That said, your suggestion of a shadow sw page table bit thing would
also work. And it would solve some problems we have in core areas
(notably "page_special()" which right now has that
ARCH_HAS_PTE_SPECIAL thing).

It would make it really easy to have that "this page table entry is
pinned" flag too.

              Linus
Nadav Amit Dec. 19, 2021, 6:02 a.m. UTC | #51
> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> (I have only ever seen the kernel side of uffd, not the actual user
> side, so I'm not sure about the use patterns).

I use it in a very fine granularity, and I suspect QEMU and CRIU do so
too.

> 
> That said, your suggestion of a shadow sw page table bit thing would
> also work. And it would solve some problems we have in core areas
> (notably "page_special()" which right now has that
> ARCH_HAS_PTE_SPECIAL thing).
> 
> It would make it really easy to have that "this page table entry is
> pinned" flag too.

I found my old messy code for the software-PTE thing.

I see that eventually I decided to hold a pointer to the “extra PTEs”
of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
pages approach but this code is long gone. ]

My rationale was that:

1. It does not bound you to have the same size for PTE and “extra-PTE”
2. The PMD-page struct is anyhow hot (since you acquired the PTL)
3. Allocating “extra-PTE” dynamically does not require to rewire the
   page-tables, which requires a TLB flush.

I think there is a place to hold a pointer in the PMD-page-struct
(_pt_pad_1, we just need to keep the lowest bit clear so the kernel
won’t mistaken it to be a compound page).

I still don’t know what exactly you have in mind for making use
out of it for the COW issue. Keeping a pin-count (which requires
internal API changes for unpin_user_page() and friends?) or having
“was ever pinned” sticky bit? And then changing
page_needs_cow_for_dma() to look at the PTE so copy_present_pte()
would break the COW eagerly?

Anyhow, I can clean it up and send (although it is rather simple
and I ignored many thing, such as THP, remap, etc), but I am not
sure I have the time now to fully address the COW problem. I will
wait for Monday for David’s response.
John Hubbard Dec. 19, 2021, 8:01 a.m. UTC | #52
On 12/18/21 22:02, Nadav Amit wrote:
> 
>> On Dec 18, 2021, at 4:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>>
>> (I have only ever seen the kernel side of uffd, not the actual user
>> side, so I'm not sure about the use patterns).
> 
> I use it in a very fine granularity, and I suspect QEMU and CRIU do so
> too.
> 
>>
>> That said, your suggestion of a shadow sw page table bit thing would
>> also work. And it would solve some problems we have in core areas
>> (notably "page_special()" which right now has that
>> ARCH_HAS_PTE_SPECIAL thing).
>>
>> It would make it really easy to have that "this page table entry is
>> pinned" flag too.
> 
> I found my old messy code for the software-PTE thing.
> 
> I see that eventually I decided to hold a pointer to the “extra PTEs”
> of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> pages approach but this code is long gone. ]
> 
> My rationale was that:
> 
> 1. It does not bound you to have the same size for PTE and “extra-PTE”
> 2. The PMD-page struct is anyhow hot (since you acquired the PTL)
> 3. Allocating “extra-PTE” dynamically does not require to rewire the
>     page-tables, which requires a TLB flush.
> 
> I think there is a place to hold a pointer in the PMD-page-struct
> (_pt_pad_1, we just need to keep the lowest bit clear so the kernel
> won’t mistaken it to be a compound page).
> 
> I still don’t know what exactly you have in mind for making use
> out of it for the COW issue. Keeping a pin-count (which requires
> internal API changes for unpin_user_page() and friends?) or having
> “was ever pinned” sticky bit? And then changing
> page_needs_cow_for_dma() to look at the PTE so copy_present_pte()
> would break the COW eagerly?
> 
> Anyhow, I can clean it up and send (although it is rather simple
> and I ignored many thing, such as THP, remap, etc), but I am not
> sure I have the time now to fully address the COW problem. I will
> wait for Monday for David’s response.
> 

Hi Nadav,

A couple of thoughts about this part of the design:

a) The PMD-page-struct approach won't help as much, because (assuming
that we're using it in an attempt to get a true, perfect pin count), you
are combining the pin counts of a PMD's worth of pages. OTOH...maybe
that actually *is* OK, assuming you don't overflow--except that you can
only answer the "is it dma-pinned?" question at a PMD level. That's a
contradiction of your stated desire above to have very granular control.

Also, because of not having bit 0 available in page._pt_pad_1, I think
the count would have to be implemented as adding and subtracting 2,
instead of 1 (in order to keep the value even), further reducing the
counter range.

b) If, instead, you used your older 2-adjacent pages approach, then
Linus' comment makes more sense here: we could use the additional struct
page to hold an exact pin count, per page. That way, you can actually
build a wrapper function such as:

     page_really_is_dma_pinned()

...and/or simply get a stronger "maybe" for page_maybe_dma_pinned().

Furthermore, this direction is extensible and supports solving other "I
am out of space in struct page" problems, at the cost of more pages, of
course.

As an aside, I find it instructive that we're talking about this
approach, instead of extending struct page. The lesson I'm taking away
is: allocating more space for some cases (2x pages) is better than
having *all* struct pages be larger than they are now.

Anyway, the pin count implementation would look somewhat like the
existing hpage_pincount, which similarly has ample space for a separate,
exact pin count. In other words, this sort of thing (mostly-pseudo
code):


diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..646761388025 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -938,6 +938,16 @@ static inline bool hpage_pincount_available(struct page *page)
  	return PageCompound(page) && compound_order(page) > 1;
  }

+static inline bool shadow_page_pincount_available(struct page *page)
+{
+	/*
+	 * TODO: Nadav: connect this up with the shadow page table
+	 * implementation, and return an appropriate answer.
+	 */
+
+	return false; // hardcoded for now, for compile testing
+}
+
  static inline int head_compound_pincount(struct page *head)
  {
  	return atomic_read(compound_pincount_ptr(head));
@@ -950,6 +960,13 @@ static inline int compound_pincount(struct page *page)
  	return head_compound_pincount(page);
  }

+static inline int shadow_page_pincount(struct page *page)
+{
+	VM_BUG_ON_PAGE(!shadow_page_pincount_available(page), page);
+
+	return atomic_read(shadow_page_pincount_ptr(page));
+}
+
  static inline void set_compound_order(struct page *page, unsigned int order)
  {
  	page[1].compound_order = order;
@@ -1326,6 +1343,9 @@ static inline bool page_maybe_dma_pinned(struct page *page)
  	if (hpage_pincount_available(page))
  		return compound_pincount(page) > 0;

+	if (shadow_page_pincount_available(page))
+		return shadow_page_pincount(page) > 0;
+
  	/*
  	 * page_ref_count() is signed. If that refcount overflows, then
  	 * page_ref_count() returns a negative value, and callers will avoid

c) The "was it ever pinned" sticky bit is not a workable concept, at the
struct page level. A counter is required, in order to allow pages to
live out their normal lives to their fullest potential. The only time we
even temporarily got away with this kind of stickiness was at a higher
level, and only per-process, not per-physical-page. Processes come and
go, but the struct pages are more or less forever, so once you mark one
sticky like this, it's out of play.

thanks,
David Hildenbrand Dec. 19, 2021, 8:43 a.m. UTC | #53
On 18.12.21 20:52, Linus Torvalds wrote:
> On Sat, Dec 18, 2021 at 11:21 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>
>> To recap:
>>  (1) is important, and page_count() is the only thing that guarantees
>> "you get full access to a page only when it's *obviously* exclusively
>> yours".
>>  (2) is NOT important, but could be a performance issue, but we have
>> real data from the past year that it isn't.
>>  (3) is important, and has a really spectacularly simple conceptual
>> fix with quite simple code too.
>>
>> In contrast, with the "mapcount" games you can't even explain why they
>> should work, and the patches I see are actively buggy because
>> everything is so subtle.
> 
> So to challenge you, please explain exactly how mapcount works to
> solve (1) and (3), and how it incidentally guarantees that (2) doesn't
> happen.

Oh, there is no need for additional challenges, I've been challenged
with this problem for way too long already ;)

And again, I appreciate this discussion and your feedback. I want to get
all of this fixed ASAP, even if it's not going to be the way I propose
as I raised. Any input is appreciated (as long as people don't scream at
me).


To get to your point: I thought about your remarks with the swapcount
and it makes sense. The mapcount itself is not expressive enough to
catch macpount == 1 vs mapcount > 1.

What *would* work to have precise COW I think is having the active and
inactive count instead of just the active (mapcount) part, whereby:

active: page tables mapping this page -> mapcount
inactive: page tables referencing this page via swap/migration entry

An alternative would be to just know that there are inactive users. We'd
have to read both values atomically in one shot.

There would be ways to store that information in the _mapcount but it
certainly adds a lot of complexity, and ...

> 
> And that really involves explaining the actual code too. I can explain
> the high-level concepts in literally a couple of sentences.
> 
> For (1), "the page_count()==1 guarantees you are the only owner, so a
> COW event can re-use the page" really explains it. And the code is
> pretty simple too. There's nothing subtle about "goto copy" when
> pagecount is not 1. And even the locking is simple: "we hold the page
> table lock, we found a page, it has only one ref to it, we own it"
> 
> Our VM is *incredibly* complicated. There really are serious
> advantages to having simple rules in place.

... you have a point here.

Having that said, I hope we can agree that the "page_count" is not the
perfect solution. I hope we can at least tweak it for now to get rid of
3) Wrong COW.

> 
> And for (2), the simple rule is "yeah, we can cause spurious cow
> events". That's not only simple to explain, it's simple to code for.
> Suddenly you don't need to worry. "Copying the page is always safe".
> That's a really really powerful statement.
> 
> Now, admittedly (3) is the one that ends up being more complicated,
> but the *concept* sure is simple. "If you don't want to COW this page,
> then don't mark it for COW".
> 
> The *code* for (3) is admittedly a bit more complicated. The "don't
> mark it for COW" is simple to say, but we do have that fairly odd
> locking thing with fork() doing a seqcount_write_begin/end, and then
> GIP does the read-seqcount thing with retry. So it's a bit unusual,
> and I don't think we have that particular pattern anywhere else, but
> it's one well-defined lock and while unusual it's not *complicated* as
> far as kernel locking rules go. It's unusual and perhaps not trivial,
> but in the end those seqcount code sequences are maybe 10 lines total,
> and they don't interact with anything else.
> 
> And yes, the "don't mark it for COW" means that write-protecting
> something is special, mainly because we sadly do not have extra bits
> in the page tables. It would be *really* easy if we could just hide
> this "don't COW this page" in the page table. Truly trivial. We don't,
> because of portability across different architectures ;(
> 
> So I'll freely give you that my (3) is somewhat painful, but it's
> painful with a really simple concept.

Thanks for admitting that!

I might have had an idea yesterday on how to fix most of the issues
without relying on the mapcount, doing it similar (but slightly
different) as you propose here. Let's call it a mixture of the unsharing
approach and your approach. I cannot promise anything, so ...

... I'll go playing with it and share some details ASAP. At least it
sounds comparatively simple in my head.

> 
> And the places that get (3) wrong are generally places that nobody has
> been able to care about. I didn't realize the problem with creating a
> swap page after the fact for a while, so that commit feb889fb40fa
> ("mm: don't put pinned pages into the swap cache") came later, but
> it's literally a very simple two-liner.
> 

Just to give you my perspective:


Personally I don't care too much about 2). The only reason why I somehow
care about "Unnecessary COW" are
* Challenging for hugetlb use as I explained. We might still want to use
  the mapcount there.
* It's mostly a symptom of our eventually too simple COW logic that
  effectively leads to 3).

While I do care about 1) (Missed CoW) for our customers, I *especially*
care about 3) (Wrong Cow) simply because silent memory corruptions in
user space are not acceptable.

As you say, fixing 1) the "page_count" way might be easy, at least for THP.

Simple example: Have swapping enabled and register a fixed io_uring
buffer at the wrong time. Fixed io_uring buffers are no a commodity
feature for unprivileged users space ...

So that's why I so deeply care about all of this.
Matthew Wilcox Dec. 19, 2021, 11:30 a.m. UTC | #54
On Sun, Dec 19, 2021 at 12:01:59AM -0800, John Hubbard wrote:
> On 12/18/21 22:02, Nadav Amit wrote:
> > I found my old messy code for the software-PTE thing.
> > 
> > I see that eventually I decided to hold a pointer to the “extra PTEs”
> > of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> > pages approach but this code is long gone. ]
>
> a) The PMD-page-struct approach won't help as much, because (assuming
> that we're using it in an attempt to get a true, perfect pin count), you
> are combining the pin counts of a PMD's worth of pages. OTOH...maybe
> that actually *is* OK, assuming you don't overflow--except that you can
> only answer the "is it dma-pinned?" question at a PMD level. That's a
> contradiction of your stated desire above to have very granular control.
> 
> Also, because of not having bit 0 available in page._pt_pad_1, I think
> the count would have to be implemented as adding and subtracting 2,
> instead of 1 (in order to keep the value even), further reducing the
> counter range.

I think you misunderstood Nadav's approach.  He's talking about making
an extra side-allocation per PMD if you're using uffd, and storing
extra information in it.  I think it's a worthwile approach.
Linus Torvalds Dec. 19, 2021, 5:27 p.m. UTC | #55
On Sat, Dec 18, 2021 at 10:02 PM Nadav Amit <namit@vmware.com> wrote:
>
> I found my old messy code for the software-PTE thing.
>
> I see that eventually I decided to hold a pointer to the “extra PTEs”
> of each page in the PMD-page-struct. [ I also implemented the 2-adjacent
> pages approach but this code is long gone. ]

Ok, I understand why that ends up being the choice, but it makes it
too ugly and messy to look up  to be worth it, I think.

> I still don’t know what exactly you have in mind for making use
> out of it for the COW issue.

So the truly fundamental question for COW (and for a long-term GUP) is
fairly simple:

 - Is the page I have truly owned exclusively by this VM?

If that _isn't_ the case, you absolutely have to COW.

If that _is_ the case, you can re-use the page.

That is really it, boiled down to the pure basics.

And if you aren't sure whether you are the ultimate and only authority
over the page, then COW is the "safer" option, in that breaking
sharing is fundamentally better than over-sharing.

Now, the reason I like "page_count()==1" is that it is a 100% certain
way to know that you own the page absolutely and clearly.

There is no question what-so-ever about it.

And the reason I hate "page_mapcount()==1" with a passion is that it
is NOTHING OF THE KIND. It is an entirely meaningless number. It
doesn't mean anything at all.

Even if the page mapcount is exactly right, it could easily and
trivially be a result of "fork, then unmap in either parent or child".

Now that page_mapcount() is unquestionably 1, but despite that, at
some point the page was shared by another VM, and you can not know
whether you really have exclusive access.

And that "even if page mapcount is exactly right" is a big issue in
itself, as I hope I've explained.

It requires page locking, it requires that you take swapcache users
into account, it is just a truly messy and messed up thing.

There really is absolutely no reason for page_mapcount to exist. It's
a mistake. We have it for completely broken historical reasons.

It's WRONG.

Now, if "page_count()==1" is so great, what is the issue? Problem solved.

No, while page_count()==1 is one really fundamental marker (unlike the
mapcount), it does have problems too.

Because yes, "page_count()==1" does mean that you have truly exclusive
ownership of the page, but the reverse is not true.

The way the current regular VM code handles that "the reverse is not
true" is by making "the page is writable" be the second way you can
say "you clearly have full ownership of the page".

So that's why you then have the "maybe_pinned()" thing in fork() and
in swap cache creation that keeps such a page writable, and doesn't do
the virtual copy and make it read-only again.

But that's also why it has problems with write-protect (whether
mprotect or uddf_wp).

Anyway, that was a long explanation to make the thinking clear, and
finally come to the actual answer to your question:

Adding another bit in the page tables - *purely* to say "this VM owns
the page outright" - would be fairly powerful. And fairly simple.

Then any COW event will set that bit - because when you actually COW,
the page you install is *yours*. No questions asked.

And fork() would simply clear that bit (unless the page was one of the
pinned pages that we simply copy).

See how simple that kind of concept is.

And please, see how INCREDIBLY BROKEN page_mapcount() is. It really
fundamentally is pure and utter garbage.  It in no way says "I have
exclusive ownership of this page", because even if the mapcount is 1
*now*, it could have been something else earlier, and some other VM
could have gotten a reference to it before the current VM did so.

This is why I will categoricall NAK any stupid attempt to re-introduce
page_mapcount() for COW or GUP handling. It's unacceptably
fundamentally broken.

Btw, the extra bit doesn't really have to be in the page tables. It
could be a bit in the page itself. We could add another page bit that
we just clear when we do the "add ref to page as you make a virtual
copy during fork() etc".

And no, we can't use "pincount" either, because it's not exact. The
fact that the page count is so elevated that we think it's pinned is a
_heuristic_, and that's ok when you have the opposite problem, and ask
"*might* this page be pinned". You want to never get a false negative,
but it can get a false positive.

                 Linus
David Hildenbrand Dec. 19, 2021, 5:44 p.m. UTC | #56
> 
> Btw, the extra bit doesn't really have to be in the page tables. It
> could be a bit in the page itself. We could add another page bit that
> we just clear when we do the "add ref to page as you make a virtual
> copy during fork() etc".

^ I'm playing with the idea if using a page bit to express: "This page
is exclusive". On a CoW fault, if that bit is set, I can simply reuse
the page.

The semantics under which semantics to set the bit are slightly
different than what you describe, and I'm playing with additional
unsharing (on GUP R/O) that avoids mapping the copied page similarly R/O
and simply sets the bit.

But the general idea could fly I think, devil's in the detail ...
Linus Torvalds Dec. 19, 2021, 5:44 p.m. UTC | #57
David, you said that you were working on some alternative model. Is it
perhaps along these same lines below?

I was thinking that a bit in the page tables to say "this page is
exclusive to this VM" would be a really simple thing to deal with for
fork() and swapout and friends.

But we don't have such a bit in general, since many architectures have
very limited sets of SW bits, and even when they exist we've spent
them on things like UDDF_WP.,

But the more I think about the "bit doesn't even have to be in the
page tables", the more I think maybe that's the solution.

A bit in the 'struct page' itself.

For hugepages, you'd have to distribute said bit when  you split the hugepage.

But other than that it looks quite simple: anybody who does a virtual
copy will inevitably be messing with the page refcount, so clearing
the "exclusive ownership" bit wouldn't be costly: the 'struct page'
cacheline is already getting dirtied.

Or what was your model you were implying you were thinking about in
your other email? You said

  "I might have had an idea yesterday on how to fix most of the issues
   without relying on the mapcount, doing it similar [..]"

but I didn't then reply to that email because I had just written this
other long email to Nadav.

           Linus

On Sun, Dec 19, 2021 at 9:27 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Adding another bit in the page tables - *purely* to say "this VM owns
> the page outright" - would be fairly powerful. And fairly simple.
>
> Then any COW event will set that bit - because when you actually COW,
> the page you install is *yours*. No questions asked.
>
 [ snip snip ]
>
> Btw, the extra bit doesn't really have to be in the page tables. It
> could be a bit in the page itself. We could add another page bit that
> we just clear when we do the "add ref to page as you make a virtual
> copy during fork() etc".
>
> And no, we can't use "pincount" either, because it's not exact. The
> fact that the page count is so elevated that we think it's pinned is a
> _heuristic_, and that's ok when you have the opposite problem, and ask
> "*might* this page be pinned". You want to never get a false negative,
> but it can get a false positive.
>
>                  Linus
David Hildenbrand Dec. 19, 2021, 5:59 p.m. UTC | #58
On 19.12.21 18:44, Linus Torvalds wrote:
> David, you said that you were working on some alternative model. Is it
> perhaps along these same lines below?
> 
> I was thinking that a bit in the page tables to say "this page is
> exclusive to this VM" would be a really simple thing to deal with for
> fork() and swapout and friends.
> 
> But we don't have such a bit in general, since many architectures have
> very limited sets of SW bits, and even when they exist we've spent
> them on things like UDDF_WP.,
> 
> But the more I think about the "bit doesn't even have to be in the
> page tables", the more I think maybe that's the solution.
> 
> A bit in the 'struct page' itself.
> 

Exactly what I am prototyping right now.

> For hugepages, you'd have to distribute said bit when  you split the hugepage.

Yes, that's one tricky part ...

> 
> But other than that it looks quite simple: anybody who does a virtual
> copy will inevitably be messing with the page refcount, so clearing
> the "exclusive ownership" bit wouldn't be costly: the 'struct page'
> cacheline is already getting dirtied.
> 
> Or what was your model you were implying you were thinking about in
> your other email? You said

I'm playing with the idea of not setting the bit always during COW but
only on GUP request to set the bit (either manually if possible or via
FOLL_UNSHARE). That's a bit more tricky but allows for decoupling that
approach completely from the page_pin() counter.

fork() is allowed to clear the bit if page_count() == 1 and share the
page. So no GUP->no fork() performance changes (!) . Otherwise the bit
can only vanish if we swapout/migrate the page: in which case there are
no additional GUP/references on the page that rely on it!

The bit can be set directly if we have to copy the page in the fault
handler (COW or unshare). Outside of COW/Unshare code, the bit can only
be set if page_count() == 1 and we sync against fork(). (and that's the
problem for gup-fast-only that I'm investigating right now, because it
would then always have to fallback to the slow variant if the bit isn't
already set)

So the bit can "vanish" whenever there is no additional reference on the
page. GUP syncs against fork() and can thereby set the bit/request to
set the bit.

I'm trying to decouple it completely from the page_pin() counter to also
be able to handle FOLL_GET (O_DIRECT reproducers unfortunately) correctly.

Not set it stone, just an idea what I'm playing with right now ... and I
have to tripple-check if
* page is PTE mapped in the page table I'm walking
* page_count() == 1
Really means that "this is the only reference.". I do strongly believe
so .. :)
Matthew Wilcox Dec. 19, 2021, 9:12 p.m. UTC | #59
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
> On 19.12.21 18:44, Linus Torvalds wrote:
> > David, you said that you were working on some alternative model. Is it
> > perhaps along these same lines below?
> > 
> > I was thinking that a bit in the page tables to say "this page is
> > exclusive to this VM" would be a really simple thing to deal with for
> > fork() and swapout and friends.
> > 
> > But we don't have such a bit in general, since many architectures have
> > very limited sets of SW bits, and even when they exist we've spent
> > them on things like UDDF_WP.,
> > 
> > But the more I think about the "bit doesn't even have to be in the
> > page tables", the more I think maybe that's the solution.
> > 
> > A bit in the 'struct page' itself.
> > 
> 
> Exactly what I am prototyping right now.
> 
> > For hugepages, you'd have to distribute said bit when  you split the hugepage.
> 
> Yes, that's one tricky part ...

That part shouldn't be that tricky ...

Can we get rid of ->mapcount altogether?  Three states:
 - Not mapped
 - Mapped exactly once
 - Possibly mapped more than once

I appreciate "Not mapped" is not a state that anon pages can
meaningfully have (maybe when they go into the swap cache?)

And this information would only be present on the head page (ie stored
per folio).  If one VMA has multiple PTEs that map the same folio,
then hopefully that only counts as mapped once.

I must admit about half this conversation is going over my head.  I need
more time to understand all the constraints than exists between emails
:-)
Linus Torvalds Dec. 19, 2021, 9:27 p.m. UTC | #60
On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Can we get rid of ->mapcount altogether?  Three states:
>  - Not mapped
>  - Mapped exactly once
>  - Possibly mapped more than once

I don't think even that is useful. We should get rid of mapcount entirely.

It doesn't actually help to know "mapped exactly once", exactly
because even when that's true, there may be non-mapped references to
the page.

Mapped references really aren't that special in general.

One case where it *can* be special is on virtually indexed cache
architectures, where "is this mapped anywhere else" can be an issue
for cache flushing.

There the page_mapcount() can actually really matter, but it's such an
odd case that I'm not convinced it should be something the kernel VM
code should bend over backwards for.

And the count could be useful for 'rmap' operations, where you can
stop walking the rmap once you've found all mapped cases (paghe
migration being one case of this). But again, I'm not convinced the
serialization is worth it, or that it's a noticeable win.

However, I'm not 100% convinced it's worth it even there, and I'm not
sure we necessarily use it there.

So in general, I think page_mapcount() can be useful as a count for
those things that are _literally_ about "where is this page mapped".
Page migration, virtual cache coherency, things like that can
literally be about "how many different virtual mappings can we find".

It's just that pages can have a number of non-mapped users too, so
mapcount isn't all that meaningful in general.

And you can look it up with rmap too, and so I do think it would be
worth discussing whether we really should strive to maintain
'mapcount' at all.

> I appreciate "Not mapped" is not a state that anon pages can
> meaningfully have (maybe when they go into the swap cache?)

Absolutely. And we can keep references around to an anonymous page
even without it having any mapping or swap cache at all (ie "gup +
unmap").

So "Not mapped at all" is a possible case, without the page being free'd.

                Linus
Matthew Wilcox Dec. 19, 2021, 9:47 p.m. UTC | #61
On Sun, Dec 19, 2021 at 01:27:07PM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2021 at 1:12 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Can we get rid of ->mapcount altogether?  Three states:
> >  - Not mapped
> >  - Mapped exactly once
> >  - Possibly mapped more than once
> 
> I don't think even that is useful. We should get rid of mapcount entirely.
> 
> It doesn't actually help to know "mapped exactly once", exactly
> because even when that's true, there may be non-mapped references to
> the page.
> 
> Mapped references really aren't that special in general.
> 
> One case where it *can* be special is on virtually indexed cache
> architectures, where "is this mapped anywhere else" can be an issue
> for cache flushing.
> 
> There the page_mapcount() can actually really matter, but it's such an
> odd case that I'm not convinced it should be something the kernel VM
> code should bend over backwards for.
> 
> And the count could be useful for 'rmap' operations, where you can
> stop walking the rmap once you've found all mapped cases (paghe
> migration being one case of this). But again, I'm not convinced the
> serialization is worth it, or that it's a noticeable win.
> 
> However, I'm not 100% convinced it's worth it even there, and I'm not
> sure we necessarily use it there.
> 
> So in general, I think page_mapcount() can be useful as a count for
> those things that are _literally_ about "where is this page mapped".
> Page migration, virtual cache coherency, things like that can
> literally be about "how many different virtual mappings can we find".
> 
> It's just that pages can have a number of non-mapped users too, so
> mapcount isn't all that meaningful in general.
> 
> And you can look it up with rmap too, and so I do think it would be
> worth discussing whether we really should strive to maintain
> 'mapcount' at all.

Yes, agreed, I was thinking that we could use "not mapped at all"
as an optimisation to avoid doing rmap walks.  eg __unmap_and_move().

Perhaps more interestingly in truncate_cleanup_page():
        if (page_mapped(page))
                unmap_mapping_page(page);
where we can skip the i_mmap rbtree walk if we know the page isn't
mapped.  I'd be willing to give up that optimisation if we had "this
page was never mapped" (ie if page_mapped() was allowed to return
false positives).
Linus Torvalds Dec. 19, 2021, 9:53 p.m. UTC | #62
On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Yes, agreed, I was thinking that we could use "not mapped at all"
> as an optimisation to avoid doing rmap walks.  eg __unmap_and_move().

So the thing is, it's a very dodgy optimization for a rather simple
reason: what if somebody pages the page in?

So even "not mapped at all" is questionable.

You have to check that it's also not a swapcache page, and hold the
page lock for that check, at the very least.

And by then, you're really in a very unusual situation - and my gut
feel says not one worth optimizing for (because anon pages are
_usually_ mapped at least once).

But I dunno - it might depend on your load. Maybe you have some very
special load that happens to trigger this case a lot?

              Linus
Matthew Wilcox Dec. 19, 2021, 10:02 p.m. UTC | #63
On Sun, Dec 19, 2021 at 01:53:36PM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2021 at 1:48 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Yes, agreed, I was thinking that we could use "not mapped at all"
> > as an optimisation to avoid doing rmap walks.  eg __unmap_and_move().
> 
> So the thing is, it's a very dodgy optimization for a rather simple
> reason: what if somebody pages the page in?
> 
> So even "not mapped at all" is questionable.
> 
> You have to check that it's also not a swapcache page, and hold the
> page lock for that check, at the very least.
> 
> And by then, you're really in a very unusual situation - and my gut
> feel says not one worth optimizing for (because anon pages are
> _usually_ mapped at least once).

I'd like to get rid of ->mapcount for file pages too.  And those are
definitely never mapped in the majority of cases.
Linus Torvalds Dec. 19, 2021, 10:12 p.m. UTC | #64
On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> I'd like to get rid of ->mapcount for file pages too.  And those are
> definitely never mapped in the majority of cases.

Fair enough.

You'd probably be better off checking "is this mapping mapped" though.
Because otherwise you have to get the page lock to serialize each
page.

               Linus
Matthew Wilcox Dec. 19, 2021, 10:26 p.m. UTC | #65
On Sun, Dec 19, 2021 at 02:12:04PM -0800, Linus Torvalds wrote:
> On Sun, Dec 19, 2021 at 2:02 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > I'd like to get rid of ->mapcount for file pages too.  And those are
> > definitely never mapped in the majority of cases.
> 
> Fair enough.
> 
> You'd probably be better off checking "is this mapping mapped" though.
> Because otherwise you have to get the page lock to serialize each
> page.

Truncate already has the page locked, eg
truncate_inode_pages_range()
  find_lock_entries()
  truncate_cleanup_page()
    if (page_mapped(page))
      unmap_mapping_page(page)

I think anyone calling unmap_mapping_page() really ought to have the
page lock.  Oh, we actually have an assert already to that effect ;-)
        VM_BUG_ON(!PageLocked(page));
Matthew Wilcox Dec. 20, 2021, 6:37 p.m. UTC | #66
On Sun, Dec 19, 2021 at 09:12:01PM +0000, Matthew Wilcox wrote:
> Can we get rid of ->mapcount altogether?  Three states:

This might be a step in the right direction?
Matthew Wilcox Dec. 20, 2021, 6:52 p.m. UTC | #67
On Mon, Dec 20, 2021 at 06:37:30PM +0000, Matthew Wilcox wrote:
> +++ b/mm/memory.c
> @@ -3626,7 +3626,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>  	inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES);
>  	dec_mm_counter_fast(vma->vm_mm, MM_SWAPENTS);
>  	pte = mk_pte(page, vma->vm_page_prot);
> -	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page, NULL)) {
> +	if ((vmf->flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
>  		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
>  		vmf->flags &= ~FAULT_FLAG_WRITE;
>  		ret |= VM_FAULT_WRITE;
[...]
> @@ -1673,17 +1665,14 @@ static int page_trans_huge_map_swapcount(struct page *page, int *total_mapcount,
>   * reuse_swap_page() returns false, but it may be always overwritten
>   * (see the other implementation for CONFIG_SWAP=n).
>   */
> -bool reuse_swap_page(struct page *page, int *total_map_swapcount)
> +bool reuse_swap_page(struct page *page)
>  {
> -	int count, total_mapcount, total_swapcount;
> +	int count, total_swapcount;
>  
>  	VM_BUG_ON_PAGE(!PageLocked(page), page);
>  	if (unlikely(PageKsm(page)))
>  		return false;
> -	count = page_trans_huge_map_swapcount(page, &total_mapcount,
> -					      &total_swapcount);
> -	if (total_map_swapcount)
> -		*total_map_swapcount = total_mapcount + total_swapcount;
> +	count = page_trans_huge_map_swapcount(page, &total_swapcount);
>  	if (count == 1 && PageSwapCache(page) &&
>  	    (likely(!PageTransCompound(page)) ||
>  	     /* The remaining swap count will be freed soon */

It makes me wonder if reuse_swap_page() can also be based on refcount
instead of mapcount?
Linus Torvalds Dec. 20, 2021, 7:15 p.m. UTC | #68
On Mon, Dec 20, 2021 at 10:37 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> This might be a step in the right direction?
>
> Subject: [PATCH] mm: reuse_swap_page() no longer needs to return map_swapcount

Well, that patch seems to be a no-op removal of dead code, so absolutely yes.

That said, I think it would be good to split it up. I looked at that
patch and went "is that really a no-op" to the point of recreating it.

I think it would be good to make it multiple patches that are each
individally trivial. IOW, start with

 (1) remove second argument to reuse_swap_page() that is always NULL,
without making any other changes

 (2) that now made 'total_mapcount' unused in reuse_swap_page(),
remove it as an argument from page_trans_huge_map_swapcount()

 (3) that now made 'total_mapcount' unused in
page_trans_huge_mapcount(), remove it as an argument there too.

because as it stands, that patch of yours looks like it is changing a
lot of things, and I think it would be clearer to remove one thign at
a time as it becomes obviously not used.

Hmm?

           Linus
Linus Torvalds Dec. 20, 2021, 7:38 p.m. UTC | #69
On Mon, Dec 20, 2021 at 10:53 AM Matthew Wilcox <willy@infradead.org> wrote:
>
> It makes me wonder if reuse_swap_page() can also be based on refcount
> instead of mapcount?

I suspect it doesn't even need refcount.

For regular pages, after we've copied the page, all we do right now is

                if (page_copied)
                        free_swap_cache(old_page);

which is basically just an optimistic trylock_page() followed by
try_to_free_swap().

And that then pretty much simply checks "are there any swap users
left" and deletes it from the swap cache if not.

The "free_swap_cache()" thing is actually just an optimization to
avoid having memory pressure do it later.  So it doesn't have to be
exact.

In fact, I thought that swap is so unusual that it's not even needed
at all, but I was wrong. See how this was re-introduced in commit
f4c4a3f48480 ("mm: free idle swap cache page after COW") because yes,
some loads still have swap space allocated.

In theory, it would probably be a good idea at COW time to see if the
page ref is 2, and if it's a swap cache page, and try to do that swap
cache removal even earlier, so that the page actually gets re-used
(instead of copied and then the swap entry removed).

But swap is such a non-issue these days that I doubt it matters, and
it's probably better to keep the swap handling in the unusual path.

So mapcount and refcount aren't what matters for the swap cache.

The swap count obviously *does* matter - because it means that some
mapping has a reference to this swap entry (not as a page, but as an
actual swap pointer).

But the mapcount is irrelevant -  any users that have the swap page
actually mapped, don't actually need to be a swapcache page.

Even the refcount doesn't really matter, afaik. The only "refcount" we
care about is that swapcount - that's what actually reference counts
the swap cases.

try_to_free_swap() does check for one particular kind of reference: it
does a check for PageWriteback(). We don't want to remove the thing
from the swap cache if it's under active IO.

(This codepath does need the page lock, though, thus all those
"page_trylock()" things).

                   Linus
Matthew Wilcox Dec. 20, 2021, 9:02 p.m. UTC | #70
On Mon, Dec 20, 2021 at 11:15:14AM -0800, Linus Torvalds wrote:
> Well, that patch seems to be a no-op removal of dead code, so absolutely yes.
> 
> That said, I think it would be good to split it up. I looked at that
> patch and went "is that really a no-op" to the point of recreating it.
> 
> I think it would be good to make it multiple patches that are each
> individally trivial. IOW, start with
> 
>  (1) remove second argument to reuse_swap_page() that is always NULL,
> without making any other changes
> 
>  (2) that now made 'total_mapcount' unused in reuse_swap_page(),
> remove it as an argument from page_trans_huge_map_swapcount()
> 
>  (3) that now made 'total_mapcount' unused in
> page_trans_huge_mapcount(), remove it as an argument there too.

Hah, that was actually how I did it originally (without actually
committing at each step, and with a few "Oh, hang on, now we can avoid
calculating this too" stops and restarts along the way), but I thought
it all hung together logically as a single change.  It's hard to see
things from the other person's perspective at times.
Linus Torvalds Dec. 20, 2021, 9:27 p.m. UTC | #71
On Mon, Dec 20, 2021 at 1:02 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> Hah, that was actually how I did it originally (without actually
> committing at each step, and with a few "Oh, hang on, now we can avoid
> calculating this too" stops and restarts along the way), but I thought
> it all hung together logically as a single change.  It's hard to see
> things from the other person's perspective at times.

In just about any other area, I wouldn't mind one bigger patch that
just removes code that isn't used.

But when it's in the vm code, and it's pretty grotty, I do prefer
seeing three patches that individually are much easier to see that
"yeah, this doesn't actually change anything at all".

The combined patch may be exactly the same thing, it's just much
harder to see that "oh, now it's not used any more".

That was perhaps especially true since a number of the changes also
ended up doing statement simplification when the old layout made no
sense any more with part of the results not used.

So your 3-patch series was much easier to look at and go "Yeah, I
believe each of these patches is a no-op".

So ACK on all those patches.

          Linus
Jason Gunthorpe Dec. 21, 2021, 1:03 a.m. UTC | #72
On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:

> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
> be set if page_count() == 1 and we sync against fork(). (and that's the
> problem for gup-fast-only that I'm investigating right now, because it
> would then always have to fallback to the slow variant if the bit isn't
> already set)

I'm having a hard time imagining how gup_fast can maintain any sort of
bit - it lacks all forms of locks so how can we do an atomic test and
set between two pieces of data?

I think the point of Linus's plan really is that the new bit is
derived from page_count, we get the set the new bit when we observe
page_count==1 in various situations and we clear the new bit whenever
we write protect with the intent to copy.

GUP doesn't interact with this bit. A writable page would still be the
second way you can say "you clearly have full ownership of the page',
so GUP just checks writability and bumps the refcount. The challenge
of GUP reamins to sanely sequence it with things that are doing WP.

The elevated GUP refcount prevents the page from getting the bit set
again, regardless of what happens to it.

Then on the WP sides.. Obviously we clear the bit when applying a WP
for copy. So all the bad GUP cases are halted now, as with a cleared
bit and a != 1 refcount COW must happen.

Then, it is also the case that most often a read-only page will have
this bit cleared, UFFWD WP being the exception.

UFFWD WP works fine as it will have the bit set in the cases we care
about and COW will not happen.

If the bit is not set then everything works as is today and you get
extra COWs. We still have to fix the things that are missing the extra
COWs to check the page ref to fix this.

It seems this new bit is acting as a 'COW disable', so, the accuracy
of COW vs GUP&speculative pagerefs now relies on setting the bit as
aggressively as possible when it is safe and cheap to do so.

If I got it right this is why it is not just mapcount reduced to 1
bit. It is quite different, even though "this VM owns the page
outright" sure sounds like "mapcount == 1"..

It seems like an interesting direction - the security properties seem
good as we only have to take care of sites applying WP to decide what
to do with the extra bit, and all places that set the bit to 1 do so
after testing refcount under various locks preventing PTE WP.

That just leave the THP splitting.. I suppose we get the PTL, then
compute the current value of the new bit based on refcount and diffuse
it to all tail pages, then update the PMD and release the PTL. Safe
against concurrent WP - don't need DoubleMap horrors because it isn't
a counter.

> Not set it stone, just an idea what I'm playing with right now ... and I
> have to tripple-check if
> * page is PTE mapped in the page table I'm walking
> * page_count() == 1
> Really means that "this is the only reference.". I do strongly believe
> so .. :)

AFAIK the only places that can break this are places putting struct
page memory into special PTEs. Which is horrific and is just bugs, but
I think I've seen it from time to time :(

ZONE_DEVICE is also messed up, of course, but that is just more
reasons ZONE_DEVICE refcounting needs fixing and you should ignore it.

Jason
Matthew Wilcox Dec. 21, 2021, 3:29 a.m. UTC | #73
On Mon, Dec 20, 2021 at 09:03:12PM -0400, Jason Gunthorpe wrote:
> That just leave the THP splitting.. I suppose we get the PTL, then
> compute the current value of the new bit based on refcount and diffuse
> it to all tail pages, then update the PMD and release the PTL. Safe
> against concurrent WP - don't need DoubleMap horrors because it isn't
> a counter.

One of the things I've been trying to figure out is how we do
can_split_huge_page().  Maybe an rmap walk to figure out how many
refcounts we would subtract if we did unmap it from everywhere it's
currently mapped?  (just to be clear, we call unmap_page() as the
next thing, so I don't mind warming up the rbtree cachelines
if it's mapped anywhere)
David Hildenbrand Dec. 21, 2021, 8:58 a.m. UTC | #74
On 21.12.21 02:03, Jason Gunthorpe wrote:
> On Sun, Dec 19, 2021 at 06:59:51PM +0100, David Hildenbrand wrote:
> 
>> handler (COW or unshare). Outside of COW/Unshare code, the bit can only
>> be set if page_count() == 1 and we sync against fork(). (and that's the
>> problem for gup-fast-only that I'm investigating right now, because it
>> would then always have to fallback to the slow variant if the bit isn't
>> already set)
> 

[in the meantime I figured out which pageflag we can reuse for anon
pages, which is at least one step into the right direction]

> I'm having a hard time imagining how gup_fast can maintain any sort of
> bit - it lacks all forms of locks so how can we do an atomic test and
> set between two pieces of data?

And exactly that is to be figured out.

Note that I am trying to make also any kind of R/O pins on an anonymous
page work as expected as well, to fix any kind of GUP after fork() and
GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
similarly has to make sure that the page is exclusive -- even if it's
mapped R/O (!).

In the pagefault handler we can then always reuse a PageAnonExclusive()
page, because we know it's exclusive and it will stay exclusive because
concurrent fork() isn't possible.

> 
> I think the point of Linus's plan really is that the new bit is
> derived from page_count, we get the set the new bit when we observe
> page_count==1 in various situations and we clear the new bit whenever
> we write protect with the intent to copy.

Here are is one problems I'm fighting with:


Assume we set the bit whenever we create a new anon page (either due to
COW, ordinary fault, unsharing request, ..., even if it's mapped R/O
first). We know the page is exclusive at that point because we created
it and fork() could not happen yet.

fork() is the only code that can share the page between processes and
turn it non-exclusive.

We can only clear the bit during fork() -- to turn the share page
exclusive and map it R/O into both processes -- when we are sure that
*nobody* concurrently takes a reference on the page the would be
problematic (-> GUP).

So to clear the bit during fork, we have to
(1) Check against page_count == 1
(2) Synchronize against GUP

(2) is easy using the mmap_lock and the mm->write_protect_seq


BUT, it would mean that whenever we fork() and there is one additional
reference on a page (even if it's from the swapcache), we would slow
down fork() even if there was never any GUP. This would apply to any
process out there that does a fork() ...


So the idea is to mark a page only exclusive as soon as someone needs
the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
or selected FOLL_GET like O_DIRECT). This can happen in my current
approach using two ways:

(1) Set the bit when we know we are the only users

We can set PageAnonExclusive() in case *we sync against fork* and the
page cannot get unmapped (pt lock) when:
* The page is mapped writable
* The page is mapped readable and page_count == 1

This should work during ordinary GUP in many cases.

If we cannot set the page exclusive, we have to trigger a page fault.

(2) During pagefaults when FOLL_FAULT_UNSHARE is set.

GUP will set FOLL_FAULT_UNSHARE for a pagefault when required (again
e.g., FOLL_PIN or selected FOLL_GET users), and manual setting of the
bit failed. The page fault will then try once again to set the bit if
there is a page mapped, and if that fails, do the COW/unshare and set
the bit.


The above should work fairly reliable with GUP. But indeed,
gup-fast-only is the problem. I'm still investigating what kind of
lightweight synchronization we could do against fork() such that we
wouldn't try setting a page PageAnonExclusive() while fork()
concurrently shares the page.

We could eventually use the page lock and do a try_lock(), both in
fork() and in gup-fast-only. fork() would only clear the bit if the
try_lock() succeeded. gup-fast-only would only be able to set the bit
and not fallback to the slow path if try_lock() succeeded.


But I'm still investigating if there are better alternatives ...

> 
> GUP doesn't interact with this bit. A writable page would still be the
> second way you can say "you clearly have full ownership of the page',
> so GUP just checks writability and bumps the refcount. The challenge
> of GUP reamins to sanely sequence it with things that are doing WP.
> 
> The elevated GUP refcount prevents the page from getting the bit set
> again, regardless of what happens to it.
> 
> Then on the WP sides.. Obviously we clear the bit when applying a WP
> for copy. So all the bad GUP cases are halted now, as with a cleared
> bit and a != 1 refcount COW must happen.
> 
> Then, it is also the case that most often a read-only page will have
> this bit cleared, UFFWD WP being the exception.
> 
> UFFWD WP works fine as it will have the bit set in the cases we care
> about and COW will not happen.
> 
> If the bit is not set then everything works as is today and you get
> extra COWs. We still have to fix the things that are missing the extra
> COWs to check the page ref to fix this.
> 
> It seems this new bit is acting as a 'COW disable', so, the accuracy
> of COW vs GUP&speculative pagerefs now relies on setting the bit as
> aggressively as possible when it is safe and cheap to do so.

But we really want to avoid degrading fork() for everybody that doesn't
do heavy GUP ...

> 
> If I got it right this is why it is not just mapcount reduced to 1
> bit. It is quite different, even though "this VM owns the page
> outright" sure sounds like "mapcount == 1"..
> 
> It seems like an interesting direction - the security properties seem
> good as we only have to take care of sites applying WP to decide what
> to do with the extra bit, and all places that set the bit to 1 do so
> after testing refcount under various locks preventing PTE WP.
> 
> That just leave the THP splitting.. I suppose we get the PTL, then
> compute the current value of the new bit based on refcount and diffuse
> it to all tail pages, then update the PMD and release the PTL. Safe
> against concurrent WP - don't need DoubleMap horrors because it isn't
> a counter.
> 
>> Not set it stone, just an idea what I'm playing with right now ... and I
>> have to tripple-check if
>> * page is PTE mapped in the page table I'm walking
>> * page_count() == 1
>> Really means that "this is the only reference.". I do strongly believe
>> so .. :)
> 
> AFAIK the only places that can break this are places putting struct
> page memory into special PTEs. Which is horrific and is just bugs, but
> I think I've seen it from time to time :(

As we only care about anon pages, I think that doesn't apply. At least
that's what I hope.
Jason Gunthorpe Dec. 21, 2021, 2:28 p.m. UTC | #75
On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote:
> > I'm having a hard time imagining how gup_fast can maintain any sort of
> > bit - it lacks all forms of locks so how can we do an atomic test and
> > set between two pieces of data?
> 
> And exactly that is to be figured out.
> 
> Note that I am trying to make also any kind of R/O pins on an anonymous
> page work as expected as well, to fix any kind of GUP after fork() and
> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> similarly has to make sure that the page is exclusive -- even if it's
> mapped R/O (!).

Why? AFAIK we don't have bugs here. If the page is RO and has an
elevated refcount it cannot be 'PageAnonExclusive' and so any place
that wants to drop the WP just cannot. What is the issue?

> BUT, it would mean that whenever we fork() and there is one additional
> reference on a page (even if it's from the swapcache), we would slow
> down fork() even if there was never any GUP. This would apply to any
> process out there that does a fork() ...

You mean because we'd copy?

Is this common? Linus' prior email was talking as though swap is so
rare we should't optimize for it?
 
> So the idea is to mark a page only exclusive as soon as someone needs
> the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
> or selected FOLL_GET like O_DIRECT). This can happen in my current
> approach using two ways:
> 
> (1) Set the bit when we know we are the only users
> 
> We can set PageAnonExclusive() in case *we sync against fork* and the
> page cannot get unmapped (pt lock) when:
> * The page is mapped writable
> * The page is mapped readable and page_count == 1

I'm still not sure I see that all this complexity is netting a gain?
 
> If we cannot set the page exclusive, we have to trigger a page fault.
> 
> (2) During pagefaults when FOLL_FAULT_UNSHARE is set.

Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this
series because of mapcount, once the hugetlb COW is fixed to use
refcount properly, as Linus showed, the bugs this was trying to fix go
away.

And as discussed before it is OK if READ gup becomes incoherent, that
is its defined semantic.

> The above should work fairly reliable with GUP. But indeed,
> gup-fast-only is the problem. I'm still investigating what kind of
> lightweight synchronization we could do against fork() such that we
> wouldn't try setting a page PageAnonExclusive() while fork()
> concurrently shares the page.
> 
> We could eventually use the page lock and do a try_lock(), both in
> fork() and in gup-fast-only. fork() would only clear the bit if the
> try_lock() succeeded. gup-fast-only would only be able to set the bit
> and not fallback to the slow path if try_lock() succeeded.

I suspect that is worse than just having fork clear the bit and leave
GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not
need to be atomic, it is protected by the PTL.
 
> > Then on the WP sides.. Obviously we clear the bit when applying a WP
> > for copy. So all the bad GUP cases are halted now, as with a cleared
> > bit and a != 1 refcount COW must happen.

> But we really want to avoid degrading fork() for everybody that doesn't
> do heavy GUP ...

fork() already has to dirty the struct page cache line for refcount,
setting a flag seems minor at that point? At least we shouldn't
discard this nice understandable approach without a measurement....

Remember fork is already incring mapcount so if we kill mapcount it is
a win for fork to replace the mapcount atomic with a non-atomic flag.

> > AFAIK the only places that can break this are places putting struct
> > page memory into special PTEs. Which is horrific and is just bugs, but
> > I think I've seen it from time to time :(
> 
> As we only care about anon pages, I think that doesn't apply. At least
> that's what I hope.

You are optimistic :)

Jason
David Hildenbrand Dec. 21, 2021, 3:19 p.m. UTC | #76
On 21.12.21 15:28, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 09:58:32AM +0100, David Hildenbrand wrote:
>>> I'm having a hard time imagining how gup_fast can maintain any sort of
>>> bit - it lacks all forms of locks so how can we do an atomic test and
>>> set between two pieces of data?
>>
>> And exactly that is to be figured out.
>>
>> Note that I am trying to make also any kind of R/O pins on an anonymous
>> page work as expected as well, to fix any kind of GUP after fork() and
>> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
>> similarly has to make sure that the page is exclusive -- even if it's
>> mapped R/O (!).
> 
> Why? AFAIK we don't have bugs here. If the page is RO and has an
> elevated refcount it cannot be 'PageAnonExclusive' and so any place
> that wants to drop the WP just cannot. What is the issue?

Sure it can.

1. Map page R/W
2. Pin it R/W
3. Swapout
4. Read access

Page is now mapped R/O and *has to be* marked PageAnonExclusive(), to
properly skip the COW fault. That's literally 60% of the reproducers we
have that need fixing.


But what I think you actually mean is if we want to get R/O pins right.
> 
>> BUT, it would mean that whenever we fork() and there is one additional
>> reference on a page (even if it's from the swapcache), we would slow
>> down fork() even if there was never any GUP. This would apply to any
>> process out there that does a fork() ...
> 
> You mean because we'd copy?

Yes.

> 
> Is this common? Linus' prior email was talking as though swap is so
> rare we should't optimize for it?
At least in the enterprise segment having swap enabled is mostly a hard
documented requirement. On customer installations swap is still common,
and even gets replaced zswap that is enabled automatically in many
installations ...

So in the world I live and work in, swap is used frequently.

>  
>> So the idea is to mark a page only exclusive as soon as someone needs
>> the page to be exclusive and stay exclusive (-> e.g., GUP with FOLL_PIN
>> or selected FOLL_GET like O_DIRECT). This can happen in my current
>> approach using two ways:
>>
>> (1) Set the bit when we know we are the only users
>>
>> We can set PageAnonExclusive() in case *we sync against fork* and the
>> page cannot get unmapped (pt lock) when:
>> * The page is mapped writable
>> * The page is mapped readable and page_count == 1
> 
> I'm still not sure I see that all this complexity is netting a gain?

Avoid copy on fork().

>  
>> If we cannot set the page exclusive, we have to trigger a page fault.
>>
>> (2) During pagefaults when FOLL_FAULT_UNSHARE is set.
> 
> Why do we need FOLL_FAULT_UNSHARE ? AFAICT that was part of this
> series because of mapcount, once the hugetlb COW is fixed to use
> refcount properly, as Linus showed, the bugs this was trying to fix go
> away.

The purpose of FOLL_FAULT_UNSHARE in the !mapcount version is to cleanly
support R/O pins without the need for FOLL_WRITE.

And it's comparatively easy to add on top. This is not core of the
complexity, really.

> 
> And as discussed before it is OK if READ gup becomes incoherent, that
> is its defined semantic.

And that's where I still disagree.

But anyhow, this is really more about FOLL_FAULT_UNSHARE, which is
pretty easy and natural to add on top and just gets this right.

> 
>> The above should work fairly reliable with GUP. But indeed,
>> gup-fast-only is the problem. I'm still investigating what kind of
>> lightweight synchronization we could do against fork() such that we
>> wouldn't try setting a page PageAnonExclusive() while fork()
>> concurrently shares the page.
>>
>> We could eventually use the page lock and do a try_lock(), both in
>> fork() and in gup-fast-only. fork() would only clear the bit if the
>> try_lock() succeeded. gup-fast-only would only be able to set the bit
>> and not fallback to the slow path if try_lock() succeeded.
> 
> I suspect that is worse than just having fork clear the bit and leave
> GUP as-is. try lock is an atomic, clearing PageAnonExclusive does not
> need to be atomic, it is protected by the PTL.

There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now:

1) Whenever mapping an anonymous page R/W (after COW, during ordinary
fault, on swapin), we mark the page exclusive. We must never lose the
PageAnonExclusive bit, not during migration, not during swapout.

fork() will process the bit for each and every process, even if there
was no GUP, and will copy if there are additional references.

2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We
can lose the PageAnonExclusive bit during migration and swapout, because
that can only happen when there are no additional references.

fork() will process the bit only if there was GUP. Ordinary fork() is
left unchanged.


Getting R/O supported in the same way just means that we have to check
on a R/O pin if the page is PageAnonExclusive, and if that's not the
case, trigger a FOLL_FAULT_UNSHARE fault. That's really the only
"complexity" on top which is without the mapcount really easy.
Linus Torvalds Dec. 21, 2021, 5:05 p.m. UTC | #77
On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 21.12.21 02:03, Jason Gunthorpe wrote:
>
> > I'm having a hard time imagining how gup_fast can maintain any sort of
> > bit - it lacks all forms of locks so how can we do an atomic test and
> > set between two pieces of data?
>
> And exactly that is to be figured out.

So my preference would be to just always maintain the "exclusive to
this VM" bit in the 'struct page', because that makes things easier to
think about.

[ Of course - the bit could be reversed, and be a 'not exclusive to
this VM' bit, semantically the set-or-cleared issue doesn't matter.
Also, when I talk about some "exclusive to this VM" bit, I'm purely
talking about pages that are marked PageAnon(), so the bit may or may
not even exist for other pager types ]

And then all GUP-fast would need to do is to refuse to look up a page
that isn't exclusive to that VM. We already have the situation that
GUP-fast can fail for non-writable pages etc, so it's just another
test.

> Note that I am trying to make also any kind of R/O pins on an anonymous
> page work as expected as well, to fix any kind of GUP after fork() and
> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> similarly has to make sure that the page is exclusive -- even if it's
> mapped R/O (!).

I do think the existing "maybe_pinned()" logic is fine for that. The
"exclusive to this VM" bit can be used to *help* that decision -
because only an exclusive page can be pinned - bit I don't think it
should _replace_ that logic.

There's a quite fundamental difference between

 (a) COW and GUP: these two operations _have_ to know that they get an
exclusive page in order to re-use or look up the page respectively

 (b) the pre-cow logic in fork() or the "add this to the swap cache"
logic in vmscan that decides whether a page can be turned into a COW
page by adding a reference coutn to it (whether due to fork or swap
cache doesn't matter - the end result is the same).

The difference is that in (a) the thing we *have* to get right is
whether a page is exclusively owned by that VM or not. We can COW too
much, but we can never share a page unless it's exclusive. That's true
whether it's pinned or not.

In (b), the "have to get right" is different. In (b), it's perfectly
ok to COW an exclusive page and turn it non-exclusive. But we must
never COW a pinned page.

So (a) and (b) are very different situations, and have different logic.

If we always maintain an exclusive bit for AnonPage pages, then both
(a) and (b) can use that bit, but they'll use it very differently. In
(a) we'll refuse to look it up and will force a 'handle_mm_fault()' to
get an exclusive copy. And in (b), we just use it as a "we know only
exclusive pages can be pinned", so it's just another check for
page_needs_cow_for_dma(), the same way we currently check
"MMF_HAS_PINNED" to narrow down the whole "page count indicates this
may be a pinned page" question.

And the "page is exclusive" would actually be the *common* case for
almost all pages. Any time you've written to a page and you haven't
forked after the write (and it hasn't been turned into a swap page),
that page would be exclusive to that VM.

Doesn't this seem like really straightforward semantics to maintain
(and think about)?

I'd like the exclusive page bit to *not* be directly about "has this
page been pinned" exactly because we already have too many special
cases for GUP. It would be nicer to have a page bit that has very
clear semantics even in the absence of GUP.

             Linus
David Hildenbrand Dec. 21, 2021, 5:40 p.m. UTC | #78
On 21.12.21 18:05, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 21.12.21 02:03, Jason Gunthorpe wrote:
>>
>>> I'm having a hard time imagining how gup_fast can maintain any sort of
>>> bit - it lacks all forms of locks so how can we do an atomic test and
>>> set between two pieces of data?
>>
>> And exactly that is to be figured out.
> 
> So my preference would be to just always maintain the "exclusive to
> this VM" bit in the 'struct page', because that makes things easier to
> think about.
> 
> [ Of course - the bit could be reversed, and be a 'not exclusive to
> this VM' bit, semantically the set-or-cleared issue doesn't matter.
> Also, when I talk about some "exclusive to this VM" bit, I'm purely
> talking about pages that are marked PageAnon(), so the bit may or may
> not even exist for other pager types ]

Yes, whatever I say applies to PageAnon() only -- including the
(overloaded bit), called PG_anon_exclusive now.

> 
> And then all GUP-fast would need to do is to refuse to look up a page
> that isn't exclusive to that VM. We already have the situation that
> GUP-fast can fail for non-writable pages etc, so it's just another
> test.

Right, the simplest way is simply failing GUP fast if the bit isn't set,
forcing it into the slow path. If that would primarily happens for R/O
pins after fork(), fine with me.

> 
>> Note that I am trying to make also any kind of R/O pins on an anonymous
>> page work as expected as well, to fix any kind of GUP after fork() and
>> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
>> similarly has to make sure that the page is exclusive -- even if it's
>> mapped R/O (!).
> 
> I do think the existing "maybe_pinned()" logic is fine for that. The
> "exclusive to this VM" bit can be used to *help* that decision -
> because only an exclusive page can be pinned - bit I don't think it
> should _replace_ that logic.

The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
that such corner cases can be handled as well correctly. But yeah, I'll
see where this goes ... O_DIRECT has to be fixed one way or the other.

John H. mentioned that he wants to look into converting that to
FOLL_PIN. So maybe that will work eventually.

> 
> There's a quite fundamental difference between
> 
>  (a) COW and GUP: these two operations _have_ to know that they get an
> exclusive page in order to re-use or look up the page respectively
> 
>  (b) the pre-cow logic in fork() or the "add this to the swap cache"
> logic in vmscan that decides whether a page can be turned into a COW
> page by adding a reference coutn to it (whether due to fork or swap
> cache doesn't matter - the end result is the same).
> 
> The difference is that in (a) the thing we *have* to get right is
> whether a page is exclusively owned by that VM or not. We can COW too
> much, but we can never share a page unless it's exclusive. That's true
> whether it's pinned or not.

Exactly. Once a page is "exclusive" it must not get shared *unless* we
can turn it into a "shared" page during fork().

There are some ugly corner cases that will require some thought.

> 
> In (b), the "have to get right" is different. In (b), it's perfectly
> ok to COW an exclusive page and turn it non-exclusive. But we must
> never COW a pinned page.
> 
> So (a) and (b) are very different situations, and have different logic.
> 
> If we always maintain an exclusive bit for AnonPage pages, then both
> (a) and (b) can use that bit, but they'll use it very differently. In
> (a) we'll refuse to look it up and will force a 'handle_mm_fault()' to
> get an exclusive copy. And in (b), we just use it as a "we know only
> exclusive pages can be pinned", so it's just another check for
> page_needs_cow_for_dma(), the same way we currently check
> "MMF_HAS_PINNED" to narrow down the whole "page count indicates this
> may be a pinned page" question.

If we use page_needs_cow_for_dma() for that purpose we can still have
other references from our process referencing the page, including right
now O_DIRECT ones. So the safest thing to do would be relying on the
same logic as we do in the COW path regarding the pagecount ... but that
might result in unnecessary copies as I mentioned.

It would be perfect if just anything that modifies page content would be
using FOLL_PIN, unfortunately that's not reality ...


> 
> And the "page is exclusive" would actually be the *common* case for
> almost all pages. Any time you've written to a page and you haven't
> forked after the write (and it hasn't been turned into a swap page),
> that page would be exclusive to that VM.

Yes. Essentially every time we create a new anonymous page it would end
up as exclusive. Or if we're in a fault and can convert the "exclusive"
page into a "shared" page (essentially the COW reuse logic).

> 
> Doesn't this seem like really straightforward semantics to maintain
> (and think about)?
> 
> I'd like the exclusive page bit to *not* be directly about "has this
> page been pinned" exactly because we already have too many special
> cases for GUP. It would be nicer to have a page bit that has very
> clear semantics even in the absence of GUP.

What adds complexity to correctly maintain the "exclusive" state are at
least:
* KSM (might be harder, have to think about it)
* migration (might be easy to just copy the bit)
* fork() with migration/swap entries that reference a page that is
  "exclusive". I'll have to think about that more.

So I have plenty of stuff to look into.


Just so we're on the same page what I'd like to achieve with anonymous
pages:

1) If we take a R/W pin on an anonymous page, we will always pin an
"exclusive page".

2) If we take a R/O pin on an anonymous page, we will always pin an
"exclusive page", even if the page is mapped R/O.

3) "Exclusive" pages cannot be turned "shared" during fork (and ksm? :/
) if pinned.

4) "Exclusive" pages can be turned "shared" during fork if not pinned.

5) "Exclusive" pages will never be COWed but remain there for all
eternity, until unmapped ... well or until converted into "shared" again
if possible


Ideally we'd handle O_DIRECT ... :(


2) is certainly the cherry on top. But it just means that R/O pins don't
have to be the weird kid. And yes, achieving 2) would require
FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
what existing COW logic does, just bypass the "map writable" and
"trigger write fault" semantics.

I hope we agree that R/O pins don't have to have weird kid if we can
"get it right" with the same approach.
Linus Torvalds Dec. 21, 2021, 6 p.m. UTC | #79
On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand <david@redhat.com> wrote:
>
> > I do think the existing "maybe_pinned()" logic is fine for that. The
> > "exclusive to this VM" bit can be used to *help* that decision -
> > because only an exclusive page can be pinned - bit I don't think it
> > should _replace_ that logic.
>
> The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
> FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
> that such corner cases can be handled as well correctly. But yeah, I'll
> see where this goes ... O_DIRECT has to be fixed one way or the other.
>
> John H. mentioned that he wants to look into converting that to
> FOLL_PIN. So maybe that will work eventually.

I'd really prefer that as the plan.

What exactly is the issue with O_DIRECT? Is it purely that it uses
"put_page()" instead of "unpin", or what?

I really think that if people look up pages and expect those pages to
stay coherent with the VM they looked it up for, they _have_ to
actively tell the VM layer - which means using FOLL_PIN.

Note that this is in absolutely no way a "new" issue. It has *always*
been true. If some O_DIORECT path depends on pinning behavior, it has
never worked correctly, and it is entirely on O_DIRECT, and not at all
a VM issue. We've had people doing GUP games forever, and being burnt
by those games not working reliably.

GUP (before we even had the notion of pinning) would always just take
a reference to the page, but it would not guarantee that that exact
page then kept an association with the VM.

Now, in *practice* this all works if:

 (a) the GUP user had always written to the page since the fork
(either explicitly, or with FOLL_WRITE obviously acting as such)

 (b) the GUP user never forks afterwards until the IO is done

 (c) the GUP user plays no other VM games on that address

and it's also very possible that it has worked by pure luck (ie we've
had a lot of random code that actively mis-used things and it would
work in practice just because COW would happen to cut the right
direction etc).

Is there some particular GUP user you happen to care about more than
others? I think it's a valid option to try to fix things up one by
one, even if you don't perhaps fix _all_ cases.

              Linus
Jan Kara Dec. 21, 2021, 6:07 p.m. UTC | #80
On Tue 21-12-21 18:40:30, David Hildenbrand wrote:
> On 21.12.21 18:05, Linus Torvalds wrote:
> > On Tue, Dec 21, 2021 at 12:58 AM David Hildenbrand <david@redhat.com> wrote:
> >> Note that I am trying to make also any kind of R/O pins on an anonymous
> >> page work as expected as well, to fix any kind of GUP after fork() and
> >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> >> similarly has to make sure that the page is exclusive -- even if it's
> >> mapped R/O (!).
> > 
> > I do think the existing "maybe_pinned()" logic is fine for that. The
> > "exclusive to this VM" bit can be used to *help* that decision -
> > because only an exclusive page can be pinned - bit I don't think it
> > should _replace_ that logic.
> 
> The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
> FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
> that such corner cases can be handled as well correctly. But yeah, I'll
> see where this goes ... O_DIRECT has to be fixed one way or the other.
> 
> John H. mentioned that he wants to look into converting that to
> FOLL_PIN. So maybe that will work eventually.

For record we always intended (and still intend) to make O_DIRECT use
FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
and pages acquired through get_page() in a single bio (such as zero page)
and thus it is non-trivial to do the right thing on IO completion (unpin or
just put_page).

								Honza
David Hildenbrand Dec. 21, 2021, 6:28 p.m. UTC | #81
On 21.12.21 19:00, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 9:40 AM David Hildenbrand <david@redhat.com> wrote:
>>
>>> I do think the existing "maybe_pinned()" logic is fine for that. The
>>> "exclusive to this VM" bit can be used to *help* that decision -
>>> because only an exclusive page can be pinned - bit I don't think it
>>> should _replace_ that logic.
>>
>> The issue is that O_DIRECT uses FOLL_GET and cannot easily be changed to
>> FOLL_PIN unfortunately. So I'm *trying* to make it more generic such
>> that such corner cases can be handled as well correctly. But yeah, I'll
>> see where this goes ... O_DIRECT has to be fixed one way or the other.
>>
>> John H. mentioned that he wants to look into converting that to
>> FOLL_PIN. So maybe that will work eventually.
> 
> I'd really prefer that as the plan.
> 
> What exactly is the issue with O_DIRECT? Is it purely that it uses
> "put_page()" instead of "unpin", or what?
> 
> I really think that if people look up pages and expect those pages to
> stay coherent with the VM they looked it up for, they _have_ to
> actively tell the VM layer - which means using FOLL_PIN.
> 
> Note that this is in absolutely no way a "new" issue. It has *always*
> been true. If some O_DIORECT path depends on pinning behavior, it has
> never worked correctly, and it is entirely on O_DIRECT, and not at all
> a VM issue. We've had people doing GUP games forever, and being burnt
> by those games not working reliably.
> 
> GUP (before we even had the notion of pinning) would always just take
> a reference to the page, but it would not guarantee that that exact
> page then kept an association with the VM.
> 
> Now, in *practice* this all works if:
> 
>  (a) the GUP user had always written to the page since the fork
> (either explicitly, or with FOLL_WRITE obviously acting as such)
> 
>  (b) the GUP user never forks afterwards until the IO is done
> 
>  (c) the GUP user plays no other VM games on that address
> 
> and it's also very possible that it has worked by pure luck (ie we've
> had a lot of random code that actively mis-used things and it would
> work in practice just because COW would happen to cut the right
> direction etc).
> 
> Is there some particular GUP user you happen to care about more than
> others? I think it's a valid option to try to fix things up one by
> one, even if you don't perhaps fix _all_ cases.

Yes, of course. The important part for me is to have a rough idea in how
to tackle all pieces and have a reliable design/approach. Besides the
security issue, highest priority is getting R/W pins (FOLL_WRITE) right,
including O_DIRECT, because that can silently break existing use cases.

Lower priority is getting R/O pins on anonymous memory right, because
that never worked reliably. Lowest priority is getting R/O pins on
MAP_PRIVATE file memory right.

I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion
while I struggle with PageAnonExclusive() and R/W pins :)

[noting that I'll not get too much done within the next 2 weeks]
Linus Torvalds Dec. 21, 2021, 6:30 p.m. UTC | #82
On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <jack@suse.cz> wrote:
>
> For record we always intended (and still intend) to make O_DIRECT use
> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
> and pages acquired through get_page() in a single bio (such as zero page)
> and thus it is non-trivial to do the right thing on IO completion (unpin or
> just put_page).

Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but
it might make it much easier to debug and catch.

If we only set the exclusive VM bit on pages that get mapped into user
space, and we guarantee that GUP only looks up such pages, then we can
also add a debug test to the "unpin" case that the bit is still set.

And that would catch anybody who ends up using other pages for
unpin(), and you could have a WARN_ON() for it (obviously also trigger
on the page count being too small to unpin).

That way, at least from a kernel debugging and development standpoint
it would make it easy to see "ok, this unpinning got a page that
wasn't pinned", and it would help find these cases where some
situation had used just a get_page() rather than a pin to get a page
pointer.

No?

                  Linus
David Hildenbrand Dec. 21, 2021, 6:51 p.m. UTC | #83
On 21.12.21 19:30, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <jack@suse.cz> wrote:
>>
>> For record we always intended (and still intend) to make O_DIRECT use
>> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
>> and pages acquired through get_page() in a single bio (such as zero page)
>> and thus it is non-trivial to do the right thing on IO completion (unpin or
>> just put_page).
> 
> Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but
> it might make it much easier to debug and catch.
> 
> If we only set the exclusive VM bit on pages that get mapped into user
> space, and we guarantee that GUP only looks up such pages, then we can
> also add a debug test to the "unpin" case that the bit is still set.
> 
> And that would catch anybody who ends up using other pages for
> unpin(), and you could have a WARN_ON() for it (obviously also trigger
> on the page count being too small to unpin).

It would also catch if someone would be wrongly dropping the exclusive
flag although there are users (pin) relying on the page staying exclusive.

> 
> That way, at least from a kernel debugging and development standpoint
> it would make it easy to see "ok, this unpinning got a page that
> wasn't pinned"

For that purpose the pincount would already kind-off work. Not precise,
but at least something ("this page cannot possibly have been pinned").
Linus Torvalds Dec. 21, 2021, 6:58 p.m. UTC | #84
On Tue, Dec 21, 2021 at 10:51 AM David Hildenbrand <david@redhat.com> wrote:
>
> For that purpose the pincount would already kind-off work. Not precise,
> but at least something ("this page cannot possibly have been pinned").

That part actually exists already, ie put_page_refs() has this:

  #ifdef CONFIG_DEBUG_VM
        if (VM_WARN_ON_ONCE_PAGE(page_ref_count(page) < refs, page))
                return;
  #endif

And yeah, it shouldn't have that '#ifdef CONFIG_DEBUG_VM' there, but I
think it's because the non-CONFIG_DEBUG_VM #define for
VM_WARN_ON_ONCE_PAGE() is broken, and doesn't return 0.

                   Linus
Jason Gunthorpe Dec. 21, 2021, 7:07 p.m. UTC | #85
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:

> 2) is certainly the cherry on top. But it just means that R/O pins don't
> have to be the weird kid. And yes, achieving 2) would require
> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
> what existing COW logic does, just bypass the "map writable" and
> "trigger write fault" semantics.

I still don't agree with this - when you come to patches can you have
this work at the end and under a good cover letter? Maybe it will make
more sense then.

Thanks,
Jason
John Hubbard Dec. 21, 2021, 9:11 p.m. UTC | #86
On 12/21/21 10:28, David Hildenbrand wrote:
...
> I'd appreciate if someone could work on the O_DIRECT FOLL_PIN conversion
> while I struggle with PageAnonExclusive() and R/W pins :)
> 

Yes, I'll sign up for that (unless someone else who is faster is already
working on it). I've tried a couple times in the past, but without the
proper level of determination to see it through. So this time will be
different. :)

> [noting that I'll not get too much done within the next 2 weeks]
> 

Likewise. Starting in early January.


thanks,
John Hubbard Dec. 21, 2021, 9:16 p.m. UTC | #87
On 12/21/21 10:30, Linus Torvalds wrote:
> On Tue, Dec 21, 2021 at 10:07 AM Jan Kara <jack@suse.cz> wrote:
>>
>> For record we always intended (and still intend) to make O_DIRECT use
>> FOLL_PIN. Just it is tricky because some users mix pages pinned with GUP
>> and pages acquired through get_page() in a single bio (such as zero page)
>> and thus it is non-trivial to do the right thing on IO completion (unpin or
>> just put_page).
> 
> Side note: the new "exclusive VM" bit wouldn't _solve_ this issue, but
> it might make it much easier to debug and catch.
> 
> If we only set the exclusive VM bit on pages that get mapped into user
> space, and we guarantee that GUP only looks up such pages, then we can
> also add a debug test to the "unpin" case that the bit is still set.
> 
> And that would catch anybody who ends up using other pages for
> unpin(), and you could have a WARN_ON() for it (obviously also trigger
> on the page count being too small to unpin).
> 
> That way, at least from a kernel debugging and development standpoint
> it would make it easy to see "ok, this unpinning got a page that
> wasn't pinned", and it would help find these cases where some
> situation had used just a get_page() rather than a pin to get a page
> pointer.
> 
> No?
> 
>                    Linus

Yes, this is especially welcome, because it means that after enough time
sitting in the -mm tree, we can reasonably expect to catch the most important
cases, if any were missed. That makes it a whole other level of useful, as
compared to local testing hacks.


thanks,
Jason Gunthorpe Dec. 21, 2021, 11:54 p.m. UTC | #88
On Tue, Dec 21, 2021 at 04:19:33PM +0100, David Hildenbrand wrote:

> >> Note that I am trying to make also any kind of R/O pins on an anonymous
> >> page work as expected as well, to fix any kind of GUP after fork() and
> >> GUP before fork(). So taking a R/O pin on an !PageAnonExclusive() page
> >> similarly has to make sure that the page is exclusive -- even if it's
> >> mapped R/O (!).
> > 
> > Why? AFAIK we don't have bugs here. If the page is RO and has an
> > elevated refcount it cannot be 'PageAnonExclusive' and so any place
> > that wants to drop the WP just cannot. What is the issue?

> But what I think you actually mean is if we want to get R/O pins
> right.

What I ment was a page that is GUP'd RO, is not PageAnonExclusive and
has an elevated refcount. Those cannot be transformed to
PageAnonExclusive, or re-used during COW, but also they don't have
problems today. Either places are like O_DIRECT read and are tolerant
of a false COW, or they are broken like VFIO and should be using
FOLL_FORCE|FOLL_WRITE, which turns them into a WRITE and then we know
they get PageAnonExclusive.

So, the swap issue is fixed directly with PageAnonExclusive and no
change to READ GUP is required, at least in your #1 scenario, AFAICT..

> There are 2 models, leaving FOLL_FAULT_UNSHARE out of the picture for now:
> 
> 1) Whenever mapping an anonymous page R/W (after COW, during ordinary
> fault, on swapin), we mark the page exclusive. We must never lose the
> PageAnonExclusive bit, not during migration, not during swapout.

I prefer this one as well.

It allows us to keep Linus's simple logic that refcount == 1 means
always safe to re-use, no matter what.

And refcount != 1 goes on to consider the additional bit to decide
what to do. The simple bit really means 'we know this page has one PTE
so ignore the refcount for COW reuse decisions'.

> fork() will process the bit for each and every process, even if there
> was no GUP, and will copy if there are additional references.

Yes, just like it does today already for mapcount.

> 2) Whenever GUP wants to pin/ref a page, we try marking it exclusive. We
> can lose the PageAnonExclusive bit during migration and swapout, because
> that can only happen when there are no additional references.

I haven't thought of a way this is achievable.

At least not without destroying GUP fast..

Idea #2 is really a "this page is GUP'd" flag with some sneaky logic
to clear it.  That comes along with all the races too because as an
idea it is fundamentally about GUP which runs without locks.

Jason
Jason Gunthorpe Dec. 21, 2021, 11:59 p.m. UTC | #89
On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:

> What adds complexity to correctly maintain the "exclusive" state are at
> least:
> * KSM (might be harder, have to think about it)

I know little about it, but isn't KSM like fork where it is trying to
WP pages with the intention of copying them? Shouldn't KSM completely
reject WP'ing a page that is under any kind of writable GUP?

Jason
David Hildenbrand Dec. 22, 2021, 8:30 a.m. UTC | #90
On 22.12.21 00:59, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
> 
>> What adds complexity to correctly maintain the "exclusive" state are at
>> least:
>> * KSM (might be harder, have to think about it)
> 
> I know little about it, but isn't KSM like fork where it is trying to
> WP pages with the intention of copying them? Shouldn't KSM completely
> reject WP'ing a page that is under any kind of writable GUP?

I think KSM will, similar to fork(), always have to try removing
PageAnonExclusive() while synchronizing against concurrent GUP pins. If
that fails, the page cannot be converted to KSM and consequently not be
shared.

That will need some rework of KSM  AFAIU, but shouldn't be impossible to do.
David Hildenbrand Dec. 22, 2021, 8:51 a.m. UTC | #91
On 21.12.21 20:07, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
> 
>> 2) is certainly the cherry on top. But it just means that R/O pins don't
>> have to be the weird kid. And yes, achieving 2) would require
>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
>> what existing COW logic does, just bypass the "map writable" and
>> "trigger write fault" semantics.
> 
> I still don't agree with this - when you come to patches can you have
> this work at the end and under a good cover letter? Maybe it will make
> more sense then.

Yes. But really, I think it's the logical consequence of what Linus said
[1]:

  "And then all GUP-fast would need to do is to refuse to look up a page
   that isn't exclusive to that VM. We already have the situation that
   GUP-fast can fail for non-writable pages etc, so it's just another
   test."

We must not FOLL_PIN a page that is not exclusive (not only on gup-fast,
but really, on any gup). If we special case R/O FOLL_PIN, we cannot
enable the sanity check on unpin as suggested by Linus [2]:

  "If we only set the exclusive VM bit on pages that get mapped into
   user space, and we guarantee that GUP only looks up such pages, then
   we can also add a debug test to the "unpin" case that the bit is
   still set."

There are really only two feasible options I see when we want to take a
R/O FOLL_PIN on a !PageAnonExclusive() anon page

(1) Fail the pinning completely. This implies that we'll have to fail
    O_DIRECT once converted to FOLL_PIN.
(2) Request to mark the page PageAnonExclusive() via a
    FAULT_FLAG_UNSHARE and let it succeed.


Anything else would require additional accounting that we already
discussed in the past is hard -- for example, to differentiate R/O from
R/W pins requiring two pin counters.

The only impact would be that FOLL_PIN after fork() has to go via a
FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this
is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would
be nice to optimize this, to *not* do that, but again ... this would
require even more counters I think, for example, to differentiate
between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".

So unless we discover a way to do additional accounting for ordinary 4k
pages, I think we really can only do (1) or (2) to make sure we never
ever pin a !PageAnonExclusive() page.


[1]
https://lkml.kernel.org/r/CAHk-=wgQq3H6wfkW7+MmduVgBOqHeiXQN97yCMd+m1mM-1xCLQ@mail.gmail.com
[2]
https://lkml.kernel.org/r/CAHk-=wiyxQ==vnHFHW99S_OPwA=u1Qrfg2OGr_6zPcBAuhQY2w@mail.gmail.com
David Hildenbrand Dec. 22, 2021, 9:58 a.m. UTC | #92
On 22.12.21 09:51, David Hildenbrand wrote:
> On 21.12.21 20:07, Jason Gunthorpe wrote:
>> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
>>
>>> 2) is certainly the cherry on top. But it just means that R/O pins don't
>>> have to be the weird kid. And yes, achieving 2) would require
>>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
>>> what existing COW logic does, just bypass the "map writable" and
>>> "trigger write fault" semantics.
>>
>> I still don't agree with this - when you come to patches can you have
>> this work at the end and under a good cover letter? Maybe it will make
>> more sense then.
> 
> Yes. But really, I think it's the logical consequence of what Linus said
> [1]:
> 
>   "And then all GUP-fast would need to do is to refuse to look up a page
>    that isn't exclusive to that VM. We already have the situation that
>    GUP-fast can fail for non-writable pages etc, so it's just another
>    test."
> 
> We must not FOLL_PIN a page that is not exclusive (not only on gup-fast,
> but really, on any gup). If we special case R/O FOLL_PIN, we cannot
> enable the sanity check on unpin as suggested by Linus [2]:
> 
>   "If we only set the exclusive VM bit on pages that get mapped into
>    user space, and we guarantee that GUP only looks up such pages, then
>    we can also add a debug test to the "unpin" case that the bit is
>    still set."
> 
> There are really only two feasible options I see when we want to take a
> R/O FOLL_PIN on a !PageAnonExclusive() anon page
> 
> (1) Fail the pinning completely. This implies that we'll have to fail
>     O_DIRECT once converted to FOLL_PIN.
> (2) Request to mark the page PageAnonExclusive() via a
>     FAULT_FLAG_UNSHARE and let it succeed.
> 
> 
> Anything else would require additional accounting that we already
> discussed in the past is hard -- for example, to differentiate R/O from
> R/W pins requiring two pin counters.
> 
> The only impact would be that FOLL_PIN after fork() has to go via a
> FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this
> is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would
> be nice to optimize this, to *not* do that, but again ... this would
> require even more counters I think, for example, to differentiate
> between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
> 
> So unless we discover a way to do additional accounting for ordinary 4k
> pages, I think we really can only do (1) or (2) to make sure we never
> ever pin a !PageAnonExclusive() page.

BTW, I just wondered if the optimization should actually be that R/O
short-term FOLL_PIN users should actually be using FOLL_GET instead. So
O_DIRECT with R/O would already be doing the right thing.

And it somewhat aligns with what we found: only R/W short-term FOLL_GET
is problematic, where we can lose writes to the page from the device via
O_DIRECT.

IIUC, our COW logic makes sure that a shared anonymous page that might
still be used by a R/O FOLL_GET cannot be modified, because any attempt
to modify it would result in a copy.

But I might be missing something, just an idea.
Jan Kara Dec. 22, 2021, 12:41 p.m. UTC | #93
On Wed 22-12-21 10:58:36, David Hildenbrand wrote:
> On 22.12.21 09:51, David Hildenbrand wrote:
> > On 21.12.21 20:07, Jason Gunthorpe wrote:
> >> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
> >>
> >>> 2) is certainly the cherry on top. But it just means that R/O pins don't
> >>> have to be the weird kid. And yes, achieving 2) would require
> >>> FAULT_FLAG_EXCLUSIVE / FAULT_FLAG_UNSHARED, but it would really 99% do
> >>> what existing COW logic does, just bypass the "map writable" and
> >>> "trigger write fault" semantics.
> >>
> >> I still don't agree with this - when you come to patches can you have
> >> this work at the end and under a good cover letter? Maybe it will make
> >> more sense then.
> > 
> > Yes. But really, I think it's the logical consequence of what Linus said
> > [1]:
> > 
> >   "And then all GUP-fast would need to do is to refuse to look up a page
> >    that isn't exclusive to that VM. We already have the situation that
> >    GUP-fast can fail for non-writable pages etc, so it's just another
> >    test."
> > 
> > We must not FOLL_PIN a page that is not exclusive (not only on gup-fast,
> > but really, on any gup). If we special case R/O FOLL_PIN, we cannot
> > enable the sanity check on unpin as suggested by Linus [2]:
> > 
> >   "If we only set the exclusive VM bit on pages that get mapped into
> >    user space, and we guarantee that GUP only looks up such pages, then
> >    we can also add a debug test to the "unpin" case that the bit is
> >    still set."
> > 
> > There are really only two feasible options I see when we want to take a
> > R/O FOLL_PIN on a !PageAnonExclusive() anon page
> > 
> > (1) Fail the pinning completely. This implies that we'll have to fail
> >     O_DIRECT once converted to FOLL_PIN.
> > (2) Request to mark the page PageAnonExclusive() via a
> >     FAULT_FLAG_UNSHARE and let it succeed.
> > 
> > 
> > Anything else would require additional accounting that we already
> > discussed in the past is hard -- for example, to differentiate R/O from
> > R/W pins requiring two pin counters.
> > 
> > The only impact would be that FOLL_PIN after fork() has to go via a
> > FAULT_FLAG_UNSHARE once, to turn the page PageAnonExclusive. IMHO this
> > is the right thing to do for FOLL_LONGTERM. For !FOLL_LONGTERM it would
> > be nice to optimize this, to *not* do that, but again ... this would
> > require even more counters I think, for example, to differentiate
> > between "R/W short/long-term or R/O long-term pin" and "R/O short-term pin".
> > 
> > So unless we discover a way to do additional accounting for ordinary 4k
> > pages, I think we really can only do (1) or (2) to make sure we never
> > ever pin a !PageAnonExclusive() page.
> 
> BTW, I just wondered if the optimization should actually be that R/O
> short-term FOLL_PIN users should actually be using FOLL_GET instead. So
> O_DIRECT with R/O would already be doing the right thing.
> 
> And it somewhat aligns with what we found: only R/W short-term FOLL_GET
> is problematic, where we can lose writes to the page from the device via
> O_DIRECT.
> 
> IIUC, our COW logic makes sure that a shared anonymous page that might
> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> to modify it would result in a copy.

Well, we defined FOLL_PIN to mean the intent that the caller wants to access
not only page state (for which is enough FOLL_GET and there are some users
- mostly inside mm - who need this) but also page data. Eventually, we even
wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.

For file pages we need this data vs no-data access distinction so that
filesystems can detect when someone can be accessing page data although the
page is unmapped.  Practically, filesystems care most about when someone
can be *modifying* page data (we need to make sure data is stable e.g. when
writing back data to disk or doing data checksumming or other operations)
so using FOLL_GET when wanting to only read page data should be OK for
filesystems but honestly I would be reluctant to break the rule of "use
FOLL_PIN when wanting to access page data" to keep things simple and
reasonably easy to understand for parties such as filesystem developers or
driver developers who all need to interact with pinned pages...

								Honza
Jan Kara Dec. 22, 2021, 12:44 p.m. UTC | #94
On Tue 21-12-21 19:59:16, Jason Gunthorpe wrote:
> On Tue, Dec 21, 2021 at 06:40:30PM +0100, David Hildenbrand wrote:
> 
> > What adds complexity to correctly maintain the "exclusive" state are at
> > least:
> > * KSM (might be harder, have to think about it)
> 
> I know little about it, but isn't KSM like fork where it is trying to
> WP pages with the intention of copying them? Shouldn't KSM completely
> reject WP'ing a page that is under any kind of writable GUP?

I know little about KSM as well but I think fundamentally it has similar
requirements for anon pages as filesystems have for page cache pages e.g.
when doing block deduplication or data checksumming... I.e., it needs to
make sure data in the page is stable and nobody can modify it.

								Honza
David Hildenbrand Dec. 22, 2021, 1:09 p.m. UTC | #95
>> IIUC, our COW logic makes sure that a shared anonymous page that might
>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
>> to modify it would result in a copy.
> 
> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> not only page state (for which is enough FOLL_GET and there are some users
> - mostly inside mm - who need this) but also page data. Eventually, we even
> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> 
> For file pages we need this data vs no-data access distinction so that
> filesystems can detect when someone can be accessing page data although the
> page is unmapped.  Practically, filesystems care most about when someone
> can be *modifying* page data (we need to make sure data is stable e.g. when
> writing back data to disk or doing data checksumming or other operations)
> so using FOLL_GET when wanting to only read page data should be OK for
> filesystems but honestly I would be reluctant to break the rule of "use
> FOLL_PIN when wanting to access page data" to keep things simple and
> reasonably easy to understand for parties such as filesystem developers or
> driver developers who all need to interact with pinned pages...

Right, from an API perspective we really want people to use FOLL_PIN.

To optimize this case in particular it would help if we would have the
FOLL flags on the unpin path. Then we could just decide internally
"well, short-term R/O FOLL_PIN can be really lightweight, we can treat
this like a FOLL_GET instead". And we would need that as well if we were
to keep different counters for R/O vs. R/W pinned.
Jan Kara Dec. 22, 2021, 2:42 p.m. UTC | #96
On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
> >> IIUC, our COW logic makes sure that a shared anonymous page that might
> >> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> >> to modify it would result in a copy.
> > 
> > Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> > not only page state (for which is enough FOLL_GET and there are some users
> > - mostly inside mm - who need this) but also page data. Eventually, we even
> > wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> > internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> > 
> > For file pages we need this data vs no-data access distinction so that
> > filesystems can detect when someone can be accessing page data although the
> > page is unmapped.  Practically, filesystems care most about when someone
> > can be *modifying* page data (we need to make sure data is stable e.g. when
> > writing back data to disk or doing data checksumming or other operations)
> > so using FOLL_GET when wanting to only read page data should be OK for
> > filesystems but honestly I would be reluctant to break the rule of "use
> > FOLL_PIN when wanting to access page data" to keep things simple and
> > reasonably easy to understand for parties such as filesystem developers or
> > driver developers who all need to interact with pinned pages...
> 
> Right, from an API perspective we really want people to use FOLL_PIN.
> 
> To optimize this case in particular it would help if we would have the
> FOLL flags on the unpin path. Then we could just decide internally
> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> this like a FOLL_GET instead". And we would need that as well if we were
> to keep different counters for R/O vs. R/W pinned.

Well, I guess the question here is: Which GUP user needs only R/O access to
page data and is so performance critical that it would be worth it to
sacrifice API clarity for speed? I'm not aware of any but I was not looking
really hard...

								Honza
David Hildenbrand Dec. 22, 2021, 2:48 p.m. UTC | #97
On 22.12.21 15:42, Jan Kara wrote:
> On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
>>>> IIUC, our COW logic makes sure that a shared anonymous page that might
>>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
>>>> to modify it would result in a copy.
>>>
>>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
>>> not only page state (for which is enough FOLL_GET and there are some users
>>> - mostly inside mm - who need this) but also page data. Eventually, we even
>>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
>>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
>>>
>>> For file pages we need this data vs no-data access distinction so that
>>> filesystems can detect when someone can be accessing page data although the
>>> page is unmapped.  Practically, filesystems care most about when someone
>>> can be *modifying* page data (we need to make sure data is stable e.g. when
>>> writing back data to disk or doing data checksumming or other operations)
>>> so using FOLL_GET when wanting to only read page data should be OK for
>>> filesystems but honestly I would be reluctant to break the rule of "use
>>> FOLL_PIN when wanting to access page data" to keep things simple and
>>> reasonably easy to understand for parties such as filesystem developers or
>>> driver developers who all need to interact with pinned pages...
>>
>> Right, from an API perspective we really want people to use FOLL_PIN.
>>
>> To optimize this case in particular it would help if we would have the
>> FOLL flags on the unpin path. Then we could just decide internally
>> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
>> this like a FOLL_GET instead". And we would need that as well if we were
>> to keep different counters for R/O vs. R/W pinned.
> 
> Well, I guess the question here is: Which GUP user needs only R/O access to
> page data and is so performance critical that it would be worth it to
> sacrifice API clarity for speed? I'm not aware of any but I was not looking
> really hard...

I'd be interested in examples as well. Maybe databases that use O_DIRECT
after fork()?
Jan Kara Dec. 22, 2021, 4:08 p.m. UTC | #98
On Wed 22-12-21 15:48:34, David Hildenbrand wrote:
> On 22.12.21 15:42, Jan Kara wrote:
> > On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
> >>>> IIUC, our COW logic makes sure that a shared anonymous page that might
> >>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> >>>> to modify it would result in a copy.
> >>>
> >>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> >>> not only page state (for which is enough FOLL_GET and there are some users
> >>> - mostly inside mm - who need this) but also page data. Eventually, we even
> >>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> >>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> >>>
> >>> For file pages we need this data vs no-data access distinction so that
> >>> filesystems can detect when someone can be accessing page data although the
> >>> page is unmapped.  Practically, filesystems care most about when someone
> >>> can be *modifying* page data (we need to make sure data is stable e.g. when
> >>> writing back data to disk or doing data checksumming or other operations)
> >>> so using FOLL_GET when wanting to only read page data should be OK for
> >>> filesystems but honestly I would be reluctant to break the rule of "use
> >>> FOLL_PIN when wanting to access page data" to keep things simple and
> >>> reasonably easy to understand for parties such as filesystem developers or
> >>> driver developers who all need to interact with pinned pages...
> >>
> >> Right, from an API perspective we really want people to use FOLL_PIN.
> >>
> >> To optimize this case in particular it would help if we would have the
> >> FOLL flags on the unpin path. Then we could just decide internally
> >> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> >> this like a FOLL_GET instead". And we would need that as well if we were
> >> to keep different counters for R/O vs. R/W pinned.
> > 
> > Well, I guess the question here is: Which GUP user needs only R/O access to
> > page data and is so performance critical that it would be worth it to
> > sacrifice API clarity for speed? I'm not aware of any but I was not looking
> > really hard...
> 
> I'd be interested in examples as well. Maybe databases that use O_DIRECT
> after fork()?

Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
page data (and so we need to detect them both for COW and filesystem needs).
O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it
is worth it.

								Honza
Matthew Wilcox Dec. 22, 2021, 4:44 p.m. UTC | #99
On Wed, Dec 22, 2021 at 05:08:46PM +0100, Jan Kara wrote:
> On Wed 22-12-21 15:48:34, David Hildenbrand wrote:
> > On 22.12.21 15:42, Jan Kara wrote:
> > > On Wed 22-12-21 14:09:41, David Hildenbrand wrote:
> > >>>> IIUC, our COW logic makes sure that a shared anonymous page that might
> > >>>> still be used by a R/O FOLL_GET cannot be modified, because any attempt
> > >>>> to modify it would result in a copy.
> > >>>
> > >>> Well, we defined FOLL_PIN to mean the intent that the caller wants to access
> > >>> not only page state (for which is enough FOLL_GET and there are some users
> > >>> - mostly inside mm - who need this) but also page data. Eventually, we even
> > >>> wanted to make FOLL_GET unavailable to broad areas of kernel (and keep it
> > >>> internal to only MM for its dirty deeds ;)) to reduce the misuse of GUP.
> > >>>
> > >>> For file pages we need this data vs no-data access distinction so that
> > >>> filesystems can detect when someone can be accessing page data although the
> > >>> page is unmapped.  Practically, filesystems care most about when someone
> > >>> can be *modifying* page data (we need to make sure data is stable e.g. when
> > >>> writing back data to disk or doing data checksumming or other operations)
> > >>> so using FOLL_GET when wanting to only read page data should be OK for
> > >>> filesystems but honestly I would be reluctant to break the rule of "use
> > >>> FOLL_PIN when wanting to access page data" to keep things simple and
> > >>> reasonably easy to understand for parties such as filesystem developers or
> > >>> driver developers who all need to interact with pinned pages...
> > >>
> > >> Right, from an API perspective we really want people to use FOLL_PIN.
> > >>
> > >> To optimize this case in particular it would help if we would have the
> > >> FOLL flags on the unpin path. Then we could just decide internally
> > >> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > >> this like a FOLL_GET instead". And we would need that as well if we were
> > >> to keep different counters for R/O vs. R/W pinned.
> > > 
> > > Well, I guess the question here is: Which GUP user needs only R/O access to
> > > page data and is so performance critical that it would be worth it to
> > > sacrifice API clarity for speed? I'm not aware of any but I was not looking
> > > really hard...
> > 
> > I'd be interested in examples as well. Maybe databases that use O_DIRECT
> > after fork()?
> 
> Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
> page data (and so we need to detect them both for COW and filesystem needs).
> O_DIRECT writes could use FOLL_GET but at this point I'm not convinced it
> is worth it.

Wow, I didn't realise the plan was to make FOLL_PIN the "default".
I hoped it was weird crap that was going away soon.  Looks like we'd
better fix all the bugs in it then ...
Linus Torvalds Dec. 22, 2021, 6:40 p.m. UTC | #100
On Wed, Dec 22, 2021 at 8:08 AM Jan Kara <jack@suse.cz> wrote:
>
> Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
> page data (and so we need to detect them both for COW and filesystem needs).

Well, O_DIRECT reads do, but not necessarily writes.

And hey, even reads have been dodgy in the past when we didn't really
have the pinning logic - there's been a lot of users that just wanted
it to work for their particular use-case rather than in general and in
all situations..

             Linus
Matthew Wilcox Dec. 23, 2021, 12:21 a.m. UTC | #101
On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> Right, from an API perspective we really want people to use FOLL_PIN.
> 
> To optimize this case in particular it would help if we would have the
> FOLL flags on the unpin path. Then we could just decide internally
> "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> this like a FOLL_GET instead". And we would need that as well if we were
> to keep different counters for R/O vs. R/W pinned.

FYI, in my current tree, there's a gup_put_folio() which replaces
put_compound_head:

static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
{
        if (flags & FOLL_PIN) {
                node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
                if (hpage_pincount_available(&folio->page))
                        hpage_pincount_sub(&folio->page, refs);
                else
                        refs *= GUP_PIN_COUNTING_BIAS;
        }

        folio_put_refs(folio, refs);
}

That can become non-static if it's needed.  I'm still working on that
series, because I'd like to get it to a point where we return one
folio pointer instead of N page pointers.  Not quite there yet.
Jan Kara Dec. 23, 2021, 12:54 p.m. UTC | #102
On Wed 22-12-21 10:40:18, Linus Torvalds wrote:
> On Wed, Dec 22, 2021 at 8:08 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Well, but O_DIRECT reads must use FOLL_PIN in any case because they modify
> > page data (and so we need to detect them both for COW and filesystem needs).
> 
> Well, O_DIRECT reads do, but not necessarily writes.

I agree.

> And hey, even reads have been dodgy in the past when we didn't really
> have the pinning logic - there's been a lot of users that just wanted
> it to work for their particular use-case rather than in general and in
> all situations..

Yes, but currently a malicious user can take the system down (BUG_ON) or
cause DIF/DIX failures if he is nasty and tries hard enough with O_DIRECT
reads (practically, the window is small so I haven't really seen a report
that I could trace to O_DIRECT reads but in principle the problem is the
same as with pinning & dirtying done e.g. by video capture drivers and
there we've seen these problem happen). So forcing pinning for O_DIRECT
reads is IMO mandatory.

								Honza
Linus Torvalds Dec. 23, 2021, 5:18 p.m. UTC | #103
On Thu, Dec 23, 2021 at 4:54 AM Jan Kara <jack@suse.cz> wrote:
>
> So forcing pinning for O_DIRECT reads is IMO mandatory.

I don't disagree.

And I do think the eventual aim should be to do it for writes too even
if they don't necessarily require it (since they write _from_ the VM
data, not _to_ the VM data - the "read-vs-write direction has always
been confusing when it comes to GUP").

Partly just for consistency in the IO paths - I think people want to
share as much as possible in there - but also just to make sure that
we're all done with the "wrong-way-cow" kind of issues for good.

If we get to the point where the legacy GUP is used only for very
special things (looking up physical pages for debug and trace purposes
etc), I think that would be lovely.

That may be a pretty long-range goal, though.

              Linus
Jason Gunthorpe Dec. 24, 2021, 2:53 a.m. UTC | #104
On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > Right, from an API perspective we really want people to use FOLL_PIN.
> > 
> > To optimize this case in particular it would help if we would have the
> > FOLL flags on the unpin path. Then we could just decide internally
> > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > this like a FOLL_GET instead". And we would need that as well if we were
> > to keep different counters for R/O vs. R/W pinned.
> 
> FYI, in my current tree, there's a gup_put_folio() which replaces
> put_compound_head:
> 
> static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> {
>         if (flags & FOLL_PIN) {
>                 node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
>                 if (hpage_pincount_available(&folio->page))
>                         hpage_pincount_sub(&folio->page, refs);
>                 else
>                         refs *= GUP_PIN_COUNTING_BIAS;
>         }
> 
>         folio_put_refs(folio, refs);
> }
> 
> That can become non-static if it's needed.  I'm still working on that
> series, because I'd like to get it to a point where we return one
> folio pointer instead of N page pointers.  Not quite there yet.

I'm keen to see what that looks like, every driver I'm working on that
calls PUP goes through gyrations to recover contiguous pages, so this
is most welcomed!

Jason
Matthew Wilcox Dec. 24, 2021, 4:53 a.m. UTC | #105
On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
> On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > > Right, from an API perspective we really want people to use FOLL_PIN.
> > > 
> > > To optimize this case in particular it would help if we would have the
> > > FOLL flags on the unpin path. Then we could just decide internally
> > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > > this like a FOLL_GET instead". And we would need that as well if we were
> > > to keep different counters for R/O vs. R/W pinned.
> > 
> > FYI, in my current tree, there's a gup_put_folio() which replaces
> > put_compound_head:
> > 
> > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> > {
> >         if (flags & FOLL_PIN) {
> >                 node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> >                 if (hpage_pincount_available(&folio->page))
> >                         hpage_pincount_sub(&folio->page, refs);
> >                 else
> >                         refs *= GUP_PIN_COUNTING_BIAS;
> >         }
> > 
> >         folio_put_refs(folio, refs);
> > }
> > 
> > That can become non-static if it's needed.  I'm still working on that
> > series, because I'd like to get it to a point where we return one
> > folio pointer instead of N page pointers.  Not quite there yet.
> 
> I'm keen to see what that looks like, every driver I'm working on that
> calls PUP goes through gyrations to recover contiguous pages, so this
> is most welcomed!

I'm about to take some time off, so alas, you won't see it any time
soon.  It'd be good to talk with some of the interested users because
it's actually a pretty tricky problem.  We can't just return an array
of the struct folios because the actual memory you want to access
might be anywhere in that folio, and you don't want to have to redo
the lookup just to find out which subpages of the folio are meant.

So I'm currently thinking about returning a bio_vec:

struct bio_vec {
        struct page     *bv_page;
        unsigned int    bv_len;
        unsigned int    bv_offset;
};

In the iomap patchset which should go upstream in the next merge window,
you can iterate over a bio like this:

        struct folio_iter fi;

        bio_for_each_folio_all(fi, bio)
                iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);

There aren't any equivalent helpers for a bvec yet, but obviously we can
add them so that you can iterate over each folio in a contiguous range.

But now that each component in it is variable length, the caller can't
know how large an array of bio_vecs to allocate.

1. The callee can allocate the array and let the caller free it when it's
   finished
2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs
   over (potentially) multiple calls.
3. The caller can overallocate and ignore that most of the array isn't
   used.

Any preferences?  I don't like #3.
Jason Gunthorpe Jan. 4, 2022, 12:33 a.m. UTC | #106
On Fri, Dec 24, 2021 at 04:53:38AM +0000, Matthew Wilcox wrote:
> On Thu, Dec 23, 2021 at 10:53:09PM -0400, Jason Gunthorpe wrote:
> > On Thu, Dec 23, 2021 at 12:21:06AM +0000, Matthew Wilcox wrote:
> > > On Wed, Dec 22, 2021 at 02:09:41PM +0100, David Hildenbrand wrote:
> > > > Right, from an API perspective we really want people to use FOLL_PIN.
> > > > 
> > > > To optimize this case in particular it would help if we would have the
> > > > FOLL flags on the unpin path. Then we could just decide internally
> > > > "well, short-term R/O FOLL_PIN can be really lightweight, we can treat
> > > > this like a FOLL_GET instead". And we would need that as well if we were
> > > > to keep different counters for R/O vs. R/W pinned.
> > > 
> > > FYI, in my current tree, there's a gup_put_folio() which replaces
> > > put_compound_head:
> > > 
> > > static void gup_put_folio(struct folio *folio, int refs, unsigned int flags)
> > > {
> > >         if (flags & FOLL_PIN) {
> > >                 node_stat_mod_folio(folio, NR_FOLL_PIN_RELEASED, refs);
> > >                 if (hpage_pincount_available(&folio->page))
> > >                         hpage_pincount_sub(&folio->page, refs);
> > >                 else
> > >                         refs *= GUP_PIN_COUNTING_BIAS;
> > >         }
> > > 
> > >         folio_put_refs(folio, refs);
> > > }
> > > 
> > > That can become non-static if it's needed.  I'm still working on that
> > > series, because I'd like to get it to a point where we return one
> > > folio pointer instead of N page pointers.  Not quite there yet.
> > 
> > I'm keen to see what that looks like, every driver I'm working on that
> > calls PUP goes through gyrations to recover contiguous pages, so this
> > is most welcomed!
> 
> I'm about to take some time off, so alas, you won't see it any time
> soon.  It'd be good to talk with some of the interested users because
> it's actually a pretty tricky problem.

Sure, it is a good idea

> We can't just return an array of the struct folios because the
> actual memory you want to access might be anywhere in that folio,
> and you don't want to have to redo the lookup just to find out which
> subpages of the folio are meant.

Yep

> So I'm currently thinking about returning a bio_vec:
> 
> struct bio_vec {
>         struct page     *bv_page;
>         unsigned int    bv_len;
>         unsigned int    bv_offset;
> };

The cases I'm looking at basically want an efficient list of physical
addresses + lengths. They don't care about pages or folios, eg often
the next step is to build a SGL and DMA map it which largely ignores
all of that.

As the memory used to hold the output of pin_user_pages() is all
temporary there is a sensitivity to allocate the memory quicky, but
also to have enough of it so that we don't have to do redundant work
in pin_user_pages() - eg traversing to the same PMD table again and
again.

> But now that each component in it is variable length, the caller can't
> know how large an array of bio_vecs to allocate.

And the array entry is now 2x the size and there is no way to scatter
the array to 4k segments?

> 1. The callee can allocate the array and let the caller free it when it's
>    finished

It is not bad, but a bit tricky, alot of the GUP code executes in an
IRQ disabled state, so it has to use a pre-allocating scheme. We also
can't scan everything twice and hope it didn't change, so exact
preallocation doesn't seem likely either.

> 2. The caller passes in a (small, fixed-size, on-stack) array of bio_vecs
>    over (potentially) multiple calls.

It is slow, because we do redundant work traversing the same locks and
page tables again and again..

> 3. The caller can overallocate and ignore that most of the array isn't
>    used.
> 
> Any preferences?  I don't like #3.

#3 is OK for my applications because we immediately turn around and
copy the output to something else and free the memory anyhow...

However, being an array means we can't reliably allocate more than 4k
and with 16 bytes per entry that isn't even able to store a full PTE
table.

What would be nice for these cases is if the caller can supply an
array of 4k pages and GUP will fill them in. In many cases we'd
probably pass in up to 2M worth of pages or something.

There is some batching balance here where we want to minimize the
temporary memory consumed by GUP's output (and the time to allocate
it!) but also minimize the redundant work inside GUP repeatedly
walking the same tables and locks. 

eg ideally GUP would stop at some natural alignment boundary if it
could tell it can't fully fill the buffer. Then the next iteration
would not redo the same locks.

I was once thinking about something like storing an array of PFNs and
using the high bits to encode that the PFN is not 4k. It would allow
efficient packing of the common fragmented cases. To make it work
you'd need to have each 4k page grow the pfn list up from the start
and the pfn sizes down from the end. A size entry is only consumed if
the pfn bits can't encode the size directly so the packing can be a
perfect 8 bytes per PFN for the common 4k and 2M aligned cases.

Jason
diff mbox series

Patch

diff --git a/include/linux/mm.h b/include/linux/mm.h
index a7e4a9e7d807..37d1fb2f865e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -436,6 +436,9 @@  extern pgprot_t protection_map[16];
  * @FAULT_FLAG_REMOTE: The fault is not for current task/mm.
  * @FAULT_FLAG_INSTRUCTION: The fault was during an instruction fetch.
  * @FAULT_FLAG_INTERRUPTIBLE: The fault can be interrupted by non-fatal signals.
+ * @FAULT_FLAG_UNSHARE: The fault is an unsharing request to unshare a
+ *                      shared anonymous page (-> mapped R/O). Does not apply
+ *                      to KSM.
  *
  * About @FAULT_FLAG_ALLOW_RETRY and @FAULT_FLAG_TRIED: we can specify
  * whether we would allow page faults to retry by specifying these two
@@ -467,6 +470,7 @@  enum fault_flag {
 	FAULT_FLAG_REMOTE =		1 << 7,
 	FAULT_FLAG_INSTRUCTION =	1 << 8,
 	FAULT_FLAG_INTERRUPTIBLE =	1 << 9,
+	FAULT_FLAG_UNSHARE =		1 << 10,
 };
 
 /*
diff --git a/mm/memory.c b/mm/memory.c
index 8f1de811a1dc..7253a2ad4320 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2707,8 +2707,9 @@  EXPORT_SYMBOL_GPL(apply_to_existing_page_range);
  * read non-atomically.  Before making any commitment, on those architectures
  * or configurations (e.g. i386 with PAE) which might give a mix of unmatched
  * parts, do_swap_page must check under lock before unmapping the pte and
- * proceeding (but do_wp_page is only called after already making such a check;
- * and do_anonymous_page can safely check later on).
+ * proceeding (but do_wp_page_cow/do_wp_page_unshare is only called after
+ * already making such a check; and do_anonymous_page can safely check later
+ * on).
  */
 static inline int pte_unmap_same(struct vm_fault *vmf)
 {
@@ -2726,8 +2727,8 @@  static inline int pte_unmap_same(struct vm_fault *vmf)
 	return same;
 }
 
-static inline bool cow_user_page(struct page *dst, struct page *src,
-				 struct vm_fault *vmf)
+static inline bool __wp_page_copy_user(struct page *dst, struct page *src,
+				       struct vm_fault *vmf)
 {
 	bool ret;
 	void *kaddr;
@@ -2952,7 +2953,8 @@  static inline void wp_page_reuse(struct vm_fault *vmf)
 }
 
 /*
- * Handle the case of a page which we actually need to copy to a new page.
+ * Handle the case of a page which we actually need to copy to a new page,
+ * either due to COW or unsharing.
  *
  * Called with mmap_lock locked and the old page referenced, but
  * without the ptl held.
@@ -2967,7 +2969,7 @@  static inline void wp_page_reuse(struct vm_fault *vmf)
  *   held to the old page, as well as updating the rmap.
  * - In any case, unlock the PTL and drop the reference we took to the old page.
  */
-static vm_fault_t wp_page_copy(struct vm_fault *vmf)
+static vm_fault_t wp_page_copy(struct vm_fault *vmf, bool unshare)
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct mm_struct *mm = vma->vm_mm;
@@ -2991,7 +2993,7 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		if (!new_page)
 			goto oom;
 
-		if (!cow_user_page(new_page, old_page, vmf)) {
+		if (!__wp_page_copy_user(new_page, old_page, vmf)) {
 			/*
 			 * COW failed, if the fault was solved by other,
 			 * it's fine. If not, userspace would re-fault on
@@ -3033,7 +3035,14 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
 		entry = mk_pte(new_page, vma->vm_page_prot);
 		entry = pte_sw_mkyoung(entry);
-		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (unlikely(unshare)) {
+			if (pte_soft_dirty(vmf->orig_pte))
+				entry = pte_mksoft_dirty(entry);
+			if (pte_uffd_wp(vmf->orig_pte))
+				entry = pte_mkuffd_wp(entry);
+		} else {
+			entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		}
 
 		/*
 		 * Clear the pte entry and flush it first, before updating the
@@ -3050,6 +3059,7 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 		 * mmu page tables (such as kvm shadow page tables), we want the
 		 * new page to be mapped directly into the secondary page table.
 		 */
+		BUG_ON(unshare && pte_write(entry));
 		set_pte_at_notify(mm, vmf->address, vmf->pte, entry);
 		update_mmu_cache(vma, vmf->address, vmf->pte);
 		if (old_page) {
@@ -3109,6 +3119,8 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 			free_swap_cache(old_page);
 		put_page(old_page);
 	}
+	if (unlikely(unshare))
+		return 0;
 	return page_copied ? VM_FAULT_WRITE : 0;
 oom_free_new:
 	put_page(new_page);
@@ -3118,6 +3130,70 @@  static vm_fault_t wp_page_copy(struct vm_fault *vmf)
 	return VM_FAULT_OOM;
 }
 
+static __always_inline vm_fault_t wp_page_cow(struct vm_fault *vmf)
+{
+	return wp_page_copy(vmf, false);
+}
+
+static __always_inline vm_fault_t wp_page_unshare(struct vm_fault *vmf)
+{
+	return wp_page_copy(vmf, true);
+}
+
+/*
+ * This routine handles present pages, when GUP tries to take a read-only
+ * pin on a shared anonymous page. It's similar to do_wp_page_cow(), except that
+ * it keeps the pages mapped read-only and doesn't apply to KSM pages.
+ *
+ * If a parent process forks a child process, we share anonymous pages between
+ * both processes with COW semantics. Both processes will map these now shared
+ * anonymous pages read-only, and any write access triggers unsharing via COW.
+ *
+ * If the child takes a read-only pin on such a page (i.e., FOLL_WRITE is not
+ * set) and then unmaps the target page, we have:
+ *
+ * * page has mapcount == 1 and refcount > 1
+ * * page is mapped read-only into the parent
+ * * page is pinned by the child and can still be read
+ *
+ * For now, we rely on refcount > 1 to perform the COW and trigger unsharing.
+ * However, that leads to other hard-to fix issues.
+ *
+ * GUP-triggered unsharing provides a parallel approach to trigger unsharing
+ * early, still allowing for relying on mapcount > 1 in COW code instead of on
+ * imprecise refcount > 1. Note that when we don't actually take a reference
+ * on the target page but instead use memory notifiers to synchronize to changes
+ * in the process page tables, unsharing is not required.
+ *
+ * Note that in the above scenario, it's impossible to distinguish during the
+ * write fault between:
+ *
+ * a) The parent process performed the pin and the child no longer has access
+ *    to the page.
+ *
+ * b) The child process performed the pin and the child still has access to the
+ *    page.
+ *
+ * In case of a), if we're dealing with a long-term read-only pin, the COW
+ * in the parent will result the pinned page differing from the page actually
+ * mapped into the process page tables in the parent: loss of synchronicity.
+ * Therefore, we really want to perform the copy when the read-only pin happens.
+ */
+static vm_fault_t do_wp_page_unshare(struct vm_fault *vmf)
+	__releases(vmf->ptl)
+{
+	vmf->page = vm_normal_page(vmf->vma, vmf->address, vmf->orig_pte);
+	if (vmf->page && PageAnon(vmf->page) && !PageKsm(vmf->page) &&
+	    page_mapcount(vmf->page) > 1) {
+		get_page(vmf->page);
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return wp_page_unshare(vmf);
+	}
+	vmf->page = NULL;
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	return 0;
+}
+
 /**
  * finish_mkwrite_fault - finish page fault for a shared mapping, making PTE
  *			  writeable once the page is prepared
@@ -3226,7 +3302,7 @@  static vm_fault_t wp_page_shared(struct vm_fault *vmf)
  * but allow concurrent faults), with pte both mapped and locked.
  * We return with mmap_lock still held, but pte unmapped and unlocked.
  */
-static vm_fault_t do_wp_page(struct vm_fault *vmf)
+static vm_fault_t do_wp_page_cow(struct vm_fault *vmf)
 	__releases(vmf->ptl)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -3258,7 +3334,7 @@  static vm_fault_t do_wp_page(struct vm_fault *vmf)
 			return wp_pfn_shared(vmf);
 
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
-		return wp_page_copy(vmf);
+		return wp_page_cow(vmf);
 	}
 
 	/*
@@ -3296,7 +3372,7 @@  static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	get_page(vmf->page);
 
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
-	return wp_page_copy(vmf);
+	return wp_page_cow(vmf);
 }
 
 static void unmap_mapping_range_vma(struct vm_area_struct *vma,
@@ -3670,7 +3746,7 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	if (vmf->flags & FAULT_FLAG_WRITE) {
-		ret |= do_wp_page(vmf);
+		ret |= do_wp_page_cow(vmf);
 		if (ret & VM_FAULT_ERROR)
 			ret &= VM_FAULT_ERROR;
 		goto out;
@@ -4428,6 +4504,16 @@  static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
 /* `inline' is required to avoid gcc 4.1.2 build error */
 static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 {
+	if (vmf->flags & FAULT_FLAG_UNSHARE) {
+		/*
+		 * We'll simply split the THP and handle unsharing on the
+		 * PTE level. Unsharing only applies to anon THPs and we
+		 * shouldn't ever find them inside shared mappings.
+		 */
+		if (WARN_ON_ONCE(vmf->vma->vm_flags & VM_SHARED))
+			return 0;
+		goto split_fallback;
+	}
 	if (vma_is_anonymous(vmf->vma)) {
 		if (userfaultfd_huge_pmd_wp(vmf->vma, vmf->orig_pmd))
 			return handle_userfault(vmf, VM_UFFD_WP);
@@ -4440,7 +4526,8 @@  static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf)
 			return ret;
 	}
 
-	/* COW or write-notify handled on pte level: split pmd. */
+split_fallback:
+	/* COW, unsharing or write-notify handled on pte level: split pmd. */
 	__split_huge_pmd(vmf->vma, vmf->pmd, vmf->address, false, NULL);
 
 	return VM_FAULT_FALLBACK;
@@ -4551,8 +4638,11 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 			return do_fault(vmf);
 	}
 
-	if (!pte_present(vmf->orig_pte))
-		return do_swap_page(vmf);
+	if (!pte_present(vmf->orig_pte)) {
+		if (likely(!(vmf->flags & FAULT_FLAG_UNSHARE)))
+			return do_swap_page(vmf);
+		return 0;
+	}
 
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);
@@ -4564,9 +4654,13 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 		update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
 		goto unlock;
 	}
-	if (vmf->flags & FAULT_FLAG_WRITE) {
-		if (!pte_write(entry))
-			return do_wp_page(vmf);
+	if (vmf->flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) {
+		if (!pte_write(entry)) {
+			if (vmf->flags & FAULT_FLAG_WRITE)
+				return do_wp_page_cow(vmf);
+			else
+				return do_wp_page_unshare(vmf);
+		}
 		entry = pte_mkdirty(entry);
 	}
 	entry = pte_mkyoung(entry);
@@ -4607,7 +4701,6 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 		.pgoff = linear_page_index(vma, address),
 		.gfp_mask = __get_fault_gfp_mask(vma),
 	};
-	unsigned int dirty = flags & FAULT_FLAG_WRITE;
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -4634,7 +4727,7 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 
 			/* NUMA case for anonymous PUDs would go here */
 
-			if (dirty && !pud_write(orig_pud)) {
+			if ((flags & FAULT_FLAG_WRITE) && !pud_write(orig_pud)) {
 				ret = wp_huge_pud(&vmf, orig_pud);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;
@@ -4672,7 +4765,8 @@  static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
 			if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma))
 				return do_huge_pmd_numa_page(&vmf);
 
-			if (dirty && !pmd_write(vmf.orig_pmd)) {
+			if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) &&
+			    !pmd_write(vmf.orig_pmd)) {
 				ret = wp_huge_pmd(&vmf);
 				if (!(ret & VM_FAULT_FALLBACK))
 					return ret;