[v3,8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU

Message ID	20250418174959.1431962-9-surenb@google.com
State	New
Headers	show Received: from mail-pl1-f201.google.com (mail-pl1-f201.google.com [209.85.214.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6BC8B22578A for <linux-kselftest@vger.kernel.org>; Fri, 18 Apr 2025 17:50:21 +0000 (UTC) Date: Fri, 18 Apr 2025 10:49:59 -0700 In-Reply-To: <20250418174959.1431962-1-surenb@google.com> Precedence: bulk Mime-Version: 1.0 References: <20250418174959.1431962-1-surenb@google.com> Message-ID: <20250418174959.1431962-9-surenb@google.com> Subject: [PATCH v3 8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, surenb@google.com Content-Type: text/plain; charset="UTF-8"
Series	perform /proc/pid/maps read and PROCMAP_QUERY under RCU \| expand [v3,0/8] perform /proc/pid/maps read and PROCMAP_QUERY under RCU [v3,2/8] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing [v3,4/8] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified [v3,6/8] mm: make vm_area_struct anon_name field RCU-safe [v3,8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU

Suren Baghdasaryan April 18, 2025, 5:49 p.m. UTC

Utilize speculative vma lookup to find and snapshot a vma without
taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
address space modifications are detected and the lookup is retried.
While we take the mmap_lock for reading during such contention, we
do that momentarily only to record new mm_wr_seq counter.
This change is designed to reduce mmap_lock contention and prevent
PROCMAP_QUERY ioctl calls (often a low priority task, such as
monitoring/data collection services) from blocking address space
updates.

Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 55 insertions(+), 8 deletions(-)

Andrii Nakryiko April 22, 2025, 10:54 p.m. UTC | #1

On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
>
> Utilize speculative vma lookup to find and snapshot a vma without
> taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> address space modifications are detected and the lookup is retried.
> While we take the mmap_lock for reading during such contention, we
> do that momentarily only to record new mm_wr_seq counter.

PROCMAP_QUERY is an even more obvious candidate for fully lockless
speculation, IMO (because it's more obvious that vma's use is
localized to do_procmap_query(), instead of being spread across
m_start/m_next and m_show as with seq_file approach). We do
rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
mmap_read_lock), use that VMA to produce (speculative) output, and
then validate that VMA or mm_struct didn't change with
mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
No need for vma_copy and any gets/puts, no?

> This change is designed to reduce mmap_lock contention and prevent
> PROCMAP_QUERY ioctl calls (often a low priority task, such as
> monitoring/data collection services) from blocking address space
> updates.
>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 55 insertions(+), 8 deletions(-)
>

[...]

Suren Baghdasaryan April 23, 2025, 9:53 p.m. UTC | #2

On Tue, Apr 22, 2025 at 3:54 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > Utilize speculative vma lookup to find and snapshot a vma without
> > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > address space modifications are detected and the lookup is retried.
> > While we take the mmap_lock for reading during such contention, we
> > do that momentarily only to record new mm_wr_seq counter.
>
> PROCMAP_QUERY is an even more obvious candidate for fully lockless
> speculation, IMO (because it's more obvious that vma's use is
> localized to do_procmap_query(), instead of being spread across
> m_start/m_next and m_show as with seq_file approach). We do
> rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> mmap_read_lock), use that VMA to produce (speculative) output, and
> then validate that VMA or mm_struct didn't change with
> mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> No need for vma_copy and any gets/puts, no?

Yeah, since we can simply retry, this should indeed work without
trying to stabilize the vma. I'll update the code to simplify this.
Thanks!

>
> > This change is designed to reduce mmap_lock contention and prevent
> > PROCMAP_QUERY ioctl calls (often a low priority task, such as
> > monitoring/data collection services) from blocking address space
> > updates.
> >
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  fs/proc/task_mmu.c | 63 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 55 insertions(+), 8 deletions(-)
> >
>
> [...]

Jann Horn April 29, 2025, 3:55 p.m. UTC | #3

On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > Utilize speculative vma lookup to find and snapshot a vma without
> > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > address space modifications are detected and the lookup is retried.
> > While we take the mmap_lock for reading during such contention, we
> > do that momentarily only to record new mm_wr_seq counter.
>
> PROCMAP_QUERY is an even more obvious candidate for fully lockless
> speculation, IMO (because it's more obvious that vma's use is
> localized to do_procmap_query(), instead of being spread across
> m_start/m_next and m_show as with seq_file approach). We do
> rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> mmap_read_lock), use that VMA to produce (speculative) output, and
> then validate that VMA or mm_struct didn't change with
> mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> No need for vma_copy and any gets/puts, no?

I really strongly dislike this "fully lockless" approach because it
means we get data races all over the place, and it gets hard to reason
about what happens especially if we do anything other than reading
plain data from the VMA. When reading the implementation of
do_procmap_query(), at basically every memory read you'd have to think
twice as hard to figure out which fields can be concurrently updated
elsewhere and whether the subsequent sequence count recheck can
recover from the resulting badness.

Just as one example, I think get_vma_name() could (depending on
compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
between "if (vma->vm_ops && vma->vm_ops->name)" and
"vma->vm_ops->name(vma)". And I think this illustrates how the "fully
lockless" approach creates more implicit assumptions about the
behavior of core MM code, which could be broken by future changes to
MM code.

Suren Baghdasaryan April 29, 2025, 5:14 p.m. UTC | #4

On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
>
> On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > Utilize speculative vma lookup to find and snapshot a vma without
> > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > address space modifications are detected and the lookup is retried.
> > > While we take the mmap_lock for reading during such contention, we
> > > do that momentarily only to record new mm_wr_seq counter.
> >
> > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > speculation, IMO (because it's more obvious that vma's use is
> > localized to do_procmap_query(), instead of being spread across
> > m_start/m_next and m_show as with seq_file approach). We do
> > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > mmap_read_lock), use that VMA to produce (speculative) output, and
> > then validate that VMA or mm_struct didn't change with
> > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > No need for vma_copy and any gets/puts, no?
>
> I really strongly dislike this "fully lockless" approach because it
> means we get data races all over the place, and it gets hard to reason
> about what happens especially if we do anything other than reading
> plain data from the VMA. When reading the implementation of
> do_procmap_query(), at basically every memory read you'd have to think
> twice as hard to figure out which fields can be concurrently updated
> elsewhere and whether the subsequent sequence count recheck can
> recover from the resulting badness.
>
> Just as one example, I think get_vma_name() could (depending on
> compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> between "if (vma->vm_ops && vma->vm_ops->name)" and
> "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> lockless" approach creates more implicit assumptions about the
> behavior of core MM code, which could be broken by future changes to
> MM code.

Yeah, I'll need to re-evaluate such an approach after your review. I
like having get_stable_vma() to obtain a completely stable version of
the vma in a localized place and then stop worrying about possible
races. If implemented correctly, would that be enough to address your
concern, Jann?

Jann Horn April 29, 2025, 5:24 p.m. UTC | #5

On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > <andrii.nakryiko@gmail.com> wrote:
> > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > address space modifications are detected and the lookup is retried.
> > > > While we take the mmap_lock for reading during such contention, we
> > > > do that momentarily only to record new mm_wr_seq counter.
> > >
> > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > speculation, IMO (because it's more obvious that vma's use is
> > > localized to do_procmap_query(), instead of being spread across
> > > m_start/m_next and m_show as with seq_file approach). We do
> > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > then validate that VMA or mm_struct didn't change with
> > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > No need for vma_copy and any gets/puts, no?
> >
> > I really strongly dislike this "fully lockless" approach because it
> > means we get data races all over the place, and it gets hard to reason
> > about what happens especially if we do anything other than reading
> > plain data from the VMA. When reading the implementation of
> > do_procmap_query(), at basically every memory read you'd have to think
> > twice as hard to figure out which fields can be concurrently updated
> > elsewhere and whether the subsequent sequence count recheck can
> > recover from the resulting badness.
> >
> > Just as one example, I think get_vma_name() could (depending on
> > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > lockless" approach creates more implicit assumptions about the
> > behavior of core MM code, which could be broken by future changes to
> > MM code.
>
> Yeah, I'll need to re-evaluate such an approach after your review. I
> like having get_stable_vma() to obtain a completely stable version of
> the vma in a localized place and then stop worrying about possible
> races. If implemented correctly, would that be enough to address your
> concern, Jann?

Yes, I think a stable local snapshot of the VMA (where tricky data
races are limited to the VMA snapshotting code) is a good tradeoff.

Andrii Nakryiko May 1, 2025, 10:10 p.m. UTC | #6

On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@google.com> wrote:
>
> On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > <andrii.nakryiko@gmail.com> wrote:
> > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > address space modifications are detected and the lookup is retried.
> > > > > While we take the mmap_lock for reading during such contention, we
> > > > > do that momentarily only to record new mm_wr_seq counter.
> > > >
> > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > speculation, IMO (because it's more obvious that vma's use is
> > > > localized to do_procmap_query(), instead of being spread across
> > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > then validate that VMA or mm_struct didn't change with
> > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > No need for vma_copy and any gets/puts, no?
> > >
> > > I really strongly dislike this "fully lockless" approach because it
> > > means we get data races all over the place, and it gets hard to reason
> > > about what happens especially if we do anything other than reading
> > > plain data from the VMA. When reading the implementation of
> > > do_procmap_query(), at basically every memory read you'd have to think
> > > twice as hard to figure out which fields can be concurrently updated
> > > elsewhere and whether the subsequent sequence count recheck can
> > > recover from the resulting badness.
> > >
> > > Just as one example, I think get_vma_name() could (depending on
> > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > lockless" approach creates more implicit assumptions about the
> > > behavior of core MM code, which could be broken by future changes to
> > > MM code.
> >
> > Yeah, I'll need to re-evaluate such an approach after your review. I
> > like having get_stable_vma() to obtain a completely stable version of
> > the vma in a localized place and then stop worrying about possible
> > races. If implemented correctly, would that be enough to address your
> > concern, Jann?
>
> Yes, I think a stable local snapshot of the VMA (where tricky data
> races are limited to the VMA snapshotting code) is a good tradeoff.

I'm not sure I agree with VMA snapshot being better either, tbh. It is
error-prone to have a byte-by-byte local copy of VMA (which isn't
really that VMA anymore), and passing it into ops callbacks (which
expect "real" VMA)... Who guarantees that this won't backfire,
depending on vm_ops implementations? And constantly copying 176+ bytes
just to access a few fields out of it is a bit unfortunate...

Also taking mmap_read_lock() sort of defeats the point of "RCU-only
access". It's still locking/unlocking and bouncing cache lines between
writer and reader frequently. How slow is per-VMA formatting? If we
take mmap_read_lock, format VMA information into a buffer under this
lock, and drop the mmap_read_lock, would it really be that much slower
compared to what Suren is doing in this patch set? And if no, that
would be so much simpler compared to this semi-locked/semi-RCU way
that is added in this patch set, no?

But I do agree that vma->vm_ops->name access is hard to do in a
completely lockless way reliably. But also how frequently VMAs have
custom names/anon_vma_name? What if we detect that VMA has some
"fancy" functionality (like this custom name thing), and just fallback
to mmap_read_lock-protected logic, which needs to be supported as a
fallback even for lockless approach?

This way we can process most (typical) VMAs completely locklessly,
while not adding any extra assumptions for all the potentially
complicated data pieces. WDYT?

Jann Horn May 2, 2025, 3:11 p.m. UTC | #7

On Fri, May 2, 2025 at 12:10 AM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
> On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@google.com> wrote:
> > On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > > address space modifications are detected and the lookup is retried.
> > > > > > While we take the mmap_lock for reading during such contention, we
> > > > > > do that momentarily only to record new mm_wr_seq counter.
> > > > >
> > > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > > speculation, IMO (because it's more obvious that vma's use is
> > > > > localized to do_procmap_query(), instead of being spread across
> > > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > > then validate that VMA or mm_struct didn't change with
> > > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > > No need for vma_copy and any gets/puts, no?
> > > >
> > > > I really strongly dislike this "fully lockless" approach because it
> > > > means we get data races all over the place, and it gets hard to reason
> > > > about what happens especially if we do anything other than reading
> > > > plain data from the VMA. When reading the implementation of
> > > > do_procmap_query(), at basically every memory read you'd have to think
> > > > twice as hard to figure out which fields can be concurrently updated
> > > > elsewhere and whether the subsequent sequence count recheck can
> > > > recover from the resulting badness.
> > > >
> > > > Just as one example, I think get_vma_name() could (depending on
> > > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > > lockless" approach creates more implicit assumptions about the
> > > > behavior of core MM code, which could be broken by future changes to
> > > > MM code.
> > >
> > > Yeah, I'll need to re-evaluate such an approach after your review. I
> > > like having get_stable_vma() to obtain a completely stable version of
> > > the vma in a localized place and then stop worrying about possible
> > > races. If implemented correctly, would that be enough to address your
> > > concern, Jann?
> >
> > Yes, I think a stable local snapshot of the VMA (where tricky data
> > races are limited to the VMA snapshotting code) is a good tradeoff.
>
> I'm not sure I agree with VMA snapshot being better either, tbh. It is
> error-prone to have a byte-by-byte local copy of VMA (which isn't
> really that VMA anymore), and passing it into ops callbacks (which
> expect "real" VMA)... Who guarantees that this won't backfire,
> depending on vm_ops implementations? And constantly copying 176+ bytes
> just to access a few fields out of it is a bit unfortunate...

Yeah, we shouldn't be passing VMA snapshots into ops callbacks, I
agree that we need to fall back to using proper locking for that.

> Also taking mmap_read_lock() sort of defeats the point of "RCU-only
> access". It's still locking/unlocking and bouncing cache lines between
> writer and reader frequently. How slow is per-VMA formatting?

I think this mainly does two things?

1. It shifts the latency burden of concurrent access toward the reader
a bit, essentially allowing writers to preempt this type of reader to
some extent.
2. It avoids bouncing cache lines between this type of reader and
other *readers*.

> If we
> take mmap_read_lock, format VMA information into a buffer under this
> lock, and drop the mmap_read_lock, would it really be that much slower
> compared to what Suren is doing in this patch set? And if no, that
> would be so much simpler compared to this semi-locked/semi-RCU way
> that is added in this patch set, no?

> But I do agree that vma->vm_ops->name access is hard to do in a
> completely lockless way reliably. But also how frequently VMAs have
> custom names/anon_vma_name?

I think there are typically two VMAs with vm_ops->name per MM, vvar
and vdso. (Since you also asked about anon_vma_name: I think
anon_vma_name is more frequent than that on Android, there seem to be
58 of those VMAs even in a simple "cat" process.)

> What if we detect that VMA has some
> "fancy" functionality (like this custom name thing), and just fallback
> to mmap_read_lock-protected logic, which needs to be supported as a
> fallback even for lockless approach?
>
> This way we can process most (typical) VMAs completely locklessly,
> while not adding any extra assumptions for all the potentially
> complicated data pieces. WDYT?

And then we'd also use the fallback path if karg.build_id_size is set?
And I guess we also need it if the VMA is hugetlb, because of
vma_kernel_pagesize()? And use READ_ONCE() in places like
vma_is_initial_heap()/vma_is_initial_stack()/arch_vma_name() for
accessing both the VMA and the MM?

And on top of that, we'd have to open-code/change anything that
currently uses the ->vm_ops (such as vma_kernel_pagesize()), because
between us checking the type of the VMA and later accessing ->vm_ops,
the VMA object could have been reallocated with different ->vm_ops?

I still don't like the idea of pushing the complexity of "the VMA
contents are unstable, and every read from the VMA object may return
data about a logically different VMA" down into these various helpers.
In my mind, making the API contract "The VMA contents can be an
internally consistent snapshot" to the API contract for these helpers
constrains the weirdness a bit more - though I guess the helpers will
still need READ_ONCE() for accessing properties of the MM either
way...

Suren Baghdasaryan May 2, 2025, 4:16 p.m. UTC | #8

On Fri, May 2, 2025 at 3:11 PM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 2, 2025 at 12:10 AM Andrii Nakryiko
> <andrii.nakryiko@gmail.com> wrote:
> > On Tue, Apr 29, 2025 at 10:25 AM Jann Horn <jannh@google.com> wrote:
> > > On Tue, Apr 29, 2025 at 7:15 PM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > On Tue, Apr 29, 2025 at 8:56 AM Jann Horn <jannh@google.com> wrote:
> > > > > On Wed, Apr 23, 2025 at 12:54 AM Andrii Nakryiko
> > > > > <andrii.nakryiko@gmail.com> wrote:
> > > > > > On Fri, Apr 18, 2025 at 10:50 AM Suren Baghdasaryan <surenb@google.com> wrote:
> > > > > > > Utilize speculative vma lookup to find and snapshot a vma without
> > > > > > > taking mmap_lock during PROCMAP_QUERY ioctl execution. Concurrent
> > > > > > > address space modifications are detected and the lookup is retried.
> > > > > > > While we take the mmap_lock for reading during such contention, we
> > > > > > > do that momentarily only to record new mm_wr_seq counter.
> > > > > >
> > > > > > PROCMAP_QUERY is an even more obvious candidate for fully lockless
> > > > > > speculation, IMO (because it's more obvious that vma's use is
> > > > > > localized to do_procmap_query(), instead of being spread across
> > > > > > m_start/m_next and m_show as with seq_file approach). We do
> > > > > > rcu_read_lock(), mmap_lock_speculate_try_begin(), query for VMA (no
> > > > > > mmap_read_lock), use that VMA to produce (speculative) output, and
> > > > > > then validate that VMA or mm_struct didn't change with
> > > > > > mmap_lock_speculate_retry(). If it did - retry, if not - we are done.
> > > > > > No need for vma_copy and any gets/puts, no?
> > > > >
> > > > > I really strongly dislike this "fully lockless" approach because it
> > > > > means we get data races all over the place, and it gets hard to reason
> > > > > about what happens especially if we do anything other than reading
> > > > > plain data from the VMA. When reading the implementation of
> > > > > do_procmap_query(), at basically every memory read you'd have to think
> > > > > twice as hard to figure out which fields can be concurrently updated
> > > > > elsewhere and whether the subsequent sequence count recheck can
> > > > > recover from the resulting badness.
> > > > >
> > > > > Just as one example, I think get_vma_name() could (depending on
> > > > > compiler optimizations) crash with a NULL deref if the VMA's ->vm_ops
> > > > > pointer is concurrently changed to &vma_dummy_vm_ops by vma_close()
> > > > > between "if (vma->vm_ops && vma->vm_ops->name)" and
> > > > > "vma->vm_ops->name(vma)". And I think this illustrates how the "fully
> > > > > lockless" approach creates more implicit assumptions about the
> > > > > behavior of core MM code, which could be broken by future changes to
> > > > > MM code.
> > > >
> > > > Yeah, I'll need to re-evaluate such an approach after your review. I
> > > > like having get_stable_vma() to obtain a completely stable version of
> > > > the vma in a localized place and then stop worrying about possible
> > > > races. If implemented correctly, would that be enough to address your
> > > > concern, Jann?
> > >
> > > Yes, I think a stable local snapshot of the VMA (where tricky data
> > > races are limited to the VMA snapshotting code) is a good tradeoff.
> >
> > I'm not sure I agree with VMA snapshot being better either, tbh. It is
> > error-prone to have a byte-by-byte local copy of VMA (which isn't
> > really that VMA anymore), and passing it into ops callbacks (which
> > expect "real" VMA)... Who guarantees that this won't backfire,
> > depending on vm_ops implementations? And constantly copying 176+ bytes
> > just to access a few fields out of it is a bit unfortunate...
>
> Yeah, we shouldn't be passing VMA snapshots into ops callbacks, I
> agree that we need to fall back to using proper locking for that.

Yes, I'm exploring the option of falling back to per-vma locking to
stabilize the VMA when its reference is used in vm_ops.

>
> > Also taking mmap_read_lock() sort of defeats the point of "RCU-only
> > access". It's still locking/unlocking and bouncing cache lines between
> > writer and reader frequently. How slow is per-VMA formatting?
>
> I think this mainly does two things?
>
> 1. It shifts the latency burden of concurrent access toward the reader
> a bit, essentially allowing writers to preempt this type of reader to
> some extent.
> 2. It avoids bouncing cache lines between this type of reader and
> other *readers*.
>
> > If we
> > take mmap_read_lock, format VMA information into a buffer under this
> > lock, and drop the mmap_read_lock, would it really be that much slower
> > compared to what Suren is doing in this patch set? And if no, that
> > would be so much simpler compared to this semi-locked/semi-RCU way
> > that is added in this patch set, no?

The problem this patch is trying to address is low priority readers
blocking a high priority writer. Taking mmap_read_lock for each VMA
will not help solve this problem. If we have to use locking then
taking per-vma lock would at least narrow down the contention scope
from the entire address space to individual VMAs.

>
> > But I do agree that vma->vm_ops->name access is hard to do in a
> > completely lockless way reliably. But also how frequently VMAs have
> > custom names/anon_vma_name?
>
> I think there are typically two VMAs with vm_ops->name per MM, vvar
> and vdso. (Since you also asked about anon_vma_name: I think
> anon_vma_name is more frequent than that on Android, there seem to be
> 58 of those VMAs even in a simple "cat" process.)
>
> > What if we detect that VMA has some
> > "fancy" functionality (like this custom name thing), and just fallback
> > to mmap_read_lock-protected logic, which needs to be supported as a
> > fallback even for lockless approach?
> >
> > This way we can process most (typical) VMAs completely locklessly,
> > while not adding any extra assumptions for all the potentially
> > complicated data pieces. WDYT?

The option I'm currently contemplating is using per-vma locks to deal
with "fancy" cases and to do snapshotting otherwise. We have several
options with different levels of complexity vs performance tradeoffs
and finding the right balance will require some experimentation. I'll
likely need Paul's help soon to run his testcase with different
versions.

>
> And then we'd also use the fallback path if karg.build_id_size is set?
> And I guess we also need it if the VMA is hugetlb, because of
> vma_kernel_pagesize()? And use READ_ONCE() in places like
> vma_is_initial_heap()/vma_is_initial_stack()/arch_vma_name() for
> accessing both the VMA and the MM?
>
> And on top of that, we'd have to open-code/change anything that
> currently uses the ->vm_ops (such as vma_kernel_pagesize()), because
> between us checking the type of the VMA and later accessing ->vm_ops,
> the VMA object could have been reallocated with different ->vm_ops?
>
> I still don't like the idea of pushing the complexity of "the VMA
> contents are unstable, and every read from the VMA object may return
> data about a logically different VMA" down into these various helpers.
> In my mind, making the API contract "The VMA contents can be an
> internally consistent snapshot" to the API contract for these helpers
> constrains the weirdness a bit more - though I guess the helpers will
> still need READ_ONCE() for accessing properties of the MM either
> way...

[v3,8/8] mm/maps: execute PROCMAP_QUERY ioctl under RCU

Commit Message

Comments

Patch