[v4,6/7] mm/maps: read proc/pid/maps under per-vma lock

Message ID	20250604231151.799834-7-surenb@google.com
State	New
Headers	show Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 97ACC26D4FC for <linux-kselftest@vger.kernel.org>; Wed, 4 Jun 2025 23:12:23 +0000 (UTC) Date: Wed, 4 Jun 2025 16:11:50 -0700 In-Reply-To: <20250604231151.799834-1-surenb@google.com> Precedence: bulk Mime-Version: 1.0 References: <20250604231151.799834-1-surenb@google.com> Message-ID: <20250604231151.799834-7-surenb@google.com> Subject: [PATCH v4 6/7] mm/maps: read proc/pid/maps under per-vma lock From: Suren Baghdasaryan <surenb@google.com> To: akpm@linux-foundation.org Cc: Liam.Howlett@oracle.com, lorenzo.stoakes@oracle.com, david@redhat.com, vbabka@suse.cz, peterx@redhat.com, jannh@google.com, hannes@cmpxchg.org, mhocko@kernel.org, paulmck@kernel.org, shuah@kernel.org, adobriyan@gmail.com, brauner@kernel.org, josef@toxicpanda.com, yebin10@huawei.com, linux@weissschuh.net, willy@infradead.org, osalvador@suse.de, andrii@kernel.org, ryan.roberts@arm.com, christophe.leroy@csgroup.eu, tjmercier@google.com, kaleshsingh@google.com, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, surenb@google.com Content-Type: text/plain; charset="UTF-8"
Series	use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY \| expand [v4,0/7] use per-vma locks for /proc/pid/maps reads and PROCMAP_QUERY [v4,1/7] selftests/proc: add /proc/pid/maps tearing from vma split test [v4,2/7] selftests/proc: extend /proc/pid/maps tearing test to include vma resizing [v4,3/7] selftests/proc: extend /proc/pid/maps tearing test to include vma remapping [v4,4/7] selftests/proc: test PROCMAP_QUERY ioctl while vma is concurrently modified [v4,5/7] selftests/proc: add verbose more for tests to facilitate debugging [v4,6/7] mm/maps: read proc/pid/maps under per-vma lock [v4,7/7] mm/maps: execute PROCMAP_QUERY ioctl under per-vma locks

Hi Suren, Forgive me but I am going to ask a lot of questions here :p just want to make sure I'm getting everything right here. On Wed, Jun 04, 2025 at 04:11:50PM -0700, Suren Baghdasaryan wrote: > With maple_tree supporting vma tree traversal under RCU and per-vma > locks, /proc/pid/maps can be read while holding individual vma locks > instead of locking the entire address space. Nice :) > Completely lockless approach would be quite complex with the main issue > being get_vma_name() using callbacks which might not work correctly with > a stable vma copy, requiring original (unstable) vma. Hmmm can you expand on what a 'completely lockless' design might comprise of? It's super un-greppable and I've not got clangd set up with an allmod kernel to triple-check but I'm seeing at least 2 (are there more?): gate_vma_name() which is: return "[vsyscall]"; special_mapping_name() which is: return ((struct vm_special_mapping *)vma->vm_private_data)->name; Which I'm guessing is the issue because it's a double pointer deref... Seems such a silly issue to get stuck on, I wonder if we can't just change this to function correctly? > When per-vma lock acquisition fails, we take the mmap_lock for reading, > lock the vma, release the mmap_lock and continue. This guarantees the > reader to make forward progress even during lock contention. This will Ah that fabled constant forward progress ;) > interfere with the writer but for a very short time while we are > acquiring the per-vma lock and only when there was contention on the > vma reader is interested in. > One case requiring special handling is when vma changes between the > time it was found and the time it got locked. A problematic case would > be if vma got shrunk so that it's start moved higher in the address > space and a new vma was installed at the beginning: > > reader found: |--------VMA A--------| > VMA is modified: |-VMA B-|----VMA A----| > reader locks modified VMA A > reader reports VMA A: | gap |----VMA A----| > > This would result in reporting a gap in the address space that does not > exist. To prevent this we retry the lookup after locking the vma, however > we do that only when we identify a gap and detect that the address space > was changed after we found the vma. OK so in this case we have 1. Find VMA A - nothing is locked yet, but presumably we are under RCU so are... safe? From unmaps? Or are we? I guess actually the detach mechanism sorts this out for us perhaps? 2. We got unlucky and did this immediately prior to VMA A having its vma->vm_start, vm_end updated to reflect the split. 3. We lock VMA A, now position with an apparent gap after the prior VMA which, in practice does not exist. So I am guessing that by observing sequence numbers you are able to detect that a change has occurred and thus retry the operation in this situation? I know we previously discussed the possibility of this retry mechanism going on forever, I guess I will see the resolution to this in the code :) > This change is designed to reduce mmap_lock contention and prevent a > process reading /proc/pid/maps files (often a low priority task, such > as monitoring/data collection services) from blocking address space > updates. Note that this change has a userspace visible disadvantage: > it allows for sub-page data tearing as opposed to the previous mechanism > where data tearing could happen only between pages of generated output > data. Since current userspace considers data tearing between pages to be > acceptable, we assume is will be able to handle sub-page data tearing > as well. By tearing do you mean for instance seeing a VMA more than once due to e.g. a VMA expanding in a racey way? Pedantic I know, but it might be worth goiing through all the merge case, split and remap scenarios and explaining what might happen in each one (or perhaps do that as some form of documentation?) I can try to put together a list of all of the possibilities if that would be helpful. > > Signed-off-by: Suren Baghdasaryan <surenb@google.com> > --- > fs/proc/internal.h | 6 ++ > fs/proc/task_mmu.c | 177 +++++++++++++++++++++++++++++++++++++++++++-- > 2 files changed, 175 insertions(+), 8 deletions(-) I really hate having all this logic in the proc/task_mmu.c file. This is really delicate stuff and I'd really like it to live in mm if possible. I reallise this might be a total pain, but I'm quite worried about us putting super-delicate, carefully written VMA handling code in different places. Also having stuff in mm/vma.c opens the door to userland testing which, when I finally have time to really expand that, would allow for some really nice stress testing here. > > diff --git a/fs/proc/internal.h b/fs/proc/internal.h > index 96122e91c645..3728c9012687 100644 > --- a/fs/proc/internal.h > +++ b/fs/proc/internal.h > @@ -379,6 +379,12 @@ struct proc_maps_private { > struct task_struct *task; > struct mm_struct *mm; > struct vma_iterator iter; > + loff_t last_pos; > +#ifdef CONFIG_PER_VMA_LOCK > + bool mmap_locked; > + unsigned int mm_wr_seq; Is this the _last_ sequence number observed in the mm_struct? or rather, previous? Nitty but maybe worth renaming accordingly. > + struct vm_area_struct *locked_vma; > +#endif > #ifdef CONFIG_NUMA > struct mempolicy *task_mempolicy; > #endif > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 27972c0749e7..36d883c4f394 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -127,13 +127,172 @@ static void release_task_mempolicy(struct proc_maps_private *priv) > } > #endif > > -static struct vm_area_struct *proc_get_vma(struct proc_maps_private *priv, > - loff_t *ppos) > +#ifdef CONFIG_PER_VMA_LOCK > + > +static struct vm_area_struct *trylock_vma(struct proc_maps_private *priv, > + struct vm_area_struct *vma, > + unsigned long last_pos, > + bool mm_unstable) This whole function is a bit weird tbh, you handle both the mm_unstable=true and mm_unstable=false cases, in the latter we don't try to lock at all... Nitty (sorry I know this is mildly irritating review) but maybe needs to be renamed, or split up somehow? This is only trylocking in the mm_unstable case... > +{ > + vma = vma_start_read(priv->mm, vma); Do we want to do this with mm_unstable == false? I know (from my own documentation :)) taking a VMA read lock while holding an mmap read lock is fine (the reverse isn't) but maybe it's suboptimal? > + if (IS_ERR_OR_NULL(vma)) > + return NULL; Hmm IS_ERR_OR_NULL() is generally a code smell (I learned this some years ago from people moaning at me on code review :) Sorry I know that's annoying but perhaps its indicative of an issue in the interface? That's possibly out of scope here however. Why are we ignoring errors here though? I guess because we don't care if the VMA got detached from under us, we don't bother retrying like we do in lock_vma_under_rcu()? Should we just abstract that part of lock_vma_under_rcu() and use it? > + > + /* Check if the vma we locked is the right one. */ Well it might not be the right one :) but might still belong to the right mm, so maybe better to refer to the right virtual address space. > + if (unlikely(vma->vm_mm != priv->mm)) > + goto err; > + > + /* vma should not be ahead of the last search position. */ You mean behind the last search position? Surely a VMA being _ahead_ of it is fine? > + if (unlikely(last_pos >= vma->vm_end)) Should that be >=? Wouldn't an == just be an adjacent VMA? Why is that problematic? Or is last_pos inclusive? > + goto err; Am I correct in thinking thi is what is being checked? last_pos | v |---------| | | |---------| vm_end <--- vma 'next'??? How did we go backwards? When last_pos gets updated, is it possible for a shrink to race to cause this somehow? Do we treat this as an entirely unexpected error condition? In which case is a WARN_ON_ONCE() warranted? > + > + /* > + * vma ahead of last search position is possible but we need to > + * verify that it was not shrunk after we found it, and another > + * vma has not been installed ahead of it. Otherwise we might > + * observe a gap that should not be there. > + */ OK so this is the juicy bit. > + if (mm_unstable && last_pos < vma->vm_start) { > + /* Verify only if the address space changed since vma lookup. */ > + if ((priv->mm_wr_seq & 1) || Can we wrap this into a helper? This is a 'you just have to know that odd seq number means a write operation is in effect'. I know you have a comment here, but I think something like: if (has_mm_been_modified(priv) || Would be a lot clearer. Again this speaks to the usefulness of abstracting all this logic from the proc code, we are putting super delicate VMA stuff here and it's just not the right place. As an aside, I don't see coverage in the process_addrs documentation on sequence number odd/even or speculation? I think we probably need to cover this to maintain an up-to-date description of how the VMA locking mechanism works and is used? > + mmap_lock_speculate_retry(priv->mm, priv->mm_wr_seq)) { Nit, again unrelated to this series, but would be useful to add a comment to mmap_lock_speculate_retry() to indicate that a true return value indicates a retry is needed, or renaming it. Maybe mmap_lock_speculate_needs_retry()? Also I think that function needs a comment. Naming is hard :P Anyway the totality of this expression is 'something changed' or 'read section retry required'. Under what circumstances would this happen? OK so we're into the 'retry' logic here: > + vma_iter_init(&priv->iter, priv->mm, last_pos); I'd definitely want Liam to confirm this is all above board and correct, as these operations are pretty sensitive. But assuming this is safe, we reset the iterator to the last position... > + if (vma != vma_next(&priv->iter)) Then assert the following VMA is the one we seek. > + goto err; Might this ever be the case in the course of ordinary operation? Is this really an error? > + } > + } > + > + priv->locked_vma = vma; > + > + return vma; > +err: As queried above, is this really an error path or something we might expect to happen that could simply result in an expected fallback to mmap lock? > + vma_end_read(vma); > + return NULL; > +} > + > + > +static void unlock_vma(struct proc_maps_private *priv) > +{ > + if (priv->locked_vma) { > + vma_end_read(priv->locked_vma); > + priv->locked_vma = NULL; > + } > +} > + > +static const struct seq_operations proc_pid_maps_op; > + > +static inline bool lock_content(struct seq_file *m, > + struct proc_maps_private *priv) Pedantic I know but isn't 'lock_content' a bit generic? He says, not being able to think of a great alternative... OK maybe fine... :) > +{ > + /* > + * smaps and numa_maps perform page table walk, therefore require > + * mmap_lock but maps can be read with locked vma only. > + */ > + if (m->op != &proc_pid_maps_op) { Nit but is there a neater way of checking this? Actually I imagine not... But maybe worth, instead of forward-declaring proc_pid_maps_op, forward declare e.g. static inline bool is_maps_op(struct seq_file *m); And check e.g. if (is_maps_op(m)) { ... in the above. Yeah this is nitty not a massive del :) > + if (mmap_read_lock_killable(priv->mm)) > + return false; > + > + priv->mmap_locked = true; > + } else { > + rcu_read_lock(); > + priv->locked_vma = NULL; > + priv->mmap_locked = false; > + } > + > + return true; > +} > + > +static inline void unlock_content(struct proc_maps_private *priv) > +{ > + if (priv->mmap_locked) { > + mmap_read_unlock(priv->mm); > + } else { > + unlock_vma(priv); > + rcu_read_unlock(); Does this always get called even in error cases? > + } > +} > + > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv, > + loff_t last_pos) We really need a generalised RCU multi-VMA locking mechanism (we're looking into madvise VMA locking atm with a conservative single VMA lock at the moment, but in future we probably want to be able to span multiple for instance) and this really really feels like it doesn't belong in this proc code. > { > - struct vm_area_struct *vma = vma_next(&priv->iter); > + struct vm_area_struct *vma; > + int ret; > + > + if (priv->mmap_locked) > + return vma_next(&priv->iter); > + > + unlock_vma(priv); > + /* > + * Record sequence number ahead of vma lookup. > + * Odd seqcount means address space modification is in progress. > + */ > + mmap_lock_speculate_try_begin(priv->mm, &priv->mm_wr_seq); Hmm we're discarding the return value I guess we don't really care about that at this stage? Or do we? Do we want to assert the read critical section state here? I guess since we have the mm_rq_seq which we use later it's the same thing and doesn't matter. ~~(off topic a bit)~~ OK so off-topic again afaict we're doing something pretty horribly gross here. We pass &priv->mm_rw_seq as 'unsigned int *seq' field to mmap_lock_speculate_try_begin(), which in turn calls: return raw_seqcount_try_begin(&mm->mm_lock_seq, *seq); And this is defined as a macro of: #define raw_seqcount_try_begin(s, start) \ ({ \ start = raw_read_seqcount(s); \ !(start & 1); \ }) So surely this expands to: *seq = raw_read_seqcount(&mm->mm_lock_seq); !(*seq & 1) // return true if even, false if odd So we're basically ostensibly passing an unsigned int, but because we're calling a macro it's actually just 'text' and we're instead able to then reassign the underlying unsigned int * ptr and... ugh. ~~(/off topic a bit)~~ > + vma = vma_next(&priv->iter); > + if (!vma) > + return NULL; > + > + vma = trylock_vma(priv, vma, last_pos, true); > + if (vma) > + return vma; > + Really feels like this should be a boolean... I guess neat to reset vma if not locked though. > + /* Address space got modified, vma might be stale. Re-lock and retry */ > + rcu_read_unlock(); Might we see a VMA possibly actually legit unmapped in a race here? Do we need to update last_pos/ppos to account for this? Otherwise we might just fail on the last_pos >= vma->vm_end check in trylock_vma() no? > + ret = mmap_read_lock_killable(priv->mm); Shouldn't we set priv->mmap_locked here? I guess not as we are simply holding the mmap lock to definitely get the next VMA. > + rcu_read_lock(); > + if (ret) > + return ERR_PTR(ret); > + > + /* Lookup the vma at the last position again under mmap_read_lock */ > + vma_iter_init(&priv->iter, priv->mm, last_pos); > + vma = vma_next(&priv->iter); > + if (vma) { > + vma = trylock_vma(priv, vma, last_pos, false); Be good to use Liam's convention of /* mm_unstable = */ false to make this clear. Find it kinda weird again we're 'trylocking' something we already have locked via the mmap lock but I already mentioend this... :) > + WARN_ON(!vma); /* mm is stable, has to succeed */ I wonder if this is really useful, at any rate seems like there'd be a flood here so WARN_ON_ONCE()? Perhaps VM_WARN_ON_ONCE() given this really really ought not happen? > + } > + mmap_read_unlock(priv->mm); > + > + return vma; > +} > + > +#else /* CONFIG_PER_VMA_LOCK */ > > +static inline bool lock_content(struct seq_file *m, > + struct proc_maps_private *priv) > +{ > + return mmap_read_lock_killable(priv->mm) == 0; > +} > + > +static inline void unlock_content(struct proc_maps_private *priv) > +{ > + mmap_read_unlock(priv->mm); > +} > + > +static struct vm_area_struct *get_next_vma(struct proc_maps_private *priv, > + loff_t last_pos) > +{ > + return vma_next(&priv->iter); > +} > + > +#endif /* CONFIG_PER_VMA_LOCK */ > + > +static struct vm_area_struct *proc_get_vma(struct seq_file *m, loff_t *ppos) > +{ > + struct proc_maps_private *priv = m->private; > + struct vm_area_struct *vma; > + > + vma = get_next_vma(priv, *ppos); > + if (IS_ERR(vma)) > + return vma; > + > + /* Store previous position to be able to restart if needed */ > + priv->last_pos = *ppos; > if (vma) { > - *ppos = vma->vm_start; > + /* > + * Track the end of the reported vma to ensure position changes > + * even if previous vma was merged with the next vma and we > + * found the extended vma with the same vm_start. > + */ Right, so observing repetitions is acceptable in such circumstances? I mean I agree. > + *ppos = vma->vm_end; If we store the end, does the last_pos logic which resets the VMA iterator later work correctly in all cases? > } else { > *ppos = -2UL; > vma = get_gate_vma(priv->mm); Is it always the case that !vma here implies a gate VMA (yuck yuck)? I see this was the original logic, but maybe put a comment about this as it's weird and confusing? (and not your fault obviously :P) Also, are all locks and state corectly handled in this case? Seems like one of this nasty edge case situations that could have jagged edges... > @@ -163,19 +322,21 @@ static void *m_start(struct seq_file *m, loff_t *ppos) > return NULL; > } > > - if (mmap_read_lock_killable(mm)) { > + if (!lock_content(m, priv)) { Nice that this just slots in like this! :) > mmput(mm); > put_task_struct(priv->task); > priv->task = NULL; > return ERR_PTR(-EINTR); > } > > + if (last_addr > 0) last_addr is an unsigned long, this will always be true. You probably want to put an explicit check for -1UL, -2UL here or? God I hate this mechanism for indicating gate VMA... yuck yuck (again, this bit not your fault :P) > + *ppos = last_addr = priv->last_pos; > vma_iter_init(&priv->iter, mm, last_addr); > hold_task_mempolicy(priv); > if (last_addr == -2UL) > return get_gate_vma(mm); > > - return proc_get_vma(priv, ppos); > + return proc_get_vma(m, ppos); > } > > static void *m_next(struct seq_file *m, void *v, loff_t *ppos) > @@ -184,7 +345,7 @@ static void *m_next(struct seq_file *m, void *v, loff_t *ppos) > *ppos = -1UL; > return NULL; > } > - return proc_get_vma(m->private, ppos); > + return proc_get_vma(m, ppos); > } > > static void m_stop(struct seq_file *m, void *v) > @@ -196,7 +357,7 @@ static void m_stop(struct seq_file *m, void *v) > return; > > release_task_mempolicy(priv); > - mmap_read_unlock(mm); > + unlock_content(priv); > mmput(mm); > put_task_struct(priv->task); > priv->task = NULL; > -- > 2.49.0.1266.g31b7d2e469-goog > Sorry to add to workload by digging into so many details here, but we really need to make sure all the i's are dotted and t's are crossed given how fiddly and fragile this stuff is :) Very much appreciate the work, this is a significant improvement and will have a great deal of real world impact! Cheers, Lorenzo

[v4,6/7] mm/maps: read proc/pid/maps under per-vma lock

Commit Message

Comments

Patch