Message ID | 20241116175922.3265872-1-pasha.tatashin@soleen.com |
---|---|
Headers | show |
Series | Page Detective | expand |
On Mon, Nov 18, 2024 at 7:54 AM Jann Horn <jannh@google.com> wrote: > > On Mon, Nov 18, 2024 at 12:17 PM Lorenzo Stoakes > <lorenzo.stoakes@oracle.com> wrote: > > On Sat, Nov 16, 2024 at 05:59:16PM +0000, Pasha Tatashin wrote: > > > It operates through the Linux debugfs interface, with two files: "virt" > > > and "phys". > > > > > > The "virt" file takes a virtual address and PID and outputs information > > > about the corresponding page. > > > > > > The "phys" file takes a physical address and outputs information about > > > that page. > > > > > > The output is presented via kernel log messages (can be accessed with > > > dmesg), and includes information such as the page's reference count, > > > mapping, flags, and memory cgroup. It also shows whether the page is > > > mapped in the kernel page table, and if so, how many times. > > > > I mean, even though I'm not a huge fan of kernel pointer hashing etc. this > > is obviously leaking as much information as you might want about kernel > > internal state to the point of maybe making the whole kernel pointer > > hashing thing moot. > > > > I know this requires CAP_SYS_ADMIN, but there are things that also require > > that which _still_ obscure kernel pointers. > > > > And you're outputting it all to dmesg. > > > > So yeah, a security person (Jann?) would be better placed to comment on > > this than me, but are we sure we want to do this when not in a > > CONFIG_DEBUG_VM* kernel? > > I guess there are two parts to this - what root is allowed to do, and > what information we're fine with exposing to dmesg. > > If the lockdown LSM is not set to LOCKDOWN_CONFIDENTIALITY_MAX, the > kernel allows root to read kernel memory through some interfaces - in > particular, BPF allows reading arbitrary kernel memory, and perf > allows reading at least some stuff (like kernel register states). With > lockdown in the most restrictive mode, the kernel tries to prevent > root from reading arbitrary kernel memory, but we don't really change > how much information goes into dmesg. (And I imagine you could > probably still get kernel pointers out of BPF somehow even in the most > restrictive lockdown mode, but that's probably not relevant.) > > The main issue with dmesg is that some systems make its contents > available to code that is not running with root privileges; and I > think it is also sometimes stored persistently in unencrypted form > (like in EFI pstore) even when everything else on the system is > encrypted. > So on one hand, we definitely shouldn't print the contents of random > chunks of memory into dmesg without a good reason; on the other hand, > for example we do already print kernel register state on WARN() (which > often includes kernel pointers and could theoretically include more > sensitive data too). > > So I think showing page metadata to root when requested is probably > okay as a tradeoff? And dumping that data into dmesg is maybe not > great, but acceptable as long as only root can actually trigger this? > > I don't really have a strong opinion on this... > > > To me, a bigger issue is that dump_page() looks like it might be racy, > which is maybe not terrible in debugging code that only runs when > something has already gone wrong, but bad if it is in code that root > can trigger on demand? Hi Jann, thank you for reviewing this proposal. Presumably, the interface should be used only when something has gone wrong but has not been noticed by the kernel. That something is usually checksums failures that are outside of the kernel: i.e. during live migration, snapshotting, filesystem journaling, etc. We already have interfaces that provide data from the live kernel that could be racy, i.e. crash utility. > __dump_page() copies the given page with > memcpy(), which I don't think guarantees enough atomicity with > concurrent updates of page->mapping or such, so dump_mapping() could > probably run on a bogus pointer. Even without torn pointers, I think > there could be a UAF if the page's mapping is destroyed while we're > going through dump_page(), since the page might not be locked. And in > dump_mapping(), the strncpy_from_kernel_nofault() also doesn't guard > against concurrent renaming of the dentry, which I think again would > probably result in UAF. Since we are holding a reference on the page at the time of dump_page(), the identity of the page should not really change, but dentry can be renamed. > So I think dump_page() in its current form is not something we should > expose to a userspace-reachable API. We use dump_page() all over WARN_ONs in MM code where pages might not be locked, but this is a good point, that while even the existing usage might be racy, providing a user-reachable API potentially makes it worse. I will see if I could add some locking before dump_page(), or make a dump_page variant that does not do dump_mapping().
On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > Additionally, using crash/drgn is not feasible for us at this time, it > requires keeping external tools on our hosts, also it requires > approval and a security review for each script before deployment in > our fleet. So it's ok to add a totally insecure kernel feature to your fleet instead? You might want to reconsider that policy decision :) good luck! greg k-h
On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > Can you point me to where a refcounted reference to the page comes > > from when page_detective_metadata() calls dump_page_lvl()? > > I am sorry, I remembered incorrectly, we are getting reference right > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I > will move the folio_try_get() to before dump_page_lvl(). > > > > > So I think dump_page() in its current form is not something we should > > > > expose to a userspace-reachable API. > > > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > > be locked, but this is a good point, that while even the existing > > > usage might be racy, providing a user-reachable API potentially makes > > > it worse. I will see if I could add some locking before dump_page(), > > > or make a dump_page variant that does not do dump_mapping(). > > > > To be clear, I am not that strongly opposed to racily reading data > > such that the data may not be internally consistent or such; but this > > is a case of racy use-after-free reads that might end up dumping > > entirely unrelated memory contents into dmesg. I think we should > > properly protect against that in an API that userspace can invoke. > > Otherwise, if we race, we might end up writing random memory contents > > into dmesg; and if we are particularly unlucky, those random memory > > contents could be PII or authentication tokens or such. > > > > I'm not entirely sure what the right approach is here; I guess it > > makes sense that when the kernel internally detects corruption, > > dump_page doesn't take references on pages it accesses to avoid > > corrupting things further. If you are looking at a page based on a > > userspace request, I guess you could access the page with the > > necessary locking to access its properties under the normal locking > > rules? > > I will take reference, as we already do that for memcg purpose, but > have not included dump_page(). Note that taking a reference on the page does not make all of dump_page() fine; in particular, my understanding is that folio_mapping() requires that the page is locked in order to return a stable pointer, and some of the code in dump_mapping() would probably also require some other locks - probably at least on the inode and maybe also on the dentry, I think? Otherwise the inode's dentry list can probably change concurrently, and the dentry's name pointer can change too.
On Tue, Nov 19, 2024 at 7:52 AM Jann Horn <jannh@google.com> wrote: > > On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > Can you point me to where a refcounted reference to the page comes > > > from when page_detective_metadata() calls dump_page_lvl()? > > > > I am sorry, I remembered incorrectly, we are getting reference right > > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I > > will move the folio_try_get() to before dump_page_lvl(). > > > > > > > So I think dump_page() in its current form is not something we should > > > > > expose to a userspace-reachable API. > > > > > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > > > be locked, but this is a good point, that while even the existing > > > > usage might be racy, providing a user-reachable API potentially makes > > > > it worse. I will see if I could add some locking before dump_page(), > > > > or make a dump_page variant that does not do dump_mapping(). > > > > > > To be clear, I am not that strongly opposed to racily reading data > > > such that the data may not be internally consistent or such; but this > > > is a case of racy use-after-free reads that might end up dumping > > > entirely unrelated memory contents into dmesg. I think we should > > > properly protect against that in an API that userspace can invoke. > > > Otherwise, if we race, we might end up writing random memory contents > > > into dmesg; and if we are particularly unlucky, those random memory > > > contents could be PII or authentication tokens or such. > > > > > > I'm not entirely sure what the right approach is here; I guess it > > > makes sense that when the kernel internally detects corruption, > > > dump_page doesn't take references on pages it accesses to avoid > > > corrupting things further. If you are looking at a page based on a > > > userspace request, I guess you could access the page with the > > > necessary locking to access its properties under the normal locking > > > rules? > > > > I will take reference, as we already do that for memcg purpose, but > > have not included dump_page(). > > Note that taking a reference on the page does not make all of > dump_page() fine; in particular, my understanding is that > folio_mapping() requires that the page is locked in order to return a > stable pointer, and some of the code in dump_mapping() would probably > also require some other locks - probably at least on the inode and > maybe also on the dentry, I think? Otherwise the inode's dentry list > can probably change concurrently, and the dentry's name pointer can > change too. Agreed, once reference is taken, the page identity cannot change (i.e. if it is a named page it will stay a named page), but dentry can be renamed. I will look into what can be done to guarantee consistency in the next version. There is also a fallback if locking cannot be reliably resolved (i.e. for performance reasons) where we can make dump_mapping() optionally disabled from dump_page_lvl() with a new argument flag. Thank you, Pasha
On Tue, Nov 19, 2024 at 4:14 PM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > On Tue, Nov 19, 2024 at 7:52 AM Jann Horn <jannh@google.com> wrote: > > On Tue, Nov 19, 2024 at 2:30 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > Can you point me to where a refcounted reference to the page comes > > > > from when page_detective_metadata() calls dump_page_lvl()? > > > > > > I am sorry, I remembered incorrectly, we are getting reference right > > > after dump_page_lvl() in page_detective_memcg() -> folio_try_get(); I > > > will move the folio_try_get() to before dump_page_lvl(). > > > > > > > > > So I think dump_page() in its current form is not something we should > > > > > > expose to a userspace-reachable API. > > > > > > > > > > We use dump_page() all over WARN_ONs in MM code where pages might not > > > > > be locked, but this is a good point, that while even the existing > > > > > usage might be racy, providing a user-reachable API potentially makes > > > > > it worse. I will see if I could add some locking before dump_page(), > > > > > or make a dump_page variant that does not do dump_mapping(). > > > > > > > > To be clear, I am not that strongly opposed to racily reading data > > > > such that the data may not be internally consistent or such; but this > > > > is a case of racy use-after-free reads that might end up dumping > > > > entirely unrelated memory contents into dmesg. I think we should > > > > properly protect against that in an API that userspace can invoke. > > > > Otherwise, if we race, we might end up writing random memory contents > > > > into dmesg; and if we are particularly unlucky, those random memory > > > > contents could be PII or authentication tokens or such. > > > > > > > > I'm not entirely sure what the right approach is here; I guess it > > > > makes sense that when the kernel internally detects corruption, > > > > dump_page doesn't take references on pages it accesses to avoid > > > > corrupting things further. If you are looking at a page based on a > > > > userspace request, I guess you could access the page with the > > > > necessary locking to access its properties under the normal locking > > > > rules? > > > > > > I will take reference, as we already do that for memcg purpose, but > > > have not included dump_page(). > > > > Note that taking a reference on the page does not make all of > > dump_page() fine; in particular, my understanding is that > > folio_mapping() requires that the page is locked in order to return a > > stable pointer, and some of the code in dump_mapping() would probably > > also require some other locks - probably at least on the inode and > > maybe also on the dentry, I think? Otherwise the inode's dentry list > > can probably change concurrently, and the dentry's name pointer can > > change too. > > Agreed, once reference is taken, the page identity cannot change (i.e. > if it is a named page it will stay a named page), but dentry can be > renamed. I will look into what can be done to guarantee consistency in > the next version. There is also a fallback if locking cannot be > reliably resolved (i.e. for performance reasons) where we can make > dump_mapping() optionally disabled from dump_page_lvl() with a new > argument flag. Yeah, I think if you don't need the details that dump_mapping() shows, skipping that for user-requested dumps might be a reasonable option.
On Tue, Nov 19, 2024 at 01:52:00PM +0100, Jann Horn wrote: > > I will take reference, as we already do that for memcg purpose, but > > have not included dump_page(). > > Note that taking a reference on the page does not make all of > dump_page() fine; in particular, my understanding is that > folio_mapping() requires that the page is locked in order to return a > stable pointer, and some of the code in dump_mapping() would probably > also require some other locks - probably at least on the inode and > maybe also on the dentry, I think? Otherwise the inode's dentry list > can probably change concurrently, and the dentry's name pointer can > change too. First important thing is that we snapshot the page. So while we may have a torn snapshot of the page, it can't change under us any more, so we don't have to worry about it being swizzled one way and then swizzled back. Second thing is that I think using folio_mapping() is actually wrong. We don't want the swap mapping if it's an anon page that's in the swapcache. We'd be fine just doing mapping = folio->mapping (we'd need to add a check for movable, but I think that's fine). Anyway, we know the folio isn't ksm or anon at the point that we call dump_mapping() because there's a chain of "else" statements. So I think we're fine because we can't switch between anon & file while holding a refcount. Having a refcount on the folio will prevent the folio from being allocated to anything else again. It will not protect the mapping from being torn down (the folio can be truncated from the mapping, then the mapping can be freed, and the memory reused). As you say, the dentry can be renamed as well. This patch series makes me nervous. I'd rather see it done as a bpf script or drgn script, but if it is going to be done in C, I'd really like to see more auditing of the safety here. It feels like the kind of hack that one deploys internally to debug a hard-to-hit condition, rather than the kind of code that we like to ship upstream.
On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > requires keeping external tools on our hosts, also it requires > > > > > approval and a security review for each script before deployment in > > > > > our fleet. > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > instead? You might want to reconsider that policy decision :) > > > > > > Hi Greg, > > > > > > While some risk is inherent, we believe the potential for abuse here > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > But, even with root access compromised, this tool presents a smaller > > > attack surface than alternatives like crash/drgn. It exposes less > > > sensitive information, unlike crash/drgn, which could potentially > > > allow reading all of kernel memory. > > > > The problem here is with using dmesg for output. No security-sensitive > > information should go there. Even exposing raw kernel pointers is not > > considered safe. > > I am OK in writing the output to a debugfs file in the next version, > the only concern I have is that implies that dump_page() would need to > be basically duplicated, as it now outputs everything via printk's. Perhaps you can refactor the code in dump_page() to use a seq_buf, then have dump_page() printk that seq_buf using seq_buf_do_printk(), and have page detective output that seq_buf to the debugfs file? We do something very similar with memory_stat_format(). We use the same function to generate the memcg stats in a seq_buf, then we use that seq_buf to output the stats to memory.stat as well as the OOM log.
On Tue, Nov 19, 2024 at 11:35:47AM -0800, Yosry Ahmed wrote: > On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin > <pasha.tatashin@soleen.com> wrote: > > > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > > requires keeping external tools on our hosts, also it requires > > > > > > approval and a security review for each script before deployment in > > > > > > our fleet. > > > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > > instead? You might want to reconsider that policy decision :) > > > > > > > > Hi Greg, > > > > > > > > While some risk is inherent, we believe the potential for abuse here > > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > > But, even with root access compromised, this tool presents a smaller > > > > attack surface than alternatives like crash/drgn. It exposes less > > > > sensitive information, unlike crash/drgn, which could potentially > > > > allow reading all of kernel memory. > > > > > > The problem here is with using dmesg for output. No security-sensitive > > > information should go there. Even exposing raw kernel pointers is not > > > considered safe. > > > > I am OK in writing the output to a debugfs file in the next version, > > the only concern I have is that implies that dump_page() would need to > > be basically duplicated, as it now outputs everything via printk's. > > Perhaps you can refactor the code in dump_page() to use a seq_buf, > then have dump_page() printk that seq_buf using seq_buf_do_printk(), > and have page detective output that seq_buf to the debugfs file? > > We do something very similar with memory_stat_format(). We use the > same function to generate the memcg stats in a seq_buf, then we use > that seq_buf to output the stats to memory.stat as well as the OOM > log. +1 Thanks!
Pasha Tatashin <pasha.tatashin@soleen.com> writes: > Page Detective is a new kernel debugging tool that provides detailed > information about the usage and mapping of physical memory pages. > > It is often known that a particular page is corrupted, but it is hard to > extract more information about such a page from live system. Examples > are: > > - Checksum failure during live migration > - Filesystem journal failure > - dump_page warnings on the console log > - Unexcpected segfaults > > Page Detective helps to extract more information from the kernel, so it > can be used by developers to root cause the associated problem. > > It operates through the Linux debugfs interface, with two files: "virt" > and "phys". > > The "virt" file takes a virtual address and PID and outputs information > about the corresponding page. > > The "phys" file takes a physical address and outputs information about > that page. > > The output is presented via kernel log messages (can be accessed with > dmesg), and includes information such as the page's reference count, > mapping, flags, and memory cgroup. It also shows whether the page is > mapped in the kernel page table, and if so, how many times. A lot of all that is already covered in /proc/kpage{flags,cgroup,count) Also we already have /proc/pid/pagemap to resolve virtual addresses. At a minimum you need to discuss why these existing mechanisms are not suitable for you and how your new one is better. If something particular is missing perhaps the existing mechanisms can be extended? Outputting in the dmesg seems rather clumpsy for a production mechanism. I personally would just use live crash or live gdb on /proc/kcore to get extra information, although I can see that might have races. -Andi
On Wed, Nov 20, 2024 at 8:14 AM Pasha Tatashin <pasha.tatashin@soleen.com> wrote: > > On Tue, Nov 19, 2024 at 2:36 PM Yosry Ahmed <yosryahmed@google.com> wrote: > > > > On Tue, Nov 19, 2024 at 11:30 AM Pasha Tatashin > > <pasha.tatashin@soleen.com> wrote: > > > > > > On Tue, Nov 19, 2024 at 1:23 PM Roman Gushchin <roman.gushchin@linux.dev> wrote: > > > > > > > > On Tue, Nov 19, 2024 at 10:08:36AM -0500, Pasha Tatashin wrote: > > > > > On Mon, Nov 18, 2024 at 8:09 PM Greg KH <gregkh@linuxfoundation.org> wrote: > > > > > > > > > > > > On Mon, Nov 18, 2024 at 05:08:42PM -0500, Pasha Tatashin wrote: > > > > > > > Additionally, using crash/drgn is not feasible for us at this time, it > > > > > > > requires keeping external tools on our hosts, also it requires > > > > > > > approval and a security review for each script before deployment in > > > > > > > our fleet. > > > > > > > > > > > > So it's ok to add a totally insecure kernel feature to your fleet > > > > > > instead? You might want to reconsider that policy decision :) > > > > > > > > > > Hi Greg, > > > > > > > > > > While some risk is inherent, we believe the potential for abuse here > > > > > is limited, especially given the existing CAP_SYS_ADMIN requirement. > > > > > But, even with root access compromised, this tool presents a smaller > > > > > attack surface than alternatives like crash/drgn. It exposes less > > > > > sensitive information, unlike crash/drgn, which could potentially > > > > > allow reading all of kernel memory. > > > > > > > > The problem here is with using dmesg for output. No security-sensitive > > > > information should go there. Even exposing raw kernel pointers is not > > > > considered safe. > > > > > > I am OK in writing the output to a debugfs file in the next version, > > > the only concern I have is that implies that dump_page() would need to > > > be basically duplicated, as it now outputs everything via printk's. > > > > Perhaps you can refactor the code in dump_page() to use a seq_buf, > > then have dump_page() printk that seq_buf using seq_buf_do_printk(), > > and have page detective output that seq_buf to the debugfs file? > > Good idea, I will look into modifying it this way. > > > We do something very similar with memory_stat_format(). We use the > > void mem_cgroup_print_oom_meminfo(struct mem_cgroup *memcg) > { > /* Use static buffer, for the caller is holding oom_lock. */ > static char buf[PAGE_SIZE]; > .... > seq_buf_init(&s, buf, sizeof(buf)); > memory_stat_format(memcg, &s); > seq_buf_do_printk(&s, KERN_INFO); > } > > This is a callosal stack allocation, given that our fleet only has 8K > stacks. :-) That's a static allocation though :)
> - Quickly identify all user processes mapping a given page. Can be done with /proc/*/pagemap today. Maybe it's not "quick" because it won't use the rmap chains, but is that a serious issue? > - Determine if and where the kernel maps the page, which is also > important given the opportunity to remove guest memory from the kernel > direct map (as discussed at LPC'24). At least x86 already has a kernel page table dumper in debugfs that can be used for this. The value of a second redundant one seems low. > We also plan to extend this functionality to include KVM and IOMMU > page tables in the future. Yes dumpers for those would likely be useful. (at least for the case when one hand is tied behind your back by security policies forbidding /proc/kcore access) > <pagemap> provides an interface to traversing through user page > tables, but the other information cannot be extracted using the > existing interfaces. Like what? You mean the reference counts? /proc/k* doesn't have any reference counts, and no space for full counts, but I suspect usually all you need to know is a few states like (>1, 1, 0, maybe negative) which could be mapped to a few spare kpageflags bits. That said I thought Willy wanted to move a lot of these elsewhere anyways with the folio revolution, so it might be a short lived interface anyways. -Andi