mbox series

[v2,0/2] procfs: Add file path and size to /proc/<pid>/fdinfo

Message ID 20220623220613.3014268-1-kaleshsingh@google.com
Headers show
Series procfs: Add file path and size to /proc/<pid>/fdinfo | expand

Message

Kalesh Singh June 23, 2022, 10:06 p.m. UTC
Hi all,

This is v2 of the fdinfo patches. The main update is adding path
field only for files with anon inodes. Rebased on 5.19-rc3.

The previous cover letter is copied below for convenience.

Thanks,
Kalesh

-----------

Processes can pin shared memory by keeping a handle to it through a
file descriptor; for instance dmabufs, memfd, and ashmem (in Android).

In the case of a memory leak, to identify the process pinning the
memory, userspace needs to:
  - Iterate the /proc/<pid>/fd/* for each process
  - Do a readlink on each entry to identify the type of memory from
    the file path.
  - stat() each entry to get the size of the memory.

The file permissions on /proc/<pid>/fd/* only allows for the owner
or root to perform the operations above; and so is not suitable for
capturing the system-wide state in a production environment.

This issue was addressed for dmabufs by making /proc/*/fdinfo/*
accessible to a process with PTRACE_MODE_READ_FSCREDS credentials[1]
To allow the same kind of tracking for other types of shared memory,
add the following fields to /proc/<pid>/fdinfo/<fd>:

path - This allows identifying the type of memory based on common
       prefixes: e.g. "/memfd...", "/dmabuf...", "/dev/ashmem..."

       This was not an issued when dmabuf tracking was introduced
       because the exp_name field of dmabuf fdinfo could be used
       to distinguish dmabuf fds from other types.

size - To track the amount of memory that is being pinned.

       dmabufs expose size as an additional field in fdinfo. Remove
       this and make it a common field for all fds.

Access to /proc/<pid>/fdinfo is governed by PTRACE_MODE_READ_FSCREDS
-- the same as for /proc/<pid>/maps which also exposes the path and
size for mapped memory regions.

This allows for a system process with PTRACE_MODE_READ_FSCREDS to
account the pinned per-process memory via fdinfo.


Kalesh Singh (2):
  procfs: Add 'size' to /proc/<pid>/fdinfo/
  procfs: Add 'path' to /proc/<pid>/fdinfo/

 Documentation/filesystems/proc.rst | 22 ++++++++++++++++++++--
 drivers/dma-buf/dma-buf.c          |  1 -
 fs/libfs.c                         |  9 +++++++++
 fs/proc/fd.c                       | 18 ++++++++++++++----
 include/linux/fs.h                 |  1 +
 5 files changed, 44 insertions(+), 7 deletions(-)


base-commit: a111daf0c53ae91e71fd2bfe7497862d14132e3e

Comments

Kalesh Singh June 28, 2022, 10:38 p.m. UTC | #1
On Tue, Jun 28, 2022 at 4:54 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Jun 23, 2022 at 03:06:06PM -0700, Kalesh Singh wrote:
> > To be able to account the amount of memory a process is keeping pinned
> > by open file descriptors add a 'size' field to fdinfo output.
> >
> > dmabufs fds already expose a 'size' field for this reason, remove this
> > and make it a common field for all fds. This allows tracking of
> > other types of memory (e.g. memfd and ashmem in Android).
> >
> > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > Reviewed-by: Christian König <christian.koenig@amd.com>
> > ---
> >
> > Changes in v2:
> >   - Add Christian's Reviewed-by
> >
> > Changes from rfc:
> >   - Split adding 'size' and 'path' into a separate patches, per Christian
> >   - Split fdinfo seq_printf into separate lines, per Christian
> >   - Fix indentation (use tabs) in documentaion, per Randy
> >
> >  Documentation/filesystems/proc.rst | 12 ++++++++++--
> >  drivers/dma-buf/dma-buf.c          |  1 -
> >  fs/proc/fd.c                       |  9 +++++----
> >  3 files changed, 15 insertions(+), 7 deletions(-)
> >
> ...
> > diff --git a/fs/proc/fd.c b/fs/proc/fd.c
> > index 913bef0d2a36..464bc3f55759 100644
> > --- a/fs/proc/fd.c
> > +++ b/fs/proc/fd.c
> > @@ -54,10 +54,11 @@ static int seq_show(struct seq_file *m, void *v)
> >       if (ret)
> >               return ret;
> >
> > -     seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\nino:\t%lu\n",
> > -                (long long)file->f_pos, f_flags,
> > -                real_mount(file->f_path.mnt)->mnt_id,
> > -                file_inode(file)->i_ino);
> > +     seq_printf(m, "pos:\t%lli\n", (long long)file->f_pos);
> > +     seq_printf(m, "flags:\t0%o\n", f_flags);
> > +     seq_printf(m, "mnt_id:\t%i\n", real_mount(file->f_path.mnt)->mnt_id);
> > +     seq_printf(m, "ino:\t%lu\n", file_inode(file)->i_ino);
> > +     seq_printf(m, "size:\t%lli\n", (long long)file_inode(file)->i_size);
>
> Hi Kalesh,
>
> Any reason not to use i_size_read() here?

Hi Brian.

Thanks for pointing this out. You are right, we should use
i_size_read() here. I'll update in the next version.

>
> Also not sure if it matters that much for your use case, but something
> worth noting at least with shmem is that one can do something like:
>
> # cat /proc/meminfo | grep Shmem:
> Shmem:               764 kB
> # xfs_io -fc "falloc -k 0 10m" ./file
> # ls -alh file
> -rw-------. 1 root root 0 Jun 28 07:22 file
> # stat file
>   File: file
>   Size: 0               Blocks: 20480      IO Block: 4096   regular empty file
> # cat /proc/meminfo | grep Shmem:
> Shmem:             11004 kB
>
> ... where the resulting memory usage isn't reflected in i_size (but is
> is in i_blocks/bytes).

I tried a similar experiment a few times, but I don't see the same
results. In my case, there is not any change in shmem. IIUC the
fallocate is allocating the disk space not shared memory.

cat /proc/meminfo > meminfo.start
xfs_io -fc "falloc -k 0 50m" ./xfs_file
cat /proc/meminfo > meminfo.stop
tail -n +1 meminfo.st* | grep -i '==\|Shmem:'

==> meminfo.start <==
Shmem:               484 kB
==> meminfo.stop <==
Shmem:               484 kB

ls -lh xfs_file
-rw------- 1 root root 0 Jun 28 15:12 xfs_file

stat xfs_file
  File: xfs_file
  Size: 0               Blocks: 102400     IO Block: 4096   regular empty file

Thanks,
Kalesh

>
> Brian
>
> >
> >       /* show_fd_locks() never deferences files so a stale value is safe */
> >       show_fd_locks(m, file, files);
> > --
> > 2.37.0.rc0.161.g10f37bed90-goog
> >
>
Kalesh Singh June 29, 2022, 8:43 p.m. UTC | #2
On Wed, Jun 29, 2022 at 5:23 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Tue, Jun 28, 2022 at 03:38:02PM -0700, Kalesh Singh wrote:
> > On Tue, Jun 28, 2022 at 4:54 AM Brian Foster <bfoster@redhat.com> wrote:
> > >
> > > On Thu, Jun 23, 2022 at 03:06:06PM -0700, Kalesh Singh wrote:
> > > > To be able to account the amount of memory a process is keeping pinned
> > > > by open file descriptors add a 'size' field to fdinfo output.
> > > >
> > > > dmabufs fds already expose a 'size' field for this reason, remove this
> > > > and make it a common field for all fds. This allows tracking of
> > > > other types of memory (e.g. memfd and ashmem in Android).
> > > >
> > > > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > > > Reviewed-by: Christian König <christian.koenig@amd.com>
> > > > ---
> > > >
> > > > Changes in v2:
> > > >   - Add Christian's Reviewed-by
> > > >
> > > > Changes from rfc:
> > > >   - Split adding 'size' and 'path' into a separate patches, per Christian
> > > >   - Split fdinfo seq_printf into separate lines, per Christian
> > > >   - Fix indentation (use tabs) in documentaion, per Randy
> > > >
> > > >  Documentation/filesystems/proc.rst | 12 ++++++++++--
> > > >  drivers/dma-buf/dma-buf.c          |  1 -
> > > >  fs/proc/fd.c                       |  9 +++++----
> > > >  3 files changed, 15 insertions(+), 7 deletions(-)
> > > >
> ...
> > >
> > > Also not sure if it matters that much for your use case, but something
> > > worth noting at least with shmem is that one can do something like:
> > >
> > > # cat /proc/meminfo | grep Shmem:
> > > Shmem:               764 kB
> > > # xfs_io -fc "falloc -k 0 10m" ./file
> > > # ls -alh file
> > > -rw-------. 1 root root 0 Jun 28 07:22 file
> > > # stat file
> > >   File: file
> > >   Size: 0               Blocks: 20480      IO Block: 4096   regular empty file
> > > # cat /proc/meminfo | grep Shmem:
> > > Shmem:             11004 kB
> > >
> > > ... where the resulting memory usage isn't reflected in i_size (but is
> > > is in i_blocks/bytes).
> >
> > I tried a similar experiment a few times, but I don't see the same
> > results. In my case, there is not any change in shmem. IIUC the
> > fallocate is allocating the disk space not shared memory.
> >
>
> Sorry, it was implied in my previous test was that I was running against
> tmpfs. So regardless of fs, the fallocate keep_size semantics shown in
> both cases is as expected: the underlying blocks are allocated and the
> inode size is unchanged.
>
> What wasn't totally clear to me when I read this patch was 1. whether
> tmpfs refers to Shmem and 2. whether tmpfs allowed this sort of
> operation. The test above seems to confirm both, however, right? E.g., a
> more detailed example:
>
> # mount | grep /tmp
> tmpfs on /tmp type tmpfs (rw,nosuid,nodev,seclabel,nr_inodes=1048576,inode64)
> # cat /proc/meminfo | grep Shmem:
> Shmem:              5300 kB
> # xfs_io -fc "falloc -k 0 1g" /tmp/file
> # stat /tmp/file
>   File: /tmp/file
>   Size: 0               Blocks: 2097152    IO Block: 4096   regular empty file
> Device: 22h/34d Inode: 45          Links: 1
> Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> Context: unconfined_u:object_r:user_tmp_t:s0
> Access: 2022-06-29 08:04:01.301307154 -0400
> Modify: 2022-06-29 08:04:01.301307154 -0400
> Change: 2022-06-29 08:04:01.451312834 -0400
>  Birth: 2022-06-29 08:04:01.301307154 -0400
> # cat /proc/meminfo | grep Shmem:
> Shmem:           1053876 kB
> # rm -f /tmp/file
> # cat /proc/meminfo | grep Shmem:
> Shmem:              5300 kB
>
> So clearly this impacts Shmem.. was your test run against tmpfs or some
> other (disk based) fs?

Hi Brian,

Thanks for clarifying. My issue was tmpfs not mounted at /tmp in my system:

==> meminfo.start <==
Shmem:               572 kB
==> meminfo.stop <==
Shmem:             51688 kB

>
> FWIW, I don't have any objection to exposing inode size if it's commonly
> useful information. My feedback was more just an fyi that i_size doesn't
> necessarily reflect underlying space consumption (whether it's memory or
> disk space) in more generic cases, because it sounds like that is really
> what you're after here. The opposite example to the above would be
> something like an 'xfs_io -fc "truncate 1t" /tmp/file', which shows a
> 1TB inode size with zero additional shmem usage.
Kalesh Singh June 30, 2022, 9:30 p.m. UTC | #3
On Thu, Jun 30, 2022 at 5:03 AM Brian Foster <bfoster@redhat.com> wrote:
>
> On Thu, Jun 30, 2022 at 07:48:46AM -0400, Brian Foster wrote:
> > On Wed, Jun 29, 2022 at 01:43:11PM -0700, Kalesh Singh wrote:
> > > On Wed, Jun 29, 2022 at 5:23 AM Brian Foster <bfoster@redhat.com> wrote:
> > > >
> > > > On Tue, Jun 28, 2022 at 03:38:02PM -0700, Kalesh Singh wrote:
> > > > > On Tue, Jun 28, 2022 at 4:54 AM Brian Foster <bfoster@redhat.com> wrote:
> > > > > >
> > > > > > On Thu, Jun 23, 2022 at 03:06:06PM -0700, Kalesh Singh wrote:
> > > > > > > To be able to account the amount of memory a process is keeping pinned
> > > > > > > by open file descriptors add a 'size' field to fdinfo output.
> > > > > > >
> > > > > > > dmabufs fds already expose a 'size' field for this reason, remove this
> > > > > > > and make it a common field for all fds. This allows tracking of
> > > > > > > other types of memory (e.g. memfd and ashmem in Android).
> > > > > > >
> > > > > > > Signed-off-by: Kalesh Singh <kaleshsingh@google.com>
> > > > > > > Reviewed-by: Christian König <christian.koenig@amd.com>
> > > > > > > ---
> > > > > > >
> > > > > > > Changes in v2:
> > > > > > >   - Add Christian's Reviewed-by
> > > > > > >
> > > > > > > Changes from rfc:
> > > > > > >   - Split adding 'size' and 'path' into a separate patches, per Christian
> > > > > > >   - Split fdinfo seq_printf into separate lines, per Christian
> > > > > > >   - Fix indentation (use tabs) in documentaion, per Randy
> > > > > > >
> > > > > > >  Documentation/filesystems/proc.rst | 12 ++++++++++--
> > > > > > >  drivers/dma-buf/dma-buf.c          |  1 -
> > > > > > >  fs/proc/fd.c                       |  9 +++++----
> > > > > > >  3 files changed, 15 insertions(+), 7 deletions(-)
> > > > > > >
> > > > ...
> > > > > >
> > > > > > Also not sure if it matters that much for your use case, but something
> > > > > > worth noting at least with shmem is that one can do something like:
> > > > > >
> > > > > > # cat /proc/meminfo | grep Shmem:
> > > > > > Shmem:               764 kB
> > > > > > # xfs_io -fc "falloc -k 0 10m" ./file
> > > > > > # ls -alh file
> > > > > > -rw-------. 1 root root 0 Jun 28 07:22 file
> > > > > > # stat file
> > > > > >   File: file
> > > > > >   Size: 0               Blocks: 20480      IO Block: 4096   regular empty file
> > > > > > # cat /proc/meminfo | grep Shmem:
> > > > > > Shmem:             11004 kB
> > > > > >
> > > > > > ... where the resulting memory usage isn't reflected in i_size (but is
> > > > > > is in i_blocks/bytes).
> > > > >
> > > > > I tried a similar experiment a few times, but I don't see the same
> > > > > results. In my case, there is not any change in shmem. IIUC the
> > > > > fallocate is allocating the disk space not shared memory.
> > > > >
> > > >
> > > > Sorry, it was implied in my previous test was that I was running against
> > > > tmpfs. So regardless of fs, the fallocate keep_size semantics shown in
> > > > both cases is as expected: the underlying blocks are allocated and the
> > > > inode size is unchanged.
> > > >
> > > > What wasn't totally clear to me when I read this patch was 1. whether
> > > > tmpfs refers to Shmem and 2. whether tmpfs allowed this sort of
> > > > operation. The test above seems to confirm both, however, right? E.g., a
> > > > more detailed example:
> > > >
> > > > # mount | grep /tmp
> > > > tmpfs on /tmp type tmpfs (rw,nosuid,nodev,seclabel,nr_inodes=1048576,inode64)
> > > > # cat /proc/meminfo | grep Shmem:
> > > > Shmem:              5300 kB
> > > > # xfs_io -fc "falloc -k 0 1g" /tmp/file
> > > > # stat /tmp/file
> > > >   File: /tmp/file
> > > >   Size: 0               Blocks: 2097152    IO Block: 4096   regular empty file
> > > > Device: 22h/34d Inode: 45          Links: 1
> > > > Access: (0600/-rw-------)  Uid: (    0/    root)   Gid: (    0/    root)
> > > > Context: unconfined_u:object_r:user_tmp_t:s0
> > > > Access: 2022-06-29 08:04:01.301307154 -0400
> > > > Modify: 2022-06-29 08:04:01.301307154 -0400
> > > > Change: 2022-06-29 08:04:01.451312834 -0400
> > > >  Birth: 2022-06-29 08:04:01.301307154 -0400
> > > > # cat /proc/meminfo | grep Shmem:
> > > > Shmem:           1053876 kB
> > > > # rm -f /tmp/file
> > > > # cat /proc/meminfo | grep Shmem:
> > > > Shmem:              5300 kB
> > > >
> > > > So clearly this impacts Shmem.. was your test run against tmpfs or some
> > > > other (disk based) fs?
> > >
> > > Hi Brian,
> > >
> > > Thanks for clarifying. My issue was tmpfs not mounted at /tmp in my system:
> > >
> > > ==> meminfo.start <==
> > > Shmem:               572 kB
> > > ==> meminfo.stop <==
> > > Shmem:             51688 kB
> > >
> >
> > Ok, makes sense.
> >
> > > >
> > > > FWIW, I don't have any objection to exposing inode size if it's commonly
> > > > useful information. My feedback was more just an fyi that i_size doesn't
> > > > necessarily reflect underlying space consumption (whether it's memory or
> > > > disk space) in more generic cases, because it sounds like that is really
> > > > what you're after here. The opposite example to the above would be
> > > > something like an 'xfs_io -fc "truncate 1t" /tmp/file', which shows a
> > > > 1TB inode size with zero additional shmem usage.
> > >
> > > From these cases, it seems the more generic way to do this is by
> > > calculating the actual size consumed using the blocks. (i_blocks *
> > > 512). So in the latter example  'xfs_io -fc "truncate 1t" /tmp/file'
> > > the size consumed would be zero. Let me know if it sounds ok to you
> > > and I can repost the updated version.
> > >
> >
> > That sounds a bit more useful to me if you're interested in space usage,
> > or at least I don't have a better idea for you. ;)
> >
> > One thing to note is that I'm not sure whether all fs' use i_blocks
> > reliably. E.g., XFS populates stat->blocks via a separate block counter
> > in the XFS specific inode structure (see xfs_vn_getattr()). A bunch of
> > other fs' seem to touch it so perhaps that is just an outlier. You could
> > consider fixing that up, perhaps make a ->getattr() call to avoid it, or
> > just use the field directly if it's useful enough as is and there are no
> > other objections. Something to think about anyways..
> >

Hi Brian,

Thanks for pointing it out. Let me take a look into the xfs case.

>
> Oh, I wonder if you're looking for similar "file rss" information this
> series wants to collect/expose..?
>
> https://lore.kernel.org/linux-fsdevel/20220624080444.7619-1-christian.koenig@amd.com/#r

Christian's series seems to have some overlap with what we want to
achieve here. IIUC it exposes the information on the per process
granularity. Perhaps if that approach is agreed on, I think we can use
the file_rss() f_op to expose the  per file size in the fdinfo for the
cases where the i_blocks are unreliable.

Thanks,
Kalesh

>
> Brian
>
> > Brian
> >
> > > Thanks,
> > > Kalesh
> > >
> > > >
> > > > Brian
> > > >
> > > > > cat /proc/meminfo > meminfo.start
> > > > > xfs_io -fc "falloc -k 0 50m" ./xfs_file
> > > > > cat /proc/meminfo > meminfo.stop
> > > > > tail -n +1 meminfo.st* | grep -i '==\|Shmem:'
> > > > >
> > > > > ==> meminfo.start <==
> > > > > Shmem:               484 kB
> > > > > ==> meminfo.stop <==
> > > > > Shmem:               484 kB
> > > > >
> > > > > ls -lh xfs_file
> > > > > -rw------- 1 root root 0 Jun 28 15:12 xfs_file
> > > > >
> > > > > stat xfs_file
> > > > >   File: xfs_file
> > > > >   Size: 0               Blocks: 102400     IO Block: 4096   regular empty file
> > > > >
> > > > > Thanks,
> > > > > Kalesh
> > > > >
> > > > > >
> > > > > > Brian
> > > > > >
> > > > > > >
> > > > > > >       /* show_fd_locks() never deferences files so a stale value is safe */
> > > > > > >       show_fd_locks(m, file, files);
> > > > > > > --
> > > > > > > 2.37.0.rc0.161.g10f37bed90-goog
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > > --
> > > > To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@android.com.
> > > >
> > >
>