mbox series

[0/2] fcntl: add fcntl(F_CHECK_ORIGINAL_MEMFD)

Message ID 20230831203647.558079-1-mclapinski@google.com
Headers show
Series fcntl: add fcntl(F_CHECK_ORIGINAL_MEMFD) | expand

Message

Michal Clapinski Aug. 31, 2023, 8:36 p.m. UTC
This change introduces a new fcntl to check if an fd points to a memfd's
original open fd (the one created by memfd_create).

We encountered an issue with migrating memfds in CRIU (checkpoint
restore in userspace - it migrates running processes between
machines). Imagine a scenario:
1. Create a memfd. By default it's open with O_RDWR and yet one can
exec() to it (unlike with regular files, where one would get ETXTBSY).
2. Reopen that memfd with O_RDWR via /proc/self/fd/<fd>.

Now those 2 fds are indistinguishable from userspace. You can't exec()
to either of them (since the reopen incremented inode->i_writecount)
and their /proc/self/fdinfo/ are exactly the same. Unfortunately they
are not the same. If you close the second one, the first one becomes
exec()able again. If you close the first one, the other doesn't become
exec()able. Therefore during migration it does matter which is recreated
first and which is reopened but there is no way for CRIU to tell which
was first.

Michal Clapinski (2):
  fcntl: add fcntl(F_CHECK_ORIGINAL_MEMFD)
  selftests: test fcntl(F_CHECK_ORIGINAL_MEMFD)

 fs/fcntl.c                                 |  3 ++
 include/uapi/linux/fcntl.h                 |  9 ++++++
 tools/testing/selftests/memfd/memfd_test.c | 32 ++++++++++++++++++++++
 3 files changed, 44 insertions(+)

Comments

Christian Brauner Sept. 1, 2023, 12:56 p.m. UTC | #1
On Thu, Aug 31, 2023 at 10:36:46PM +0200, Michal Clapinski wrote:
> Add a way to check if an fd points to the memfd's original open fd
> (the one created by memfd_create).
> Useful because only the original open fd can be both writable and
> executable.
> 
> Signed-off-by: Michal Clapinski <mclapinski@google.com>
> ---
>  fs/fcntl.c                 | 3 +++
>  include/uapi/linux/fcntl.h | 9 +++++++++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index e871009f6c88..301527e07a4d 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -419,6 +419,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	case F_SET_RW_HINT:
>  		err = fcntl_rw_hint(filp, cmd, arg);
>  		break;
> +	case F_CHECK_ORIGINAL_MEMFD:
> +		err = !(filp->f_mode & FMODE_WRITER);
> +		break;

Honestly, make this an ioctl on memfds. This is so specific that it
really doesn't belong into fcntl().
Michal Clapinski Sept. 1, 2023, 2:50 p.m. UTC | #2
On Fri, Sep 1, 2023 at 2:56 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Aug 31, 2023 at 10:36:46PM +0200, Michal Clapinski wrote:
> > Add a way to check if an fd points to the memfd's original open fd
> > (the one created by memfd_create).
> > Useful because only the original open fd can be both writable and
> > executable.
> >
> > Signed-off-by: Michal Clapinski <mclapinski@google.com>
> > ---
> >  fs/fcntl.c                 | 3 +++
> >  include/uapi/linux/fcntl.h | 9 +++++++++
> >  2 files changed, 12 insertions(+)
> >
> > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > index e871009f6c88..301527e07a4d 100644
> > --- a/fs/fcntl.c
> > +++ b/fs/fcntl.c
> > @@ -419,6 +419,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> >       case F_SET_RW_HINT:
> >               err = fcntl_rw_hint(filp, cmd, arg);
> >               break;
> > +     case F_CHECK_ORIGINAL_MEMFD:
> > +             err = !(filp->f_mode & FMODE_WRITER);
> > +             break;
>
> Honestly, make this an ioctl on memfds. This is so specific that it
> really doesn't belong into fcntl().

I've never touched ioctls but if I'm correct, I can't just add it to
memfd. I would have to add it to the underlying fs, so hugetlbfs and
shmem (which I think can be defined as ramfs so also there). File
sealing fcntl is already memfd specific. Are you sure ioctl will be a
better idea?
Kees Cook Sept. 1, 2023, 6:34 p.m. UTC | #3
On Fri, Sep 01, 2023 at 04:50:53PM +0200, Michał Cłapiński wrote:
> On Fri, Sep 1, 2023 at 2:56 PM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Thu, Aug 31, 2023 at 10:36:46PM +0200, Michal Clapinski wrote:
> > > Add a way to check if an fd points to the memfd's original open fd
> > > (the one created by memfd_create).
> > > Useful because only the original open fd can be both writable and
> > > executable.
> > >
> > > Signed-off-by: Michal Clapinski <mclapinski@google.com>
> > > ---
> > >  fs/fcntl.c                 | 3 +++
> > >  include/uapi/linux/fcntl.h | 9 +++++++++
> > >  2 files changed, 12 insertions(+)
> > >
> > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > index e871009f6c88..301527e07a4d 100644
> > > --- a/fs/fcntl.c
> > > +++ b/fs/fcntl.c
> > > @@ -419,6 +419,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > >       case F_SET_RW_HINT:
> > >               err = fcntl_rw_hint(filp, cmd, arg);
> > >               break;
> > > +     case F_CHECK_ORIGINAL_MEMFD:
> > > +             err = !(filp->f_mode & FMODE_WRITER);
> > > +             break;
> >
> > Honestly, make this an ioctl on memfds. This is so specific that it
> > really doesn't belong into fcntl().
> 
> I've never touched ioctls but if I'm correct, I can't just add it to
> memfd. I would have to add it to the underlying fs, so hugetlbfs and
> shmem (which I think can be defined as ramfs so also there). File
> sealing fcntl is already memfd specific. Are you sure ioctl will be a
> better idea?

Does this check "mean" anything for other files? Because if it's
generically useful (and got renamed) it maybe would be right for
fcntl...
Christian Brauner Sept. 4, 2023, 7:29 a.m. UTC | #4
On Fri, Sep 01, 2023 at 11:34:32AM -0700, Kees Cook wrote:
> On Fri, Sep 01, 2023 at 04:50:53PM +0200, Michał Cłapiński wrote:
> > On Fri, Sep 1, 2023 at 2:56 PM Christian Brauner <brauner@kernel.org> wrote:
> > >
> > > On Thu, Aug 31, 2023 at 10:36:46PM +0200, Michal Clapinski wrote:
> > > > Add a way to check if an fd points to the memfd's original open fd
> > > > (the one created by memfd_create).
> > > > Useful because only the original open fd can be both writable and
> > > > executable.
> > > >
> > > > Signed-off-by: Michal Clapinski <mclapinski@google.com>
> > > > ---
> > > >  fs/fcntl.c                 | 3 +++
> > > >  include/uapi/linux/fcntl.h | 9 +++++++++
> > > >  2 files changed, 12 insertions(+)
> > > >
> > > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > > index e871009f6c88..301527e07a4d 100644
> > > > --- a/fs/fcntl.c
> > > > +++ b/fs/fcntl.c
> > > > @@ -419,6 +419,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > >       case F_SET_RW_HINT:
> > > >               err = fcntl_rw_hint(filp, cmd, arg);
> > > >               break;
> > > > +     case F_CHECK_ORIGINAL_MEMFD:
> > > > +             err = !(filp->f_mode & FMODE_WRITER);
> > > > +             break;
> > >
> > > Honestly, make this an ioctl on memfds. This is so specific that it
> > > really doesn't belong into fcntl().
> > 
> > I've never touched ioctls but if I'm correct, I can't just add it to
> > memfd. I would have to add it to the underlying fs, so hugetlbfs and
> > shmem (which I think can be defined as ramfs so also there). File
> > sealing fcntl is already memfd specific. Are you sure ioctl will be a
> > better idea?

fcntl() should be generic. Frankly, the sealing stuff should've gone
into an ioctl as well and only upgraded to a fcntl() once multiple fd
types support it.

> 
> Does this check "mean" anything for other files? Because if it's
> generically useful (and got renamed) it maybe would be right for
> fcntl...

For regular files it just means that the file has gotten write access to
the underlying fs and we use this flag to release the necessary
refcounts/protections once the file is closed.

If this check has any meaning beyond that than it only has meaning for
memfd. I'm also not sure why this checks FMODE_WRITER and not
FMODE_WRITE as FMODE_WRITER is almost an entirely internal thing that
only very specific codepaths need to know about.
Michal Clapinski Sept. 4, 2023, 5:57 p.m. UTC | #5
On Mon, Sep 4, 2023 at 9:29 AM Christian Brauner <brauner@kernel.org> wrote:
>
> On Fri, Sep 01, 2023 at 11:34:32AM -0700, Kees Cook wrote:
> > On Fri, Sep 01, 2023 at 04:50:53PM +0200, Michał Cłapiński wrote:
> > > On Fri, Sep 1, 2023 at 2:56 PM Christian Brauner <brauner@kernel.org> wrote:
> > > >
> > > > On Thu, Aug 31, 2023 at 10:36:46PM +0200, Michal Clapinski wrote:
> > > > > Add a way to check if an fd points to the memfd's original open fd
> > > > > (the one created by memfd_create).
> > > > > Useful because only the original open fd can be both writable and
> > > > > executable.
> > > > >
> > > > > Signed-off-by: Michal Clapinski <mclapinski@google.com>
> > > > > ---
> > > > >  fs/fcntl.c                 | 3 +++
> > > > >  include/uapi/linux/fcntl.h | 9 +++++++++
> > > > >  2 files changed, 12 insertions(+)
> > > > >
> > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > > > index e871009f6c88..301527e07a4d 100644
> > > > > --- a/fs/fcntl.c
> > > > > +++ b/fs/fcntl.c
> > > > > @@ -419,6 +419,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > > >       case F_SET_RW_HINT:
> > > > >               err = fcntl_rw_hint(filp, cmd, arg);
> > > > >               break;
> > > > > +     case F_CHECK_ORIGINAL_MEMFD:
> > > > > +             err = !(filp->f_mode & FMODE_WRITER);
> > > > > +             break;
> > > >
> > > > Honestly, make this an ioctl on memfds. This is so specific that it
> > > > really doesn't belong into fcntl().
> > >
> > > I've never touched ioctls but if I'm correct, I can't just add it to
> > > memfd. I would have to add it to the underlying fs, so hugetlbfs and
> > > shmem (which I think can be defined as ramfs so also there). File
> > > sealing fcntl is already memfd specific. Are you sure ioctl will be a
> > > better idea?
>
> fcntl() should be generic. Frankly, the sealing stuff should've gone
> into an ioctl as well and only upgraded to a fcntl() once multiple fd
> types support it.
>

But ioctl is good for stuff related to the underlying fs, which this
isn't. I'm worried if I rewrite it as an ioctl and put it in 3
different places, the maintainers of shmem, hugetlbfs and ramfs will
tell me to rewrite it as an fcntl. If a new filesystem pops up that
can be used as the backend for memfd, the ioctl will also have to be
added there.

> >
> > Does this check "mean" anything for other files? Because if it's
> > generically useful (and got renamed) it maybe would be right for
> > fcntl...
>
> For regular files it just means that the file has gotten write access to
> the underlying fs and we use this flag to release the necessary
> refcounts/protections once the file is closed.
>
> If this check has any meaning beyond that than it only has meaning for
> memfd. I'm also not sure why this checks FMODE_WRITER and not
> FMODE_WRITE as FMODE_WRITER is almost an entirely internal thing that
> only very specific codepaths need to know about.

If you reopen the memfd via /proc/<pid>/fd/ with O_RDWR, both file
objects (the original and the reopened one) have FMODE_WRITE, so
knowing if the flag is set gives me nothing. FMODE_WRITER is the only
difference between the original fd and the reopened one. This flag
also dictates whether `inode->i_writecount` will be decremented on
close (in `put_file_access`) which influences exec()ability of the
other fd. It surprised me too that this flag theoretically means
"write access to underlying fs" but it's used to determine whether to
decrement i_writecount.
Christian Brauner Sept. 5, 2023, 8:37 a.m. UTC | #6
On Mon, Sep 04, 2023 at 07:57:03PM +0200, Michał Cłapiński wrote:
> On Mon, Sep 4, 2023 at 9:29 AM Christian Brauner <brauner@kernel.org> wrote:
> >
> > On Fri, Sep 01, 2023 at 11:34:32AM -0700, Kees Cook wrote:
> > > On Fri, Sep 01, 2023 at 04:50:53PM +0200, Michał Cłapiński wrote:
> > > > On Fri, Sep 1, 2023 at 2:56 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > >
> > > > > On Thu, Aug 31, 2023 at 10:36:46PM +0200, Michal Clapinski wrote:
> > > > > > Add a way to check if an fd points to the memfd's original open fd
> > > > > > (the one created by memfd_create).
> > > > > > Useful because only the original open fd can be both writable and
> > > > > > executable.
> > > > > >
> > > > > > Signed-off-by: Michal Clapinski <mclapinski@google.com>
> > > > > > ---
> > > > > >  fs/fcntl.c                 | 3 +++
> > > > > >  include/uapi/linux/fcntl.h | 9 +++++++++
> > > > > >  2 files changed, 12 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/fcntl.c b/fs/fcntl.c
> > > > > > index e871009f6c88..301527e07a4d 100644
> > > > > > --- a/fs/fcntl.c
> > > > > > +++ b/fs/fcntl.c
> > > > > > @@ -419,6 +419,9 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
> > > > > >       case F_SET_RW_HINT:
> > > > > >               err = fcntl_rw_hint(filp, cmd, arg);
> > > > > >               break;
> > > > > > +     case F_CHECK_ORIGINAL_MEMFD:
> > > > > > +             err = !(filp->f_mode & FMODE_WRITER);

That by the way is broken or at least misleading.

Looking at the sealing stuff, it's only available for memfds, i.e.,
memfd_create(). Any non-memfds have F_SEAL_SEAL set meaning that setting
seals on any non-memfds fails.

But your check is available on any tmpfs or hugetlbfs file in addition
to memfds. In fact, it available for any file/fd...

But it has unclear meaning for anything that isn't a memfd_create() memfd.

> > > > > > +             break;
> > > > >
> > > > > Honestly, make this an ioctl on memfds. This is so specific that it
> > > > > really doesn't belong into fcntl().
> > > >
> > > > I've never touched ioctls but if I'm correct, I can't just add it to
> > > > memfd. I would have to add it to the underlying fs, so hugetlbfs and
> > > > shmem (which I think can be defined as ramfs so also there). File
> > > > sealing fcntl is already memfd specific. Are you sure ioctl will be a
> > > > better idea?
> >
> > fcntl() should be generic. Frankly, the sealing stuff should've gone
> > into an ioctl as well and only upgraded to a fcntl() once multiple fd
> > types support it.
> >
> 
> But ioctl is good for stuff related to the underlying fs, which this
> isn't. I'm worried if I rewrite it as an ioctl and put it in 3
> different places, the maintainers of shmem, hugetlbfs and ramfs will
> tell me to rewrite it as an fcntl. If a new filesystem pops up that
> can be used as the backend for memfd, the ioctl will also have to be
> added there.

I see your concern but first, the fact that memfd_create() was
implemented as a multiplexer over different filesystem types was a
deliberate choice. But that's not an argument to make us absorb an
ill-defined fcntl() addition. We're just up for a repeat if you decide
to add yet more fcntl()s later.

Second, looking at:

memfd_fcntl()
-> memfd_get_seals()
   -> memfd_file_seals_ptr()
      {
              if (shmem_file(file))
                      return &SHMEM_I(file_inode(file))->seals;
      
      #ifdef CONFIG_HUGETLBFS
              if (is_file_hugepages(file))
                      return &HUGETLBFS_I(file_inode(file))->seals;
      #endif
      
              return NULL;
      }

If you have a third backend then you need to handle that anyway. You
even need to add a new seals inode member to that new backend. So you
already fiddle with each backend to accomodate memfds.

Whether you do that to as a wrapper located in mm/memfd.c that's
callable from the fses ioctl implementation or through memfd_fcntl()
really isn't any more complicated.

(UNTESTED, FAILS TO COMPILE ON PURPOSE)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 316c4cebd3f3..bbdddcdc8936 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1324,6 +1324,12 @@ static void init_once(void *foo)
 	inode_init_once(&ei->vfs_inode);
 }
 
+static int hugetlbfs_file_ioctl(struct file *file, unsigned int cmd, unsigned int arg)
+{
+	return memfd_ioctl(file, cmd, arg);
+}
+
 const struct file_operations hugetlbfs_file_operations = {
 	.read_iter		= hugetlbfs_read_iter,
 	.mmap			= hugetlbfs_file_mmap,
@@ -1331,6 +1337,8 @@ const struct file_operations hugetlbfs_file_operations = {
 	.get_unmapped_area	= hugetlb_get_unmapped_area,
 	.llseek			= default_llseek,
 	.fallocate		= hugetlbfs_fallocate,
+	.unlocked_ioctl		= hugetlbfs_file_ioctl,
+	.compat_ioctl		= hugetlbfs_file_ioctl,
 };
 
 static const struct inode_operations hugetlbfs_dir_inode_operations = {
diff --git a/include/linux/memfd.h b/include/linux/memfd.h
index e7abf6fa4c52..7e3c97ad842f 100644
--- a/include/linux/memfd.h
+++ b/include/linux/memfd.h
@@ -6,11 +6,16 @@
 
 #ifdef CONFIG_MEMFD_CREATE
 extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg);
+long memfd_ioctl(struct file *file, unsigned int cmd, unsigned int arg);
 #else
 static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned int a)
 {
 	return -EINVAL;
 }
+static inline long memfd_ioctl(struct file *file, unsigned int cmd, unsigned int arg)
+{

	PROPERLY DETECT A memfd_create() MEMFD INSTEAD OF F_SEAL_SEAL

+	return -EINVAL;
+}
 #endif
 
 #endif /* __LINUX_MEMFD_H */
diff --git a/mm/memfd.c b/mm/memfd.c
index 1cad1904fc26..4a26df1f4a91 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -243,6 +243,7 @@ static int memfd_get_seals(struct file *file)
 	return seals ? *seals : -EINVAL;
 }
 
+/* new memfd specific functionality goes into memfd_ioctl() */
 long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 {
 	long error;
@@ -254,6 +255,12 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 	case F_GET_SEALS:
 		error = memfd_get_seals(file);
 		break;
+	case F_SOMETHING_1:
+		error = memfd_something_1(file);
+		break;
+	case F_SOMETHING_1:
+		error = memfd_something_2(file);
+		break;
 	default:
 		error = -EINVAL;
 		break;
@@ -262,6 +269,22 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
 	return error;
 }
 
+long memfd_ioctl(struct file *file, unsigned int cmd, unsigned int arg)
+{
+	long error;
+
+	switch (cmd) {
+	case F_SOMETHING_1:
+		error = memfd_something_1(file);
+		break;
+	case F_SOMETHING_1:
+		error = memfd_something_2(file);
+		break;
+	}
+
+	return memfd_fcntl(file, cmd, arg);
+}
+
 #define MFD_NAME_PREFIX "memfd:"
 #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
 #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
diff --git a/mm/shmem.c b/mm/shmem.c
index 02e62fccc80d..6f580b2bf004 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -4445,6 +4445,11 @@ static int shmem_error_remove_page(struct address_space *mapping,
 	return 0;
 }
 
+static inline shmem_file_ioctl(struct file *file, unsigned int cmd, unsigned int arg)
+{
+	return memfd_ioctl(file, cmd, arg);
+}
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.dirty_folio	= noop_dirty_folio,
@@ -4471,6 +4476,8 @@ static const struct file_operations shmem_file_operations = {
 	.splice_read	= shmem_file_splice_read,
 	.splice_write	= iter_file_splice_write,
 	.fallocate	= shmem_fallocate,
+	.unlocked_ioctl	= shmem_file_ioctl,
+	.compat_ioctl	= shmem_file_ioctl,
 #endif
 };
 
> 
> > >
> > > Does this check "mean" anything for other files? Because if it's
> > > generically useful (and got renamed) it maybe would be right for
> > > fcntl...
> >
> > For regular files it just means that the file has gotten write access to
> > the underlying fs and we use this flag to release the necessary
> > refcounts/protections once the file is closed.
> >
> > If this check has any meaning beyond that than it only has meaning for
> > memfd. I'm also not sure why this checks FMODE_WRITER and not
> > FMODE_WRITE as FMODE_WRITER is almost an entirely internal thing that
> > only very specific codepaths need to know about.
> 
> If you reopen the memfd via /proc/<pid>/fd/ with O_RDWR, both file
> objects (the original and the reopened one) have FMODE_WRITE, so
> knowing if the flag is set gives me nothing. FMODE_WRITER is the only
> difference between the original fd and the reopened one. This flag
> also dictates whether `inode->i_writecount` will be decremented on
> close (in `put_file_access`) which influences exec()ability of the
> other fd. It surprised me too that this flag theoretically means
> "write access to underlying fs" but it's used to determine whether to
> decrement i_writecount.

It's not surprising at all. It's how read-write fds are created.
memfd_create() is the oddball because it uses an internal tmpfs
kernmount that's never remountable read-only afaict otherwise this would
be pretty broken as memfd_create() doesn't do the usual mark sb writable
and get write access to the mount dance and so the mount or sb could go
ro behind it's back.

And this has zero meaning for non-memfd. So NAK on fcntl(). However, as
an ioctl() on memfd backends we won't care.

And I would suggest to figure out a nicer way to differentiate between
memfd_create() and regular tmpfs/hugetlbfs files. So you can explicitly
detect them instead of this implict F_SEAL_SEALS dance. For example,
abuse a higher flag bit in unsigned int seals that both tmpfs and
hugetlbfs already have and set F_MEMFD or whatever. Or some other
mechanism. Just a thought.

Really, fcntl() is wrong here.