mbox series

[v8,RESEND,0/8] fs: clean up internal i_version handling

Message ID 20230124193025.185781-1-jlayton@kernel.org
Headers show
Series fs: clean up internal i_version handling | expand

Message

Jeff Layton Jan. 24, 2023, 7:30 p.m. UTC
I had inteded to send a PR for this for v6.2, but I got sidetracked
with different issues, and didn't get it together in time. This set has
been sitting in linux-next since October, and it seems to be behaving,
so I intend to send a PR when the v6.3 merge window opens.

Though nothing has really changed since last year, I'm resending now
in the hopes I can collect a few more Reviewed-bys (ones from Al, Trond
and Anna would be particularly welcome).

The main consumer of i_version field (knfsd) has to jump through a
number of hoops to fetch it, depending on what sort of inode it is.
Rather than do this, we want to offload the responsibility for
presenting this field to the filesystem's ->getattr operation, which is
a more natural way to deal with a field that may be implemented
differently.

The focus of this patchset is to clean up these internal interfaces.
This should also make it simple to present this attribute to userland in
the future, which should be possible once the semantics are a bit more
consistent across different backing filesystems.

Thanks!

Jeff Layton (8):
  fs: uninline inode_query_iversion
  fs: clarify when the i_version counter must be updated
  vfs: plumb i_version handling into struct kstat
  nfs: report the inode version in getattr if requested
  ceph: report the inode version in getattr if requested
  nfsd: move nfsd4_change_attribute to nfsfh.c
  nfsd: use the getattr operation to fetch i_version
  nfsd: remove fetch_iversion export operation

 fs/ceph/inode.c          | 16 +++++++----
 fs/libfs.c               | 36 ++++++++++++++++++++++++
 fs/nfs/export.c          |  7 -----
 fs/nfs/inode.c           | 16 ++++++++---
 fs/nfsd/nfs4xdr.c        |  4 ++-
 fs/nfsd/nfsfh.c          | 42 ++++++++++++++++++++++++++++
 fs/nfsd/nfsfh.h          | 29 +-------------------
 fs/nfsd/vfs.h            |  7 ++++-
 fs/stat.c                | 17 ++++++++++--
 include/linux/exportfs.h |  1 -
 include/linux/iversion.h | 59 ++++++++++++++--------------------------
 include/linux/stat.h     |  9 ++++++
 12 files changed, 156 insertions(+), 87 deletions(-)

Comments

Jan Kara Jan. 25, 2023, 1:44 p.m. UTC | #1
On Tue 24-01-23 14:30:18, Jeff Layton wrote:
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Reasonable. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/libfs.c               | 36 ++++++++++++++++++++++++++++++++++++
>  include/linux/iversion.h | 38 ++------------------------------------
>  2 files changed, 38 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/libfs.c b/fs/libfs.c
> index aada4e7c8713..17ecc47696e1 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -1582,3 +1582,39 @@ bool inode_maybe_inc_iversion(struct inode *inode, bool force)
>  	return true;
>  }
>  EXPORT_SYMBOL(inode_maybe_inc_iversion);
> +
> +/**
> + * inode_query_iversion - read i_version for later use
> + * @inode: inode from which i_version should be read
> + *
> + * Read the inode i_version counter. This should be used by callers that wish
> + * to store the returned i_version for later comparison. This will guarantee
> + * that a later query of the i_version will result in a different value if
> + * anything has changed.
> + *
> + * In this implementation, we fetch the current value, set the QUERIED flag and
> + * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> + * that fails, we try again with the newly fetched value from the cmpxchg.
> + */
> +u64 inode_query_iversion(struct inode *inode)
> +{
> +	u64 cur, new;
> +
> +	cur = inode_peek_iversion_raw(inode);
> +	do {
> +		/* If flag is already set, then no need to swap */
> +		if (cur & I_VERSION_QUERIED) {
> +			/*
> +			 * This barrier (and the implicit barrier in the
> +			 * cmpxchg below) pairs with the barrier in
> +			 * inode_maybe_inc_iversion().
> +			 */
> +			smp_mb();
> +			break;
> +		}
> +
> +		new = cur | I_VERSION_QUERIED;
> +	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> +	return cur >> I_VERSION_QUERIED_SHIFT;
> +}
> +EXPORT_SYMBOL(inode_query_iversion);
> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index e27bd4f55d84..6755d8b4f20b 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -234,42 +234,6 @@ inode_peek_iversion(const struct inode *inode)
>  	return inode_peek_iversion_raw(inode) >> I_VERSION_QUERIED_SHIFT;
>  }
>  
> -/**
> - * inode_query_iversion - read i_version for later use
> - * @inode: inode from which i_version should be read
> - *
> - * Read the inode i_version counter. This should be used by callers that wish
> - * to store the returned i_version for later comparison. This will guarantee
> - * that a later query of the i_version will result in a different value if
> - * anything has changed.
> - *
> - * In this implementation, we fetch the current value, set the QUERIED flag and
> - * then try to swap it into place with a cmpxchg, if it wasn't already set. If
> - * that fails, we try again with the newly fetched value from the cmpxchg.
> - */
> -static inline u64
> -inode_query_iversion(struct inode *inode)
> -{
> -	u64 cur, new;
> -
> -	cur = inode_peek_iversion_raw(inode);
> -	do {
> -		/* If flag is already set, then no need to swap */
> -		if (cur & I_VERSION_QUERIED) {
> -			/*
> -			 * This barrier (and the implicit barrier in the
> -			 * cmpxchg below) pairs with the barrier in
> -			 * inode_maybe_inc_iversion().
> -			 */
> -			smp_mb();
> -			break;
> -		}
> -
> -		new = cur | I_VERSION_QUERIED;
> -	} while (!atomic64_try_cmpxchg(&inode->i_version, &cur, new));
> -	return cur >> I_VERSION_QUERIED_SHIFT;
> -}
> -
>  /*
>   * For filesystems without any sort of change attribute, the best we can
>   * do is fake one up from the ctime:
> @@ -283,6 +247,8 @@ static inline u64 time_to_chattr(struct timespec64 *t)
>  	return chattr;
>  }
>  
> +u64 inode_query_iversion(struct inode *inode);
> +
>  /**
>   * inode_eq_iversion_raw - check whether the raw i_version counter has changed
>   * @inode: inode to check
> -- 
> 2.39.1
>
Jan Kara Jan. 25, 2023, 4:06 p.m. UTC | #2
On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> The i_version field in the kernel has had different semantics over
> the decades, but NFSv4 has certain expectations. Update the comments
> in iversion.h to describe when the i_version must change.
> 
> Cc: Colin Walters <walters@verbum.org>
> Cc: NeilBrown <neilb@suse.de>
> Cc: Trond Myklebust <trondmy@hammerspace.com>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>

Looks good to me. But one note below:

> diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> index 6755d8b4f20b..fced8115a5f4 100644
> --- a/include/linux/iversion.h
> +++ b/include/linux/iversion.h
> @@ -9,8 +9,25 @@
>   * ---------------------------
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear larger to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> + *
> + * An explicit change is one that would ordinarily result in a change to the
> + * inode status change time (aka ctime). i_version must appear to change, even
> + * if the ctime does not (since the whole point is to avoid missing updates due
> + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> + * ctime must change due to an operation, then the i_version counter must be
> + * incremented as well.
> + *
> + * Making the i_version update completely atomic with the operation itself would
> + * be prohibitively expensive. Traditionally the kernel has updated the times on
> + * directories after an operation that changes its contents. For regular files,
> + * the ctime is usually updated before the data is copied into the cache for a
> + * write. This means that there is a window of time when an observer can
> + * associate a new timestamp with old file contents. Since the purpose of the
> + * i_version is to allow for better cache coherency, the i_version must always
> + * be updated after the results of the operation are visible. Updating it before
> + * and after a change is also permitted.

This sounds good but it is not the case for any of the current filesystems, is
it? Perhaps the documentation should mention this so that people are not
confused?

								Honza
Christian Brauner Jan. 26, 2023, 9:02 a.m. UTC | #3
On Tue, Jan 24, 2023 at 02:30:18PM -0500, Jeff Layton wrote:
> Reviewed-by: NeilBrown <neilb@suse.de>
> Signed-off-by: Jeff Layton <jlayton@kernel.org>
> ---

Looks good to me,
Reviewed-by: Christian Brauner <brauner@kernel.org>
Jeff Layton Jan. 26, 2023, 10:54 a.m. UTC | #4
On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > The i_version field in the kernel has had different semantics over
> > the decades, but NFSv4 has certain expectations. Update the comments
> > in iversion.h to describe when the i_version must change.
> > 
> > Cc: Colin Walters <walters@verbum.org>
> > Cc: NeilBrown <neilb@suse.de>
> > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > Cc: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> 
> Looks good to me. But one note below:
> 
> > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > index 6755d8b4f20b..fced8115a5f4 100644
> > --- a/include/linux/iversion.h
> > +++ b/include/linux/iversion.h
> > @@ -9,8 +9,25 @@
> >   * ---------------------------
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear larger to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > + *
> > + * An explicit change is one that would ordinarily result in a change to the
> > + * inode status change time (aka ctime). i_version must appear to change, even
> > + * if the ctime does not (since the whole point is to avoid missing updates due
> > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > + * ctime must change due to an operation, then the i_version counter must be
> > + * incremented as well.
> > + *
> > + * Making the i_version update completely atomic with the operation itself would
> > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > + * directories after an operation that changes its contents. For regular files,
> > + * the ctime is usually updated before the data is copied into the cache for a
> > + * write. This means that there is a window of time when an observer can
> > + * associate a new timestamp with old file contents. Since the purpose of the
> > + * i_version is to allow for better cache coherency, the i_version must always
> > + * be updated after the results of the operation are visible. Updating it before
> > + * and after a change is also permitted.
> 
> This sounds good but it is not the case for any of the current filesystems, is
> it? Perhaps the documentation should mention this so that people are not
> confused?
> 
> 								Honza

Correct. Currently, all filesystems change the times and version before
a write instead of after. I'm hoping that situation will change soon
though, as I've been working on a patchset to fix this for tmpfs, ext4
and btrfs.

If you still want to see something for this though, what would you
suggest for verbiage?

Thanks,
Jan Kara Jan. 26, 2023, 11:36 a.m. UTC | #5
On Thu 26-01-23 05:54:16, Jeff Layton wrote:
> On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> > On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > > The i_version field in the kernel has had different semantics over
> > > the decades, but NFSv4 has certain expectations. Update the comments
> > > in iversion.h to describe when the i_version must change.
> > > 
> > > Cc: Colin Walters <walters@verbum.org>
> > > Cc: NeilBrown <neilb@suse.de>
> > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > Cc: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > 
> > Looks good to me. But one note below:
> > 
> > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > index 6755d8b4f20b..fced8115a5f4 100644
> > > --- a/include/linux/iversion.h
> > > +++ b/include/linux/iversion.h
> > > @@ -9,8 +9,25 @@
> > >   * ---------------------------
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear larger to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > + *
> > > + * An explicit change is one that would ordinarily result in a change to the
> > > + * inode status change time (aka ctime). i_version must appear to change, even
> > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > > + * ctime must change due to an operation, then the i_version counter must be
> > > + * incremented as well.
> > > + *
> > > + * Making the i_version update completely atomic with the operation itself would
> > > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > > + * directories after an operation that changes its contents. For regular files,
> > > + * the ctime is usually updated before the data is copied into the cache for a
> > > + * write. This means that there is a window of time when an observer can
> > > + * associate a new timestamp with old file contents. Since the purpose of the
> > > + * i_version is to allow for better cache coherency, the i_version must always
> > > + * be updated after the results of the operation are visible. Updating it before
> > > + * and after a change is also permitted.
> > 
> > This sounds good but it is not the case for any of the current filesystems, is
> > it? Perhaps the documentation should mention this so that people are not
> > confused?
> 
> Correct. Currently, all filesystems change the times and version before
> a write instead of after. I'm hoping that situation will change soon
> though, as I've been working on a patchset to fix this for tmpfs, ext4
> and btrfs.

That is good but we'll see how long it takes to get merged. AFAIR it is not
a complete nobrainer ;)

> If you still want to see something for this though, what would you
> suggest for verbiage?

Sure:

... the i_version must a be updated after the results of the operation are
visible (note that none of the filesystems currently do this, it is a work
in progress to fix this).

And once your patches are merged, you can also delete this note :).

								Honza
Jeff Layton Jan. 26, 2023, 12:02 p.m. UTC | #6
On Thu, 2023-01-26 at 12:36 +0100, Jan Kara wrote:
> On Thu 26-01-23 05:54:16, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 17:06 +0100, Jan Kara wrote:
> > > On Tue 24-01-23 14:30:19, Jeff Layton wrote:
> > > > The i_version field in the kernel has had different semantics over
> > > > the decades, but NFSv4 has certain expectations. Update the comments
> > > > in iversion.h to describe when the i_version must change.
> > > > 
> > > > Cc: Colin Walters <walters@verbum.org>
> > > > Cc: NeilBrown <neilb@suse.de>
> > > > Cc: Trond Myklebust <trondmy@hammerspace.com>
> > > > Cc: Dave Chinner <david@fromorbit.com>
> > > > Signed-off-by: Jeff Layton <jlayton@kernel.org>
> > > 
> > > Looks good to me. But one note below:
> > > 
> > > > diff --git a/include/linux/iversion.h b/include/linux/iversion.h
> > > > index 6755d8b4f20b..fced8115a5f4 100644
> > > > --- a/include/linux/iversion.h
> > > > +++ b/include/linux/iversion.h
> > > > @@ -9,8 +9,25 @@
> > > >   * ---------------------------
> > > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > > - * appear different to observers if there was a change to the inode's data or
> > > > - * metadata since it was last queried.
> > > > + * appear larger to observers if there was an explicit change to the inode's
> > > > + * data or metadata since it was last queried.
> > > > + *
> > > > + * An explicit change is one that would ordinarily result in a change to the
> > > > + * inode status change time (aka ctime). i_version must appear to change, even
> > > > + * if the ctime does not (since the whole point is to avoid missing updates due
> > > > + * to timestamp granularity). If POSIX or other relevant spec mandates that the
> > > > + * ctime must change due to an operation, then the i_version counter must be
> > > > + * incremented as well.
> > > > + *
> > > > + * Making the i_version update completely atomic with the operation itself would
> > > > + * be prohibitively expensive. Traditionally the kernel has updated the times on
> > > > + * directories after an operation that changes its contents. For regular files,
> > > > + * the ctime is usually updated before the data is copied into the cache for a
> > > > + * write. This means that there is a window of time when an observer can
> > > > + * associate a new timestamp with old file contents. Since the purpose of the
> > > > + * i_version is to allow for better cache coherency, the i_version must always
> > > > + * be updated after the results of the operation are visible. Updating it before
> > > > + * and after a change is also permitted.
> > > 
> > > This sounds good but it is not the case for any of the current filesystems, is
> > > it? Perhaps the documentation should mention this so that people are not
> > > confused?
> > 
> > Correct. Currently, all filesystems change the times and version before
> > a write instead of after. I'm hoping that situation will change soon
> > though, as I've been working on a patchset to fix this for tmpfs, ext4
> > and btrfs.
> 
> That is good but we'll see how long it takes to get merged. AFAIR it is not
> a complete nobrainer ;)
> 
> > If you still want to see something for this though, what would you
> > suggest for verbiage?
> 
> Sure:
> 
> ... the i_version must a be updated after the results of the operation are
> visible (note that none of the filesystems currently do this, it is a work
> in progress to fix this).
> 
> And once your patches are merged, you can also delete this note :).
> 
> 								Honza

Sounds good, I folded something similar to that into the patch and
pushed it into the branch I'm feeding into linux-next. I won't bother
re-posting for just that though:

diff --git a/include/linux/iversion.h b/include/linux/iversion.h
index fced8115a5f4..f174ff1b59ee 100644
--- a/include/linux/iversion.h
+++ b/include/linux/iversion.h
@@ -27,7 +27,8 @@
  * associate a new timestamp with old file contents. Since the purpose of the
  * i_version is to allow for better cache coherency, the i_version must always
  * be updated after the results of the operation are visible. Updating it before
- * and after a change is also permitted.
+ * and after a change is also permitted. (Note that no filesystems currently do
+ * this. Fixing that is a work-in-progress).
  *
  * Observers see the i_version as a 64-bit number that never decreases. If it
  * remains the same since it was last checked, then nothing has changed in the

Thanks!