diff mbox series

[man-pages,RFC,v4] statx, inode: document the new STATX_INO_VERSION field

Message ID 20220907111606.18831-1-jlayton@kernel.org
State New
Headers show
Series [man-pages,RFC,v4] statx, inode: document the new STATX_INO_VERSION field | expand

Commit Message

Jeff Layton Sept. 7, 2022, 11:16 a.m. UTC
I'm proposing to expose the inode change attribute via statx [1]. Document
what this value means and what an observer can infer from it changing.

Signed-off-by: Jeff Layton <jlayton@kernel.org>

[1]: https://lore.kernel.org/linux-nfs/20220826214703.134870-1-jlayton@kernel.org/T/#t
---
 man2/statx.2 |  8 ++++++++
 man7/inode.7 | 39 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

v4: add paragraph pointing out the lack of atomicity wrt other changes

I think these patches are racing with another change to add DIO
alignment info to statx. I imagine this will go in after that, so this
will probably need to be respun to account for contextual differences.

What I'm mostly interested in here is getting the sematics and
description of the i_version counter nailed down.

Comments

J. Bruce Fields Sept. 7, 2022, 12:20 p.m. UTC | #1
On Wed, Sep 07, 2022 at 09:37:33PM +1000, NeilBrown wrote:
> On Wed, 07 Sep 2022, Jeff Layton wrote:
> > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
> > +other changes in the inode. On a write, for instance, the i_version it usually
> > +incremented before the data is copied into the pagecache. Therefore it is
> > +possible to see a new i_version value while a read still shows the old data.
> 
> Doesn't that make the value useless?  Surely the change number must
> change no sooner than the change itself is visible, otherwise stale data
> could be cached indefinitely.

For the purposes of NFS close-to-open, I guess all we need is for the
change attribute increment to happen sometime between the open and the
close.

But, yes, it'd seem a lot more useful if it was guaranteed to happen
after.  (Or before and after both--extraneous increments aren't a big
problem here.)

--b.

> 
> If currently implementations behave this way, surely they are broken.
> 
> NeilBrown
J. Bruce Fields Sept. 7, 2022, 12:52 p.m. UTC | #2
On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
> > > +other changes in the inode. On a write, for instance, the i_version it usually
> > > +incremented before the data is copied into the pagecache. Therefore it is
> > > +possible to see a new i_version value while a read still shows the old data.
> > 
> > Doesn't that make the value useless?
> > 
> 
> No, I don't think so. It's only really useful for comparing to an older
> sample anyway. If you do "statx; read; statx" and the value hasn't
> changed, then you know that things are stable. 

I don't see how that helps.  It's still possible to get:

		reader		writer
		------		------
				i_version++
		statx
		read
		statx
				update page cache

right?

--b.

> 
> > Surely the change number must
> > change no sooner than the change itself is visible, otherwise stale data
> > could be cached indefinitely.
> > 
> > If currently implementations behave this way, surely they are broken.
> 
> It's certainly not ideal but we've never been able to offer truly atomic
> behavior here given that Linux is a general-purpose OS. The behavior is
> a little inconsistent too:
> 
> The c/mtime update and i_version bump on directories (mostly) occur
> after the operation. c/mtime updates for files however are mostly driven
> by calls to file_update_time, which happens before data is copied to the
> pagecache.
> 
> It's not clear to me why it's done this way. Maybe to ensure that the
> metadata is up to date in the event that a statx comes in? Improving
> this would be nice, but I don't see a way to do that without regressing
> performance.
> -- 
> Jeff Layton <jlayton@kernel.org>
Jeff Layton Sept. 7, 2022, 1:12 p.m. UTC | #3
On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
> > > > +other changes in the inode. On a write, for instance, the i_version it usually
> > > > +incremented before the data is copied into the pagecache. Therefore it is
> > > > +possible to see a new i_version value while a read still shows the old data.
> > > 
> > > Doesn't that make the value useless?
> > > 
> > 
> > No, I don't think so. It's only really useful for comparing to an older
> > sample anyway. If you do "statx; read; statx" and the value hasn't
> > changed, then you know that things are stable. 
> 
> I don't see how that helps.  It's still possible to get:
> 
> 		reader		writer
> 		------		------
> 				i_version++
> 		statx
> 		read
> 		statx
> 				update page cache
> 
> right?
> 

Yeah, I suppose so -- the statx wouldn't necessitate any locking. In
that case, maybe this is useless then other than for testing purposes
and userland NFS servers.

Would it be better to not consume a statx field with this if so? What
could we use as an alternate interface? ioctl? Some sort of global
virtual xattr? It does need to be something per-inode.

> > 
> > > Surely the change number must
> > > change no sooner than the change itself is visible, otherwise stale data
> > > could be cached indefinitely.
> > > 
> > > If currently implementations behave this way, surely they are broken.
> > 
> > It's certainly not ideal but we've never been able to offer truly atomic
> > behavior here given that Linux is a general-purpose OS. The behavior is
> > a little inconsistent too:
> > 
> > The c/mtime update and i_version bump on directories (mostly) occur
> > after the operation. c/mtime updates for files however are mostly driven
> > by calls to file_update_time, which happens before data is copied to the
> > pagecache.
> > 
> > It's not clear to me why it's done this way. Maybe to ensure that the
> > metadata is up to date in the event that a statx comes in? Improving
> > this would be nice, but I don't see a way to do that without regressing
> > performance.
> > -- 
> > Jeff Layton <jlayton@kernel.org>
Jan Kara Sept. 7, 2022, 1:51 p.m. UTC | #4
On Wed 07-09-22 09:12:34, Jeff Layton wrote:
> On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
> > > > > +other changes in the inode. On a write, for instance, the i_version it usually
> > > > > +incremented before the data is copied into the pagecache. Therefore it is
> > > > > +possible to see a new i_version value while a read still shows the old data.
> > > > 
> > > > Doesn't that make the value useless?
> > > > 
> > > 
> > > No, I don't think so. It's only really useful for comparing to an older
> > > sample anyway. If you do "statx; read; statx" and the value hasn't
> > > changed, then you know that things are stable. 
> > 
> > I don't see how that helps.  It's still possible to get:
> > 
> > 		reader		writer
> > 		------		------
> > 				i_version++
> > 		statx
> > 		read
> > 		statx
> > 				update page cache
> > 
> > right?
> > 
> 
> Yeah, I suppose so -- the statx wouldn't necessitate any locking. In
> that case, maybe this is useless then other than for testing purposes
> and userland NFS servers.
> 
> Would it be better to not consume a statx field with this if so? What
> could we use as an alternate interface? ioctl? Some sort of global
> virtual xattr? It does need to be something per-inode.

I was thinking how hard would it be to increment i_version after updating
data but it will be rather hairy. In particular because of stuff like
IOCB_NOWAIT support which needs to bail if i_version update is needed. So
yeah, I don't think there's an easy way how to provide useful i_version for
general purpose use.

								Honza
Jeff Layton Sept. 7, 2022, 2:43 p.m. UTC | #5
On Wed, 2022-09-07 at 15:51 +0200, Jan Kara wrote:
> On Wed 07-09-22 09:12:34, Jeff Layton wrote:
> > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
> > > > > > +other changes in the inode. On a write, for instance, the i_version it usually
> > > > > > +incremented before the data is copied into the pagecache. Therefore it is
> > > > > > +possible to see a new i_version value while a read still shows the old data.
> > > > > 
> > > > > Doesn't that make the value useless?
> > > > > 
> > > > 
> > > > No, I don't think so. It's only really useful for comparing to an older
> > > > sample anyway. If you do "statx; read; statx" and the value hasn't
> > > > changed, then you know that things are stable. 
> > > 
> > > I don't see how that helps.  It's still possible to get:
> > > 
> > > 		reader		writer
> > > 		------		------
> > > 				i_version++
> > > 		statx
> > > 		read
> > > 		statx
> > > 				update page cache
> > > 
> > > right?
> > > 
> > 
> > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In
> > that case, maybe this is useless then other than for testing purposes
> > and userland NFS servers.
> > 
> > Would it be better to not consume a statx field with this if so? What
> > could we use as an alternate interface? ioctl? Some sort of global
> > virtual xattr? It does need to be something per-inode.
> 
> I was thinking how hard would it be to increment i_version after updating
> data but it will be rather hairy. In particular because of stuff like
> IOCB_NOWAIT support which needs to bail if i_version update is needed. So
> yeah, I don't think there's an easy way how to provide useful i_version for
> general purpose use.
> 

Yeah, it does look ugly.

Another idea might be to just take the i_rwsem for read in the statx
codepath when STATX_INO_VERSION has been requested. xfs, ext4 and btrfs
hold the i_rwsem exclusively over their buffered write ops. Doing that
should be enough to prevent the race above, I think. The ext4 DAX path
also looks ok there.

The ext4 DIO write implementation seems to take the i_rwsem for read
though unless the size is changing or the write is unaligned. So a
i_rwsem readlock would probably not be enough to guard against changes
there. Maybe we can just say if you're doing DIO, then don't expect real
atomicity wrt i_version?

knfsd seems to already hold i_rwsem when doing directory morphing
operations (where it fetches the pre and post attrs), but it doesn't
take it when calling nfsd4_encode_fattr (which is used to fill out
GETATTR and READDIR replies, etc.). We'd probably have to start taking
it in those codepaths too.

We should also bear in mind that from userland, doing a read of a normal
file and fetching the i_version takes two different syscalls. I'm not
sure we need things to be truly "atomic", per-se. Whether and how we can
exploit that fact, I'm not sure.
Jeff Layton Sept. 7, 2022, 3:11 p.m. UTC | #6
On Wed, 2022-09-07 at 15:04 +0000, Trond Myklebust wrote:
> On Wed, 2022-09-07 at 10:05 -0400, Jeff Layton wrote:
> > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote:
> > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote:
> > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic
> > > > > > > > with
> > > > > > > > respect to the
> > > > > > > > +other changes in the inode. On a write, for instance,
> > > > > > > > the
> > > > > > > > i_version it usually
> > > > > > > > +incremented before the data is copied into the
> > > > > > > > pagecache.
> > > > > > > > Therefore it is
> > > > > > > > +possible to see a new i_version value while a read still
> > > > > > > > shows the old data.
> > > > > > > 
> > > > > > > Doesn't that make the value useless?
> > > > > > > 
> > > > > > 
> > > > > > No, I don't think so. It's only really useful for comparing
> > > > > > to an
> > > > > > older
> > > > > > sample anyway. If you do "statx; read; statx" and the value
> > > > > > hasn't
> > > > > > changed, then you know that things are stable. 
> > > > > 
> > > > > I don't see how that helps.  It's still possible to get:
> > > > > 
> > > > >                 reader          writer
> > > > >                 ------          ------
> > > > >                                 i_version++
> > > > >                 statx
> > > > >                 read
> > > > >                 statx
> > > > >                                 update page cache
> > > > > 
> > > > > right?
> > > > > 
> > > > 
> > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking.
> > > > In
> > > > that case, maybe this is useless then other than for testing
> > > > purposes
> > > > and userland NFS servers.
> > > > 
> > > > Would it be better to not consume a statx field with this if so?
> > > > What
> > > > could we use as an alternate interface? ioctl? Some sort of
> > > > global
> > > > virtual xattr? It does need to be something per-inode.
> > > 
> > > I don't see how a non-atomic change attribute is remotely useful
> > > even
> > > for NFS.
> > > 
> > > The main problem is not so much the above (although NFS clients are
> > > vulnerable to that too) but the behaviour w.r.t. directory changes.
> > > 
> > > If the server can't guarantee that file/directory/... creation and
> > > unlink are atomically recorded with change attribute updates, then
> > > the
> > > client has to always assume that the server is lying, and that it
> > > has
> > > to revalidate all its caches anyway. Cue endless
> > > readdir/lookup/getattr
> > > requests after each and every directory modification in order to
> > > check
> > > that some other client didn't also sneak in a change of their own.
> > > 
> > 
> > We generally hold the parent dir's inode->i_rwsem exclusively over
> > most
> > important directory changes, and the times/i_version are also updated
> > while holding it. What we don't do is serialize reads of this value
> > vs.
> > the i_rwsem, so you could see new directory contents alongside an old
> > i_version. Maybe we should be taking it for read when we query it on
> > a
> > directory?
> 
> Serialising reads is not the problem. The problem is ensuring that
> knfsd is able to provide an atomic change_info4 structure when the
> client modifies the directory.
> i.e. the requirement is that if the directory changed, then that
> modification is atomically accompanied by an update of the change
> attribute that can be retrieved by knfsd and placed in the reply to the
> client.
> 

I think we already do that for directories today via the i_rwsem. We
hold that exclusively over directory-morphing operations, and the
i_version is updated while holding that lock.

> > Achieving atomicity with file writes though is another matter
> > entirely.
> > I'm not sure that's even doable or how to approach it if so.
> > Suggestions?
> 
> The problem outlined by Bruce above isn't a big deal. Just check the
> I_VERSION_QUERIED flag after the 'update_page_cache' bit, and bump the
> i_version if that's the case. The real problem is what happens if you
> then crash during writeback...
> 

It's a uglier than it looks at first glance. As Jan pointed out, thIt's
possible for the initial file_modified call to succeed and then a second
one to fail. If the time got an initial update and then the data was
copied in, should we fail the write at that point?

We may be better served by trying to also do this with the i_rwsem. I'm
looking at that now, though it's a bit hairy given that
vfs_getattr_nosec can be called either with or without it held.
Jan Kara Sept. 8, 2022, 8:33 a.m. UTC | #7
On Thu 08-09-22 10:44:22, NeilBrown wrote:
> On Wed, 07 Sep 2022, Jan Kara wrote:
> > On Wed 07-09-22 09:12:34, Jeff Layton wrote:
> > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
> > > > > > > +other changes in the inode. On a write, for instance, the i_version it usually
> > > > > > > +incremented before the data is copied into the pagecache. Therefore it is
> > > > > > > +possible to see a new i_version value while a read still shows the old data.
> > > > > > 
> > > > > > Doesn't that make the value useless?
> > > > > > 
> > > > > 
> > > > > No, I don't think so. It's only really useful for comparing to an older
> > > > > sample anyway. If you do "statx; read; statx" and the value hasn't
> > > > > changed, then you know that things are stable. 
> > > > 
> > > > I don't see how that helps.  It's still possible to get:
> > > > 
> > > > 		reader		writer
> > > > 		------		------
> > > > 				i_version++
> > > > 		statx
> > > > 		read
> > > > 		statx
> > > > 				update page cache
> > > > 
> > > > right?
> > > > 
> > > 
> > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In
> > > that case, maybe this is useless then other than for testing purposes
> > > and userland NFS servers.
> > > 
> > > Would it be better to not consume a statx field with this if so? What
> > > could we use as an alternate interface? ioctl? Some sort of global
> > > virtual xattr? It does need to be something per-inode.
> > 
> > I was thinking how hard would it be to increment i_version after updating
> > data but it will be rather hairy. In particular because of stuff like
> > IOCB_NOWAIT support which needs to bail if i_version update is needed. So
> > yeah, I don't think there's an easy way how to provide useful i_version for
> > general purpose use.
> > 
> 
> Why cannot IOCB_NOWAIT update i_version?  Do we not want to wait on the
> cmp_xchg loop in inode_maybe_inc_iversion(), or do we not want to
> trigger an inode update?
> 
> The first seems unlikely, but the second seems unreasonable.  We already
> acknowledge that after a crash iversion might go backwards and/or miss
> changes.

It boils down to the fact that we don't want to call mark_inode_dirty()
from IOCB_NOWAIT path because for lots of filesystems that means journal
operation and there are high chances that may block.

Presumably we could treat inode dirtying after i_version change similarly
to how we handle timestamp updates with lazytime mount option (i.e., not
dirty the inode immediately but only with a delay) but then the time window
for i_version inconsistencies due to a crash would be much larger.

								Honza
Jeff Layton Sept. 8, 2022, 11:37 a.m. UTC | #8
On Thu, 2022-09-08 at 00:41 +0000, Trond Myklebust wrote:
> On Thu, 2022-09-08 at 10:31 +1000, NeilBrown wrote:
> > On Wed, 07 Sep 2022, Trond Myklebust wrote:
> > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote:
> > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic
> > > > > > > > with
> > > > > > > > respect to the
> > > > > > > > +other changes in the inode. On a write, for instance,
> > > > > > > > the
> > > > > > > > i_version it usually
> > > > > > > > +incremented before the data is copied into the
> > > > > > > > pagecache.
> > > > > > > > Therefore it is
> > > > > > > > +possible to see a new i_version value while a read still
> > > > > > > > shows the old data.
> > > > > > > 
> > > > > > > Doesn't that make the value useless?
> > > > > > > 
> > > > > > 
> > > > > > No, I don't think so. It's only really useful for comparing
> > > > > > to an
> > > > > > older
> > > > > > sample anyway. If you do "statx; read; statx" and the value
> > > > > > hasn't
> > > > > > changed, then you know that things are stable. 
> > > > > 
> > > > > I don't see how that helps.  It's still possible to get:
> > > > > 
> > > > >                 reader          writer
> > > > >                 ------          ------
> > > > >                                 i_version++
> > > > >                 statx
> > > > >                 read
> > > > >                 statx
> > > > >                                 update page cache
> > > > > 
> > > > > right?
> > > > > 
> > > > 
> > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking.
> > > > In
> > > > that case, maybe this is useless then other than for testing
> > > > purposes
> > > > and userland NFS servers.
> > > > 
> > > > Would it be better to not consume a statx field with this if so?
> > > > What
> > > > could we use as an alternate interface? ioctl? Some sort of
> > > > global
> > > > virtual xattr? It does need to be something per-inode.
> > > 
> > > I don't see how a non-atomic change attribute is remotely useful
> > > even
> > > for NFS.
> > > 
> > > The main problem is not so much the above (although NFS clients are
> > > vulnerable to that too) but the behaviour w.r.t. directory changes.
> > > 
> > > If the server can't guarantee that file/directory/... creation and
> > > unlink are atomically recorded with change attribute updates, then
> > > the
> > > client has to always assume that the server is lying, and that it
> > > has
> > > to revalidate all its caches anyway. Cue endless
> > > readdir/lookup/getattr
> > > requests after each and every directory modification in order to
> > > check
> > > that some other client didn't also sneak in a change of their own.
> > 
> > NFS re-export doesn't support atomic change attributes on
> > directories.
> > Do we see the endless revalidate requests after directory
> > modification
> > in that situation?  Just curious.
> 
> Why wouldn't NFS re-export be capable of supporting atomic change
> attributes in those cases, provided that the server does? It seems to
> me that is just a question of providing the correct information w.r.t.
> atomicity to knfsd.
> 
> ...but yes, a quick glance at nfs4_update_changeattr_locked(), and what
> happens when !cinfo->atomic should tell you all you need to know.

The main reason we disabled atomic change attribute updates was that
getattr calls on NFS can be pretty expensive. By setting the NOWCC flag,
we can avoid those for WCC info, but at the expense of the client having
to do more revalidation on its own.
Theodore Ts'o Sept. 8, 2022, 3:21 p.m. UTC | #9
On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> It boils down to the fact that we don't want to call mark_inode_dirty()
> from IOCB_NOWAIT path because for lots of filesystems that means journal
> operation and there are high chances that may block.
> 
> Presumably we could treat inode dirtying after i_version change similarly
> to how we handle timestamp updates with lazytime mount option (i.e., not
> dirty the inode immediately but only with a delay) but then the time window
> for i_version inconsistencies due to a crash would be much larger.

Perhaps this is a radical suggestion, but there seems to be a lot of
the problems which are due to the concern "what if the file system
crashes" (and so we need to worry about making sure that any
increments to i_version MUST be persisted after it is incremented).

Well, if we assume that unclean shutdowns are rare, then perhaps we
shouldn't be optimizing for that case.  So.... what if a file system
had a counter which got incremented each time its journal is replayed
representing an unclean shutdown.  That shouldn't happen often, but if
it does, there might be any number of i_version updates that may have
gotten lost.  So in that case, the NFS client should invalidate all of
its caches.

If the i_version field was large enough, we could just prefix the
"unclean shutdown counter" with the existing i_version number when it
is sent over the NFS protocol to the client.  But if that field is too
small, and if (as I understand things) NFS just needs to know when
i_version is different, we could just simply hash the "unclean
shtudown counter" with the inode's "i_version counter", and let that
be the version which is sent from the NFS client to the server.

If we could do that, then it doesn't become critical that every single
i_version bump has to be persisted to disk, and we could treat it like
a lazytime update; it's guaranteed to updated when we do an clean
unmount of the file system (and when the file system is frozen), but
on a crash, there is no guaranteee that all i_version bumps will be
persisted, but we do have this "unclean shutdown" counter to deal with
that case.

Would this make life easier for folks?

						- Ted
Jeff Layton Sept. 8, 2022, 3:44 p.m. UTC | #10
On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote:
> On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> > It boils down to the fact that we don't want to call mark_inode_dirty()
> > from IOCB_NOWAIT path because for lots of filesystems that means journal
> > operation and there are high chances that may block.
> > 
> > Presumably we could treat inode dirtying after i_version change similarly
> > to how we handle timestamp updates with lazytime mount option (i.e., not
> > dirty the inode immediately but only with a delay) but then the time window
> > for i_version inconsistencies due to a crash would be much larger.
> 
> Perhaps this is a radical suggestion, but there seems to be a lot of
> the problems which are due to the concern "what if the file system
> crashes" (and so we need to worry about making sure that any
> increments to i_version MUST be persisted after it is incremented).
> 
> Well, if we assume that unclean shutdowns are rare, then perhaps we
> shouldn't be optimizing for that case.  So.... what if a file system
> had a counter which got incremented each time its journal is replayed
> representing an unclean shutdown.  That shouldn't happen often, but if
> it does, there might be any number of i_version updates that may have
> gotten lost.  So in that case, the NFS client should invalidate all of
> its caches.
> 
> If the i_version field was large enough, we could just prefix the
> "unclean shutdown counter" with the existing i_version number when it
> is sent over the NFS protocol to the client.  But if that field is too
> small, and if (as I understand things) NFS just needs to know when
> i_version is different, we could just simply hash the "unclean
> shtudown counter" with the inode's "i_version counter", and let that
> be the version which is sent from the NFS client to the server.
> 
> If we could do that, then it doesn't become critical that every single
> i_version bump has to be persisted to disk, and we could treat it like
> a lazytime update; it's guaranteed to updated when we do an clean
> unmount of the file system (and when the file system is frozen), but
> on a crash, there is no guaranteee that all i_version bumps will be
> persisted, but we do have this "unclean shutdown" counter to deal with
> that case.
> 
> Would this make life easier for folks?
> 
> 						- Ted

Thanks for chiming in, Ted. That's part of the problem, but we're
actually not too worried about that case:

nfsd mixes the ctime in with i_version, so you'd have to crash+clock
jump backward by juuuust enough to allow you to get the i_version and
ctime into a state it was before the crash, but with different data.
We're assuming that that is difficult to achieve in practice.

The issue with a reboot counter (or similar) is that on an unclean crash
the NFS client would end up invalidating every inode in the cache, as
all of the i_versions would change. That's probably excessive.

The bigger issue (at the moment) is atomicity: when we fetch an
i_version, the natural inclination is to associate that with the state
of the inode at some point in time, so we need this to be updated
atomically with certain other attributes of the inode. That's the part
I'm trying to sort through at the moment.
Chuck Lever Sept. 8, 2022, 4:15 p.m. UTC | #11
> On Sep 8, 2022, at 11:56 AM, J. Bruce Fields <bfields@fieldses.org> wrote:
> 
> On Thu, Sep 08, 2022 at 11:44:33AM -0400, Jeff Layton wrote:
>> On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote:
>>> On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
>>>> It boils down to the fact that we don't want to call mark_inode_dirty()
>>>> from IOCB_NOWAIT path because for lots of filesystems that means journal
>>>> operation and there are high chances that may block.
>>>> 
>>>> Presumably we could treat inode dirtying after i_version change similarly
>>>> to how we handle timestamp updates with lazytime mount option (i.e., not
>>>> dirty the inode immediately but only with a delay) but then the time window
>>>> for i_version inconsistencies due to a crash would be much larger.
>>> 
>>> Perhaps this is a radical suggestion, but there seems to be a lot of
>>> the problems which are due to the concern "what if the file system
>>> crashes" (and so we need to worry about making sure that any
>>> increments to i_version MUST be persisted after it is incremented).
>>> 
>>> Well, if we assume that unclean shutdowns are rare, then perhaps we
>>> shouldn't be optimizing for that case.  So.... what if a file system
>>> had a counter which got incremented each time its journal is replayed
>>> representing an unclean shutdown.  That shouldn't happen often, but if
>>> it does, there might be any number of i_version updates that may have
>>> gotten lost.  So in that case, the NFS client should invalidate all of
>>> its caches.
>>> 
>>> If the i_version field was large enough, we could just prefix the
>>> "unclean shutdown counter" with the existing i_version number when it
>>> is sent over the NFS protocol to the client.  But if that field is too
>>> small, and if (as I understand things) NFS just needs to know when
>>> i_version is different, we could just simply hash the "unclean
>>> shtudown counter" with the inode's "i_version counter", and let that
>>> be the version which is sent from the NFS client to the server.
>>> 
>>> If we could do that, then it doesn't become critical that every single
>>> i_version bump has to be persisted to disk, and we could treat it like
>>> a lazytime update; it's guaranteed to updated when we do an clean
>>> unmount of the file system (and when the file system is frozen), but
>>> on a crash, there is no guaranteee that all i_version bumps will be
>>> persisted, but we do have this "unclean shutdown" counter to deal with
>>> that case.
>>> 
>>> Would this make life easier for folks?
>>> 
>>> 						- Ted
>> 
>> Thanks for chiming in, Ted. That's part of the problem, but we're
>> actually not too worried about that case:
>> 
>> nfsd mixes the ctime in with i_version, so you'd have to crash+clock
>> jump backward by juuuust enough to allow you to get the i_version and
>> ctime into a state it was before the crash, but with different data.
>> We're assuming that that is difficult to achieve in practice.
> 
> But a change in the clock could still cause our returned change
> attribute to go backwards (even without a crash).  Not sure how to
> evaluate the risk, but it was enough that Trond hasn't been comfortable
> with nfsd advertising NFS4_CHANGE_TYPE_IS_MONOTONIC.
> 
> Ted's idea would be sufficient to allow us to turn that flag on, which I
> think allows some client-side optimizations.
> 
>> The issue with a reboot counter (or similar) is that on an unclean crash
>> the NFS client would end up invalidating every inode in the cache, as
>> all of the i_versions would change. That's probably excessive.
> 
> But if we use the crash counter on write instead of read, we don't
> invalidate caches unnecessarily.  And I think the monotonicity would
> still be close enough for our purposes?
> 
>> The bigger issue (at the moment) is atomicity: when we fetch an
>> i_version, the natural inclination is to associate that with the state
>> of the inode at some point in time, so we need this to be updated
>> atomically with certain other attributes of the inode. That's the part
>> I'm trying to sort through at the moment.
> 
> That may be, but I still suspect the crash counter would help.

Fwiw, I like the crash counter idea too.

--
Chuck Lever
Jeff Layton Sept. 8, 2022, 5:40 p.m. UTC | #12
On Thu, 2022-09-08 at 11:56 -0400, J. Bruce Fields wrote:
> On Thu, Sep 08, 2022 at 11:44:33AM -0400, Jeff Layton wrote:
> > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote:
> > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> > > > It boils down to the fact that we don't want to call mark_inode_dirty()
> > > > from IOCB_NOWAIT path because for lots of filesystems that means journal
> > > > operation and there are high chances that may block.
> > > > 
> > > > Presumably we could treat inode dirtying after i_version change similarly
> > > > to how we handle timestamp updates with lazytime mount option (i.e., not
> > > > dirty the inode immediately but only with a delay) but then the time window
> > > > for i_version inconsistencies due to a crash would be much larger.
> > > 
> > > Perhaps this is a radical suggestion, but there seems to be a lot of
> > > the problems which are due to the concern "what if the file system
> > > crashes" (and so we need to worry about making sure that any
> > > increments to i_version MUST be persisted after it is incremented).
> > > 
> > > Well, if we assume that unclean shutdowns are rare, then perhaps we
> > > shouldn't be optimizing for that case.  So.... what if a file system
> > > had a counter which got incremented each time its journal is replayed
> > > representing an unclean shutdown.  That shouldn't happen often, but if
> > > it does, there might be any number of i_version updates that may have
> > > gotten lost.  So in that case, the NFS client should invalidate all of
> > > its caches.
> > > 
> > > If the i_version field was large enough, we could just prefix the
> > > "unclean shutdown counter" with the existing i_version number when it
> > > is sent over the NFS protocol to the client.  But if that field is too
> > > small, and if (as I understand things) NFS just needs to know when
> > > i_version is different, we could just simply hash the "unclean
> > > shtudown counter" with the inode's "i_version counter", and let that
> > > be the version which is sent from the NFS client to the server.
> > > 
> > > If we could do that, then it doesn't become critical that every single
> > > i_version bump has to be persisted to disk, and we could treat it like
> > > a lazytime update; it's guaranteed to updated when we do an clean
> > > unmount of the file system (and when the file system is frozen), but
> > > on a crash, there is no guaranteee that all i_version bumps will be
> > > persisted, but we do have this "unclean shutdown" counter to deal with
> > > that case.
> > > 
> > > Would this make life easier for folks?
> > > 
> > > 						- Ted
> > 
> > Thanks for chiming in, Ted. That's part of the problem, but we're
> > actually not too worried about that case:
> > 
> > nfsd mixes the ctime in with i_version, so you'd have to crash+clock
> > jump backward by juuuust enough to allow you to get the i_version and
> > ctime into a state it was before the crash, but with different data.
> > We're assuming that that is difficult to achieve in practice.
> 
> But a change in the clock could still cause our returned change
> attribute to go backwards (even without a crash).  Not sure how to
> evaluate the risk, but it was enough that Trond hasn't been comfortable
> with nfsd advertising NFS4_CHANGE_TYPE_IS_MONOTONIC.
> 
> Ted's idea would be sufficient to allow us to turn that flag on, which I
> think allows some client-side optimizations.
> 

Good point.

> > The issue with a reboot counter (or similar) is that on an unclean crash
> > the NFS client would end up invalidating every inode in the cache, as
> > all of the i_versions would change. That's probably excessive.
> 
> But if we use the crash counter on write instead of read, we don't
> invalidate caches unnecessarily.  And I think the monotonicity would
> still be close enough for our purposes?
> 
> > The bigger issue (at the moment) is atomicity: when we fetch an
> > i_version, the natural inclination is to associate that with the state
> > of the inode at some point in time, so we need this to be updated
> > atomically with certain other attributes of the inode. That's the part
> > I'm trying to sort through at the moment.
> 
> That may be, but I still suspect the crash counter would help.
> 

Yeah, ok. That does make some sense. So we would mix this into the
i_version instead of the ctime when it was available. Preferably, we'd
mix that in when we store the i_version rather than adding it afterward.

Ted, how would we access this? Maybe we could just add a new (generic)
super_block field for this that ext4 (and other filesystems) could
populate at mount time?
J. Bruce Fields Sept. 8, 2022, 6:22 p.m. UTC | #13
On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> Yeah, ok. That does make some sense. So we would mix this into the
> i_version instead of the ctime when it was available. Preferably, we'd
> mix that in when we store the i_version rather than adding it afterward.
> 
> Ted, how would we access this? Maybe we could just add a new (generic)
> super_block field for this that ext4 (and other filesystems) could
> populate at mount time?

Couldn't the filesystem just return an ino_version that already includes
it?

--b.
NeilBrown Sept. 8, 2022, 10:55 p.m. UTC | #14
On Fri, 09 Sep 2022, Jeff Layton wrote:
> On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote:
> > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> > > It boils down to the fact that we don't want to call mark_inode_dirty()
> > > from IOCB_NOWAIT path because for lots of filesystems that means journal
> > > operation and there are high chances that may block.
> > > 
> > > Presumably we could treat inode dirtying after i_version change similarly
> > > to how we handle timestamp updates with lazytime mount option (i.e., not
> > > dirty the inode immediately but only with a delay) but then the time window
> > > for i_version inconsistencies due to a crash would be much larger.
> > 
> > Perhaps this is a radical suggestion, but there seems to be a lot of
> > the problems which are due to the concern "what if the file system
> > crashes" (and so we need to worry about making sure that any
> > increments to i_version MUST be persisted after it is incremented).
> > 
> > Well, if we assume that unclean shutdowns are rare, then perhaps we
> > shouldn't be optimizing for that case.  So.... what if a file system
> > had a counter which got incremented each time its journal is replayed
> > representing an unclean shutdown.  That shouldn't happen often, but if
> > it does, there might be any number of i_version updates that may have
> > gotten lost.  So in that case, the NFS client should invalidate all of
> > its caches.
> > 
> > If the i_version field was large enough, we could just prefix the
> > "unclean shutdown counter" with the existing i_version number when it
> > is sent over the NFS protocol to the client.  But if that field is too
> > small, and if (as I understand things) NFS just needs to know when
> > i_version is different, we could just simply hash the "unclean
> > shtudown counter" with the inode's "i_version counter", and let that
> > be the version which is sent from the NFS client to the server.
> > 
> > If we could do that, then it doesn't become critical that every single
> > i_version bump has to be persisted to disk, and we could treat it like
> > a lazytime update; it's guaranteed to updated when we do an clean
> > unmount of the file system (and when the file system is frozen), but
> > on a crash, there is no guaranteee that all i_version bumps will be
> > persisted, but we do have this "unclean shutdown" counter to deal with
> > that case.
> > 
> > Would this make life easier for folks?
> > 
> > 						- Ted
> 
> Thanks for chiming in, Ted. That's part of the problem, but we're
> actually not too worried about that case:
> 
> nfsd mixes the ctime in with i_version, so you'd have to crash+clock
> jump backward by juuuust enough to allow you to get the i_version and
> ctime into a state it was before the crash, but with different data.
> We're assuming that that is difficult to achieve in practice.
> 
> The issue with a reboot counter (or similar) is that on an unclean crash
> the NFS client would end up invalidating every inode in the cache, as
> all of the i_versions would change. That's probably excessive.
> 
> The bigger issue (at the moment) is atomicity: when we fetch an
> i_version, the natural inclination is to associate that with the state
> of the inode at some point in time, so we need this to be updated
> atomically with certain other attributes of the inode. That's the part
> I'm trying to sort through at the moment.

I don't think atomicity matters nearly as much as ordering.
The i_version must not be visible before the change that it reflects.
It is OK for it to be after.  Even seconds after without great cost.  It
is bad for it to be earlier.  Any unlocked gap after the i_version
update and before the change is visible can result in a race and
incorrect caching.

Even for directory updates where NFSv4 wants atomic before/after version
numbers, they don't need to be atomic w.r.t. the change being visible.

If three concurrent file creates cause the version number to go from 4
to 7, then it is important that one op sees "4,5", one sees "5,6" and
one sees "6,7", but it doesn't matter if concurrent lookups only see
version 4 even while they can see the newly created names.

A longer gap increases the risk of an unnecessary cache flush, but it
doesn't lead to incorrectness.

So I think we should put the version update *after* the change is
visible, and not require locking (beyond a memory barrier) when reading
the version. It should be as soon after as practical, bit no sooner.

NeilBrown
Jeff Layton Sept. 8, 2022, 11:23 p.m. UTC | #15
On Fri, 2022-09-09 at 09:01 +1000, NeilBrown wrote:
> On Fri, 09 Sep 2022, Jeff Layton wrote:
> > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
> > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > > > Yeah, ok. That does make some sense. So we would mix this into the
> > > > i_version instead of the ctime when it was available. Preferably, we'd
> > > > mix that in when we store the i_version rather than adding it afterward.
> > > > 
> > > > Ted, how would we access this? Maybe we could just add a new (generic)
> > > > super_block field for this that ext4 (and other filesystems) could
> > > > populate at mount time?
> > > 
> > > Couldn't the filesystem just return an ino_version that already includes
> > > it?
> > > 
> > 
> > Yes. That's simple if we want to just fold it in during getattr. If we
> > want to fold that into the values stored on disk, then I'm a little less
> > clear on how that will work.
> > 
> > Maybe I need a concrete example of how that will work:
> > 
> > Suppose we have an i_version value X with the previous crash counter
> > already factored in that makes it to disk. We hand out a newer version
> > X+1 to a client, but that value never makes it to disk.
> 
> As I understand it, the crash counter would NEVER appear in the on-disk
> i_version.
> The crash counter is stable while a filesystem is mounted so is the same
> when loading an inode from disk and when writing it back.
> 
> When loading, add crash counter to on-disk i_version to provide
> in-memory i_version.
> when storing, subtract crash counter from in-memory i_version to provide
> on-disk i_version.
> 
> "add" and "subtract" could be any reversible hash, and its inverse.  I
> would probably shift the crash counter up 16 and add/subtract.
> 
> 

If you store the value with the crash counter already factored-in, then
not every inode would end up being invalidated after a crash. If we try
to mix it in later, the client will end up invalidating the cache even
for inodes that had no changes.

> > 
> > The machine crashes and comes back up, and we get a query for i_version
> > and it comes back as X. Fine, it's an old version. Now there is a write.
> > What do we do to ensure that the new value doesn't collide with X+1? 
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> >
Trond Myklebust Sept. 8, 2022, 11:59 p.m. UTC | #16
On Fri, 2022-09-09 at 08:55 +1000, NeilBrown wrote:
> On Fri, 09 Sep 2022, Jeff Layton wrote:
> > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote:
> > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> > > > It boils down to the fact that we don't want to call
> > > > mark_inode_dirty()
> > > > from IOCB_NOWAIT path because for lots of filesystems that
> > > > means journal
> > > > operation and there are high chances that may block.
> > > > 
> > > > Presumably we could treat inode dirtying after i_version change
> > > > similarly
> > > > to how we handle timestamp updates with lazytime mount option
> > > > (i.e., not
> > > > dirty the inode immediately but only with a delay) but then the
> > > > time window
> > > > for i_version inconsistencies due to a crash would be much
> > > > larger.
> > > 
> > > Perhaps this is a radical suggestion, but there seems to be a lot
> > > of
> > > the problems which are due to the concern "what if the file
> > > system
> > > crashes" (and so we need to worry about making sure that any
> > > increments to i_version MUST be persisted after it is
> > > incremented).
> > > 
> > > Well, if we assume that unclean shutdowns are rare, then perhaps
> > > we
> > > shouldn't be optimizing for that case.  So.... what if a file
> > > system
> > > had a counter which got incremented each time its journal is
> > > replayed
> > > representing an unclean shutdown.  That shouldn't happen often,
> > > but if
> > > it does, there might be any number of i_version updates that may
> > > have
> > > gotten lost.  So in that case, the NFS client should invalidate
> > > all of
> > > its caches.
> > > 
> > > If the i_version field was large enough, we could just prefix the
> > > "unclean shutdown counter" with the existing i_version number
> > > when it
> > > is sent over the NFS protocol to the client.  But if that field
> > > is too
> > > small, and if (as I understand things) NFS just needs to know
> > > when
> > > i_version is different, we could just simply hash the "unclean
> > > shtudown counter" with the inode's "i_version counter", and let
> > > that
> > > be the version which is sent from the NFS client to the server.
> > > 
> > > If we could do that, then it doesn't become critical that every
> > > single
> > > i_version bump has to be persisted to disk, and we could treat it
> > > like
> > > a lazytime update; it's guaranteed to updated when we do an clean
> > > unmount of the file system (and when the file system is frozen),
> > > but
> > > on a crash, there is no guaranteee that all i_version bumps will
> > > be
> > > persisted, but we do have this "unclean shutdown" counter to deal
> > > with
> > > that case.
> > > 
> > > Would this make life easier for folks?
> > > 
> > >                                                 - Ted
> > 
> > Thanks for chiming in, Ted. That's part of the problem, but we're
> > actually not too worried about that case:
> > 
> > nfsd mixes the ctime in with i_version, so you'd have to
> > crash+clock
> > jump backward by juuuust enough to allow you to get the i_version
> > and
> > ctime into a state it was before the crash, but with different
> > data.
> > We're assuming that that is difficult to achieve in practice.
> > 
> > The issue with a reboot counter (or similar) is that on an unclean
> > crash
> > the NFS client would end up invalidating every inode in the cache,
> > as
> > all of the i_versions would change. That's probably excessive.
> > 
> > The bigger issue (at the moment) is atomicity: when we fetch an
> > i_version, the natural inclination is to associate that with the
> > state
> > of the inode at some point in time, so we need this to be updated
> > atomically with certain other attributes of the inode. That's the
> > part
> > I'm trying to sort through at the moment.
> 
> I don't think atomicity matters nearly as much as ordering.
>
> The i_version must not be visible before the change that it reflects.
> It is OK for it to be after.  Even seconds after without great cost. 
> It
> is bad for it to be earlier.  Any unlocked gap after the i_version
> update and before the change is visible can result in a race and
> incorrect caching.
> 
> Even for directory updates where NFSv4 wants atomic before/after
> version
> numbers, they don't need to be atomic w.r.t. the change being
> visible.
> 
> If three concurrent file creates cause the version number to go from
> 4
> to 7, then it is important that one op sees "4,5", one sees "5,6" and
> one sees "6,7", but it doesn't matter if concurrent lookups only see
> version 4 even while they can see the newly created names.
> 
> A longer gap increases the risk of an unnecessary cache flush, but it
> doesn't lead to incorrectness.
> 

I'm not really sure what you mean when you say that a 'longer gap
increases the risk of an unnecessary cache flush'. Either the change
attribute update is atomic with the operation it is recording, or it is
not. If that update is recorded in the NFS reply as not being atomic,
then the client will evict all cached data that is associated with that
change attribute at some point.

> So I think we should put the version update *after* the change is
> visible, and not require locking (beyond a memory barrier) when
> reading
> the version. It should be as soon after as practical, bit no sooner.
> 

Ordering is not a sufficient condition. The guarantee needs to be that
any application that reads the change attribute, then reads file data
and then reads the change attribute again will see the 2 change
attribute values as being the same *if and only if* there were no
changes to the file data made after the read and before the read of the
change attribute.
That includes the case where data was written after the read, and a
crash occurred after it was committed to stable storage. If you only
update the version after the written data is visible, then there is a
possibility that the crash could occur before any change attribute
update is committed to disk.

IOW: the minimal condition needs to be that for all cases below, the
application reads 'state B' as having occurred if any data was
committed to disk before the crash.

Application				Filesystem
===========				==========
read change attr <- 'state A'
read data <- 'state A'
					write data -> 'state B'
					<crash>+<reboot>
read change attr <- 'state B'
Trond Myklebust Sept. 9, 2022, 1:05 a.m. UTC | #17
On Fri, 2022-09-09 at 10:51 +1000, NeilBrown wrote:
> On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > On Fri, 2022-09-09 at 08:55 +1000, NeilBrown wrote:
> > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > On Thu, 2022-09-08 at 11:21 -0400, Theodore Ts'o wrote:
> > > > > On Thu, Sep 08, 2022 at 10:33:26AM +0200, Jan Kara wrote:
> > > > > > It boils down to the fact that we don't want to call
> > > > > > mark_inode_dirty()
> > > > > > from IOCB_NOWAIT path because for lots of filesystems that
> > > > > > means journal
> > > > > > operation and there are high chances that may block.
> > > > > > 
> > > > > > Presumably we could treat inode dirtying after i_version
> > > > > > change
> > > > > > similarly
> > > > > > to how we handle timestamp updates with lazytime mount
> > > > > > option
> > > > > > (i.e., not
> > > > > > dirty the inode immediately but only with a delay) but then
> > > > > > the
> > > > > > time window
> > > > > > for i_version inconsistencies due to a crash would be much
> > > > > > larger.
> > > > > 
> > > > > Perhaps this is a radical suggestion, but there seems to be a
> > > > > lot
> > > > > of
> > > > > the problems which are due to the concern "what if the file
> > > > > system
> > > > > crashes" (and so we need to worry about making sure that any
> > > > > increments to i_version MUST be persisted after it is
> > > > > incremented).
> > > > > 
> > > > > Well, if we assume that unclean shutdowns are rare, then
> > > > > perhaps
> > > > > we
> > > > > shouldn't be optimizing for that case.  So.... what if a file
> > > > > system
> > > > > had a counter which got incremented each time its journal is
> > > > > replayed
> > > > > representing an unclean shutdown.  That shouldn't happen
> > > > > often,
> > > > > but if
> > > > > it does, there might be any number of i_version updates that
> > > > > may
> > > > > have
> > > > > gotten lost.  So in that case, the NFS client should
> > > > > invalidate
> > > > > all of
> > > > > its caches.
> > > > > 
> > > > > If the i_version field was large enough, we could just prefix
> > > > > the
> > > > > "unclean shutdown counter" with the existing i_version number
> > > > > when it
> > > > > is sent over the NFS protocol to the client.  But if that
> > > > > field
> > > > > is too
> > > > > small, and if (as I understand things) NFS just needs to know
> > > > > when
> > > > > i_version is different, we could just simply hash the
> > > > > "unclean
> > > > > shtudown counter" with the inode's "i_version counter", and
> > > > > let
> > > > > that
> > > > > be the version which is sent from the NFS client to the
> > > > > server.
> > > > > 
> > > > > If we could do that, then it doesn't become critical that
> > > > > every
> > > > > single
> > > > > i_version bump has to be persisted to disk, and we could
> > > > > treat it
> > > > > like
> > > > > a lazytime update; it's guaranteed to updated when we do an
> > > > > clean
> > > > > unmount of the file system (and when the file system is
> > > > > frozen),
> > > > > but
> > > > > on a crash, there is no guaranteee that all i_version bumps
> > > > > will
> > > > > be
> > > > > persisted, but we do have this "unclean shutdown" counter to
> > > > > deal
> > > > > with
> > > > > that case.
> > > > > 
> > > > > Would this make life easier for folks?
> > > > > 
> > > > >                                                 - Ted
> > > > 
> > > > Thanks for chiming in, Ted. That's part of the problem, but
> > > > we're
> > > > actually not too worried about that case:
> > > > 
> > > > nfsd mixes the ctime in with i_version, so you'd have to
> > > > crash+clock
> > > > jump backward by juuuust enough to allow you to get the
> > > > i_version
> > > > and
> > > > ctime into a state it was before the crash, but with different
> > > > data.
> > > > We're assuming that that is difficult to achieve in practice.
> > > > 
> > > > The issue with a reboot counter (or similar) is that on an
> > > > unclean
> > > > crash
> > > > the NFS client would end up invalidating every inode in the
> > > > cache,
> > > > as
> > > > all of the i_versions would change. That's probably excessive.
> > > > 
> > > > The bigger issue (at the moment) is atomicity: when we fetch an
> > > > i_version, the natural inclination is to associate that with
> > > > the
> > > > state
> > > > of the inode at some point in time, so we need this to be
> > > > updated
> > > > atomically with certain other attributes of the inode. That's
> > > > the
> > > > part
> > > > I'm trying to sort through at the moment.
> > > 
> > > I don't think atomicity matters nearly as much as ordering.
> > > 
> > > The i_version must not be visible before the change that it
> > > reflects.
> > > It is OK for it to be after.  Even seconds after without great
> > > cost. 
> > > It
> > > is bad for it to be earlier.  Any unlocked gap after the
> > > i_version
> > > update and before the change is visible can result in a race and
> > > incorrect caching.
> > > 
> > > Even for directory updates where NFSv4 wants atomic before/after
> > > version
> > > numbers, they don't need to be atomic w.r.t. the change being
> > > visible.
> > > 
> > > If three concurrent file creates cause the version number to go
> > > from
> > > 4
> > > to 7, then it is important that one op sees "4,5", one sees "5,6"
> > > and
> > > one sees "6,7", but it doesn't matter if concurrent lookups only
> > > see
> > > version 4 even while they can see the newly created names.
> > > 
> > > A longer gap increases the risk of an unnecessary cache flush,
> > > but it
> > > doesn't lead to incorrectness.
> > > 
> > 
> > I'm not really sure what you mean when you say that a 'longer gap
> > increases the risk of an unnecessary cache flush'. Either the
> > change
> > attribute update is atomic with the operation it is recording, or
> > it is
> > not. If that update is recorded in the NFS reply as not being
> > atomic,
> > then the client will evict all cached data that is associated with
> > that
> > change attribute at some point.
> > 
> > > So I think we should put the version update *after* the change is
> > > visible, and not require locking (beyond a memory barrier) when
> > > reading
> > > the version. It should be as soon after as practical, bit no
> > > sooner.
> > > 
> > 
> > Ordering is not a sufficient condition. The guarantee needs to be
> > that
> > any application that reads the change attribute, then reads file
> > data
> > and then reads the change attribute again will see the 2 change
> > attribute values as being the same *if and only if* there were no
> > changes to the file data made after the read and before the read of
> > the
> > change attribute.
> 
> I'm say that only the "only if" is mandatory - getting that wrong has
> a
> correctness cost.
> BUT the "if" is less critical.  Getting that wrong has a performance
> cost.  We want to get it wrong as rarely as possible, but there is a
> performance cost to the underlying filesystem in providing
> perfection,
> and that must be balanced with the performance cost to NFS of
> providing
> imperfect results.
> 
I strongly disagree.

If the 2 change attribute values are different, then it is OK for the
file data to be the same, but if the file data has changed, then the
change attributes MUST differ.

Conversely, if the 2 change attributes are the same then it MUST be the
case that the file data did not change.

So it really needs to be an 'if and only if' case.

> For NFSv4, this is of limited interest for files.
> If the client has a delegation, then it is certain that no other
> client
> or server-side application will change the file, so it doesn't need
> to
> pay much attention to change ids.
> If the client doesn't have a delegation, then if there is any change
> to
> the changeid, the client cannot be certain that the change wasn't due
> to
> some other client, so it must purge its cache on close or lock.  So
> fine
> details of the changeid aren't interesting (as long as we have the
> "only
> if"). 
> 
> For directories, NFSv4 does want precise changeids, but directory ops
> needs to be sync for NFS anyway, so the extra burden on the fs is
> small.
> 
> 
> > That includes the case where data was written after the read, and a
> > crash occurred after it was committed to stable storage. If you
> > only
> > update the version after the written data is visible, then there is
> > a
> > possibility that the crash could occur before any change attribute
> > update is committed to disk.
> 
> I think we all agree that handling a crash is hard.  I think that
> should be a separate consideration to how i_version is handled during
> normal running.
> 
> > 
> > IOW: the minimal condition needs to be that for all cases below,
> > the
> > application reads 'state B' as having occurred if any data was
> > committed to disk before the crash.
> > 
> > Application                             Filesystem
> > ===========                             ==========
> > read change attr <- 'state A'
> > read data <- 'state A'
> >                                         write data -> 'state B'
> >                                         <crash>+<reboot>
> > read change attr <- 'state B'
> 
> The important thing here is to not see 'state A'.  Seeing 'state C'
> should be acceptable.  Worst case we could merge in wall-clock time
> of
> system boot, but the filesystem should be able to be more helpful
> than
> that.
> 
Agreed.
Trond Myklebust Sept. 9, 2022, 1:10 a.m. UTC | #18
On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> On Fri, 09 Sep 2022, NeilBrown wrote:
> > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > 
> > > 
> > > IOW: the minimal condition needs to be that for all cases below,
> > > the
> > > application reads 'state B' as having occurred if any data was
> > > committed to disk before the crash.
> > > 
> > > Application                             Filesystem
> > > ===========                             =========
> > > read change attr <- 'state A'
> > > read data <- 'state A'
> > >                                         write data -> 'state B'
> > >                                         <crash>+<reboot>
> > > read change attr <- 'state B'
> > 
> > The important thing here is to not see 'state A'.  Seeing 'state C'
> > should be acceptable.  Worst case we could merge in wall-clock time
> > of
> > system boot, but the filesystem should be able to be more helpful
> > than
> > that.
> > 
> 
> Actually, without the crash+reboot it would still be acceptable to
> see
> "state A" at the end there - but preferably not for long.
> From the NFS perspective, the changeid needs to update by the time of
> a
> close or unlock (so it is visible to open or lock), but before that
> it
> is just best-effort.

Nope. That will inevitably lead to data corruption, since the
application might decide to use the data from state A instead of
revalidating it.
Trond Myklebust Sept. 9, 2022, 2:14 a.m. UTC | #19
On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote:
> On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> > On Fri, 09 Sep 2022, NeilBrown wrote:
> > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > 
> > > > 
> > > > IOW: the minimal condition needs to be that for all cases
> > > > below,
> > > > the
> > > > application reads 'state B' as having occurred if any data was
> > > > committed to disk before the crash.
> > > > 
> > > > Application                             Filesystem
> > > > ===========                             =========
> > > > read change attr <- 'state A'
> > > > read data <- 'state A'
> > > >                                         write data -> 'state B'
> > > >                                         <crash>+<reboot>
> > > > read change attr <- 'state B'
> > > 
> > > The important thing here is to not see 'state A'.  Seeing 'state
> > > C'
> > > should be acceptable.  Worst case we could merge in wall-clock
> > > time
> > > of
> > > system boot, but the filesystem should be able to be more helpful
> > > than
> > > that.
> > > 
> > 
> > Actually, without the crash+reboot it would still be acceptable to
> > see
> > "state A" at the end there - but preferably not for long.
> > From the NFS perspective, the changeid needs to update by the time
> > of
> > a
> > close or unlock (so it is visible to open or lock), but before that
> > it
> > is just best-effort.
> 
> Nope. That will inevitably lead to data corruption, since the
> application might decide to use the data from state A instead of
> revalidating it.
> 

The point is, NFS is not the only potential use case for change
attributes. We wouldn't be bothering to discuss statx() if it was.

I could be using O_DIRECT, and all the tricks in order to ensure that
my stock broker application (to choose one example) has access to the
absolute very latest prices when I'm trying to execute a trade.
When the filesystem then says 'the prices haven't changed since your
last read because the change attribute on the database file is the
same' in response to a statx() request with the AT_STATX_FORCE_SYNC
flag set, then why shouldn't my application be able to assume it can
serve those prices right out of memory instead of having to go to disk?
NeilBrown Sept. 9, 2022, 6:41 a.m. UTC | #20
On Fri, 09 Sep 2022, Trond Myklebust wrote:
> On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote:
> > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> > > On Fri, 09 Sep 2022, NeilBrown wrote:
> > > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > 
> > > > > 
> > > > > IOW: the minimal condition needs to be that for all cases
> > > > > below,
> > > > > the
> > > > > application reads 'state B' as having occurred if any data was
> > > > > committed to disk before the crash.
> > > > > 
> > > > > Application                             Filesystem
> > > > > ===========                             =========
> > > > > read change attr <- 'state A'
> > > > > read data <- 'state A'
> > > > >                                         write data -> 'state B'
> > > > >                                         <crash>+<reboot>
> > > > > read change attr <- 'state B'
> > > > 
> > > > The important thing here is to not see 'state A'.  Seeing 'state
> > > > C'
> > > > should be acceptable.  Worst case we could merge in wall-clock
> > > > time
> > > > of
> > > > system boot, but the filesystem should be able to be more helpful
> > > > than
> > > > that.
> > > > 
> > > 
> > > Actually, without the crash+reboot it would still be acceptable to
> > > see
> > > "state A" at the end there - but preferably not for long.
> > > From the NFS perspective, the changeid needs to update by the time
> > > of
> > > a
> > > close or unlock (so it is visible to open or lock), but before that
> > > it
> > > is just best-effort.
> > 
> > Nope. That will inevitably lead to data corruption, since the
> > application might decide to use the data from state A instead of
> > revalidating it.
> > 
> 
> The point is, NFS is not the only potential use case for change
> attributes. We wouldn't be bothering to discuss statx() if it was.

My understanding is that it was primarily a desire to add fstests to
exercise the i_version which motivated the statx extension.
Obviously we should prepare for other uses though.

> 
> I could be using O_DIRECT, and all the tricks in order to ensure that
> my stock broker application (to choose one example) has access to the
> absolute very latest prices when I'm trying to execute a trade.
> When the filesystem then says 'the prices haven't changed since your
> last read because the change attribute on the database file is the
> same' in response to a statx() request with the AT_STATX_FORCE_SYNC
> flag set, then why shouldn't my application be able to assume it can
> serve those prices right out of memory instead of having to go to disk?

I would think that such an application would be using inotify rather
than having to poll.  But certainly we should have a clear statement of
quality-of-service parameters in the documentation.
If we agree that perfect atomicity is what we want to promise, and that
the cost to the filesystem and the statx call is acceptable, then so be it.

My point wasn't to say that atomicity is bad.  It was that:
 - if the i_version change is visible before the change itself is
   visible, then that is a correctness problem.
 - if the i_version change is only visible some time after the change
   itself is visible, then that is a quality-of-service issue.
I cannot see any room for debating the first.  I do see some room to
debate the second.

Cached writes, directory ops, and attribute changes are, I think, easy
enough to provide truly atomic i_version updates with the change being
visible.

Changes to a shared memory-mapped files is probably the hardest to
provide timely i_version updates for.  We might want to document an
explicit exception for those.  Alternately each request for i_version
would need to find all pages that are writable, remap them read-only to
catch future writes, then update i_version if any were writable (i.e.
->mkwrite had been called).  That is the only way I can think of to
provide atomicity.

O_DIRECT writes are a little easier than mmapped files.  I suspect we
should update the i_version once the device reports that the write is
complete, but a parallel reader could have seem some of the write before
that moment.  True atomicity could only be provided by taking some
exclusive lock that blocked all O_DIRECT writes.  Jeff seems to be
suggesting this, but I doubt the stock broker application would be
willing to make the call in that case.  I don't think I would either.

NeilBrown

> 
> -- 
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> trond.myklebust@hammerspace.com
> 
> 
>
Jeff Layton Sept. 9, 2022, 11:53 a.m. UTC | #21
On Fri, 2022-09-09 at 08:29 +1000, NeilBrown wrote:
> On Thu, 08 Sep 2022, Jeff Layton wrote:
> > On Thu, 2022-09-08 at 10:40 +1000, NeilBrown wrote:
> > > On Thu, 08 Sep 2022, Jeff Layton wrote:
> > > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote:
> > > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote:
> > > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with
> > > > > > > > > > respect to the
> > > > > > > > > > +other changes in the inode. On a write, for instance, the
> > > > > > > > > > i_version it usually
> > > > > > > > > > +incremented before the data is copied into the pagecache.
> > > > > > > > > > Therefore it is
> > > > > > > > > > +possible to see a new i_version value while a read still
> > > > > > > > > > shows the old data.
> > > > > > > > > 
> > > > > > > > > Doesn't that make the value useless?
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > No, I don't think so. It's only really useful for comparing to an
> > > > > > > > older
> > > > > > > > sample anyway. If you do "statx; read; statx" and the value
> > > > > > > > hasn't
> > > > > > > > changed, then you know that things are stable. 
> > > > > > > 
> > > > > > > I don't see how that helps.  It's still possible to get:
> > > > > > > 
> > > > > > >                 reader          writer
> > > > > > >                 ------          ------
> > > > > > >                                 i_version++
> > > > > > >                 statx
> > > > > > >                 read
> > > > > > >                 statx
> > > > > > >                                 update page cache
> > > > > > > 
> > > > > > > right?
> > > > > > > 
> > > > > > 
> > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In
> > > > > > that case, maybe this is useless then other than for testing purposes
> > > > > > and userland NFS servers.
> > > > > > 
> > > > > > Would it be better to not consume a statx field with this if so? What
> > > > > > could we use as an alternate interface? ioctl? Some sort of global
> > > > > > virtual xattr? It does need to be something per-inode.
> > > > > 
> > > > > I don't see how a non-atomic change attribute is remotely useful even
> > > > > for NFS.
> > > > > 
> > > > > The main problem is not so much the above (although NFS clients are
> > > > > vulnerable to that too) but the behaviour w.r.t. directory changes.
> > > > > 
> > > > > If the server can't guarantee that file/directory/... creation and
> > > > > unlink are atomically recorded with change attribute updates, then the
> > > > > client has to always assume that the server is lying, and that it has
> > > > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr
> > > > > requests after each and every directory modification in order to check
> > > > > that some other client didn't also sneak in a change of their own.
> > > > > 
> > > > 
> > > > We generally hold the parent dir's inode->i_rwsem exclusively over most
> > > > important directory changes, and the times/i_version are also updated
> > > > while holding it. What we don't do is serialize reads of this value vs.
> > > > the i_rwsem, so you could see new directory contents alongside an old
> > > > i_version. Maybe we should be taking it for read when we query it on a
> > > > directory?
> > > 
> > > We do hold i_rwsem today.  I'm working on changing that.  Preserving
> > > atomic directory changeinfo will be a challenge.  The only mechanism I
> > > can think if is to pass a "u64*" to all the directory modification ops,
> > > and they fill in the version number at the point where it is incremented
> > > (inode_maybe_inc_iversion_return()).  The (nfsd) caller assumes that
> > > "before" was one less than "after".  If you don't want to internally
> > > require single increments, then you would need to pass a 'u64 [2]' to
> > > get two iversions back.
> > > 
> > 
> > That's a major redesign of what the i_version counter is today. It may
> > very well end up being needed, but that's going to touch a lot of stuff
> > in the VFS. Are you planning to do that as a part of your locking
> > changes?
> > 
> 
> "A major design"?  How?  The "one less than" might be, but allowing a
> directory morphing op to fill in a "u64 [2]" is just a new interface to
> existing data.  One that allows fine grained atomicity.
> 
> This would actually be really good for NFS.  nfs_mkdir (for example)
> could easily have access to the atomic pre/post changedid provided by
> the server, and so could easily provide them to nfsd.
> 
> I'm not planning to do this as part of my locking changes.  In the first
> instance only NFS changes behaviour, and it doesn't provide atomic
> changeids, so there is no loss of functionality.
> 
> When some other filesystem wants to opt-in to shared-locking on
> directories - that would be the time to push through a better interface.
> 

I think nfsd does provide atomic changeids for directory operations
currently. AFAICT, any operation where we're changing directory contents
is done while holding the i_rwsem exclusively, and we hold that lock
over the pre and post i_version fetch for the change_info4.

If you change nfsd to allow parallel directory morphing operations
without addressing this, then I think that would be a regression.

> 
> > > > 
> > > > Achieving atomicity with file writes though is another matter entirely.
> > > > I'm not sure that's even doable or how to approach it if so.
> > > > Suggestions?
> > > 
> > > Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ??
> > > 
> > 
> > Writes can cover multiple folios so we'd be doing several increments per
> > write. Maybe that's ok? Should we also be updating the ctime at that
> > point as well?
> 
> You would only do several increments if something was reading the value
> concurrently, and then you really should to several increments for
> correctness.
> 

Agreed.

> > 
> > Fetching the i_version under the i_rwsem is probably sufficient to fix
> > this though. Most of the write_iter ops already bump the i_version while
> > holding that lock, so this wouldn't add any extra locking to the write
> > codepaths.
> 
> Adding new locking doesn't seem like a good idea.  It's bound to have
> performance implications.  It may well end up serialising the directory
> op that I'm currently trying to make parallelisable.
> 

The new locking would only be in the NFSv4 GETATTR codepath:

    https://lore.kernel.org/linux-nfs/20220908172448.208585-9-jlayton@kernel.org/T/#u

Maybe we'd still better off taking a hit in the write codepath instead
of doing this, but with this, most of the penalty would be paid by nfsd
which I would think would be preferred here.

The problem of mmap writes is another matter though. Not sure what we
can do about that without making i_version bumps a lot more expensive.
Theodore Ts'o Sept. 9, 2022, 12:11 p.m. UTC | #22
On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> 
> Ted, how would we access this? Maybe we could just add a new (generic)
> super_block field for this that ext4 (and other filesystems) could
> populate at mount time?

Yeah, I was thinking about just adding it to struct super, with some
value (perhaps 0 or ~0) meaning that the file system didn't support
it.  If people were concerned about struct super bloat, we could also
add some new function to struct super_ops that would return one or
more values that are used rarely by most of the kernel code, and so
doesn't need to be in the struct super data structure.  I don't have
strong feelings one way or another.

On another note, my personal opinion is that at least as far as ext4
is concerned, i_version on disk's only use is for NFS's convenience,
and so I have absolutely no problem with changing how and when
i_version gets updated modulo concerns about impacting performance.
That's one of the reasons why being able to update i_version only
lazily, so that if we had, say, some workload that was doing O_DIRECT
writes followed by fdatasync(), there wouldn't be any obligation to
flush the inode out to disk just because we had bumped i_version
appeals to me.

But aside from that, I don't consider when i_version gets updated on
disk, especially what the semantics are after a crash, and if we need
to change things so that NFS can be more performant, I'm happy to
accomodate.  One of the reasons why we implemented the ext4 fast
commit feature was to improve performance for NFS workloads.

I know some XFS developers have some concerns here, but I just wanted
to make it be explicit that (a) I'm not aware of any users who are
depending on the i_version on-disk semantics, and (b) if they are
depending on something which as far as I'm concerned in an internal
implementation detail, we've made no promises to them, and they can
get to keep both pieces.  :-)  This is especially since up until now,
there is no supported, portable userspace interface to make i_version
available to userspace.

Cheers,

					- Ted
Jeff Layton Sept. 9, 2022, 12:47 p.m. UTC | #23
On Fri, 2022-09-09 at 08:11 -0400, Theodore Ts'o wrote:
> On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > 
> > Ted, how would we access this? Maybe we could just add a new (generic)
> > super_block field for this that ext4 (and other filesystems) could
> > populate at mount time?
> 
> Yeah, I was thinking about just adding it to struct super, with some
> value (perhaps 0 or ~0) meaning that the file system didn't support
> it.  If people were concerned about struct super bloat, we could also
> add some new function to struct super_ops that would return one or
> more values that are used rarely by most of the kernel code, and so
> doesn't need to be in the struct super data structure.  I don't have
> strong feelings one way or another.
> 

Either would be fine, I think.

> On another note, my personal opinion is that at least as far as ext4
> is concerned, i_version on disk's only use is for NFS's convenience,

Technically, IMA uses it too, but it needs the same behavior as NFSv4.

> and so I have absolutely no problem with changing how and when
> i_version gets updated modulo concerns about impacting performance.
> That's one of the reasons why being able to update i_version only
> lazily, so that if we had, say, some workload that was doing O_DIRECT
> writes followed by fdatasync(), there wouldn't be any obligation to
> flush the inode out to disk just because we had bumped i_version
> appeals to me.
> 

i_version only changes now if someone has queried it since it was last
changed. That makes a huge difference in performance. We can try to
optimize it further, but it probably wouldn't move the needle much under
real workloads.

> But aside from that, I don't consider when i_version gets updated on
> disk, especially what the semantics are after a crash, and if we need
> to change things so that NFS can be more performant, I'm happy to
> accomodate.  One of the reasons why we implemented the ext4 fast
> commit feature was to improve performance for NFS workloads.
> 
> I know some XFS developers have some concerns here, but I just wanted
> to make it be explicit that (a) I'm not aware of any users who are
> depending on the i_version on-disk semantics, and (b) if they are
> depending on something which as far as I'm concerned in an internal
> implementation detail, we've made no promises to them, and they can
> get to keep both pieces.  :-)  This is especially since up until now,
> there is no supported, portable userspace interface to make i_version
> available to userspace.
> 

Great! That's what I was hoping for with ext4. Would you be willing to
pick up these two patches for v6.1?

https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u
https://lore.kernel.org/linux-ext4/20220908172448.208585-4-jlayton@kernel.org/T/#u

They should be able to go in independently of the rest of the series and
I don't forsee any big changes to them.

Thanks,
Theodore Ts'o Sept. 9, 2022, 1:48 p.m. UTC | #24
On Fri, Sep 09, 2022 at 08:47:17AM -0400, Jeff Layton wrote:
> 
> i_version only changes now if someone has queried it since it was last
> changed. That makes a huge difference in performance. We can try to
> optimize it further, but it probably wouldn't move the needle much under
> real workloads.

Good point.  And to be clear, from NFS's perspective, you only need to
have i_version bumped if there is a user-visible change to the
file. --- with an explicit exception here of the FIEMAP system call,
since in the case of a delayed allocation, FIEMAP might change from
reporting:

 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:          0..         0:      0:             last,unknown_loc,delalloc,eof

to this:

 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:  190087172.. 190087172:      1:             last,eof

after a sync(2) or fsync(2) call, or after time passes.

> Great! That's what I was hoping for with ext4. Would you be willing to
> pick up these two patches for v6.1?
> 
> https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u
> https://lore.kernel.org/linux-ext4/20220908172448.208585-4-jlayton@kernel.org/T/#u

I think you mean:

https://lore.kernel.org/linux-ext4/20220908172448.208585-2-jlayton@kernel.org/T/#u
https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u

Right?

BTW, sorry for not responding to these patches earlier; between
preparing for the various Linux conferences in Dublin next week, and
being in Zurich and meeting with colleagues at $WORK all of this week,
I'm a bit behind on my patch reviews.

Cheers,

					- Ted
Jeff Layton Sept. 9, 2022, 2:43 p.m. UTC | #25
On Fri, 2022-09-09 at 09:48 -0400, Theodore Ts'o wrote:
> On Fri, Sep 09, 2022 at 08:47:17AM -0400, Jeff Layton wrote:
> > 
> > i_version only changes now if someone has queried it since it was last
> > changed. That makes a huge difference in performance. We can try to
> > optimize it further, but it probably wouldn't move the needle much under
> > real workloads.
> 
> Good point.  And to be clear, from NFS's perspective, you only need to
> have i_version bumped if there is a user-visible change to the
> file. --- with an explicit exception here of the FIEMAP system call,
> since in the case of a delayed allocation, FIEMAP might change from
> reporting:
> 
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:          0..         0:      0:             last,unknown_loc,delalloc,eof
> 
> to this:
> 
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..       0:  190087172.. 190087172:      1:             last,eof
> 
> after a sync(2) or fsync(2) call, or after time passes.
> 

In general, we want to bump i_version if the ctime changes. I'm guessing
that we don't change ctime on a delalloc? If it's not visible to NFS,
then NFS won't care about it.  We can't project FIEMAP info across the
wire at this time, so we'd probably like to avoid seeing an i_version
bump in due to delalloc.

> > Great! That's what I was hoping for with ext4. Would you be willing to
> > pick up these two patches for v6.1?
> > 
> > https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u
> > https://lore.kernel.org/linux-ext4/20220908172448.208585-4-jlayton@kernel.org/T/#u
> 
> I think you mean:
> 
> https://lore.kernel.org/linux-ext4/20220908172448.208585-2-jlayton@kernel.org/T/#u
> https://lore.kernel.org/linux-ext4/20220908172448.208585-3-jlayton@kernel.org/T/#u
> 
> Right?
> 
> BTW, sorry for not responding to these patches earlier; between
> preparing for the various Linux conferences in Dublin next week, and
> being in Zurich and meeting with colleagues at $WORK all of this week,
> I'm a bit behind on my patch reviews.
> 

No worries. As long as they're on your radar, that's fine.

Thanks!
Theodore Ts'o Sept. 9, 2022, 2:58 p.m. UTC | #26
On Fri, Sep 09, 2022 at 10:43:30AM -0400, Jeff Layton wrote:

> In general, we want to bump i_version if the ctime changes. I'm guessing
> that we don't change ctime on a delalloc? If it's not visible to NFS,
> then NFS won't care about it.  We can't project FIEMAP info across the
> wire at this time, so we'd probably like to avoid seeing an i_version
> bump in due to delalloc.

Right, currently nothing user-visible changes when delayed allocation
is resolved; ctime isn't bumped, and i_version shouldn't be bumped
either.

If we crash before delayed allocation is resolved, there might be
cases (mounting with data=writeback is the one which I'm most worried
about, but I haven't experimented to be sure) where the inode might
become a zero-length file after the reboot without i_version or ctime
changing, but given that NFS forces a fsync(2) before it acknowledges
a client request, that shouldn't be an issue for NFS.

This is where as far I'm concerned, for ext4, i_version has only one
customer to keep happy, and it's NFS.  :-)    Now, if we expose i_version
via statx(2), we might need to be a tad bit more careful about what
semantics we guarantee to userspace, especially with respect to what
might be returned before and after a crash recovery.  If we can leave
things such that there is maximal freedom for file system
implementations, that would be my preference.

						- Ted
J. Bruce Fields Sept. 9, 2022, 3:45 p.m. UTC | #27
On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote:
> On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
> > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > > Yeah, ok. That does make some sense. So we would mix this into the
> > > i_version instead of the ctime when it was available. Preferably, we'd
> > > mix that in when we store the i_version rather than adding it afterward.
> > > 
> > > Ted, how would we access this? Maybe we could just add a new (generic)
> > > super_block field for this that ext4 (and other filesystems) could
> > > populate at mount time?
> > 
> > Couldn't the filesystem just return an ino_version that already includes
> > it?
> > 
> 
> Yes. That's simple if we want to just fold it in during getattr. If we
> want to fold that into the values stored on disk, then I'm a little less
> clear on how that will work.
> 
> Maybe I need a concrete example of how that will work:
> 
> Suppose we have an i_version value X with the previous crash counter
> already factored in that makes it to disk. We hand out a newer version
> X+1 to a client, but that value never makes it to disk.
> 
> The machine crashes and comes back up, and we get a query for i_version
> and it comes back as X. Fine, it's an old version. Now there is a write.
> What do we do to ensure that the new value doesn't collide with X+1? 

I was assuming we could partition i_version's 64 bits somehow: e.g., top
16 bits store the crash counter.  You increment the i_version by: 1)
replacing the top bits by the new crash counter, if it has changed, and
2) incrementing.

Do the numbers work out?  2^16 mounts after unclean shutdowns sounds
like a lot for one filesystem, as does 2^48 changes to a single file,
but people do weird things.  Maybe there's a better partitioning, or
some more flexible way of maintaining an i_version that still allows you
to identify whether a given i_version preceded a crash.

--b.
Jeff Layton Sept. 9, 2022, 4:36 p.m. UTC | #28
On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote:
> On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote:
> > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
> > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > > > Yeah, ok. That does make some sense. So we would mix this into the
> > > > i_version instead of the ctime when it was available. Preferably, we'd
> > > > mix that in when we store the i_version rather than adding it afterward.
> > > > 
> > > > Ted, how would we access this? Maybe we could just add a new (generic)
> > > > super_block field for this that ext4 (and other filesystems) could
> > > > populate at mount time?
> > > 
> > > Couldn't the filesystem just return an ino_version that already includes
> > > it?
> > > 
> > 
> > Yes. That's simple if we want to just fold it in during getattr. If we
> > want to fold that into the values stored on disk, then I'm a little less
> > clear on how that will work.
> > 
> > Maybe I need a concrete example of how that will work:
> > 
> > Suppose we have an i_version value X with the previous crash counter
> > already factored in that makes it to disk. We hand out a newer version
> > X+1 to a client, but that value never makes it to disk.
> > 
> > The machine crashes and comes back up, and we get a query for i_version
> > and it comes back as X. Fine, it's an old version. Now there is a write.
> > What do we do to ensure that the new value doesn't collide with X+1? 
> 
> I was assuming we could partition i_version's 64 bits somehow: e.g., top
> 16 bits store the crash counter.  You increment the i_version by: 1)
> replacing the top bits by the new crash counter, if it has changed, and
> 2) incrementing.
> 
> Do the numbers work out?  2^16 mounts after unclean shutdowns sounds
> like a lot for one filesystem, as does 2^48 changes to a single file,
> but people do weird things.  Maybe there's a better partitioning, or
> some more flexible way of maintaining an i_version that still allows you
> to identify whether a given i_version preceded a crash.
> 

We consume one bit to keep track of the "seen" flag, so it would be a
16+47 split. I assume that we'd also reset the version counter to 0 when
the crash counter changes? Maybe that doesn't matter as long as we don't
overflow into the crash counter.

I'm not sure we can get away with 16 bits for the crash counter, as
it'll leave us subject to the version counter wrapping after a long
uptimes. 

If you increment a counter every nanosecond, how long until that counter
wraps? With 63 bits, that's 292 years (and change). With 16+47 bits,
that's less than two days. An 8+55 split would give us ~416 days which
seems a bit more reasonable?

For NFS, we can probably live with even less bits in the crash counter. 

If the crash counter changes, then that means the NFS server itself has
(likely) also crashed. The client will have to reestablish sockets,
reclaim, etc. It should get new attributes for the inodes it cares about
at that time.
John Stoffel Sept. 9, 2022, 8:34 p.m. UTC | #29
>>>>> "Jeff" == Jeff Layton <jlayton@kernel.org> writes:

> On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
>> On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
>> > Yeah, ok. That does make some sense. So we would mix this into the
>> > i_version instead of the ctime when it was available. Preferably, we'd
>> > mix that in when we store the i_version rather than adding it afterward.
>> > 
>> > Ted, how would we access this? Maybe we could just add a new (generic)
>> > super_block field for this that ext4 (and other filesystems) could
>> > populate at mount time?
>> 
>> Couldn't the filesystem just return an ino_version that already includes
>> it?
>> 

> Yes. That's simple if we want to just fold it in during getattr. If we
> want to fold that into the values stored on disk, then I'm a little less
> clear on how that will work.

I wonder if this series should also include some updates to the
various xfstests to hopefully document in code what this statx() call
will do in various situations.  Or at least document how to test it in
some manner?  Especially since it's layers on top of layers to make
this work. 

My assumption is that if the underlying filesystem doesn't support the
new values, it just returns 0 or c_time?

John
Jeff Layton Sept. 10, 2022, 12:39 p.m. UTC | #30
On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote:
> > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote:
> > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote:
> > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases
> > > > > > > > > > > > below,
> > > > > > > > > > > > the
> > > > > > > > > > > > application reads 'state B' as having occurred if any data was
> > > > > > > > > > > > committed to disk before the crash.
> > > > > > > > > > > > 
> > > > > > > > > > > > Application                             Filesystem
> > > > > > > > > > > > ===========                             =========
> > > > > > > > > > > > read change attr <- 'state A'
> > > > > > > > > > > > read data <- 'state A'
> > > > > > > > > > > >                                         write data -> 'state B'
> > > > > > > > > > > >                                         <crash>+<reboot>
> > > > > > > > > > > > read change attr <- 'state B'
> > > > > > > > > > 
> > > > > > > > > > The important thing here is to not see 'state A'.  Seeing 'state
> > > > > > > > > > C'
> > > > > > > > > > should be acceptable.  Worst case we could merge in wall-clock
> > > > > > > > > > time
> > > > > > > > > > of
> > > > > > > > > > system boot, but the filesystem should be able to be more helpful
> > > > > > > > > > than
> > > > > > > > > > that.
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > Actually, without the crash+reboot it would still be acceptable to
> > > > > > > > see
> > > > > > > > "state A" at the end there - but preferably not for long.
> > > > > > > > From the NFS perspective, the changeid needs to update by the time
> > > > > > > > of
> > > > > > > > a
> > > > > > > > close or unlock (so it is visible to open or lock), but before that
> > > > > > > > it
> > > > > > > > is just best-effort.
> > > > > > 
> > > > > > Nope. That will inevitably lead to data corruption, since the
> > > > > > application might decide to use the data from state A instead of
> > > > > > revalidating it.
> > > > > > 
> > > > 
> > > > The point is, NFS is not the only potential use case for change
> > > > attributes. We wouldn't be bothering to discuss statx() if it was.
> > 
> > My understanding is that it was primarily a desire to add fstests to
> > exercise the i_version which motivated the statx extension.
> > Obviously we should prepare for other uses though.
> > 

Mainly. Also, userland nfs servers might also like this for obvious
reasons. For now though, in the v5 set, I've backed off on trying to
expose this to userland in favor of trying to just clean up the internal
implementation.

I'd still like to expose this via statx if possible, but I don't want to
get too bogged down in interface design just now as we have Real Bugs to
fix. That patchset should make it simple to expose it later though.

> > > > 
> > > > I could be using O_DIRECT, and all the tricks in order to ensure
> > > > that
> > > > my stock broker application (to choose one example) has access
> > > > to the
> > > > absolute very latest prices when I'm trying to execute a trade.
> > > > When the filesystem then says 'the prices haven't changed since
> > > > your
> > > > last read because the change attribute on the database file is
> > > > the
> > > > same' in response to a statx() request with the
> > > > AT_STATX_FORCE_SYNC
> > > > flag set, then why shouldn't my application be able to assume it
> > > > can
> > > > serve those prices right out of memory instead of having to go
> > > > to disk?
> > 
> > I would think that such an application would be using inotify rather
> > than having to poll.  But certainly we should have a clear statement
> > of
> > quality-of-service parameters in the documentation.
> > If we agree that perfect atomicity is what we want to promise, and
> > that
> > the cost to the filesystem and the statx call is acceptable, then so
> > be it.
> > 
> > My point wasn't to say that atomicity is bad.  It was that:
> >  - if the i_version change is visible before the change itself is
> >    visible, then that is a correctness problem.
> >  - if the i_version change is only visible some time after the
> > change
> >    itself is visible, then that is a quality-of-service issue.
> > I cannot see any room for debating the first.  I do see some room to
> > debate the second.
> > 
> > Cached writes, directory ops, and attribute changes are, I think,
> > easy
> > enough to provide truly atomic i_version updates with the change
> > being
> > visible.
> > 
> > Changes to a shared memory-mapped files is probably the hardest to
> > provide timely i_version updates for.  We might want to document an
> > explicit exception for those.  Alternately each request for
> > i_version
> > would need to find all pages that are writable, remap them read-only
> > to
> > catch future writes, then update i_version if any were writable
> > (i.e.
> > ->mkwrite had been called).  That is the only way I can think of to
> > provide atomicity.
> > 

I don't think we really want to make i_version bumps that expensive.
Documenting that you can't expect perfect consistency vs. mmap with NFS
seems like the best thing to do. We do our best, but that sort of
synchronization requires real locking.

> > O_DIRECT writes are a little easier than mmapped files.  I suspect we
> > should update the i_version once the device reports that the write is
> > complete, but a parallel reader could have seem some of the write before
> > that moment.  True atomicity could only be provided by taking some
> > exclusive lock that blocked all O_DIRECT writes.  Jeff seems to be
> > suggesting this, but I doubt the stock broker application would be
> > willing to make the call in that case.  I don't think I would either.

Well, only blocked for long enough to run the getattr. Granted, with a
slow underlying filesystem that can take a while.

To summarize, there are two main uses for the change attr in NFSv4:

1/ to provide change_info4 for directory morphing operations (CREATE,
LINK, OPEN, REMOVE, and RENAME). It turns out that this is already
atomic in the current nfsd code (AFAICT) by virtue of the fact that we
hold the i_rwsem exclusively over these operations. The change attr is
also queried pre and post while the lock is held, so that should ensure
that we get true atomicity for this.

2/ as an adjunct for the ctime when fetching attributes to validate
caches. We don't expect perfect consistency between read (and readlike)
operations and GETATTR, even when they're in the same compound.

IOW, a READ+GETATTR compound can legally give you a short (or zero-
length) read, and then the getattr indicates a size that is larger than
where the READ data stops, due to a write or truncate racing in after
the read.

Ideally, the attributes in the GETATTR reply should be consistent
between themselves though. IOW, all of the attrs should accurately
represent the state of the file at a single point in time.
change+size+times+etc. should all be consistent with one another.

I think we get all of this by taking the inode_lock around the
vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant
solution, but it should give us the atomicity we need, and it doesn't
require adding extra operations or locking to the write codepaths.

We could also consider less invasive ways to achieve this (maybe some
sort of seqretry loop around the vfs_getattr call?), but I'd rather not
do extra work in the write codepaths if we can get away with it.
J. Bruce Fields Sept. 10, 2022, 2:56 p.m. UTC | #31
On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote:
> On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote:
> > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote:
> > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
> > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > > > > Yeah, ok. That does make some sense. So we would mix this into the
> > > > > i_version instead of the ctime when it was available. Preferably, we'd
> > > > > mix that in when we store the i_version rather than adding it afterward.
> > > > > 
> > > > > Ted, how would we access this? Maybe we could just add a new (generic)
> > > > > super_block field for this that ext4 (and other filesystems) could
> > > > > populate at mount time?
> > > > 
> > > > Couldn't the filesystem just return an ino_version that already includes
> > > > it?
> > > > 
> > > 
> > > Yes. That's simple if we want to just fold it in during getattr. If we
> > > want to fold that into the values stored on disk, then I'm a little less
> > > clear on how that will work.
> > > 
> > > Maybe I need a concrete example of how that will work:
> > > 
> > > Suppose we have an i_version value X with the previous crash counter
> > > already factored in that makes it to disk. We hand out a newer version
> > > X+1 to a client, but that value never makes it to disk.
> > > 
> > > The machine crashes and comes back up, and we get a query for i_version
> > > and it comes back as X. Fine, it's an old version. Now there is a write.
> > > What do we do to ensure that the new value doesn't collide with X+1? 
> > 
> > I was assuming we could partition i_version's 64 bits somehow: e.g., top
> > 16 bits store the crash counter.  You increment the i_version by: 1)
> > replacing the top bits by the new crash counter, if it has changed, and
> > 2) incrementing.
> > 
> > Do the numbers work out?  2^16 mounts after unclean shutdowns sounds
> > like a lot for one filesystem, as does 2^48 changes to a single file,
> > but people do weird things.  Maybe there's a better partitioning, or
> > some more flexible way of maintaining an i_version that still allows you
> > to identify whether a given i_version preceded a crash.
> > 
> 
> We consume one bit to keep track of the "seen" flag, so it would be a
> 16+47 split. I assume that we'd also reset the version counter to 0 when
> the crash counter changes? Maybe that doesn't matter as long as we don't
> overflow into the crash counter.
> 
> I'm not sure we can get away with 16 bits for the crash counter, as
> it'll leave us subject to the version counter wrapping after a long
> uptimes. 
> 
> If you increment a counter every nanosecond, how long until that counter
> wraps? With 63 bits, that's 292 years (and change). With 16+47 bits,
> that's less than two days. An 8+55 split would give us ~416 days which
> seems a bit more reasonable?

Though now it's starting to seem a little limiting to allow only 2^8
mounts after unclean shutdowns.

Another way to think of it might be: multiply that 8-bit crash counter
by 2^48, and think of it as a 64-bit value that we believe (based on
practical limits on how many times you can modify a single file) is
gauranteed to be larger than any i_version that we gave out before the
most recent crash.

Our goal is to ensure that after a crash, any *new* i_versions that we
give out or write to disk are larger than any that have previously been
given out.  We can do that by ensuring that they're equal to at least
that old maximum.

So think of the 64-bit value we're storing in the superblock as a
ceiling on i_version values across all the filesystem's inodes.  Call it
s_version_max or something.  We also need to know what the maximum was
before the most recent crash.  Call that s_version_max_old.

Then we could get correct behavior if we generated i_versions with
something like:

	i_version++;
	if (i_version < s_version_max_old)
		i_version = s_version_max_old;
	if (i_version > s_version_max)
		s_version_max = i_version + 1;

But that last step makes this ludicrously expensive, because for this to
be safe across crashes we need to update that value on disk as well, and
we need to do that frequently.

Fortunately, s_version_max doesn't have to be a tight bound at all.  We
can easily just initialize it to, say, 2^40, and only bump it by 2^40 at
a time.  And recognize when we're running up against it way ahead of
time, so we only need to say "here's an updated value, could you please
make sure it gets to disk sometime in the next twenty minutes"?
(Numbers made up.)

Sorry, that was way too many words.  But I think something like that
could work, and make it very difficult to hit any hard limits, and
actually not be too complicated??  Unless I missed something.

--b.
Al Viro Sept. 10, 2022, 7:46 p.m. UTC | #32
On Thu, Sep 08, 2022 at 10:40:43AM +1000, NeilBrown wrote:

> We do hold i_rwsem today.  I'm working on changing that.  Preserving
> atomic directory changeinfo will be a challenge.  The only mechanism I
> can think if is to pass a "u64*" to all the directory modification ops,
> and they fill in the version number at the point where it is incremented
> (inode_maybe_inc_iversion_return()).  The (nfsd) caller assumes that
> "before" was one less than "after".  If you don't want to internally
> require single increments, then you would need to pass a 'u64 [2]' to
> get two iversions back.

Are you serious?  What kind of boilerplate would that inflict on the
filesystems not, er, opting in for that... scalability improvement
experiment?
NeilBrown Sept. 10, 2022, 10:53 p.m. UTC | #33
On Sat, 10 Sep 2022, Jeff Layton wrote:
> On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote:
> > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote:
> > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote:
> > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > > > > > > > 
> > > > > > > > > > > > > 
> > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases
> > > > > > > > > > > > > below,
> > > > > > > > > > > > > the
> > > > > > > > > > > > > application reads 'state B' as having occurred if any data was
> > > > > > > > > > > > > committed to disk before the crash.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Application                             Filesystem
> > > > > > > > > > > > > ===========                             =========
> > > > > > > > > > > > > read change attr <- 'state A'
> > > > > > > > > > > > > read data <- 'state A'
> > > > > > > > > > > > >                                         write data -> 'state B'
> > > > > > > > > > > > >                                         <crash>+<reboot>
> > > > > > > > > > > > > read change attr <- 'state B'
> > > > > > > > > > > 
> > > > > > > > > > > The important thing here is to not see 'state A'.  Seeing 'state
> > > > > > > > > > > C'
> > > > > > > > > > > should be acceptable.  Worst case we could merge in wall-clock
> > > > > > > > > > > time
> > > > > > > > > > > of
> > > > > > > > > > > system boot, but the filesystem should be able to be more helpful
> > > > > > > > > > > than
> > > > > > > > > > > that.
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Actually, without the crash+reboot it would still be acceptable to
> > > > > > > > > see
> > > > > > > > > "state A" at the end there - but preferably not for long.
> > > > > > > > > From the NFS perspective, the changeid needs to update by the time
> > > > > > > > > of
> > > > > > > > > a
> > > > > > > > > close or unlock (so it is visible to open or lock), but before that
> > > > > > > > > it
> > > > > > > > > is just best-effort.
> > > > > > > 
> > > > > > > Nope. That will inevitably lead to data corruption, since the
> > > > > > > application might decide to use the data from state A instead of
> > > > > > > revalidating it.
> > > > > > > 
> > > > > 
> > > > > The point is, NFS is not the only potential use case for change
> > > > > attributes. We wouldn't be bothering to discuss statx() if it was.
> > > 
> > > My understanding is that it was primarily a desire to add fstests to
> > > exercise the i_version which motivated the statx extension.
> > > Obviously we should prepare for other uses though.
> > > 
> 
> Mainly. Also, userland nfs servers might also like this for obvious
> reasons. For now though, in the v5 set, I've backed off on trying to
> expose this to userland in favor of trying to just clean up the internal
> implementation.
> 
> I'd still like to expose this via statx if possible, but I don't want to
> get too bogged down in interface design just now as we have Real Bugs to
> fix. That patchset should make it simple to expose it later though.
> 
> > > > > 
> > > > > I could be using O_DIRECT, and all the tricks in order to ensure
> > > > > that
> > > > > my stock broker application (to choose one example) has access
> > > > > to the
> > > > > absolute very latest prices when I'm trying to execute a trade.
> > > > > When the filesystem then says 'the prices haven't changed since
> > > > > your
> > > > > last read because the change attribute on the database file is
> > > > > the
> > > > > same' in response to a statx() request with the
> > > > > AT_STATX_FORCE_SYNC
> > > > > flag set, then why shouldn't my application be able to assume it
> > > > > can
> > > > > serve those prices right out of memory instead of having to go
> > > > > to disk?
> > > 
> > > I would think that such an application would be using inotify rather
> > > than having to poll.  But certainly we should have a clear statement
> > > of
> > > quality-of-service parameters in the documentation.
> > > If we agree that perfect atomicity is what we want to promise, and
> > > that
> > > the cost to the filesystem and the statx call is acceptable, then so
> > > be it.
> > > 
> > > My point wasn't to say that atomicity is bad.  It was that:
> > >  - if the i_version change is visible before the change itself is
> > >    visible, then that is a correctness problem.
> > >  - if the i_version change is only visible some time after the
> > > change
> > >    itself is visible, then that is a quality-of-service issue.
> > > I cannot see any room for debating the first.  I do see some room to
> > > debate the second.
> > > 
> > > Cached writes, directory ops, and attribute changes are, I think,
> > > easy
> > > enough to provide truly atomic i_version updates with the change
> > > being
> > > visible.
> > > 
> > > Changes to a shared memory-mapped files is probably the hardest to
> > > provide timely i_version updates for.  We might want to document an
> > > explicit exception for those.  Alternately each request for
> > > i_version
> > > would need to find all pages that are writable, remap them read-only
> > > to
> > > catch future writes, then update i_version if any were writable
> > > (i.e.
> > > ->mkwrite had been called).  That is the only way I can think of to
> > > provide atomicity.
> > > 
> 
> I don't think we really want to make i_version bumps that expensive.
> Documenting that you can't expect perfect consistency vs. mmap with NFS
> seems like the best thing to do. We do our best, but that sort of
> synchronization requires real locking.
> 
> > > O_DIRECT writes are a little easier than mmapped files.  I suspect we
> > > should update the i_version once the device reports that the write is
> > > complete, but a parallel reader could have seem some of the write before
> > > that moment.  True atomicity could only be provided by taking some
> > > exclusive lock that blocked all O_DIRECT writes.  Jeff seems to be
> > > suggesting this, but I doubt the stock broker application would be
> > > willing to make the call in that case.  I don't think I would either.
> 
> Well, only blocked for long enough to run the getattr. Granted, with a
> slow underlying filesystem that can take a while.

Maybe I misunderstand, but this doesn't seem to make much sense.

If you want i_version updates to appear to be atomic w.r.t O_DIRECT
writes, then you need to prevent accessing the i_version while any write
is on-going. At that time there is no meaningful value for i_version.
So you need a lock (At least shared) around the actual write, and you
need an exclusive lock around the get_i_version().
So accessing the i_version would have to wait for all pending O_DIRECT
writes to complete, and would block any new O_DIRECT writes from
starting.

This could be expensive.

There is not currently any locking around O_DIRECT writes.  You cannot
synchronise with them.

The best you can do is update the i_version immediately after all the
O_DIRECT writes in a single request complete.

> 
> To summarize, there are two main uses for the change attr in NFSv4:
> 
> 1/ to provide change_info4 for directory morphing operations (CREATE,
> LINK, OPEN, REMOVE, and RENAME). It turns out that this is already
> atomic in the current nfsd code (AFAICT) by virtue of the fact that we
> hold the i_rwsem exclusively over these operations. The change attr is
> also queried pre and post while the lock is held, so that should ensure
> that we get true atomicity for this.

Yes, directory ops are relatively easy.

> 
> 2/ as an adjunct for the ctime when fetching attributes to validate
> caches. We don't expect perfect consistency between read (and readlike)
> operations and GETATTR, even when they're in the same compound.
> 
> IOW, a READ+GETATTR compound can legally give you a short (or zero-
> length) read, and then the getattr indicates a size that is larger than
> where the READ data stops, due to a write or truncate racing in after
> the read.

I agree that atomicity is neither necessary nor practical.  Ordering is
important though.  I don't think a truncate(0) racing with a READ can
credibly result in a non-zero size AFTER a zero-length read.  A truncate
that extends the size could have that effect though.

> 
> Ideally, the attributes in the GETATTR reply should be consistent
> between themselves though. IOW, all of the attrs should accurately
> represent the state of the file at a single point in time.
> change+size+times+etc. should all be consistent with one another.
> 
> I think we get all of this by taking the inode_lock around the
> vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant
> solution, but it should give us the atomicity we need, and it doesn't
> require adding extra operations or locking to the write codepaths.

Explicit attribute changes (chown/chmod/utimes/truncate etc) are always
done under the inode lock.  Implicit changes via inode_update_time() are
not (though xfs does take the lock, ext4 doesn't, haven't checked
others).  So taking the inode lock won't ensure those are internally
consistent.

I think using inode_lock_shared() is acceptable.  It doesn't promise
perfect atomicity, but it is probably good enough.

We'd need a good reason to want perfect atomicity to go further, and I
cannot think of one.

NeilBrown


> 
> We could also consider less invasive ways to achieve this (maybe some
> sort of seqretry loop around the vfs_getattr call?), but I'd rather not
> do extra work in the write codepaths if we can get away with it.
> -- 
> Jeff Layton <jlayton@kernel.org>
> 
>
NeilBrown Sept. 10, 2022, 10:58 p.m. UTC | #34
On Fri, 09 Sep 2022, Jeff Layton wrote:
> On Fri, 2022-09-09 at 08:29 +1000, NeilBrown wrote:
> > On Thu, 08 Sep 2022, Jeff Layton wrote:
> > > On Thu, 2022-09-08 at 10:40 +1000, NeilBrown wrote:
> > > > On Thu, 08 Sep 2022, Jeff Layton wrote:
> > > > > On Wed, 2022-09-07 at 13:55 +0000, Trond Myklebust wrote:
> > > > > > On Wed, 2022-09-07 at 09:12 -0400, Jeff Layton wrote:
> > > > > > > On Wed, 2022-09-07 at 08:52 -0400, J. Bruce Fields wrote:
> > > > > > > > On Wed, Sep 07, 2022 at 08:47:20AM -0400, Jeff Layton wrote:
> > > > > > > > > On Wed, 2022-09-07 at 21:37 +1000, NeilBrown wrote:
> > > > > > > > > > On Wed, 07 Sep 2022, Jeff Layton wrote:
> > > > > > > > > > > +The change to \fIstatx.stx_ino_version\fP is not atomic with
> > > > > > > > > > > respect to the
> > > > > > > > > > > +other changes in the inode. On a write, for instance, the
> > > > > > > > > > > i_version it usually
> > > > > > > > > > > +incremented before the data is copied into the pagecache.
> > > > > > > > > > > Therefore it is
> > > > > > > > > > > +possible to see a new i_version value while a read still
> > > > > > > > > > > shows the old data.
> > > > > > > > > > 
> > > > > > > > > > Doesn't that make the value useless?
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > No, I don't think so. It's only really useful for comparing to an
> > > > > > > > > older
> > > > > > > > > sample anyway. If you do "statx; read; statx" and the value
> > > > > > > > > hasn't
> > > > > > > > > changed, then you know that things are stable. 
> > > > > > > > 
> > > > > > > > I don't see how that helps.  It's still possible to get:
> > > > > > > > 
> > > > > > > >                 reader          writer
> > > > > > > >                 ------          ------
> > > > > > > >                                 i_version++
> > > > > > > >                 statx
> > > > > > > >                 read
> > > > > > > >                 statx
> > > > > > > >                                 update page cache
> > > > > > > > 
> > > > > > > > right?
> > > > > > > > 
> > > > > > > 
> > > > > > > Yeah, I suppose so -- the statx wouldn't necessitate any locking. In
> > > > > > > that case, maybe this is useless then other than for testing purposes
> > > > > > > and userland NFS servers.
> > > > > > > 
> > > > > > > Would it be better to not consume a statx field with this if so? What
> > > > > > > could we use as an alternate interface? ioctl? Some sort of global
> > > > > > > virtual xattr? It does need to be something per-inode.
> > > > > > 
> > > > > > I don't see how a non-atomic change attribute is remotely useful even
> > > > > > for NFS.
> > > > > > 
> > > > > > The main problem is not so much the above (although NFS clients are
> > > > > > vulnerable to that too) but the behaviour w.r.t. directory changes.
> > > > > > 
> > > > > > If the server can't guarantee that file/directory/... creation and
> > > > > > unlink are atomically recorded with change attribute updates, then the
> > > > > > client has to always assume that the server is lying, and that it has
> > > > > > to revalidate all its caches anyway. Cue endless readdir/lookup/getattr
> > > > > > requests after each and every directory modification in order to check
> > > > > > that some other client didn't also sneak in a change of their own.
> > > > > > 
> > > > > 
> > > > > We generally hold the parent dir's inode->i_rwsem exclusively over most
> > > > > important directory changes, and the times/i_version are also updated
> > > > > while holding it. What we don't do is serialize reads of this value vs.
> > > > > the i_rwsem, so you could see new directory contents alongside an old
> > > > > i_version. Maybe we should be taking it for read when we query it on a
> > > > > directory?
> > > > 
> > > > We do hold i_rwsem today.  I'm working on changing that.  Preserving
> > > > atomic directory changeinfo will be a challenge.  The only mechanism I
> > > > can think if is to pass a "u64*" to all the directory modification ops,
> > > > and they fill in the version number at the point where it is incremented
> > > > (inode_maybe_inc_iversion_return()).  The (nfsd) caller assumes that
> > > > "before" was one less than "after".  If you don't want to internally
> > > > require single increments, then you would need to pass a 'u64 [2]' to
> > > > get two iversions back.
> > > > 
> > > 
> > > That's a major redesign of what the i_version counter is today. It may
> > > very well end up being needed, but that's going to touch a lot of stuff
> > > in the VFS. Are you planning to do that as a part of your locking
> > > changes?
> > > 
> > 
> > "A major design"?  How?  The "one less than" might be, but allowing a
> > directory morphing op to fill in a "u64 [2]" is just a new interface to
> > existing data.  One that allows fine grained atomicity.
> > 
> > This would actually be really good for NFS.  nfs_mkdir (for example)
> > could easily have access to the atomic pre/post changedid provided by
> > the server, and so could easily provide them to nfsd.
> > 
> > I'm not planning to do this as part of my locking changes.  In the first
> > instance only NFS changes behaviour, and it doesn't provide atomic
> > changeids, so there is no loss of functionality.
> > 
> > When some other filesystem wants to opt-in to shared-locking on
> > directories - that would be the time to push through a better interface.
> > 
> 
> I think nfsd does provide atomic changeids for directory operations
> currently. AFAICT, any operation where we're changing directory contents
> is done while holding the i_rwsem exclusively, and we hold that lock
> over the pre and post i_version fetch for the change_info4.
> 
> If you change nfsd to allow parallel directory morphing operations
> without addressing this, then I think that would be a regression.

Of course.

As I said, in the first instance only NFS allows parallel directory
morphing ops, and NFS doesn't provide atomic pre/post already.  No
regression.

Parallel directory morphing is opt-in - at least until all file systems
can be converted and these other issues are resolved.

> 
> > 
> > > > > 
> > > > > Achieving atomicity with file writes though is another matter entirely.
> > > > > I'm not sure that's even doable or how to approach it if so.
> > > > > Suggestions?
> > > > 
> > > > Call inode_maybe_inc_version(page->host) in __folio_mark_dirty() ??
> > > > 
> > > 
> > > Writes can cover multiple folios so we'd be doing several increments per
> > > write. Maybe that's ok? Should we also be updating the ctime at that
> > > point as well?
> > 
> > You would only do several increments if something was reading the value
> > concurrently, and then you really should to several increments for
> > correctness.
> > 
> 
> Agreed.
> 
> > > 
> > > Fetching the i_version under the i_rwsem is probably sufficient to fix
> > > this though. Most of the write_iter ops already bump the i_version while
> > > holding that lock, so this wouldn't add any extra locking to the write
> > > codepaths.
> > 
> > Adding new locking doesn't seem like a good idea.  It's bound to have
> > performance implications.  It may well end up serialising the directory
> > op that I'm currently trying to make parallelisable.
> > 
> 
> The new locking would only be in the NFSv4 GETATTR codepath:
> 
>     https://lore.kernel.org/linux-nfs/20220908172448.208585-9-jlayton@kernel.org/T/#u
> 
> Maybe we'd still better off taking a hit in the write codepath instead
> of doing this, but with this, most of the penalty would be paid by nfsd
> which I would think would be preferred here.

inode_lock_shard() would be acceptable here.  inode_lock() is unnecessary.

> 
> The problem of mmap writes is another matter though. Not sure what we
> can do about that without making i_version bumps a lot more expensive.
> 

Agreed.  We need to document our way out of that one.

NeilBrown

> -- 
> Jeff Layton <jlayton@kernel.org>
>
NeilBrown Sept. 10, 2022, 11 p.m. UTC | #35
On Sun, 11 Sep 2022, Al Viro wrote:
> On Thu, Sep 08, 2022 at 10:40:43AM +1000, NeilBrown wrote:
> 
> > We do hold i_rwsem today.  I'm working on changing that.  Preserving
> > atomic directory changeinfo will be a challenge.  The only mechanism I
> > can think if is to pass a "u64*" to all the directory modification ops,
> > and they fill in the version number at the point where it is incremented
> > (inode_maybe_inc_iversion_return()).  The (nfsd) caller assumes that
> > "before" was one less than "after".  If you don't want to internally
> > require single increments, then you would need to pass a 'u64 [2]' to
> > get two iversions back.
> 
> Are you serious?  What kind of boilerplate would that inflict on the
> filesystems not, er, opting in for that... scalability improvement
> experiment?
> 

Why would you think there would be any boiler plate?  Only filesystems
that opt in would need to do anything, and only when the caller asked
(by passing a non-NULL array pointer).

NeilBrown
Jeff Layton Sept. 12, 2022, 10:25 a.m. UTC | #36
On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> On Sat, 10 Sep 2022, Jeff Layton wrote:
> > On Fri, 2022-09-09 at 16:41 +1000, NeilBrown wrote:
> > > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > > On Fri, 2022-09-09 at 01:10 +0000, Trond Myklebust wrote:
> > > > > > > > On Fri, 2022-09-09 at 11:07 +1000, NeilBrown wrote:
> > > > > > > > > > On Fri, 09 Sep 2022, NeilBrown wrote:
> > > > > > > > > > > > On Fri, 09 Sep 2022, Trond Myklebust wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > IOW: the minimal condition needs to be that for all cases
> > > > > > > > > > > > > > below,
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > application reads 'state B' as having occurred if any data was
> > > > > > > > > > > > > > committed to disk before the crash.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Application                             Filesystem
> > > > > > > > > > > > > > ===========                             =========
> > > > > > > > > > > > > > read change attr <- 'state A'
> > > > > > > > > > > > > > read data <- 'state A'
> > > > > > > > > > > > > >                                         write data -> 'state B'
> > > > > > > > > > > > > >                                         <crash>+<reboot>
> > > > > > > > > > > > > > read change attr <- 'state B'
> > > > > > > > > > > > 
> > > > > > > > > > > > The important thing here is to not see 'state A'.  Seeing 'state
> > > > > > > > > > > > C'
> > > > > > > > > > > > should be acceptable.  Worst case we could merge in wall-clock
> > > > > > > > > > > > time
> > > > > > > > > > > > of
> > > > > > > > > > > > system boot, but the filesystem should be able to be more helpful
> > > > > > > > > > > > than
> > > > > > > > > > > > that.
> > > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Actually, without the crash+reboot it would still be acceptable to
> > > > > > > > > > see
> > > > > > > > > > "state A" at the end there - but preferably not for long.
> > > > > > > > > > From the NFS perspective, the changeid needs to update by the time
> > > > > > > > > > of
> > > > > > > > > > a
> > > > > > > > > > close or unlock (so it is visible to open or lock), but before that
> > > > > > > > > > it
> > > > > > > > > > is just best-effort.
> > > > > > > > 
> > > > > > > > Nope. That will inevitably lead to data corruption, since the
> > > > > > > > application might decide to use the data from state A instead of
> > > > > > > > revalidating it.
> > > > > > > > 
> > > > > > 
> > > > > > The point is, NFS is not the only potential use case for change
> > > > > > attributes. We wouldn't be bothering to discuss statx() if it was.
> > > > 
> > > > My understanding is that it was primarily a desire to add fstests to
> > > > exercise the i_version which motivated the statx extension.
> > > > Obviously we should prepare for other uses though.
> > > > 
> > 
> > Mainly. Also, userland nfs servers might also like this for obvious
> > reasons. For now though, in the v5 set, I've backed off on trying to
> > expose this to userland in favor of trying to just clean up the internal
> > implementation.
> > 
> > I'd still like to expose this via statx if possible, but I don't want to
> > get too bogged down in interface design just now as we have Real Bugs to
> > fix. That patchset should make it simple to expose it later though.
> > 
> > > > > > 
> > > > > > I could be using O_DIRECT, and all the tricks in order to ensure
> > > > > > that
> > > > > > my stock broker application (to choose one example) has access
> > > > > > to the
> > > > > > absolute very latest prices when I'm trying to execute a trade.
> > > > > > When the filesystem then says 'the prices haven't changed since
> > > > > > your
> > > > > > last read because the change attribute on the database file is
> > > > > > the
> > > > > > same' in response to a statx() request with the
> > > > > > AT_STATX_FORCE_SYNC
> > > > > > flag set, then why shouldn't my application be able to assume it
> > > > > > can
> > > > > > serve those prices right out of memory instead of having to go
> > > > > > to disk?
> > > > 
> > > > I would think that such an application would be using inotify rather
> > > > than having to poll.  But certainly we should have a clear statement
> > > > of
> > > > quality-of-service parameters in the documentation.
> > > > If we agree that perfect atomicity is what we want to promise, and
> > > > that
> > > > the cost to the filesystem and the statx call is acceptable, then so
> > > > be it.
> > > > 
> > > > My point wasn't to say that atomicity is bad.  It was that:
> > > >  - if the i_version change is visible before the change itself is
> > > >    visible, then that is a correctness problem.
> > > >  - if the i_version change is only visible some time after the
> > > > change
> > > >    itself is visible, then that is a quality-of-service issue.
> > > > I cannot see any room for debating the first.  I do see some room to
> > > > debate the second.
> > > > 
> > > > Cached writes, directory ops, and attribute changes are, I think,
> > > > easy
> > > > enough to provide truly atomic i_version updates with the change
> > > > being
> > > > visible.
> > > > 
> > > > Changes to a shared memory-mapped files is probably the hardest to
> > > > provide timely i_version updates for.  We might want to document an
> > > > explicit exception for those.  Alternately each request for
> > > > i_version
> > > > would need to find all pages that are writable, remap them read-only
> > > > to
> > > > catch future writes, then update i_version if any were writable
> > > > (i.e.
> > > > ->mkwrite had been called).  That is the only way I can think of to
> > > > provide atomicity.
> > > > 
> > 
> > I don't think we really want to make i_version bumps that expensive.
> > Documenting that you can't expect perfect consistency vs. mmap with NFS
> > seems like the best thing to do. We do our best, but that sort of
> > synchronization requires real locking.
> > 
> > > > O_DIRECT writes are a little easier than mmapped files.  I suspect we
> > > > should update the i_version once the device reports that the write is
> > > > complete, but a parallel reader could have seem some of the write before
> > > > that moment.  True atomicity could only be provided by taking some
> > > > exclusive lock that blocked all O_DIRECT writes.  Jeff seems to be
> > > > suggesting this, but I doubt the stock broker application would be
> > > > willing to make the call in that case.  I don't think I would either.
> > 
> > Well, only blocked for long enough to run the getattr. Granted, with a
> > slow underlying filesystem that can take a while.
> 
> Maybe I misunderstand, but this doesn't seem to make much sense.
> 
> If you want i_version updates to appear to be atomic w.r.t O_DIRECT
> writes, then you need to prevent accessing the i_version while any write
> is on-going. At that time there is no meaningful value for i_version.
> So you need a lock (At least shared) around the actual write, and you
> need an exclusive lock around the get_i_version().
> So accessing the i_version would have to wait for all pending O_DIRECT
> writes to complete, and would block any new O_DIRECT writes from
> starting.
> 
> This could be expensive.
> 
> There is not currently any locking around O_DIRECT writes.  You cannot
> synchronise with them.
> 

AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
inode_lock_shared across the I/O. That was why patch #8 takes the
inode_lock (exclusive) across the getattr.

> The best you can do is update the i_version immediately after all the
> O_DIRECT writes in a single request complete.
> 
> > 
> > To summarize, there are two main uses for the change attr in NFSv4:
> > 
> > 1/ to provide change_info4 for directory morphing operations (CREATE,
> > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already
> > atomic in the current nfsd code (AFAICT) by virtue of the fact that we
> > hold the i_rwsem exclusively over these operations. The change attr is
> > also queried pre and post while the lock is held, so that should ensure
> > that we get true atomicity for this.
> 
> Yes, directory ops are relatively easy.
> 
> > 
> > 2/ as an adjunct for the ctime when fetching attributes to validate
> > caches. We don't expect perfect consistency between read (and readlike)
> > operations and GETATTR, even when they're in the same compound.
> > 
> > IOW, a READ+GETATTR compound can legally give you a short (or zero-
> > length) read, and then the getattr indicates a size that is larger than
> > where the READ data stops, due to a write or truncate racing in after
> > the read.
> 
> I agree that atomicity is neither necessary nor practical.  Ordering is
> important though.  I don't think a truncate(0) racing with a READ can
> credibly result in a non-zero size AFTER a zero-length read.  A truncate
> that extends the size could have that effect though.
> 
> > 
> > Ideally, the attributes in the GETATTR reply should be consistent
> > between themselves though. IOW, all of the attrs should accurately
> > represent the state of the file at a single point in time.
> > change+size+times+etc. should all be consistent with one another.
> > 
> > I think we get all of this by taking the inode_lock around the
> > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant
> > solution, but it should give us the atomicity we need, and it doesn't
> > require adding extra operations or locking to the write codepaths.
> 
> Explicit attribute changes (chown/chmod/utimes/truncate etc) are always
> done under the inode lock.  Implicit changes via inode_update_time() are
> not (though xfs does take the lock, ext4 doesn't, haven't checked
> others).  So taking the inode lock won't ensure those are internally
> consistent.
> 
> I think using inode_lock_shared() is acceptable.  It doesn't promise
> perfect atomicity, but it is probably good enough.
> 
> We'd need a good reason to want perfect atomicity to go further, and I
> cannot think of one.
> 
> 

Taking inode_lock_shared is sufficient to block out buffered and DAX
writes. DIO writes sometimes only take the shared lock (e.g. when the
data is already properly aligned). If we want to ensure the getattr
doesn't run while _any_ writes are running, we'd need the exclusive
lock.

Maybe that's overkill, though it seems like we could have a race like
this without taking inode_lock across the getattr:

reader				writer
-----------------------------------------------------------------
				i_version++
getattr
read
				DIO write to backing store


Given that we can't fully exclude mmap writes, maybe we can just
document that mixing DIO or mmap writes on the server + NFS may not be
fully cache coherent.

> 
> > 
> > We could also consider less invasive ways to achieve this (maybe some
> > sort of seqretry loop around the vfs_getattr call?), but I'd rather not
> > do extra work in the write codepaths if we can get away with it.
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> > 
> >
Jeff Layton Sept. 12, 2022, 10:43 a.m. UTC | #37
On Sun, 2022-09-11 at 08:13 +1000, NeilBrown wrote:
> On Fri, 09 Sep 2022, Jeff Layton wrote:
> > 
> > The machine crashes and comes back up, and we get a query for i_version
> > and it comes back as X. Fine, it's an old version. Now there is a write.
> > What do we do to ensure that the new value doesn't collide with X+1? 
> 
> (I missed this bit in my earlier reply..)
> 
> How is it "Fine" to see an old version?
> The file could have changed without the version changing.
> And I thought one of the goals of the crash-count was to be able to
> provide a monotonic change id.
> 

"Fine" in the sense that we expect that to happen in this situation.
It's not fine for the clients obviously, which is why we're discussing
mitigation techniques.
Jeff Layton Sept. 12, 2022, 11:42 a.m. UTC | #38
On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote:
> On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote:
> > On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote:
> > > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote:
> > > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
> > > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
> > > > > > Yeah, ok. That does make some sense. So we would mix this into the
> > > > > > i_version instead of the ctime when it was available. Preferably, we'd
> > > > > > mix that in when we store the i_version rather than adding it afterward.
> > > > > > 
> > > > > > Ted, how would we access this? Maybe we could just add a new (generic)
> > > > > > super_block field for this that ext4 (and other filesystems) could
> > > > > > populate at mount time?
> > > > > 
> > > > > Couldn't the filesystem just return an ino_version that already includes
> > > > > it?
> > > > > 
> > > > 
> > > > Yes. That's simple if we want to just fold it in during getattr. If we
> > > > want to fold that into the values stored on disk, then I'm a little less
> > > > clear on how that will work.
> > > > 
> > > > Maybe I need a concrete example of how that will work:
> > > > 
> > > > Suppose we have an i_version value X with the previous crash counter
> > > > already factored in that makes it to disk. We hand out a newer version
> > > > X+1 to a client, but that value never makes it to disk.
> > > > 
> > > > The machine crashes and comes back up, and we get a query for i_version
> > > > and it comes back as X. Fine, it's an old version. Now there is a write.
> > > > What do we do to ensure that the new value doesn't collide with X+1? 
> > > 
> > > I was assuming we could partition i_version's 64 bits somehow: e.g., top
> > > 16 bits store the crash counter.  You increment the i_version by: 1)
> > > replacing the top bits by the new crash counter, if it has changed, and
> > > 2) incrementing.
> > > 
> > > Do the numbers work out?  2^16 mounts after unclean shutdowns sounds
> > > like a lot for one filesystem, as does 2^48 changes to a single file,
> > > but people do weird things.  Maybe there's a better partitioning, or
> > > some more flexible way of maintaining an i_version that still allows you
> > > to identify whether a given i_version preceded a crash.
> > > 
> > 
> > We consume one bit to keep track of the "seen" flag, so it would be a
> > 16+47 split. I assume that we'd also reset the version counter to 0 when
> > the crash counter changes? Maybe that doesn't matter as long as we don't
> > overflow into the crash counter.
> > 
> > I'm not sure we can get away with 16 bits for the crash counter, as
> > it'll leave us subject to the version counter wrapping after a long
> > uptimes. 
> > 
> > If you increment a counter every nanosecond, how long until that counter
> > wraps? With 63 bits, that's 292 years (and change). With 16+47 bits,
> > that's less than two days. An 8+55 split would give us ~416 days which
> > seems a bit more reasonable?
> 
> Though now it's starting to seem a little limiting to allow only 2^8
> mounts after unclean shutdowns.
> 
> Another way to think of it might be: multiply that 8-bit crash counter
> by 2^48, and think of it as a 64-bit value that we believe (based on
> practical limits on how many times you can modify a single file) is
> gauranteed to be larger than any i_version that we gave out before the
> most recent crash.
> 
> Our goal is to ensure that after a crash, any *new* i_versions that we
> give out or write to disk are larger than any that have previously been
> given out.  We can do that by ensuring that they're equal to at least
> that old maximum.
> 
> So think of the 64-bit value we're storing in the superblock as a
> ceiling on i_version values across all the filesystem's inodes.  Call it
> s_version_max or something.  We also need to know what the maximum was
> before the most recent crash.  Call that s_version_max_old.
> 
> Then we could get correct behavior if we generated i_versions with
> something like:
> 
> 	i_version++;
> 	if (i_version < s_version_max_old)
> 		i_version = s_version_max_old;
> 	if (i_version > s_version_max)
> 		s_version_max = i_version + 1;
> 
> But that last step makes this ludicrously expensive, because for this to
> be safe across crashes we need to update that value on disk as well, and
> we need to do that frequently.
> 
> Fortunately, s_version_max doesn't have to be a tight bound at all.  We
> can easily just initialize it to, say, 2^40, and only bump it by 2^40 at
> a time.  And recognize when we're running up against it way ahead of
> time, so we only need to say "here's an updated value, could you please
> make sure it gets to disk sometime in the next twenty minutes"?
> (Numbers made up.)
> 
> Sorry, that was way too many words.  But I think something like that
> could work, and make it very difficult to hit any hard limits, and
> actually not be too complicated??  Unless I missed something.
> 

That's not too many words -- I appreciate a good "for dummies"
explanation!

A scheme like that could work. It might be hard to do it without a
spinlock or something, but maybe that's ok. Thinking more about how we'd
implement this in the underlying filesystems:

To do this we'd need 2 64-bit fields in the on-disk and in-memory 
superblocks for ext4, xfs and btrfs. On the first mount after a crash,
the filesystem would need to bump s_version_max by the significant
increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
to do that.

Would there be a way to ensure that the new s_version_max value has made
it to disk? Bumping it by a large value and hoping for the best might be
ok for most cases, but there are always outliers, so it might be
worthwhile to make an i_version increment wait on that if necessary.
Florian Weimer Sept. 12, 2022, 12:13 p.m. UTC | #39
* Jeff Layton:

> To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> the filesystem would need to bump s_version_max by the significant
> increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> to do that.
>
> Would there be a way to ensure that the new s_version_max value has made
> it to disk? Bumping it by a large value and hoping for the best might be
> ok for most cases, but there are always outliers, so it might be
> worthwhile to make an i_version increment wait on that if necessary. 

How common are unclean shutdowns in practice?  Do ex64/XFS/btrfs keep
counters in the superblocks for journal replays that can be read easily?

Several useful i_version applications could be negatively impacted by
frequent i_version invalidation.

Thanks,
Florian
J. Bruce Fields Sept. 12, 2022, 12:54 p.m. UTC | #40
On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote:
> A scheme like that could work. It might be hard to do it without a
> spinlock or something, but maybe that's ok. Thinking more about how we'd
> implement this in the underlying filesystems:
> 
> To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> the filesystem would need to bump s_version_max by the significant
> increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> to do that.
> 
> Would there be a way to ensure that the new s_version_max value has made
> it to disk? Bumping it by a large value and hoping for the best might be
> ok for most cases, but there are always outliers, so it might be
> worthwhile to make an i_version increment wait on that if necessary. 

I was imagining that when you recognize you're getting close, you kick
off something which writes s_version_max+2^40 to disk, and then updates
s_version_max to that new value on success of the write.

The code that increments i_version checks to make sure it wouldn't
exceed s_version_max.  If it would, something has gone wrong--a write
has failed or taken a long time--so it waits or errors out or something,
depending on desired filesystem behavior in that case.

No locking required in the normal case?

--b.
Jeff Layton Sept. 12, 2022, 12:55 p.m. UTC | #41
On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote:
> * Jeff Layton:
> 
> > To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> > superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> > the filesystem would need to bump s_version_max by the significant
> > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> > to do that.
> > 
> > Would there be a way to ensure that the new s_version_max value has made
> > it to disk? Bumping it by a large value and hoping for the best might be
> > ok for most cases, but there are always outliers, so it might be
> > worthwhile to make an i_version increment wait on that if necessary. 
> 
> How common are unclean shutdowns in practice?  Do ex64/XFS/btrfs keep
> counters in the superblocks for journal replays that can be read easily?
> 
> Several useful i_version applications could be negatively impacted by
> frequent i_version invalidation.
> 

One would hope "not very often", but Oopses _are_ something that happens
occasionally, even in very stable environments, and it would be best if
what we're building can cope with them. Consider:

reader				writer
----------------------------------------------------------
start with i_version 1
				inode updated in memory, i_version++
query, get i_version 2

 <<< CRASH : update never makes it to disk, back at 1 after reboot >>>

query, get i_version 1
				application restarts and redoes write, i_version at 2^40+1
query, get i_version 2^40+1 

The main thing we have to avoid here is giving out an i_version that
represents two different states of the same inode. This should achieve
that.

Something else we should consider though is that with enough crashes on
a long-lived filesystem, the value could eventually wrap. I think we
should acknowledge that fact in advance, and plan to deal with it
(particularly if we're going to expose this to userland eventually).

Because of the "seen" flag, we have a 63 bit counter to play with. Could
we use a similar scheme to the one we use to handle when "jiffies"
wraps? Assume that we'd never compare two values that were more than
2^62 apart? We could add i_version_before/i_version_after macros to make
it simple to handle this.
Jeff Layton Sept. 12, 2022, 12:59 p.m. UTC | #42
On Mon, 2022-09-12 at 08:54 -0400, J. Bruce Fields wrote:
> On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote:
> > A scheme like that could work. It might be hard to do it without a
> > spinlock or something, but maybe that's ok. Thinking more about how we'd
> > implement this in the underlying filesystems:
> > 
> > To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> > superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> > the filesystem would need to bump s_version_max by the significant
> > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> > to do that.
> > 
> > Would there be a way to ensure that the new s_version_max value has made
> > it to disk? Bumping it by a large value and hoping for the best might be
> > ok for most cases, but there are always outliers, so it might be
> > worthwhile to make an i_version increment wait on that if necessary. 
> 
> I was imagining that when you recognize you're getting close, you kick
> off something which writes s_version_max+2^40 to disk, and then updates
> s_version_max to that new value on success of the write.
> 

Ok, that makes sense.

> The code that increments i_version checks to make sure it wouldn't
> exceed s_version_max.  If it would, something has gone wrong--a write
> has failed or taken a long time--so it waits or errors out or something,
> depending on desired filesystem behavior in that case.
> 

Maybe could just throw a big scary pr_warn too? I'd have to think about
how we'd want to handle this case.

> No locking required in the normal case?

Yeah, maybe not.
Florian Weimer Sept. 12, 2022, 1:20 p.m. UTC | #43
* Jeff Layton:

> On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote:
>> * Jeff Layton:
>> 
>> > To do this we'd need 2 64-bit fields in the on-disk and in-memory 
>> > superblocks for ext4, xfs and btrfs. On the first mount after a crash,
>> > the filesystem would need to bump s_version_max by the significant
>> > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
>> > to do that.
>> > 
>> > Would there be a way to ensure that the new s_version_max value has made
>> > it to disk? Bumping it by a large value and hoping for the best might be
>> > ok for most cases, but there are always outliers, so it might be
>> > worthwhile to make an i_version increment wait on that if necessary. 
>> 
>> How common are unclean shutdowns in practice?  Do ex64/XFS/btrfs keep
>> counters in the superblocks for journal replays that can be read easily?
>> 
>> Several useful i_version applications could be negatively impacted by
>> frequent i_version invalidation.
>> 
>
> One would hope "not very often", but Oopses _are_ something that happens
> occasionally, even in very stable environments, and it would be best if
> what we're building can cope with them.

I was wondering if such unclean shutdown events are associated with SSD
“unsafe shutdowns”, as identified by the SMART counter.  I think those
aren't necessarily restricted to oopses or various forms of powerless
(maybe depending on file system/devicemapper configuration)?

I admit it's possible that the file system is shut down cleanly before
the kernel requests the power-off state from the firmware, but the
underlying SSD is not.

Thanks,
Florian
J. Bruce Fields Sept. 12, 2022, 1:42 p.m. UTC | #44
On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> On Fri, 09 Sep 2022, Jeff Layton wrote:
> > 
> > The machine crashes and comes back up, and we get a query for i_version
> > and it comes back as X. Fine, it's an old version. Now there is a write.
> > What do we do to ensure that the new value doesn't collide with X+1? 
> 
> (I missed this bit in my earlier reply..)
> 
> How is it "Fine" to see an old version?
> The file could have changed without the version changing.
> And I thought one of the goals of the crash-count was to be able to
> provide a monotonic change id.

I was still mainly thinking about how to provide reliable close-to-open
semantics between NFS clients.  In the case the writer was an NFS
client, it wasn't done writing (or it would have COMMITted), so those
writes will come in and bump the change attribute soon, and as long as
we avoid the small chance of reusing an old change attribute, we're OK,
and I think it'd even still be OK to advertise
CHANGE_TYPE_IS_MONOTONIC_INCR.

If we're trying to do better than that, I'm just not sure what's right.

--b.
Jeff Layton Sept. 12, 2022, 1:49 p.m. UTC | #45
On Mon, 2022-09-12 at 15:20 +0200, Florian Weimer wrote:
> * Jeff Layton:
> 
> > On Mon, 2022-09-12 at 14:13 +0200, Florian Weimer wrote:
> > > * Jeff Layton:
> > > 
> > > > To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> > > > superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> > > > the filesystem would need to bump s_version_max by the significant
> > > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> > > > to do that.
> > > > 
> > > > Would there be a way to ensure that the new s_version_max value has made
> > > > it to disk? Bumping it by a large value and hoping for the best might be
> > > > ok for most cases, but there are always outliers, so it might be
> > > > worthwhile to make an i_version increment wait on that if necessary. 
> > > 
> > > How common are unclean shutdowns in practice?  Do ex64/XFS/btrfs keep
> > > counters in the superblocks for journal replays that can be read easily?
> > > 
> > > Several useful i_version applications could be negatively impacted by
> > > frequent i_version invalidation.
> > > 
> > 
> > One would hope "not very often", but Oopses _are_ something that happens
> > occasionally, even in very stable environments, and it would be best if
> > what we're building can cope with them.
> 
> I was wondering if such unclean shutdown events are associated with SSD
> “unsafe shutdowns”, as identified by the SMART counter.  I think those
> aren't necessarily restricted to oopses or various forms of powerless
> (maybe depending on file system/devicemapper configuration)?
> 
> I admit it's possible that the file system is shut down cleanly before
> the kernel requests the power-off state from the firmware, but the
> underlying SSD is not.
> 

Yeah filesystem integrity is mostly what we're concerned with here.

I think most local filesystems effectively set a flag in the superblock
that is cleared when the it's cleanly unmounted. If that flag is set
when you go to mount then you know there was a crash. We'd probably key
off of that in some way internally.
J. Bruce Fields Sept. 12, 2022, 1:51 p.m. UTC | #46
On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> Because of the "seen" flag, we have a 63 bit counter to play with. Could
> we use a similar scheme to the one we use to handle when "jiffies"
> wraps? Assume that we'd never compare two values that were more than
> 2^62 apart? We could add i_version_before/i_version_after macros to make
> it simple to handle this.

As far as I recall the protocol just assumes it can never wrap.  I guess
you could add a new change_attr_type that works the way you describe.
But without some new protocol clients aren't going to know what to do
with a change attribute that wraps.

I think this just needs to be designed so that wrapping is impossible in
any realistic scenario.  I feel like that's doable?

If we feel we have to catch that case, the only 100% correct behavior
would probably be to make the filesystem readonly.

--b.
Jeff Layton Sept. 12, 2022, 2:02 p.m. UTC | #47
On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > Because of the "seen" flag, we have a 63 bit counter to play with. Could
> > we use a similar scheme to the one we use to handle when "jiffies"
> > wraps? Assume that we'd never compare two values that were more than
> > 2^62 apart? We could add i_version_before/i_version_after macros to make
> > it simple to handle this.
> 
> As far as I recall the protocol just assumes it can never wrap.  I guess
> you could add a new change_attr_type that works the way you describe.
> But without some new protocol clients aren't going to know what to do
> with a change attribute that wraps.
> 

Right, I think that's the case now, and with contemporary hardware that
shouldn't ever happen, but in 10 years when we're looking at femtosecond
latencies, could this be different? I don't know.

> I think this just needs to be designed so that wrapping is impossible in
> any realistic scenario.  I feel like that's doable?
> 
> If we feel we have to catch that case, the only 100% correct behavior
> would probably be to make the filesystem readonly.

What would be the recourse at that point? Rebuild the fs from scratch, I
guess?
Trond Myklebust Sept. 12, 2022, 2:15 p.m. UTC | #48
On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > Because of the "seen" flag, we have a 63 bit counter to play with.
> > Could
> > we use a similar scheme to the one we use to handle when "jiffies"
> > wraps? Assume that we'd never compare two values that were more
> > than
> > 2^62 apart? We could add i_version_before/i_version_after macros to
> > make
> > it simple to handle this.
> 
> As far as I recall the protocol just assumes it can never wrap.  I
> guess
> you could add a new change_attr_type that works the way you describe.
> But without some new protocol clients aren't going to know what to do
> with a change attribute that wraps.
> 
> I think this just needs to be designed so that wrapping is impossible
> in
> any realistic scenario.  I feel like that's doable?
> 
> If we feel we have to catch that case, the only 100% correct behavior
> would probably be to make the filesystem readonly.
> 

Which protocol? If you're talking about basic NFSv4, it doesn't assume
anything about the change attribute and wrapping.

The NFSv4.2 protocol did introduce the optional attribute
'change_attr_type' that tries to describe the change attribute
behaviour to the client. It tells you if the behaviour is monotonically
increasing, but doesn't say anything about the behaviour when the
attribute value overflows.

That said, the Linux NFSv4.2 client, which uses that change_attr_type
attribute does deal with overflow by assuming standard uint64_t wrap
around rules. i.e. it assumes bit values > 63 are truncated, meaning
that the value obtained by incrementing (2^64-1) is 0.
J. Bruce Fields Sept. 12, 2022, 2:47 p.m. UTC | #49
On Mon, Sep 12, 2022 at 10:02:27AM -0400, Jeff Layton wrote:
> On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > > Because of the "seen" flag, we have a 63 bit counter to play with. Could
> > > we use a similar scheme to the one we use to handle when "jiffies"
> > > wraps? Assume that we'd never compare two values that were more than
> > > 2^62 apart? We could add i_version_before/i_version_after macros to make
> > > it simple to handle this.
> > 
> > As far as I recall the protocol just assumes it can never wrap.  I guess
> > you could add a new change_attr_type that works the way you describe.
> > But without some new protocol clients aren't going to know what to do
> > with a change attribute that wraps.
> > 
> 
> Right, I think that's the case now, and with contemporary hardware that
> shouldn't ever happen, but in 10 years when we're looking at femtosecond
> latencies, could this be different? I don't know.

That doesn't sound likely.  We probably need not just 2^63 writes to a
single file, but a dependent sequence of 2^63 interspersed writes and
change attribute reads.

Then there's the question of how many crashes and remounts are possible
for a single filesystem in the worst case.

> 
> > I think this just needs to be designed so that wrapping is impossible in
> > any realistic scenario.  I feel like that's doable?
> > 
> > If we feel we have to catch that case, the only 100% correct behavior
> > would probably be to make the filesystem readonly.
> 
> What would be the recourse at that point? Rebuild the fs from scratch, I
> guess?

I guess.

--b.
J. Bruce Fields Sept. 12, 2022, 2:50 p.m. UTC | #50
On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote:
> On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > > Because of the "seen" flag, we have a 63 bit counter to play with.
> > > Could
> > > we use a similar scheme to the one we use to handle when "jiffies"
> > > wraps? Assume that we'd never compare two values that were more
> > > than
> > > 2^62 apart? We could add i_version_before/i_version_after macros to
> > > make
> > > it simple to handle this.
> > 
> > As far as I recall the protocol just assumes it can never wrap.  I
> > guess
> > you could add a new change_attr_type that works the way you describe.
> > But without some new protocol clients aren't going to know what to do
> > with a change attribute that wraps.
> > 
> > I think this just needs to be designed so that wrapping is impossible
> > in
> > any realistic scenario.  I feel like that's doable?
> > 
> > If we feel we have to catch that case, the only 100% correct behavior
> > would probably be to make the filesystem readonly.
> > 
> 
> Which protocol? If you're talking about basic NFSv4, it doesn't assume
> anything about the change attribute and wrapping.
> 
> The NFSv4.2 protocol did introduce the optional attribute
> 'change_attr_type' that tries to describe the change attribute
> behaviour to the client. It tells you if the behaviour is monotonically
> increasing, but doesn't say anything about the behaviour when the
> attribute value overflows.
> 
> That said, the Linux NFSv4.2 client, which uses that change_attr_type
> attribute does deal with overflow by assuming standard uint64_t wrap
> around rules. i.e. it assumes bit values > 63 are truncated, meaning
> that the value obtained by incrementing (2^64-1) is 0.

Yeah, it was the MONOTONIC_INCRE case I was thinking of.  That's
interesting, I didn't know the client did that.

--b.
Trond Myklebust Sept. 12, 2022, 2:56 p.m. UTC | #51
On Mon, 2022-09-12 at 10:50 -0400, J. Bruce Fields wrote:
> On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote:
> > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > > > Because of the "seen" flag, we have a 63 bit counter to play
> > > > with.
> > > > Could
> > > > we use a similar scheme to the one we use to handle when
> > > > "jiffies"
> > > > wraps? Assume that we'd never compare two values that were more
> > > > than
> > > > 2^62 apart? We could add i_version_before/i_version_after
> > > > macros to
> > > > make
> > > > it simple to handle this.
> > > 
> > > As far as I recall the protocol just assumes it can never wrap. 
> > > I
> > > guess
> > > you could add a new change_attr_type that works the way you
> > > describe.
> > > But without some new protocol clients aren't going to know what
> > > to do
> > > with a change attribute that wraps.
> > > 
> > > I think this just needs to be designed so that wrapping is
> > > impossible
> > > in
> > > any realistic scenario.  I feel like that's doable?
> > > 
> > > If we feel we have to catch that case, the only 100% correct
> > > behavior
> > > would probably be to make the filesystem readonly.
> > > 
> > 
> > Which protocol? If you're talking about basic NFSv4, it doesn't
> > assume
> > anything about the change attribute and wrapping.
> > 
> > The NFSv4.2 protocol did introduce the optional attribute
> > 'change_attr_type' that tries to describe the change attribute
> > behaviour to the client. It tells you if the behaviour is
> > monotonically
> > increasing, but doesn't say anything about the behaviour when the
> > attribute value overflows.
> > 
> > That said, the Linux NFSv4.2 client, which uses that
> > change_attr_type
> > attribute does deal with overflow by assuming standard uint64_t
> > wrap
> > around rules. i.e. it assumes bit values > 63 are truncated,
> > meaning
> > that the value obtained by incrementing (2^64-1) is 0.
> 
> Yeah, it was the MONOTONIC_INCRE case I was thinking of.  That's
> interesting, I didn't know the client did that.
> 

If you look at where we compare version numbers, it is always some
variant of the following:

static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr *fattr,
                                         const struct inode *inode)
{
        s64 diff = fattr->change_attr - inode_peek_iversion_raw(inode);
        if (diff > 0)
                return 1;
        return diff == 0 ? 0 : -1;
}

i.e. we do an unsigned 64-bit subtraction, and then cast it to the
signed 64-bit equivalent in order to figure out which is the more
recent value.
Trond Myklebust Sept. 12, 2022, 3:32 p.m. UTC | #52
On Mon, 2022-09-12 at 14:56 +0000, Trond Myklebust wrote:
> On Mon, 2022-09-12 at 10:50 -0400, J. Bruce Fields wrote:
> > On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote:
> > > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> > > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > > > > Because of the "seen" flag, we have a 63 bit counter to play
> > > > > with.
> > > > > Could
> > > > > we use a similar scheme to the one we use to handle when
> > > > > "jiffies"
> > > > > wraps? Assume that we'd never compare two values that were
> > > > > more
> > > > > than
> > > > > 2^62 apart? We could add i_version_before/i_version_after
> > > > > macros to
> > > > > make
> > > > > it simple to handle this.
> > > > 
> > > > As far as I recall the protocol just assumes it can never
> > > > wrap. 
> > > > I
> > > > guess
> > > > you could add a new change_attr_type that works the way you
> > > > describe.
> > > > But without some new protocol clients aren't going to know what
> > > > to do
> > > > with a change attribute that wraps.
> > > > 
> > > > I think this just needs to be designed so that wrapping is
> > > > impossible
> > > > in
> > > > any realistic scenario.  I feel like that's doable?
> > > > 
> > > > If we feel we have to catch that case, the only 100% correct
> > > > behavior
> > > > would probably be to make the filesystem readonly.
> > > > 
> > > 
> > > Which protocol? If you're talking about basic NFSv4, it doesn't
> > > assume
> > > anything about the change attribute and wrapping.
> > > 
> > > The NFSv4.2 protocol did introduce the optional attribute
> > > 'change_attr_type' that tries to describe the change attribute
> > > behaviour to the client. It tells you if the behaviour is
> > > monotonically
> > > increasing, but doesn't say anything about the behaviour when the
> > > attribute value overflows.
> > > 
> > > That said, the Linux NFSv4.2 client, which uses that
> > > change_attr_type
> > > attribute does deal with overflow by assuming standard uint64_t
> > > wrap
> > > around rules. i.e. it assumes bit values > 63 are truncated,
> > > meaning
> > > that the value obtained by incrementing (2^64-1) is 0.
> > 
> > Yeah, it was the MONOTONIC_INCRE case I was thinking of.  That's
> > interesting, I didn't know the client did that.
> > 
> 
> If you look at where we compare version numbers, it is always some
> variant of the following:
> 
> static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr
> *fattr,
>                                          const struct inode *inode)
> {
>         s64 diff = fattr->change_attr -
> inode_peek_iversion_raw(inode);
>         if (diff > 0)
>                 return 1;
>         return diff == 0 ? 0 : -1;
> }
> 
> i.e. we do an unsigned 64-bit subtraction, and then cast it to the
> signed 64-bit equivalent in order to figure out which is the more
> recent value.
> 

...and by the way, yes this does mean that if you suddenly add a value
of 2^63 to the change attribute, then you are likely to cause the
client to think that you just handed it an old value.

i.e. you're better off having the crash counter increment the change
attribute by a relatively small value. One that is guaranteed to be
larger than the values that may have been lost, but that is not
excessively large.
Jeff Layton Sept. 12, 2022, 3:49 p.m. UTC | #53
On Mon, 2022-09-12 at 15:32 +0000, Trond Myklebust wrote:
> On Mon, 2022-09-12 at 14:56 +0000, Trond Myklebust wrote:
> > On Mon, 2022-09-12 at 10:50 -0400, J. Bruce Fields wrote:
> > > On Mon, Sep 12, 2022 at 02:15:16PM +0000, Trond Myklebust wrote:
> > > > On Mon, 2022-09-12 at 09:51 -0400, J. Bruce Fields wrote:
> > > > > On Mon, Sep 12, 2022 at 08:55:04AM -0400, Jeff Layton wrote:
> > > > > > Because of the "seen" flag, we have a 63 bit counter to play
> > > > > > with.
> > > > > > Could
> > > > > > we use a similar scheme to the one we use to handle when
> > > > > > "jiffies"
> > > > > > wraps? Assume that we'd never compare two values that were
> > > > > > more
> > > > > > than
> > > > > > 2^62 apart? We could add i_version_before/i_version_after
> > > > > > macros to
> > > > > > make
> > > > > > it simple to handle this.
> > > > > 
> > > > > As far as I recall the protocol just assumes it can never
> > > > > wrap. 
> > > > > I
> > > > > guess
> > > > > you could add a new change_attr_type that works the way you
> > > > > describe.
> > > > > But without some new protocol clients aren't going to know what
> > > > > to do
> > > > > with a change attribute that wraps.
> > > > > 
> > > > > I think this just needs to be designed so that wrapping is
> > > > > impossible
> > > > > in
> > > > > any realistic scenario.  I feel like that's doable?
> > > > > 
> > > > > If we feel we have to catch that case, the only 100% correct
> > > > > behavior
> > > > > would probably be to make the filesystem readonly.
> > > > > 
> > > > 
> > > > Which protocol? If you're talking about basic NFSv4, it doesn't
> > > > assume
> > > > anything about the change attribute and wrapping.
> > > > 
> > > > The NFSv4.2 protocol did introduce the optional attribute
> > > > 'change_attr_type' that tries to describe the change attribute
> > > > behaviour to the client. It tells you if the behaviour is
> > > > monotonically
> > > > increasing, but doesn't say anything about the behaviour when the
> > > > attribute value overflows.
> > > > 
> > > > That said, the Linux NFSv4.2 client, which uses that
> > > > change_attr_type
> > > > attribute does deal with overflow by assuming standard uint64_t
> > > > wrap
> > > > around rules. i.e. it assumes bit values > 63 are truncated,
> > > > meaning
> > > > that the value obtained by incrementing (2^64-1) is 0.
> > > 
> > > Yeah, it was the MONOTONIC_INCRE case I was thinking of.  That's
> > > interesting, I didn't know the client did that.
> > > 
> > 
> > If you look at where we compare version numbers, it is always some
> > variant of the following:
> > 
> > static int nfs_inode_attrs_cmp_monotonic(const struct nfs_fattr
> > *fattr,
> >                                          const struct inode *inode)
> > {
> >         s64 diff = fattr->change_attr -
> > inode_peek_iversion_raw(inode);
> >         if (diff > 0)
> >                 return 1;
> >         return diff == 0 ? 0 : -1;
> > }
> > 
> > i.e. we do an unsigned 64-bit subtraction, and then cast it to the
> > signed 64-bit equivalent in order to figure out which is the more
> > recent value.
> > 

Good! This seems like the reasonable thing to do, given that the spec
doesn't really say that the change attribute has to start at low values.

> 
> ...and by the way, yes this does mean that if you suddenly add a value
> of 2^63 to the change attribute, then you are likely to cause the
> client to think that you just handed it an old value.
> 
> i.e. you're better off having the crash counter increment the change
> attribute by a relatively small value. One that is guaranteed to be
> larger than the values that may have been lost, but that is not
> excessively large.
> 

Yeah.

Like with jiffies, you need to make sure the samples you're comparing
aren't _too_ far off. That should be doable here -- 62 bits is plenty of
room to store a lot of change values.

My benchmark (maybe wrong, but maybe good enough) is to figure on an
increment per nanosecond for a worst-case scenario. With that, 2^40
nanoseconds is >12 days. Maybe that's overkill.

2^32 ns is about an hour and 20 mins. That's probably a reasonable value
to use. If we can't get a a new value onto disk in that time then
something is probably very wrong.
NeilBrown Sept. 12, 2022, 11:29 p.m. UTC | #54
On Mon, 12 Sep 2022, Jeff Layton wrote:
> On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> > This could be expensive.
> > 
> > There is not currently any locking around O_DIRECT writes.  You cannot
> > synchronise with them.
> > 
> 
> AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
> inode_lock_shared across the I/O. That was why patch #8 takes the
> inode_lock (exclusive) across the getattr.

Looking at ext4_dio_write_iter() it certain does take
inode_lock_shared() before starting the write and in some cases it
requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for
the write to complete.  But not in all cases.
So I don't think it always holds the shared lock across all direct IO.

> 
> > The best you can do is update the i_version immediately after all the
> > O_DIRECT writes in a single request complete.
> > 
> > > 
> > > To summarize, there are two main uses for the change attr in NFSv4:
> > > 
> > > 1/ to provide change_info4 for directory morphing operations (CREATE,
> > > LINK, OPEN, REMOVE, and RENAME). It turns out that this is already
> > > atomic in the current nfsd code (AFAICT) by virtue of the fact that we
> > > hold the i_rwsem exclusively over these operations. The change attr is
> > > also queried pre and post while the lock is held, so that should ensure
> > > that we get true atomicity for this.
> > 
> > Yes, directory ops are relatively easy.
> > 
> > > 
> > > 2/ as an adjunct for the ctime when fetching attributes to validate
> > > caches. We don't expect perfect consistency between read (and readlike)
> > > operations and GETATTR, even when they're in the same compound.
> > > 
> > > IOW, a READ+GETATTR compound can legally give you a short (or zero-
> > > length) read, and then the getattr indicates a size that is larger than
> > > where the READ data stops, due to a write or truncate racing in after
> > > the read.
> > 
> > I agree that atomicity is neither necessary nor practical.  Ordering is
> > important though.  I don't think a truncate(0) racing with a READ can
> > credibly result in a non-zero size AFTER a zero-length read.  A truncate
> > that extends the size could have that effect though.
> > 
> > > 
> > > Ideally, the attributes in the GETATTR reply should be consistent
> > > between themselves though. IOW, all of the attrs should accurately
> > > represent the state of the file at a single point in time.
> > > change+size+times+etc. should all be consistent with one another.
> > > 
> > > I think we get all of this by taking the inode_lock around the
> > > vfs_getattr call in nfsd4_encode_fattr. It may not be the most elegant
> > > solution, but it should give us the atomicity we need, and it doesn't
> > > require adding extra operations or locking to the write codepaths.
> > 
> > Explicit attribute changes (chown/chmod/utimes/truncate etc) are always
> > done under the inode lock.  Implicit changes via inode_update_time() are
> > not (though xfs does take the lock, ext4 doesn't, haven't checked
> > others).  So taking the inode lock won't ensure those are internally
> > consistent.
> > 
> > I think using inode_lock_shared() is acceptable.  It doesn't promise
> > perfect atomicity, but it is probably good enough.
> > 
> > We'd need a good reason to want perfect atomicity to go further, and I
> > cannot think of one.
> > 
> > 
> 
> Taking inode_lock_shared is sufficient to block out buffered and DAX
> writes. DIO writes sometimes only take the shared lock (e.g. when the
> data is already properly aligned). If we want to ensure the getattr
> doesn't run while _any_ writes are running, we'd need the exclusive
> lock.

But the exclusive lock is bad for scalability.

> 
> Maybe that's overkill, though it seems like we could have a race like
> this without taking inode_lock across the getattr:
> 
> reader				writer
> -----------------------------------------------------------------
> 				i_version++
> getattr
> read
> 				DIO write to backing store
> 

This is why I keep saying that the i_version increment must be after the
write, not before it.

> 
> Given that we can't fully exclude mmap writes, maybe we can just
> document that mixing DIO or mmap writes on the server + NFS may not be
> fully cache coherent.

"fully cache coherent" is really more than anyone needs.
The i_version must be seen to change no earlier than the related change
becomes visible, and no later than the request which initiated that
change is acknowledged as complete.

NeilBrown
John Stoffel Sept. 13, 2022, 12:29 a.m. UTC | #55
>>>>> "Jeff" == Jeff Layton <jlayton@kernel.org> writes:

> On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote:
>> On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote:
>> > On Fri, 2022-09-09 at 11:45 -0400, J. Bruce Fields wrote:
>> > > On Thu, Sep 08, 2022 at 03:07:58PM -0400, Jeff Layton wrote:
>> > > > On Thu, 2022-09-08 at 14:22 -0400, J. Bruce Fields wrote:
>> > > > > On Thu, Sep 08, 2022 at 01:40:11PM -0400, Jeff Layton wrote:
>> > > > > > Yeah, ok. That does make some sense. So we would mix this into the
>> > > > > > i_version instead of the ctime when it was available. Preferably, we'd
>> > > > > > mix that in when we store the i_version rather than adding it afterward.
>> > > > > > 
>> > > > > > Ted, how would we access this? Maybe we could just add a new (generic)
>> > > > > > super_block field for this that ext4 (and other filesystems) could
>> > > > > > populate at mount time?
>> > > > > 
>> > > > > Couldn't the filesystem just return an ino_version that already includes
>> > > > > it?
>> > > > > 
>> > > > 
>> > > > Yes. That's simple if we want to just fold it in during getattr. If we
>> > > > want to fold that into the values stored on disk, then I'm a little less
>> > > > clear on how that will work.
>> > > > 
>> > > > Maybe I need a concrete example of how that will work:
>> > > > 
>> > > > Suppose we have an i_version value X with the previous crash counter
>> > > > already factored in that makes it to disk. We hand out a newer version
>> > > > X+1 to a client, but that value never makes it to disk.
>> > > > 
>> > > > The machine crashes and comes back up, and we get a query for i_version
>> > > > and it comes back as X. Fine, it's an old version. Now there is a write.
>> > > > What do we do to ensure that the new value doesn't collide with X+1? 
>> > > 
>> > > I was assuming we could partition i_version's 64 bits somehow: e.g., top
>> > > 16 bits store the crash counter.  You increment the i_version by: 1)
>> > > replacing the top bits by the new crash counter, if it has changed, and
>> > > 2) incrementing.
>> > > 
>> > > Do the numbers work out?  2^16 mounts after unclean shutdowns sounds
>> > > like a lot for one filesystem, as does 2^48 changes to a single file,
>> > > but people do weird things.  Maybe there's a better partitioning, or
>> > > some more flexible way of maintaining an i_version that still allows you
>> > > to identify whether a given i_version preceded a crash.
>> > > 
>> > 
>> > We consume one bit to keep track of the "seen" flag, so it would be a
>> > 16+47 split. I assume that we'd also reset the version counter to 0 when
>> > the crash counter changes? Maybe that doesn't matter as long as we don't
>> > overflow into the crash counter.
>> > 
>> > I'm not sure we can get away with 16 bits for the crash counter, as
>> > it'll leave us subject to the version counter wrapping after a long
>> > uptimes. 
>> > 
>> > If you increment a counter every nanosecond, how long until that counter
>> > wraps? With 63 bits, that's 292 years (and change). With 16+47 bits,
>> > that's less than two days. An 8+55 split would give us ~416 days which
>> > seems a bit more reasonable?
>> 
>> Though now it's starting to seem a little limiting to allow only 2^8
>> mounts after unclean shutdowns.
>> 
>> Another way to think of it might be: multiply that 8-bit crash counter
>> by 2^48, and think of it as a 64-bit value that we believe (based on
>> practical limits on how many times you can modify a single file) is
>> gauranteed to be larger than any i_version that we gave out before the
>> most recent crash.
>> 
>> Our goal is to ensure that after a crash, any *new* i_versions that we
>> give out or write to disk are larger than any that have previously been
>> given out.  We can do that by ensuring that they're equal to at least
>> that old maximum.
>> 
>> So think of the 64-bit value we're storing in the superblock as a
>> ceiling on i_version values across all the filesystem's inodes.  Call it
>> s_version_max or something.  We also need to know what the maximum was
>> before the most recent crash.  Call that s_version_max_old.
>> 
>> Then we could get correct behavior if we generated i_versions with
>> something like:
>> 
>> i_version++;
>> if (i_version < s_version_max_old)
>> i_version = s_version_max_old;
>> if (i_version > s_version_max)
>> s_version_max = i_version + 1;
>> 
>> But that last step makes this ludicrously expensive, because for this to
>> be safe across crashes we need to update that value on disk as well, and
>> we need to do that frequently.
>> 
>> Fortunately, s_version_max doesn't have to be a tight bound at all.  We
>> can easily just initialize it to, say, 2^40, and only bump it by 2^40 at
>> a time.  And recognize when we're running up against it way ahead of
>> time, so we only need to say "here's an updated value, could you please
>> make sure it gets to disk sometime in the next twenty minutes"?
>> (Numbers made up.)
>> 
>> Sorry, that was way too many words.  But I think something like that
>> could work, and make it very difficult to hit any hard limits, and
>> actually not be too complicated??  Unless I missed something.
>> 

> That's not too many words -- I appreciate a good "for dummies"
> explanation!

> A scheme like that could work. It might be hard to do it without a
> spinlock or something, but maybe that's ok. Thinking more about how we'd
> implement this in the underlying filesystems:

> To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> the filesystem would need to bump s_version_max by the significant
> increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> to do that.

> Would there be a way to ensure that the new s_version_max value has made
> it to disk? Bumping it by a large value and hoping for the best might be
> ok for most cases, but there are always outliers, so it might be
> worthwhile to make an i_version increment wait on that if necessary. 


Would it be silly to steal the same idea from the DNS folks where they
can wrap the 32 bit serial number around by incrementing it by a large
amount, pushing the out the change, then incrementing back down to 1
to wrap the counter?  

I just worry about space limited counters that don't automatically
wrap, or allow people to force them to wrap gracefully with out major
hassles.

But I come at this all from the IT side of things, not the
programming/kernel side.

John
Dave Chinner Sept. 13, 2022, 12:41 a.m. UTC | #56
On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote:
> On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote:
> > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote:
> > Our goal is to ensure that after a crash, any *new* i_versions that we
> > give out or write to disk are larger than any that have previously been
> > given out.  We can do that by ensuring that they're equal to at least
> > that old maximum.
> > 
> > So think of the 64-bit value we're storing in the superblock as a
> > ceiling on i_version values across all the filesystem's inodes.  Call it
> > s_version_max or something.  We also need to know what the maximum was
> > before the most recent crash.  Call that s_version_max_old.
> > 
> > Then we could get correct behavior if we generated i_versions with
> > something like:
> > 
> > 	i_version++;
> > 	if (i_version < s_version_max_old)
> > 		i_version = s_version_max_old;
> > 	if (i_version > s_version_max)
> > 		s_version_max = i_version + 1;
> > 
> > But that last step makes this ludicrously expensive, because for this to
> > be safe across crashes we need to update that value on disk as well, and
> > we need to do that frequently.
> > 
> > Fortunately, s_version_max doesn't have to be a tight bound at all.  We
> > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at
> > a time.  And recognize when we're running up against it way ahead of
> > time, so we only need to say "here's an updated value, could you please
> > make sure it gets to disk sometime in the next twenty minutes"?
> > (Numbers made up.)
> > 
> > Sorry, that was way too many words.  But I think something like that
> > could work, and make it very difficult to hit any hard limits, and
> > actually not be too complicated??  Unless I missed something.
> > 
> 
> That's not too many words -- I appreciate a good "for dummies"
> explanation!
> 
> A scheme like that could work. It might be hard to do it without a
> spinlock or something, but maybe that's ok. Thinking more about how we'd
> implement this in the underlying filesystems:
> 
> To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> the filesystem would need to bump s_version_max by the significant
> increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> to do that.

Why only increment on crash? If the filesystem has been unmounted,
then any cached data is -stale- and must be discarded. e.g. unmount,
run fsck which cleans up corrupt files but does not modify
i_version, then mount. Remote caches are now invalid, but i_version
may not have changed, so we still need the clean unmount-mount cycle
to invalidate caches.

IOWs, what we want is a salted i_version value, with the filesystem
providing the unique per-mount salt that gets added to the
externally visible i_version values.

If that's the case, the salt doesn't need to be restricted to just
modifying the upper bits - as long as the salt increments
substantially and independently to the on-disk inode i_version then
we just don't care what bits of the superblock salt change from
mount to mount.

For XFS we already have a unique 64 bit salt we could use for every
mount - clean or unclean - and guarantee it is larger for every
mount. It also gets substantially bumped by fsck, too. It's called a
Log Sequence Number and we use them to track and strictly order
every modification we write into the log. This is exactly what is
needed for a i_version salt, and it's already guaranteed to be
persistent.

> Would there be a way to ensure that the new s_version_max value has made
> it to disk?

Yes, but that's not really relevant to the definition of the salt:
we don't need to design the filesystem implementation of a
persistent per-mount salt value. All we need is to define the
behaviour of the salt (e.g. must always increase across a
umount/mount cycle) and then you can let the filesystem developers
worry about how to provide the required salt behaviour and it's
persistence.

In the mean time, you can implement the salting and testing it by
using the system time to seed the superblock salt - that's good
enough for proof of concept, and as a fallback for filesystems that
cannot provide the required per-mount salt persistence....

> Bumping it by a large value and hoping for the best might be
> ok for most cases, but there are always outliers, so it might be
> worthwhile to make an i_version increment wait on that if necessary. 

Nothing should be able to query i_version until the filesystem is
fully recovered, mounted and the salt has been set. Hence no
application (kernel or userspace) should ever see an unsalted
i_version value....

-Dave.
Dave Chinner Sept. 13, 2022, 1:15 a.m. UTC | #57
On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote:
> On Mon, 12 Sep 2022, Jeff Layton wrote:
> > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> > > This could be expensive.
> > > 
> > > There is not currently any locking around O_DIRECT writes.  You cannot
> > > synchronise with them.
> > > 
> > 
> > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
> > inode_lock_shared across the I/O. That was why patch #8 takes the
> > inode_lock (exclusive) across the getattr.
> 
> Looking at ext4_dio_write_iter() it certain does take
> inode_lock_shared() before starting the write and in some cases it
> requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for
> the write to complete.  But not in all cases.
> So I don't think it always holds the shared lock across all direct IO.

To serialise against dio writes, one must:

	// Lock the inode exclusively to block new DIO submissions
	inode_lock(inode);

	// Wait for all in flight DIO reads and writes to complete
	inode_dio_wait(inode);

This is how truncate, fallocate, etc serialise against AIO+DIO which
do not hold the inode lock across the entire IO. These have to
serialise aginst DIO reads, too, because we can't have IO in
progress over a range of the file that we are about to free....

> > Taking inode_lock_shared is sufficient to block out buffered and DAX
> > writes. DIO writes sometimes only take the shared lock (e.g. when the
> > data is already properly aligned). If we want to ensure the getattr
> > doesn't run while _any_ writes are running, we'd need the exclusive
> > lock.
> 
> But the exclusive lock is bad for scalability.

Serilisation against DIO is -expensive- and -slow-. It's not a
solution for what is supposed to be a fast unlocked read-only
operation like statx().

> > Maybe that's overkill, though it seems like we could have a race like
> > this without taking inode_lock across the getattr:
> > 
> > reader				writer
> > -----------------------------------------------------------------
> > 				i_version++
> > getattr
> > read
> > 				DIO write to backing store
> > 
> 
> This is why I keep saying that the i_version increment must be after the
> write, not before it.

Sure, but that ignores the reason why we actually need to bump
i_version *before* we submit a DIO write. DIO write invalidates the
page cache over the range of the write, so any racing read will
re-populate the page cache during the DIO write.

Hence buffered reads can return before the DIO write has completed,
and the contents of the read can contain, none, some or all of the
contents of the DIO write. Hence i_version has to be incremented
before the DIO write is submitted so that racing getattrs will
indicate that the local caches have been invalidated and that data
needs to be refetched.

But, yes, to actually be safe here, we *also* should be bumping
i_version on DIO write on DIO write completion so that racing
i_version and data reads that occur *after* the initial i_version
bump are invalidated immediately.

IOWs, to avoid getattr/read races missing stale data invalidations
during DIO writes, we really need to bump i_version both _before and
after_ DIO write submission.

It's corner cases like this where "i_version should only be bumped
when ctime changes" fails completely. i.e. there are concurrent IO
situations which can only really be handled correctly by bumping
i_version whenever either in-memory and/or on-disk persistent data/
metadata state changes occur.....

Cheers,

Dave.
Dave Chinner Sept. 13, 2022, 2:41 a.m. UTC | #58
On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote:
> On Tue, 13 Sep 2022, Dave Chinner wrote:
> > On Mon, Sep 12, 2022 at 07:42:16AM -0400, Jeff Layton wrote:
> > > On Sat, 2022-09-10 at 10:56 -0400, J. Bruce Fields wrote:
> > > > On Fri, Sep 09, 2022 at 12:36:29PM -0400, Jeff Layton wrote:
> > > > Our goal is to ensure that after a crash, any *new* i_versions that we
> > > > give out or write to disk are larger than any that have previously been
> > > > given out.  We can do that by ensuring that they're equal to at least
> > > > that old maximum.
> > > > 
> > > > So think of the 64-bit value we're storing in the superblock as a
> > > > ceiling on i_version values across all the filesystem's inodes.  Call it
> > > > s_version_max or something.  We also need to know what the maximum was
> > > > before the most recent crash.  Call that s_version_max_old.
> > > > 
> > > > Then we could get correct behavior if we generated i_versions with
> > > > something like:
> > > > 
> > > > 	i_version++;
> > > > 	if (i_version < s_version_max_old)
> > > > 		i_version = s_version_max_old;
> > > > 	if (i_version > s_version_max)
> > > > 		s_version_max = i_version + 1;
> > > > 
> > > > But that last step makes this ludicrously expensive, because for this to
> > > > be safe across crashes we need to update that value on disk as well, and
> > > > we need to do that frequently.
> > > > 
> > > > Fortunately, s_version_max doesn't have to be a tight bound at all.  We
> > > > can easily just initialize it to, say, 2^40, and only bump it by 2^40 at
> > > > a time.  And recognize when we're running up against it way ahead of
> > > > time, so we only need to say "here's an updated value, could you please
> > > > make sure it gets to disk sometime in the next twenty minutes"?
> > > > (Numbers made up.)
> > > > 
> > > > Sorry, that was way too many words.  But I think something like that
> > > > could work, and make it very difficult to hit any hard limits, and
> > > > actually not be too complicated??  Unless I missed something.
> > > > 
> > > 
> > > That's not too many words -- I appreciate a good "for dummies"
> > > explanation!
> > > 
> > > A scheme like that could work. It might be hard to do it without a
> > > spinlock or something, but maybe that's ok. Thinking more about how we'd
> > > implement this in the underlying filesystems:
> > > 
> > > To do this we'd need 2 64-bit fields in the on-disk and in-memory 
> > > superblocks for ext4, xfs and btrfs. On the first mount after a crash,
> > > the filesystem would need to bump s_version_max by the significant
> > > increment (2^40 bits or whatever). On a "clean" mount, it wouldn't need
> > > to do that.
> > 
> > Why only increment on crash? If the filesystem has been unmounted,
> > then any cached data is -stale- and must be discarded. e.g. unmount,
> > run fsck which cleans up corrupt files but does not modify
> > i_version, then mount. Remote caches are now invalid, but i_version
> > may not have changed, so we still need the clean unmount-mount cycle
> > to invalidate caches.
> 
> I disagree.  We do need fsck to cause caches to be invalidated IF IT
> FOUND SOMETHING TO REPAIR, but not if the filesystem was truely clean.

<sigh>

Neil, why the fuck are you shouting at me for making the obvious
observation that data in cleanly unmount filesystems can be modified
when they are off line?

Indeed, we know there are many systems out there that mount a
filesystem, preallocate and map the blocks that are allocated to a
large file, unmount the filesysetm, mmap the ranges of the block
device and pass them to RDMA hardware, then have sensor arrays rdma
data directly into the block device. Then when the measurement
application is done they walk the ondisk metadata to remove the
unwritten flags on the extents, mount the filesystem again and
export the file data to a HPC cluster for post-processing.....

So how does the filesystem know whether data the storage contains
for it's files has been modified while it is unmounted and so needs
to change the salt?

The short answer is that it can't, and so we cannot make assumptions
that a unmount/mount cycle has not changed the filesystem in any
way....

> > IOWs, what we want is a salted i_version value, with the filesystem
> > providing the unique per-mount salt that gets added to the
> > externally visible i_version values.
> 
> I agree this is a simple approach.  Possible the best.
> 
> > 
> > If that's the case, the salt doesn't need to be restricted to just
> > modifying the upper bits - as long as the salt increments
> > substantially and independently to the on-disk inode i_version then
> > we just don't care what bits of the superblock salt change from
> > mount to mount.
> > 
> > For XFS we already have a unique 64 bit salt we could use for every
> > mount - clean or unclean - and guarantee it is larger for every
> > mount. It also gets substantially bumped by fsck, too. It's called a
> > Log Sequence Number and we use them to track and strictly order
> > every modification we write into the log. This is exactly what is
> > needed for a i_version salt, and it's already guaranteed to be
> > persistent.
> 
> Invalidating the client cache on EVERY unmount/mount could impose
> unnecessary cost.  Imagine a client that caches a lot of data (several
> large files) from a server which is expected to fail-over from one
> cluster node to another from time to time.  Adding extra delays to a
> fail-over is not likely to be well received.

HA fail-over is something that happens rarely, and isn't something
we should be trying to optimise i_version for.  Indeed, HA failover
is usually a result of an active server crash/failure, in which case
server side filesystem recovery is required before the new node can
export the filesystem again. That's exactly the case you are talking
about needing to have the salt change to invalidate potentially
stale client side i_version values....

If the HA system needs to control the salt for co-ordinated, cache
coherent hand-over then -add an option for the HA server to control
the salt value itself-. HA orchestration has to handle so much state
hand-over between server nodes already that handling a salt value
for the mount is no big deal. This really is not something that
individual local filesystems need to care about, ever.

-Dave.
Theodore Ts'o Sept. 13, 2022, 9:38 a.m. UTC | #59
On Tue, Sep 13, 2022 at 01:30:58PM +1000, NeilBrown wrote:
> On Tue, 13 Sep 2022, Dave Chinner wrote:
> > 
> > Indeed, we know there are many systems out there that mount a
> > filesystem, preallocate and map the blocks that are allocated to a
> > large file, unmount the filesysetm, mmap the ranges of the block
> > device and pass them to RDMA hardware, then have sensor arrays rdma
> > data directly into the block device.....
> 
> And this tool doesn't update the i_version?  Sounds like a bug.

Tools that do this include "grub" and "lilo".  Fortunately, most
people aren't trying to export their /boot directory over NFS.  :-P

That being said, all we can strive for is "good enough" and not
"perfection".  So if I were to add a "crash counter" to the ext4
superblock, I can make sure it gets incremented (a) whenever the
journal is replayed (assuming that we decide to use lazytime-style
update for i_version for performance reasons), or (b) when fsck needs
to fix some file system inconsistency, or (c) when some external tool
like debugfs or fuse2fs is modifying the file system.

Will this get *everything*?  No.  For example, in addition Linux boot
loaders, there might be userspace which uses FIEMAP to get the
physical blocks #'s for a file, and then reads and writes to those
blocks using a kernel-bypass interface for high-speed SSDs, for
example.  I happen to know of thousands of machines that are doing
this with ext4 in production today, so this isn't hypothetical
example; fortuntely, they aren't exporting their file system over NFS,
nor are they likely to do so.  :-)

		    	    	      		   - Ted
Jeff Layton Sept. 13, 2022, 7:01 p.m. UTC | #60
On Tue, 2022-09-13 at 11:15 +1000, Dave Chinner wrote:
> On Tue, Sep 13, 2022 at 09:29:48AM +1000, NeilBrown wrote:
> > On Mon, 12 Sep 2022, Jeff Layton wrote:
> > > On Sun, 2022-09-11 at 08:53 +1000, NeilBrown wrote:
> > > > This could be expensive.
> > > > 
> > > > There is not currently any locking around O_DIRECT writes.  You cannot
> > > > synchronise with them.
> > > > 
> > > 
> > > AFAICT, DIO write() implementations in btrfs, ext4, and xfs all hold
> > > inode_lock_shared across the I/O. That was why patch #8 takes the
> > > inode_lock (exclusive) across the getattr.
> > 
> > Looking at ext4_dio_write_iter() it certain does take
> > inode_lock_shared() before starting the write and in some cases it
> > requests, using IOMAP_DIO_FORCE_WAIT, that imap_dio_rw() should wait for
> > the write to complete.  But not in all cases.
> > So I don't think it always holds the shared lock across all direct IO.
> 
> To serialise against dio writes, one must:
> 
> 	// Lock the inode exclusively to block new DIO submissions
> 	inode_lock(inode);
> 
> 	// Wait for all in flight DIO reads and writes to complete
> 	inode_dio_wait(inode);
> 
> This is how truncate, fallocate, etc serialise against AIO+DIO which
> do not hold the inode lock across the entire IO. These have to
> serialise aginst DIO reads, too, because we can't have IO in
> progress over a range of the file that we are about to free....
> 

Thanks, that clarifies a bit.

> > > Taking inode_lock_shared is sufficient to block out buffered and DAX
> > > writes. DIO writes sometimes only take the shared lock (e.g. when the
> > > data is already properly aligned). If we want to ensure the getattr
> > > doesn't run while _any_ writes are running, we'd need the exclusive
> > > lock.
> > 
> > But the exclusive lock is bad for scalability.
> 
> Serilisation against DIO is -expensive- and -slow-. It's not a
> solution for what is supposed to be a fast unlocked read-only
> operation like statx().
> 

Fair enough. I labeled that patch with RFC as I suspected that it would
be too expensive. I don't think we can guarantee perfect consistency vs.
mmap either, so carving out DIO is not a stretch (at least not for
NFSv4).

> > > Maybe that's overkill, though it seems like we could have a race like
> > > this without taking inode_lock across the getattr:
> > > 
> > > reader				writer
> > > -----------------------------------------------------------------
> > > 				i_version++
> > > getattr
> > > read
> > > 				DIO write to backing store
> > > 
> > 
> > This is why I keep saying that the i_version increment must be after the
> > write, not before it.
> 
> Sure, but that ignores the reason why we actually need to bump
> i_version *before* we submit a DIO write. DIO write invalidates the
> page cache over the range of the write, so any racing read will
> re-populate the page cache during the DIO write.
> 
> Hence buffered reads can return before the DIO write has completed,
> and the contents of the read can contain, none, some or all of the
> contents of the DIO write. Hence i_version has to be incremented
> before the DIO write is submitted so that racing getattrs will
> indicate that the local caches have been invalidated and that data
> needs to be refetched.
> 

Bumping the change attribute after the write is done would be sufficient
for serving NFSv4. The clients just invalidate their caches if they see
the value change. Bumping it before and after would be fine too. We
might get some spurious cache invalidations but they'd be infrequent.

FWIW, we've never guaranteed any real atomicity with NFS readers vs.
writers. Clients may see the intermediate stages of a write from a
different client if their reads race in at the right time. If you need
real atomicity, then you really should be using locking. What we _do_
try to ensure is timely pagecache invalidation when this occurs.

If we want to expose this to userland via statx in the future, then we
may need a stronger guarantee because we can't as easily predict how
people will want to use this.

At that point, bumping i_version both before and after makes a bit more
sense, since it better ensures that a change will be noticed, whether
the related read op comes before or after the statx.

> But, yes, to actually be safe here, we *also* should be bumping
> i_version on DIO write on DIO write completion so that racing
> i_version and data reads that occur *after* the initial i_version
> bump are invalidated immediately.
> 
> IOWs, to avoid getattr/read races missing stale data invalidations
> during DIO writes, we really need to bump i_version both _before and
> after_ DIO write submission.
> 
> It's corner cases like this where "i_version should only be bumped
> when ctime changes" fails completely. i.e. there are concurrent IO
> situations which can only really be handled correctly by bumping
> i_version whenever either in-memory and/or on-disk persistent data/
> metadata state changes occur.....

I think we have two choices (so far) when it comes to closing the race
window between the i_version bump and the write. Either should be fine
for serving NFSv4.

1/ take the inode_lock in some form across the getattr call for filling
out GETATTR/READDIR/NVERIFY info. This is what the RFC patch in my
latest set does. That's obviously too expensive though. We could take
inode_lock_shared, which wouldn't exclude DIO, but would cover the
buffered and DAX codepaths. This is somewhat ugly though, particularly
with slow backend network filesystems (like NFS). That getattr could
take a while, and meanwhile all writes are stuck...

...or...

2/ start bumping the i_version after a write completes. Bumping it twice
(before and after) would be fine too. In most cases the second one will
be a no-op anyway. We might get the occasional false cache invalidations
there with NFS, but they should be pretty rare and that's preferable to
holding on to invalid cached data (which I think is a danger today).

To do #2, I guess we'd need to add an inode_maybe_inc_iversion call at
the end of the relevant ->write_iter ops, and then dirty the inode if
that comes back true? That should be pretty rare.

We do also still need some way to mitigate potential repeated versions
due to crashes, but that's orthogonal to the above issue (and being
discussed in a different branch of this thread).
J. Bruce Fields Sept. 13, 2022, 7:02 p.m. UTC | #61
On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote:
> Invalidating the client cache on EVERY unmount/mount could impose
> unnecessary cost.  Imagine a client that caches a lot of data (several
> large files) from a server which is expected to fail-over from one
> cluster node to another from time to time.  Adding extra delays to a
> fail-over is not likely to be well received.
> 
> I don't *know* this cost would be unacceptable, and I *would* like to
> leave it to the filesystem to decide how to manage its own i_version
> values.  So maybe XFS can use the LSN for a salt.  If people notice the
> extra cost, they can complain.

I'd expect complaints.

NFS is actually even worse than this: it allows clients to reacquire
file locks across server restart and unmount/remount, even though
obviously the kernel will do nothing to prevent someone else from
locking (or modifying) the file in between.

Administrators are just supposed to know not to allow other applications
access to the filesystem until nfsd's started.  It's always been this
way.

You can imagine all sorts of measures to prevent that, and if anyone
wants to work on ways to prevent people from shooting themselves in the
foot here, great.

Just taking away the ability to cache or lock across reboots wouldn't
make people happy, though....

--b.
NeilBrown Sept. 13, 2022, 11:19 p.m. UTC | #62
On Wed, 14 Sep 2022, J. Bruce Fields wrote:
> On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote:
> > Invalidating the client cache on EVERY unmount/mount could impose
> > unnecessary cost.  Imagine a client that caches a lot of data (several
> > large files) from a server which is expected to fail-over from one
> > cluster node to another from time to time.  Adding extra delays to a
> > fail-over is not likely to be well received.
> > 
> > I don't *know* this cost would be unacceptable, and I *would* like to
> > leave it to the filesystem to decide how to manage its own i_version
> > values.  So maybe XFS can use the LSN for a salt.  If people notice the
> > extra cost, they can complain.
> 
> I'd expect complaints.
> 
> NFS is actually even worse than this: it allows clients to reacquire
> file locks across server restart and unmount/remount, even though
> obviously the kernel will do nothing to prevent someone else from
> locking (or modifying) the file in between.

I don't understand this comment.  You seem to be implying that changing
the i_version during a server restart would stop a client from
reclaiming locks.  Is that correct?
I would have thought that the client would largely ignore i_version
while it has a lock or open or delegation, as these tend to imply some
degree of exclusive access ("open" being least exclusive).

Thanks,
NeilBrown


> 
> Administrators are just supposed to know not to allow other applications
> access to the filesystem until nfsd's started.  It's always been this
> way.
> 
> You can imagine all sorts of measures to prevent that, and if anyone
> wants to work on ways to prevent people from shooting themselves in the
> foot here, great.
> 
> Just taking away the ability to cache or lock across reboots wouldn't
> make people happy, though....
> 
> --b.
>
NeilBrown Sept. 13, 2022, 11:24 p.m. UTC | #63
On Wed, 14 Sep 2022, Jeff Layton wrote:
>
> At that point, bumping i_version both before and after makes a bit more
> sense, since it better ensures that a change will be noticed, whether
> the related read op comes before or after the statx.

How does bumping it before make any sense at all?  Maybe it wouldn't
hurt much, but how does it help anyone at all?

  i_version must appear to change no sooner than the change it reflects
  becomes visible and no later than the request which initiated that
  change is acknowledged as complete.

Why would that definition ever not be satisfactory?

NeilBrown
J. Bruce Fields Sept. 14, 2022, 12:08 a.m. UTC | #64
On Wed, Sep 14, 2022 at 09:19:22AM +1000, NeilBrown wrote:
> On Wed, 14 Sep 2022, J. Bruce Fields wrote:
> > On Tue, Sep 13, 2022 at 11:49:03AM +1000, NeilBrown wrote:
> > > Invalidating the client cache on EVERY unmount/mount could impose
> > > unnecessary cost.  Imagine a client that caches a lot of data (several
> > > large files) from a server which is expected to fail-over from one
> > > cluster node to another from time to time.  Adding extra delays to a
> > > fail-over is not likely to be well received.
> > > 
> > > I don't *know* this cost would be unacceptable, and I *would* like to
> > > leave it to the filesystem to decide how to manage its own i_version
> > > values.  So maybe XFS can use the LSN for a salt.  If people notice the
> > > extra cost, they can complain.
> > 
> > I'd expect complaints.
> > 
> > NFS is actually even worse than this: it allows clients to reacquire
> > file locks across server restart and unmount/remount, even though
> > obviously the kernel will do nothing to prevent someone else from
> > locking (or modifying) the file in between.
> 
> I don't understand this comment.  You seem to be implying that changing
> the i_version during a server restart would stop a client from
> reclaiming locks.  Is that correct?

No, sorry, I'm probably being confusing.

I was just saying: we've always depended in a lot of ways on the
assumption that filesystems aren't messed with while nfsd's not running.
You can produce all sorts of incorrect behavior by violating that
assumption.  That tools might fool with unmounted filesystems is just
another such example, and fixing that wouldn't be very high on my list
of priorities.

??

--b.

> I would have thought that the client would largely ignore i_version
> while it has a lock or open or delegation, as these tend to imply some
> degree of exclusive access ("open" being least exclusive).
> 
> Thanks,
> NeilBrown
> 
> 
> > 
> > Administrators are just supposed to know not to allow other applications
> > access to the filesystem until nfsd's started.  It's always been this
> > way.
> > 
> > You can imagine all sorts of measures to prevent that, and if anyone
> > wants to work on ways to prevent people from shooting themselves in the
> > foot here, great.
> > 
> > Just taking away the ability to cache or lock across reboots wouldn't
> > make people happy, though....
> > 
> > --b.
> >
Jeff Layton Sept. 14, 2022, 11:51 a.m. UTC | #65
On Wed, 2022-09-14 at 09:24 +1000, NeilBrown wrote:
> On Wed, 14 Sep 2022, Jeff Layton wrote:
> > 
> > At that point, bumping i_version both before and after makes a bit more
> > sense, since it better ensures that a change will be noticed, whether
> > the related read op comes before or after the statx.
> 
> How does bumping it before make any sense at all?  Maybe it wouldn't
> hurt much, but how does it help anyone at all?
> 

My assumption (maybe wrong) was that timestamp updates were done before
the actual write by design. Does doing it before the write make increase
the chances that the inode metadata writeout will get done in the same
physical I/O as the data write? IDK, just speculating here.

If there's no benefit to doing it before then we should just move it
afterward.


>   i_version must appear to change no sooner than the change it reflects
>   becomes visible and no later than the request which initiated that
>   change is acknowledged as complete.
> 
> Why would that definition ever not be satisfactory?

It's fine with me.
NeilBrown Sept. 14, 2022, 10:45 p.m. UTC | #66
On Wed, 14 Sep 2022, Jeff Layton wrote:
> On Wed, 2022-09-14 at 09:24 +1000, NeilBrown wrote:
> > On Wed, 14 Sep 2022, Jeff Layton wrote:
> > > 
> > > At that point, bumping i_version both before and after makes a bit more
> > > sense, since it better ensures that a change will be noticed, whether
> > > the related read op comes before or after the statx.
> > 
> > How does bumping it before make any sense at all?  Maybe it wouldn't
> > hurt much, but how does it help anyone at all?
> > 
> 
> My assumption (maybe wrong) was that timestamp updates were done before
> the actual write by design. Does doing it before the write make increase
> the chances that the inode metadata writeout will get done in the same
> physical I/O as the data write? IDK, just speculating here.

When the code was written, the inode semaphore (before mutexes) was held
over the whole thing, and timestamp resolution was 1 second.  So
ordering didn't really matter.  Since then locking has bee reduced and
precision increased but no-one saw any need to fix the ordering.  I
think that is fine for timestamps.

But i_version is about absolute precision, so we need to think carefully
about what meets our needs.

> 
> If there's no benefit to doing it before then we should just move it
> afterward.

Great!
Thanks,
NeilBrown
J. Bruce Fields Sept. 15, 2022, 2:06 p.m. UTC | #67
On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > 
> > > > The machine crashes and comes back up, and we get a query for i_version
> > > > and it comes back as X. Fine, it's an old version. Now there is a write.
> > > > What do we do to ensure that the new value doesn't collide with X+1? 
> > > 
> > > (I missed this bit in my earlier reply..)
> > > 
> > > How is it "Fine" to see an old version?
> > > The file could have changed without the version changing.
> > > And I thought one of the goals of the crash-count was to be able to
> > > provide a monotonic change id.
> > 
> > I was still mainly thinking about how to provide reliable close-to-open
> > semantics between NFS clients.  In the case the writer was an NFS
> > client, it wasn't done writing (or it would have COMMITted), so those
> > writes will come in and bump the change attribute soon, and as long as
> > we avoid the small chance of reusing an old change attribute, we're OK,
> > and I think it'd even still be OK to advertise
> > CHANGE_TYPE_IS_MONOTONIC_INCR.
> 
> You seem to be assuming that the client doesn't crash at the same time
> as the server (maybe they are both VMs on a host that lost power...)
> 
> If client A reads and caches, client B writes, the server crashes after
> writing some data (to already allocated space so no inode update needed)
> but before writing the new i_version, then client B crashes.
> When server comes back the i_version will be unchanged but the data has
> changed.  Client A will cache old data indefinitely...

I guess I assume that if all we're promising is close-to-open, then a
client isn't allowed to trust its cache in that situation.  Maybe that's
an overly draconian interpretation of close-to-open.

Also, I'm trying to think about how to improve things incrementally.
Incorporating something like a crash count into the on-disk i_version
fixes some cases without introducing any new ones or regressing
performance after a crash.

If we subsequently wanted to close those remaining holes, I think we'd
need the change attribute increment to be seen as atomic with respect to
its associated change, both to clients and (separately) on disk.  (That
would still allow the change attribute to go backwards after a crash, to
the value it held as of the on-disk state of the file.  I think clients
should be able to deal with that case.)

But, I don't know, maybe a bigger hammer would be OK:

> I think we need to require the filesystem to ensure that the i_version
> is seen to increase shortly after any change becomes visible in the
> file, and no later than the moment when the request that initiated the
> change is acknowledged as being complete.  In the case of an unclean
> restart, any file that is not known to have been unchanged immediately
> before the crash must have i_version increased.
> 
> The simplest implementation is to have an unclean-restart counter and to
> always included this multiplied by some constant X in the reported
> i_version.  The filesystem guarantees to record (e.g.  to journal
> at least) the i_version if it comes close to X more than the previous
> record.  The filesystem gets to choose X.

So the question is whether people can live with invalidating all client
caches after a cache.  I don't know.

> A more complex solution would be to record (similar to the way orphans
> are recorded) any file which is open for write, and to add X to the
> i_version for any "dirty" file still recorded during an unclean
> restart.  This would avoid bumping the i_version for read-only files.

Is that practical?  Working out the performance tradeoffs sounds like a
project.

> There may be other solutions, but we should leave that up to the
> filesystem.  Each filesystem might choose something different.

Sure.

--b.
Trond Myklebust Sept. 15, 2022, 3:08 p.m. UTC | #68
On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > 
> > > > > The machine crashes and comes back up, and we get a query for
> > > > > i_version
> > > > > and it comes back as X. Fine, it's an old version. Now there
> > > > > is a write.
> > > > > What do we do to ensure that the new value doesn't collide
> > > > > with X+1? 
> > > > 
> > > > (I missed this bit in my earlier reply..)
> > > > 
> > > > How is it "Fine" to see an old version?
> > > > The file could have changed without the version changing.
> > > > And I thought one of the goals of the crash-count was to be
> > > > able to
> > > > provide a monotonic change id.
> > > 
> > > I was still mainly thinking about how to provide reliable close-
> > > to-open
> > > semantics between NFS clients.  In the case the writer was an NFS
> > > client, it wasn't done writing (or it would have COMMITted), so
> > > those
> > > writes will come in and bump the change attribute soon, and as
> > > long as
> > > we avoid the small chance of reusing an old change attribute,
> > > we're OK,
> > > and I think it'd even still be OK to advertise
> > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > 
> > You seem to be assuming that the client doesn't crash at the same
> > time
> > as the server (maybe they are both VMs on a host that lost
> > power...)
> > 
> > If client A reads and caches, client B writes, the server crashes
> > after
> > writing some data (to already allocated space so no inode update
> > needed)
> > but before writing the new i_version, then client B crashes.
> > When server comes back the i_version will be unchanged but the data
> > has
> > changed.  Client A will cache old data indefinitely...
> 
> I guess I assume that if all we're promising is close-to-open, then a
> client isn't allowed to trust its cache in that situation.  Maybe
> that's
> an overly draconian interpretation of close-to-open.
> 
> Also, I'm trying to think about how to improve things incrementally.
> Incorporating something like a crash count into the on-disk i_version
> fixes some cases without introducing any new ones or regressing
> performance after a crash.
> 
> If we subsequently wanted to close those remaining holes, I think
> we'd
> need the change attribute increment to be seen as atomic with respect
> to
> its associated change, both to clients and (separately) on disk. 
> (That
> would still allow the change attribute to go backwards after a crash,
> to
> the value it held as of the on-disk state of the file.  I think
> clients
> should be able to deal with that case.)
> 
> But, I don't know, maybe a bigger hammer would be OK:
> 

If you're not going to meet the minimum bar of data integrity, then
this whole exercise is just a massive waste of everyone's time. The
answer then going forward is just to recommend never using Linux as an
NFS server. Makes my life much easier, because I no longer have to
debug any of the issues.

>
Jeff Layton Sept. 15, 2022, 3:41 p.m. UTC | #69
On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > 
> > > > > The machine crashes and comes back up, and we get a query for i_version
> > > > > and it comes back as X. Fine, it's an old version. Now there is a write.
> > > > > What do we do to ensure that the new value doesn't collide with X+1? 
> > > > 
> > > > (I missed this bit in my earlier reply..)
> > > > 
> > > > How is it "Fine" to see an old version?
> > > > The file could have changed without the version changing.
> > > > And I thought one of the goals of the crash-count was to be able to
> > > > provide a monotonic change id.
> > > 
> > > I was still mainly thinking about how to provide reliable close-to-open
> > > semantics between NFS clients.  In the case the writer was an NFS
> > > client, it wasn't done writing (or it would have COMMITted), so those
> > > writes will come in and bump the change attribute soon, and as long as
> > > we avoid the small chance of reusing an old change attribute, we're OK,
> > > and I think it'd even still be OK to advertise
> > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > 
> > You seem to be assuming that the client doesn't crash at the same time
> > as the server (maybe they are both VMs on a host that lost power...)
> > 
> > If client A reads and caches, client B writes, the server crashes after
> > writing some data (to already allocated space so no inode update needed)
> > but before writing the new i_version, then client B crashes.
> > When server comes back the i_version will be unchanged but the data has
> > changed.  Client A will cache old data indefinitely...
> 
> I guess I assume that if all we're promising is close-to-open, then a
> client isn't allowed to trust its cache in that situation.  Maybe that's
> an overly draconian interpretation of close-to-open.
> 
> Also, I'm trying to think about how to improve things incrementally.
> Incorporating something like a crash count into the on-disk i_version
> fixes some cases without introducing any new ones or regressing
> performance after a crash.
> 

I think we ought to start there.

> If we subsequently wanted to close those remaining holes, I think we'd
> need the change attribute increment to be seen as atomic with respect to
> its associated change, both to clients and (separately) on disk.  (That
> would still allow the change attribute to go backwards after a crash, to
> the value it held as of the on-disk state of the file.  I think clients
> should be able to deal with that case.)
> 
> But, I don't know, maybe a bigger hammer would be OK:
> 
> > I think we need to require the filesystem to ensure that the i_version
> > is seen to increase shortly after any change becomes visible in the
> > file, and no later than the moment when the request that initiated the
> > change is acknowledged as being complete.  In the case of an unclean
> > restart, any file that is not known to have been unchanged immediately
> > before the crash must have i_version increased.
> > 
> > The simplest implementation is to have an unclean-restart counter and to
> > always included this multiplied by some constant X in the reported
> > i_version.  The filesystem guarantees to record (e.g.  to journal
> > at least) the i_version if it comes close to X more than the previous
> > record.  The filesystem gets to choose X.
>
> So the question is whether people can live with invalidating all client
> caches after a cache.  I don't know.
> 

I assume you mean "after a crash". Yeah, that is pretty nasty. We don't
get perfect crash resilience with incorporating this into the on-disk
value, but I like that better than factoring it in at presentation time.

That would mean that the servers would end up getting hammered with read
activity after a crash (at least in some environments). I don't think
that would be worth the tradeoff. There's a real benefit to preserving
caches when we can.

> > A more complex solution would be to record (similar to the way orphans
> > are recorded) any file which is open for write, and to add X to the
> > i_version for any "dirty" file still recorded during an unclean
> > restart.  This would avoid bumping the i_version for read-only files.
> 
> Is that practical?  Working out the performance tradeoffs sounds like a
> project.
>
> 
> > There may be other solutions, but we should leave that up to the
> > filesystem.  Each filesystem might choose something different.
> 
> Sure.
> 

Agreed here too. I think we need to allow for some flexibility here. 

Here's what I'm thinking:

We'll carve out the upper 16 bits in the i_version counter to be the
crash counter field. That gives us 8k crashes before we have to worry
about collisions. Hopefully the remaining 47 bits of counter will be
plenty given that we don't increment it when it's not being queried or
nothing else changes. (Can we mitigate wrapping here somehow?)

The easiest way to do this would be to add a u16 s_crash_counter to
struct super_block. We'd initialize that to 0, and the filesystem could
fill that value out at mount time.

Then inode_maybe_inc_iversion can just shift the s_crash_counter that
left by 24 bits and and plop it into the top of the value we're
preparing to cmpxchg into place.

This is backward compatible too, at least for i_version counter values
that are <2^47. With anything larger, we might end up with something
going backward and a possible collision, but it's (hopefully) a small
risk.
Jeff Layton Sept. 15, 2022, 4:45 p.m. UTC | #70
On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote:
> On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > 
> > > > > > The machine crashes and comes back up, and we get a query for
> > > > > > i_version
> > > > > > and it comes back as X. Fine, it's an old version. Now there
> > > > > > is a write.
> > > > > > What do we do to ensure that the new value doesn't collide
> > > > > > with X+1? 
> > > > > 
> > > > > (I missed this bit in my earlier reply..)
> > > > > 
> > > > > How is it "Fine" to see an old version?
> > > > > The file could have changed without the version changing.
> > > > > And I thought one of the goals of the crash-count was to be
> > > > > able to
> > > > > provide a monotonic change id.
> > > > 
> > > > I was still mainly thinking about how to provide reliable close-
> > > > to-open
> > > > semantics between NFS clients.  In the case the writer was an NFS
> > > > client, it wasn't done writing (or it would have COMMITted), so
> > > > those
> > > > writes will come in and bump the change attribute soon, and as
> > > > long as
> > > > we avoid the small chance of reusing an old change attribute,
> > > > we're OK,
> > > > and I think it'd even still be OK to advertise
> > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > 
> > > You seem to be assuming that the client doesn't crash at the same
> > > time
> > > as the server (maybe they are both VMs on a host that lost
> > > power...)
> > > 
> > > If client A reads and caches, client B writes, the server crashes
> > > after
> > > writing some data (to already allocated space so no inode update
> > > needed)
> > > but before writing the new i_version, then client B crashes.
> > > When server comes back the i_version will be unchanged but the data
> > > has
> > > changed.  Client A will cache old data indefinitely...
> > 
> > I guess I assume that if all we're promising is close-to-open, then a
> > client isn't allowed to trust its cache in that situation.  Maybe
> > that's
> > an overly draconian interpretation of close-to-open.
> > 
> > Also, I'm trying to think about how to improve things incrementally.
> > Incorporating something like a crash count into the on-disk i_version
> > fixes some cases without introducing any new ones or regressing
> > performance after a crash.
> > 
> > If we subsequently wanted to close those remaining holes, I think
> > we'd
> > need the change attribute increment to be seen as atomic with respect
> > to
> > its associated change, both to clients and (separately) on disk. 
> > (That
> > would still allow the change attribute to go backwards after a crash,
> > to
> > the value it held as of the on-disk state of the file.  I think
> > clients
> > should be able to deal with that case.)
> > 
> > But, I don't know, maybe a bigger hammer would be OK:
> > 
> 
> If you're not going to meet the minimum bar of data integrity, then
> this whole exercise is just a massive waste of everyone's time. The
> answer then going forward is just to recommend never using Linux as an
> NFS server. Makes my life much easier, because I no longer have to
> debug any of the issues.
> 
> 

To be clear, you believe any scheme that would allow the client to see
an old change attr after a crash is insufficient?

The only way I can see to fix that (at least with only a crash counter)
would be to factor it in at presentation time like Neil suggested.
Basically we'd just mask off the top 16 bits and plop the crash counter
in there before presenting it.

In principle, I suppose we could do that at the nfsd level as well (and
that might be the simplest way to fix this). We probably wouldn't be
able to advertise a change attr type of MONOTONIC with this scheme
though.
Trond Myklebust Sept. 15, 2022, 5:49 p.m. UTC | #71
On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote:
> On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote:
> > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > > 
> > > > > > > The machine crashes and comes back up, and we get a query
> > > > > > > for
> > > > > > > i_version
> > > > > > > and it comes back as X. Fine, it's an old version. Now
> > > > > > > there
> > > > > > > is a write.
> > > > > > > What do we do to ensure that the new value doesn't
> > > > > > > collide
> > > > > > > with X+1? 
> > > > > > 
> > > > > > (I missed this bit in my earlier reply..)
> > > > > > 
> > > > > > How is it "Fine" to see an old version?
> > > > > > The file could have changed without the version changing.
> > > > > > And I thought one of the goals of the crash-count was to be
> > > > > > able to
> > > > > > provide a monotonic change id.
> > > > > 
> > > > > I was still mainly thinking about how to provide reliable
> > > > > close-
> > > > > to-open
> > > > > semantics between NFS clients.  In the case the writer was an
> > > > > NFS
> > > > > client, it wasn't done writing (or it would have COMMITted),
> > > > > so
> > > > > those
> > > > > writes will come in and bump the change attribute soon, and
> > > > > as
> > > > > long as
> > > > > we avoid the small chance of reusing an old change attribute,
> > > > > we're OK,
> > > > > and I think it'd even still be OK to advertise
> > > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > > 
> > > > You seem to be assuming that the client doesn't crash at the
> > > > same
> > > > time
> > > > as the server (maybe they are both VMs on a host that lost
> > > > power...)
> > > > 
> > > > If client A reads and caches, client B writes, the server
> > > > crashes
> > > > after
> > > > writing some data (to already allocated space so no inode
> > > > update
> > > > needed)
> > > > but before writing the new i_version, then client B crashes.
> > > > When server comes back the i_version will be unchanged but the
> > > > data
> > > > has
> > > > changed.  Client A will cache old data indefinitely...
> > > 
> > > I guess I assume that if all we're promising is close-to-open,
> > > then a
> > > client isn't allowed to trust its cache in that situation.  Maybe
> > > that's
> > > an overly draconian interpretation of close-to-open.
> > > 
> > > Also, I'm trying to think about how to improve things
> > > incrementally.
> > > Incorporating something like a crash count into the on-disk
> > > i_version
> > > fixes some cases without introducing any new ones or regressing
> > > performance after a crash.
> > > 
> > > If we subsequently wanted to close those remaining holes, I think
> > > we'd
> > > need the change attribute increment to be seen as atomic with
> > > respect
> > > to
> > > its associated change, both to clients and (separately) on disk. 
> > > (That
> > > would still allow the change attribute to go backwards after a
> > > crash,
> > > to
> > > the value it held as of the on-disk state of the file.  I think
> > > clients
> > > should be able to deal with that case.)
> > > 
> > > But, I don't know, maybe a bigger hammer would be OK:
> > > 
> > 
> > If you're not going to meet the minimum bar of data integrity, then
> > this whole exercise is just a massive waste of everyone's time. The
> > answer then going forward is just to recommend never using Linux as
> > an
> > NFS server. Makes my life much easier, because I no longer have to
> > debug any of the issues.
> > 
> > 
> 
> To be clear, you believe any scheme that would allow the client to
> see
> an old change attr after a crash is insufficient?
> 

Correct. If a NFSv4 client or userspace application cannot trust that
it will always see a change to the change attribute value when the file
data changes, then you will eventually see data corruption due to the
cached data no longer matching the stored data.

A false positive update of the change attribute (i.e. a case where the
change attribute changes despite the data/metadata staying the same) is
not desirable because it causes performance issues, but false negatives
are far worse because they mean your data backup, cache, etc... are not
consistent. Applications that have strong consistency requirements will
have no option but to revalidate by always reading the entire file data
+ metadata.

> The only way I can see to fix that (at least with only a crash
> counter)
> would be to factor it in at presentation time like Neil suggested.
> Basically we'd just mask off the top 16 bits and plop the crash
> counter
> in there before presenting it.
> 
> In principle, I suppose we could do that at the nfsd level as well
> (and
> that might be the simplest way to fix this). We probably wouldn't be
> able to advertise a change attr type of MONOTONIC with this scheme
> though.

Why would you want to limit the crash counter to 16 bits?
Jeff Layton Sept. 15, 2022, 6:11 p.m. UTC | #72
On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote:
> On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote:
> > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote:
> > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > > > 
> > > > > > > > The machine crashes and comes back up, and we get a query
> > > > > > > > for
> > > > > > > > i_version
> > > > > > > > and it comes back as X. Fine, it's an old version. Now
> > > > > > > > there
> > > > > > > > is a write.
> > > > > > > > What do we do to ensure that the new value doesn't
> > > > > > > > collide
> > > > > > > > with X+1? 
> > > > > > > 
> > > > > > > (I missed this bit in my earlier reply..)
> > > > > > > 
> > > > > > > How is it "Fine" to see an old version?
> > > > > > > The file could have changed without the version changing.
> > > > > > > And I thought one of the goals of the crash-count was to be
> > > > > > > able to
> > > > > > > provide a monotonic change id.
> > > > > > 
> > > > > > I was still mainly thinking about how to provide reliable
> > > > > > close-
> > > > > > to-open
> > > > > > semantics between NFS clients.  In the case the writer was an
> > > > > > NFS
> > > > > > client, it wasn't done writing (or it would have COMMITted),
> > > > > > so
> > > > > > those
> > > > > > writes will come in and bump the change attribute soon, and
> > > > > > as
> > > > > > long as
> > > > > > we avoid the small chance of reusing an old change attribute,
> > > > > > we're OK,
> > > > > > and I think it'd even still be OK to advertise
> > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > > > 
> > > > > You seem to be assuming that the client doesn't crash at the
> > > > > same
> > > > > time
> > > > > as the server (maybe they are both VMs on a host that lost
> > > > > power...)
> > > > > 
> > > > > If client A reads and caches, client B writes, the server
> > > > > crashes
> > > > > after
> > > > > writing some data (to already allocated space so no inode
> > > > > update
> > > > > needed)
> > > > > but before writing the new i_version, then client B crashes.
> > > > > When server comes back the i_version will be unchanged but the
> > > > > data
> > > > > has
> > > > > changed.  Client A will cache old data indefinitely...
> > > > 
> > > > I guess I assume that if all we're promising is close-to-open,
> > > > then a
> > > > client isn't allowed to trust its cache in that situation.  Maybe
> > > > that's
> > > > an overly draconian interpretation of close-to-open.
> > > > 
> > > > Also, I'm trying to think about how to improve things
> > > > incrementally.
> > > > Incorporating something like a crash count into the on-disk
> > > > i_version
> > > > fixes some cases without introducing any new ones or regressing
> > > > performance after a crash.
> > > > 
> > > > If we subsequently wanted to close those remaining holes, I think
> > > > we'd
> > > > need the change attribute increment to be seen as atomic with
> > > > respect
> > > > to
> > > > its associated change, both to clients and (separately) on disk. 
> > > > (That
> > > > would still allow the change attribute to go backwards after a
> > > > crash,
> > > > to
> > > > the value it held as of the on-disk state of the file.  I think
> > > > clients
> > > > should be able to deal with that case.)
> > > > 
> > > > But, I don't know, maybe a bigger hammer would be OK:
> > > > 
> > > 
> > > If you're not going to meet the minimum bar of data integrity, then
> > > this whole exercise is just a massive waste of everyone's time. The
> > > answer then going forward is just to recommend never using Linux as
> > > an
> > > NFS server. Makes my life much easier, because I no longer have to
> > > debug any of the issues.
> > > 
> > > 
> > 
> > To be clear, you believe any scheme that would allow the client to
> > see
> > an old change attr after a crash is insufficient?
> > 
> 
> Correct. If a NFSv4 client or userspace application cannot trust that
> it will always see a change to the change attribute value when the file
> data changes, then you will eventually see data corruption due to the
> cached data no longer matching the stored data.
> 
> A false positive update of the change attribute (i.e. a case where the
> change attribute changes despite the data/metadata staying the same) is
> not desirable because it causes performance issues, but false negatives
> are far worse because they mean your data backup, cache, etc... are not
> consistent. Applications that have strong consistency requirements will
> have no option but to revalidate by always reading the entire file data
> + metadata.
> 
> > The only way I can see to fix that (at least with only a crash
> > counter)
> > would be to factor it in at presentation time like Neil suggested.
> > Basically we'd just mask off the top 16 bits and plop the crash
> > counter
> > in there before presenting it.
> > 
> > In principle, I suppose we could do that at the nfsd level as well
> > (and
> > that might be the simplest way to fix this). We probably wouldn't be
> > able to advertise a change attr type of MONOTONIC with this scheme
> > though.
> 
> Why would you want to limit the crash counter to 16 bits?
> 

To leave more room for the "real" counter. Otherwise, an inode that gets
frequent writes after a long period of no crashes could experience the
counter wrap.

IOW, we have 63 bits to play with. Whatever part we dedicate to the
crash counter will not be available for the actual version counter.

I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a
different one.
Trond Myklebust Sept. 15, 2022, 7:03 p.m. UTC | #73
On Thu, 2022-09-15 at 14:11 -0400, Jeff Layton wrote:
> On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote:
> > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote:
> > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote:
> > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown
> > > > > > > wrote:
> > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > > > > 
> > > > > > > > > The machine crashes and comes back up, and we get a
> > > > > > > > > query
> > > > > > > > > for
> > > > > > > > > i_version
> > > > > > > > > and it comes back as X. Fine, it's an old version.
> > > > > > > > > Now
> > > > > > > > > there
> > > > > > > > > is a write.
> > > > > > > > > What do we do to ensure that the new value doesn't
> > > > > > > > > collide
> > > > > > > > > with X+1? 
> > > > > > > > 
> > > > > > > > (I missed this bit in my earlier reply..)
> > > > > > > > 
> > > > > > > > How is it "Fine" to see an old version?
> > > > > > > > The file could have changed without the version
> > > > > > > > changing.
> > > > > > > > And I thought one of the goals of the crash-count was
> > > > > > > > to be
> > > > > > > > able to
> > > > > > > > provide a monotonic change id.
> > > > > > > 
> > > > > > > I was still mainly thinking about how to provide reliable
> > > > > > > close-
> > > > > > > to-open
> > > > > > > semantics between NFS clients.  In the case the writer
> > > > > > > was an
> > > > > > > NFS
> > > > > > > client, it wasn't done writing (or it would have
> > > > > > > COMMITted),
> > > > > > > so
> > > > > > > those
> > > > > > > writes will come in and bump the change attribute soon,
> > > > > > > and
> > > > > > > as
> > > > > > > long as
> > > > > > > we avoid the small chance of reusing an old change
> > > > > > > attribute,
> > > > > > > we're OK,
> > > > > > > and I think it'd even still be OK to advertise
> > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > > > > 
> > > > > > You seem to be assuming that the client doesn't crash at
> > > > > > the
> > > > > > same
> > > > > > time
> > > > > > as the server (maybe they are both VMs on a host that lost
> > > > > > power...)
> > > > > > 
> > > > > > If client A reads and caches, client B writes, the server
> > > > > > crashes
> > > > > > after
> > > > > > writing some data (to already allocated space so no inode
> > > > > > update
> > > > > > needed)
> > > > > > but before writing the new i_version, then client B
> > > > > > crashes.
> > > > > > When server comes back the i_version will be unchanged but
> > > > > > the
> > > > > > data
> > > > > > has
> > > > > > changed.  Client A will cache old data indefinitely...
> > > > > 
> > > > > I guess I assume that if all we're promising is close-to-
> > > > > open,
> > > > > then a
> > > > > client isn't allowed to trust its cache in that situation. 
> > > > > Maybe
> > > > > that's
> > > > > an overly draconian interpretation of close-to-open.
> > > > > 
> > > > > Also, I'm trying to think about how to improve things
> > > > > incrementally.
> > > > > Incorporating something like a crash count into the on-disk
> > > > > i_version
> > > > > fixes some cases without introducing any new ones or
> > > > > regressing
> > > > > performance after a crash.
> > > > > 
> > > > > If we subsequently wanted to close those remaining holes, I
> > > > > think
> > > > > we'd
> > > > > need the change attribute increment to be seen as atomic with
> > > > > respect
> > > > > to
> > > > > its associated change, both to clients and (separately) on
> > > > > disk. 
> > > > > (That
> > > > > would still allow the change attribute to go backwards after
> > > > > a
> > > > > crash,
> > > > > to
> > > > > the value it held as of the on-disk state of the file.  I
> > > > > think
> > > > > clients
> > > > > should be able to deal with that case.)
> > > > > 
> > > > > But, I don't know, maybe a bigger hammer would be OK:
> > > > > 
> > > > 
> > > > If you're not going to meet the minimum bar of data integrity,
> > > > then
> > > > this whole exercise is just a massive waste of everyone's time.
> > > > The
> > > > answer then going forward is just to recommend never using
> > > > Linux as
> > > > an
> > > > NFS server. Makes my life much easier, because I no longer have
> > > > to
> > > > debug any of the issues.
> > > > 
> > > > 
> > > 
> > > To be clear, you believe any scheme that would allow the client
> > > to
> > > see
> > > an old change attr after a crash is insufficient?
> > > 
> > 
> > Correct. If a NFSv4 client or userspace application cannot trust
> > that
> > it will always see a change to the change attribute value when the
> > file
> > data changes, then you will eventually see data corruption due to
> > the
> > cached data no longer matching the stored data.
> > 
> > A false positive update of the change attribute (i.e. a case where
> > the
> > change attribute changes despite the data/metadata staying the
> > same) is
> > not desirable because it causes performance issues, but false
> > negatives
> > are far worse because they mean your data backup, cache, etc... are
> > not
> > consistent. Applications that have strong consistency requirements
> > will
> > have no option but to revalidate by always reading the entire file
> > data
> > + metadata.
> > 
> > > The only way I can see to fix that (at least with only a crash
> > > counter)
> > > would be to factor it in at presentation time like Neil
> > > suggested.
> > > Basically we'd just mask off the top 16 bits and plop the crash
> > > counter
> > > in there before presenting it.
> > > 
> > > In principle, I suppose we could do that at the nfsd level as
> > > well
> > > (and
> > > that might be the simplest way to fix this). We probably wouldn't
> > > be
> > > able to advertise a change attr type of MONOTONIC with this
> > > scheme
> > > though.
> > 
> > Why would you want to limit the crash counter to 16 bits?
> > 
> 
> To leave more room for the "real" counter. Otherwise, an inode that
> gets
> frequent writes after a long period of no crashes could experience
> the
> counter wrap.
> 
> IOW, we have 63 bits to play with. Whatever part we dedicate to the
> crash counter will not be available for the actual version counter.
> 
> I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a
> different one.


What is the expectation when you have an unclean shutdown or crash? Do
all change attribute values get updated to reflect the new crash
counter value, or only some?

If the answer is that 'all values change', then why store the crash
counter in the inode at all? Why not just add it as an offset when
you're generating the user-visible change attribute?

i.e. statx.change_attr = inode->i_version + (crash counter * offset)

(where offset is chosen to be larger than the max number of inode-
>i_version updates that could get lost by an inode in a crash).

Presumably that offset could be significantly smaller than 2^63...
Jeff Layton Sept. 15, 2022, 7:25 p.m. UTC | #74
On Thu, 2022-09-15 at 19:03 +0000, Trond Myklebust wrote:
> On Thu, 2022-09-15 at 14:11 -0400, Jeff Layton wrote:
> > On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote:
> > > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote:
> > > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote:
> > > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown
> > > > > > > > wrote:
> > > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > > > > > 
> > > > > > > > > > The machine crashes and comes back up, and we get a
> > > > > > > > > > query
> > > > > > > > > > for
> > > > > > > > > > i_version
> > > > > > > > > > and it comes back as X. Fine, it's an old version.
> > > > > > > > > > Now
> > > > > > > > > > there
> > > > > > > > > > is a write.
> > > > > > > > > > What do we do to ensure that the new value doesn't
> > > > > > > > > > collide
> > > > > > > > > > with X+1? 
> > > > > > > > > 
> > > > > > > > > (I missed this bit in my earlier reply..)
> > > > > > > > > 
> > > > > > > > > How is it "Fine" to see an old version?
> > > > > > > > > The file could have changed without the version
> > > > > > > > > changing.
> > > > > > > > > And I thought one of the goals of the crash-count was
> > > > > > > > > to be
> > > > > > > > > able to
> > > > > > > > > provide a monotonic change id.
> > > > > > > > 
> > > > > > > > I was still mainly thinking about how to provide reliable
> > > > > > > > close-
> > > > > > > > to-open
> > > > > > > > semantics between NFS clients.  In the case the writer
> > > > > > > > was an
> > > > > > > > NFS
> > > > > > > > client, it wasn't done writing (or it would have
> > > > > > > > COMMITted),
> > > > > > > > so
> > > > > > > > those
> > > > > > > > writes will come in and bump the change attribute soon,
> > > > > > > > and
> > > > > > > > as
> > > > > > > > long as
> > > > > > > > we avoid the small chance of reusing an old change
> > > > > > > > attribute,
> > > > > > > > we're OK,
> > > > > > > > and I think it'd even still be OK to advertise
> > > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > > > > > 
> > > > > > > You seem to be assuming that the client doesn't crash at
> > > > > > > the
> > > > > > > same
> > > > > > > time
> > > > > > > as the server (maybe they are both VMs on a host that lost
> > > > > > > power...)
> > > > > > > 
> > > > > > > If client A reads and caches, client B writes, the server
> > > > > > > crashes
> > > > > > > after
> > > > > > > writing some data (to already allocated space so no inode
> > > > > > > update
> > > > > > > needed)
> > > > > > > but before writing the new i_version, then client B
> > > > > > > crashes.
> > > > > > > When server comes back the i_version will be unchanged but
> > > > > > > the
> > > > > > > data
> > > > > > > has
> > > > > > > changed.  Client A will cache old data indefinitely...
> > > > > > 
> > > > > > I guess I assume that if all we're promising is close-to-
> > > > > > open,
> > > > > > then a
> > > > > > client isn't allowed to trust its cache in that situation. 
> > > > > > Maybe
> > > > > > that's
> > > > > > an overly draconian interpretation of close-to-open.
> > > > > > 
> > > > > > Also, I'm trying to think about how to improve things
> > > > > > incrementally.
> > > > > > Incorporating something like a crash count into the on-disk
> > > > > > i_version
> > > > > > fixes some cases without introducing any new ones or
> > > > > > regressing
> > > > > > performance after a crash.
> > > > > > 
> > > > > > If we subsequently wanted to close those remaining holes, I
> > > > > > think
> > > > > > we'd
> > > > > > need the change attribute increment to be seen as atomic with
> > > > > > respect
> > > > > > to
> > > > > > its associated change, both to clients and (separately) on
> > > > > > disk. 
> > > > > > (That
> > > > > > would still allow the change attribute to go backwards after
> > > > > > a
> > > > > > crash,
> > > > > > to
> > > > > > the value it held as of the on-disk state of the file.  I
> > > > > > think
> > > > > > clients
> > > > > > should be able to deal with that case.)
> > > > > > 
> > > > > > But, I don't know, maybe a bigger hammer would be OK:
> > > > > > 
> > > > > 
> > > > > If you're not going to meet the minimum bar of data integrity,
> > > > > then
> > > > > this whole exercise is just a massive waste of everyone's time.
> > > > > The
> > > > > answer then going forward is just to recommend never using
> > > > > Linux as
> > > > > an
> > > > > NFS server. Makes my life much easier, because I no longer have
> > > > > to
> > > > > debug any of the issues.
> > > > > 
> > > > > 
> > > > 
> > > > To be clear, you believe any scheme that would allow the client
> > > > to
> > > > see
> > > > an old change attr after a crash is insufficient?
> > > > 
> > > 
> > > Correct. If a NFSv4 client or userspace application cannot trust
> > > that
> > > it will always see a change to the change attribute value when the
> > > file
> > > data changes, then you will eventually see data corruption due to
> > > the
> > > cached data no longer matching the stored data.
> > > 
> > > A false positive update of the change attribute (i.e. a case where
> > > the
> > > change attribute changes despite the data/metadata staying the
> > > same) is
> > > not desirable because it causes performance issues, but false
> > > negatives
> > > are far worse because they mean your data backup, cache, etc... are
> > > not
> > > consistent. Applications that have strong consistency requirements
> > > will
> > > have no option but to revalidate by always reading the entire file
> > > data
> > > + metadata.
> > > 
> > > > The only way I can see to fix that (at least with only a crash
> > > > counter)
> > > > would be to factor it in at presentation time like Neil
> > > > suggested.
> > > > Basically we'd just mask off the top 16 bits and plop the crash
> > > > counter
> > > > in there before presenting it.
> > > > 
> > > > In principle, I suppose we could do that at the nfsd level as
> > > > well
> > > > (and
> > > > that might be the simplest way to fix this). We probably wouldn't
> > > > be
> > > > able to advertise a change attr type of MONOTONIC with this
> > > > scheme
> > > > though.
> > > 
> > > Why would you want to limit the crash counter to 16 bits?
> > > 
> > 
> > To leave more room for the "real" counter. Otherwise, an inode that
> > gets
> > frequent writes after a long period of no crashes could experience
> > the
> > counter wrap.
> > 
> > IOW, we have 63 bits to play with. Whatever part we dedicate to the
> > crash counter will not be available for the actual version counter.
> > 
> > I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a
> > different one.
> 
> 
> What is the expectation when you have an unclean shutdown or crash? Do
> all change attribute values get updated to reflect the new crash
> counter value, or only some?
> 
> If the answer is that 'all values change', then why store the crash
> counter in the inode at all? Why not just add it as an offset when
> you're generating the user-visible change attribute?
> 
> i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> 
> (where offset is chosen to be larger than the max number of inode-
> > i_version updates that could get lost by an inode in a crash).
> 
> Presumably that offset could be significantly smaller than 2^63...
> 


Yes, if we plan to ensure that all the change attrs change after a
crash, we can do that.

So what would make sense for an offset? Maybe 2**12? One would hope that
there wouldn't be more than 4k increments before one of them made it to
disk. OTOH, maybe that can happen with teeny-tiny writes.

If we want to leave this up to the filesystem, I guess we could just add
a new struct super_block.s_version_offset field and let the filesystem
precompute that value and set it at mount time. Then we can just add
that in after querying i_version.
NeilBrown Sept. 15, 2022, 10:23 p.m. UTC | #75
On Fri, 16 Sep 2022, Jeff Layton wrote:
> On Thu, 2022-09-15 at 19:03 +0000, Trond Myklebust wrote:
> > On Thu, 2022-09-15 at 14:11 -0400, Jeff Layton wrote:
> > > On Thu, 2022-09-15 at 17:49 +0000, Trond Myklebust wrote:
> > > > On Thu, 2022-09-15 at 12:45 -0400, Jeff Layton wrote:
> > > > > On Thu, 2022-09-15 at 15:08 +0000, Trond Myklebust wrote:
> > > > > > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > > > > > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > > > > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > > > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown
> > > > > > > > > wrote:
> > > > > > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > > > > > > 
> > > > > > > > > > > The machine crashes and comes back up, and we get a
> > > > > > > > > > > query
> > > > > > > > > > > for
> > > > > > > > > > > i_version
> > > > > > > > > > > and it comes back as X. Fine, it's an old version.
> > > > > > > > > > > Now
> > > > > > > > > > > there
> > > > > > > > > > > is a write.
> > > > > > > > > > > What do we do to ensure that the new value doesn't
> > > > > > > > > > > collide
> > > > > > > > > > > with X+1? 
> > > > > > > > > > 
> > > > > > > > > > (I missed this bit in my earlier reply..)
> > > > > > > > > > 
> > > > > > > > > > How is it "Fine" to see an old version?
> > > > > > > > > > The file could have changed without the version
> > > > > > > > > > changing.
> > > > > > > > > > And I thought one of the goals of the crash-count was
> > > > > > > > > > to be
> > > > > > > > > > able to
> > > > > > > > > > provide a monotonic change id.
> > > > > > > > > 
> > > > > > > > > I was still mainly thinking about how to provide reliable
> > > > > > > > > close-
> > > > > > > > > to-open
> > > > > > > > > semantics between NFS clients.  In the case the writer
> > > > > > > > > was an
> > > > > > > > > NFS
> > > > > > > > > client, it wasn't done writing (or it would have
> > > > > > > > > COMMITted),
> > > > > > > > > so
> > > > > > > > > those
> > > > > > > > > writes will come in and bump the change attribute soon,
> > > > > > > > > and
> > > > > > > > > as
> > > > > > > > > long as
> > > > > > > > > we avoid the small chance of reusing an old change
> > > > > > > > > attribute,
> > > > > > > > > we're OK,
> > > > > > > > > and I think it'd even still be OK to advertise
> > > > > > > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > > > > > > 
> > > > > > > > You seem to be assuming that the client doesn't crash at
> > > > > > > > the
> > > > > > > > same
> > > > > > > > time
> > > > > > > > as the server (maybe they are both VMs on a host that lost
> > > > > > > > power...)
> > > > > > > > 
> > > > > > > > If client A reads and caches, client B writes, the server
> > > > > > > > crashes
> > > > > > > > after
> > > > > > > > writing some data (to already allocated space so no inode
> > > > > > > > update
> > > > > > > > needed)
> > > > > > > > but before writing the new i_version, then client B
> > > > > > > > crashes.
> > > > > > > > When server comes back the i_version will be unchanged but
> > > > > > > > the
> > > > > > > > data
> > > > > > > > has
> > > > > > > > changed.  Client A will cache old data indefinitely...
> > > > > > > 
> > > > > > > I guess I assume that if all we're promising is close-to-
> > > > > > > open,
> > > > > > > then a
> > > > > > > client isn't allowed to trust its cache in that situation. 
> > > > > > > Maybe
> > > > > > > that's
> > > > > > > an overly draconian interpretation of close-to-open.
> > > > > > > 
> > > > > > > Also, I'm trying to think about how to improve things
> > > > > > > incrementally.
> > > > > > > Incorporating something like a crash count into the on-disk
> > > > > > > i_version
> > > > > > > fixes some cases without introducing any new ones or
> > > > > > > regressing
> > > > > > > performance after a crash.
> > > > > > > 
> > > > > > > If we subsequently wanted to close those remaining holes, I
> > > > > > > think
> > > > > > > we'd
> > > > > > > need the change attribute increment to be seen as atomic with
> > > > > > > respect
> > > > > > > to
> > > > > > > its associated change, both to clients and (separately) on
> > > > > > > disk. 
> > > > > > > (That
> > > > > > > would still allow the change attribute to go backwards after
> > > > > > > a
> > > > > > > crash,
> > > > > > > to
> > > > > > > the value it held as of the on-disk state of the file.  I
> > > > > > > think
> > > > > > > clients
> > > > > > > should be able to deal with that case.)
> > > > > > > 
> > > > > > > But, I don't know, maybe a bigger hammer would be OK:
> > > > > > > 
> > > > > > 
> > > > > > If you're not going to meet the minimum bar of data integrity,
> > > > > > then
> > > > > > this whole exercise is just a massive waste of everyone's time.
> > > > > > The
> > > > > > answer then going forward is just to recommend never using
> > > > > > Linux as
> > > > > > an
> > > > > > NFS server. Makes my life much easier, because I no longer have
> > > > > > to
> > > > > > debug any of the issues.
> > > > > > 
> > > > > > 
> > > > > 
> > > > > To be clear, you believe any scheme that would allow the client
> > > > > to
> > > > > see
> > > > > an old change attr after a crash is insufficient?
> > > > > 
> > > > 
> > > > Correct. If a NFSv4 client or userspace application cannot trust
> > > > that
> > > > it will always see a change to the change attribute value when the
> > > > file
> > > > data changes, then you will eventually see data corruption due to
> > > > the
> > > > cached data no longer matching the stored data.
> > > > 
> > > > A false positive update of the change attribute (i.e. a case where
> > > > the
> > > > change attribute changes despite the data/metadata staying the
> > > > same) is
> > > > not desirable because it causes performance issues, but false
> > > > negatives
> > > > are far worse because they mean your data backup, cache, etc... are
> > > > not
> > > > consistent. Applications that have strong consistency requirements
> > > > will
> > > > have no option but to revalidate by always reading the entire file
> > > > data
> > > > + metadata.
> > > > 
> > > > > The only way I can see to fix that (at least with only a crash
> > > > > counter)
> > > > > would be to factor it in at presentation time like Neil
> > > > > suggested.
> > > > > Basically we'd just mask off the top 16 bits and plop the crash
> > > > > counter
> > > > > in there before presenting it.
> > > > > 
> > > > > In principle, I suppose we could do that at the nfsd level as
> > > > > well
> > > > > (and
> > > > > that might be the simplest way to fix this). We probably wouldn't
> > > > > be
> > > > > able to advertise a change attr type of MONOTONIC with this
> > > > > scheme
> > > > > though.
> > > > 
> > > > Why would you want to limit the crash counter to 16 bits?
> > > > 
> > > 
> > > To leave more room for the "real" counter. Otherwise, an inode that
> > > gets
> > > frequent writes after a long period of no crashes could experience
> > > the
> > > counter wrap.
> > > 
> > > IOW, we have 63 bits to play with. Whatever part we dedicate to the
> > > crash counter will not be available for the actual version counter.
> > > 
> > > I'm proposing a 16+47+1 split, but I'm happy to hear arguments for a
> > > different one.
> > 
> > 
> > What is the expectation when you have an unclean shutdown or crash? Do
> > all change attribute values get updated to reflect the new crash
> > counter value, or only some?
> > 
> > If the answer is that 'all values change', then why store the crash
> > counter in the inode at all? Why not just add it as an offset when
> > you're generating the user-visible change attribute?
> > 
> > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > 
> > (where offset is chosen to be larger than the max number of inode-
> > > i_version updates that could get lost by an inode in a crash).
> > 
> > Presumably that offset could be significantly smaller than 2^63...
> > 
> 
> 
> Yes, if we plan to ensure that all the change attrs change after a
> crash, we can do that.
> 
> So what would make sense for an offset? Maybe 2**12? One would hope that
> there wouldn't be more than 4k increments before one of them made it to
> disk. OTOH, maybe that can happen with teeny-tiny writes.

Leave it up the to filesystem to decide.  The VFS and/or NFSD should
have not have part in calculating the i_version.  It should be entirely
in the filesystem - though support code could be provided if common
patterns exist across filesystems.

A filesystem *could* decide to ensure the on-disk i_version is updated
when the difference between in-memory and on-disk reaches X/2, and add X
after an unclean restart.  Or it could just choose a large X and hope.
Or it could do something else that neither of us has thought of.  But
PLEASE leave the filesystem in control, do not make it fit with our
pre-conceived ideas of what would be easy for it.

> 
> If we want to leave this up to the filesystem, I guess we could just add
> a new struct super_block.s_version_offset field and let the filesystem
> precompute that value and set it at mount time. Then we can just add
> that in after querying i_version.

If we are leaving "this up to the filesystem", the we don't add anything
to struct super_block and we don't add anything "in after querying
i_version".  Rather, we "leave this up to the filesystem" and use
exactly the i_version that the filesystem provides.  We only provide
advice as to minimum requirements, preferred behaviours, and possible
implementation suggestions.

NeilBrown


> -- 
> Jeff Layton <jlayton@kernel.org>
>
Theodore Ts'o Sept. 16, 2022, 6:54 a.m. UTC | #76
On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > If the answer is that 'all values change', then why store the crash
> > > counter in the inode at all? Why not just add it as an offset when
> > > you're generating the user-visible change attribute?
> > > 
> > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)

I had suggested just hashing the crash counter with the file system's
on-disk i_version number, which is essentially what you are suggested.

> > Yes, if we plan to ensure that all the change attrs change after a
> > crash, we can do that.
> > 
> > So what would make sense for an offset? Maybe 2**12? One would hope that
> > there wouldn't be more than 4k increments before one of them made it to
> > disk. OTOH, maybe that can happen with teeny-tiny writes.
> 
> Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> have not have part in calculating the i_version.  It should be entirely
> in the filesystem - though support code could be provided if common
> patterns exist across filesystems.

Oh, *heck* no.  This parameter is for the NFS implementation to
decide, because it's NFS's caching algorithms which are at stake here.

As a the file system maintainer, I had offered to make an on-disk
"crash counter" which would get updated when the journal had gotten
replayed, in addition to the on-disk i_version number.  This will be
available for the Linux implementation of NFSD to use, but that's up
to *you* to decide how you want to use them.

I was perfectly happy with hashing the crash counter and the i_version
because I had assumed that not *that* much stuff was going to be
cached, and so invalidating all of the caches in the unusual case
where there was a crash was acceptable.  After all it's a !@#?!@
cache.  Caches sometimmes get invalidated.  "That is the order of
things." (as Ramata'Klan once said in "Rocks and Shoals")

But if people expect that multiple TB's of data is going to be stored;
that cache invalidation is unacceptable; and that a itsy-weeny chance
of false negative failures which might cause data corruption might be
acceptable tradeoff, hey, that's for the system which is providing
caching semantics to determine.

PLEASE don't put this tradeoff on the file system authors; I would
much prefer to leave this tradeoff in the hands of the system which is
trying to do the caching.

						- Ted
Jeff Layton Sept. 16, 2022, 11:32 a.m. UTC | #77
On Fri, 2022-09-16 at 08:42 +1000, NeilBrown wrote:
> On Fri, 16 Sep 2022, Jeff Layton wrote:
> > On Thu, 2022-09-15 at 10:06 -0400, J. Bruce Fields wrote:
> > > On Tue, Sep 13, 2022 at 09:14:32AM +1000, NeilBrown wrote:
> > > > On Mon, 12 Sep 2022, J. Bruce Fields wrote:
> > > > > On Sun, Sep 11, 2022 at 08:13:11AM +1000, NeilBrown wrote:
> > > > > > On Fri, 09 Sep 2022, Jeff Layton wrote:
> > > > > > > 
> > > > > > > The machine crashes and comes back up, and we get a query for i_version
> > > > > > > and it comes back as X. Fine, it's an old version. Now there is a write.
> > > > > > > What do we do to ensure that the new value doesn't collide with X+1? 
> > > > > > 
> > > > > > (I missed this bit in my earlier reply..)
> > > > > > 
> > > > > > How is it "Fine" to see an old version?
> > > > > > The file could have changed without the version changing.
> > > > > > And I thought one of the goals of the crash-count was to be able to
> > > > > > provide a monotonic change id.
> > > > > 
> > > > > I was still mainly thinking about how to provide reliable close-to-open
> > > > > semantics between NFS clients.  In the case the writer was an NFS
> > > > > client, it wasn't done writing (or it would have COMMITted), so those
> > > > > writes will come in and bump the change attribute soon, and as long as
> > > > > we avoid the small chance of reusing an old change attribute, we're OK,
> > > > > and I think it'd even still be OK to advertise
> > > > > CHANGE_TYPE_IS_MONOTONIC_INCR.
> > > > 
> > > > You seem to be assuming that the client doesn't crash at the same time
> > > > as the server (maybe they are both VMs on a host that lost power...)
> > > > 
> > > > If client A reads and caches, client B writes, the server crashes after
> > > > writing some data (to already allocated space so no inode update needed)
> > > > but before writing the new i_version, then client B crashes.
> > > > When server comes back the i_version will be unchanged but the data has
> > > > changed.  Client A will cache old data indefinitely...
> > > 
> > > I guess I assume that if all we're promising is close-to-open, then a
> > > client isn't allowed to trust its cache in that situation.  Maybe that's
> > > an overly draconian interpretation of close-to-open.
> > > 
> > > Also, I'm trying to think about how to improve things incrementally.
> > > Incorporating something like a crash count into the on-disk i_version
> > > fixes some cases without introducing any new ones or regressing
> > > performance after a crash.
> > > 
> > 
> > I think we ought to start there.
> > 
> > > If we subsequently wanted to close those remaining holes, I think we'd
> > > need the change attribute increment to be seen as atomic with respect to
> > > its associated change, both to clients and (separately) on disk.  (That
> > > would still allow the change attribute to go backwards after a crash, to
> > > the value it held as of the on-disk state of the file.  I think clients
> > > should be able to deal with that case.)
> > > 
> > > But, I don't know, maybe a bigger hammer would be OK:
> > > 
> > > > I think we need to require the filesystem to ensure that the i_version
> > > > is seen to increase shortly after any change becomes visible in the
> > > > file, and no later than the moment when the request that initiated the
> > > > change is acknowledged as being complete.  In the case of an unclean
> > > > restart, any file that is not known to have been unchanged immediately
> > > > before the crash must have i_version increased.
> > > > 
> > > > The simplest implementation is to have an unclean-restart counter and to
> > > > always included this multiplied by some constant X in the reported
> > > > i_version.  The filesystem guarantees to record (e.g.  to journal
> > > > at least) the i_version if it comes close to X more than the previous
> > > > record.  The filesystem gets to choose X.
> > > 
> > > So the question is whether people can live with invalidating all client
> > > caches after a cache.  I don't know.
> > > 
> > 
> > I assume you mean "after a crash". Yeah, that is pretty nasty. We don't
> > get perfect crash resilience with incorporating this into the on-disk
> > value, but I like that better than factoring it in at presentation time.
> > 
> > That would mean that the servers would end up getting hammered with read
> > activity after a crash (at least in some environments). I don't think
> > that would be worth the tradeoff. There's a real benefit to preserving
> > caches when we can.
> 
> Would it really mean the server gets hammered?
> 

Traditionally, yes. That was the rationale for fscache, after all.
Particularly in large renderfarms, when rebooting a large swath of
client machines, they end up with blank caches and when they come up
they hammer the server with READs.

We'll be back to that behavior after a crash with this scheme, since
fscache uses the change attribute to determine cache validity. I guess
that's unavoidable for now.

> For files and NFSv4, any significant cache should be held on the basis
> of a delegation, and if the client holds a delegation then it shouldn't
> be paying attention to i_version.
> 
> I'm not entirely sure of this.  Section 10.2.1 of RFC 5661 seems to
> suggest that when the client uses CLAIM_DELEG_PREV to reclaim a
> delegation, it must then return the delegation.  However the explanation
> seems to be mostly about WRITE delegations and immediately flushing
> cached changes.  Do we know if there is a way for the server to say "OK,
> you have that delegation again" in a way that the client can keep the
> delegation and continue to ignore i_version?
> 

Delegations may change that calculus. In general I've noticed that the
client tends to ignore attribute cache changes when it has a delegation.

> For directories, which cannot be delegated the same way but can still be
> cached, the issues are different.  All directory morphing operations
> will be journalled by the filesystem so it should be able to keep the
> i_version up to date.  So the (journalling) filesystem should *NOT* add
> a crash-count to the i_version for directories even if it does for files.
> 

Interesting and good point. We should be able to make that distinction
and just mix in the crash counter for regular files.

> 
> 
> > 
> > > > A more complex solution would be to record (similar to the way orphans
> > > > are recorded) any file which is open for write, and to add X to the
> > > > i_version for any "dirty" file still recorded during an unclean
> > > > restart.  This would avoid bumping the i_version for read-only files.
> > > 
> > > Is that practical?  Working out the performance tradeoffs sounds like a
> > > project.
> > > 
> > > 
> > > > There may be other solutions, but we should leave that up to the
> > > > filesystem.  Each filesystem might choose something different.
> > > 
> > > Sure.
> > > 
> > 
> > Agreed here too. I think we need to allow for some flexibility here. 
> > 
> > Here's what I'm thinking:
> > 
> > We'll carve out the upper 16 bits in the i_version counter to be the
> > crash counter field. That gives us 8k crashes before we have to worry
> > about collisions. Hopefully the remaining 47 bits of counter will be
> > plenty given that we don't increment it when it's not being queried or
> > nothing else changes. (Can we mitigate wrapping here somehow?)
> > 
> > The easiest way to do this would be to add a u16 s_crash_counter to
> > struct super_block. We'd initialize that to 0, and the filesystem could
> > fill that value out at mount time.
> > 
> > Then inode_maybe_inc_iversion can just shift the s_crash_counter that
> > left by 24 bits and and plop it into the top of the value we're
> > preparing to cmpxchg into place.
> > 
> > This is backward compatible too, at least for i_version counter values
> > that are <2^47. With anything larger, we might end up with something
> > going backward and a possible collision, but it's (hopefully) a small
> > risk.
> > 
> > -- 
> > Jeff Layton <jlayton@kernel.org>
> >
Jeff Layton Sept. 16, 2022, 11:36 a.m. UTC | #78
On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > If the answer is that 'all values change', then why store the crash
> > > > counter in the inode at all? Why not just add it as an offset when
> > > > you're generating the user-visible change attribute?
> > > > 
> > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> 
> I had suggested just hashing the crash counter with the file system's
> on-disk i_version number, which is essentially what you are suggested.
> 
> > > Yes, if we plan to ensure that all the change attrs change after a
> > > crash, we can do that.
> > > 
> > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > there wouldn't be more than 4k increments before one of them made it to
> > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > 
> > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > have not have part in calculating the i_version.  It should be entirely
> > in the filesystem - though support code could be provided if common
> > patterns exist across filesystems.
> 
> Oh, *heck* no.  This parameter is for the NFS implementation to
> decide, because it's NFS's caching algorithms which are at stake here.
> 
> As a the file system maintainer, I had offered to make an on-disk
> "crash counter" which would get updated when the journal had gotten
> replayed, in addition to the on-disk i_version number.  This will be
> available for the Linux implementation of NFSD to use, but that's up
> to *you* to decide how you want to use them.
> 
> I was perfectly happy with hashing the crash counter and the i_version
> because I had assumed that not *that* much stuff was going to be
> cached, and so invalidating all of the caches in the unusual case
> where there was a crash was acceptable.  After all it's a !@#?!@
> cache.  Caches sometimmes get invalidated.  "That is the order of
> things." (as Ramata'Klan once said in "Rocks and Shoals")
> 
> But if people expect that multiple TB's of data is going to be stored;
> that cache invalidation is unacceptable; and that a itsy-weeny chance
> of false negative failures which might cause data corruption might be
> acceptable tradeoff, hey, that's for the system which is providing
> caching semantics to determine.
> 
> PLEASE don't put this tradeoff on the file system authors; I would
> much prefer to leave this tradeoff in the hands of the system which is
> trying to do the caching.
> 

Yeah, if we were designing this from scratch, I might agree with leaving
more up to the filesystem, but the existing users all have pretty much
the same needs. I'm going to plan to try to keep most of this in the
common infrastructure defined in iversion.h.

Ted, for the ext4 crash counter, what wordsize were you thinking? I
doubt we'll be able to use much more than 32 bits so a larger integer is
probably not worthwhile. There are several holes in struct super_block
(at least on x86_64), so adding this field to the generic structure
needn't grow it.
Jeff Layton Sept. 16, 2022, 3:11 p.m. UTC | #79
On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > If the answer is that 'all values change', then why store the crash
> > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > you're generating the user-visible change attribute?
> > > > > 
> > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > 
> > I had suggested just hashing the crash counter with the file system's
> > on-disk i_version number, which is essentially what you are suggested.
> > 
> > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > crash, we can do that.
> > > > 
> > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > there wouldn't be more than 4k increments before one of them made it to
> > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > 
> > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > have not have part in calculating the i_version.  It should be entirely
> > > in the filesystem - though support code could be provided if common
> > > patterns exist across filesystems.
> > 
> > Oh, *heck* no.  This parameter is for the NFS implementation to
> > decide, because it's NFS's caching algorithms which are at stake here.
> > 
> > As a the file system maintainer, I had offered to make an on-disk
> > "crash counter" which would get updated when the journal had gotten
> > replayed, in addition to the on-disk i_version number.  This will be
> > available for the Linux implementation of NFSD to use, but that's up
> > to *you* to decide how you want to use them.
> > 
> > I was perfectly happy with hashing the crash counter and the i_version
> > because I had assumed that not *that* much stuff was going to be
> > cached, and so invalidating all of the caches in the unusual case
> > where there was a crash was acceptable.  After all it's a !@#?!@
> > cache.  Caches sometimmes get invalidated.  "That is the order of
> > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > 
> > But if people expect that multiple TB's of data is going to be stored;
> > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > of false negative failures which might cause data corruption might be
> > acceptable tradeoff, hey, that's for the system which is providing
> > caching semantics to determine.
> > 
> > PLEASE don't put this tradeoff on the file system authors; I would
> > much prefer to leave this tradeoff in the hands of the system which is
> > trying to do the caching.
> > 
> 
> Yeah, if we were designing this from scratch, I might agree with leaving
> more up to the filesystem, but the existing users all have pretty much
> the same needs. I'm going to plan to try to keep most of this in the
> common infrastructure defined in iversion.h.
> 
> Ted, for the ext4 crash counter, what wordsize were you thinking? I
> doubt we'll be able to use much more than 32 bits so a larger integer is
> probably not worthwhile. There are several holes in struct super_block
> (at least on x86_64), so adding this field to the generic structure
> needn't grow it.

That said, now that I've taken a swipe at implementing this, I need more
information than just the crash counter. We need to multiply the crash
counter with a reasonable estimate of the maximum number of individual
writes that could occur between an i_version being incremented and that
value making it to the backing store.

IOW, given a write that bumps the i_version to X, how many more write
calls could race in before X makes it to the platter? I took a SWAG and
said 4k in an earlier email, but I don't really have a way to know, and
that could vary wildly with different filesystems and storage.

What I'd like to see is this in struct super_block:

	u32		s_version_offset;

...and then individual filesystems can calculate:

	crash_counter * max_number_of_writes

and put the correct value in there at mount time.
Dave Chinner Sept. 18, 2022, 11:53 p.m. UTC | #80
On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote:
> On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > > If the answer is that 'all values change', then why store the crash
> > > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > > you're generating the user-visible change attribute?
> > > > > > 
> > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > > 
> > > I had suggested just hashing the crash counter with the file system's
> > > on-disk i_version number, which is essentially what you are suggested.
> > > 
> > > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > > crash, we can do that.
> > > > > 
> > > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > > there wouldn't be more than 4k increments before one of them made it to
> > > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > > 
> > > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > > have not have part in calculating the i_version.  It should be entirely
> > > > in the filesystem - though support code could be provided if common
> > > > patterns exist across filesystems.
> > > 
> > > Oh, *heck* no.  This parameter is for the NFS implementation to
> > > decide, because it's NFS's caching algorithms which are at stake here.
> > > 
> > > As a the file system maintainer, I had offered to make an on-disk
> > > "crash counter" which would get updated when the journal had gotten
> > > replayed, in addition to the on-disk i_version number.  This will be
> > > available for the Linux implementation of NFSD to use, but that's up
> > > to *you* to decide how you want to use them.
> > > 
> > > I was perfectly happy with hashing the crash counter and the i_version
> > > because I had assumed that not *that* much stuff was going to be
> > > cached, and so invalidating all of the caches in the unusual case
> > > where there was a crash was acceptable.  After all it's a !@#?!@
> > > cache.  Caches sometimmes get invalidated.  "That is the order of
> > > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > > 
> > > But if people expect that multiple TB's of data is going to be stored;
> > > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > > of false negative failures which might cause data corruption might be
> > > acceptable tradeoff, hey, that's for the system which is providing
> > > caching semantics to determine.
> > > 
> > > PLEASE don't put this tradeoff on the file system authors; I would
> > > much prefer to leave this tradeoff in the hands of the system which is
> > > trying to do the caching.
> > > 
> > 
> > Yeah, if we were designing this from scratch, I might agree with leaving
> > more up to the filesystem, but the existing users all have pretty much
> > the same needs. I'm going to plan to try to keep most of this in the
> > common infrastructure defined in iversion.h.
> > 
> > Ted, for the ext4 crash counter, what wordsize were you thinking? I
> > doubt we'll be able to use much more than 32 bits so a larger integer is
> > probably not worthwhile. There are several holes in struct super_block
> > (at least on x86_64), so adding this field to the generic structure
> > needn't grow it.
> 
> That said, now that I've taken a swipe at implementing this, I need more
> information than just the crash counter. We need to multiply the crash
> counter with a reasonable estimate of the maximum number of individual
> writes that could occur between an i_version being incremented and that
> value making it to the backing store.
> 
> IOW, given a write that bumps the i_version to X, how many more write
> calls could race in before X makes it to the platter? I took a SWAG and
> said 4k in an earlier email, but I don't really have a way to know, and
> that could vary wildly with different filesystems and storage.
> 
> What I'd like to see is this in struct super_block:
> 
> 	u32		s_version_offset;

	u64		s_version_salt;

> ...and then individual filesystems can calculate:
> 
> 	crash_counter * max_number_of_writes
> 
> and put the correct value in there at mount time.

Other filesystems might not have a crash counter but have other
information that can be substituted, like a mount counter or a
global change sequence number that is guaranteed to increment from
one mount to the next. 

Further, have you thought about what "max number of writes" might
be in ten years time? e.g.  what happens if a filesysetm as "max
number of writes" being greater than 2^32? I mean, we already have
machines out there running Linux with 64-128TB of physical RAM, so
it's already practical to hold > 2^32 individual writes to a single
inode that each bump i_version in memory....

So when we consider this sort of scale, the "crash counter * max
writes" scheme largely falls apart because "max writes" is a really
large number to begin with. We're going to be stuck with whatever
algorithm is decided on for the foreseeable future, so we must
recognise that _we've already overrun 32 bit counter schemes_ in
terms of tracking "i_version changes in memory vs what we have on
disk".

Hence I really think that we should be leaving the implementation of
the salt value to the individual filesysetms as different
filesytsems are aimed at different use cases and so may not
necessarily have to all care about the same things (like 2^32 bit
max write overruns).  All the high level VFS code then needs to do
is add the two together:

	statx.change_attr = inode->i_version + sb->s_version_salt;

Cheers,

Dave.
Jeff Layton Sept. 19, 2022, 1:13 p.m. UTC | #81
On Mon, 2022-09-19 at 09:53 +1000, Dave Chinner wrote:
> On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote:
> > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > > > If the answer is that 'all values change', then why store the crash
> > > > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > > > you're generating the user-visible change attribute?
> > > > > > > 
> > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > > > 
> > > > I had suggested just hashing the crash counter with the file system's
> > > > on-disk i_version number, which is essentially what you are suggested.
> > > > 
> > > > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > > > crash, we can do that.
> > > > > > 
> > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > > > there wouldn't be more than 4k increments before one of them made it to
> > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > > > 
> > > > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > > > have not have part in calculating the i_version.  It should be entirely
> > > > > in the filesystem - though support code could be provided if common
> > > > > patterns exist across filesystems.
> > > > 
> > > > Oh, *heck* no.  This parameter is for the NFS implementation to
> > > > decide, because it's NFS's caching algorithms which are at stake here.
> > > > 
> > > > As a the file system maintainer, I had offered to make an on-disk
> > > > "crash counter" which would get updated when the journal had gotten
> > > > replayed, in addition to the on-disk i_version number.  This will be
> > > > available for the Linux implementation of NFSD to use, but that's up
> > > > to *you* to decide how you want to use them.
> > > > 
> > > > I was perfectly happy with hashing the crash counter and the i_version
> > > > because I had assumed that not *that* much stuff was going to be
> > > > cached, and so invalidating all of the caches in the unusual case
> > > > where there was a crash was acceptable.  After all it's a !@#?!@
> > > > cache.  Caches sometimmes get invalidated.  "That is the order of
> > > > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > > > 
> > > > But if people expect that multiple TB's of data is going to be stored;
> > > > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > > > of false negative failures which might cause data corruption might be
> > > > acceptable tradeoff, hey, that's for the system which is providing
> > > > caching semantics to determine.
> > > > 
> > > > PLEASE don't put this tradeoff on the file system authors; I would
> > > > much prefer to leave this tradeoff in the hands of the system which is
> > > > trying to do the caching.
> > > > 
> > > 
> > > Yeah, if we were designing this from scratch, I might agree with leaving
> > > more up to the filesystem, but the existing users all have pretty much
> > > the same needs. I'm going to plan to try to keep most of this in the
> > > common infrastructure defined in iversion.h.
> > > 
> > > Ted, for the ext4 crash counter, what wordsize were you thinking? I
> > > doubt we'll be able to use much more than 32 bits so a larger integer is
> > > probably not worthwhile. There are several holes in struct super_block
> > > (at least on x86_64), so adding this field to the generic structure
> > > needn't grow it.
> > 
> > That said, now that I've taken a swipe at implementing this, I need more
> > information than just the crash counter. We need to multiply the crash
> > counter with a reasonable estimate of the maximum number of individual
> > writes that could occur between an i_version being incremented and that
> > value making it to the backing store.
> > 
> > IOW, given a write that bumps the i_version to X, how many more write
> > calls could race in before X makes it to the platter? I took a SWAG and
> > said 4k in an earlier email, but I don't really have a way to know, and
> > that could vary wildly with different filesystems and storage.
> > 
> > What I'd like to see is this in struct super_block:
> > 
> > 	u32		s_version_offset;
> 
> 	u64		s_version_salt;
> 

IDK...it _is_ an offset since we're folding it in with addition, and it
has a real meaning. Filesystems do need to be cognizant of that fact, I
think.

Also does anyone have a preference on doing this vs. a get_version_salt
or get_version_offset sb operation? I figured the value should be mostly
static so it'd be nice to avoid an operation for it.

> > ...and then individual filesystems can calculate:
> > 
> > 	crash_counter * max_number_of_writes
> > 
> > and put the correct value in there at mount time.
> 
> Other filesystems might not have a crash counter but have other
> information that can be substituted, like a mount counter or a
> global change sequence number that is guaranteed to increment from
> one mount to the next. 
> 

The problem there is that you're going to cause the invalidation of all
of the NFS client's cached regular files, even on clean server reboots.
That's not a desirable outcome.

> Further, have you thought about what "max number of writes" might
> be in ten years time? e.g.  what happens if a filesysetm as "max
> number of writes" being greater than 2^32? I mean, we already have
> machines out there running Linux with 64-128TB of physical RAM, so
> it's already practical to hold > 2^32 individual writes to a single
> inode that each bump i_version in memory....

> So when we consider this sort of scale, the "crash counter * max
> writes" scheme largely falls apart because "max writes" is a really
> large number to begin with. We're going to be stuck with whatever
> algorithm is decided on for the foreseeable future, so we must
> recognise that _we've already overrun 32 bit counter schemes_ in
> terms of tracking "i_version changes in memory vs what we have on
> disk".
> 
> Hence I really think that we should be leaving the implementation of
> the salt value to the individual filesysetms as different
> filesytsems are aimed at different use cases and so may not
> necessarily have to all care about the same things (like 2^32 bit
> max write overruns).  All the high level VFS code then needs to do
> is add the two together:
> 
> 	statx.change_attr = inode->i_version + sb->s_version_salt;
> 

Yeah, I have thought about that. I was really hoping that file systems
wouldn't leave so many ephemeral changes lying around before logging
something. It's actually not as bad as it sounds. You'd need that number
of inode changes in memory + queries of i_version, alternating. When
there are no queries, nothing changes. But, the number of queries is
hard to gauge too as it's very dependent on workload, hardware, etc.

If the sky really is the limit on unlogged inode changes, then what do
you suggest? One idea:

We could try to kick off a write_inode in the background when the
i_version gets halfway to the limit. Eventually the nfs server could
just return NFS4ERR_DELAY on a GETATTR if it looked like the reported
version was going to cross the threshold. It'd be ugly, but hopefully
wouldn't happen much if things are tuned well.

Tracking that info might be expensive though. We'd need at least another
u64 field in struct inode for the latest on-disk version. Maybe we can
keep that in the fs-specific part of the inode somehow so we don't need
to grow generic struct inode?
Dave Chinner Sept. 20, 2022, 12:16 a.m. UTC | #82
On Mon, Sep 19, 2022 at 09:13:00AM -0400, Jeff Layton wrote:
> On Mon, 2022-09-19 at 09:53 +1000, Dave Chinner wrote:
> > On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote:
> > > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> > > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > > > > If the answer is that 'all values change', then why store the crash
> > > > > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > > > > you're generating the user-visible change attribute?
> > > > > > > > 
> > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > > > > 
> > > > > I had suggested just hashing the crash counter with the file system's
> > > > > on-disk i_version number, which is essentially what you are suggested.
> > > > > 
> > > > > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > > > > crash, we can do that.
> > > > > > > 
> > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > > > > there wouldn't be more than 4k increments before one of them made it to
> > > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > > > > 
> > > > > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > > > > have not have part in calculating the i_version.  It should be entirely
> > > > > > in the filesystem - though support code could be provided if common
> > > > > > patterns exist across filesystems.
> > > > > 
> > > > > Oh, *heck* no.  This parameter is for the NFS implementation to
> > > > > decide, because it's NFS's caching algorithms which are at stake here.
> > > > > 
> > > > > As a the file system maintainer, I had offered to make an on-disk
> > > > > "crash counter" which would get updated when the journal had gotten
> > > > > replayed, in addition to the on-disk i_version number.  This will be
> > > > > available for the Linux implementation of NFSD to use, but that's up
> > > > > to *you* to decide how you want to use them.
> > > > > 
> > > > > I was perfectly happy with hashing the crash counter and the i_version
> > > > > because I had assumed that not *that* much stuff was going to be
> > > > > cached, and so invalidating all of the caches in the unusual case
> > > > > where there was a crash was acceptable.  After all it's a !@#?!@
> > > > > cache.  Caches sometimmes get invalidated.  "That is the order of
> > > > > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > > > > 
> > > > > But if people expect that multiple TB's of data is going to be stored;
> > > > > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > > > > of false negative failures which might cause data corruption might be
> > > > > acceptable tradeoff, hey, that's for the system which is providing
> > > > > caching semantics to determine.
> > > > > 
> > > > > PLEASE don't put this tradeoff on the file system authors; I would
> > > > > much prefer to leave this tradeoff in the hands of the system which is
> > > > > trying to do the caching.
> > > > > 
> > > > 
> > > > Yeah, if we were designing this from scratch, I might agree with leaving
> > > > more up to the filesystem, but the existing users all have pretty much
> > > > the same needs. I'm going to plan to try to keep most of this in the
> > > > common infrastructure defined in iversion.h.
> > > > 
> > > > Ted, for the ext4 crash counter, what wordsize were you thinking? I
> > > > doubt we'll be able to use much more than 32 bits so a larger integer is
> > > > probably not worthwhile. There are several holes in struct super_block
> > > > (at least on x86_64), so adding this field to the generic structure
> > > > needn't grow it.
> > > 
> > > That said, now that I've taken a swipe at implementing this, I need more
> > > information than just the crash counter. We need to multiply the crash
> > > counter with a reasonable estimate of the maximum number of individual
> > > writes that could occur between an i_version being incremented and that
> > > value making it to the backing store.
> > > 
> > > IOW, given a write that bumps the i_version to X, how many more write
> > > calls could race in before X makes it to the platter? I took a SWAG and
> > > said 4k in an earlier email, but I don't really have a way to know, and
> > > that could vary wildly with different filesystems and storage.
> > > 
> > > What I'd like to see is this in struct super_block:
> > > 
> > > 	u32		s_version_offset;
> > 
> > 	u64		s_version_salt;
> > 
> 
> IDK...it _is_ an offset since we're folding it in with addition, and it
> has a real meaning. Filesystems do need to be cognizant of that fact, I
> think.
> 
> Also does anyone have a preference on doing this vs. a get_version_salt
> or get_version_offset sb operation? I figured the value should be mostly
> static so it'd be nice to avoid an operation for it.
> 
> > > ...and then individual filesystems can calculate:
> > > 
> > > 	crash_counter * max_number_of_writes
> > > 
> > > and put the correct value in there at mount time.
> > 
> > Other filesystems might not have a crash counter but have other
> > information that can be substituted, like a mount counter or a
> > global change sequence number that is guaranteed to increment from
> > one mount to the next. 
> > 
> 
> The problem there is that you're going to cause the invalidation of all
> of the NFS client's cached regular files, even on clean server reboots.
> That's not a desirable outcome.

Stop saying "anything less than perfect is unacceptible". I *know*
that changing the salt on every mount might result in less than
perfect results, but the fact is that a -false negative- is a data
corruption event, whilst a false positive is not. False positives
may not be desirable, but false negatives are *not acceptible at
all*.

XFS can give you a guarantee of no false negatives right now with no
on-disk format changes necessary, but it comes with the downside of
false positives. That's not the end of the world, and it gives NFS
the functionality it needs immediately and allows us time to add
purpose-built on-disk functionality that gives NFS exactly what it
wants. The reality is that this purpose-built on-disk change will
take years to roll out to production systems, whilst using what we
have now is just a kernel patch and upgrade away....

Changing on-disk metadata formats takes time, no matter how simple
the change, and this timeframe is not something the NFS server
actually controls.

But there is a way for the NFS server to define and control it's own
on-disk persistent metadata: *extended attributes*.

How about we set a "crash" extended attribute on the root of an NFS
export when the filesystem is exported, and then remove it when the
filesystem is unexported.

This gives the NFS server it's own persistent attribute that tells
it whether the filesystem was *unexported* cleanly. If the exportfs
code calls syncfs() before the xattr is removed, then it guarantees
that everything the NFS clients have written and modified will be
exactly present the next time the filesystem is exported. If the
"crash" xattr is present when the filesystem is exported, then it
wasn't cleanly synced before it was taken out of service, and so
something may have been lost and the "crash counter" needs to be
bumped.

Yes, the "crash counter" is held in another xattr, so that it is
persistent across crash and mount/unmount cycles. If the crash
xattr is present, the NFSD reads, bumps and writes the crash counter
xattr, and uses the new value for the life of that export. If the
crash xattr is not present, then is just reads the counter xattr and
uses it unchanged.

IOWs, the NFS server can define it's own on-disk persistent metadata
using xattrs, and you don't need local filesystems to be modified at
all. You can add the crash epoch into the change attr that is sent
to NFS clients without having to change the VFS i_version
implementation at all.

This whole problem is solvable entirely within the NFS server code,
and we don't need to change local filesystems at all. NFS can
control the persistence and format of the xattrs it uses, and it
does not need new custom on-disk format changes from every
filesystem to support this new application requirement.

At this point, NFS server developers don't need to care what the
underlying filesystem format provides - the xattrs provide the crash
detection and enumeration the NFS server functionality requires.

-Dave.
Jeff Layton Sept. 20, 2022, 10:26 a.m. UTC | #83
On Tue, 2022-09-20 at 10:16 +1000, Dave Chinner wrote:
> On Mon, Sep 19, 2022 at 09:13:00AM -0400, Jeff Layton wrote:
> > On Mon, 2022-09-19 at 09:53 +1000, Dave Chinner wrote:
> > > On Fri, Sep 16, 2022 at 11:11:34AM -0400, Jeff Layton wrote:
> > > > On Fri, 2022-09-16 at 07:36 -0400, Jeff Layton wrote:
> > > > > On Fri, 2022-09-16 at 02:54 -0400, Theodore Ts'o wrote:
> > > > > > On Fri, Sep 16, 2022 at 08:23:55AM +1000, NeilBrown wrote:
> > > > > > > > > If the answer is that 'all values change', then why store the crash
> > > > > > > > > counter in the inode at all? Why not just add it as an offset when
> > > > > > > > > you're generating the user-visible change attribute?
> > > > > > > > > 
> > > > > > > > > i.e. statx.change_attr = inode->i_version + (crash counter * offset)
> > > > > > 
> > > > > > I had suggested just hashing the crash counter with the file system's
> > > > > > on-disk i_version number, which is essentially what you are suggested.
> > > > > > 
> > > > > > > > Yes, if we plan to ensure that all the change attrs change after a
> > > > > > > > crash, we can do that.
> > > > > > > > 
> > > > > > > > So what would make sense for an offset? Maybe 2**12? One would hope that
> > > > > > > > there wouldn't be more than 4k increments before one of them made it to
> > > > > > > > disk. OTOH, maybe that can happen with teeny-tiny writes.
> > > > > > > 
> > > > > > > Leave it up the to filesystem to decide.  The VFS and/or NFSD should
> > > > > > > have not have part in calculating the i_version.  It should be entirely
> > > > > > > in the filesystem - though support code could be provided if common
> > > > > > > patterns exist across filesystems.
> > > > > > 
> > > > > > Oh, *heck* no.  This parameter is for the NFS implementation to
> > > > > > decide, because it's NFS's caching algorithms which are at stake here.
> > > > > > 
> > > > > > As a the file system maintainer, I had offered to make an on-disk
> > > > > > "crash counter" which would get updated when the journal had gotten
> > > > > > replayed, in addition to the on-disk i_version number.  This will be
> > > > > > available for the Linux implementation of NFSD to use, but that's up
> > > > > > to *you* to decide how you want to use them.
> > > > > > 
> > > > > > I was perfectly happy with hashing the crash counter and the i_version
> > > > > > because I had assumed that not *that* much stuff was going to be
> > > > > > cached, and so invalidating all of the caches in the unusual case
> > > > > > where there was a crash was acceptable.  After all it's a !@#?!@
> > > > > > cache.  Caches sometimmes get invalidated.  "That is the order of
> > > > > > things." (as Ramata'Klan once said in "Rocks and Shoals")
> > > > > > 
> > > > > > But if people expect that multiple TB's of data is going to be stored;
> > > > > > that cache invalidation is unacceptable; and that a itsy-weeny chance
> > > > > > of false negative failures which might cause data corruption might be
> > > > > > acceptable tradeoff, hey, that's for the system which is providing
> > > > > > caching semantics to determine.
> > > > > > 
> > > > > > PLEASE don't put this tradeoff on the file system authors; I would
> > > > > > much prefer to leave this tradeoff in the hands of the system which is
> > > > > > trying to do the caching.
> > > > > > 
> > > > > 
> > > > > Yeah, if we were designing this from scratch, I might agree with leaving
> > > > > more up to the filesystem, but the existing users all have pretty much
> > > > > the same needs. I'm going to plan to try to keep most of this in the
> > > > > common infrastructure defined in iversion.h.
> > > > > 
> > > > > Ted, for the ext4 crash counter, what wordsize were you thinking? I
> > > > > doubt we'll be able to use much more than 32 bits so a larger integer is
> > > > > probably not worthwhile. There are several holes in struct super_block
> > > > > (at least on x86_64), so adding this field to the generic structure
> > > > > needn't grow it.
> > > > 
> > > > That said, now that I've taken a swipe at implementing this, I need more
> > > > information than just the crash counter. We need to multiply the crash
> > > > counter with a reasonable estimate of the maximum number of individual
> > > > writes that could occur between an i_version being incremented and that
> > > > value making it to the backing store.
> > > > 
> > > > IOW, given a write that bumps the i_version to X, how many more write
> > > > calls could race in before X makes it to the platter? I took a SWAG and
> > > > said 4k in an earlier email, but I don't really have a way to know, and
> > > > that could vary wildly with different filesystems and storage.
> > > > 
> > > > What I'd like to see is this in struct super_block:
> > > > 
> > > > 	u32		s_version_offset;
> > > 
> > > 	u64		s_version_salt;
> > > 
> > 
> > IDK...it _is_ an offset since we're folding it in with addition, and it
> > has a real meaning. Filesystems do need to be cognizant of that fact, I
> > think.
> > 
> > Also does anyone have a preference on doing this vs. a get_version_salt
> > or get_version_offset sb operation? I figured the value should be mostly
> > static so it'd be nice to avoid an operation for it.
> > 
> > > > ...and then individual filesystems can calculate:
> > > > 
> > > > 	crash_counter * max_number_of_writes
> > > > 
> > > > and put the correct value in there at mount time.
> > > 
> > > Other filesystems might not have a crash counter but have other
> > > information that can be substituted, like a mount counter or a
> > > global change sequence number that is guaranteed to increment from
> > > one mount to the next. 
> > > 
> > 
> > The problem there is that you're going to cause the invalidation of all
> > of the NFS client's cached regular files, even on clean server reboots.
> > That's not a desirable outcome.
> 
> Stop saying "anything less than perfect is unacceptible". I *know*
> that changing the salt on every mount might result in less than
> perfect results, but the fact is that a -false negative- is a data
> corruption event, whilst a false positive is not. False positives
> may not be desirable, but false negatives are *not acceptible at
> all*.
> 
> XFS can give you a guarantee of no false negatives right now with no
> on-disk format changes necessary, but it comes with the downside of
> false positives. That's not the end of the world, and it gives NFS
> the functionality it needs immediately and allows us time to add
> purpose-built on-disk functionality that gives NFS exactly what it
> wants. The reality is that this purpose-built on-disk change will
> take years to roll out to production systems, whilst using what we
> have now is just a kernel patch and upgrade away....
> 
> Changing on-disk metadata formats takes time, no matter how simple
> the change, and this timeframe is not something the NFS server
> actually controls.
> 
> But there is a way for the NFS server to define and control it's own
> on-disk persistent metadata: *extended attributes*.
> 
> How about we set a "crash" extended attribute on the root of an NFS
> export when the filesystem is exported, and then remove it when the
> filesystem is unexported.
> 
> This gives the NFS server it's own persistent attribute that tells
> it whether the filesystem was *unexported* cleanly. If the exportfs
> code calls syncfs() before the xattr is removed, then it guarantees
> that everything the NFS clients have written and modified will be
> exactly present the next time the filesystem is exported. If the
> "crash" xattr is present when the filesystem is exported, then it
> wasn't cleanly synced before it was taken out of service, and so
> something may have been lost and the "crash counter" needs to be
> bumped.
> 
> Yes, the "crash counter" is held in another xattr, so that it is
> persistent across crash and mount/unmount cycles. If the crash
> xattr is present, the NFSD reads, bumps and writes the crash counter
> xattr, and uses the new value for the life of that export. If the
> crash xattr is not present, then is just reads the counter xattr and
> uses it unchanged.
> 
> IOWs, the NFS server can define it's own on-disk persistent metadata
> using xattrs, and you don't need local filesystems to be modified at
> all. You can add the crash epoch into the change attr that is sent
> to NFS clients without having to change the VFS i_version
> implementation at all.
> 
> This whole problem is solvable entirely within the NFS server code,
> and we don't need to change local filesystems at all. NFS can
> control the persistence and format of the xattrs it uses, and it
> does not need new custom on-disk format changes from every
> filesystem to support this new application requirement.
> 
> At this point, NFS server developers don't need to care what the
> underlying filesystem format provides - the xattrs provide the crash
> detection and enumeration the NFS server functionality requires.
> 

Doesn't the filesystem already detect when it's been mounted after an
unclean shutdown? I'm not sure what good we'll get out of bolting this
scheme onto the NFS server, when the filesystem could just as easily
give us this info.

In any case, the main problem at this point is not so much in detecting
when there has been an unclean shutdown, but rather what to do when
there is one. We need to to advance the presented change attributes
beyond the largest possible one that may have been handed out prior to
the crash. 

How do we determine what that offset should be? Your last email
suggested that there really is no limit to the number of i_version bumps
that can happen in memory before one of them makes it to disk. What can
we do to address that?
Dave Chinner Sept. 21, 2022, midnight UTC | #84
On Tue, Sep 20, 2022 at 06:26:05AM -0400, Jeff Layton wrote:
> On Tue, 2022-09-20 at 10:16 +1000, Dave Chinner wrote:
> > IOWs, the NFS server can define it's own on-disk persistent metadata
> > using xattrs, and you don't need local filesystems to be modified at
> > all. You can add the crash epoch into the change attr that is sent
> > to NFS clients without having to change the VFS i_version
> > implementation at all.
> > 
> > This whole problem is solvable entirely within the NFS server code,
> > and we don't need to change local filesystems at all. NFS can
> > control the persistence and format of the xattrs it uses, and it
> > does not need new custom on-disk format changes from every
> > filesystem to support this new application requirement.
> > 
> > At this point, NFS server developers don't need to care what the
> > underlying filesystem format provides - the xattrs provide the crash
> > detection and enumeration the NFS server functionality requires.
> > 
> 
> Doesn't the filesystem already detect when it's been mounted after an
> unclean shutdown?

Not every filesystem will be able to guarantee unclean shutdown
detection at the next mount. That's the whole problem - NFS
developers are asking for something that cannot be provided as
generic functionality by individual filesystems, so the NFS server
application is going to have to work around any filesytem that
cannot provide the information it needs.

e.g. ext4 has it journal replayed by the userspace tools prior
to mount, so when it then gets mounted by the kernel it's seen as a
clean mount.

If we shut an XFS filesystem down due to a filesystem corruption or
failed IO to the journal code, the kernel might not be able to
replay the journal on mount (i.e. it is corrupt).  We then run
xfs_repair, and that fixes the corruption issue and -cleans the
log-. When we next mount the filesystem, it results in a _clean
mount_, and the kernel filesystem code can not signal to NFS that an
unclean mount occurred and so it should bump it's crash counter.

IOWs, this whole "filesystems need to tell NFS about crashes"
propagates all the way through *every filesystem tool chain*, not
just the kernel mount code. And we most certainly don't control
every 3rd party application that walks around in the filesystem on
disk format, and so there are -zero- guarantees that the kernel
filesystem mount code can give that an unclean shutdown occurred
prior to the current mount.

And then for niche NFS server applications (like transparent
fail-over between HA NFS servers) there are even more rigid
constraints on NFS change attributes. And you're asking local
filesystems to know about these application constraints and bake
them into their on-disk format again.

This whole discussion has come about because we baked certain
behaviour for NFS into the on-disk format many, many years ago, and
it's only now that it is considered inadequate for *new* NFS
application related functionality (e.g. fscache integration and
cache validity across server side mount cycles).

We've learnt a valuable lesson from this: don't bake application
specific persistent metadata requirements into the on-disk format
because when the application needs to change, it requires every
filesystem that supports taht application level functionality
to change their on-disk formats...

> I'm not sure what good we'll get out of bolting this
> scheme onto the NFS server, when the filesystem could just as easily
> give us this info.

The xattr scheme guarantees the correct application behaviour that the NFS
server requires, all at the NFS application level without requiring
local filesystems to support the NFS requirements in their on-disk
format. THe NFS server controls the format and versioning of it's
on-disk persistent metadata (i.e. the xattrs it uses) and so any
changes to the application level requirements of that functionality
are now completely under the control of the application.

i.e. the application gets to manage version control, backwards and
forwards compatibility of it's persistent metadata, etc. What you
are asking is that every local filesystem takes responsibility for
managing the long term persistent metadata that only NFS requires.
It's more complex to do this at the filesystem level, and we have to
replicate the same work for every filesystem that is going to
support this on-disk functionality.

Using xattrs means the functionality is implemented once, it's
common across all local filesystems, and no exportable filesystem
needs to know anything about it as it's all self-contained in the
NFS server code. THe code is smaller, easier to maintain, consistent
across all systems, easy to test, etc.

It also can be implemented and rolled out *immediately* to all
existing supported NFS server implementations, without having to
wait months/years (or never!) for local filesystem on-disk format
changes to roll out to production systems.

Asking individual filesystems to implement application specific
persistent metadata is a *last resort* and should only be done if
correctness or performance cannot be obtained in any other way.

So, yeah, the only sane direction to take here is to use xattrs to
store this NFS application level information. It's less work for
everyone, and in the long term it means when the NFS application
requirements change again, we don't need to modify the on-disk
format of multiple local filesystems.

> In any case, the main problem at this point is not so much in detecting
> when there has been an unclean shutdown, but rather what to do when
> there is one. We need to to advance the presented change attributes
> beyond the largest possible one that may have been handed out prior to
> the crash. 

Sure, but you're missing my point: by using xattrs for detection,
you don't need to involve anything to do with local filesystems at
all.

> How do we determine what that offset should be? Your last email
> suggested that there really is no limit to the number of i_version bumps
> that can happen in memory before one of them makes it to disk. What can
> we do to address that?

<shrug>

I'm just pointing out problems I see when defining this as behaviour
for on-disk format purposes. If we define it as part of the on-disk
format, then we have to be concerned about how it may be used
outside the scope of just the NFS server application. 

However, If NFS keeps this metadata and functionaly entirely
contained at the application level via xattrs, I really don't care
what algorithm NFS developers decides to use for their crash
sequencing. It's not my concern at this point, and that's precisely
why NFS should be using xattrs for this NFS specific functionality.

-Dave.
Jeff Layton Sept. 21, 2022, 10:33 a.m. UTC | #85
On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote:
> On Tue, Sep 20, 2022 at 06:26:05AM -0400, Jeff Layton wrote:
> > On Tue, 2022-09-20 at 10:16 +1000, Dave Chinner wrote:
> > > IOWs, the NFS server can define it's own on-disk persistent metadata
> > > using xattrs, and you don't need local filesystems to be modified at
> > > all. You can add the crash epoch into the change attr that is sent
> > > to NFS clients without having to change the VFS i_version
> > > implementation at all.
> > > 
> > > This whole problem is solvable entirely within the NFS server code,
> > > and we don't need to change local filesystems at all. NFS can
> > > control the persistence and format of the xattrs it uses, and it
> > > does not need new custom on-disk format changes from every
> > > filesystem to support this new application requirement.
> > > 
> > > At this point, NFS server developers don't need to care what the
> > > underlying filesystem format provides - the xattrs provide the crash
> > > detection and enumeration the NFS server functionality requires.
> > > 
> > 
> > Doesn't the filesystem already detect when it's been mounted after an
> > unclean shutdown?
> 
> Not every filesystem will be able to guarantee unclean shutdown
> detection at the next mount. That's the whole problem - NFS
> developers are asking for something that cannot be provided as
> generic functionality by individual filesystems, so the NFS server
> application is going to have to work around any filesytem that
> cannot provide the information it needs.
> 
> e.g. ext4 has it journal replayed by the userspace tools prior
> to mount, so when it then gets mounted by the kernel it's seen as a
> clean mount.
> 
> If we shut an XFS filesystem down due to a filesystem corruption or
> failed IO to the journal code, the kernel might not be able to
> replay the journal on mount (i.e. it is corrupt).  We then run
> xfs_repair, and that fixes the corruption issue and -cleans the
> log-. When we next mount the filesystem, it results in a _clean
> mount_, and the kernel filesystem code can not signal to NFS that an
> unclean mount occurred and so it should bump it's crash counter.
> 
> IOWs, this whole "filesystems need to tell NFS about crashes"
> propagates all the way through *every filesystem tool chain*, not
> just the kernel mount code. And we most certainly don't control
> every 3rd party application that walks around in the filesystem on
> disk format, and so there are -zero- guarantees that the kernel
> filesystem mount code can give that an unclean shutdown occurred
> prior to the current mount.
> 
> And then for niche NFS server applications (like transparent
> fail-over between HA NFS servers) there are even more rigid
> constraints on NFS change attributes. And you're asking local
> filesystems to know about these application constraints and bake
> them into their on-disk format again.
> 
> This whole discussion has come about because we baked certain
> behaviour for NFS into the on-disk format many, many years ago, and
> it's only now that it is considered inadequate for *new* NFS
> application related functionality (e.g. fscache integration and
> cache validity across server side mount cycles).
> 
> We've learnt a valuable lesson from this: don't bake application
> specific persistent metadata requirements into the on-disk format
> because when the application needs to change, it requires every
> filesystem that supports taht application level functionality
> to change their on-disk formats...
> 
> > I'm not sure what good we'll get out of bolting this
> > scheme onto the NFS server, when the filesystem could just as easily
> > give us this info.
> 
> The xattr scheme guarantees the correct application behaviour that the NFS
> server requires, all at the NFS application level without requiring
> local filesystems to support the NFS requirements in their on-disk
> format. THe NFS server controls the format and versioning of it's
> on-disk persistent metadata (i.e. the xattrs it uses) and so any
> changes to the application level requirements of that functionality
> are now completely under the control of the application.
> 
> i.e. the application gets to manage version control, backwards and
> forwards compatibility of it's persistent metadata, etc. What you
> are asking is that every local filesystem takes responsibility for
> managing the long term persistent metadata that only NFS requires.
> It's more complex to do this at the filesystem level, and we have to
> replicate the same work for every filesystem that is going to
> support this on-disk functionality.
> 
> Using xattrs means the functionality is implemented once, it's
> common across all local filesystems, and no exportable filesystem
> needs to know anything about it as it's all self-contained in the
> NFS server code. THe code is smaller, easier to maintain, consistent
> across all systems, easy to test, etc.
> 
> It also can be implemented and rolled out *immediately* to all
> existing supported NFS server implementations, without having to
> wait months/years (or never!) for local filesystem on-disk format
> changes to roll out to production systems.
> 
> Asking individual filesystems to implement application specific
> persistent metadata is a *last resort* and should only be done if
> correctness or performance cannot be obtained in any other way.
> 
> So, yeah, the only sane direction to take here is to use xattrs to
> store this NFS application level information. It's less work for
> everyone, and in the long term it means when the NFS application
> requirements change again, we don't need to modify the on-disk
> format of multiple local filesystems.
> 
> > In any case, the main problem at this point is not so much in detecting
> > when there has been an unclean shutdown, but rather what to do when
> > there is one. We need to to advance the presented change attributes
> > beyond the largest possible one that may have been handed out prior to
> > the crash. 
> 
> Sure, but you're missing my point: by using xattrs for detection,
> you don't need to involve anything to do with local filesystems at
> all.
> 
> > How do we determine what that offset should be? Your last email
> > suggested that there really is no limit to the number of i_version bumps
> > that can happen in memory before one of them makes it to disk. What can
> > we do to address that?
> 
> <shrug>
> 
> I'm just pointing out problems I see when defining this as behaviour
> for on-disk format purposes. If we define it as part of the on-disk
> format, then we have to be concerned about how it may be used
> outside the scope of just the NFS server application. 
> 
> However, If NFS keeps this metadata and functionaly entirely
> contained at the application level via xattrs, I really don't care
> what algorithm NFS developers decides to use for their crash
> sequencing. It's not my concern at this point, and that's precisely
> why NFS should be using xattrs for this NFS specific functionality.
> 

I get it: you'd rather not have to deal with what you see as an NFS
problem, but I don't get how what you're proposing solves anything. We
might be able to use that scheme to detect crashes, but that's only part
of the problem (and it's a relatively simple part of the problem to
solve, really).

Maybe you can clarify it for me:

Suppose we go with what you're saying and store some information in
xattrs that allows us to detect crashes in some fashion. The server
crashes and comes back up and we detect that there was a crash earlier.

What does nfsd need to do now to ensure that it doesn't hand out a
duplicate change attribute? 

Until we can answer that question, detecting crashes doesn't matter.
Dave Chinner Sept. 21, 2022, 9:41 p.m. UTC | #86
On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote:
> On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote:
> > > How do we determine what that offset should be? Your last email
> > > suggested that there really is no limit to the number of i_version bumps
> > > that can happen in memory before one of them makes it to disk. What can
> > > we do to address that?
> > 
> > <shrug>
> > 
> > I'm just pointing out problems I see when defining this as behaviour
> > for on-disk format purposes. If we define it as part of the on-disk
> > format, then we have to be concerned about how it may be used
> > outside the scope of just the NFS server application. 
> > 
> > However, If NFS keeps this metadata and functionaly entirely
> > contained at the application level via xattrs, I really don't care
> > what algorithm NFS developers decides to use for their crash
> > sequencing. It's not my concern at this point, and that's precisely
> > why NFS should be using xattrs for this NFS specific functionality.
> > 
> 
> I get it: you'd rather not have to deal with what you see as an NFS
> problem, but I don't get how what you're proposing solves anything. We
> might be able to use that scheme to detect crashes, but that's only part
> of the problem (and it's a relatively simple part of the problem to
> solve, really).
> 
> Maybe you can clarify it for me:
> 
> Suppose we go with what you're saying and store some information in
> xattrs that allows us to detect crashes in some fashion. The server
> crashes and comes back up and we detect that there was a crash earlier.
> 
> What does nfsd need to do now to ensure that it doesn't hand out a
> duplicate change attribute? 

As I've already stated, the NFS server can hold the persistent NFS
crash counter value in a second xattr that it bumps whenever it
detects a crash and hence we take the local filesystem completely
out of the equation.  How the crash counter is then used by the nfsd
to fold it into the NFS protocol change attribute is a nfsd problem,
not a local filesystem problem.

If you're worried about maximum number of writes outstanding vs
i_version bumps that are held in memory, then *bound the maximum
number of uncommitted i_version changes that the NFS server will
allow to build up in memory*. By moving the crash counter to being a
NFS server only function, the NFS server controls the entire
algorithm and it doesn't have to care about external 3rd party
considerations like local filesystems have to.

e.g. The NFS server can track the i_version values when the NFSD
syncs/commits a given inode. The nfsd can sample i_version it when
calls ->commit_metadata or flushed data on the inode, and then when
it peeks at i_version when gathering post-op attrs (or any other
getattr op) it can decide that there is too much in-memory change
(e.g. 10,000 counts since last sync) and sync the inode.

i.e. the NFS server can trivially cap the maximum number of
uncommitted NFS change attr bumps it allows to build up in memory.
At that point, the NFS server has a bound "maximum write count" that
can be used in conjunction with the xattr based crash counter to
determine how the change_attr is bumped by the crash counter.

-Dave.
Jeff Layton Sept. 22, 2022, 10:18 a.m. UTC | #87
On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote:
> > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote:
> > > > How do we determine what that offset should be? Your last email
> > > > suggested that there really is no limit to the number of i_version bumps
> > > > that can happen in memory before one of them makes it to disk. What can
> > > > we do to address that?
> > > 
> > > <shrug>
> > > 
> > > I'm just pointing out problems I see when defining this as behaviour
> > > for on-disk format purposes. If we define it as part of the on-disk
> > > format, then we have to be concerned about how it may be used
> > > outside the scope of just the NFS server application. 
> > > 
> > > However, If NFS keeps this metadata and functionaly entirely
> > > contained at the application level via xattrs, I really don't care
> > > what algorithm NFS developers decides to use for their crash
> > > sequencing. It's not my concern at this point, and that's precisely
> > > why NFS should be using xattrs for this NFS specific functionality.
> > > 
> > 
> > I get it: you'd rather not have to deal with what you see as an NFS
> > problem, but I don't get how what you're proposing solves anything. We
> > might be able to use that scheme to detect crashes, but that's only part
> > of the problem (and it's a relatively simple part of the problem to
> > solve, really).
> > 
> > Maybe you can clarify it for me:
> > 
> > Suppose we go with what you're saying and store some information in
> > xattrs that allows us to detect crashes in some fashion. The server
> > crashes and comes back up and we detect that there was a crash earlier.
> > 
> > What does nfsd need to do now to ensure that it doesn't hand out a
> > duplicate change attribute? 
> 
> As I've already stated, the NFS server can hold the persistent NFS
> crash counter value in a second xattr that it bumps whenever it
> detects a crash and hence we take the local filesystem completely
> out of the equation.  How the crash counter is then used by the nfsd
> to fold it into the NFS protocol change attribute is a nfsd problem,
> not a local filesystem problem.
> 

Ok, assuming you mean put this in an xattr that lives at the root of the
export? We only need this for IS_I_VERSION filesystems (btrfs, xfs, and
ext4), and they all support xattrs so this scheme should work.

> If you're worried about maximum number of writes outstanding vs
> i_version bumps that are held in memory, then *bound the maximum
> number of uncommitted i_version changes that the NFS server will
> allow to build up in memory*. By moving the crash counter to being a
> NFS server only function, the NFS server controls the entire
> algorithm and it doesn't have to care about external 3rd party
> considerations like local filesystems have to.
> 

Yeah, this is the bigger consideration.

> e.g. The NFS server can track the i_version values when the NFSD
> syncs/commits a given inode. The nfsd can sample i_version it when
> calls ->commit_metadata or flushed data on the inode, and then when
> it peeks at i_version when gathering post-op attrs (or any other
> getattr op) it can decide that there is too much in-memory change
> (e.g. 10,000 counts since last sync) and sync the inode.
> 
> i.e. the NFS server can trivially cap the maximum number of
> uncommitted NFS change attr bumps it allows to build up in memory.
> At that point, the NFS server has a bound "maximum write count" that
> can be used in conjunction with the xattr based crash counter to
> determine how the change_attr is bumped by the crash counter.

Well, not "trivially". This is the bit where we have to grow struct
inode (or the fs-specific inode), as we'll need to know what the latest
on-disk value is for the inode.

I'm leaning toward doing this on the query side. Basically, when nfsd
goes to query the i_version, it'll check the delta between the current
version and the latest one on disk. If it's bigger than X then we'd just
return NFS4ERR_DELAY to the client.

If the delta is >X/2, maybe it can kick off a workqueue job or something
that calls write_inode with WB_SYNC_ALL to try to get the thing onto the
platter ASAP.
Jeff Layton Sept. 22, 2022, 8:18 p.m. UTC | #88
On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > On Wed, Sep 21, 2022 at 06:33:28AM -0400, Jeff Layton wrote:
> > > On Wed, 2022-09-21 at 10:00 +1000, Dave Chinner wrote:
> > > > > How do we determine what that offset should be? Your last email
> > > > > suggested that there really is no limit to the number of i_version bumps
> > > > > that can happen in memory before one of them makes it to disk. What can
> > > > > we do to address that?
> > > > 
> > > > <shrug>
> > > > 
> > > > I'm just pointing out problems I see when defining this as behaviour
> > > > for on-disk format purposes. If we define it as part of the on-disk
> > > > format, then we have to be concerned about how it may be used
> > > > outside the scope of just the NFS server application. 
> > > > 
> > > > However, If NFS keeps this metadata and functionaly entirely
> > > > contained at the application level via xattrs, I really don't care
> > > > what algorithm NFS developers decides to use for their crash
> > > > sequencing. It's not my concern at this point, and that's precisely
> > > > why NFS should be using xattrs for this NFS specific functionality.
> > > > 
> > > 
> > > I get it: you'd rather not have to deal with what you see as an NFS
> > > problem, but I don't get how what you're proposing solves anything. We
> > > might be able to use that scheme to detect crashes, but that's only part
> > > of the problem (and it's a relatively simple part of the problem to
> > > solve, really).
> > > 
> > > Maybe you can clarify it for me:
> > > 
> > > Suppose we go with what you're saying and store some information in
> > > xattrs that allows us to detect crashes in some fashion. The server
> > > crashes and comes back up and we detect that there was a crash earlier.
> > > 
> > > What does nfsd need to do now to ensure that it doesn't hand out a
> > > duplicate change attribute? 
> > 
> > As I've already stated, the NFS server can hold the persistent NFS
> > crash counter value in a second xattr that it bumps whenever it
> > detects a crash and hence we take the local filesystem completely
> > out of the equation.  How the crash counter is then used by the nfsd
> > to fold it into the NFS protocol change attribute is a nfsd problem,
> > not a local filesystem problem.
> > 
> 
> Ok, assuming you mean put this in an xattr that lives at the root of the
> export? We only need this for IS_I_VERSION filesystems (btrfs, xfs, and
> ext4), and they all support xattrs so this scheme should work.
> 

I had a look at this today and it's not as straightforward as it
sounds. 

In particular, there is no guarantee that an export will not cross
filesystem boundaries. Also, nfsd and mountd are very much "demand
driven". We might not touch an exported filesystem at all if nothing
asks for it. Ensuring we can do something to every exported filesystem
after a crash is more difficult than it sounds.

So trying to do something with xattrs on the exported filesystems is
probably not what we want. It's also sort of janky since we do strive to
leave a "light footprint" on the exported filesystem.

Maybe we don't need that though. Chuck reminded me that nfsdcltrack
could be used here instead. We can punt this to userland!

nfsdcltrack could keep track of a global crash "salt", and feed that to
nfsd when it starts up. When starting a grace period, it can set a
RUNNING flag in the db. If it's set when the server starts, we know
there was a crash and can bump the crash counter. When nfsd is shutting
down cleanly, it can call sync() and then clear the flag (this may
require a new cld upcall cmd). We then mix that value into the change
attribute for IS_I_VERSION inodes.

That's probably good enough for nfsd, but if we wanted to present this
to userland via statx, we'd need a different mechanism. For now, I'm
going to plan to fix this up in nfsd and then we'll see where we are.

> > If you're worried about maximum number of writes outstanding vs
> > i_version bumps that are held in memory, then *bound the maximum
> > number of uncommitted i_version changes that the NFS server will
> > allow to build up in memory*. By moving the crash counter to being a
> > NFS server only function, the NFS server controls the entire
> > algorithm and it doesn't have to care about external 3rd party
> > considerations like local filesystems have to.
> > 
> 
> Yeah, this is the bigger consideration.
> 
> > e.g. The NFS server can track the i_version values when the NFSD
> > syncs/commits a given inode. The nfsd can sample i_version it when
> > calls ->commit_metadata or flushed data on the inode, and then when
> > it peeks at i_version when gathering post-op attrs (or any other
> > getattr op) it can decide that there is too much in-memory change
> > (e.g. 10,000 counts since last sync) and sync the inode.
> > 
> > i.e. the NFS server can trivially cap the maximum number of
> > uncommitted NFS change attr bumps it allows to build up in memory.
> > At that point, the NFS server has a bound "maximum write count" that
> > can be used in conjunction with the xattr based crash counter to
> > determine how the change_attr is bumped by the crash counter.
> 
> Well, not "trivially". This is the bit where we have to grow struct
> inode (or the fs-specific inode), as we'll need to know what the latest
> on-disk value is for the inode.
> 
> I'm leaning toward doing this on the query side. Basically, when nfsd
> goes to query the i_version, it'll check the delta between the current
> version and the latest one on disk. If it's bigger than X then we'd just
> return NFS4ERR_DELAY to the client.
> 
> If the delta is >X/2, maybe it can kick off a workqueue job or something
> that calls write_inode with WB_SYNC_ALL to try to get the thing onto the
> platter ASAP.

Still looking at this bit too. Probably we can just kick off a
WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best?
Jan Kara Sept. 23, 2022, 9:56 a.m. UTC | #89
On Thu 22-09-22 16:18:02, Jeff Layton wrote:
> On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > > e.g. The NFS server can track the i_version values when the NFSD
> > > syncs/commits a given inode. The nfsd can sample i_version it when
> > > calls ->commit_metadata or flushed data on the inode, and then when
> > > it peeks at i_version when gathering post-op attrs (or any other
> > > getattr op) it can decide that there is too much in-memory change
> > > (e.g. 10,000 counts since last sync) and sync the inode.
> > > 
> > > i.e. the NFS server can trivially cap the maximum number of
> > > uncommitted NFS change attr bumps it allows to build up in memory.
> > > At that point, the NFS server has a bound "maximum write count" that
> > > can be used in conjunction with the xattr based crash counter to
> > > determine how the change_attr is bumped by the crash counter.
> > 
> > Well, not "trivially". This is the bit where we have to grow struct
> > inode (or the fs-specific inode), as we'll need to know what the latest
> > on-disk value is for the inode.
> > 
> > I'm leaning toward doing this on the query side. Basically, when nfsd
> > goes to query the i_version, it'll check the delta between the current
> > version and the latest one on disk. If it's bigger than X then we'd just
> > return NFS4ERR_DELAY to the client.
> > 
> > If the delta is >X/2, maybe it can kick off a workqueue job or something
> > that calls write_inode with WB_SYNC_ALL to try to get the thing onto the
> > platter ASAP.
> 
> Still looking at this bit too. Probably we can just kick off a
> WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best?

"Hope" is not a great assurance regarding data integrity ;) Anyway, it
depends on how you imagine the "i_version on disk" is going to be
maintained. It could be maintained by NFSD inside commit_inode_metadata() -
fetch current i_version value before asking filesystem for the sync and by the
time commit_metadata() returns we know that value is on disk. If we detect the
current - on_disk is > X/2, we call commit_inode_metadata() and we are
done. It is not even *that* expensive because usually filesystems optimize
away unnecessary IO when the inode didn't change since last time it got
synced.

								Honza
Jeff Layton Sept. 23, 2022, 10:19 a.m. UTC | #90
On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote:
> On Thu 22-09-22 16:18:02, Jeff Layton wrote:
> > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > > > e.g. The NFS server can track the i_version values when the NFSD
> > > > syncs/commits a given inode. The nfsd can sample i_version it when
> > > > calls ->commit_metadata or flushed data on the inode, and then when
> > > > it peeks at i_version when gathering post-op attrs (or any other
> > > > getattr op) it can decide that there is too much in-memory change
> > > > (e.g. 10,000 counts since last sync) and sync the inode.
> > > > 
> > > > i.e. the NFS server can trivially cap the maximum number of
> > > > uncommitted NFS change attr bumps it allows to build up in memory.
> > > > At that point, the NFS server has a bound "maximum write count" that
> > > > can be used in conjunction with the xattr based crash counter to
> > > > determine how the change_attr is bumped by the crash counter.
> > > 
> > > Well, not "trivially". This is the bit where we have to grow struct
> > > inode (or the fs-specific inode), as we'll need to know what the latest
> > > on-disk value is for the inode.
> > > 
> > > I'm leaning toward doing this on the query side. Basically, when nfsd
> > > goes to query the i_version, it'll check the delta between the current
> > > version and the latest one on disk. If it's bigger than X then we'd just
> > > return NFS4ERR_DELAY to the client.
> > > 
> > > If the delta is >X/2, maybe it can kick off a workqueue job or something
> > > that calls write_inode with WB_SYNC_ALL to try to get the thing onto the
> > > platter ASAP.
> > 
> > Still looking at this bit too. Probably we can just kick off a
> > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the best?
> 
> "Hope" is not a great assurance regarding data integrity ;) 

By "hoping for the best", I meant hoping that we never have to take the
the drastic action of returning NFS4ERR_DELAY on GETATTR operations. We
definitely don't want to jeopardize data integrity. 

> Anyway, it
> depends on how you imagine the "i_version on disk" is going to be
> maintained. It could be maintained by NFSD inside commit_inode_metadata() -
> fetch current i_version value before asking filesystem for the sync and by the
> time commit_metadata() returns we know that value is on disk. If we detect the
> current - on_disk is > X/2, we call commit_inode_metadata() and we are
> done. It is not even *that* expensive because usually filesystems optimize
> away unnecessary IO when the inode didn't change since last time it got
> synced.
> 

At >X/2 we don't really want to start blocking or anything. I'd prefer
if we could kick off something in the background for this, but if it's
not too expensive then maybe just calling commit_inode_metadata
synchronously in this codepath is OK. Alternately, we could consider
doing that in a workqueue job too.

I need to do a bit more research here, but I think we have some options.
Trond Myklebust Sept. 23, 2022, 1:44 p.m. UTC | #91
On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote:
> On Thu 22-09-22 16:18:02, Jeff Layton wrote:
> > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > > > e.g. The NFS server can track the i_version values when the
> > > > NFSD
> > > > syncs/commits a given inode. The nfsd can sample i_version it
> > > > when
> > > > calls ->commit_metadata or flushed data on the inode, and then
> > > > when
> > > > it peeks at i_version when gathering post-op attrs (or any
> > > > other
> > > > getattr op) it can decide that there is too much in-memory
> > > > change
> > > > (e.g. 10,000 counts since last sync) and sync the inode.
> > > > 
> > > > i.e. the NFS server can trivially cap the maximum number of
> > > > uncommitted NFS change attr bumps it allows to build up in
> > > > memory.
> > > > At that point, the NFS server has a bound "maximum write count"
> > > > that
> > > > can be used in conjunction with the xattr based crash counter
> > > > to
> > > > determine how the change_attr is bumped by the crash counter.
> > > 
> > > Well, not "trivially". This is the bit where we have to grow
> > > struct
> > > inode (or the fs-specific inode), as we'll need to know what the
> > > latest
> > > on-disk value is for the inode.
> > > 
> > > I'm leaning toward doing this on the query side. Basically, when
> > > nfsd
> > > goes to query the i_version, it'll check the delta between the
> > > current
> > > version and the latest one on disk. If it's bigger than X then
> > > we'd just
> > > return NFS4ERR_DELAY to the client.
> > > 
> > > If the delta is >X/2, maybe it can kick off a workqueue job or
> > > something
> > > that calls write_inode with WB_SYNC_ALL to try to get the thing
> > > onto the
> > > platter ASAP.
> > 
> > Still looking at this bit too. Probably we can just kick off a
> > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the
> > best?
> 
> "Hope" is not a great assurance regarding data integrity ;) Anyway,
> it
> depends on how you imagine the "i_version on disk" is going to be
> maintained. It could be maintained by NFSD inside
> commit_inode_metadata() -
> fetch current i_version value before asking filesystem for the sync
> and by the
> time commit_metadata() returns we know that value is on disk. If we
> detect the
> current - on_disk is > X/2, we call commit_inode_metadata() and we
> are
> done. It is not even *that* expensive because usually filesystems
> optimize
> away unnecessary IO when the inode didn't change since last time it
> got
> synced.
> 

Note that these approaches requiring 3rd party help in order to track
i_version integrity across filesystem crashes all make the idea of
adding i_version to statx() a no-go.

It is one thing for knfsd to add specialised machinery for integrity
checking, but if all applications need to do so, then they are highly
unlikely to want to adopt this attribute.
Jeff Layton Sept. 23, 2022, 1:50 p.m. UTC | #92
On Fri, 2022-09-23 at 13:44 +0000, Trond Myklebust wrote:
> On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote:
> > On Thu 22-09-22 16:18:02, Jeff Layton wrote:
> > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > > > > e.g. The NFS server can track the i_version values when the
> > > > > NFSD
> > > > > syncs/commits a given inode. The nfsd can sample i_version it
> > > > > when
> > > > > calls ->commit_metadata or flushed data on the inode, and then
> > > > > when
> > > > > it peeks at i_version when gathering post-op attrs (or any
> > > > > other
> > > > > getattr op) it can decide that there is too much in-memory
> > > > > change
> > > > > (e.g. 10,000 counts since last sync) and sync the inode.
> > > > > 
> > > > > i.e. the NFS server can trivially cap the maximum number of
> > > > > uncommitted NFS change attr bumps it allows to build up in
> > > > > memory.
> > > > > At that point, the NFS server has a bound "maximum write count"
> > > > > that
> > > > > can be used in conjunction with the xattr based crash counter
> > > > > to
> > > > > determine how the change_attr is bumped by the crash counter.
> > > > 
> > > > Well, not "trivially". This is the bit where we have to grow
> > > > struct
> > > > inode (or the fs-specific inode), as we'll need to know what the
> > > > latest
> > > > on-disk value is for the inode.
> > > > 
> > > > I'm leaning toward doing this on the query side. Basically, when
> > > > nfsd
> > > > goes to query the i_version, it'll check the delta between the
> > > > current
> > > > version and the latest one on disk. If it's bigger than X then
> > > > we'd just
> > > > return NFS4ERR_DELAY to the client.
> > > > 
> > > > If the delta is >X/2, maybe it can kick off a workqueue job or
> > > > something
> > > > that calls write_inode with WB_SYNC_ALL to try to get the thing
> > > > onto the
> > > > platter ASAP.
> > > 
> > > Still looking at this bit too. Probably we can just kick off a
> > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the
> > > best?
> > 
> > "Hope" is not a great assurance regarding data integrity ;) Anyway,
> > it
> > depends on how you imagine the "i_version on disk" is going to be
> > maintained. It could be maintained by NFSD inside
> > commit_inode_metadata() -
> > fetch current i_version value before asking filesystem for the sync
> > and by the
> > time commit_metadata() returns we know that value is on disk. If we
> > detect the
> > current - on_disk is > X/2, we call commit_inode_metadata() and we
> > are
> > done. It is not even *that* expensive because usually filesystems
> > optimize
> > away unnecessary IO when the inode didn't change since last time it
> > got
> > synced.
> > 
> 
> Note that these approaches requiring 3rd party help in order to track
> i_version integrity across filesystem crashes all make the idea of
> adding i_version to statx() a no-go.
> 
> It is one thing for knfsd to add specialised machinery for integrity
> checking, but if all applications need to do so, then they are highly
> unlikely to want to adopt this attribute.
> 
> 

Absolutely. That is the downside of this approach, but the priority here
has always been to improve nfsd. If we don't get the ability to present
this info via statx, then so be it. Later on, I suppose we can move that
handling into the kernel in some fashion if we decide it's worthwhile.

That said, not having this in statx makes it more difficult to test
i_version behavior. Maybe we can add a generic ioctl for that in the
interim?
Frank Filz Sept. 23, 2022, 2:58 p.m. UTC | #93
> -----Original Message-----
> From: Jeff Layton [mailto:jlayton@kernel.org]
> Sent: Friday, September 23, 2022 6:50 AM
> To: Trond Myklebust <trondmy@hammerspace.com>; jack@suse.cz
> Cc: zohar@linux.ibm.com; djwong@kernel.org; brauner@kernel.org; linux-
> xfs@vger.kernel.org; bfields@fieldses.org; linux-api@vger.kernel.org;
> neilb@suse.de; david@fromorbit.com; fweimer@redhat.com; linux-
> kernel@vger.kernel.org; chuck.lever@oracle.com; linux-man@vger.kernel.org;
> linux-nfs@vger.kernel.org; linux-ext4@vger.kernel.org; tytso@mit.edu;
> viro@zeniv.linux.org.uk; xiubli@redhat.com; linux-fsdevel@vger.kernel.org;
> adilger.kernel@dilger.ca; lczerner@redhat.com; ceph-devel@vger.kernel.org;
> linux-btrfs@vger.kernel.org
> Subject: Re: [man-pages RFC PATCH v4] statx, inode: document the new
> STATX_INO_VERSION field
> 
> On Fri, 2022-09-23 at 13:44 +0000, Trond Myklebust wrote:
> > On Fri, 2022-09-23 at 11:56 +0200, Jan Kara wrote:
> > > On Thu 22-09-22 16:18:02, Jeff Layton wrote:
> > > > On Thu, 2022-09-22 at 06:18 -0400, Jeff Layton wrote:
> > > > > On Thu, 2022-09-22 at 07:41 +1000, Dave Chinner wrote:
> > > > > > e.g. The NFS server can track the i_version values when the
> > > > > > NFSD syncs/commits a given inode. The nfsd can sample
> > > > > > i_version it when calls ->commit_metadata or flushed data on
> > > > > > the inode, and then when it peeks at i_version when gathering
> > > > > > post-op attrs (or any other getattr op) it can decide that
> > > > > > there is too much in-memory change (e.g. 10,000 counts since
> > > > > > last sync) and sync the inode.
> > > > > >
> > > > > > i.e. the NFS server can trivially cap the maximum number of
> > > > > > uncommitted NFS change attr bumps it allows to build up in
> > > > > > memory.
> > > > > > At that point, the NFS server has a bound "maximum write count"
> > > > > > that
> > > > > > can be used in conjunction with the xattr based crash counter
> > > > > > to determine how the change_attr is bumped by the crash
> > > > > > counter.
> > > > >
> > > > > Well, not "trivially". This is the bit where we have to grow
> > > > > struct inode (or the fs-specific inode), as we'll need to know
> > > > > what the latest on-disk value is for the inode.
> > > > >
> > > > > I'm leaning toward doing this on the query side. Basically, when
> > > > > nfsd goes to query the i_version, it'll check the delta between
> > > > > the current version and the latest one on disk. If it's bigger
> > > > > than X then we'd just return NFS4ERR_DELAY to the client.
> > > > >
> > > > > If the delta is >X/2, maybe it can kick off a workqueue job or
> > > > > something that calls write_inode with WB_SYNC_ALL to try to get
> > > > > the thing onto the platter ASAP.
> > > >
> > > > Still looking at this bit too. Probably we can just kick off a
> > > > WB_SYNC_NONE filemap_fdatawrite at that point and hope for the
> > > > best?
> > >
> > > "Hope" is not a great assurance regarding data integrity ;) Anyway,
> > > it depends on how you imagine the "i_version on disk" is going to be
> > > maintained. It could be maintained by NFSD inside
> > > commit_inode_metadata() -
> > > fetch current i_version value before asking filesystem for the sync
> > > and by the time commit_metadata() returns we know that value is on
> > > disk. If we detect the current - on_disk is > X/2, we call
> > > commit_inode_metadata() and we are done. It is not even *that*
> > > expensive because usually filesystems optimize away unnecessary IO
> > > when the inode didn't change since last time it got synced.
> > >
> >
> > Note that these approaches requiring 3rd party help in order to track
> > i_version integrity across filesystem crashes all make the idea of
> > adding i_version to statx() a no-go.
> >
> > It is one thing for knfsd to add specialised machinery for integrity
> > checking, but if all applications need to do so, then they are highly
> > unlikely to want to adopt this attribute.
> >
> >
> 
> Absolutely. That is the downside of this approach, but the priority here
has
> always been to improve nfsd. If we don't get the ability to present this
info via
> statx, then so be it. Later on, I suppose we can move that handling into
the
> kernel in some fashion if we decide it's worthwhile.
> 
> That said, not having this in statx makes it more difficult to test
i_version
> behavior. Maybe we can add a generic ioctl for that in the interim?

Having i_version in statx would be really nice for nfs-ganesha. I would
consider doing the extra integrity stuff or we may in some cases be able to
rely on the filesystem, but in any case, i_version would be an improvement
over using ctime or mtime for a change attribute.

Frank
NeilBrown Sept. 26, 2022, 10:43 p.m. UTC | #94
On Fri, 23 Sep 2022, Jeff Layton wrote:
> 
> Absolutely. That is the downside of this approach, but the priority here
> has always been to improve nfsd. If we don't get the ability to present
> this info via statx, then so be it. Later on, I suppose we can move that
> handling into the kernel in some fashion if we decide it's worthwhile.
> 
> That said, not having this in statx makes it more difficult to test
> i_version behavior. Maybe we can add a generic ioctl for that in the
> interim?

I wonder if we are over-thinking this, trying too hard, making "perfect"
the enemy of "good".
While we agree that the current implementation of i_version is
imperfect, it isn't causing major data corruption all around the world.
I don't think there are even any known bug reports are there?
So while we do want to fix it as best we can, we don't need to make that
the first priority.

I think the first priority should be to document how we want it to work,
which is what this thread is really all about.  The documentation can
note that some (all) filesystems do not provide perfect semantics across
unclean restarts, and can list any other anomalies that we are aware of.
And on that basis we can export the current i_version to user-space via
statx and start trying to write some test code.

We can then look at moving the i_version/ctime update from *before* the
write to *after* the write, and any other improvements that can be
achieved easily in common code.  We can then update the man page to say
"since Linux 6.42, this list of anomalies is no longer present".

Then we can explore some options for handling unclean restart - in a
context where we can write tests and maybe even demonstrate a concrete
problem before we start trying to fix it.

NeilBrown
Jeff Layton Sept. 27, 2022, 11:14 a.m. UTC | #95
On Tue, 2022-09-27 at 08:43 +1000, NeilBrown wrote:
> On Fri, 23 Sep 2022, Jeff Layton wrote:
> > 
> > Absolutely. That is the downside of this approach, but the priority here
> > has always been to improve nfsd. If we don't get the ability to present
> > this info via statx, then so be it. Later on, I suppose we can move that
> > handling into the kernel in some fashion if we decide it's worthwhile.
> > 
> > That said, not having this in statx makes it more difficult to test
> > i_version behavior. Maybe we can add a generic ioctl for that in the
> > interim?
> 
> I wonder if we are over-thinking this, trying too hard, making "perfect"
> the enemy of "good".

I tend to think we are.

> While we agree that the current implementation of i_version is
> imperfect, it isn't causing major data corruption all around the world.
> I don't think there are even any known bug reports are there?
> So while we do want to fix it as best we can, we don't need to make that
> the first priority.
> 

I'm not aware of any bug reports aside from the issue of atime updates
affecting the change attribute, but the effects of misbehavior here can
be very subtle.


> I think the first priority should be to document how we want it to work,
> which is what this thread is really all about.  The documentation can
> note that some (all) filesystems do not provide perfect semantics across
> unclean restarts, and can list any other anomalies that we are aware of.
> And on that basis we can export the current i_version to user-space via
> statx and start trying to write some test code.
> 
> We can then look at moving the i_version/ctime update from *before* the
> write to *after* the write, and any other improvements that can be
> achieved easily in common code.  We can then update the man page to say
> "since Linux 6.42, this list of anomalies is no longer present".
> 

I have a patch for this for ext4, and started looking at the same for
btrfs and xfs.

> Then we can explore some options for handling unclean restart - in a
> context where we can write tests and maybe even demonstrate a concrete
> problem before we start trying to fix it.
> 

I think too that we need to recognize that there are multiple distinct
issues around i_version handling:

1/ atime updates affecting i_version in ext4 and xfs, which harms
performance

2/ ext4 should enable the change attribute by default

3/ we currently mix the ctime into the change attr for directories,
which is unnecessary.

4/ we'd like to be able to report NFS4_CHANGE_TYPE_IS_MONOTONIC_INCR
from nfsd, but the change attr on regular files can appear to go
backward after a crash+clock jump.

5/ testing i_version behavior is very difficult since there is no way to
query it from userland.

We can work on the first three without having to solve the last two
right away.
Jeff Layton Sept. 27, 2022, 1:18 p.m. UTC | #96
On Tue, 2022-09-27 at 08:43 +1000, NeilBrown wrote:
> On Fri, 23 Sep 2022, Jeff Layton wrote:
> > 
> > Absolutely. That is the downside of this approach, but the priority here
> > has always been to improve nfsd. If we don't get the ability to present
> > this info via statx, then so be it. Later on, I suppose we can move that
> > handling into the kernel in some fashion if we decide it's worthwhile.
> > 
> > That said, not having this in statx makes it more difficult to test
> > i_version behavior. Maybe we can add a generic ioctl for that in the
> > interim?
> 
> I wonder if we are over-thinking this, trying too hard, making "perfect"
> the enemy of "good".
> While we agree that the current implementation of i_version is
> imperfect, it isn't causing major data corruption all around the world.
> I don't think there are even any known bug reports are there?
> So while we do want to fix it as best we can, we don't need to make that
> the first priority.
> 
> I think the first priority should be to document how we want it to work,
> which is what this thread is really all about.  The documentation can
> note that some (all) filesystems do not provide perfect semantics across
> unclean restarts, and can list any other anomalies that we are aware of.
> And on that basis we can export the current i_version to user-space via
> statx and start trying to write some test code.
> 
> We can then look at moving the i_version/ctime update from *before* the
> write to *after* the write, and any other improvements that can be
> achieved easily in common code.  We can then update the man page to say
> "since Linux 6.42, this list of anomalies is no longer present".
> 
> Then we can explore some options for handling unclean restart - in a
> context where we can write tests and maybe even demonstrate a concrete
> problem before we start trying to fix it.
> 

We can also argue that crash resilience isn't a hard requirement for all
possible applications. We'll definitely need some sort of mitigation for
nfsd so we can claim that it's MONOTONIC [1], but local applications may
not care whether the value rolls backward after a crash, since they
would have presumably crashed as well and may not be persisting values.

IOW, I think I agree with Dave C. that crash resilience for regular
files is best handled at the application level (with the first
application being knfsd). RFC 7862 requires that the change_attr_type be
homogeneous across the entire filesystem, so we don't have the option of
deciding that on a per-inode basis. If we want to advertise it, we have
ensure that all inode types conform.

I think for nfsd, a crash counter tracked in userland by nfsdcld
multiplied by some large number of reasonable version bumps in a jiffy
would work well and allow us to go back to advertising the value as
MONOTONIC. That's a bit of a project though and may take a while.

For presentation via statx, maybe we can create a
STATX_ATTR_VERSION_MONOTONIC bit for stx_attributes for when the
filesystem can provide that sort of guarantee. I may just add that
internally for now anyway, since that would make for nicer layering.

[1]: https://datatracker.ietf.org/doc/html/rfc7862#section-12.2.3
diff mbox series

Patch

diff --git a/man2/statx.2 b/man2/statx.2
index 0d1b4591f74c..d98d5148a442 100644
--- a/man2/statx.2
+++ b/man2/statx.2
@@ -62,6 +62,7 @@  struct statx {
     __u32 stx_dev_major;   /* Major ID */
     __u32 stx_dev_minor;   /* Minor ID */
     __u64 stx_mnt_id;      /* Mount ID */
+    __u64 stx_ino_version; /* Inode change attribute */
 };
 .EE
 .in
@@ -247,6 +248,7 @@  STATX_BTIME	Want stx_btime
 STATX_ALL	The same as STATX_BASIC_STATS | STATX_BTIME.
 	It is deprecated and should not be used.
 STATX_MNT_ID	Want stx_mnt_id (since Linux 5.8)
+STATX_INO_VERSION	Want stx_ino_version (DRAFT)
 .TE
 .in
 .PP
@@ -407,10 +409,16 @@  This is the same number reported by
 .BR name_to_handle_at (2)
 and corresponds to the number in the first field in one of the records in
 .IR /proc/self/mountinfo .
+.TP
+.I stx_ino_version
+The inode version, also known as the inode change attribute. See
+.BR inode (7)
+for details.
 .PP
 For further information on the above fields, see
 .BR inode (7).
 .\"
+.TP
 .SS File attributes
 The
 .I stx_attributes
diff --git a/man7/inode.7 b/man7/inode.7
index 9b255a890720..8e83836594d8 100644
--- a/man7/inode.7
+++ b/man7/inode.7
@@ -184,6 +184,12 @@  Last status change timestamp (ctime)
 This is the file's last status change timestamp.
 It is changed by writing or by setting inode information
 (i.e., owner, group, link count, mode, etc.).
+.TP
+Inode version (i_version)
+(not returned in the \fIstat\fP structure); \fIstatx.stx_ino_version\fP
+.IP
+This is the inode change counter. See the discussion of
+\fBthe inode version counter\fP, below.
 .PP
 The timestamp fields report time measured with a zero point at the
 .IR Epoch ,
@@ -424,6 +430,39 @@  on a directory means that a file
 in that directory can be renamed or deleted only by the owner
 of the file, by the owner of the directory, and by a privileged
 process.
+.SS The inode version counter
+.PP
+The
+.I statx.stx_ino_version
+field is the inode change counter. Any operation that would result in a
+change to \fIstatx.stx_ctime\fP must result in an increase to this value.
+The value must increase even in the case where the ctime change is not
+evident due to coarse timestamp granularity.
+.PP
+An observer cannot infer anything from amount of increase about the
+nature or magnitude of the change. If the returned value is different
+from the last time it was checked, then something has made an explicit
+data and/or metadata change to the inode.
+.PP
+The change to \fIstatx.stx_ino_version\fP is not atomic with respect to the
+other changes in the inode. On a write, for instance, the i_version it usually
+incremented before the data is copied into the pagecache. Therefore it is
+possible to see a new i_version value while a read still shows the old data.
+.PP
+In the event of a system crash, this value can appear to go backward,
+if it were queried before ever being written to the backing store. If
+the value were then incremented again after restart, then an observer
+could miss noticing a change.
+.PP
+In order to guard against this, it is recommended to also watch the
+\fIstatx.stx_ctime\fP for changes when watching this value. As long as the
+system clock doesn't jump backward during the crash, an observer can be
+reasonably sure that the i_version and ctime together represent a unique inode
+state.
+.PP
+The i_version is a Linux extension and is not supported by all filesystems.
+The application must verify that the \fISTATX_INO_VERSION\fP bit is set in the
+returned \fIstatx.stx_mask\fP before relying on this field.
 .SH STANDARDS
 If you need to obtain the definition of the
 .I blkcnt_t