mbox series

[00/13] Allow readpage to return a locked page

Message ID 20200917151050.5363-1-willy@infradead.org
Headers show
Series Allow readpage to return a locked page | expand

Message

Matthew Wilcox Sept. 17, 2020, 3:10 p.m. UTC
Linus recently made the page lock more fair.  That means that the old
pattern where we returned from ->readpage with the page unlocked and
then attempted to re-lock it will send us to the back of the queue for
this page's lock.

Ideally all filesystems would return from ->readpage with the
page Uptodate and Locked, but it's a bit painful to convert all the
asynchronous readpage implementations to synchronous.  These ones are
already synchronous, so convert them while I work on iomap.

A further benefit is that a synchronous readpage implementation allows
us to return an error to someone who might actually care about it.
There's no need to SetPageError, but I don't want to learn about how
a dozen filesystems handle I/O errors (hint: they're all different),
so I have not attempted to change that.

Please review your filesystem carefully.  I've tried to catch all the
places where a filesystem calls its own internal readpage implementation
without going through ->readpage, but I may have missed some.

Matthew Wilcox (Oracle) (13):
  mm: Add AOP_UPDATED_PAGE return value
  9p: Tell the VFS that readpage was synchronous
  afs: Tell the VFS that readpage was synchronous
  ceph: Tell the VFS that readpage was synchronous
  cifs: Tell the VFS that readpage was synchronous
  cramfs: Tell the VFS that readpage was synchronous
  ecryptfs: Tell the VFS that readpage was synchronous
  fuse: Tell the VFS that readpage was synchronous
  hostfs: Tell the VFS that readpage was synchronous
  jffs2: Tell the VFS that readpage was synchronous
  ubifs: Tell the VFS that readpage was synchronous
  udf: Tell the VFS that readpage was synchronous
  vboxsf: Tell the VFS that readpage was synchronous

 Documentation/filesystems/locking.rst |  7 ++++---
 Documentation/filesystems/vfs.rst     | 21 ++++++++++++++-------
 fs/9p/vfs_addr.c                      |  6 +++++-
 fs/afs/file.c                         |  3 ++-
 fs/ceph/addr.c                        |  9 +++++----
 fs/cifs/file.c                        |  8 ++++++--
 fs/cramfs/inode.c                     |  5 ++---
 fs/ecryptfs/mmap.c                    | 11 ++++++-----
 fs/fuse/file.c                        |  2 ++
 fs/hostfs/hostfs_kern.c               |  2 ++
 fs/jffs2/file.c                       |  6 ++++--
 fs/ubifs/file.c                       | 16 ++++++++++------
 fs/udf/file.c                         |  3 +--
 fs/vboxsf/file.c                      |  2 ++
 include/linux/fs.h                    |  5 +++++
 mm/filemap.c                          | 12 ++++++++++--
 16 files changed, 80 insertions(+), 38 deletions(-)

Comments

Matthew Wilcox Sept. 17, 2020, 10:03 p.m. UTC | #1
On Thu, Sep 17, 2020 at 04:10:38PM +0100, Matthew Wilcox (Oracle) wrote:
> +++ b/mm/filemap.c

> @@ -2254,8 +2254,10 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,

>  		 * PG_error will be set again if readpage fails.

>  		 */

>  		ClearPageError(page);

> -		/* Start the actual read. The read will unlock the page. */

> +		/* Start the actual read. The read may unlock the page. */

>  		error = mapping->a_ops->readpage(filp, page);

> +		if (error == AOP_UPDATED_PAGE)

> +			goto page_ok;

>  

>  		if (unlikely(error)) {

>  			if (error == AOP_TRUNCATED_PAGE) {


If anybody wants to actually test this, this hunk is wrong.

+++ b/mm/filemap.c
@@ -2256,8 +2256,11 @@ ssize_t generic_file_buffered_read(struct kiocb *iocb,
                ClearPageError(page);
                /* Start the actual read. The read may unlock the page. */
                error = mapping->a_ops->readpage(filp, page);
-               if (error == AOP_UPDATED_PAGE)
+               if (error == AOP_UPDATED_PAGE) {
+                       unlock_page(page);
+                       error = 0;
                        goto page_ok;
+               }
 
                if (unlikely(error)) {
                        if (error == AOP_TRUNCATED_PAGE) {

> @@ -2619,7 +2621,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)

>  	 */

>  	if (unlikely(!PageUptodate(page)))

>  		goto page_not_uptodate;

> -

> +page_ok:

>  	/*

>  	 * We've made it this far and we had to drop our mmap_lock, now is the

>  	 * time to return to the upper layer and have it re-find the vma and

> @@ -2654,6 +2656,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)

>  	ClearPageError(page);

>  	fpin = maybe_unlock_mmap_for_io(vmf, fpin);

>  	error = mapping->a_ops->readpage(file, page);

> +	if (error == AOP_UPDATED_PAGE)

> +		goto page_ok;

>  	if (!error) {

>  		wait_on_page_locked(page);

>  		if (!PageUptodate(page))

> @@ -2867,6 +2871,10 @@ static struct page *do_read_cache_page(struct address_space *mapping,

>  			err = filler(data, page);

>  		else

>  			err = mapping->a_ops->readpage(data, page);

> +		if (err == AOP_UPDATED_PAGE) {

> +			unlock_page(page);

> +			goto out;

> +		}

>  

>  		if (err < 0) {

>  			put_page(page);

> -- 

> 2.28.0

>
Dominique Martinet Sept. 18, 2020, 5:59 a.m. UTC | #2
Matthew Wilcox (Oracle) wrote on Thu, Sep 17, 2020:
> diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c

> index cce9ace651a2..506ca0ba2ec7 100644

> --- a/fs/9p/vfs_addr.c

> +++ b/fs/9p/vfs_addr.c

> @@ -280,6 +280,10 @@ static int v9fs_write_begin(struct file *filp, struct address_space *mapping,

>  		goto out;

>  

>  	retval = v9fs_fid_readpage(v9inode->writeback_fid, page);

> +	if (retval == AOP_UPDATED_PAGE) {

> +		retval = 0;

> +		goto out;

> +	}


FWIW this is a change of behaviour; for some reason the code used to
loop back to grab_cache_page_write_begin() and bail out on
PageUptodate() I suppose; some sort of race check?
The whole pattern is a bit weird to me and 9p has no guarantee on
concurrent writes to a file with cache enabled (except that it will
corrupt something), so this part is fine with me.

What I'm curious about is the page used to be both unlocked and put, but
now isn't either and the return value hasn't changed for the caller to
make a difference on write_begin / I don't see any code change in the
vfs  to handle that.
What did I miss?


(FWIW at least cifs in the series has the same pattern change; didn't
check all of them)


Thanks,
-- 
Dominique
Matthew Wilcox Sept. 18, 2020, 11:19 a.m. UTC | #3
On Fri, Sep 18, 2020 at 07:59:19AM +0200, Dominique Martinet wrote:
> Matthew Wilcox (Oracle) wrote on Thu, Sep 17, 2020:

> > diff --git a/fs/9p/vfs_addr.c b/fs/9p/vfs_addr.c

> > index cce9ace651a2..506ca0ba2ec7 100644

> > --- a/fs/9p/vfs_addr.c

> > +++ b/fs/9p/vfs_addr.c

> > @@ -280,6 +280,10 @@ static int v9fs_write_begin(struct file *filp, struct address_space *mapping,

> >  		goto out;

> >  

> >  	retval = v9fs_fid_readpage(v9inode->writeback_fid, page);

> > +	if (retval == AOP_UPDATED_PAGE) {

> > +		retval = 0;

> > +		goto out;

> > +	}

> 

> FWIW this is a change of behaviour; for some reason the code used to

> loop back to grab_cache_page_write_begin() and bail out on

> PageUptodate() I suppose; some sort of race check?

> The whole pattern is a bit weird to me and 9p has no guarantee on

> concurrent writes to a file with cache enabled (except that it will

> corrupt something), so this part is fine with me.

> 

> What I'm curious about is the page used to be both unlocked and put, but

> now isn't either and the return value hasn't changed for the caller to

> make a difference on write_begin / I don't see any code change in the

> vfs  to handle that.

> What did I miss?


The page cache is kind of subtle.  The grab_cache_page_write_begin()
will return a Locked page with an increased refcount.  If it's Uptodate,
that's exactly what we want, and we return it.  If we have to read the
page, readpage used to unlock the page before returning, and rather than
re-lock it, we would drop the reference to the page and look it up again.
It's possible that after dropping the lock on that page that the page
was replaced in the page cache and so we'd get a different page.

Anyway, now (unless fscache is involved), v9fs_fid_readpage will return
the page without unlocking it.  So we don't need to do the dance of
dropping the lock, putting the refcount and looking the page back up
again.  We can just return the page.  The VFS doesn't need a special
return code because nothing has changed from the VFS's point of view --
it asked you to get a page and you got the page.