Message ID | 161340389201.1303470.14353807284546854878.stgit@warthog.procyon.org.uk |
---|---|
State | New |
Headers | show |
Series | Network fs helper library & fscache kiocb API [ver #3] | expand |
I plan to try and use readahead_expand in Orangefs... -Mike On Tue, Feb 16, 2021 at 8:28 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Feb 16, 2021 at 11:32:15AM +0100, Christoph Hellwig wrote: > > On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote: > > > Provide a function, readahead_expand(), that expands the set of pages > > > specified by a readahead_control object to encompass a revised area with a > > > proposed size and length. > > > > > > The proposed area must include all of the old area and may be expanded yet > > > more by this function so that the edges align on (transparent huge) page > > > boundaries as allocated. > > > > > > The expansion will be cut short if a page already exists in either of the > > > areas being expanded into. Note that any expansion made in such a case is > > > not rolled back. > > > > > > This will be used by fscache so that reads can be expanded to cache granule > > > boundaries, thereby allowing whole granules to be stored in the cache, but > > > there are other potential users also. > > > > So looking at linux-next this seems to have a user, but that user is > > dead wood given that nothing implements ->expand_readahead. > > > > Looking at the code structure I think netfs_readahead and > > netfs_rreq_expand is a complete mess and needs to be turned upside > > down, that is instead of calling back from netfs_readahead to the > > calling file system, split it into a few helpers called by the > > caller. > > That's funny, we modelled it after iomap. > > > But even after this can't we just expose the cache granule boundary > > to the VM so that the read-ahead request gets setup correctly from > > the very beginning? > > The intent is that this be usable by filesystems which want to (for > example) compress variable sized blocks. So they won't know which pages > they want to readahead until they're in their iomap actor routine, > see that the extent they're in is compressed, and find out how large > the extent is.
Mike Marshall <hubcap@omnibond.com> wrote:
> I plan to try and use readahead_expand in Orangefs...
Would it help if I shuffled the readahead_expand patch to the bottom of the
pack?
David
On Mon, Feb 15, 2021 at 03:44:52PM +0000, David Howells wrote: > +++ b/include/linux/pagemap.h > @@ -761,6 +761,8 @@ extern void __delete_from_page_cache(struct page *page, void *shadow); > int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); > void delete_from_page_cache_batch(struct address_space *mapping, > struct pagevec *pvec); > +void readahead_expand(struct readahead_control *ractl, > + loff_t new_start, size_t new_len); If we're revising this patchset, I'd rather this lived with the other readahead declarations, ie after the definition of readahead_control. > + /* Expand the trailing edge upwards */ > + while (ractl->_nr_pages < new_nr_pages) { > + unsigned long index = ractl->_index + ractl->_nr_pages; > + struct page *page = xa_load(&mapping->i_pages, index); > + > + if (page && !xa_is_value(page)) > + return; /* Page apparently present */ > + > + page = __page_cache_alloc(gfp_mask); > + if (!page) > + return; > + if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) { > + put_page(page); > + return; > + } > + ractl->_nr_pages++; > + } We're defeating the ondemand_readahead() algorithm here. Let's suppose userspace is doing 64kB reads, the filesystem is OrangeFS which only wants to do 4MB reads, the page cache is initially empty and there's only one thread doing a sequential read. ondemand_readahead() calls get_init_ra_size() which tells it to allocate 128kB and set the async marker at 64kB. Then orangefs calls readahead_expand() to allocate the remainder of the 4MB. After the app has read the first 64kB, it comes back to read the next 64kB, sees the readahead marker and tries to trigger the next batch of readahead, but it's already present, so it does nothing (see page_cache_ra_unbounded() for what happens with pages present). Then it keeps going through the 4MB that's been read, not seeing any more readahead markers, gets to 4MB and asks for ... 256kB? Not quite sure. Anyway, it then has to wait for the next 4MB because the readahead didn't overlap with the application processing. So readahead_expand() needs to adjust the file's f_ra so that when the application gets to 64kB, it kicks off the readahead of 4MB-8MB chunk (and then when we get to 4MB+256kB, it kicks off the readahead of 8MB-12MB, and so on). Unless someone sees a better way to do this? I don't want to inadvertently break POSIX_FADV_WILLNEED which calls force_page_cache_readahead() and should not perturb the kernel's ondemand algorithm. Perhaps we need to add an 'ra' pointer to the ractl to indicate whether the file_ra_state should be updated by readahead_expand()?
Matthew has looked at how I'm fumbling about trying to deal with Orangefs's need for much larger than page-sized IO... I think I need to implement orangefs_readahead and from there fire off an asynchronous read and while that's going I'll call readahead_page with a rac that I've cranked up with readahead_expand and when the read gets done I'll have plenty of pages for the large IO I did. Even if what I think I need to do is somewhere near right, the async code in the Orangefs kernel module didn't make it into the upstream version, so I have to refurbish that. All that to say: I don't need readahead_expand "tomorrow", but it fits into my plan to get Orangefs the extra pages it needs without me having open-coded page cache code in orangefs_readpage. -Mike On Wed, Feb 17, 2021 at 10:42 AM David Howells <dhowells@redhat.com> wrote: > > Mike Marshall <hubcap@omnibond.com> wrote: > > > I plan to try and use readahead_expand in Orangefs... > > Would it help if I shuffled the readahead_expand patch to the bottom of the > pack? > > David >
Mike Marshall <hubcap@omnibond.com> wrote: > Matthew has looked at how I'm fumbling about > trying to deal with Orangefs's need for much larger > than page-sized IO... > > I think I need to implement orangefs_readahead > and from there fire off an asynchronous read > and while that's going I'll call readahead_page > with a rac that I've cranked up with readahead_expand > and when the read gets done I'll have plenty of pages > for the large IO I did. Would the netfs helper lib in patches 5-13 here be of use to orangefs? Most of the information about it is on patch 8. David
Matthew Wilcox <willy@infradead.org> wrote: > We're defeating the ondemand_readahead() algorithm here. Let's suppose > userspace is doing 64kB reads, the filesystem is OrangeFS which only > wants to do 4MB reads, the page cache is initially empty and there's > only one thread doing a sequential read. ondemand_readahead() calls > get_init_ra_size() which tells it to allocate 128kB and set the async > marker at 64kB. Then orangefs calls readahead_expand() to allocate the > remainder of the 4MB. After the app has read the first 64kB, it comes > back to read the next 64kB, sees the readahead marker and tries to trigger > the next batch of readahead, but it's already present, so it does nothing > (see page_cache_ra_unbounded() for what happens with pages present). It sounds like Christoph is right on the right track and the vm needs to ask the filesystem (and by extension, the cache) before doing the allocation and before setting the trigger flag. Then we don't need to call back into the vm to expand the readahead. Also, there's Steve's request to try and keep at least two requests in flight for CIFS/SMB at the same time to consider. David
On Wed, Feb 17, 2021 at 10:34:39PM +0000, David Howells wrote: > Matthew Wilcox <willy@infradead.org> wrote: > > > We're defeating the ondemand_readahead() algorithm here. Let's suppose > > userspace is doing 64kB reads, the filesystem is OrangeFS which only > > wants to do 4MB reads, the page cache is initially empty and there's > > only one thread doing a sequential read. ondemand_readahead() calls > > get_init_ra_size() which tells it to allocate 128kB and set the async > > marker at 64kB. Then orangefs calls readahead_expand() to allocate the > > remainder of the 4MB. After the app has read the first 64kB, it comes > > back to read the next 64kB, sees the readahead marker and tries to trigger > > the next batch of readahead, but it's already present, so it does nothing > > (see page_cache_ra_unbounded() for what happens with pages present). > > It sounds like Christoph is right on the right track and the vm needs to ask > the filesystem (and by extension, the cache) before doing the allocation and > before setting the trigger flag. Then we don't need to call back into the vm > to expand the readahead. Doesn't work. You could read my reply to Christoph, or try to figure out how to get rid of https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n742 for yourself. > Also, there's Steve's request to try and keep at least two requests in flight > for CIFS/SMB at the same time to consider. That's not relevant to this problem.
Matthew Wilcox <willy@infradead.org> wrote: > So readahead_expand() needs to adjust the file's f_ra so that when the > application gets to 64kB, it kicks off the readahead of 4MB-8MB chunk (and > then when we get to 4MB+256kB, it kicks off the readahead of 8MB-12MB, > and so on). Ummm... Two questions: Firstly, how do I do that? Set ->async_size? And to what? The expansion could be 2MB from a ceph stripe, 256k from the cache. Just to add to the fun, the leading edge of the window might also be rounded downwards and the RA trigger could be before where the app is going to start reading. Secondly, what happens if, say, a 4MB read is covered by a single 4MB THP? David
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index 365a28ece763..d2786607d297 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -761,6 +761,8 @@ extern void __delete_from_page_cache(struct page *page, void *shadow); int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask); void delete_from_page_cache_batch(struct address_space *mapping, struct pagevec *pvec); +void readahead_expand(struct readahead_control *ractl, + loff_t new_start, size_t new_len); /* * Like add_to_page_cache_locked, but used to add newly allocated pages: diff --git a/mm/readahead.c b/mm/readahead.c index c5b0457415be..4446dada0bc2 100644 --- a/mm/readahead.c +++ b/mm/readahead.c @@ -638,3 +638,73 @@ SYSCALL_DEFINE3(readahead, int, fd, loff_t, offset, size_t, count) { return ksys_readahead(fd, offset, count); } + +/** + * readahead_expand - Expand a readahead request + * @ractl: The request to be expanded + * @new_start: The revised start + * @new_len: The revised size of the request + * + * Attempt to expand a readahead request outwards from the current size to the + * specified size by inserting locked pages before and after the current window + * to increase the size to the new window. This may involve the insertion of + * THPs, in which case the window may get expanded even beyond what was + * requested. + * + * The algorithm will stop if it encounters a conflicting page already in the + * pagecache and leave a smaller expansion than requested. + * + * The caller must check for this by examining the revised @ractl object for a + * different expansion than was requested. + */ +void readahead_expand(struct readahead_control *ractl, + loff_t new_start, size_t new_len) +{ + struct address_space *mapping = ractl->mapping; + pgoff_t new_index, new_nr_pages; + gfp_t gfp_mask = readahead_gfp_mask(mapping); + + new_index = new_start / PAGE_SIZE; + + /* Expand the leading edge downwards */ + while (ractl->_index > new_index) { + unsigned long index = ractl->_index - 1; + struct page *page = xa_load(&mapping->i_pages, index); + + if (page && !xa_is_value(page)) + return; /* Page apparently present */ + + page = __page_cache_alloc(gfp_mask); + if (!page) + return; + if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) { + put_page(page); + return; + } + + ractl->_nr_pages++; + ractl->_index = page->index; + } + + new_len += new_start - readahead_pos(ractl); + new_nr_pages = DIV_ROUND_UP(new_len, PAGE_SIZE); + + /* Expand the trailing edge upwards */ + while (ractl->_nr_pages < new_nr_pages) { + unsigned long index = ractl->_index + ractl->_nr_pages; + struct page *page = xa_load(&mapping->i_pages, index); + + if (page && !xa_is_value(page)) + return; /* Page apparently present */ + + page = __page_cache_alloc(gfp_mask); + if (!page) + return; + if (add_to_page_cache_lru(page, mapping, index, gfp_mask) < 0) { + put_page(page); + return; + } + ractl->_nr_pages++; + } +} +EXPORT_SYMBOL(readahead_expand);
Provide a function, readahead_expand(), that expands the set of pages specified by a readahead_control object to encompass a revised area with a proposed size and length. The proposed area must include all of the old area and may be expanded yet more by this function so that the edges align on (transparent huge) page boundaries as allocated. The expansion will be cut short if a page already exists in either of the areas being expanded into. Note that any expansion made in such a case is not rolled back. This will be used by fscache so that reads can be expanded to cache granule boundaries, thereby allowing whole granules to be stored in the cache, but there are other potential users also. Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: David Howells <dhowells@redhat.com> cc: Matthew Wilcox (Oracle) <willy@infradead.org> cc: Alexander Viro <viro@zeniv.linux.org.uk> cc: Christoph Hellwig <hch@lst.de> cc: linux-mm@kvack.org cc: linux-cachefs@redhat.com cc: linux-afs@lists.infradead.org cc: linux-nfs@vger.kernel.org cc: linux-cifs@vger.kernel.org cc: ceph-devel@vger.kernel.org cc: v9fs-developer@lists.sourceforge.net cc: linux-fsdevel@vger.kernel.org --- include/linux/pagemap.h | 2 + mm/readahead.c | 70 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 72 insertions(+)