diff mbox series

[v4,4/5] cramfs: add mmap support

Message ID 20170927233224.31676-5-nicolas.pitre@linaro.org
State New
Headers show
Series cramfs refresh for embedded usage | expand

Commit Message

Nicolas Pitre Sept. 27, 2017, 11:32 p.m. UTC
When cramfs_physmem is used then we have the opportunity to map files
directly from ROM, directly into user space, saving on RAM usage.
This gives us Execute-In-Place (XIP) support.

For a file to be mmap()-able, the map area has to correspond to a range
of uncompressed and contiguous blocks, and in the MMU case it also has
to be page aligned. A version of mkcramfs with appropriate support is
necessary to create such a filesystem image.

In the MMU case it may happen for a vma structure to extend beyond the
actual file size. This is notably the case in binfmt_elf.c:elf_map().
Or the file's last block is shared with other files and cannot be mapped
as is. Rather than refusing to mmap it, we do a partial map and set up
a special vm_ops fault handler that splits the vma in two: the direct
mapping vma and the memory-backed vma populated by the readpage method.
In practice the unmapped area is seldom accessed so the split might never
occur before this area is discarded.

In the non-MMU case it is the get_unmapped_area method that is responsible
for providing the address where the actual data can be found. No mapping
is necessary of course.

Signed-off-by: Nicolas Pitre <nico@linaro.org>

Tested-by: Chris Brandt <chris.brandt@renesas.com>

---
 fs/cramfs/Kconfig |   2 +-
 fs/cramfs/inode.c | 295 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 296 insertions(+), 1 deletion(-)

-- 
2.9.5

Comments

'Christoph Hellwig' Oct. 1, 2017, 8:30 a.m. UTC | #1
up_read(&mm->mmap_sem) in the fault path is a still a complete
no-go,

NAK
Nicolas Pitre Oct. 1, 2017, 10:29 p.m. UTC | #2
On Sun, 1 Oct 2017, Christoph Hellwig wrote:

> up_read(&mm->mmap_sem) in the fault path is a still a complete

> no-go,

> 

> NAK


Care to elaborate?

What about mm/filemap.c:__lock_page_or_retry() then?

Why the special handling on mm->mmap_sem with VM_FAULT_RETRY?

What are the potential problems with my approach I didn't cover yet?

Serious: I'm simply looking for solutions here.


Nicolas
Richard Weinberger Oct. 2, 2017, 10:45 p.m. UTC | #3
On Mon, Oct 2, 2017 at 12:29 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Sun, 1 Oct 2017, Christoph Hellwig wrote:

>

>> up_read(&mm->mmap_sem) in the fault path is a still a complete

>> no-go,

>>

>> NAK

>

> Care to elaborate?

>

> What about mm/filemap.c:__lock_page_or_retry() then?


As soon you up_read() in the page fault path other tasks will race
with you before
you're able to grab the write lock.

HTH

-- 
Thanks,
//richard
Nicolas Pitre Oct. 2, 2017, 11:33 p.m. UTC | #4
On Tue, 3 Oct 2017, Richard Weinberger wrote:

> On Mon, Oct 2, 2017 at 12:29 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> > On Sun, 1 Oct 2017, Christoph Hellwig wrote:

> >

> >> up_read(&mm->mmap_sem) in the fault path is a still a complete

> >> no-go,

> >>

> >> NAK

> >

> > Care to elaborate?

> >

> > What about mm/filemap.c:__lock_page_or_retry() then?

> 

> As soon you up_read() in the page fault path other tasks will race

> with you before

> you're able to grab the write lock.


But I _know_ that.

Could you highlight an area in my code where this is not accounted for?


Nicolas
'Christoph Hellwig' Oct. 3, 2017, 2:57 p.m. UTC | #5
On Mon, Oct 02, 2017 at 07:33:29PM -0400, Nicolas Pitre wrote:
> On Tue, 3 Oct 2017, Richard Weinberger wrote:

> 

> > On Mon, Oct 2, 2017 at 12:29 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> > > On Sun, 1 Oct 2017, Christoph Hellwig wrote:

> > >

> > >> up_read(&mm->mmap_sem) in the fault path is a still a complete

> > >> no-go,

> > >>

> > >> NAK

> > >

> > > Care to elaborate?

> > >

> > > What about mm/filemap.c:__lock_page_or_retry() then?

> > 

> > As soon you up_read() in the page fault path other tasks will race

> > with you before

> > you're able to grab the write lock.

> 

> But I _know_ that.

> 

> Could you highlight an area in my code where this is not accounted for?


Existing users of lock_page_or_retry return VM_FAULT_RETRY right after
up()ing mmap_sem, and they must already have a reference to the page
which is the only thing touched until then.

Your patch instead goes for an exclusive mmap_sem if it can, and
even if there is nothing that breaks with that scheme right now
there s nothing documenting that this actually safe, and we are
way down in the complex page fault path.
Nicolas Pitre Oct. 3, 2017, 3:30 p.m. UTC | #6
On Tue, 3 Oct 2017, Christoph Hellwig wrote:

> On Mon, Oct 02, 2017 at 07:33:29PM -0400, Nicolas Pitre wrote:

> > On Tue, 3 Oct 2017, Richard Weinberger wrote:

> > 

> > > On Mon, Oct 2, 2017 at 12:29 AM, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:

> > > > On Sun, 1 Oct 2017, Christoph Hellwig wrote:

> > > >

> > > >> up_read(&mm->mmap_sem) in the fault path is a still a complete

> > > >> no-go,

> > > >>

> > > >> NAK

> > > >

> > > > Care to elaborate?

> > > >

> > > > What about mm/filemap.c:__lock_page_or_retry() then?

> > > 

> > > As soon you up_read() in the page fault path other tasks will race

> > > with you before

> > > you're able to grab the write lock.

> > 

> > But I _know_ that.

> > 

> > Could you highlight an area in my code where this is not accounted for?

> 

> Existing users of lock_page_or_retry return VM_FAULT_RETRY right after

> up()ing mmap_sem, and they must already have a reference to the page

> which is the only thing touched until then.

> 

> Your patch instead goes for an exclusive mmap_sem if it can, and

> even if there is nothing that breaks with that scheme right now

> there s nothing documenting that this actually safe, and we are

> way down in the complex page fault path.


It is pretty obvious looking at the existing code that if you want to 
safely manipulate a vma you need the write lock. There are many things 
in the kernel tree that are not explicitly documented. Did that stop 
people from adding new code?

I agree that the fault path is quite complex. I've studied it carefully 
before coming up with this scheme. This is not something that came about 
just because the sunshine felt good when I woke up one day.

So if you agree that I've done a reasonable job creating a scheme that 
currently doesn't break then IMHO this should be good enough, 
*especially* for such an isolated and specialized use case with zero 
impact on anyone else. And if things break in the future than I will be 
the one working out the pieces not you, and _that_ can be written down 
somewhere if necessary so nobody has an obligation to bend backward for 
not breaking it.

Unless you have a better scheme altogether  to suggest of course, given 
the existing constraints.


Nicolas
'Christoph Hellwig' Oct. 3, 2017, 3:37 p.m. UTC | #7
On Tue, Oct 03, 2017 at 11:30:50AM -0400, Nicolas Pitre wrote:
> Unless you have a better scheme altogether  to suggest of course, given 

> the existing constraints.


I still can't understand why this convoluted fault path that finds
vma, attempts with all kinds of races and then tries to update things
like vm_ops is even nessecary.

We have direct mappings of physical address perfectly working in the
DAX code (even with write support!) or in drivers using remap_pfn_range
so a really good explanation why neither scheme can be used is needed
first.
Nicolas Pitre Oct. 3, 2017, 3:40 p.m. UTC | #8
On Tue, 3 Oct 2017, Christoph Hellwig wrote:

> On Tue, Oct 03, 2017 at 11:30:50AM -0400, Nicolas Pitre wrote:

> > Unless you have a better scheme altogether  to suggest of course, given 

> > the existing constraints.

> 

> I still can't understand why this convoluted fault path that finds

> vma, attempts with all kinds of races and then tries to update things

> like vm_ops is even nessecary.

> 

> We have direct mappings of physical address perfectly working in the

> DAX code (even with write support!) or in drivers using remap_pfn_range

> so a really good explanation why neither scheme can be used is needed

> first.


I provided that explanation several times by now in my cover letter. And 
separately even to you directly at least once.  What else should I do?


Nicolas
'Christoph Hellwig' Oct. 4, 2017, 7:25 a.m. UTC | #9
On Tue, Oct 03, 2017 at 11:40:28AM -0400, Nicolas Pitre wrote:
> I provided that explanation several times by now in my cover letter. And 

> separately even to you directly at least once.  What else should I do?


You should do the right things instead of stating irrelevant things
in your cover letter.  As said in my last mail: look at the VM_MIXEDMAP
flag and how it is used by DAX, and you'll get out of the vma splitting
business in the fault path.

If the fs/dax.c code scares you take a look at drivers/dax/device.c
instead.
Nicolas Pitre Oct. 4, 2017, 8:47 p.m. UTC | #10
On Wed, 4 Oct 2017, Christoph Hellwig wrote:

> As said in my last mail: look at the VM_MIXEDMAP flag and how it is 

> used by DAX, and you'll get out of the vma splitting business in the 

> fault path.


Alright, it appears to work.

The only downside so far is the lack of visibility from user space to 
confirm it actually works as intended. With the vma splitting approach 
you clearly see what gets directly mapped in /proc/*/maps thanks to 
remap_pfn_range() storing the actual physical address in vma->vm_pgoff. 
With VM_MIXEDMAP things are no longer visible. Any opinion for the best 
way to overcome this?

Anyway, here's a replacement for patch 4/5 below:

----- >8
Subject: cramfs: add mmap support

When cramfs_physmem is used then we have the opportunity to map files
directly from ROM, directly into user space, saving on RAM usage.
This gives us Execute-In-Place (XIP) support.

For a file to be mmap()-able, the map area has to correspond to a range
of uncompressed and contiguous blocks, and in the MMU case it also has
to be page aligned. A version of mkcramfs with appropriate support is
necessary to create such a filesystem image.

In the MMU case it may happen for a vma structure to extend beyond the
actual file size. This is notably the case in binfmt_elf.c:elf_map().
Or the file's last block is shared with other files and cannot be mapped
as is. Rather than refusing to mmap it, we do a "mixed" map and let the
regular fault handler populate the unmapped area with RAM-backed pages.
In practice the unmapped area is seldom accessed so page faults might
never occur before this area is discarded.

In the non-MMU case it is the get_unmapped_area method that is responsible
for providing the address where the actual data can be found. No mapping
is necessary of course.

Signed-off-by: Nicolas Pitre <nico@linaro.org>


diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 2fc886092b..9d5d0c1f7d 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -15,7 +15,10 @@
 
 #include <linux/module.h>
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/pfn_t.h>
+#include <linux/ramfs.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/blkdev.h>
@@ -49,6 +52,7 @@ static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
 static const struct super_operations cramfs_ops;
 static const struct inode_operations cramfs_dir_inode_operations;
 static const struct file_operations cramfs_directory_operations;
+static const struct file_operations cramfs_physmem_fops;
 static const struct address_space_operations cramfs_aops;
 
 static DEFINE_MUTEX(read_mutex);
@@ -96,6 +100,10 @@ static struct inode *get_cramfs_inode(struct super_block *sb,
 	case S_IFREG:
 		inode->i_fop = &generic_ro_fops;
 		inode->i_data.a_ops = &cramfs_aops;
+		if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
+		    CRAMFS_SB(sb)->flags & CRAMFS_FLAG_EXT_BLOCK_POINTERS &&
+		    CRAMFS_SB(sb)->linear_phys_addr)
+			inode->i_fop = &cramfs_physmem_fops;
 		break;
 	case S_IFDIR:
 		inode->i_op = &cramfs_dir_inode_operations;
@@ -277,6 +285,188 @@ static void *cramfs_read(struct super_block *sb, unsigned int offset,
 		return NULL;
 }
 
+/*
+ * For a mapping to be possible, we need a range of uncompressed and
+ * contiguous blocks. Return the offset for the first block and number of
+ * valid blocks for which that is true, or zero otherwise.
+ */
+static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
+{
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	int i;
+	u32 *blockptrs, blockaddr;
+
+	/*
+	 * We can dereference memory directly here as this code may be
+	 * reached only when there is a direct filesystem image mapping
+	 * available in memory.
+	 */
+	blockptrs = (u32 *)(sbi->linear_virt_addr + OFFSET(inode) + pgoff*4);
+	blockaddr = blockptrs[0] & ~CRAMFS_BLK_FLAGS;
+	i = 0;
+	do {
+		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);
+		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;
+		if (blockptrs[i] != expect) {
+			pr_debug("range: block %d/%d got %#x expects %#x\n",
+				 pgoff+i, pgoff+*pages-1, blockptrs[i], expect);
+			if (i == 0)
+				return 0;
+			break;
+		}
+	} while (++i < *pages);
+
+	*pages = i;
+
+	/* stored "direct" block ptrs are shifted down by 2 bits */
+	return blockaddr << 2;
+}
+
+static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, vma_pages, max_pages, offset;
+	unsigned long address;
+	char *fail_reason;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_MMU))
+		return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -ENOSYS;
+
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		return -EINVAL;
+
+	/* Could COW work here? */
+	fail_reason = "vma is writable";
+	if (vma->vm_flags & VM_WRITE)
+		goto fail;
+
+	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	fail_reason = "beyond file limit";
+	if (vma->vm_pgoff >= max_pages)
+		goto fail;
+	pages = vma_pages;
+	if (pages > max_pages - vma->vm_pgoff)
+		pages = max_pages - vma->vm_pgoff;
+
+	offset = cramfs_get_block_range(inode, vma->vm_pgoff, &pages);
+	fail_reason = "unsuitable block layout";
+	if (!offset)
+		goto fail;
+	address = sbi->linear_phys_addr + offset;
+	fail_reason = "data is not page aligned";
+	if (!PAGE_ALIGNED(address))
+		goto fail;
+
+	/* Don't map the last page if it contains some other data */
+	if (unlikely(vma->vm_pgoff + pages == max_pages)) {
+		unsigned int partial = offset_in_page(inode->i_size);
+		if (partial) {
+			char *data = sbi->linear_virt_addr + offset;
+			data += (max_pages - 1) * PAGE_SIZE + partial;
+			while ((unsigned long)data & 7)
+				if (*data++ != 0)
+					goto nonzero;
+			while (offset_in_page(data)) {
+				if (*(u64 *)data != 0) {
+					nonzero:
+					pr_debug("mmap: %s: last page is shared\n",
+						 file_dentry(file)->d_name.name);
+					pages--;
+					break;
+				}
+				data += 8;
+			}
+		}
+	}
+
+	if (!pages) {
+		fail_reason = "no suitable block remaining";
+		goto fail;
+	} else if (pages != vma_pages) {
+		/*
+		 * Let's create a mixed map if we can't map it all.
+		 * The normal paging machinery will take care of the
+		 * unpopulated vma via cramfs_readpage().
+		 */
+		int i;
+		vma->vm_flags |= VM_MIXEDMAP;
+		for (i = 0; i < pages; i++) {
+			unsigned long vaddr = vma->vm_start + i*PAGE_SIZE;
+			pfn_t pfn = phys_to_pfn_t(address + i*PAGE_SIZE, PFN_DEV);
+			ret = vm_insert_mixed(vma, vaddr, pfn);
+			if (ret)
+				return ret;
+		}
+		vma->vm_ops = &generic_file_vm_ops;
+	} else {
+		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
+				      pages * PAGE_SIZE, vma->vm_page_prot);
+		if (ret)
+			return ret;
+	}
+
+	pr_debug("mapped %s at 0x%08lx (%u/%u pages) to vma 0x%08lx, "
+		 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
+		 address, pages, vma_pages, vma->vm_start,
+		 (unsigned long long)pgprot_val(vma->vm_page_prot));
+	return 0;
+
+fail:
+	pr_debug("%s: direct mmap failed: %s\n",
+		 file_dentry(file)->d_name.name, fail_reason);
+
+	/* We failed to do a direct map, but normal paging is still possible */
+	vma->vm_ops = &generic_file_vm_ops;
+	return 0;
+}
+
+#ifndef CONFIG_MMU
+
+static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
+			unsigned long addr, unsigned long len,
+			unsigned long pgoff, unsigned long flags)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, block_pages, max_pages, offset;
+
+	pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= max_pages || pages > max_pages - pgoff)
+		return -EINVAL;
+	block_pages = pages;
+	offset = cramfs_get_block_range(inode, pgoff, &block_pages);
+	if (!offset || block_pages != pages)
+		return -ENOSYS;
+	addr = sbi->linear_phys_addr + offset;
+	pr_debug("get_unmapped for %s ofs %#lx siz %lu at 0x%08lx\n",
+		 file_dentry(file)->d_name.name, pgoff*PAGE_SIZE, len, addr);
+	return addr;
+}
+
+static unsigned cramfs_physmem_mmap_capabilities(struct file *file)
+{
+	return NOMMU_MAP_COPY | NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_EXEC;
+}
+#endif
+
+static const struct file_operations cramfs_physmem_fops = {
+	.llseek			= generic_file_llseek,
+	.read_iter		= generic_file_read_iter,
+	.splice_read		= generic_file_splice_read,
+	.mmap			= cramfs_physmem_mmap,
+#ifndef CONFIG_MMU
+	.get_unmapped_area	= cramfs_physmem_get_unmapped_area,
+	.mmap_capabilities	= cramfs_physmem_mmap_capabilities,
+#endif
+};
+
 static void cramfs_blkdev_kill_sb(struct super_block *sb)
 {
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
'Christoph Hellwig' Oct. 5, 2017, 7:15 a.m. UTC | #11
On Wed, Oct 04, 2017 at 04:47:52PM -0400, Nicolas Pitre wrote:
> The only downside so far is the lack of visibility from user space to 

> confirm it actually works as intended. With the vma splitting approach 

> you clearly see what gets directly mapped in /proc/*/maps thanks to 

> remap_pfn_range() storing the actual physical address in vma->vm_pgoff. 

> With VM_MIXEDMAP things are no longer visible. Any opinion for the best 

> way to overcome this?


Add trace points that allow you to trace it using trace-cmd, perf
or just tracefs?

> 

> Anyway, here's a replacement for patch 4/5 below:


This looks much better, and is about 100 lines less than the previous
version.  More (mostly cosmetic) comments below:

> +	blockptrs = (u32 *)(sbi->linear_virt_addr + OFFSET(inode) + pgoff*4);


missing psaces around the *

>

> +	blockaddr = blockptrs[0] & ~CRAMFS_BLK_FLAGS;

> +	i = 0;

> +	do {

> +		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);


There are a lot of magic numbers in here.  It seems like that's standard
for cramfs, but if you really plan to bring it back to live it would be
create to sort that out..



> +		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;


Too long line.

Just turn this into:

		 u32 expect = blockaddr + i * (PAGE_SIZE >> 2) |
		 		CRAMFS_BLK_FLAG_DIRECT_PTR |
				CRAMFS_BLK_FLAG_UNCOMPRESSED;

and it will be a lot more readable.

> +static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)

> +{

> +	struct inode *inode = file_inode(file);

> +	struct super_block *sb = inode->i_sb;

> +	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);

> +	unsigned int pages, vma_pages, max_pages, offset;

> +	unsigned long address;

> +	char *fail_reason;

> +	int ret;

> +

> +	if (!IS_ENABLED(CONFIG_MMU))

> +		return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -ENOSYS;


Given that you have a separate #ifndef CONFIG_MMU section below just
have a separate implementation of cramfs_physmem_mmap for it, which
makes the code a lot more obvious.

> +	/* Could COW work here? */

> +	fail_reason = "vma is writable";

> +	if (vma->vm_flags & VM_WRITE)

> +		goto fail;


The fail_reaosn is a rather unusable style, is there any good reason
why you need it here?  We generall don't add a debug printk for every
pssible failure case.

> +	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;


Just use vma_pages - the defintion is different, but given that vm_end
and vm_stat must be page aligned anyway it should not make a difference.

> +	if (pages > max_pages - vma->vm_pgoff)

> +		pages = max_pages - vma->vm_pgoff;


Use min() or min_t().

> +	/* Don't map the last page if it contains some other data */

> +	if (unlikely(vma->vm_pgoff + pages == max_pages)) {

> +		unsigned int partial = offset_in_page(inode->i_size);

> +		if (partial) {

> +			char *data = sbi->linear_virt_addr + offset;

> +			data += (max_pages - 1) * PAGE_SIZE + partial;

> +			while ((unsigned long)data & 7)

> +				if (*data++ != 0)

> +					goto nonzero;

> +			while (offset_in_page(data)) {

> +				if (*(u64 *)data != 0) {

> +					nonzero:

> +					pr_debug("mmap: %s: last page is shared\n",

> +						 file_dentry(file)->d_name.name);

> +					pages--;

> +					break;

> +				}

> +				data += 8;

> +			}


The nonzer label is in a rather unusual space, both having weird
indentation and being in the middle of the loop.

It seems like this whole partial section should just go into a little
helper where the nonzero case is at the end of said helper to make it
readable.  Also lots of magic numbers again, and generally a little
too much magic for the code to be easily understandable: why do you
operate on pointers casted to longs, increment in 8-byte steps?
Why is offset_in_page used for an operation that doesn't operate on
struct page at all?  Any reason you can't just use memchr_inv?

> +	if (!pages) {

> +		fail_reason = "no suitable block remaining";

> +		goto fail;

> +	} else if (pages != vma_pages) {


No if else please if you goto a different label, that just confuses the
user.

> +		/*

> +		 * Let's create a mixed map if we can't map it all.

> +		 * The normal paging machinery will take care of the

> +		 * unpopulated vma via cramfs_readpage().

> +		 */

> +		int i;

> +		vma->vm_flags |= VM_MIXEDMAP;

> +		for (i = 0; i < pages; i++) {

> +			unsigned long vaddr = vma->vm_start + i*PAGE_SIZE;

> +			pfn_t pfn = phys_to_pfn_t(address + i*PAGE_SIZE, PFN_DEV);

> +			ret = vm_insert_mixed(vma, vaddr, pfn);


Please use spaces around the * operator, and don't use overly long
lines.

A local variable might help doing that in a readnable way:

			unsigned long off = i * PAGE_SIZE;

			ret = vm_insert_mixed(vma, vma->vm_start + off,
					phys_to_pfn_t(address + off, PFN_DEV);

> +	/* We failed to do a direct map, but normal paging is still possible */

> +	vma->vm_ops = &generic_file_vm_ops;


Maybe let the mixedmap case fall through to this instead of having
a duplicate vm_ops assignment.

> +static unsigned cramfs_physmem_mmap_capabilities(struct file *file)

> +{

> +	return NOMMU_MAP_COPY | NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_EXEC;


Too long line.
Nicolas Pitre Oct. 5, 2017, 5:52 p.m. UTC | #12
On Thu, 5 Oct 2017, Christoph Hellwig wrote:

> On Wed, Oct 04, 2017 at 04:47:52PM -0400, Nicolas Pitre wrote:

> > The only downside so far is the lack of visibility from user space to 

> > confirm it actually works as intended. With the vma splitting approach 

> > you clearly see what gets directly mapped in /proc/*/maps thanks to 

> > remap_pfn_range() storing the actual physical address in vma->vm_pgoff. 

> > With VM_MIXEDMAP things are no longer visible. Any opinion for the best 

> > way to overcome this?

> 

> Add trace points that allow you to trace it using trace-cmd, perf

> or just tracefs?


In memory constrained embedded environments those facilities are 
sometimes too big to be practical. And the /proc/*/maps content is 
static i.e. it is always there regardless of how many tasks you have and 
how long they've been running which makes it extremely handy.

> > Anyway, here's a replacement for patch 4/5 below:

> 

> This looks much better, and is about 100 lines less than the previous

> version.  More (mostly cosmetic) comments below:

> 

[...]
> > +	fail_reason = "vma is writable";

> > +	if (vma->vm_flags & VM_WRITE)

> > +		goto fail;

> 

> The fail_reaosn is a rather unusable style, is there any good reason

> why you need it here?  We generall don't add a debug printk for every

> pssible failure case.


There are many things that might make your files not XIP and they're 
mostly related to how the file is mmap'd or how mkcramfs was used. When 
looking where some of your memory has gone because some files are not 
directly mapped it is nice to have a hint as to why at run time. Doing 
it that way also works as comments for someone reading the code, and the 
compiler optimizes those strings away when DEBUG is not defined anyway. 

I did s/fail/bailout/ though, as those are not hard failures. The hard 
failures have no such debugging messages.

[...]
> It seems like this whole partial section should just go into a little

> helper where the nonzero case is at the end of said helper to make it

> readable.  Also lots of magic numbers again, and generally a little

> too much magic for the code to be easily understandable: why do you

> operate on pointers casted to longs, increment in 8-byte steps?

> Why is offset_in_page used for an operation that doesn't operate on

> struct page at all?  Any reason you can't just use memchr_inv?


Ahhh... use memchr_inv is in fact exactly what I was looking for.
Learn something every day.

[...]
> > +	/* We failed to do a direct map, but normal paging is still possible */

> > +	vma->vm_ops = &generic_file_vm_ops;

> 

> Maybe let the mixedmap case fall through to this instead of having

> a duplicate vm_ops assignment.


The code flow is different and that makes it hard to have a common 
assignment in this case.

Otherwise I've applied all your suggestions.

Thanks for your comments. Very appreciated.


Nicolas
Chris Brandt Oct. 5, 2017, 8 p.m. UTC | #13
On Wednesday, October 04, 2017, Nicolas Pitre wrote:
> On Wed, 4 Oct 2017, Christoph Hellwig wrote:

> 

> > As said in my last mail: look at the VM_MIXEDMAP flag and how it is

> > used by DAX, and you'll get out of the vma splitting business in the

> > fault path.

> 

> Alright, it appears to work.

> 

> The only downside so far is the lack of visibility from user space to

> confirm it actually works as intended. With the vma splitting approach

> you clearly see what gets directly mapped in /proc/*/maps thanks to

> remap_pfn_range() storing the actual physical address in vma->vm_pgoff.

> With VM_MIXEDMAP things are no longer visible. Any opinion for the best

> way to overcome this?

> 

> Anyway, here's a replacement for patch 4/5 below:

> 

> ----- >8

> Subject: cramfs: add mmap support

> 

> When cramfs_physmem is used then we have the opportunity to map files

> directly from ROM, directly into user space, saving on RAM usage.

> This gives us Execute-In-Place (XIP) support.



Tested on my setup:
 * Cortex A9 (with MMU)
 * CONFIG_XIP_KERNEL=y
 * booted with XIP CRAMFS as my rootfs 
 * all apps and libraries marked as XIP in my cramfs image



So far, functionally it seems to work the same as [PATCH v4 4/5].

As Nicolas said, before you could easily see that all my apps and 
libraries were XIP from Flash:

$ cat /proc/self/maps
00008000-000a1000 r-xp 1b005000 00:0c 18192      /bin/busybox
000a9000-000aa000 rw-p 00099000 00:0c 18192      /bin/busybox
000aa000-000ac000 rw-p 00000000 00:00 0          [heap]
b6e69000-b6f42000 r-xp 1b0bc000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f42000-b6f4a000 ---p 1b195000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f4a000-b6f4c000 r--p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f4c000-b6f4d000 rw-p 000db000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f4d000-b6f50000 rw-p 00000000 00:00 0
b6f50000-b6f67000 r-xp 1b0a4000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6f6a000-b6f6b000 rw-p 00000000 00:00 0
b6f6c000-b6f6e000 rw-p 00000000 00:00 0
b6f6e000-b6f6f000 r--p 00016000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6f6f000-b6f70000 rw-p 00017000 00:0c 670372     /lib/ld-2.18-2013.10.so
beac0000-beae1000 rw-p 00000000 00:00 0          [stack]
bebc9000-bebca000 r-xp 00000000 00:00 0          [sigpage]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]



But now just busybox looks like it's XIP:

$ cat /proc/self/maps
00008000-000a1000 r-xp 1b005000 00:0c 18192      /bin/busybox
000a9000-000aa000 rw-p 00099000 00:0c 18192      /bin/busybox
000aa000-000ac000 rw-p 00000000 00:00 0          [heap]
b6e4d000-b6f26000 r-xp 00000000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f26000-b6f2e000 ---p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f2e000-b6f30000 r--p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f30000-b6f31000 rw-p 000db000 00:0c 766540     /lib/libc-2.18-2013.10.so
b6f31000-b6f34000 rw-p 00000000 00:00 0
b6f34000-b6f4b000 r-xp 00000000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6f4e000-b6f4f000 rw-p 00000000 00:00 0
b6f50000-b6f52000 rw-p 00000000 00:00 0
b6f52000-b6f53000 r--p 00016000 00:0c 670372     /lib/ld-2.18-2013.10.so
b6f53000-b6f54000 rw-p 00017000 00:0c 670372     /lib/ld-2.18-2013.10.so
bec93000-becb4000 rw-p 00000000 00:00 0          [stack]
befad000-befae000 r-xp 00000000 00:00 0          [sigpage]
ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]


Regardless, from a functional standpoint:

Tested-by: Chris Brandt <chris.brandt@renesas.com>





Just FYI, the previous [PATCH v4 4/5] also included this (which was the 
only real difference between v3 and v4):


diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 5b4e0b7e13..306549be25 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -30,7 +30,7 @@ config CRAMFS_BLOCKDEV
 
 config CRAMFS_PHYSMEM
 	bool "Support CramFs image directly mapped in physical memory"
-	depends on CRAMFS
+	depends on CRAMFS = y
 	default y if !CRAMFS_BLOCKDEV
 	help
 	  This option allows the CramFs driver to load data directly from


Chris
Nicolas Pitre Oct. 5, 2017, 9:15 p.m. UTC | #14
On Thu, 5 Oct 2017, Chris Brandt wrote:

> On Wednesday, October 04, 2017, Nicolas Pitre wrote:

> > Anyway, here's a replacement for patch 4/5 below:

> > 

> > ----- >8

> > Subject: cramfs: add mmap support

> > 

> > When cramfs_physmem is used then we have the opportunity to map files

> > directly from ROM, directly into user space, saving on RAM usage.

> > This gives us Execute-In-Place (XIP) support.

> 

> 

> Tested on my setup:

>  * Cortex A9 (with MMU)

>  * CONFIG_XIP_KERNEL=y

>  * booted with XIP CRAMFS as my rootfs 

>  * all apps and libraries marked as XIP in my cramfs image

> 

> 

> 

> So far, functionally it seems to work the same as [PATCH v4 4/5].

> 

> As Nicolas said, before you could easily see that all my apps and 

> libraries were XIP from Flash:

> 

> $ cat /proc/self/maps

> 00008000-000a1000 r-xp 1b005000 00:0c 18192      /bin/busybox

> 000a9000-000aa000 rw-p 00099000 00:0c 18192      /bin/busybox

> 000aa000-000ac000 rw-p 00000000 00:00 0          [heap]

> b6e69000-b6f42000 r-xp 1b0bc000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f42000-b6f4a000 ---p 1b195000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f4a000-b6f4c000 r--p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f4c000-b6f4d000 rw-p 000db000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f4d000-b6f50000 rw-p 00000000 00:00 0

> b6f50000-b6f67000 r-xp 1b0a4000 00:0c 670372     /lib/ld-2.18-2013.10.so

> b6f6a000-b6f6b000 rw-p 00000000 00:00 0

> b6f6c000-b6f6e000 rw-p 00000000 00:00 0

> b6f6e000-b6f6f000 r--p 00016000 00:0c 670372     /lib/ld-2.18-2013.10.so

> b6f6f000-b6f70000 rw-p 00017000 00:0c 670372     /lib/ld-2.18-2013.10.so

> beac0000-beae1000 rw-p 00000000 00:00 0          [stack]

> bebc9000-bebca000 r-xp 00000000 00:00 0          [sigpage]

> ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]

> 

> 

> 

> But now just busybox looks like it's XIP:

> 

> $ cat /proc/self/maps

> 00008000-000a1000 r-xp 1b005000 00:0c 18192      /bin/busybox

> 000a9000-000aa000 rw-p 00099000 00:0c 18192      /bin/busybox

> 000aa000-000ac000 rw-p 00000000 00:00 0          [heap]

> b6e4d000-b6f26000 r-xp 00000000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f26000-b6f2e000 ---p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f2e000-b6f30000 r--p 000d9000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f30000-b6f31000 rw-p 000db000 00:0c 766540     /lib/libc-2.18-2013.10.so

> b6f31000-b6f34000 rw-p 00000000 00:00 0

> b6f34000-b6f4b000 r-xp 00000000 00:0c 670372     /lib/ld-2.18-2013.10.so

> b6f4e000-b6f4f000 rw-p 00000000 00:00 0

> b6f50000-b6f52000 rw-p 00000000 00:00 0

> b6f52000-b6f53000 r--p 00016000 00:0c 670372     /lib/ld-2.18-2013.10.so

> b6f53000-b6f54000 rw-p 00017000 00:0c 670372     /lib/ld-2.18-2013.10.so

> bec93000-becb4000 rw-p 00000000 00:00 0          [stack]

> befad000-befae000 r-xp 00000000 00:00 0          [sigpage]

> ffff0000-ffff1000 r-xp 00000000 00:00 0          [vectors]


Do you have the same amount of free memory once booted in both cases?

> Regardless, from a functional standpoint:

> 

> Tested-by: Chris Brandt <chris.brandt@renesas.com>


Thanks.

> Just FYI, the previous [PATCH v4 4/5] also included this (which was the 

> only real difference between v3 and v4):

> 

> 

> diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig

> index 5b4e0b7e13..306549be25 100644

> --- a/fs/cramfs/Kconfig

> +++ b/fs/cramfs/Kconfig

> @@ -30,7 +30,7 @@ config CRAMFS_BLOCKDEV

>  

>  config CRAMFS_PHYSMEM

>  	bool "Support CramFs image directly mapped in physical memory"

> -	depends on CRAMFS

> +	depends on CRAMFS = y


Yeah, that was necessary because split_vma() wasn't exported to modules. 
Now split_vma() is no longer used so the no-module restriction has also 
been removed.


Nicolas
Chris Brandt Oct. 5, 2017, 11:49 p.m. UTC | #15
On Thursday, October 05, 2017, Nicolas Pitre wrote:
> Do you have the same amount of free memory once booted in both cases?


Yes, almost exactly the same, so obvious it must be working the same for
both cases. That's enough evidence for me.

Thanks.

Chris
diff mbox series

Patch

diff --git a/fs/cramfs/Kconfig b/fs/cramfs/Kconfig
index 5b4e0b7e13..306549be25 100644
--- a/fs/cramfs/Kconfig
+++ b/fs/cramfs/Kconfig
@@ -30,7 +30,7 @@  config CRAMFS_BLOCKDEV
 
 config CRAMFS_PHYSMEM
 	bool "Support CramFs image directly mapped in physical memory"
-	depends on CRAMFS
+	depends on CRAMFS = y
 	default y if !CRAMFS_BLOCKDEV
 	help
 	  This option allows the CramFs driver to load data directly from
diff --git a/fs/cramfs/inode.c b/fs/cramfs/inode.c
index 2fc886092b..1d7d61354b 100644
--- a/fs/cramfs/inode.c
+++ b/fs/cramfs/inode.c
@@ -15,7 +15,9 @@ 
 
 #include <linux/module.h>
 #include <linux/fs.h>
+#include <linux/file.h>
 #include <linux/pagemap.h>
+#include <linux/ramfs.h>
 #include <linux/init.h>
 #include <linux/string.h>
 #include <linux/blkdev.h>
@@ -49,6 +51,7 @@  static inline struct cramfs_sb_info *CRAMFS_SB(struct super_block *sb)
 static const struct super_operations cramfs_ops;
 static const struct inode_operations cramfs_dir_inode_operations;
 static const struct file_operations cramfs_directory_operations;
+static const struct file_operations cramfs_physmem_fops;
 static const struct address_space_operations cramfs_aops;
 
 static DEFINE_MUTEX(read_mutex);
@@ -96,6 +99,10 @@  static struct inode *get_cramfs_inode(struct super_block *sb,
 	case S_IFREG:
 		inode->i_fop = &generic_ro_fops;
 		inode->i_data.a_ops = &cramfs_aops;
+		if (IS_ENABLED(CONFIG_CRAMFS_PHYSMEM) &&
+		    CRAMFS_SB(sb)->flags & CRAMFS_FLAG_EXT_BLOCK_POINTERS &&
+		    CRAMFS_SB(sb)->linear_phys_addr)
+			inode->i_fop = &cramfs_physmem_fops;
 		break;
 	case S_IFDIR:
 		inode->i_op = &cramfs_dir_inode_operations;
@@ -277,6 +284,294 @@  static void *cramfs_read(struct super_block *sb, unsigned int offset,
 		return NULL;
 }
 
+/*
+ * For a mapping to be possible, we need a range of uncompressed and
+ * contiguous blocks. Return the offset for the first block and number of
+ * valid blocks for which that is true, or zero otherwise.
+ */
+static u32 cramfs_get_block_range(struct inode *inode, u32 pgoff, u32 *pages)
+{
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	int i;
+	u32 *blockptrs, blockaddr;
+
+	/*
+	 * We can dereference memory directly here as this code may be
+	 * reached only when there is a direct filesystem image mapping
+	 * available in memory.
+	 */
+	blockptrs = (u32 *)(sbi->linear_virt_addr + OFFSET(inode) + pgoff*4);
+	blockaddr = blockptrs[0] & ~CRAMFS_BLK_FLAGS;
+	i = 0;
+	do {
+		u32 expect = blockaddr + i * (PAGE_SIZE >> 2);
+		expect |= CRAMFS_BLK_FLAG_DIRECT_PTR|CRAMFS_BLK_FLAG_UNCOMPRESSED;
+		if (blockptrs[i] != expect) {
+			pr_debug("range: block %d/%d got %#x expects %#x\n",
+				 pgoff+i, pgoff+*pages-1, blockptrs[i], expect);
+			if (i == 0)
+				return 0;
+			break;
+		}
+	} while (++i < *pages);
+
+	*pages = i;
+
+	/* stored "direct" block ptrs are shifted down by 2 bits */
+	return blockaddr << 2;
+}
+
+/*
+ * It is possible for cramfs_physmem_mmap() to partially populate the mapping
+ * causing page faults in the unmapped area. When that happens, we need to
+ * split the vma so that the unmapped area gets its own vma that can be backed
+ * with actual memory pages and loaded normally. This is necessary because
+ * remap_pfn_range() overwrites vma->vm_pgoff with the pfn and filemap_fault()
+ * no longer works with it. Furthermore this makes /proc/x/maps right.
+ * Q: is there a way to do split vma at mmap() time?
+ */
+static const struct vm_operations_struct cramfs_vmasplit_ops;
+static int cramfs_vmasplit_fault(struct vm_fault *vmf)
+{
+	struct mm_struct *mm = vmf->vma->vm_mm;
+	struct vm_area_struct *vma, *new_vma;
+	struct file *vma_file = get_file(vmf->vma->vm_file);
+	unsigned long split_val, split_addr;
+	unsigned int split_pgoff;
+	int ret;
+
+	/* We have some vma surgery to do and need the write lock. */
+	up_read(&mm->mmap_sem);
+	if (down_write_killable(&mm->mmap_sem)) {
+		fput(vma_file);
+		return VM_FAULT_RETRY;
+	}
+
+	/* Make sure the vma didn't change between the locks */
+	ret = VM_FAULT_SIGSEGV;
+	vma = find_vma(mm, vmf->address);
+	if (!vma)
+		goto out_fput;
+
+	/*
+	 * Someone else might have raced with us and handled the fault,
+	 * changed the vma, etc. If so let it go back to user space and
+	 * fault again if necessary.
+	 */
+	ret = VM_FAULT_NOPAGE;
+	if (vma->vm_ops != &cramfs_vmasplit_ops || vma->vm_file != vma_file)
+		goto out_fput;
+	fput(vma_file);
+
+	/* Retrieve the vma split address and validate it */
+	split_val = (unsigned long)vma->vm_private_data;
+	split_pgoff = split_val & 0xfff;
+	split_addr = (split_val >> 12) << PAGE_SHIFT;
+	if (split_addr < vma->vm_start) {
+		/* bottom of vma was unmapped */
+		split_pgoff += (vma->vm_start - split_addr) >> PAGE_SHIFT;
+		split_addr = vma->vm_start;
+	}
+	pr_debug("fault: addr=%#lx vma=%#lx-%#lx split=%#lx\n",
+		 vmf->address, vma->vm_start, vma->vm_end, split_addr);
+	ret = VM_FAULT_SIGSEGV;
+	if (!split_val || split_addr > vmf->address || vma->vm_end <= vmf->address)
+		goto out;
+
+	if (unlikely(vma->vm_start == split_addr)) {
+		/* nothing to split */
+		new_vma = vma;
+	} else {
+		/* Split away the directly mapped area */
+		ret = VM_FAULT_OOM;
+		if (split_vma(mm, vma, split_addr, 0) != 0)
+			goto out;
+
+		/* The direct vma should no longer ever fault */
+		vma->vm_ops = NULL;
+
+		/* Retrieve the new vma covering the unmapped area */
+		new_vma = find_vma(mm, split_addr);
+		BUG_ON(new_vma == vma);
+		ret = VM_FAULT_SIGSEGV;
+		if (!new_vma)
+			goto out;
+	}
+
+	/*
+	 * Readjust the new vma with the actual file based pgoff and
+	 * process the fault normally on it.
+	 */
+	new_vma->vm_pgoff = split_pgoff;
+	new_vma->vm_ops = &generic_file_vm_ops;
+	new_vma->vm_flags &= ~(VM_IO | VM_PFNMAP | VM_DONTEXPAND);
+	vmf->vma = new_vma;
+	vmf->pgoff = split_pgoff;
+	vmf->pgoff += (vmf->address - new_vma->vm_start) >> PAGE_SHIFT;
+	downgrade_write(&mm->mmap_sem);
+	return filemap_fault(vmf);
+
+out_fput:
+	fput(vma_file);
+out:
+	downgrade_write(&mm->mmap_sem);
+	return ret;
+}
+
+static const struct vm_operations_struct cramfs_vmasplit_ops = {
+	.fault	= cramfs_vmasplit_fault,
+};
+
+static int cramfs_physmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, vma_pages, max_pages, offset;
+	unsigned long address;
+	char *fail_reason;
+	int ret;
+
+	if (!IS_ENABLED(CONFIG_MMU))
+		return vma->vm_flags & (VM_SHARED | VM_MAYSHARE) ? 0 : -ENOSYS;
+
+	if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+		return -EINVAL;
+
+	/* Could COW work here? */
+	fail_reason = "vma is writable";
+	if (vma->vm_flags & VM_WRITE)
+		goto fail;
+
+	vma_pages = (vma->vm_end - vma->vm_start + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	fail_reason = "beyond file limit";
+	if (vma->vm_pgoff >= max_pages)
+		goto fail;
+	pages = vma_pages;
+	if (pages > max_pages - vma->vm_pgoff)
+		pages = max_pages - vma->vm_pgoff;
+
+	offset = cramfs_get_block_range(inode, vma->vm_pgoff, &pages);
+	fail_reason = "unsuitable block layout";
+	if (!offset)
+		goto fail;
+	address = sbi->linear_phys_addr + offset;
+	fail_reason = "data is not page aligned";
+	if (!PAGE_ALIGNED(address))
+		goto fail;
+
+	/* Don't map the last page if it contains some other data */
+	if (unlikely(vma->vm_pgoff + pages == max_pages)) {
+		unsigned int partial = offset_in_page(inode->i_size);
+		if (partial) {
+			char *data = sbi->linear_virt_addr + offset;
+			data += (max_pages - 1) * PAGE_SIZE + partial;
+			while ((unsigned long)data & 7)
+				if (*data++ != 0)
+					goto nonzero;
+			while (offset_in_page(data)) {
+				if (*(u64 *)data != 0) {
+					nonzero:
+					pr_debug("mmap: %s: last page is shared\n",
+						 file_dentry(file)->d_name.name);
+					pages--;
+					break;
+				}
+				data += 8;
+			}
+		}
+	}
+
+	if (pages) {
+		/*
+		 * If we can't map it all, page faults will occur if the
+		 * unmapped area is accessed. Let's handle them to split the
+		 * vma and let the normal paging machinery take care of the
+		 * rest through cramfs_readpage(). Because remap_pfn_range()
+		 * repurposes vma->vm_pgoff, we have to save it somewhere.
+		 * Let's use vma->vm_private_data to hold both the pgoff and
+		 * the actual address split point. Maximum file size is 16MB
+		 * (12 bits pgoff) and max 20 bits pfn where a long is 32 bits
+		 * so we can pack both together.
+		 */
+		if (pages != vma_pages) {
+			unsigned int split_pgoff = vma->vm_pgoff + pages;
+			unsigned long split_pfn = (vma->vm_start >> PAGE_SHIFT) + pages;
+			unsigned long split_val = split_pgoff | (split_pfn << 12);
+			vma->vm_private_data = (void *)split_val;
+			vma->vm_ops = &cramfs_vmasplit_ops;
+			/* to keep remap_pfn_range() happy */
+			vma->vm_end = vma->vm_start + pages * PAGE_SIZE;
+		}
+
+		ret = remap_pfn_range(vma, vma->vm_start, address >> PAGE_SHIFT,
+				      pages * PAGE_SIZE, vma->vm_page_prot);
+		/* restore vm_end in case we cheated it above */
+		vma->vm_end = vma->vm_start + vma_pages * PAGE_SIZE;
+		if (ret)
+			return ret;
+
+		pr_debug("mapped %s at 0x%08lx (%u/%u pages) to vma 0x%08lx, "
+			 "page_prot 0x%llx\n", file_dentry(file)->d_name.name,
+			 address, pages, vma_pages, vma->vm_start,
+			 (unsigned long long)pgprot_val(vma->vm_page_prot));
+		return 0;
+	}
+	fail_reason = "no suitable block remaining";
+
+fail:
+	pr_debug("%s: direct mmap failed: %s\n",
+		 file_dentry(file)->d_name.name, fail_reason);
+
+	/* We failed to do a direct map, but normal paging will do it */
+	vma->vm_ops = &generic_file_vm_ops;
+	return 0;
+}
+
+#ifndef CONFIG_MMU
+
+static unsigned long cramfs_physmem_get_unmapped_area(struct file *file,
+			unsigned long addr, unsigned long len,
+			unsigned long pgoff, unsigned long flags)
+{
+	struct inode *inode = file_inode(file);
+	struct super_block *sb = inode->i_sb;
+	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);
+	unsigned int pages, block_pages, max_pages, offset;
+
+	pages = (len + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	max_pages = (inode->i_size + PAGE_SIZE - 1) >> PAGE_SHIFT;
+	if (pgoff >= max_pages || pages > max_pages - pgoff)
+		return -EINVAL;
+	block_pages = pages;
+	offset = cramfs_get_block_range(inode, pgoff, &block_pages);
+	if (!offset || block_pages != pages)
+		return -ENOSYS;
+	addr = sbi->linear_phys_addr + offset;
+	pr_debug("get_unmapped for %s ofs %#lx siz %lu at 0x%08lx\n",
+		 file_dentry(file)->d_name.name, pgoff*PAGE_SIZE, len, addr);
+	return addr;
+}
+
+static unsigned cramfs_physmem_mmap_capabilities(struct file *file)
+{
+	return NOMMU_MAP_COPY | NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_EXEC;
+}
+#endif
+
+static const struct file_operations cramfs_physmem_fops = {
+	.llseek			= generic_file_llseek,
+	.read_iter		= generic_file_read_iter,
+	.splice_read		= generic_file_splice_read,
+	.mmap			= cramfs_physmem_mmap,
+#ifndef CONFIG_MMU
+	.get_unmapped_area	= cramfs_physmem_get_unmapped_area,
+	.mmap_capabilities	= cramfs_physmem_mmap_capabilities,
+#endif
+};
+
 static void cramfs_blkdev_kill_sb(struct super_block *sb)
 {
 	struct cramfs_sb_info *sbi = CRAMFS_SB(sb);