From patchwork Tue Nov 1 14:40:32 2011 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Stultz X-Patchwork-Id: 4887 Return-Path: X-Original-To: patchwork@peony.canonical.com Delivered-To: patchwork@peony.canonical.com Received: from fiordland.canonical.com (fiordland.canonical.com [91.189.94.145]) by peony.canonical.com (Postfix) with ESMTP id C3D2123E05 for ; Tue, 1 Nov 2011 14:41:05 +0000 (UTC) Received: from mail-fx0-f52.google.com (mail-fx0-f52.google.com [209.85.161.52]) by fiordland.canonical.com (Postfix) with ESMTP id ABF01A18457 for ; Tue, 1 Nov 2011 14:41:05 +0000 (UTC) Received: by faan26 with SMTP id n26so10059032faa.11 for ; Tue, 01 Nov 2011 07:41:05 -0700 (PDT) Received: by 10.223.5.82 with SMTP id 18mr1061744fau.27.1320158459076; Tue, 01 Nov 2011 07:40:59 -0700 (PDT) X-Forwarded-To: linaro-patchwork@canonical.com X-Forwarded-For: patch@linaro.org linaro-patchwork@canonical.com Delivered-To: patches@linaro.org Received: by 10.152.14.103 with SMTP id o7cs11373lac; Tue, 1 Nov 2011 07:40:58 -0700 (PDT) Received: by 10.52.91.116 with SMTP id cd20mr8909528vdb.27.1320158454868; Tue, 01 Nov 2011 07:40:54 -0700 (PDT) Received: from e9.ny.us.ibm.com (e9.ny.us.ibm.com. [32.97.182.139]) by mx.google.com with ESMTPS id es9si3622929vdb.137.2011.11.01.07.40.54 (version=TLSv1/SSLv3 cipher=OTHER); Tue, 01 Nov 2011 07:40:54 -0700 (PDT) Received-SPF: pass (google.com: domain of jstultz@us.ibm.com designates 32.97.182.139 as permitted sender) client-ip=32.97.182.139; Authentication-Results: mx.google.com; spf=pass (google.com: domain of jstultz@us.ibm.com designates 32.97.182.139 as permitted sender) smtp.mail=jstultz@us.ibm.com Received: from /spool/local by e9.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Tue, 1 Nov 2011 10:40:53 -0400 Received: from d01relay01.pok.ibm.com ([9.56.227.233]) by e9.ny.us.ibm.com ([192.168.1.109]) with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted; Tue, 1 Nov 2011 10:40:51 -0400 Received: from d01av02.pok.ibm.com (d01av02.pok.ibm.com [9.56.224.216]) by d01relay01.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id pA1Eepu3201854; Tue, 1 Nov 2011 10:40:51 -0400 Received: from d01av02.pok.ibm.com (loopback [127.0.0.1]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id pA1Eeosw009975; Tue, 1 Nov 2011 12:40:50 -0200 Received: from kernel.beaverton.ibm.com ([9.47.67.96]) by d01av02.pok.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP id pA1EeoVO009944; Tue, 1 Nov 2011 12:40:50 -0200 Received: by kernel.beaverton.ibm.com (Postfix, from userid 1056) id 0C0A81E74FB; Tue, 1 Nov 2011 07:40:50 -0700 (PDT) From: John Stultz To: Dave Hansen Cc: John Stultz Subject: [PATCH] madvise: Add _VOLATILE,_ISVOLATILE, and _NONVOLATILE flags Date: Tue, 1 Nov 2011 07:40:32 -0700 Message-Id: <1320158432-25229-1-git-send-email-john.stultz@linaro.org> X-Mailer: git-send-email 1.7.3.2.146.gca209 x-cbid: 11110114-7182-0000-0000-00000019FE53 This patch provides new madvise flags that can be used to mark memory as volatile, which will allow it to be discarded if the kernel wants to reclaim memory. This patch does not do the actual discard, instead it provides the infrastructure to specify if page ranges are volatile or not. The discard will be done in a following patch. This is very much influenced by the Android Ashmem interface by Robert Love so credits to him and the Android developers. In many cases the code & logic come directly from the ashmem patch. The intent of this patch is to allow for ashmem-like behavior, but embeds the idea a little deeper into the VM code, instead of isolating it into a specific driver. Also many thanks to Dave Hansen who helped design and develop the initial version of this patch, and has provided continued review and mentoring in the VM code. Signed-off-by: John Stultz --- fs/inode.c | 1 + include/asm-generic/mman-common.h | 3 + include/linux/fs.h | 13 +++ mm/madvise.c | 172 +++++++++++++++++++++++++++++++++++++ 4 files changed, 189 insertions(+), 0 deletions(-) diff --git a/fs/inode.c b/fs/inode.c index ecbb68d..4285cef 100644 --- a/fs/inode.c +++ b/fs/inode.c @@ -278,6 +278,7 @@ void address_space_init_once(struct address_space *mapping) spin_lock_init(&mapping->private_lock); INIT_RAW_PRIO_TREE_ROOT(&mapping->i_mmap); INIT_LIST_HEAD(&mapping->i_mmap_nonlinear); + INIT_LIST_HEAD(&mapping->unpinned_list); } EXPORT_SYMBOL(address_space_init_once); diff --git a/include/asm-generic/mman-common.h b/include/asm-generic/mman-common.h index 787abbb..c633d13 100644 --- a/include/asm-generic/mman-common.h +++ b/include/asm-generic/mman-common.h @@ -47,6 +47,9 @@ #define MADV_HUGEPAGE 14 /* Worth backing with hugepages */ #define MADV_NOHUGEPAGE 15 /* Not worth backing with hugepages */ +#define MADV_VOLATILE 16 /* _can_ toss, but don't toss now */ +#define MADV_ISVOLATILE 17 /* Check if page is marked volatile or not */ +#define MADV_NONVOLATILE 18 /* Remove VOLATILE flag */ /* compatibility flags */ #define MAP_FILE 0 diff --git a/include/linux/fs.h b/include/linux/fs.h index 7a049fd..935d8b4 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -632,6 +632,18 @@ int pagecache_write_end(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); + + +/* upinned_mem_range & range macros from Robert Love's Ashmem patch */ +struct unpinned_mem_range { + struct address_space *addrsp; /* associated address_space */ + struct list_head unpinned; /* Next unpinned range */ + size_t start; + size_t end; + unsigned int purged; +}; + + struct backing_dev_info; struct address_space { struct inode *host; /* owner: inode, block_device */ @@ -650,6 +662,7 @@ struct address_space { spinlock_t private_lock; /* for use by the address_space */ struct list_head private_list; /* ditto */ struct address_space *assoc_mapping; /* ditto */ + struct list_head unpinned_list; /* unpinned area list */ } __attribute__((aligned(sizeof(long)))); /* * On most architectures that alignment is already the case; but diff --git a/mm/madvise.c b/mm/madvise.c index 74bf193..4f59049 100644 --- a/mm/madvise.c +++ b/mm/madvise.c @@ -225,6 +225,169 @@ static long madvise_remove(struct vm_area_struct *vma, return error; } + + +#define range_size(range) \ + ((range)->end - (range)->start +1) +#define page_range_subsumes_range(range, start_addr, end_addr) \ + (((range)->start >= (start_addr)) && ((range)->end <= (end_addr))) +#define page_range_subsumed_by_range(range, start_addr, end_addr) \ + (((range)->start <= (start_addr)) && ((range)->end >= (end_addr))) +#define page_in_range(range, page) \ + (((range)->start <= (page)) && ((range)->end >= (page))) +#define page_range_in_range(range, start_addr, end_addr) \ + (page_in_range(range, start_addr) || page_in_range(range, end_addr) || \ + page_range_subsumes_range(range, start_addr, end_addr)) +#define range_before_page(range, page) \ + ((range)->end < (page)) + + + +static int unpinned_range_alloc(struct address_space *addrsp, + struct unpinned_mem_range *prev_range, + unsigned int purged, size_t start, size_t end) +{ + struct unpinned_mem_range *range; + + range = kzalloc(sizeof(struct unpinned_mem_range), GFP_KERNEL); + if (unlikely(!range)) + return -ENOMEM; + + range->addrsp = addrsp; + range->start = start; + range->end = end; + range->purged = purged; + + list_add_tail(&range->unpinned, &prev_range->unpinned); + + return 0; +} + +static void unpinned_range_del(struct unpinned_mem_range *range) +{ + list_del(&range->unpinned); + kfree(range); +} + +static inline void unpinned_range_shrink(struct unpinned_mem_range *range, + size_t start, size_t end) +{ + range->start = start; + range->end = end; +} + + +static long madvise_volatile(struct vm_area_struct * vma, + unsigned long start, unsigned long end) +{ + struct unpinned_mem_range *range, *next; + unsigned int purged = 0; + int ret; + struct address_space *addrsp; + + /* XXX - check start->end is within mmapped shm region */ + if(!vma->vm_file) + return -1; + + addrsp=vma->vm_file->f_mapping; + +restart: + list_for_each_entry_safe(range, next, &addrsp->unpinned_list, + unpinned) { + if (range_before_page(range, start)) + break; + + if(page_range_subsumed_by_range(range, start, end)) + return 0; + if (page_range_in_range(range, start,end)) { + start = min_t(size_t, range->start, start); + end = max_t(size_t, range->end, end); + purged |= range->purged; + unpinned_range_del(range); + goto restart; + } + + } + ret = unpinned_range_alloc(addrsp, range, purged, start, end); + return ret; +} + +static long madvise_nonvolatile(struct vm_area_struct * vma, + unsigned long start, unsigned long end) +{ + struct unpinned_mem_range *range, *next; + int ret = 0; + struct address_space *addrsp; + if(!vma->vm_file) + return -1; + addrsp=vma->vm_file->f_mapping; + + list_for_each_entry_safe(range, next, &addrsp->unpinned_list, + unpinned) { + if (range_before_page(range, start)) + break; + + if (page_range_in_range(range, start, end)) { + ret |= range->purged; + /* Case #1: Easy. Just nuke the whole thing. */ + if (page_range_subsumes_range(range, start, end)) { + unpinned_range_del(range); + continue; + } + + /* Case #2: We overlap from the start, so adjust it */ + if (range->start >= start) { + unpinned_range_shrink(range, end + 1, + range->end); + continue; + } + + /* Case #3: We overlap from the rear, so adjust it */ + if (range->end <= end) { + unpinned_range_shrink(range, range->start, + start-1); + continue; + } + + /* + * Case #4: We eat a chunk out of the middle. A bit + * more complicated, we allocate a new range for the + * second half and adjust the first chunk's endpoint. + */ + unpinned_range_alloc(addrsp, range, + range->purged, end + 1, + range->end); + unpinned_range_shrink(range, range->start, start - 1); + + } + + } + return ret; + + +} + +static long madvise_isvolatile(struct vm_area_struct * vma, + unsigned long start, unsigned long end) +{ + struct unpinned_mem_range *range; + long ret = 0; + struct address_space *addrsp; + if(!vma->vm_file) + return -1; + addrsp=vma->vm_file->f_mapping; + + list_for_each_entry(range, &addrsp->unpinned_list, unpinned) { + if (range_before_page(range, start)) + break; + if (page_range_in_range(range, start, end)) { + ret = 1; + break; + } + } + return ret; +} + #ifdef CONFIG_MEMORY_FAILURE /* * Error injection support for memory error handling. @@ -268,6 +431,12 @@ madvise_vma(struct vm_area_struct *vma, struct vm_area_struct **prev, return madvise_willneed(vma, prev, start, end); case MADV_DONTNEED: return madvise_dontneed(vma, prev, start, end); + case MADV_VOLATILE: + return madvise_volatile(vma, start, end); + case MADV_ISVOLATILE: + return madvise_isvolatile(vma, start, end); + case MADV_NONVOLATILE: + return madvise_nonvolatile(vma, start, end); default: return madvise_behavior(vma, prev, start, end, behavior); } @@ -293,6 +462,9 @@ madvise_behavior_valid(int behavior) case MADV_HUGEPAGE: case MADV_NOHUGEPAGE: #endif + case MADV_VOLATILE: + case MADV_ISVOLATILE: + case MADV_NONVOLATILE: return 1; default: