From patchwork Fri Mar 14 18:33:31 2014 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Stultz X-Patchwork-Id: 26283 Return-Path: X-Original-To: linaro@patches.linaro.org Delivered-To: linaro@patches.linaro.org Received: from mail-qa0-f72.google.com (mail-qa0-f72.google.com [209.85.216.72]) by ip-10-151-82-157.ec2.internal (Postfix) with ESMTPS id A4A4E202DD for ; Fri, 14 Mar 2014 18:33:47 +0000 (UTC) Received: by mail-qa0-f72.google.com with SMTP id f11sf5745661qae.11 for ; Fri, 14 Mar 2014 11:33:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:delivered-to:from:to:cc:subject :date:message-id:in-reply-to:references:x-original-sender :x-original-authentication-results:precedence:mailing-list:list-id :list-post:list-help:list-archive:list-unsubscribe; bh=Boh3qVaj5nst8JNcATB6DaaBOcVCQJItE1QT59mJq9M=; b=NO9PkOfwcDsfD0TIURQ3JOKGArJUycsXUs5qU4L3dIRvugARBQyFY1JU0pNwFckdHs BruwqQMhLnDAd+d/J8PWEj8mGfgqw//UkAYFOw1TZDgVQTxgri2/p/xO85Az7e8AWnUo 9+M4zVyiBelwMEM/DBDICr02en4jSnB/pFTWvhTTA3grGzPBJrQrsEBFtkKJNDzx8zde 1eDs3GZrPuxSeVYEttsr1wyA3JFhK/3peI6dek+CAIsFFD2Xk+ZIGfsIY/J1Tv+niI7c 1ryfe6naKu0wyF9r6bjwLfu4XAtsf26TMY51KC9ab2lOMxm71krYCfBrt8JCP1miZShA bLIQ== X-Gm-Message-State: ALoCoQl7qLcYwhiY6mviyMKsNQUgk8cPfz3WaTZZWava4iHc1oHYGu460eDwo+fjHb70wBQ/Y5Yn X-Received: by 10.58.136.35 with SMTP id px3mr3881028veb.31.1394822027425; Fri, 14 Mar 2014 11:33:47 -0700 (PDT) MIME-Version: 1.0 X-BeenThere: patchwork-forward@linaro.org Received: by 10.140.47.239 with SMTP id m102ls80202qga.8.gmail; Fri, 14 Mar 2014 11:33:47 -0700 (PDT) X-Received: by 10.220.200.6 with SMTP id eu6mr934226vcb.35.1394822027351; Fri, 14 Mar 2014 11:33:47 -0700 (PDT) Received: from mail-vc0-f172.google.com (mail-vc0-f172.google.com [209.85.220.172]) by mx.google.com with ESMTPS id gs7si2407804vdc.54.2014.03.14.11.33.47 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 14 Mar 2014 11:33:47 -0700 (PDT) Received-SPF: neutral (google.com: 209.85.220.172 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) client-ip=209.85.220.172; Received: by mail-vc0-f172.google.com with SMTP id la4so3191638vcb.31 for ; Fri, 14 Mar 2014 11:33:47 -0700 (PDT) X-Received: by 10.52.124.66 with SMTP id mg2mr233779vdb.50.1394822027238; Fri, 14 Mar 2014 11:33:47 -0700 (PDT) X-Forwarded-To: patchwork-forward@linaro.org X-Forwarded-For: patch@linaro.org patchwork-forward@linaro.org Delivered-To: patches@linaro.org Received: by 10.220.78.9 with SMTP id i9csp43104vck; Fri, 14 Mar 2014 11:33:46 -0700 (PDT) X-Received: by 10.68.89.162 with SMTP id bp2mr10590667pbb.151.1394822026446; Fri, 14 Mar 2014 11:33:46 -0700 (PDT) Received: from mail-pd0-f182.google.com (mail-pd0-f182.google.com [209.85.192.182]) by mx.google.com with ESMTPS id wm7si4422562pab.69.2014.03.14.11.33.46 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 14 Mar 2014 11:33:46 -0700 (PDT) Received-SPF: neutral (google.com: 209.85.192.182 is neither permitted nor denied by best guess record for domain of john.stultz@linaro.org) client-ip=209.85.192.182; Received: by mail-pd0-f182.google.com with SMTP id y10so2888387pdj.13 for ; Fri, 14 Mar 2014 11:33:46 -0700 (PDT) X-Received: by 10.68.143.231 with SMTP id sh7mr10836514pbb.7.1394822025940; Fri, 14 Mar 2014 11:33:45 -0700 (PDT) Received: from buildbox.hsd1.or.comcast.net (c-67-170-153-23.hsd1.or.comcast.net. [67.170.153.23]) by mx.google.com with ESMTPSA id dk1sm18837041pbc.46.2014.03.14.11.33.44 for (version=TLSv1.1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Fri, 14 Mar 2014 11:33:45 -0700 (PDT) From: John Stultz To: LKML Cc: John Stultz , Andrew Morton , Android Kernel Team , Johannes Weiner , Robert Love , Mel Gorman , Hugh Dickins , Dave Hansen , Rik van Riel , Dmitry Adamushko , Neil Brown , Andrea Arcangeli , Mike Hommey , Taras Glek , Dhaval Giani , Jan Kara , KOSAKI Motohiro , Michel Lespinasse , Minchan Kim , "linux-mm@kvack.org" Subject: [PATCH 1/3] vrange: Add vrange syscall and handle splitting/merging and marking vmas Date: Fri, 14 Mar 2014 11:33:31 -0700 Message-Id: <1394822013-23804-2-git-send-email-john.stultz@linaro.org> X-Mailer: git-send-email 1.8.3.2 In-Reply-To: <1394822013-23804-1-git-send-email-john.stultz@linaro.org> References: <1394822013-23804-1-git-send-email-john.stultz@linaro.org> X-Removed-Original-Auth: Dkim didn't pass. X-Original-Sender: john.stultz@linaro.org X-Original-Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.220.172 is neither permitted nor denied by best guess record for domain of patch+caf_=patchwork-forward=linaro.org@linaro.org) smtp.mail=patch+caf_=patchwork-forward=linaro.org@linaro.org Precedence: list Mailing-list: list patchwork-forward@linaro.org; contact patchwork-forward+owners@linaro.org List-ID: X-Google-Group-Id: 836684582541 List-Post: , List-Help: , List-Archive: List-Unsubscribe: , This patch introduces the vrange() syscall, which allows for specifying ranges of memory as volatile, and able to be discarded by the system. This initial patch simply adds the syscall, and the vma handling, splitting and merging the vmas as needed, and marking them with VM_VOLATILE. No purging or discarding of volatile ranges is done at this point. Example man page: NAME vrange - Mark or unmark range of memory as volatile SYNOPSIS int vrange(unsigned_long start, size_t length, int mode, int *purged); DESCRIPTION Applications can use vrange(2) to advise the kernel how it should handle paging I/O in this VM area. The idea is to help the kernel discard pages of vrange instead of reclaiming when memory pressure happens. It means kernel doesn't discard any pages of vrange if there is no memory pressure. mode: VRANGE_VOLATILE hint to kernel so VM can discard in vrange pages when memory pressure happens. VRANGE_NONVOLATILE hint to kernel so VM doesn't discard vrange pages any more. If user try to access purged memory without VRANGE_NONVOLATILE call, he can encounter SIGBUS if the page was discarded by kernel. purged: Pointer to an integer which will return 1 if mode == VRANGE_NONVOLATILE and any page in the affected range was purged. If purged returns zero during a mode == VRANGE_NONVOLATILE call, it means all of the pages in the range are intact. RETURN VALUE On success vrange returns the number of bytes marked or unmarked. Similar to write(), it may return fewer bytes then specified if it ran into a problem. If an error is returned, no changes were made. ERRORS EINVAL This error can occur for the following reasons: * The value length is negative or not page size units. * addr is not page-aligned * mode not a valid value. ENOMEM Not enough memory EFAULT purged pointer is invalid This a simplified implementation which reuses some of the logic from Minchan's earlier efforts. So credit to Minchan for his work. Cc: Andrew Morton Cc: Android Kernel Team Cc: Johannes Weiner Cc: Robert Love Cc: Mel Gorman Cc: Hugh Dickins Cc: Dave Hansen Cc: Rik van Riel Cc: Dmitry Adamushko Cc: Neil Brown Cc: Andrea Arcangeli Cc: Mike Hommey Cc: Taras Glek Cc: Dhaval Giani Cc: Jan Kara Cc: KOSAKI Motohiro Cc: Michel Lespinasse Cc: Minchan Kim Cc: linux-mm@kvack.org Signed-off-by: John Stultz --- arch/x86/syscalls/syscall_64.tbl | 1 + include/linux/mm.h | 1 + include/linux/vrange.h | 7 ++ mm/Makefile | 2 +- mm/vrange.c | 150 +++++++++++++++++++++++++++++++++++++++ 5 files changed, 160 insertions(+), 1 deletion(-) create mode 100644 include/linux/vrange.h create mode 100644 mm/vrange.c diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl index a12bddc..7ae3940 100644 --- a/arch/x86/syscalls/syscall_64.tbl +++ b/arch/x86/syscalls/syscall_64.tbl @@ -322,6 +322,7 @@ 313 common finit_module sys_finit_module 314 common sched_setattr sys_sched_setattr 315 common sched_getattr sys_sched_getattr +316 common vrange sys_vrange # # x32-specific system call numbers start at 512 to avoid cache impact diff --git a/include/linux/mm.h b/include/linux/mm.h index c1b7414..a1f11da 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -117,6 +117,7 @@ extern unsigned int kobjsize(const void *objp); #define VM_IO 0x00004000 /* Memory mapped I/O or similar */ /* Used by sys_madvise() */ +#define VM_VOLATILE 0x00001000 /* VMA is volatile */ #define VM_SEQ_READ 0x00008000 /* App will access data sequentially */ #define VM_RAND_READ 0x00010000 /* App will not benefit from clustered reads */ diff --git a/include/linux/vrange.h b/include/linux/vrange.h new file mode 100644 index 0000000..652396b --- /dev/null +++ b/include/linux/vrange.h @@ -0,0 +1,7 @@ +#ifndef _LINUX_VRANGE_H +#define _LINUX_VRANGE_H + +#define VRANGE_NONVOLATILE 0 +#define VRANGE_VOLATILE 1 + +#endif /* _LINUX_VRANGE_H */ diff --git a/mm/Makefile b/mm/Makefile index 310c90a..20229e2 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -16,7 +16,7 @@ obj-y := filemap.o mempool.o oom_kill.o fadvise.o \ readahead.o swap.o truncate.o vmscan.o shmem.o \ util.o mmzone.o vmstat.o backing-dev.o \ mm_init.o mmu_context.o percpu.o slab_common.o \ - compaction.o balloon_compaction.o \ + compaction.o balloon_compaction.o vrange.o \ interval_tree.o list_lru.o $(mmu-y) obj-y += init-mm.o diff --git a/mm/vrange.c b/mm/vrange.c new file mode 100644 index 0000000..acb4356 --- /dev/null +++ b/mm/vrange.c @@ -0,0 +1,150 @@ +#include +#include +#include +#include +#include +#include +#include +#include +#include "internal.h" + +static ssize_t do_vrange(struct mm_struct *mm, unsigned long start, + unsigned long end, int mode, int *purged) +{ + struct vm_area_struct *vma, *prev; + unsigned long orig_start = start; + ssize_t count = 0, ret = 0; + int lpurged = 0; + + down_read(&mm->mmap_sem); + + vma = find_vma_prev(mm, start, &prev); + if (vma && start > vma->vm_start) + prev = vma; + + for (;;) { + unsigned long new_flags; + pgoff_t pgoff; + unsigned long tmp; + + if (!vma) + goto out; + + if (vma->vm_flags & (VM_SPECIAL|VM_LOCKED|VM_MIXEDMAP| + VM_HUGETLB)) + goto out; + + /* We don't support volatility on files for now */ + if (vma->vm_file) { + ret = -EINVAL; + goto out; + } + + new_flags = vma->vm_flags; + + if (start < vma->vm_start) { + start = vma->vm_start; + if (start >= end) + goto out; + } + tmp = vma->vm_end; + if (end < tmp) + tmp = end; + + switch (mode) { + case VRANGE_VOLATILE: + new_flags |= VM_VOLATILE; + break; + case VRANGE_NONVOLATILE: + new_flags &= ~VM_VOLATILE; + } + + pgoff = vma->vm_pgoff + ((start - vma->vm_start) >> PAGE_SHIFT); + prev = vma_merge(mm, prev, start, tmp, new_flags, + vma->anon_vma, vma->vm_file, pgoff, + vma_policy(vma)); + if (prev) + goto success; + + if (start != vma->vm_start) { + ret = split_vma(mm, vma, start, 1); + if (ret) + goto out; + } + + if (tmp != vma->vm_end) { + ret = split_vma(mm, vma, tmp, 0); + if (ret) + goto out; + } + + prev = vma; +success: + vma->vm_flags = new_flags; + *purged = lpurged; + + /* update count to distance covered so far*/ + count = tmp - orig_start; + + if (prev && start < prev->vm_end) + start = prev->vm_end; + if (start >= end) + goto out; + if (prev) + vma = prev->vm_next; + else /* madvise_remove dropped mmap_sem */ + vma = find_vma(mm, start); + } +out: + up_read(&mm->mmap_sem); + + /* report bytes successfully marked, even if we're exiting on error */ + if (count) + return count; + + return ret; +} + +SYSCALL_DEFINE4(vrange, unsigned long, start, + size_t, len, int, mode, int __user *, purged) +{ + unsigned long end; + struct mm_struct *mm = current->mm; + ssize_t ret = -EINVAL; + int p = 0; + + if (start & ~PAGE_MASK) + goto out; + + len &= PAGE_MASK; + if (!len) + goto out; + + end = start + len; + if (end < start) + goto out; + + if (start >= TASK_SIZE) + goto out; + + if (purged) { + /* Test pointer is valid before making any changes */ + if (put_user(p, purged)) + return -EFAULT; + } + + ret = do_vrange(mm, start, end, mode, &p); + + if (purged) { + if (put_user(p, purged)) { + /* + * This would be bad, since we've modified volatilty + * and the change in purged state would be lost. + */ + WARN_ONCE(1, "vrange: purge state possibly lost\n"); + } + } + +out: + return ret; +}