From patchwork Tue May 9 10:04:53 2017 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Arnd Bergmann X-Patchwork-Id: 98902 Delivered-To: patch@linaro.org Received: by 10.140.96.100 with SMTP id j91csp1739842qge; Tue, 9 May 2017 03:06:21 -0700 (PDT) X-Received: by 10.84.214.144 with SMTP id j16mr8658446pli.133.1494324381045; Tue, 09 May 2017 03:06:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1494324381; cv=none; d=google.com; s=arc-20160816; b=RMN46t44Pq5185P17rGhmN3b9Ix0pWF+s2kXv5jaVBzccmV8Oo14oNutfdNFISAdh2 c2sTq9c506BPELbvg+gOCyfdxftv0l68w0DamXRGUuZdPycaFnN0X5psvw7FnRqrqpKc i+NQWUU3vzwZPrPxemlwbyTfWpWHrnTWi8nLwRl7tIgjzGEAntFSTAkmnC2hfGWQvo/3 0kEru/TRBOoeoO9R8c1SCcme2bIWKbGPtNzAXVQVeDDfWss6lSqeq25nhotZKBRYQTES HXlZ8eoJUz+xWy8vPWCxzwRxHZpmRX7oWBJLLKyt7Z9SKE/C/MQq8Qq+sGe8fT6YffIH 3YDg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:arc-authentication-results; bh=h/nMIgWwVSYKzV9t7a4j4bD5tvQLFVJ+dLGTzn6iBF4=; b=o1tXrJwhmzImas4kKos5KQ8TABlVW7u8onr/vlYzjTraEvfFDmgt6P8XNrzNykCc7b pokjq0gD3iTnLlR3ebXhkFmw4CeWpFkWGUQrmHJ8AEvB07yZ5ewI6UmFchZHyD/n88W4 D1fRvZ+KVMNmrhro11spPFKaloEoeHdsloTgXbrRT1hJj3PsThPVOnQbl7RKlKPG86na 8SvXLr1vT5/pUAL7zMJPg8wC5kyJ3ZnFFXuoRjcHnj6HtQimoCHZdPsRN3mTDMWBmDl8 NK75ZYr/FpVEMPLlk0EbFRsOVfrkhlwnXC6CPvqNT5iEDUxWfNTjisF/8oKTSHoAg5Y4 UcpA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of stable-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=stable-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id h13si12805392pgn.51.2017.05.09.03.06.20; Tue, 09 May 2017 03:06:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of stable-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of stable-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=stable-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752376AbdEIKGU (ORCPT + 6 others); Tue, 9 May 2017 06:06:20 -0400 Received: from mout.kundenserver.de ([212.227.126.187]:52510 "EHLO mout.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752262AbdEIKGT (ORCPT ); Tue, 9 May 2017 06:06:19 -0400 Received: from wuerfel.lan ([78.42.17.5]) by mrelayeu.kundenserver.de (mreue004 [212.227.15.129]) with ESMTPA (Nemesis) id 0MJ09d-1d9xxJ3Um9-002au9; Tue, 09 May 2017 12:05:13 +0200 From: Arnd Bergmann To: Ben Hutchings Cc: stable@vger.kernel.org, "Peter Zijlstra (Intel)" , Arnaldo Carvalho de Melo , Javi Merino , Linus Torvalds , Mathieu Desnoyers , Oleg Nesterov , Paul Mackerras , Petr Mladek , Steven Rostedt , Tom Zanussi , Vaibhav Nagarnaik , Ingo Molnar , Arnd Bergmann Subject: [PATCH 3.16-stable 05/14] perf: Avoid horrible stack usage Date: Tue, 9 May 2017 12:04:53 +0200 Message-Id: <20170509100502.1358298-6-arnd@arndb.de> X-Mailer: git-send-email 2.9.0 In-Reply-To: <20170509100502.1358298-1-arnd@arndb.de> References: <20170509100502.1358298-1-arnd@arndb.de> X-Provags-ID: V03:K0:B31etINwNAYD3nFNGThgalqonvfNl6S9zR2vAxEnUAcPSHx1c7/ h2cx4n9xInWvtToN5OaQcuVfVPjqqKv5Hm/n20XG+TC75cTpQDxKLab5mhqZ+rshqFHdMyb npxigJBP1jVhw4BEZJGXz7R0bgIJErgfpyzp0R2gYoAJw3FDpYcxM4rSN87WL6AAohblchG oyAviCSl5Wi86gBPWGpjw== X-UI-Out-Filterresults: notjunk:1; V01:K0:bAvKIxkZ8VU=:EoK8X0yGBUu0LhXB0ZHNv7 5wlqs0PCoxohwPH+aAW5NPHBdRQlTeeXchWrajibn9UmYYnzxKEiYXoE4OJXVmFTQlyBtgloa Y7j35clw+odMr9fgGYYUTd0lyZcA1OTG5JGq8ucIhBX7e62FDLYXNIub0aHFCPmJiAZvWVGCQ Ii1vkx7xTFKtsny6JvbdVg5Djuzb0JBsReoBnVLThJJAmNKjmoFFIKFhE/hNCV0nPhmfBdM+f VIWhtIpamGNT1JqfhAsoQuGrtz0EySoKt+5D5bvVIaxVDDbkEHfT0ncx2TP4JqMbCrwK3TWec NYq1cURKfvqSI9fqsuAHLsCMvXN7AGlMhaqwWMWWEgVRAFbgglKecifs3NtBVDLW7QROb3Kdx YmJ63kEkjmuJoRDhLg4lQc8B9ikGPbIDOwiHP1dM/uiXB7qTmZFPBWSbyLIti4g0f02XYCJRs QIGYM1y5v7F9j3zOxNA5HqrIthvxzlE0RgDInj//dm2VWWB2L52j664iP+r07A0sl5v1fOeoe +4Nd4X2FR8weOfdjs2JfCRFL1u9LXm/Ghoms5w4KnJwhv/Pw1z8eyLmcJxvCtOr90u6X4NrUD nG4KXjGi1AESnwbBUzCxsA4n2LdDCPTz4t1JVdQyKBvJPZFKlIQf3b/70pdqffuTAQSffn2b+ /v7T431pXQuxOb37LLl/uySLm6kspOhCIIBO7m3UlGip9DoV/gcX3GRjgFyylGYU6DmU= Sender: stable-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: "Peter Zijlstra (Intel)" commit 86038c5ea81b519a8a1fcfcd5e4599aab0cdd119 upstream. Both Linus (most recent) and Steve (a while ago) reported that perf related callbacks have massive stack bloat. The problem is that software events need a pt_regs in order to properly report the event location and unwind stack. And because we could not assume one was present we allocated one on stack and filled it with minimal bits required for operation. Now, pt_regs is quite large, so this is undesirable. Furthermore it turns out that most sites actually have a pt_regs pointer available, making this even more onerous, as the stack space is pointless waste. This patch addresses the problem by observing that software events have well defined nesting semantics, therefore we can use static per-cpu storage instead of on-stack. Linus made the further observation that all but the scheduler callers of perf_sw_event() have a pt_regs available, so we change the regular perf_sw_event() to require a valid pt_regs (where it used to be optional) and add perf_sw_event_sched() for the scheduler. We have a scheduler specific call instead of a more generic _noregs() like construct because we can assume non-recursion from the scheduler and thereby simplify the code further (_noregs would have to put the recursion context call inline in order to assertain which __perf_regs element to use). One last note on the implementation of perf_trace_buf_prepare(); we allow .regs = NULL for those cases where we already have a pt_regs pointer available and do not need another. Reported-by: Linus Torvalds Reported-by: Steven Rostedt Signed-off-by: Peter Zijlstra (Intel) Cc: Arnaldo Carvalho de Melo Cc: Javi Merino Cc: Linus Torvalds Cc: Mathieu Desnoyers Cc: Oleg Nesterov Cc: Paul Mackerras Cc: Petr Mladek Cc: Steven Rostedt Cc: Tom Zanussi Cc: Vaibhav Nagarnaik Link: http://lkml.kernel.org/r/20141216115041.GW3337@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar Signed-off-by: Arnd Bergmann --- include/linux/ftrace_event.h | 2 +- include/linux/perf_event.h | 28 +++++++++++++++++++++------- include/trace/ftrace.h | 7 ++++--- kernel/events/core.c | 23 +++++++++++++++++------ kernel/sched/core.c | 2 +- kernel/trace/trace_event_perf.c | 4 +++- kernel/trace/trace_kprobe.c | 4 ++-- kernel/trace/trace_syscalls.c | 4 ++-- kernel/trace/trace_uprobe.c | 2 +- 9 files changed, 52 insertions(+), 24 deletions(-) -- 2.9.0 diff --git a/include/linux/ftrace_event.h b/include/linux/ftrace_event.h index cff3106ffe2c..c817d7938b4b 100644 --- a/include/linux/ftrace_event.h +++ b/include/linux/ftrace_event.h @@ -621,7 +621,7 @@ extern int ftrace_profile_set_filter(struct perf_event *event, int event_id, char *filter_str); extern void ftrace_profile_free_filter(struct perf_event *event); extern void *perf_trace_buf_prepare(int size, unsigned short type, - struct pt_regs *regs, int *rctxp); + struct pt_regs **regs, int *rctxp); static inline void perf_trace_buf_submit(void *raw_data, int size, int rctx, u64 addr, diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index 7e8445e9dcbf..d86153fef0e0 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -642,6 +642,7 @@ static inline int is_software_event(struct perf_event *event) extern struct static_key perf_swevent_enabled[PERF_COUNT_SW_MAX]; +extern void ___perf_sw_event(u32, u64, struct pt_regs *, u64); extern void __perf_sw_event(u32, u64, struct pt_regs *, u64); #ifndef perf_arch_fetch_caller_regs @@ -666,14 +667,25 @@ static inline void perf_fetch_caller_regs(struct pt_regs *regs) static __always_inline void perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { - struct pt_regs hot_regs; + if (static_key_false(&perf_swevent_enabled[event_id])) + __perf_sw_event(event_id, nr, regs, addr); +} + +DECLARE_PER_CPU(struct pt_regs, __perf_regs[4]); +/* + * 'Special' version for the scheduler, it hard assumes no recursion, + * which is guaranteed by us not actually scheduling inside other swevents + * because those disable preemption. + */ +static __always_inline void +perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) +{ if (static_key_false(&perf_swevent_enabled[event_id])) { - if (!regs) { - perf_fetch_caller_regs(&hot_regs); - regs = &hot_regs; - } - __perf_sw_event(event_id, nr, regs, addr); + struct pt_regs *regs = this_cpu_ptr(&__perf_regs[0]); + + perf_fetch_caller_regs(regs); + ___perf_sw_event(event_id, nr, regs, addr); } } @@ -689,7 +701,7 @@ static inline void perf_event_task_sched_in(struct task_struct *prev, static inline void perf_event_task_sched_out(struct task_struct *prev, struct task_struct *next) { - perf_sw_event(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, NULL, 0); + perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0); if (static_key_false(&perf_sched_events.key)) __perf_event_task_sched_out(prev, next); @@ -800,6 +812,8 @@ static inline int perf_event_refresh(struct perf_event *event, int refresh) static inline void perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { } static inline void +perf_sw_event_sched(u32 event_id, u64 nr, u64 addr) { } +static inline void perf_bp_event(struct perf_event *event, void *data) { } static inline int perf_register_guest_info_callbacks diff --git a/include/trace/ftrace.h b/include/trace/ftrace.h index 26b4f2e13275..bb1f5d82ad49 100644 --- a/include/trace/ftrace.h +++ b/include/trace/ftrace.h @@ -765,7 +765,7 @@ perf_trace_##call(void *__data, proto) \ struct ftrace_event_call *event_call = __data; \ struct ftrace_data_offsets_##call __maybe_unused __data_offsets;\ struct ftrace_raw_##call *entry; \ - struct pt_regs __regs; \ + struct pt_regs *__regs; \ u64 __addr = 0, __count = 1; \ struct task_struct *__task = NULL; \ struct hlist_head *head; \ @@ -784,18 +784,19 @@ perf_trace_##call(void *__data, proto) \ sizeof(u64)); \ __entry_size -= sizeof(u32); \ \ - perf_fetch_caller_regs(&__regs); \ entry = perf_trace_buf_prepare(__entry_size, \ event_call->event.type, &__regs, &rctx); \ if (!entry) \ return; \ \ + perf_fetch_caller_regs(__regs); \ + \ tstruct \ \ { assign; } \ \ perf_trace_buf_submit(entry, __entry_size, rctx, __addr, \ - __count, &__regs, head, __task); \ + __count, __regs, head, __task); \ } /* diff --git a/kernel/events/core.c b/kernel/events/core.c index 2faaed3ba61b..48988c7ac954 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -5934,6 +5934,8 @@ end: rcu_read_unlock(); } +DEFINE_PER_CPU(struct pt_regs, __perf_regs[4]); + int perf_swevent_get_recursion_context(void) { struct swevent_htable *swhash = &__get_cpu_var(swevent_htable); @@ -5949,21 +5951,30 @@ inline void perf_swevent_put_recursion_context(int rctx) put_recursion_context(swhash->recursion, rctx); } -void __perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) +void ___perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) { struct perf_sample_data data; - int rctx; - preempt_disable_notrace(); - rctx = perf_swevent_get_recursion_context(); - if (rctx < 0) + if (WARN_ON_ONCE(!regs)) return; perf_sample_data_init(&data, addr, 0); - do_perf_sw_event(PERF_TYPE_SOFTWARE, event_id, nr, &data, regs); +} + +void __perf_sw_event(u32 event_id, u64 nr, struct pt_regs *regs, u64 addr) +{ + int rctx; + + preempt_disable_notrace(); + rctx = perf_swevent_get_recursion_context(); + if (unlikely(rctx < 0)) + goto fail; + + ___perf_sw_event(event_id, nr, regs, addr); perf_swevent_put_recursion_context(rctx); +fail: preempt_enable_notrace(); } diff --git a/kernel/sched/core.c b/kernel/sched/core.c index f2c0bcc4ba6c..e5cd1088c1dc 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -1085,7 +1085,7 @@ void set_task_cpu(struct task_struct *p, unsigned int new_cpu) if (p->sched_class->migrate_task_rq) p->sched_class->migrate_task_rq(p, new_cpu); p->se.nr_migrations++; - perf_sw_event(PERF_COUNT_SW_CPU_MIGRATIONS, 1, NULL, 0); + perf_sw_event_sched(PERF_COUNT_SW_CPU_MIGRATIONS, 1, 0); tmn.task = p; tmn.from_cpu = task_cpu(p); diff --git a/kernel/trace/trace_event_perf.c b/kernel/trace/trace_event_perf.c index 5d12bb407b44..a818e5601f89 100644 --- a/kernel/trace/trace_event_perf.c +++ b/kernel/trace/trace_event_perf.c @@ -249,7 +249,7 @@ void perf_trace_del(struct perf_event *p_event, int flags) } void *perf_trace_buf_prepare(int size, unsigned short type, - struct pt_regs *regs, int *rctxp) + struct pt_regs **regs, int *rctxp) { struct trace_entry *entry; unsigned long flags; @@ -268,6 +268,8 @@ void *perf_trace_buf_prepare(int size, unsigned short type, if (*rctxp < 0) return NULL; + if (regs) + *regs = this_cpu_ptr(&__perf_regs[*rctxp]); raw_data = this_cpu_ptr(perf_trace_buf[*rctxp]); /* zero the dead bytes from align to not leak stack to user */ diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 282f6e4e5539..22328503aa3e 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -1158,7 +1158,7 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs) size = ALIGN(__size + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx); + entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx); if (!entry) return; @@ -1189,7 +1189,7 @@ kretprobe_perf_func(struct trace_kprobe *tk, struct kretprobe_instance *ri, size = ALIGN(__size + sizeof(u32), sizeof(u64)); size -= sizeof(u32); - entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx); + entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx); if (!entry) return; diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c index 7e3cd7aaec83..ba78f0e3477d 100644 --- a/kernel/trace/trace_syscalls.c +++ b/kernel/trace/trace_syscalls.c @@ -586,7 +586,7 @@ static void perf_syscall_enter(void *ignore, struct pt_regs *regs, long id) size -= sizeof(u32); rec = (struct syscall_trace_enter *)perf_trace_buf_prepare(size, - sys_data->enter_event->event.type, regs, &rctx); + sys_data->enter_event->event.type, NULL, &rctx); if (!rec) return; @@ -659,7 +659,7 @@ static void perf_syscall_exit(void *ignore, struct pt_regs *regs, long ret) size -= sizeof(u32); rec = (struct syscall_trace_exit *)perf_trace_buf_prepare(size, - sys_data->exit_event->event.type, regs, &rctx); + sys_data->exit_event->event.type, NULL, &rctx); if (!rec) return; diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 3c9b97e6b1f4..5224e836acde 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1116,7 +1116,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, if (hlist_empty(head)) goto out; - entry = perf_trace_buf_prepare(size, call->event.type, regs, &rctx); + entry = perf_trace_buf_prepare(size, call->event.type, NULL, &rctx); if (!entry) goto out;