From patchwork Tue Mar 30 18:19:06 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 412418 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3F59DC433E4 for ; Tue, 30 Mar 2021 18:20:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 092CC619D7 for ; Tue, 30 Mar 2021 18:20:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232682AbhC3STn (ORCPT ); Tue, 30 Mar 2021 14:19:43 -0400 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:60972 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232840AbhC3STM (ORCPT ); Tue, 30 Mar 2021 14:19:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1617128353; x=1648664353; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=ecLxUKexLXn/jG6DseW6edCqSL1Nrr/pEasr6s0rAmo=; b=saKE+W3ZpBohEf3HO8Byc/vG8zHLzGRr/IigvAn59DUZYgR4DqgOAIl/ FZh7DwN6g8MGtdj3SKs6efoagTl7/MJer7mRsiX2FPOHbgTHswWaSiuKk fhFGF5DGkS7/Vjx6ILmUXxWhho1SKAkJJ0KIquYvWZSog+5nClW18bg5w E=; X-IronPort-AV: E=Sophos;i="5.81,291,1610409600"; d="scan'208";a="97732956" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1a-af6a10df.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 30 Mar 2021 18:19:12 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-af6a10df.us-east-1.amazon.com (Postfix) with ESMTPS id 6AE38A226A for ; Tue, 30 Mar 2021 18:19:11 +0000 (UTC) Received: from EX13D13UWB003.ant.amazon.com (10.43.161.233) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from EX13MTAUWB001.ant.amazon.com (10.43.161.207) by EX13D13UWB003.ant.amazon.com (10.43.161.233) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (172.23.141.97) by mail-relay.amazon.com (10.43.161.249) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Tue, 30 Mar 2021 18:19:10 +0000 Received: by dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (Postfix, from userid 6262777) id DD0E4DF600; Tue, 30 Mar 2021 18:19:10 +0000 (UTC) From: Frank van der Linden To: CC: Subject: [PATCH 4.14 1/5] mm: memcontrol: fix NR_WRITEBACK leak in memcg and system stats Date: Tue, 30 Mar 2021 18:19:06 +0000 Message-ID: <20210330181910.15378-2-fllinden@amazon.com> X-Mailer: git-send-email 2.16.6 In-Reply-To: <20210330181910.15378-1-fllinden@amazon.com> References: <20210330181910.15378-1-fllinden@amazon.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Johannes Weiner commit c3cc39118c3610eb6ab4711bc624af7fc48a35fe upstream. After commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting"), we observed slowly upward creeping NR_WRITEBACK counts over the course of several days, both the per-memcg stats as well as the system counter in e.g. /proc/meminfo. The conversion from full per-cpu stat counts to per-cpu cached atomic stat counts introduced an irq-unsafe RMW operation into the updates. Most stat updates come from process context, but one notable exception is the NR_WRITEBACK counter. While writebacks are issued from process context, they are retired from (soft)irq context. When writeback completions interrupt the RMW counter updates of new writebacks being issued, the decs from the completions are lost. Since the global updates are routed through the joint lruvec API, both the memcg counters as well as the system counters are affected. This patch makes the joint stat and event API irq safe. Link: http://lkml.kernel.org/r/20180203082353.17284-1-hannes@cmpxchg.org Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner Debugged-by: Tejun Heo Reviewed-by: Rik van Riel Reviewed-by: Andrew Morton Cc: Vladimir Davydov Cc: Michal Hocko Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 24 ++++++++++++++++-------- 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 882046863581..c46016bb25eb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -523,9 +523,11 @@ static inline void __mod_memcg_state(struct mem_cgroup *memcg, static inline void mod_memcg_state(struct mem_cgroup *memcg, int idx, int val) { - preempt_disable(); + unsigned long flags; + + local_irq_save(flags); __mod_memcg_state(memcg, idx, val); - preempt_enable(); + local_irq_restore(flags); } /** @@ -606,9 +608,11 @@ static inline void __mod_lruvec_state(struct lruvec *lruvec, static inline void mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, int val) { - preempt_disable(); + unsigned long flags; + + local_irq_save(flags); __mod_lruvec_state(lruvec, idx, val); - preempt_enable(); + local_irq_restore(flags); } static inline void __mod_lruvec_page_state(struct page *page, @@ -630,9 +634,11 @@ static inline void __mod_lruvec_page_state(struct page *page, static inline void mod_lruvec_page_state(struct page *page, enum node_stat_item idx, int val) { - preempt_disable(); + unsigned long flags; + + local_irq_save(flags); __mod_lruvec_page_state(page, idx, val); - preempt_enable(); + local_irq_restore(flags); } unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, @@ -659,9 +665,11 @@ static inline void __count_memcg_events(struct mem_cgroup *memcg, static inline void count_memcg_events(struct mem_cgroup *memcg, int idx, unsigned long count) { - preempt_disable(); + unsigned long flags; + + local_irq_save(flags); __count_memcg_events(memcg, idx, count); - preempt_enable(); + local_irq_restore(flags); } /* idx can be of type enum memcg_event_item or vm_event_item */ From patchwork Tue Mar 30 18:19:07 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 413531 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D1F0C433E2 for ; Tue, 30 Mar 2021 18:20:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id DFBED61983 for ; Tue, 30 Mar 2021 18:20:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232848AbhC3STk (ORCPT ); Tue, 30 Mar 2021 14:19:40 -0400 Received: from smtp-fw-6001.amazon.com ([52.95.48.154]:36953 "EHLO smtp-fw-6001.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232841AbhC3STM (ORCPT ); Tue, 30 Mar 2021 14:19:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1617128352; x=1648664352; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=8jmiwvqaZ8CIZpxMDB5Pi77CQq7iHqtOPPhM/hHYWSA=; b=aMNSr2BIjs80+cN2Znxs195oPaABgMKhqt5PNMipuf4uN1y0SG9HzxvR IZbDfc83qAE0hL8d6dkuFde5OLuQPTYPImzMJRepgNgGQXfp9CPJacYpU xLKM4XQLIs24fouLmQpXHAXLHLaBmUK2sL4ylRMQ1S7/PMwiUvEli4Scs c=; X-IronPort-AV: E=Sophos;i="5.81,291,1610409600"; d="scan'208";a="103538032" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1e-57e1d233.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-6001.iad6.amazon.com with ESMTP; 30 Mar 2021 18:19:12 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1e-57e1d233.us-east-1.amazon.com (Postfix) with ESMTPS id 896071414EA for ; Tue, 30 Mar 2021 18:19:11 +0000 (UTC) Received: from EX13D13UWA002.ant.amazon.com (10.43.160.172) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from EX13MTAUWA001.ant.amazon.com (10.43.160.58) by EX13D13UWA002.ant.amazon.com (10.43.160.172) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:10 +0000 Received: from dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (172.23.141.97) by mail-relay.amazon.com (10.43.160.118) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Tue, 30 Mar 2021 18:19:11 +0000 Received: by dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (Postfix, from userid 6262777) id DFFF7DF92C; Tue, 30 Mar 2021 18:19:10 +0000 (UTC) From: Frank van der Linden To: CC: Subject: [PATCH 4.14 2/5] mm: memcg: make sure memory.events is uptodate when waking pollers Date: Tue, 30 Mar 2021 18:19:07 +0000 Message-ID: <20210330181910.15378-3-fllinden@amazon.com> X-Mailer: git-send-email 2.16.6 In-Reply-To: <20210330181910.15378-1-fllinden@amazon.com> References: <20210330181910.15378-1-fllinden@amazon.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Johannes Weiner commit 2d978806d863e926345185d084a90a4c35846e37 upstream. Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") added per-cpu drift to all memory cgroup stats and events shown in memory.stat and memory.events. For memory.stat this is acceptable. But memory.events issues file notifications, and somebody polling the file for changes will be confused when the counters in it are unchanged after a wakeup. Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM - are sufficiently rare and high-level that we don't need per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most frequent, but they're counting invocations of reclaim, which is a complex operation that touches many shared cachelines. This splits memory.events from the generic VM events and tracks them in their own, unbuffered atomic counters. That's also cleaner, as it eliminates the ugly enum nesting of VM and cgroup events. [hannes@cmpxchg.org: "array subscript is above array bounds"] Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") Signed-off-by: Johannes Weiner Reported-by: Tejun Heo Acked-by: Tejun Heo Acked-by: Michal Hocko Cc: Vladimir Davydov Cc: Roman Gushchin Cc: Rik van Riel Cc: Stephen Rothwell Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 35 ++++++++++++++++++----------------- mm/memcontrol.c | 28 ++++++++++++++++------------ mm/vmscan.c | 2 +- 3 files changed, 35 insertions(+), 30 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c46016bb25eb..6503a9ca27c1 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -48,13 +48,12 @@ enum memcg_stat_item { MEMCG_NR_STAT, }; -/* Cgroup-specific events, on top of universal VM events */ -enum memcg_event_item { - MEMCG_LOW = NR_VM_EVENT_ITEMS, +enum memcg_memory_event { + MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM, - MEMCG_NR_EVENTS, + MEMCG_NR_MEMORY_EVENTS, }; struct mem_cgroup_reclaim_cookie { @@ -88,7 +87,7 @@ enum mem_cgroup_events_target { struct mem_cgroup_stat_cpu { long count[MEMCG_NR_STAT]; - unsigned long events[MEMCG_NR_EVENTS]; + unsigned long events[NR_VM_EVENT_ITEMS]; unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; }; @@ -202,7 +201,8 @@ struct mem_cgroup { /* OOM-Killer disable */ int oom_kill_disable; - /* handle for "memory.events" */ + /* memory.events */ + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; struct cgroup_file events_file; /* protect arrays of thresholds */ @@ -231,9 +231,10 @@ struct mem_cgroup { struct task_struct *move_lock_task; unsigned long move_lock_flags; + /* memory.stat */ struct mem_cgroup_stat_cpu __percpu *stat_cpu; atomic_long_t stat[MEMCG_NR_STAT]; - atomic_long_t events[MEMCG_NR_EVENTS]; + atomic_long_t events[NR_VM_EVENT_ITEMS]; unsigned long socket_pressure; @@ -645,9 +646,9 @@ unsigned long mem_cgroup_soft_limit_reclaim(pg_data_t *pgdat, int order, gfp_t gfp_mask, unsigned long *total_scanned); -/* idx can be of type enum memcg_event_item or vm_event_item */ static inline void __count_memcg_events(struct mem_cgroup *memcg, - int idx, unsigned long count) + enum vm_event_item idx, + unsigned long count) { unsigned long x; @@ -663,7 +664,8 @@ static inline void __count_memcg_events(struct mem_cgroup *memcg, } static inline void count_memcg_events(struct mem_cgroup *memcg, - int idx, unsigned long count) + enum vm_event_item idx, + unsigned long count) { unsigned long flags; @@ -672,9 +674,8 @@ static inline void count_memcg_events(struct mem_cgroup *memcg, local_irq_restore(flags); } -/* idx can be of type enum memcg_event_item or vm_event_item */ static inline void count_memcg_page_event(struct page *page, - int idx) + enum vm_event_item idx) { if (page->mem_cgroup) count_memcg_events(page->mem_cgroup, idx, 1); @@ -698,10 +699,10 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, rcu_read_unlock(); } -static inline void mem_cgroup_event(struct mem_cgroup *memcg, - enum memcg_event_item event) +static inline void memcg_memory_event(struct mem_cgroup *memcg, + enum memcg_memory_event event) { - count_memcg_events(memcg, event, 1); + atomic_long_inc(&memcg->memory_events[event]); cgroup_file_notify(&memcg->events_file); } @@ -721,8 +722,8 @@ static inline bool mem_cgroup_disabled(void) return true; } -static inline void mem_cgroup_event(struct mem_cgroup *memcg, - enum memcg_event_item event) +static inline void memcg_memory_event(struct mem_cgroup *memcg, + enum memcg_memory_event event) { } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 4e763cdccb33..31972189a827 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1872,7 +1872,7 @@ static int memcg_hotplug_cpu_dead(unsigned int cpu) } } - for (i = 0; i < MEMCG_NR_EVENTS; i++) { + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) { long x; x = this_cpu_xchg(memcg->stat_cpu->events[i], 0); @@ -1891,7 +1891,7 @@ static void reclaim_high(struct mem_cgroup *memcg, do { if (page_counter_read(&memcg->memory) <= memcg->high) continue; - mem_cgroup_event(memcg, MEMCG_HIGH); + memcg_memory_event(memcg, MEMCG_HIGH); try_to_free_mem_cgroup_pages(memcg, nr_pages, gfp_mask, true); } while ((memcg = parent_mem_cgroup(memcg))); } @@ -1982,7 +1982,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (!gfpflags_allow_blocking(gfp_mask)) goto nomem; - mem_cgroup_event(mem_over_limit, MEMCG_MAX); + memcg_memory_event(mem_over_limit, MEMCG_MAX); nr_reclaimed = try_to_free_mem_cgroup_pages(mem_over_limit, nr_pages, gfp_mask, may_swap); @@ -2025,7 +2025,7 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (fatal_signal_pending(current)) goto force; - mem_cgroup_event(mem_over_limit, MEMCG_OOM); + memcg_memory_event(mem_over_limit, MEMCG_OOM); mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); @@ -2790,10 +2790,10 @@ static void tree_events(struct mem_cgroup *memcg, unsigned long *events) struct mem_cgroup *iter; int i; - memset(events, 0, sizeof(*events) * MEMCG_NR_EVENTS); + memset(events, 0, sizeof(*events) * NR_VM_EVENT_ITEMS); for_each_mem_cgroup_tree(iter, memcg) { - for (i = 0; i < MEMCG_NR_EVENTS; i++) + for (i = 0; i < NR_VM_EVENT_ITEMS; i++) events[i] += memcg_sum_events(iter, i); } } @@ -5299,7 +5299,7 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, continue; } - mem_cgroup_event(memcg, MEMCG_OOM); + memcg_memory_event(memcg, MEMCG_OOM); if (!mem_cgroup_out_of_memory(memcg, GFP_KERNEL, 0)) break; } @@ -5312,10 +5312,14 @@ static int memory_events_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); - seq_printf(m, "low %lu\n", memcg_sum_events(memcg, MEMCG_LOW)); - seq_printf(m, "high %lu\n", memcg_sum_events(memcg, MEMCG_HIGH)); - seq_printf(m, "max %lu\n", memcg_sum_events(memcg, MEMCG_MAX)); - seq_printf(m, "oom %lu\n", memcg_sum_events(memcg, MEMCG_OOM)); + seq_printf(m, "low %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_LOW])); + seq_printf(m, "high %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_HIGH])); + seq_printf(m, "max %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_MAX])); + seq_printf(m, "oom %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_OOM])); seq_printf(m, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); return 0; @@ -5325,7 +5329,7 @@ static int memory_stat_show(struct seq_file *m, void *v) { struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(m)); unsigned long stat[MEMCG_NR_STAT]; - unsigned long events[MEMCG_NR_EVENTS]; + unsigned long events[NR_VM_EVENT_ITEMS]; int i; /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 5ee6fbdec8a8..b37e6dd50925 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2628,7 +2628,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc) sc->memcg_low_skipped = 1; continue; } - mem_cgroup_event(memcg, MEMCG_LOW); + memcg_memory_event(memcg, MEMCG_LOW); } reclaimed = sc->nr_reclaimed; From patchwork Tue Mar 30 18:19:08 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 413532 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9BDCBC433DB for ; Tue, 30 Mar 2021 18:20:09 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 634E161983 for ; Tue, 30 Mar 2021 18:20:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232847AbhC3STi (ORCPT ); Tue, 30 Mar 2021 14:19:38 -0400 Received: from smtp-fw-4101.amazon.com ([72.21.198.25]:60972 "EHLO smtp-fw-4101.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232845AbhC3STN (ORCPT ); Tue, 30 Mar 2021 14:19:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1617128354; x=1648664354; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=WBO2yQnZLyZ58LHMkJapbcc4AeOJTQQnOQmr0JkNZno=; b=gi7NmSi7CZqhC7VBXFthraDIr5j5FDZmdljlAzXUvnapPuoKMIIGQ6Zr gy0KHE3v/HTsa6b8uG/R2ppoJXQHM831DIWvEKZyK5/b3/pv9rB4fl0Tx oq3dkF2yUe70xoUcC4i2ZP00GqzAN95rWJHWtnLefejr6a5632wXG1fqd M=; X-IronPort-AV: E=Sophos;i="5.81,291,1610409600"; d="scan'208";a="97732965" Received: from iad12-co-svc-p1-lb1-vlan2.amazon.com (HELO email-inbound-relay-1d-5dd976cd.us-east-1.amazon.com) ([10.43.8.2]) by smtp-border-fw-out-4101.iad4.amazon.com with ESMTP; 30 Mar 2021 18:19:13 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1d-5dd976cd.us-east-1.amazon.com (Postfix) with ESMTPS id 68971A1C16 for ; Tue, 30 Mar 2021 18:19:12 +0000 (UTC) Received: from EX13D13UWB002.ant.amazon.com (10.43.161.21) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:12 +0000 Received: from EX13MTAUEA002.ant.amazon.com (10.43.61.77) by EX13D13UWB002.ant.amazon.com (10.43.161.21) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (172.23.141.97) by mail-relay.amazon.com (10.43.61.169) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Tue, 30 Mar 2021 18:19:11 +0000 Received: by dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (Postfix, from userid 6262777) id E3096DF92D; Tue, 30 Mar 2021 18:19:10 +0000 (UTC) From: Frank van der Linden To: CC: Subject: [PATCH 4.14 3/5] mem_cgroup: make sure moving_account, move_lock_task and stat_cpu in the same cacheline Date: Tue, 30 Mar 2021 18:19:08 +0000 Message-ID: <20210330181910.15378-4-fllinden@amazon.com> X-Mailer: git-send-email 2.16.6 In-Reply-To: <20210330181910.15378-1-fllinden@amazon.com> References: <20210330181910.15378-1-fllinden@amazon.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Aaron Lu commit e81bf9793b1861d74953ef041b4f6c7faecc2dbd upstream. The LKP robot found a 27% will-it-scale/page_fault3 performance regression regarding commit e27be240df53("mm: memcg: make sure memory.events is uptodate when waking pollers"). What the test does is: 1 mkstemp() a 128M file on a tmpfs; 2 start $nr_cpu processes, each to loop the following: 2.1 mmap() this file in shared write mode; 2.2 write 0 to this file in a PAGE_SIZE step till the end of the file; 2.3 unmap() this file and repeat this process. 3 After 5 minutes, check how many loops they managed to complete, the higher the better. The commit itself looks innocent enough as it merely changed some event counting mechanism and this test didn't trigger those events at all. Perf shows increased cycles spent on accessing root_mem_cgroup->stat_cpu in count_memcg_event_mm()(called by handle_mm_fault()) and in __mod_memcg_state() called by page_add_file_rmap(). So it's likely due to the changed layout of 'struct mem_cgroup' that either make stat_cpu falling into a constantly modifying cacheline or some hot fields stop being in the same cacheline. I verified this by moving memory_events[] back to where it was: : --- a/include/linux/memcontrol.h : +++ b/include/linux/memcontrol.h : @@ -205,7 +205,6 @@ struct mem_cgroup { : int oom_kill_disable; : : /* memory.events */ : - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; : struct cgroup_file events_file; : : /* protect arrays of thresholds */ : @@ -238,6 +237,7 @@ struct mem_cgroup { : struct mem_cgroup_stat_cpu __percpu *stat_cpu; : atomic_long_t stat[MEMCG_NR_STAT]; : atomic_long_t events[NR_VM_EVENT_ITEMS]; : + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; : : unsigned long socket_pressure; And performance restored. Later investigation found that as long as the following 3 fields moving_account, move_lock_task and stat_cpu are in the same cacheline, performance will be good. To avoid future performance surprise by other commits changing the layout of 'struct mem_cgroup', this patch makes sure the 3 fields stay in the same cacheline. One concern of this approach is, moving_account and move_lock_task could be modified when a process changes memory cgroup while stat_cpu is a always read field, it might hurt to place them in the same cacheline. I assume it is rare for a process to change memory cgroup so this should be OK. Link: https://lkml.kernel.org/r/20180528114019.GF9904@yexl-desktop Link: http://lkml.kernel.org/r/20180601071115.GA27302@intel.com Signed-off-by: Aaron Lu Reported-by: kernel test robot Cc: Johannes Weiner Cc: Michal Hocko Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6503a9ca27c1..c7876eadd206 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -155,6 +155,15 @@ enum memcg_kmem_state { KMEM_ONLINE, }; +#if defined(CONFIG_SMP) +struct memcg_padding { + char x[0]; +} ____cacheline_internodealigned_in_smp; +#define MEMCG_PADDING(name) struct memcg_padding name; +#else +#define MEMCG_PADDING(name) +#endif + /* * The memory controller data structure. The memory controller controls both * page cache and RSS per cgroup. We would eventually like to provide @@ -202,7 +211,6 @@ struct mem_cgroup { int oom_kill_disable; /* memory.events */ - atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; struct cgroup_file events_file; /* protect arrays of thresholds */ @@ -222,19 +230,26 @@ struct mem_cgroup { * mem_cgroup ? And what type of charges should we move ? */ unsigned long move_charge_at_immigrate; + /* taken only while moving_account > 0 */ + spinlock_t move_lock; + unsigned long move_lock_flags; + + MEMCG_PADDING(_pad1_); + /* * set > 0 if pages under this cgroup are moving to other cgroup. */ atomic_t moving_account; - /* taken only while moving_account > 0 */ - spinlock_t move_lock; struct task_struct *move_lock_task; - unsigned long move_lock_flags; /* memory.stat */ struct mem_cgroup_stat_cpu __percpu *stat_cpu; + + MEMCG_PADDING(_pad2_); + atomic_long_t stat[MEMCG_NR_STAT]; atomic_long_t events[NR_VM_EVENT_ITEMS]; + atomic_long_t memory_events[MEMCG_NR_MEMORY_EVENTS]; unsigned long socket_pressure; From patchwork Tue Mar 30 18:19:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 412417 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6E7B1C433E3 for ; Tue, 30 Mar 2021 18:20:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2868B619DD for ; Tue, 30 Mar 2021 18:20:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232851AbhC3STo (ORCPT ); Tue, 30 Mar 2021 14:19:44 -0400 Received: from smtp-fw-9103.amazon.com ([207.171.188.200]:29503 "EHLO smtp-fw-9103.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232846AbhC3STN (ORCPT ); Tue, 30 Mar 2021 14:19:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1617128353; x=1648664353; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=duwQVjGYkOrOvXtM8bKl9OMPh3xurpKbK0qbhH01PVU=; b=cYmcGVUNEnfZmiRMacdjTkBuo2FV9FRePwRYrAzE5hBUun6ltjIQ/uTa azE4YPOa1uoz8YAvs91WxVdcP3mdhlXpiYXeTzR0qnDJEMdq2HPdKpmIT 6rEYsFOs9+QF6yU+haEUJVRsk3F99hbZA0IIzSSLn5IVnD2PV/B+vDskQ 8=; X-IronPort-AV: E=Sophos;i="5.81,291,1610409600"; d="scan'208";a="922494444" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1a-16acd5e0.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-9103.sea19.amazon.com with ESMTP; 30 Mar 2021 18:19:12 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan2.iad.amazon.com [10.40.163.34]) by email-inbound-relay-1a-16acd5e0.us-east-1.amazon.com (Postfix) with ESMTPS id 742A4A2F16 for ; Tue, 30 Mar 2021 18:19:12 +0000 (UTC) Received: from EX13D13UWA002.ant.amazon.com (10.43.160.172) by EX13MTAUWB001.ant.amazon.com (10.43.161.207) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:12 +0000 Received: from EX13MTAUEE002.ant.amazon.com (10.43.62.24) by EX13D13UWA002.ant.amazon.com (10.43.160.172) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (172.23.141.97) by mail-relay.amazon.com (10.43.62.224) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Tue, 30 Mar 2021 18:19:10 +0000 Received: by dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (Postfix, from userid 6262777) id E5F6DDF92E; Tue, 30 Mar 2021 18:19:10 +0000 (UTC) From: Frank van der Linden To: CC: Subject: [PATCH 4.14 4/5] mm: fix oom_kill event handling Date: Tue, 30 Mar 2021 18:19:09 +0000 Message-ID: <20210330181910.15378-5-fllinden@amazon.com> X-Mailer: git-send-email 2.16.6 In-Reply-To: <20210330181910.15378-1-fllinden@amazon.com> References: <20210330181910.15378-1-fllinden@amazon.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Roman Gushchin commit fe6bdfc8e1e131720abbe77a2eb990c94c9024cb upstream. Commit e27be240df53 ("mm: memcg: make sure memory.events is uptodate when waking pollers") converted most of memcg event counters to per-memcg atomics, which made them less confusing for a user. The "oom_kill" counter remained untouched, so now it behaves differently than other counters (including "oom"). This adds nothing but confusion. Let's fix this by adding the MEMCG_OOM_KILL event, and follow the MEMCG_OOM approach. This also removes a hack from count_memcg_event_mm(), introduced earlier specially for the OOM_KILL counter. [akpm@linux-foundation.org: fix for droppage of memcg-replace-mm-owner-with-mm-memcg.patch] Link: http://lkml.kernel.org/r/20180508124637.29984-1-guro@fb.com Signed-off-by: Roman Gushchin Acked-by: Konstantin Khlebnikov Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Vladimir Davydov Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds [fllinden@amazon.com: backport to 4.14, minor contextual changes] Signed-off-by: Frank van der Linden --- include/linux/memcontrol.h | 26 ++++++++++++++++++++++---- mm/memcontrol.c | 6 ++++-- mm/oom_kill.c | 2 +- 3 files changed, 27 insertions(+), 7 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index c7876eadd206..b5cd86e320ff 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -53,6 +53,7 @@ enum memcg_memory_event { MEMCG_HIGH, MEMCG_MAX, MEMCG_OOM, + MEMCG_OOM_KILL, MEMCG_NR_MEMORY_EVENTS, }; @@ -706,11 +707,8 @@ static inline void count_memcg_event_mm(struct mm_struct *mm, rcu_read_lock(); memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); - if (likely(memcg)) { + if (likely(memcg)) count_memcg_events(memcg, idx, 1); - if (idx == OOM_KILL) - cgroup_file_notify(&memcg->events_file); - } rcu_read_unlock(); } @@ -721,6 +719,21 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg, cgroup_file_notify(&memcg->events_file); } +static inline void memcg_memory_event_mm(struct mm_struct *mm, + enum memcg_memory_event event) +{ + struct mem_cgroup *memcg; + + if (mem_cgroup_disabled()) + return; + + rcu_read_lock(); + memcg = mem_cgroup_from_task(rcu_dereference(mm->owner)); + if (likely(memcg)) + memcg_memory_event(memcg, event); + rcu_read_unlock(); +} + #ifdef CONFIG_TRANSPARENT_HUGEPAGE void mem_cgroup_split_huge_fixup(struct page *head); #endif @@ -742,6 +755,11 @@ static inline void memcg_memory_event(struct mem_cgroup *memcg, { } +static inline void memcg_memory_event_mm(struct mm_struct *mm, + enum memcg_memory_event event) +{ +} + static inline bool mem_cgroup_low(struct mem_cgroup *root, struct mem_cgroup *memcg) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 31972189a827..ef6d996a920a 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3648,7 +3648,8 @@ static int mem_cgroup_oom_control_read(struct seq_file *sf, void *v) seq_printf(sf, "oom_kill_disable %d\n", memcg->oom_kill_disable); seq_printf(sf, "under_oom %d\n", (bool)memcg->under_oom); - seq_printf(sf, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); + seq_printf(sf, "oom_kill %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); return 0; } @@ -5320,7 +5321,8 @@ static int memory_events_show(struct seq_file *m, void *v) atomic_long_read(&memcg->memory_events[MEMCG_MAX])); seq_printf(m, "oom %lu\n", atomic_long_read(&memcg->memory_events[MEMCG_OOM])); - seq_printf(m, "oom_kill %lu\n", memcg_sum_events(memcg, OOM_KILL)); + seq_printf(m, "oom_kill %lu\n", + atomic_long_read(&memcg->memory_events[MEMCG_OOM_KILL])); return 0; } diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 6482d743c5c8..6f1bed211122 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -917,7 +917,7 @@ static void oom_kill_process(struct oom_control *oc, const char *message) /* Raise event before sending signal: task reaper must see this */ count_vm_event(OOM_KILL); - count_memcg_event_mm(mm, OOM_KILL); + memcg_memory_event_mm(mm, MEMCG_OOM_KILL); /* * We should send SIGKILL before granting access to memory reserves From patchwork Tue Mar 30 18:19:10 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Frank van der Linden X-Patchwork-Id: 412419 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A55EC433E1 for ; Tue, 30 Mar 2021 18:20:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id C6B45619CB for ; Tue, 30 Mar 2021 18:20:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232782AbhC3STj (ORCPT ); Tue, 30 Mar 2021 14:19:39 -0400 Received: from smtp-fw-9103.amazon.com ([207.171.188.200]:29474 "EHLO smtp-fw-9103.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232842AbhC3STM (ORCPT ); Tue, 30 Mar 2021 14:19:12 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1617128352; x=1648664352; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version; bh=QIY4SWZ3YKZ4CmJ4f08vv74wpEgfEWwbtZCHee2ZFwg=; b=n94v1e3rZzuR3Ij8NhykCV+gVZYcklq+fNpz/I6TtSAi1aoHUOaf7Y+w qY3uuodJm0vtaAGD3LZAiRlA7rRL5nvnue7RRWZoZPpRiYrIeyogljHCJ 2lMz01xkyNBdMXVoTJMdpQFV2TXWLSDCYfxOArhMUOFESkOqtrQ2Rp4Pw k=; X-IronPort-AV: E=Sophos;i="5.81,291,1610409600"; d="scan'208";a="922494441" Received: from iad12-co-svc-p1-lb1-vlan3.amazon.com (HELO email-inbound-relay-1a-821c648d.us-east-1.amazon.com) ([10.43.8.6]) by smtp-border-fw-out-9103.sea19.amazon.com with ESMTP; 30 Mar 2021 18:19:11 +0000 Received: from EX13MTAUWB001.ant.amazon.com (iad12-ws-svc-p26-lb9-vlan3.iad.amazon.com [10.40.163.38]) by email-inbound-relay-1a-821c648d.us-east-1.amazon.com (Postfix) with ESMTPS id 90F19A28B8 for ; Tue, 30 Mar 2021 18:19:11 +0000 (UTC) Received: from EX13D13UWB003.ant.amazon.com (10.43.161.233) by EX13MTAUWB001.ant.amazon.com (10.43.161.249) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from EX13MTAUWA001.ant.amazon.com (10.43.160.58) by EX13D13UWB003.ant.amazon.com (10.43.161.233) with Microsoft SMTP Server (TLS) id 15.0.1497.2; Tue, 30 Mar 2021 18:19:11 +0000 Received: from dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (172.23.141.97) by mail-relay.amazon.com (10.43.160.118) with Microsoft SMTP Server id 15.0.1497.2 via Frontend Transport; Tue, 30 Mar 2021 18:19:11 +0000 Received: by dev-dsk-fllinden-2c-c1893d73.us-west-2.amazon.com (Postfix, from userid 6262777) id E8E69DF92F; Tue, 30 Mar 2021 18:19:10 +0000 (UTC) From: Frank van der Linden To: CC: Subject: [PATCH 4.14 5/5] mm: writeback: use exact memcg dirty counts Date: Tue, 30 Mar 2021 18:19:10 +0000 Message-ID: <20210330181910.15378-6-fllinden@amazon.com> X-Mailer: git-send-email 2.16.6 In-Reply-To: <20210330181910.15378-1-fllinden@amazon.com> References: <20210330181910.15378-1-fllinden@amazon.com> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Greg Thelen commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream. Since commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting") memcg dirty and writeback counters are managed as: 1) per-memcg per-cpu values in range of [-32..32] 2) per-memcg atomic counter When a per-cpu counter cannot fit in [-32..32] it's flushed to the atomic. Stat readers only check the atomic. Thus readers such as balance_dirty_pages() may see a nontrivial error margin: 32 pages per cpu. Assuming 100 cpus: 4k x86 page_size: 13 MiB error per memcg 64k ppc page_size: 200 MiB error per memcg Considering that dirty+writeback are used together for some decisions the errors double. This inaccuracy can lead to undeserved oom kills. One nasty case is when all per-cpu counters hold positive values offsetting an atomic negative value (i.e. per_cpu[*]=32, atomic=n_cpu*-32). balance_dirty_pages() only consults the atomic and does not consider throttling the next n_cpu*32 dirty pages. If the file_lru is in the 13..200 MiB range then there's absolutely no dirty throttling, which burdens vmscan with only dirty+writeback pages thus resorting to oom kill. It could be argued that tiny containers are not supported, but it's more subtle. It's the amount the space available for file lru that matters. If a container has memory.max-200MiB of non reclaimable memory, then it will also suffer such oom kills on a 100 cpu machine. The following test reliably ooms without this patch. This patch avoids oom kills. $ cat test mount -t cgroup2 none /dev/cgroup cd /dev/cgroup echo +io +memory > cgroup.subtree_control mkdir test cd test echo 10M > memory.max (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo) (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100) $ cat memcg-writeback-stress.c /* * Dirty pages from all but one cpu. * Clean pages from the non dirtying cpu. * This is to stress per cpu counter imbalance. * On a 100 cpu machine: * - per memcg per cpu dirty count is 32 pages for each of 99 cpus * - per memcg atomic is -99*32 pages * - thus the complete dirty limit: sum of all counters 0 * - balance_dirty_pages() only sees atomic count -99*32 pages, which * it max()s to 0. * - So a workload can dirty -99*32 pages before balance_dirty_pages() * cares. */ #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include static char *buf; static int bufSize; static void set_affinity(int cpu) { cpu_set_t affinity; CPU_ZERO(&affinity); CPU_SET(cpu, &affinity); if (sched_setaffinity(0, sizeof(affinity), &affinity)) err(1, "sched_setaffinity"); } static void dirty_on(int output_fd, int cpu) { int i, wrote; set_affinity(cpu); for (i = 0; i < 32; i++) { for (wrote = 0; wrote < bufSize; ) { int ret = write(output_fd, buf+wrote, bufSize-wrote); if (ret == -1) err(1, "write"); wrote += ret; } } } int main(int argc, char **argv) { int cpu, flush_cpu = 1, output_fd; const char *output; if (argc != 2) errx(1, "usage: output_file"); output = argv[1]; bufSize = getpagesize(); buf = malloc(getpagesize()); if (buf == NULL) errx(1, "malloc failed"); output_fd = open(output, O_CREAT|O_RDWR); if (output_fd == -1) err(1, "open(%s)", output); for (cpu = 0; cpu < get_nprocs(); cpu++) { if (cpu != flush_cpu) dirty_on(output_fd, cpu); } set_affinity(flush_cpu); if (fsync(output_fd)) err(1, "fsync(%s)", output); if (close(output_fd)) err(1, "close(%s)", output); free(buf); } Make balance_dirty_pages() and wb_over_bg_thresh() work harder to collect exact per memcg counters. This avoids the aforementioned oom kills. This does not affect the overhead of memory.stat, which still reads the single atomic counter. Why not use percpu_counter? memcg already handles cpus going offline, so no need for that overhead from percpu_counter. And the percpu_counter spinlocks are more heavyweight than is required. It probably also makes sense to use exact dirty and writeback counters in memcg oom reports. But that is saved for later. Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.com Signed-off-by: Greg Thelen Reviewed-by: Roman Gushchin Acked-by: Johannes Weiner Cc: Michal Hocko Cc: Vladimir Davydov Cc: Tejun Heo Cc: [4.16+] Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- include/linux/memcontrol.h | 5 ++++- mm/memcontrol.c | 20 ++++++++++++++++++-- 2 files changed, 22 insertions(+), 3 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index b5cd86e320ff..365d5079d1cb 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -507,7 +507,10 @@ struct mem_cgroup *lock_page_memcg(struct page *page); void __unlock_page_memcg(struct mem_cgroup *memcg); void unlock_page_memcg(struct page *page); -/* idx can be of type enum memcg_stat_item or node_stat_item */ +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page_state(). + */ static inline unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index ef6d996a920a..5e8b8e1b7d90 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -3701,6 +3701,22 @@ struct wb_domain *mem_cgroup_wb_domain(struct bdi_writeback *wb) return &memcg->cgwb_domain; } +/* + * idx can be of type enum memcg_stat_item or node_stat_item. + * Keep in sync with memcg_exact_page(). + */ +static unsigned long memcg_exact_page_state(struct mem_cgroup *memcg, int idx) +{ + long x = atomic_long_read(&memcg->stat[idx]); + int cpu; + + for_each_online_cpu(cpu) + x += per_cpu_ptr(memcg->stat_cpu, cpu)->count[idx]; + if (x < 0) + x = 0; + return x; +} + /** * mem_cgroup_wb_stats - retrieve writeback related stats from its memcg * @wb: bdi_writeback in question @@ -3726,10 +3742,10 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); + *pdirty = memcg_exact_page_state(memcg, NR_FILE_DIRTY); /* this should eventually include NR_UNSTABLE_NFS */ - *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); + *pwriteback = memcg_exact_page_state(memcg, NR_WRITEBACK); *pfilepages = mem_cgroup_nr_lru_pages(memcg, (1 << LRU_INACTIVE_FILE) | (1 << LRU_ACTIVE_FILE)); *pheadroom = PAGE_COUNTER_MAX;