diff mbox series

[v2,8/8] mm: Centralize & improve oom reporting in show_mem.c

Message ID 20220421234837.3629927-14-kent.overstreet@gmail.com
State New
Headers show
Series [v2,1/8] lib/printbuf: New data structure for heap-allocated strings | expand

Commit Message

Kent Overstreet April 21, 2022, 11:48 p.m. UTC
This patch:
 - Changes show_mem() to always report on slab usage
 - Instead of reporting on all slabs, we only report on top 10 slabs,
   and in sorted order
 - Also reports on shrinkers, with the new shrinkers_to_text().
   Shrinkers need to be included in OOM/allocation failure reporting
   because they're responsible for memory reclaim - if a shrinker isn't
   giving up its memory, we need to know which one and why.

More OOM reporting can be moved to show_mem.c and improved, this patch
is only a start.

New example output on OOM/memory allocation failure:

00177 Mem-Info:
00177 active_anon:13706 inactive_anon:32266 isolated_anon:16
00177  active_file:1653 inactive_file:1822 isolated_file:0
00177  unevictable:0 dirty:0 writeback:0
00177  slab_reclaimable:6242 slab_unreclaimable:11168
00177  mapped:3824 shmem:3 pagetables:1266 bounce:0
00177  kernel_misc_reclaimable:0
00177  free:4362 free_pcp:35 free_cma:0
00177 Node 0 active_anon:54824kB inactive_anon:129064kB active_file:6612kB inactive_file:7288kB unevictable:0kB isolated(anon):64kB isolated(file):0kB mapped:15296kB dirty:0kB writeback:0kB shmem:12kB writeback_tmp:0kB kernel_stack:3392kB pagetables:5064kB all_unreclaimable? no
00177 DMA free:2232kB boost:0kB min:88kB low:108kB high:128kB reserved_highatomic:0KB active_anon:2924kB inactive_anon:6596kB active_file:428kB inactive_file:384kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 426 426 426
00177 DMA32 free:15092kB boost:5836kB min:8432kB low:9080kB high:9728kB reserved_highatomic:0KB active_anon:52196kB inactive_anon:122392kB active_file:6176kB inactive_file:7068kB unevictable:0kB writepending:0kB present:507760kB managed:441816kB mlocked:0kB bounce:0kB free_pcp:72kB local_pcp:0kB free_cma:0kB
00177 lowmem_reserve[]: 0 0 0 0
00177 DMA: 284*4kB (UM) 53*8kB (UM) 21*16kB (U) 11*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2248kB
00177 DMA32: 2765*4kB (UME) 375*8kB (UME) 57*16kB (UM) 5*32kB (U) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 15132kB
00177 4656 total pagecache pages
00177 1031 pages in swap cache
00177 Swap cache stats: add 6572399, delete 6572173, find 488603/3286476
00177 Free swap  = 509112kB
00177 Total swap = 2097148kB
00177 130938 pages RAM
00177 0 pages HighMem/MovableOnly
00177 16644 pages reserved
00177 Unreclaimable slab info:
00177 9p-fcall-cache    total: 8.25 MiB active: 8.25 MiB
00177 kernfs_node_cache total: 2.15 MiB active: 2.15 MiB
00177 kmalloc-64        total: 2.08 MiB active: 2.07 MiB
00177 task_struct       total: 1.95 MiB active: 1.95 MiB
00177 kmalloc-4k        total: 1.50 MiB active: 1.50 MiB
00177 signal_cache      total: 1.34 MiB active: 1.34 MiB
00177 kmalloc-2k        total: 1.16 MiB active: 1.16 MiB
00177 bch_inode_info    total: 1.02 MiB active: 922 KiB
00177 perf_event        total: 1.02 MiB active: 1.02 MiB
00177 biovec-max        total: 992 KiB active: 960 KiB
00177 Shrinkers:
00177 super_cache_scan: objects: 127
00177 super_cache_scan: objects: 106
00177 jbd2_journal_shrink_scan: objects: 32
00177 ext4_es_scan: objects: 32
00177 bch2_btree_cache_scan: objects: 8
00177   nr nodes:          24
00177   nr dirty:          0
00177   cannibalize lock:  0000000000000000
00177
00177 super_cache_scan: objects: 8
00177 super_cache_scan: objects: 1

Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
---
 mm/oom_kill.c    | 23 ---------------------
 mm/show_mem.c    | 14 +++++++++++++
 mm/slab.h        |  6 ++++--
 mm/slab_common.c | 53 ++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 63 insertions(+), 33 deletions(-)

Comments

Michal Hocko April 22, 2022, 12:58 p.m. UTC | #1
On Thu 21-04-22 19:48:37, Kent Overstreet wrote:
> This patch:
>  - Changes show_mem() to always report on slab usage
>  - Instead of reporting on all slabs, we only report on top 10 slabs,
>    and in sorted order

As I've already pointed out in the email thread for the previous
version, this would be better in its own patch explaining why we want to
make this unconditional and why to limit the number caches to print.
Why the trashold shouldn't be absolute size based?

>  - Also reports on shrinkers, with the new shrinkers_to_text().
>    Shrinkers need to be included in OOM/allocation failure reporting
>    because they're responsible for memory reclaim - if a shrinker isn't
>    giving up its memory, we need to know which one and why.

Again, I do agree that information about shrinkers can be useful but
there are two main things to consider. Do we want to dump that
information unconditionaly? E.g. does it make sense to print for all
allocation requests (even high order, GFP_NOWAIT...)? Should there be
any explicit trigger when to dump this data (like too many shrinkers
failing etc)?

Last but not least let me echo the concern from the other reply. Memory
allocations are not really reasonable to be done from the oom context so
the pr_buf doesn't sound like a good tool here.

Also please make sure to provide an example of the output and _explain_
how the new output is better than the existing one.
Kent Overstreet April 22, 2022, 11:48 p.m. UTC | #2
On Fri, Apr 22, 2022 at 08:09:48AM -0700, Roman Gushchin wrote:
> To add a concern: largest shrinkers are usually memcg-aware. Scanning
> over the whole cgroup tree (with potentially hundreds or thousands of cgroups)
> and over all shrinkers from the oom context sounds like a bad idea to me.

Why would we be scanning over the whole cgroup tree? We don't do that in the
vmscan code, nor the new report. If the OOM is for a specific cgroup, we should
probably only be reporting on memory usage for that cgroup (show_mem() is not
currently cgroup aware, but perhaps it should be).

> IMO it's more appropriate to do from userspace by oomd or a similar daemon,
> well before the in-kernel OOM kicks in.

The reason I've been introducing printbufs and the .to_text() method was
specifically to make this code general enough to be available from
sysfs/debugfs - so I see no reasons why a userspace oomd couldn't make use of it
as well.

> > Last but not least let me echo the concern from the other reply. Memory
> > allocations are not really reasonable to be done from the oom context so
> > the pr_buf doesn't sound like a good tool here.
> 
> +1

In my experience, it's rare to be _so_ out of memory that small kmalloc
allocations are failing - we'll be triggering the show_mem() report before that
happens.

However, if this turns out not to be the case in practice, or if there's a
consensus now that we really want to guard against this, I have some thoughts.
We could either:

 - mempool-ify printbufs as a whole

 - reserve some memory just for the show_mem() report, which would mean either
   adding support to printbuf for external buffers (subsuming what seq_buf
   does), or shrinker .to_text() methods would have to output to seq_buf instead
   of printbuf (ew, API fragmentation).
Kent Overstreet April 23, 2022, 12:46 a.m. UTC | #3
On Fri, Apr 22, 2022 at 05:27:41PM -0700, Roman Gushchin wrote:
> You're scanning over a small portion of all shrinker lists (on a machine with
> cgroups), so the top-10 list has little value.
> Global ->count_objects() return the number of objects on the system/root_mem_cgroup
> level, not the shrinker's total.

Not quite following what you're saying here...?

If you're complaining that my current top-10-shrinker report isn't memcg aware,
that's valid - I can fix that.

> > In my experience, it's rare to be _so_ out of memory that small kmalloc
> > allocations are failing - we'll be triggering the show_mem() report before that
> > happens.
> 
> I agree. However the OOM killer _has_ to make the progress even in such rare
> circumstances.

Oh, and the concern is allocator recursion? Yeah, that's a good point.

Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
generate the report on slabs, since we're taking the slab mutex there.
Michal Hocko April 25, 2022, 9:28 a.m. UTC | #4
On Fri 22-04-22 20:46:07, Kent Overstreet wrote:
> On Fri, Apr 22, 2022 at 05:27:41PM -0700, Roman Gushchin wrote:
[...]
> > > In my experience, it's rare to be _so_ out of memory that small kmalloc
> > > allocations are failing - we'll be triggering the show_mem() report before that
> > > happens.
> > 
> > I agree. However the OOM killer _has_ to make the progress even in such rare
> > circumstances.

Absolutely agreed!

> Oh, and the concern is allocator recursion? Yeah, that's a good point.

No, not really. The oom killer is running with PF_MEMALLOC context so no
reclaim recursion is allowed. As I've already pointed out in other
reply the context will have access to memory reserves without any
constrains so it could deplete them completely resulting in other issues
during the recovery.

> Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> generate the report on slabs, since we're taking the slab mutex there.

No it's not. You simply _cannot_ allocate from the oom context.
Kent Overstreet April 25, 2022, 3:28 p.m. UTC | #5
On Mon, Apr 25, 2022 at 11:28:26AM +0200, Michal Hocko wrote:
> 
> > Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> > or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> > generate the report on slabs, since we're taking the slab mutex there.
> 
> No it's not. You simply _cannot_ allocate from the oom context.

Hmm, no, that can't be right. I've been using the patch set and it definitely
works, at least in my testing. Do you mean to say that we shouldn't? Can you
explain why?
Michal Hocko April 26, 2022, 7:17 a.m. UTC | #6
On Mon 25-04-22 11:28:11, Kent Overstreet wrote:
> On Mon, Apr 25, 2022 at 11:28:26AM +0200, Michal Hocko wrote:
> > 
> > > Do you know if using memalloc_noreclaim_(save|restore) is sufficient for that,
> > > or do we want GFP_ATOMIC? I'm already using GFP_ATOMIC for allocations when we
> > > generate the report on slabs, since we're taking the slab mutex there.
> > 
> > No it's not. You simply _cannot_ allocate from the oom context.
> 
> Hmm, no, that can't be right. I've been using the patch set and it definitely
> works, at least in my testing.

Yes, the world will not fall down and it really depends on the workload
what kind of effect this might have.

> Do you mean to say that we shouldn't? Can you explain why?

I have already touched on that but let me reiterate. Allocation context
called from the oom path will have an unbound access to memory reserves.
Those are a last resort emergency pools of memory that are not available
normally and there are areas which really depend on them to make a
further progress to release the memory pressure.

Swap over NFS would be one such example. If some other code path messes
with those reserves the swap IO path could fail with all sorts of
fallouts.

So to be really exact in my statement. You can allocate from the OOM
context but it is _strongly_ discouraged unless there is no other way
around that.

I would even claim that the memory reclaim in general shouldn't rely on
memory allocations (other than mempools). If an allocation is really
necessary then an extra care has to prevent from complete memory
depletion.
Kent Overstreet April 26, 2022, 7:26 a.m. UTC | #7
On Tue, Apr 26, 2022 at 09:17:39AM +0200, Michal Hocko wrote:
> I have already touched on that but let me reiterate. Allocation context
> called from the oom path will have an unbound access to memory reserves.
> Those are a last resort emergency pools of memory that are not available
> normally and there are areas which really depend on them to make a
> further progress to release the memory pressure.
> 
> Swap over NFS would be one such example. If some other code path messes
> with those reserves the swap IO path could fail with all sorts of
> fallouts.
> 
> So to be really exact in my statement. You can allocate from the OOM
> context but it is _strongly_ discouraged unless there is no other way
> around that.
> 
> I would even claim that the memory reclaim in general shouldn't rely on
> memory allocations (other than mempools). If an allocation is really
> necessary then an extra care has to prevent from complete memory
> depletion.

100% agreement with this, I've always made sure IO paths I touched were fully
mempool-ified (some of my early work was actually for making sure bio allocation
underneath generic_make_request() won't deadlock - previously allocated bios
won't make forward progress and be freed due to generic_make_request() turning
recursion into iteration, but that's all ancient history).

Anyways, the reason I think this allocation is fine is it's GFP_NOWAIT and it's
completely fine if it fails - all we lose is some diagnostics, and also it's
released right away.

But there's also no need for it to be a point of contention, the way I'm going
with printbufs it'll be trivial to mempool-ify this if we want.

Before I get back to this I'm changing the approach I'm taking with printbufs
and first using it to clean up vsnprintf() and all the related code, which is..
a bit of an undertaking. End result is going to be really cool though.
Michal Hocko April 26, 2022, 7:40 a.m. UTC | #8
On Tue 26-04-22 03:26:12, Kent Overstreet wrote:
[...]
> Anyways, the reason I think this allocation is fine is it's GFP_NOWAIT and it's
> completely fine if it fails - all we lose is some diagnostics, and also it's
> released right away.

I think you are still missing the PF_MEMALLOC point. Please have a look
how this leads to no reclaim recursion so GFP_NOWAIT has no meaning when
you are allocating from PF_MEMALLOC context (which the oom killer and
any reclaim path is). Also have a look at how __gfp_pfmemalloc_flags
makes the allocation request from that context ALLOC_NO_WATERMARKS.
See?
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 832fb33037..659c7d6376 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -171,27 +171,6 @@  static bool oom_unkillable_task(struct task_struct *p)
 	return false;
 }
 
-/*
- * Check whether unreclaimable slab amount is greater than
- * all user memory(LRU pages).
- * dump_unreclaimable_slab() could help in the case that
- * oom due to too much unreclaimable slab used by kernel.
-*/
-static bool should_dump_unreclaim_slab(void)
-{
-	unsigned long nr_lru;
-
-	nr_lru = global_node_page_state(NR_ACTIVE_ANON) +
-		 global_node_page_state(NR_INACTIVE_ANON) +
-		 global_node_page_state(NR_ACTIVE_FILE) +
-		 global_node_page_state(NR_INACTIVE_FILE) +
-		 global_node_page_state(NR_ISOLATED_ANON) +
-		 global_node_page_state(NR_ISOLATED_FILE) +
-		 global_node_page_state(NR_UNEVICTABLE);
-
-	return (global_node_page_state_pages(NR_SLAB_UNRECLAIMABLE_B) > nr_lru);
-}
-
 /**
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
@@ -465,8 +444,6 @@  static void dump_header(struct oom_control *oc, struct task_struct *p)
 		mem_cgroup_print_oom_meminfo(oc->memcg);
 	else {
 		show_mem(SHOW_MEM_FILTER_NODES, oc->nodemask);
-		if (should_dump_unreclaim_slab())
-			dump_unreclaimable_slab();
 	}
 	if (sysctl_oom_dump_tasks)
 		dump_tasks(oc);
diff --git a/mm/show_mem.c b/mm/show_mem.c
index 1c26c14ffb..24b662f64d 100644
--- a/mm/show_mem.c
+++ b/mm/show_mem.c
@@ -7,11 +7,15 @@ 
 
 #include <linux/mm.h>
 #include <linux/cma.h>
+#include <linux/printbuf.h>
+
+#include "slab.h"
 
 void show_mem(unsigned int filter, nodemask_t *nodemask)
 {
 	pg_data_t *pgdat;
 	unsigned long total = 0, reserved = 0, highmem = 0;
+	struct printbuf buf = PRINTBUF;
 
 	printk("Mem-Info:\n");
 	show_free_areas(filter, nodemask);
@@ -41,4 +45,14 @@  void show_mem(unsigned int filter, nodemask_t *nodemask)
 #ifdef CONFIG_MEMORY_FAILURE
 	printk("%lu pages hwpoisoned\n", atomic_long_read(&num_poisoned_pages));
 #endif
+
+	pr_info("Unreclaimable slab info:\n");
+	dump_unreclaimable_slab(&buf);
+	printk("%s", printbuf_str(&buf));
+	printbuf_reset(&buf);
+
+	printk("Shrinkers:\n");
+	shrinkers_to_text(&buf);
+	printk("%s", printbuf_str(&buf));
+	printbuf_exit(&buf);
 }
diff --git a/mm/slab.h b/mm/slab.h
index c7f2abc2b1..abefbf7674 100644
--- a/mm/slab.h
+++ b/mm/slab.h
@@ -788,10 +788,12 @@  static inline struct kmem_cache_node *get_node(struct kmem_cache *s, int node)
 
 #endif
 
+struct printbuf;
+
 #if defined(CONFIG_SLAB) || defined(CONFIG_SLUB_DEBUG)
-void dump_unreclaimable_slab(void);
+void dump_unreclaimable_slab(struct printbuf *);
 #else
-static inline void dump_unreclaimable_slab(void)
+static inline void dump_unreclaimable_slab(struct printbuf *out)
 {
 }
 #endif
diff --git a/mm/slab_common.c b/mm/slab_common.c
index 23f2ab0713..1209480797 100644
--- a/mm/slab_common.c
+++ b/mm/slab_common.c
@@ -24,6 +24,7 @@ 
 #include <asm/tlbflush.h>
 #include <asm/page.h>
 #include <linux/memcontrol.h>
+#include <linux/printbuf.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/kmem.h>
@@ -1084,10 +1085,15 @@  static int slab_show(struct seq_file *m, void *p)
 	return 0;
 }
 
-void dump_unreclaimable_slab(void)
+void dump_unreclaimable_slab(struct printbuf *out)
 {
 	struct kmem_cache *s;
 	struct slabinfo sinfo;
+	struct slab_by_mem {
+		struct kmem_cache *s;
+		size_t total, active;
+	} slabs_by_mem[10], n;
+	int i, nr = 0;
 
 	/*
 	 * Here acquiring slab_mutex is risky since we don't prefer to get
@@ -1097,12 +1103,11 @@  void dump_unreclaimable_slab(void)
 	 * without acquiring the mutex.
 	 */
 	if (!mutex_trylock(&slab_mutex)) {
-		pr_warn("excessive unreclaimable slab but cannot dump stats\n");
+		pr_buf(out, "excessive unreclaimable slab but cannot dump stats\n");
 		return;
 	}
 
-	pr_info("Unreclaimable slab info:\n");
-	pr_info("Name                      Used          Total\n");
+	printbuf_atomic_inc(out);
 
 	list_for_each_entry(s, &slab_caches, list) {
 		if (s->flags & SLAB_RECLAIM_ACCOUNT)
@@ -1110,11 +1115,43 @@  void dump_unreclaimable_slab(void)
 
 		get_slabinfo(s, &sinfo);
 
-		if (sinfo.num_objs > 0)
-			pr_info("%-17s %10luKB %10luKB\n", s->name,
-				(sinfo.active_objs * s->size) / 1024,
-				(sinfo.num_objs * s->size) / 1024);
+		if (!sinfo.num_objs)
+			continue;
+
+		n.s = s;
+		n.total = sinfo.num_objs * s->size;
+		n.active = sinfo.active_objs * s->size;
+
+		for (i = 0; i < nr; i++)
+			if (n.total < slabs_by_mem[i].total)
+				break;
+
+		if (nr < ARRAY_SIZE(slabs_by_mem)) {
+			memmove(&slabs_by_mem[i + 1],
+				&slabs_by_mem[i],
+				sizeof(slabs_by_mem[0]) * (nr - i));
+			nr++;
+		} else if (i) {
+			i--;
+			memmove(&slabs_by_mem[0],
+				&slabs_by_mem[1],
+				sizeof(slabs_by_mem[0]) * i);
+		} else {
+			continue;
+		}
+
+		slabs_by_mem[i] = n;
+	}
+
+	for (i = nr - 1; i >= 0; --i) {
+		pr_buf(out, "%-17s total: ", slabs_by_mem[i].s->name);
+		pr_human_readable_u64(out, slabs_by_mem[i].total);
+		pr_buf(out, " active: ");
+		pr_human_readable_u64(out, slabs_by_mem[i].active);
+		pr_newline(out);
 	}
+
+	printbuf_atomic_dec(out);
 	mutex_unlock(&slab_mutex);
 }