diff mbox series

[1/2] mm, memcg: cg2 memory{.swap,}.peak write handlers

Message ID 20240723233149.3226636-2-davidf@vimeo.com
State Superseded
Headers show
Series [1/2] mm, memcg: cg2 memory{.swap,}.peak write handlers | expand

Commit Message

David Finkel July 23, 2024, 11:31 p.m. UTC
Other mechanisms for querying the peak memory usage of either a process
or v1 memory cgroup allow for resetting the high watermark. Restore
parity with those mechanisms, but with a less racy API.

For example:
 - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets
   the high watermark.
 - writing "5" to the clear_refs pseudo-file in a processes's proc
   directory resets the peak RSS.

This change is an evolution of a previous patch, which mostly copied the
cgroup v1 behavior, however, there were concerns about races/ownership
issues with a global reset, so instead this change makes the reset
filedescriptor-local.

Writing a specific string to the memory.peak and memory.swap.peak
pseudo-files reset the high watermark to the current usage for
subsequent reads through that same fd.

Notably, following Johannes's suggestion, this implementation moves the
O(fds that have written) behavior onto the fd write(2) path. Instead, on
the page-allocation path, we simply add one additional watermark to
conditionally bump per-hierarchy level in the page-counter.

Additionally, this takes Longman's suggestion of nesting the
page-charging-path checks for the two watermarks to reduce the number of
common-case comparisons.

This behavior is particularly useful for work scheduling systems that
need to track memory usage of worker processes/cgroups per-work-item.
Since memory can't be squeezed like CPU can (the OOM-killer has
opinions), these systems need to track the peak memory usage to compute
system/container fullness when binpacking workitems.

Most notably, Vimeo's use-case involves a system that's doing global
binpacking across many Kubernetes pods/containers, and while we can use
PSI for some local decisions about overload, we strive to avoid packing
workloads too tightly in the first place. To facilitate this, we track
the peak memory usage. However, since we run with long-lived workers (to
amortize startup costs) we need a way to track the high watermark while
a work-item is executing. Polling runs the risk of missing short spikes
that last for timescales below the polling interval, and peak memory
tracking at the cgroup level is otherwise perfect for this use-case.

As this data is used to ensure that binpacked work ends up with
sufficient headroom, this use-case mostly avoids the inaccuracies
surrounding reclaimable memory.

Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
Suggested-by: Waiman Long <longman@redhat.com>
Signed-off-by: David Finkel <davidf@vimeo.com>
---
 Documentation/admin-guide/cgroup-v2.rst |  26 ++++--
 include/linux/cgroup-defs.h             |   5 +
 include/linux/cgroup.h                  |   3 +
 include/linux/memcontrol.h              |   5 +
 include/linux/page_counter.h            |   6 +-
 kernel/cgroup/cgroup-internal.h         |   2 +
 kernel/cgroup/cgroup.c                  |   7 ++
 mm/memcontrol.c                         | 117 ++++++++++++++++++++++--
 mm/page_counter.c                       |  30 ++++--
 9 files changed, 174 insertions(+), 27 deletions(-)

Comments

Johannes Weiner July 24, 2024, 11:49 a.m. UTC | #1
On Tue, Jul 23, 2024 at 09:55:19PM -0400, Waiman Long wrote:
> Could you use the "-v <n>" option of git-format-patch to add a version 
> number to the patch title? Without that, it can be confusing as to 
> whether the patch is new or a resend of the previous one.

+1

> > @@ -775,6 +775,11 @@ struct cgroup_subsys {
> >   
> >   extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> >   
> > +struct cgroup_of_peak {
> > +	long			value;
> > +	struct list_head	list;
> > +};
> The name "cgroup_of_peak" is kind of confusing. Maybe local_peak?

It's the peak associated with an 'of' (which is a known concept in
cgroup code), and it pairs up nicely with of_peak(). I'd prefer
keeping that over local_peak.

> > @@ -26,6 +26,7 @@ struct page_counter {
> >   	atomic_long_t children_low_usage;
> >   
> >   	unsigned long watermark;
> > +	unsigned long local_watermark; /* track min of fd-local resets */
> track "min"? I thought it is used to track local maximum after a reset.

Yeah, the comment doesn't sound quite right.

However, I think we'd be hard-pressed to explain correctly and
comprehensively what this thing does in <40 characters.

I'd just remove the comment tbh.

> > @@ -78,7 +79,10 @@ int page_counter_memparse(const char *buf, const char *max,
> >   
> >   static inline void page_counter_reset_watermark(struct page_counter *counter)
> >   {
> > -	counter->watermark = page_counter_read(counter);
> > +	unsigned long usage = page_counter_read(counter);
> > +
> > +	counter->watermark = usage;
> > +	counter->local_watermark = usage;
> >   }
> >   
> 
> Could you set the local_watermark first before setting watermark? There 
> is a very small time window that the invariant "local_watermark <= 
> watermark" is not true.

Does it matter? Only cgroup1 supports global resets; only cgroup2
supports local peaks watching. This doesn't add anything to the race
that already exists between reset and global watermark update on cg1.

> > @@ -3950,12 +3955,90 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
> >   	return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
> >   }
> >   
> > -static u64 memory_peak_read(struct cgroup_subsys_state *css,
> > -			    struct cftype *cft)
> > +static int peak_show(struct seq_file *sf, void *v, struct page_counter *pc)
> >   {
> > -	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > +	struct cgroup_of_peak *ofp = of_peak(sf->private);
> > +	s64 fd_peak = ofp->value, peak;
> > +
> > +	/* User wants global or local peak? */
> > +	if (fd_peak == -1)
> > +		peak = pc->watermark;
> > +	else
> > +		peak = max(fd_peak, (s64)pc->local_watermark);
> Should you save the local_watermark value into ofp->value if 
> local_watermark is bigger? This will ensure that each successive read of 
> the fd is monotonically increasing. Otherwise the value may go up or 
> down if there are multiple resets in between.

The reset saves local_watermark into ofp->value if it's bigger..?

I do see another problem, though. The compiler might issue multiple
reads to ofp->value in arbitrary order. We could print max(-1, ...)
which is nonsense. Saving ofp->value into a local variable is the
right idea, but the compiler might still issue two reads anyway. It
needs a READ_ONCE() to force a single read.

I'd use unsigned long for fd_peak. This way the "specialness" is on
the -1UL comparison. The max() must be between two positive numbers,
so the (s64) there is confusing.
David Finkel July 24, 2024, 4:11 p.m. UTC | #2
On Wed, Jul 24, 2024 at 7:49 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Jul 23, 2024 at 09:55:19PM -0400, Waiman Long wrote:
> > Could you use the "-v <n>" option of git-format-patch to add a version
> > number to the patch title? Without that, it can be confusing as to
> > whether the patch is new or a resend of the previous one.
>
> +1

Sorry, I forgot that that flag exists.
I'll use that with the next patch. (which I'll send out shortly)

>
> > > @@ -775,6 +775,11 @@ struct cgroup_subsys {
> > >
> > >   extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
> > >
> > > +struct cgroup_of_peak {
> > > +   long                    value;
> > > +   struct list_head        list;
> > > +};
> > The name "cgroup_of_peak" is kind of confusing. Maybe local_peak?
>
> It's the peak associated with an 'of' (which is a known concept in
> cgroup code), and it pairs up nicely with of_peak(). I'd prefer
> keeping that over local_peak.
>
> > > @@ -26,6 +26,7 @@ struct page_counter {
> > >     atomic_long_t children_low_usage;
> > >
> > >     unsigned long watermark;
> > > +   unsigned long local_watermark; /* track min of fd-local resets */
> > track "min"? I thought it is used to track local maximum after a reset.
>
> Yeah, the comment doesn't sound quite right.
Yeah, it's not explicitly the min. At reset-time, it's the current
value at reset-time, and all the fd-local
watermarks will all be greater than or equal.
Which does effectively make it the min of watermarks at the time it's
being set by the reset code.

However, yeah, the page-charging code will increase it, which makes it
not a min.

>
> However, I think we'd be hard-pressed to explain correctly and
> comprehensively what this thing does in <40 characters.
>
> I'd just remove the comment tbh.
Yeah, I definitely didn't think that comment through.
Deleting.
>
> > > @@ -78,7 +79,10 @@ int page_counter_memparse(const char *buf, const char *max,
> > >
> > >   static inline void page_counter_reset_watermark(struct page_counter *counter)
> > >   {
> > > -   counter->watermark = page_counter_read(counter);
> > > +   unsigned long usage = page_counter_read(counter);
> > > +
> > > +   counter->watermark = usage;
> > > +   counter->local_watermark = usage;
> > >   }
> > >
> >
> > Could you set the local_watermark first before setting watermark? There
> > is a very small time window that the invariant "local_watermark <=
> > watermark" is not true.
>
> Does it matter? Only cgroup1 supports global resets; only cgroup2
> supports local peaks watching. This doesn't add anything to the race
> that already exists between reset and global watermark update on cg1.
>
Hmm, since the global watermark update is now conditional on both watermarks
being <= the current usage, it does make sense.
Witht that said, since we're assigning without any barriers, as-is,
the CPU and compiler are quite free to re-order them anyway.

I've swapped them and added a comment.
> > > @@ -3950,12 +3955,90 @@ static u64 memory_current_read(struct cgroup_subsys_state *css,
> > >     return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
> > >   }
> > >
> > > -static u64 memory_peak_read(struct cgroup_subsys_state *css,
> > > -                       struct cftype *cft)
> > > +static int peak_show(struct seq_file *sf, void *v, struct page_counter *pc)
> > >   {
> > > -   struct mem_cgroup *memcg = mem_cgroup_from_css(css);
> > > +   struct cgroup_of_peak *ofp = of_peak(sf->private);
> > > +   s64 fd_peak = ofp->value, peak;
> > > +
> > > +   /* User wants global or local peak? */
> > > +   if (fd_peak == -1)
> > > +           peak = pc->watermark;
> > > +   else
> > > +           peak = max(fd_peak, (s64)pc->local_watermark);
> > Should you save the local_watermark value into ofp->value if
> > local_watermark is bigger? This will ensure that each successive read of
> > the fd is monotonically increasing. Otherwise the value may go up or
> > down if there are multiple resets in between.
>
> The reset saves local_watermark into ofp->value if it's bigger..?
>
> I do see another problem, though. The compiler might issue multiple
> reads to ofp->value in arbitrary order. We could print max(-1, ...)
> which is nonsense. Saving ofp->value into a local variable is the
> right idea, but the compiler might still issue two reads anyway. It
> needs a READ_ONCE() to force a single read.
Thanks, I didn't realize the compiler had the latitude to decide to
read from that
struct field a second time when referencing the local variable.

Added.
>
> I'd use unsigned long for fd_peak. This way the "specialness" is on
> the -1UL comparison. The max() must be between two positive numbers,
> so the (s64) there is confusing.
I've switched fd_peak to `u64`.

Thanks again,

--
David Finkel
Senior Principal Software Engineer, Core Services
Tejun Heo July 26, 2024, 8:22 p.m. UTC | #3
Hello, David.

On Tue, Jul 23, 2024 at 07:31:48PM -0400, David Finkel wrote:
...
> +	A write of the string "reset" to this file resets it to the
> +	current memory usage for subsequent reads through the same
> +	file descriptor.
> +	Attempts to write any other non-empty string will return EINVAL
> +	(modulo leading and trailing whitespace).

Let's just please do any write. We don't want to add complex write semantics
to these files. Writing anything to reset these files is an established
pattern and I don't think we gain anything by making this more complicated.

Thanks.
David Finkel July 29, 2024, 1:37 p.m. UTC | #4
Hi Tejun,

On Fri, Jul 26, 2024 at 4:22 PM Tejun Heo <tj@kernel.org> wrote:
>
> Hello, David.
>
> On Tue, Jul 23, 2024 at 07:31:48PM -0400, David Finkel wrote:
> ...
> > +     A write of the string "reset" to this file resets it to the
> > +     current memory usage for subsequent reads through the same
> > +     file descriptor.
> > +     Attempts to write any other non-empty string will return EINVAL
> > +     (modulo leading and trailing whitespace).
>
> Let's just please do any write. We don't want to add complex write semantics
> to these files. Writing anything to reset these files is an established
> pattern and I don't think we gain anything by making this more complicated.

I still think something more limited is right here, but it seems that
there's consensus
that accepting all non-empty writes is the right option here, so I've
removed the check.
The next patchset will accept any (non-empty) write.

>
> Thanks.
>
> --
> tejun


Thanks,
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 86311c2907cd3..01554cf6e55b4 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1333,11 +1333,16 @@  The following nested keys are defined.
 	all the existing limitations and potential future extensions.
 
   memory.peak
-	A read-only single value file which exists on non-root
-	cgroups.
+	A read-write single value file which exists on non-root cgroups.
+
+	The max memory usage recorded for the cgroup and its descendants since
+	either the creation of the cgroup or the most recent reset for that FD.
 
-	The max memory usage recorded for the cgroup and its
-	descendants since the creation of the cgroup.
+	A write of the string "reset" to this file resets it to the
+	current memory usage for subsequent reads through the same
+	file descriptor.
+	Attempts to write any other non-empty string will return EINVAL
+	(modulo leading and trailing whitespace).
 
   memory.oom.group
 	A read-write single value file which exists on non-root
@@ -1663,11 +1668,16 @@  The following nested keys are defined.
 	Healthy workloads are not expected to reach this limit.
 
   memory.swap.peak
-	A read-only single value file which exists on non-root
-	cgroups.
+	A read-write single value file which exists on non-root cgroups.
+
+	The max swap usage recorded for the cgroup and its descendants since
+	the creation of the cgroup or the most recent reset for that FD.
 
-	The max swap usage recorded for the cgroup and its
-	descendants since the creation of the cgroup.
+	A write of the string "reset" to this file resets it to the
+	current memory usage for subsequent reads through the same
+	file descriptor.
+	Attempts to write any other non-empty string will return EINVAL
+	(modulo leading and trailing whitespace).
 
   memory.swap.max
 	A read-write single value file which exists on non-root
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index ae04035b6cbe5..2188ea76ab327 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -775,6 +775,11 @@  struct cgroup_subsys {
 
 extern struct percpu_rw_semaphore cgroup_threadgroup_rwsem;
 
+struct cgroup_of_peak {
+	long			value;
+	struct list_head	list;
+};
+
 /**
  * cgroup_threadgroup_change_begin - threadgroup exclusion for cgroups
  * @tsk: target task
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index c60ba0ab14627..3e0563753cc3e 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -11,6 +11,7 @@ 
 
 #include <linux/sched.h>
 #include <linux/nodemask.h>
+#include <linux/list.h>
 #include <linux/rculist.h>
 #include <linux/cgroupstats.h>
 #include <linux/fs.h>
@@ -854,4 +855,6 @@  static inline void cgroup_bpf_put(struct cgroup *cgrp) {}
 
 struct cgroup *task_get_cgroup1(struct task_struct *tsk, int hierarchy_id);
 
+struct cgroup_of_peak *of_peak(struct kernfs_open_file *of);
+
 #endif /* _LINUX_CGROUP_H */
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7e2eb091049a0..4fa4f0e931d26 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -192,6 +192,11 @@  struct mem_cgroup {
 		struct page_counter memsw;	/* v1 only */
 	};
 
+	/* registered local peak watchers */
+	struct list_head memory_peaks;
+	struct list_head swap_peaks;
+	spinlock_t	 peaks_lock;
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index 904c52f97284f..860f313182e77 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -26,6 +26,7 @@  struct page_counter {
 	atomic_long_t children_low_usage;
 
 	unsigned long watermark;
+	unsigned long local_watermark; /* track min of fd-local resets */
 	unsigned long failcnt;
 
 	/* Keep all the read most fields in a separete cacheline. */
@@ -78,7 +79,10 @@  int page_counter_memparse(const char *buf, const char *max,
 
 static inline void page_counter_reset_watermark(struct page_counter *counter)
 {
-	counter->watermark = page_counter_read(counter);
+	unsigned long usage = page_counter_read(counter);
+
+	counter->watermark = usage;
+	counter->local_watermark = usage;
 }
 
 void page_counter_calculate_protection(struct page_counter *root,
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 520b90dd97eca..c964dd7ff967a 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/cgroup-internal.h
@@ -81,6 +81,8 @@  struct cgroup_file_ctx {
 	struct {
 		struct cgroup_pidlist	*pidlist;
 	} procs1;
+
+	struct cgroup_of_peak peak;
 };
 
 /*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c8e4b62b436a4..0a97cb2ef1245 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -1972,6 +1972,13 @@  static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param
 	return -EINVAL;
 }
 
+struct cgroup_of_peak *of_peak(struct kernfs_open_file *of)
+{
+	struct cgroup_file_ctx *ctx = of->priv;
+
+	return &ctx->peak;
+}
+
 static void apply_cgroup_root_flags(unsigned int root_flags)
 {
 	if (current->nsproxy->cgroup_ns == &init_cgroup_ns) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9603717886877..2176a2da1aa83 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -25,6 +25,7 @@ 
  * Copyright (C) 2020 Alibaba, Inc, Alex Shi
  */
 
+#include <linux/cgroup-defs.h>
 #include <linux/page_counter.h>
 #include <linux/memcontrol.h>
 #include <linux/cgroup.h>
@@ -41,6 +42,7 @@ 
 #include <linux/rcupdate.h>
 #include <linux/limits.h>
 #include <linux/export.h>
+#include <linux/list.h>
 #include <linux/mutex.h>
 #include <linux/rbtree.h>
 #include <linux/slab.h>
@@ -3558,6 +3560,9 @@  static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
 
 	INIT_WORK(&memcg->high_work, high_work_func);
 	vmpressure_init(&memcg->vmpressure);
+	INIT_LIST_HEAD(&memcg->memory_peaks);
+	INIT_LIST_HEAD(&memcg->swap_peaks);
+	spin_lock_init(&memcg->peaks_lock);
 	memcg->socket_pressure = jiffies;
 	memcg1_memcg_init(memcg);
 	memcg->kmemcg_id = -1;
@@ -3950,12 +3955,90 @@  static u64 memory_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->memory) * PAGE_SIZE;
 }
 
-static u64 memory_peak_read(struct cgroup_subsys_state *css,
-			    struct cftype *cft)
+static int peak_show(struct seq_file *sf, void *v, struct page_counter *pc)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct cgroup_of_peak *ofp = of_peak(sf->private);
+	s64 fd_peak = ofp->value, peak;
+
+	/* User wants global or local peak? */
+	if (fd_peak == -1)
+		peak = pc->watermark;
+	else
+		peak = max(fd_peak, (s64)pc->local_watermark);
+
+	seq_printf(sf, "%lld\n", peak * PAGE_SIZE);
+	return 0;
+}
 
-	return (u64)memcg->memory.watermark * PAGE_SIZE;
+static int memory_peak_show(struct seq_file *sf, void *v)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf));
+
+	return peak_show(sf, v, &memcg->memory);
+}
+
+static int peak_open(struct kernfs_open_file *of)
+{
+	struct cgroup_of_peak *ofp = of_peak(of);
+
+	ofp->value = -1;
+	return 0;
+}
+
+static void peak_release(struct kernfs_open_file *of)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct cgroup_of_peak *ofp = of_peak(of);
+
+	if (ofp->value == -1) {
+		/* fast path (no writes on this fd) */
+		return;
+	}
+	spin_lock(&memcg->peaks_lock);
+	list_del(&ofp->list);
+	spin_unlock(&memcg->peaks_lock);
+}
+
+static ssize_t peak_write(struct kernfs_open_file *of, char *buf, size_t nbytes,
+			  loff_t off, struct page_counter *pc,
+			  struct list_head *watchers)
+{
+	unsigned long usage;
+	struct cgroup_of_peak *peer_ctx;
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	struct cgroup_of_peak *ofp = of_peak(of);
+
+	buf = strstrip(buf);
+	/* Only allow "reset" to keep the API clear */
+	if (strcmp(buf, "reset"))
+		return -EINVAL;
+
+	spin_lock(&memcg->peaks_lock);
+
+	pc->local_watermark = page_counter_read(pc);
+	usage = pc->local_watermark;
+
+	list_for_each_entry(peer_ctx, watchers, list)
+		if (usage > peer_ctx->value)
+			peer_ctx->value = usage;
+
+	/* initial write, register watcher */
+	if (ofp->value == -1)
+		list_add(&ofp->list, watchers);
+
+	ofp->value = usage;
+	spin_unlock(&memcg->peaks_lock);
+
+	return nbytes;
+}
+
+static ssize_t memory_peak_write(struct kernfs_open_file *of, char *buf,
+				 size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+
+	return peak_write(of, buf, nbytes, off, &memcg->memory,
+			  &memcg->memory_peaks);
 }
 
 static int memory_min_show(struct seq_file *m, void *v)
@@ -4307,7 +4390,10 @@  static struct cftype memory_files[] = {
 	{
 		.name = "peak",
 		.flags = CFTYPE_NOT_ON_ROOT,
-		.read_u64 = memory_peak_read,
+		.open = peak_open,
+		.release = peak_release,
+		.seq_show = memory_peak_show,
+		.write = memory_peak_write,
 	},
 	{
 		.name = "min",
@@ -5099,12 +5185,20 @@  static u64 swap_current_read(struct cgroup_subsys_state *css,
 	return (u64)page_counter_read(&memcg->swap) * PAGE_SIZE;
 }
 
-static u64 swap_peak_read(struct cgroup_subsys_state *css,
-			  struct cftype *cft)
+static int swap_peak_show(struct seq_file *sf, void *v)
 {
-	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+	struct mem_cgroup *memcg = mem_cgroup_from_css(seq_css(sf));
+
+	return peak_show(sf, v, &memcg->swap);
+}
+
+static ssize_t swap_peak_write(struct kernfs_open_file *of, char *buf,
+			       size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
 
-	return (u64)memcg->swap.watermark * PAGE_SIZE;
+	return peak_write(of, buf, nbytes, off, &memcg->swap,
+			  &memcg->swap_peaks);
 }
 
 static int swap_high_show(struct seq_file *m, void *v)
@@ -5188,7 +5282,10 @@  static struct cftype swap_files[] = {
 	{
 		.name = "swap.peak",
 		.flags = CFTYPE_NOT_ON_ROOT,
-		.read_u64 = swap_peak_read,
+		.open = peak_open,
+		.release = peak_release,
+		.seq_show = swap_peak_show,
+		.write = swap_peak_write,
 	},
 	{
 		.name = "swap.events",
diff --git a/mm/page_counter.c b/mm/page_counter.c
index 0153f5bb31611..ad9bdde5d5d20 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -79,9 +79,22 @@  void page_counter_charge(struct page_counter *counter, unsigned long nr_pages)
 		/*
 		 * This is indeed racy, but we can live with some
 		 * inaccuracy in the watermark.
+		 *
+		 * Notably, we have two watermarks to allow for both a globally
+		 * visible peak and one that can be reset at a smaller scope.
+		 *
+		 * Since we reset both watermarks when the global reset occurs,
+		 * we can guarantee that watermark >= local_watermark, so we
+		 * don't need to do both comparisons every time.
+		 *
+		 * On systems with branch predictors, the inner condition should
+		 * be almost free.
 		 */
-		if (new > READ_ONCE(c->watermark))
-			WRITE_ONCE(c->watermark, new);
+		if (new > READ_ONCE(c->local_watermark)) {
+			WRITE_ONCE(c->local_watermark, new);
+			if (new > READ_ONCE(c->watermark))
+				WRITE_ONCE(c->watermark, new);
+		}
 	}
 }
 
@@ -129,12 +142,13 @@  bool page_counter_try_charge(struct page_counter *counter,
 			goto failed;
 		}
 		propagate_protected_usage(c, new);
-		/*
-		 * Just like with failcnt, we can live with some
-		 * inaccuracy in the watermark.
-		 */
-		if (new > READ_ONCE(c->watermark))
-			WRITE_ONCE(c->watermark, new);
+
+		/* see comment on page_counter_charge */
+		if (new > READ_ONCE(c->local_watermark)) {
+			WRITE_ONCE(c->local_watermark, new);
+			if (new > READ_ONCE(c->watermark))
+				WRITE_ONCE(c->watermark, new);
+		}
 	}
 	return true;