[v3,0/7] mm: workingset reporting

Message ID	20240813165619.748102-1-yuanchu@google.com
Headers	show Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08CB555892 for <linux-kselftest@vger.kernel.org>; Tue, 13 Aug 2024 16:59:36 +0000 (UTC) Date: Tue, 13 Aug 2024 09:56:11 -0700 Precedence: bulk Mime-Version: 1.0 Message-ID: <20240813165619.748102-1-yuanchu@google.com> Subject: [PATCH v3 0/7] mm: workingset reporting From: Yuanchu Xie <yuanchu@google.com> To: David Hildenbrand <david@redhat.com>, "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, Khalid Aziz <khalid.aziz@oracle.com>, Henry Huang <henry.hj@antgroup.com>, Yu Zhao <yuzhao@google.com>, Dan Williams <dan.j.williams@intel.com>, Gregory Price <gregory.price@memverge.com>, Huang Ying <ying.huang@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Lance Yang <ioworker0@gmail.com>, Randy Dunlap <rdunlap@infradead.org>, Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Kalesh Singh <kaleshsingh@google.com>, Wei Xu <weixugc@google.com>, David Rientjes <rientjes@google.com>, Greg Kroah-Hartman <gregkh@linuxfoundation.org>, "Rafael J. Wysocki" <rafael@kernel.org>, Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>, Roman Gushchin <roman.gushchin@linux.dev>, Muchun Song <muchun.song@linux.dev>, Shuah Khan <shuah@kernel.org>, Yosry Ahmed <yosryahmed@google.com>, Matthew Wilcox <willy@infradead.org>, Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>, Kairui Song <kasong@tencent.com>, "Michael S. Tsirkin" <mst@redhat.com>, Vasily Averin <vasily.averin@linux.dev>, Nhat Pham <nphamcs@gmail.com>, Miaohe Lin <linmiaohe@huawei.com>, Qi Zheng <zhengqi.arch@bytedance.com>, Abel Wu <wuyun.abel@bytedance.com>, "Vishal Moola (Oracle)" <vishal.moola@gmail.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, Yuanchu Xie <yuanchu@google.com>, linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org Content-Type: text/plain; charset="UTF-8"
Series	mm: workingset reporting \| expand [v3,0/7] mm: workingset reporting [v3,1/7] mm: aggregate working set information into histograms [v3,2/7] mm: use refresh interval to rate-limit workingset report aggregation [v3,3/7] mm: report workingset during memory pressure driven scanning [v3,4/7] mm: extend working set reporting to memcgs [v3,5/7] mm: add kernel aging thread for workingset reporting [v3,6/7] selftest: test system-wide workingset reporting [v3,7/7] Docs/admin-guide/mm/workingset_report: document sysfs and memcg interfaces

Message ID

20240813165619.748102-1-yuanchu@google.com

Headers

Date: Tue, 13 Aug 2024 09:56:11 -0700
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20240813165619.748102-1-yuanchu@google.com>
Subject: [PATCH v3 0/7] mm: workingset reporting
From: Yuanchu Xie <yuanchu@google.com>
To: David Hildenbrand <david@redhat.com>,
 "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
	Khalid Aziz <khalid.aziz@oracle.com>, Henry Huang <henry.hj@antgroup.com>,
	Yu Zhao <yuzhao@google.com>, Dan Williams <dan.j.williams@intel.com>,
	Gregory Price <gregory.price@memverge.com>,
 Huang Ying <ying.huang@intel.com>,
	Andrew Morton <akpm@linux-foundation.org>, Lance Yang <ioworker0@gmail.com>,
	Randy Dunlap <rdunlap@infradead.org>,
 Muhammad Usama Anjum <usama.anjum@collabora.com>
Cc: Kalesh Singh <kaleshsingh@google.com>, Wei Xu <weixugc@google.com>,
	David Rientjes <rientjes@google.com>,
 Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
 Johannes Weiner <hannes@cmpxchg.org>, Michal Hocko <mhocko@kernel.org>,
	Roman Gushchin <roman.gushchin@linux.dev>,
 Muchun Song <muchun.song@linux.dev>,
	Shuah Khan <shuah@kernel.org>, Yosry Ahmed <yosryahmed@google.com>,
	Matthew Wilcox <willy@infradead.org>,
 Sudarshan Rajagopalan <quic_sudaraja@quicinc.com>,
	Kairui Song <kasong@tencent.com>, "Michael S. Tsirkin" <mst@redhat.com>,
	Vasily Averin <vasily.averin@linux.dev>, Nhat Pham <nphamcs@gmail.com>,
	Miaohe Lin <linmiaohe@huawei.com>, Qi Zheng <zhengqi.arch@bytedance.com>,
	Abel Wu <wuyun.abel@bytedance.com>,
 "Vishal Moola (Oracle)" <vishal.moola@gmail.com>,
	Kefeng Wang <wangkefeng.wang@huawei.com>, Yuanchu Xie <yuanchu@google.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Content-Type: text/plain; charset="UTF-8"

Series

mm: workingset reporting | expand

Message

Yuanchu Xie Aug. 13, 2024, 4:56 p.m. UTC

Changes from PATCH v2 -> v3:
- Fixed typos in commit messages and documentation
  (Lance Yang, Randy Dunlap)
- Split out the force_scan patch to be reviewed separately
- Added benchmarks from Ghait Ouled Amar Ben Cheikh
- Fixed reported compile error without CONFIG_MEMCG

Changes from PATCH v1 -> v2:
- Updated selftest to use ksft_test_result_code instead of switch-case
  (Muhammad Usama Anjum)
- Included more use cases in the cover letter
  (Huang, Ying)
- Added documentation for sysfs and memcg interfaces
- Added an aging-specific struct lru_gen_mm_walk in struct pglist_data
  to avoid allocating for each lruvec.

Changes from RFC v3 -> PATCH v1:
- Updated selftest to use ksft_print_msg instead of fprintf(stderr, ...)
  (Muhammad Usama Anjum)
- Included more detail in patch skipping pmd_young with force_scan
  (Huang, Ying)
- Deferred reaccess histogram as a followup
- Removed per-memcg page age interval configs for simplicity

Changes from RFC v2 -> RFC v3:
- Update to v6.8
- Added an aging kernel thread (gated behind config)
- Added basic selftests for sysfs interface files
- Track swapped out pages for reaccesses
- Refactoring and cleanup
- Dropped the virtio-balloon extension to make things manageable

Changes from RFC v1 -> RFC v2:
- Refactored the patchs into smaller pieces
- Renamed interfaces and functions from wss to wsr (Working Set Reporting)
- Fixed build errors when CONFIG_WSR is not set
- Changed working_set_num_bins to u8 for virtio-balloon
- Added support for per-NUMA node reporting for virtio-balloon

[rfc v1]
https://lore.kernel.org/linux-mm/20230509185419.1088297-1-yuanchu@google.com/
[rfc v2]
https://lore.kernel.org/linux-mm/20230621180454.973862-1-yuanchu@google.com/
[rfc v3]
https://lore.kernel.org/linux-mm/20240327213108.2384666-1-yuanchu@google.com/

This patch series provides workingset reporting of user pages in
lruvecs, of which coldness can be tracked by accessed bits and fd
references. However, the concept of workingset applies generically to
all types of memory, which could be kernel slab caches, discardable
userspace caches (databases), or CXL.mem. Therefore, data sources might
come from slab shrinkers, device drivers, or the userspace. IMO, the
kernel should provide a set of workingset interfaces that should be
generic enough to accommodate the various use cases, and be extensible
to potential future use cases. The current proposed interfaces are not
sufficient in that regard, but I would like to start somewhere, solicit
feedback, and iterate.

Use cases
==========
Job scheduling
On overcommitted hosts, workingset information allows the job scheduler
to right-size each job and land more jobs on the same host or NUMA node,
and in the case of a job with increasing workingset, policy decisions
can be made to migrate other jobs off the host/NUMA node, or oom-kill
the misbehaving job. If the job shape is very different from the machine
shape, knowing the workingset per-node can also help inform page
allocation policies.

Proactive reclaim
Workingset information allows the a container manager to proactively
reclaim memory while not impacting a job's performance. While PSI may
provide a reactive measure of when a proactive reclaim has reclaimed too
much, workingset reporting allows the policy to be more accurate and
flexible.

Ballooning (similar to proactive reclaim)
While this patch series does not extend the virtio-balloon device,
balloon policies benefit from workingset to more precisely determine
the size of the memory balloon. On desktops/laptops/mobile devices where
memory is scarce and overcommitted, the balloon sizing in multiple VMs
running on the same device can be orchestrated with workingset reports
from each one.

Promotion/Demotion
If different mechanisms are used for promition and demotion, workingset
information can help connect the two and avoid pages being migrated back
and forth.
For example, given a promotion hot page threshold defined in reaccess
distance of N seconds (promote pages accessed more often than every N
seconds). The threshold N should be set so that ~80% (e.g.) of pages on
the fast memory node passes the threshold. This calculation can be done
with workingset reports.
To be directly useful for promotion policies, the workingset report
interfaces need to be extended to report hotness and gather hotness
information from the devices[1].

[1]
https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1

Sysfs and Cgroup Interfaces
==========
The interfaces are detailed in the patches that introduce them. The main
idea here is we break down the workingset per-node per-memcg into time
intervals (ms), e.g.

1000 anon=137368 file=24530
20000 anon=34342 file=0
30000 anon=353232 file=333608
40000 anon=407198 file=206052
9223372036854775807 anon=4925624 file=892892

I realize this does not generalize well to hotness information, but I
lack the intuition for an abstraction that presents hotness in a useful
way. Based on a recent proposal for move_phys_pages[2], it seems like
userspace tiering software would like to move specific physical pages,
instead of informing the kernel "move x number of hot pages to y
device". Please advise.

[2]
https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/

Implementation
==========
Currently, the reporting of user pages is based off of MGLRU, and
therefore requires CONFIG_LRU_GEN=y. We would benefit from more MGLRU
generations for a more fine-grained workingset report. I will make the
generation count configurable in the next version. The workingset
reporting mechanism is gated behind CONFIG_WORKINGSET_REPORT, and the
aging thread is behind CONFIG_WORKINGSET_REPORT_AGING.

Benchmarks
==========
Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything
colder than 10 seconds every 40 seconds" policy and ran Linux compile
and redis from the phoronix test suite. The results are in his repo:
https://github.com/miloudi98/WMO

Yuanchu Xie (7):
  mm: aggregate working set information into histograms
  mm: use refresh interval to rate-limit workingset report aggregation
  mm: report workingset during memory pressure driven scanning
  mm: extend working set reporting to memcgs
  mm: add kernel aging thread for workingset reporting
  selftest: test system-wide workingset reporting
  Docs/admin-guide/mm/workingset_report: document sysfs and memcg
    interfaces

 Documentation/admin-guide/mm/index.rst        |   1 +
 .../admin-guide/mm/workingset_report.rst      | 105 ++++
 drivers/base/node.c                           |   6 +
 include/linux/memcontrol.h                    |  21 +
 include/linux/mmzone.h                        |   9 +
 include/linux/workingset_report.h             |  97 +++
 mm/Kconfig                                    |  15 +
 mm/Makefile                                   |   2 +
 mm/internal.h                                 |  18 +
 mm/memcontrol.c                               | 184 +++++-
 mm/mm_init.c                                  |   2 +
 mm/mmzone.c                                   |   2 +
 mm/vmscan.c                                   |  56 +-
 mm/workingset_report.c                        | 561 ++++++++++++++++++
 mm/workingset_report_aging.c                  | 127 ++++
 tools/testing/selftests/mm/.gitignore         |   1 +
 tools/testing/selftests/mm/Makefile           |   3 +
 tools/testing/selftests/mm/run_vmtests.sh     |   5 +
 .../testing/selftests/mm/workingset_report.c  | 306 ++++++++++
 .../testing/selftests/mm/workingset_report.h  |  39 ++
 .../selftests/mm/workingset_report_test.c     | 330 +++++++++++
 21 files changed, 1885 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
 create mode 100644 include/linux/workingset_report.h
 create mode 100644 mm/workingset_report.c
 create mode 100644 mm/workingset_report_aging.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.c
 create mode 100644 tools/testing/selftests/mm/workingset_report.h
 create mode 100644 tools/testing/selftests/mm/workingset_report_test.c

Comments

Andrew Morton Aug. 13, 2024, 6:33 p.m. UTC | #1

On Tue, 13 Aug 2024 09:56:11 -0700 Yuanchu Xie <yuanchu@google.com> wrote:

> This patch series provides workingset reporting of user pages in
> lruvecs, of which coldness can be tracked by accessed bits and fd
> references.

Very little reviewer interest.  I wonder why.  Will Google be the only
organization which finds this useful?

> Benchmarks
> ==========
> Ghait Ouled Amar Ben Cheikh has implemented a simple "reclaim everything
> colder than 10 seconds every 40 seconds" policy and ran Linux compile
> and redis from the phoronix test suite. The results are in his repo:
> https://github.com/miloudi98/WMO

I'd suggest at least summarizing these results here in the [0/N].  The
Linux kernel will probably outlive that URL!

Randy Dunlap Aug. 13, 2024, 11:45 p.m. UTC | #2

Hi,

On 8/13/24 9:56 AM, Yuanchu Xie wrote:
> Add workingset reporting documentation for better discoverability of
> its sysfs and memcg interfaces. Also document the required kernel
> config to enable workingset reporting.
> 
> Change-Id: Ib9dfc9004473baa6ef26ca7277d220b6199517de
> Signed-off-by: Yuanchu Xie <yuanchu@google.com>
> ---
>  Documentation/admin-guide/mm/index.rst        |   1 +
>  .../admin-guide/mm/workingset_report.rst      | 105 ++++++++++++++++++
>  2 files changed, 106 insertions(+)
>  create mode 100644 Documentation/admin-guide/mm/workingset_report.rst
> 

> diff --git a/Documentation/admin-guide/mm/workingset_report.rst b/Documentation/admin-guide/mm/workingset_report.rst
> new file mode 100644
> index 000000000000..ddcc0c33a8df
> --- /dev/null
> +++ b/Documentation/admin-guide/mm/workingset_report.rst
> @@ -0,0 +1,105 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +=================
> +Workingset Report
> +=================
> +Workingset report provides a view of memory coldness in user-defined
> +time intervals, i.e. X bytes are Y milliseconds cold. It breaks down

                   e.g., X bytes are Y milliseconds cold.

> +the user pages in the system per-NUMA node, per-memcg, for both
> +anonymous and file pages into histograms that look like:
> +::
> +
> +    1000 anon=137368 file=24530
> +    20000 anon=34342 file=0
> +    30000 anon=353232 file=333608
> +    40000 anon=407198 file=206052
> +    9223372036854775807 anon=4925624 file=892892
> +
> +The workingset reports can be used to drive proactive reclaim, by
> +identifying the number of cold bytes in a memcg, then writing to
> +``memory.reclaim``.
> +
> +Quick start
> +===========
> +Build the kernel with the following configurations. The report relies
> +on Multi-gen LRU for page coldness.
> +
> +* ``CONFIG_LRU_GEN=y``
> +* ``CONFIG_LRU_GEN_ENABLED=y``
> +* ``CONFIG_WORKINGSET_REPORT=y``
> +
> +Optionally, the aging kernel daemon can be enabled with the following
> +configuration.
> +* ``CONFIG_WORKINGSET_REPORT_AGING=y``
> +
> +Sysfs interfaces
> +================
> +``/sys/devices/system/node/nodeX/workingset_report/page_age`` provides
> +a per-node page age histogram, showing an aggregate of the node's lruvecs.
> +Reading this file causes a hierarchical aging of all lruvecs, scanning
> +pages and creates a new Multi-gen LRU generation in each lruvec.
> +For example:
> +::
> +
> +    1000 anon=0 file=0
> +    2000 anon=0 file=0
> +    100000 anon=5533696 file=5566464
> +    18446744073709551615 anon=0 file=0
> +
> +``/sys/devices/system/node/nodeX/workingset_report/page_age_intervals``
> +is a comma separated list of time in milliseconds that configures what

        comma-separated

> +the page age histogram uses for aggregation. For the above histogram,
> +the intervals are:
> +::

I guess just change the "are:" to "are::" and change the line that only
contains "::" to a blank line.  Otherwise there is a warning:

Documentation/admin-guide/mm/workingset_report.rst:54: ERROR: Unexpected indentation.


> +    1000,2000,100000
> +
> +``/sys/devices/system/node/nodeX/workingset_report/refresh_interval``
> +defines the amount of time the report is valid for in milliseconds.
> +When a report is still valid, reading the ``page_age`` file shows
> +the existing valid report, instead of generating a new one.
> +
> +``/sys/devices/system/node/nodeX/workingset_report/report_threshold``
> +specifies how often the userspace agent can be notified for node
> +memory pressure, in milliseconds. When a node reaches its low
> +watermarks and wakes up kswapd, programs waiting on ``page_age`` are
> +woken up so they can read the histogram and make policy decisions.
> +
> +Memcg interface
> +===============
> +While ``page_age_interval`` is defined per-node in sysfs, ``page_age``,
> +``refresh_interval`` and ``report_threshold`` are available per-memcg.
> +
> +``/sys/fs/cgroup/.../memory.workingset.page_age``
> +The memcg equivalent of the sysfs workingset page age histogram
> +breaks down the workingset of this memcg and its children into
> +page age intervals. Each node is prefixed with a node header and
> +a newline. Non-proactive direct reclaim on this memcg can also
> +wake up userspace agents that are waiting on this file.
> +e.g.

   E.g.

> +::
> +
> +    N0
> +    1000 anon=0 file=0
> +    2000 anon=0 file=0
> +    3000 anon=0 file=0
> +    4000 anon=0 file=0
> +    5000 anon=0 file=0
> +    18446744073709551615 anon=0 file=0
> +
> +``/sys/fs/cgroup/.../memory.workingset.refresh_interval``
> +The memcg equivalent of the sysfs refresh interval. A per-node
> +number of how much time a page age histogram is valid for, in
> +milliseconds.
> +e.g.

   E.g.

> +::
> +
> +    echo N0=2000 > memory.workingset.refresh_interval
> +
> +``/sys/fs/cgroup/.../memory.workingset.report_threshold``
> +The memcg equivalent of the sysfs report threshold. A per-node
> +number of how often userspace agent waiting on the page age
> +histogram can be woken up, in milliseconds.
> +e.g.

   E.g.

> +::
> +
> +    echo N0=1000 > memory.workingset.report_threshold

Gregory Price Aug. 20, 2024, 1 p.m. UTC | #3

On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
> This patch series provides workingset reporting of user pages in
> lruvecs, of which coldness can be tracked by accessed bits and fd
> references. However, the concept of workingset applies generically to
> all types of memory, which could be kernel slab caches, discardable
> userspace caches (databases), or CXL.mem. Therefore, data sources might
> come from slab shrinkers, device drivers, or the userspace. IMO, the
> kernel should provide a set of workingset interfaces that should be
> generic enough to accommodate the various use cases, and be extensible
> to potential future use cases. The current proposed interfaces are not
> sufficient in that regard, but I would like to start somewhere, solicit
> feedback, and iterate.
>
... snip ... 
> Use cases
> ==========
> Promotion/Demotion
> If different mechanisms are used for promition and demotion, workingset
> information can help connect the two and avoid pages being migrated back
> and forth.
> For example, given a promotion hot page threshold defined in reaccess
> distance of N seconds (promote pages accessed more often than every N
> seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> the fast memory node passes the threshold. This calculation can be done
> with workingset reports.
> To be directly useful for promotion policies, the workingset report
> interfaces need to be extended to report hotness and gather hotness
> information from the devices[1].
> 
> [1]
> https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> 
> Sysfs and Cgroup Interfaces
> ==========
> The interfaces are detailed in the patches that introduce them. The main
> idea here is we break down the workingset per-node per-memcg into time
> intervals (ms), e.g.
> 
> 1000 anon=137368 file=24530
> 20000 anon=34342 file=0
> 30000 anon=353232 file=333608
> 40000 anon=407198 file=206052
> 9223372036854775807 anon=4925624 file=892892
> 
> I realize this does not generalize well to hotness information, but I
> lack the intuition for an abstraction that presents hotness in a useful
> way. Based on a recent proposal for move_phys_pages[2], it seems like
> userspace tiering software would like to move specific physical pages,
> instead of informing the kernel "move x number of hot pages to y
> device". Please advise.
> 
> [2]
> https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> 

Just as a note on this work, this is really a testing interface.  The
end-goal is not to merge such an interface that is user-facing like
move_phys_pages, but instead to have something like a triggered kernel
task that has a directive of "Promote X pages from Device A".

This work is more of an open collaboration for prototyping such that we
don't have to plumb it through the kernel from the start and assess the
usefulness of the hardware hotness collection mechanism.

---

More generally on promotion, I have been considering recently a problem
with promoting unmapped pagecache pages - since they are not subject to
NUMA hint faults.  I started looking at PG_accessed and PG_workingset as
a potential mechanism to trigger promotion - but i'm starting to see a
pattern of competing priorities between reclaim (LRU/MGLRU) logic and
promotion logic.

Reclaim is triggered largely under memory pressure - which means co-opting
reclaim logic for promotion is at best logically confusing, and at worst
likely to introduce regressions.  The LRU/MGLRU logic is written largely
for reclaim, not promotion.  This makes hacking promotion in after the
fact rather dubious - the design choices don't match.

One example: if a page moves from inactive->active (or old->young), we
could treat this as a page "becoming hot" and mark it for promotion, but
this potentially punishes pages on the "active/younger" lists which are
themselves hotter.

I'm starting to think separate demotion/reclaim and promotion components
are warranted. This could take the form of a separate kernel worker that
occasionally gets scheduled to manage a promotion list, or even the
addition of a PG_promote flag to decouple reclaim and promotion logic
completely.  Separating the structures entirely would be good to allow
both demotion/reclaim and promotion to occur concurrently (although this
seems problematic under memory pressure).

Would like to know your thoughts here.  If we can decide to segregate
promotion and demotion logic, it might go a long way to simplify the
existing interfaces and formalize transactions between the two.

(also if you're going to LPC, might be worth a chat in person)

~Gregory

Yuanchu Xie Aug. 26, 2024, 11:43 p.m. UTC | #4

On Tue, Aug 20, 2024 at 6:00 AM Gregory Price <gourry@gourry.net> wrote:
>
> On Tue, Aug 13, 2024 at 09:56:11AM -0700, Yuanchu Xie wrote:
> > This patch series provides workingset reporting of user pages in
> > lruvecs, of which coldness can be tracked by accessed bits and fd
> > references. However, the concept of workingset applies generically to
> > all types of memory, which could be kernel slab caches, discardable
> > userspace caches (databases), or CXL.mem. Therefore, data sources might
> > come from slab shrinkers, device drivers, or the userspace. IMO, the
> > kernel should provide a set of workingset interfaces that should be
> > generic enough to accommodate the various use cases, and be extensible
> > to potential future use cases. The current proposed interfaces are not
> > sufficient in that regard, but I would like to start somewhere, solicit
> > feedback, and iterate.
> >
> ... snip ...
> > Use cases
> > ==========
> > Promotion/Demotion
> > If different mechanisms are used for promition and demotion, workingset
> > information can help connect the two and avoid pages being migrated back
> > and forth.
> > For example, given a promotion hot page threshold defined in reaccess
> > distance of N seconds (promote pages accessed more often than every N
> > seconds). The threshold N should be set so that ~80% (e.g.) of pages on
> > the fast memory node passes the threshold. This calculation can be done
> > with workingset reports.
> > To be directly useful for promotion policies, the workingset report
> > interfaces need to be extended to report hotness and gather hotness
> > information from the devices[1].
> >
> > [1]
> > https://www.opencompute.org/documents/ocp-cms-hotness-tracking-requirements-white-paper-pdf-1
> >
> > Sysfs and Cgroup Interfaces
> > ==========
> > The interfaces are detailed in the patches that introduce them. The main
> > idea here is we break down the workingset per-node per-memcg into time
> > intervals (ms), e.g.
> >
> > 1000 anon=137368 file=24530
> > 20000 anon=34342 file=0
> > 30000 anon=353232 file=333608
> > 40000 anon=407198 file=206052
> > 9223372036854775807 anon=4925624 file=892892
> >
> > I realize this does not generalize well to hotness information, but I
> > lack the intuition for an abstraction that presents hotness in a useful
> > way. Based on a recent proposal for move_phys_pages[2], it seems like
> > userspace tiering software would like to move specific physical pages,
> > instead of informing the kernel "move x number of hot pages to y
> > device". Please advise.
> >
> > [2]
> > https://lore.kernel.org/lkml/20240319172609.332900-1-gregory.price@memverge.com/
> >
>
> Just as a note on this work, this is really a testing interface.  The
> end-goal is not to merge such an interface that is user-facing like
> move_phys_pages, but instead to have something like a triggered kernel
> task that has a directive of "Promote X pages from Device A".
>
> This work is more of an open collaboration for prototyping such that we
> don't have to plumb it through the kernel from the start and assess the
> usefulness of the hardware hotness collection mechanism.

Understood. I think we previously had this exchange and I forgot to
remove the mentions from the cover letter.

>
> ---
>
> More generally on promotion, I have been considering recently a problem
> with promoting unmapped pagecache pages - since they are not subject to
> NUMA hint faults.  I started looking at PG_accessed and PG_workingset as
> a potential mechanism to trigger promotion - but i'm starting to see a
> pattern of competing priorities between reclaim (LRU/MGLRU) logic and
> promotion logic.

In this case, IMO hardware support would be good as it could provide
the kernel with exactly what pages are hot, and it would not care
whether a page is mapped or not. I recall there being some CXL
proposal on this, but I'm not sure whether it has settled into a
standard yet.

>
> Reclaim is triggered largely under memory pressure - which means co-opting
> reclaim logic for promotion is at best logically confusing, and at worst
> likely to introduce regressions.  The LRU/MGLRU logic is written largely
> for reclaim, not promotion.  This makes hacking promotion in after the
> fact rather dubious - the design choices don't match.
>
> One example: if a page moves from inactive->active (or old->young), we
> could treat this as a page "becoming hot" and mark it for promotion, but
> this potentially punishes pages on the "active/younger" lists which are
> themselves hotter.

To avoid punishing pages on the "young" list, one could insert the
page into a "less young" generation, but it would be difficult to have
a fixed policy for this in the kernel, so it may be best for this to
be configurable via BPF. One could insert the page in the middle of
the active/inactive list, but that would in effect create multiple
generations.

>
> I'm starting to think separate demotion/reclaim and promotion components
> are warranted. This could take the form of a separate kernel worker that
> occasionally gets scheduled to manage a promotion list, or even the
> addition of a PG_promote flag to decouple reclaim and promotion logic
> completely.  Separating the structures entirely would be good to allow
> both demotion/reclaim and promotion to occur concurrently (although this
> seems problematic under memory pressure).
>
> Would like to know your thoughts here.  If we can decide to segregate
> promotion and demotion logic, it might go a long way to simplify the
> existing interfaces and formalize transactions between the two.

The two systems still have to interact, so separating the two would
essentially create a new policy that decides whether the
demotion/reclaim or the promotion policy is in effect. If promotion
could figure out where to insert the page in terms of generations,
wouldn't that be simpler?

>
> (also if you're going to LPC, might be worth a chat in person)

I cannot make it to LPC. :( Sadness

Yuanchu