From patchwork Wed Mar 27 21:31:00 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783387 Received: from mail-yb1-f202.google.com (mail-yb1-f202.google.com [209.85.219.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1B7E215359C for ; Wed, 27 Mar 2024 21:31:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575093; cv=none; b=k9ql4TOg89Xk5MrcqrEqEXdipKgZ2N2Qw0ufGUp6LuMok1pGumhqcEAyC5rZJQ1duoeshgUL/jHZrubjAYGzfN87VED+MbqhyiC7pePzXrgslKA9T1GF84A7xi4hLRDL0yLLP7iv1BcsR6XhIznTTV8OXYTm1xTS0Cks02NzojQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575093; c=relaxed/simple; bh=NB7Q6As5LCuC5IWAkmAIMr9K7a1mke7uPnTHkJnbklI=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=k1M2Eo0iXg7unBnC2V0EGA2Act4uBHlxdwpyrnqOVgaF0gO8P1/IYR+svXt78iKsXFxxhj3KQPUnnQwwJ93uhqHDYlYd4VApKMPWq6jtpX1H8tb3NS1+KwLv4yDkr+8wH/CP3vGO38VLW07y6ou/hz22ERAZUIw2Y9vXXmyVbOw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=S4etqny1; arc=none smtp.client-ip=209.85.219.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="S4etqny1" Received: by mail-yb1-f202.google.com with SMTP id 3f1490d57ef6-dbf618042daso379032276.0 for ; Wed, 27 Mar 2024 14:31:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575090; x=1712179890; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=J7Eobqr0LFBQ4KWjEX1Ii2cj/j9a00dqkQu19UHFfuM=; b=S4etqny1LdnIfyfzpyk2YSUzXWSmvUC7xsEss2xNipHRbf/fhB/MCe/OMJWvfy+lIS /9QIldAWBdbcSLKYDwoADXt2CnhpwjcJ3h1WxbtY49Sbh97kIb4J9aeMHEJltDGhucQO ijKZcONCUZGSKvajoGqrYL4hh0Cw4lD+h0WsZet/KD514cF7eF/0CNl97pNB9TAkdRKT bJ9LKWeqwVThDGUEI9vuqUpgb+SWx9dJHMRdtcvswfNn7GiUwfzp4U/DyfseM7FYwkBw kzPdPKJ4U+NjoIfEOzZuOjcj7BoPzIPd7pC/aq43X/1lgxtXGGLqmVAfCvZfYD/5kFjR tCyQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575090; x=1712179890; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=J7Eobqr0LFBQ4KWjEX1Ii2cj/j9a00dqkQu19UHFfuM=; b=wz5gaAELMr40G6AzFV7qI3dhWK9DZCe9ngzRKOaj0tsfIS5GV47qSBna71EGzKwL2x eWbs2xmH7rYqU8F8JZkRQ6JIpcGQ2Cuwh+5EDACDvX779iWYFNx5t9w0iMLD2VyBeDi9 WncqsBc+SLYsU5PEMtWywl/V2Ri3i8bjNwu3kBWGhqhdJ1WLQIq0lXg/31AjkPej7wKS yXCvsls177ZnuTngjYThGroUFyKeUfVSpbU8zA55GeD/UHPlVJDdQbkPAvrYJxI3pQIu OKZxXFTR6i29kybSmUoWwOEDZwRLNhztYi+RM2XC+q9X3TuXVU/jG80YEoDiSH2DBFog HtiA== X-Forwarded-Encrypted: i=1; AJvYcCW/GSrKY0+CNWAc7mU5m9M5atOqFOXcAMOkXf4Wkk6YamVkn5A0GVPOpkQfyWamiEUEUBd/XKtBPT5SsnP/wVlaS0f1ajCbrWRmtWtDT/AZ X-Gm-Message-State: AOJu0YwEeXM/GP2AudkpMt4paTLxOzgsnz2LSK/LMmJrC+46DSUC9AwY IB1EIxlk3O9t14CtOfYmtrz3pydYx4qLUIkINvz/vYPjOnQ/JRg9f4+bl9Sjmk4kkljV0QZsF4K zJVUpZg== X-Google-Smtp-Source: AGHT+IEtjKFm0aB/GkGE7TLyxEpzw6bgXLiicKkfzUR/Ki0/1kiBkWenomFbSpLy63KHaGIYhCLu0OSP/VuU X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:6902:2311:b0:dbe:d0a9:2be8 with SMTP id do17-20020a056902231100b00dbed0a92be8mr116972ybb.0.1711575090119; Wed, 27 Mar 2024 14:31:30 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:00 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-2-yuanchu@google.com> Subject: [RFC PATCH v3 1/8] mm: multi-gen LRU: ignore non-leaf pmd_young for force_scan=true From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org When non-leaf pmd accessed bits are available, MGLRU page table walks can clear the accessed bit and promptly ignore the accessed bit on the pte because it's on a different node, so the walk does not update the generation of said page. When the next scan comes around on the right node, the non-leaf pmd accessed bit might remain cleared and the pte accessed bits won't be checked. While this is sufficient for reclaim-driven aging, where the goal is to select a reasonably cold page, the access can be missed when aging proactively for measuring the working set size of a node/memcg. Since force_scan disables various other optimizations, we check force_scan to ignore the non-leaf pmd accessed bit. Signed-off-by: Yuanchu Xie --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 4f9c854ce6cc..1a7c7d537db6 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3522,7 +3522,7 @@ static void walk_pmd_range(pud_t *pud, unsigned long start, unsigned long end, walk->mm_stats[MM_NONLEAF_TOTAL]++; - if (should_clear_pmd_young()) { + if (!walk->force_scan && should_clear_pmd_young()) { if (!pmd_young(val)) continue; From patchwork Wed Mar 27 21:31:01 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783792 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3040F153809 for ; Wed, 27 Mar 2024 21:31:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575096; cv=none; b=qkJn1/9aMc/Eq/VWzNhgvXh6tg5SDGbbNbgCuwMwrjWH3yPDxhvw075CjgUJrFJJEbusQPXMROJy9BZKGZIc2Eagn7M0j38Ql2TfpZkZXKDZzT7C6dK35XQQU+BWiw7dWPJhMedbNBtB++yJvK/wepZ/2/FA9PP66RmodoHgc9s= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575096; c=relaxed/simple; bh=X/NNRrfFzSv1USaKsPJDOwgPKfqM2uFX8wKnKpPNR/I=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=javyZjB89HKHH/sJxX54EupGii3nCYOVHR5/auOBqiDw+PY1yzDho/t0Ca3aFjswRaRZkw4cmjePXBcL42xgYVejy3AjG4Qsu1XMs0yVbkpIUYgQldfnIjDy6e1dnT1v2xqWLe8YM0ABU7LIK18wIyqXi9am28DiIbs3zE/h64s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=QCY2alyD; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="QCY2alyD" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dc693399655so446786276.1 for ; Wed, 27 Mar 2024 14:31:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575092; x=1712179892; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=wHoDydsLx5fzCgq34CdDr9/Bjf46nA3NHW9iYY15TNs=; b=QCY2alyDVK7hhwDljzCnG4KuT8wFIqoPYLOgfNL0UZaVQHzP7bNH0148xUDVPVzM8Z C+Ox6lsO5OoIMNTP0f3Tkz+PtPOLodjja4OfPe1I/MzuHkCeU+qm6B9TrivvdJAJb8KX 6LIiMfd580Po8p/xcQ1EnWKolEx1KxEuX9NJ3kvWjFiMwgDfupziu/UDxtWxtOpEcMxq /l7KE1QoYKvib+K7h9dYADMPzf+tF8GklTmdyL9kO4cLiDIJreDJTEKQVGTn3TuXQTO8 XLhTZny5AhPuXfbhNcl7u/PCAWpkA0lf8jMrhrmES0lbhpTLW9/gRTflC50UuiGkUm+Z QBMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575092; x=1712179892; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=wHoDydsLx5fzCgq34CdDr9/Bjf46nA3NHW9iYY15TNs=; b=S1vNvHfG6D8gEWfCEonjrrvzkDXGZ9OUUMKZXZXZH2PCyD60hqabVPhO1qsyGsnroA EInrmVUZ0euvW1Vmw2bIaJSuJ21r9XW9lDSCSG/tLc8HlPAUpCwQJWsz7MybpVxUs1vp NHcuTiMOO9NWHFT8fvokkGMnwstpndVIsCXZKM1bT0FIBgBsrJKuiFY5aXv5xiqs45qr rB3nQwgSvIzr3k5HzSwub0wBrdZlOq93FX7o7v8tnWxfBtjmpn4+nbYesLI/09Xp02zg CSud094cOHu4xjhwbcSnG3wtpTFEh2CLULDa8kARw6ufeGbwynS3it+ML6ov1ZKZt46K /gcg== X-Forwarded-Encrypted: i=1; AJvYcCW2FVMryQT3PmhyVuAtb5GRXesx1BLNzLELEQ+TksUai07TzW4WJAaBZJliH/wU6Z2Wf5qYUym/mFd8XaWoSgxp+ZeXhrup/P5nQ8kN/cwN X-Gm-Message-State: AOJu0YxrADJ8orkh4FNnkcNn9cvcbIlfI7UnitYYaZw74s61Kq0848cd XKLhVZ2eJBxTkJbWmuTnTsvfH3aUH50cfdmtL288ogTLAjAQ8BcMynSEJuzxQ9VELla7I6TcPYv b1iyZ5A== X-Google-Smtp-Source: AGHT+IHAO+2WDgh1+3HZ2CbnZvR4JcPfdViRiPfSKuxQAo52I4fv3S3CbGxcVvNyzCJyKIgeCiCFaX7DkQJs X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:6902:709:b0:dc7:7ce9:fb4d with SMTP id k9-20020a056902070900b00dc77ce9fb4dmr299227ybt.12.1711575092175; Wed, 27 Mar 2024 14:31:32 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:01 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-3-yuanchu@google.com> Subject: [RFC PATCH v3 2/8] mm: aggregate working set information into histograms From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org Hierarchically aggregate all memcgs' MGLRU generations and their page counts into working set page age histograms. The histograms break down the system's working set per-node, per-anon/file. The sysfs interfaces are as follows: /sys/devices/system/node/nodeX/page_age A per-node page age histogram, showing an aggregate of the node's lruvecs. The information is extracted from MGLRU's per-generation page counters. Reading this file causes a hierarchical aging of all lruvecs, scanning pages and creates a new generation in each lruvec. For example: 1000 anon=0 file=0 2000 anon=0 file=0 100000 anon=5533696 file=5566464 18446744073709551615 anon=0 file=0 /sys/devices/system/node/nodeX/page_age_interval A comma separated list of time in milliseconds that configures what the page age histogram uses for aggregation. Signed-off-by: Yuanchu Xie --- drivers/base/node.c | 3 + include/linux/mmzone.h | 4 + include/linux/workingset_report.h | 69 +++++ mm/Kconfig | 9 + mm/Makefile | 1 + mm/internal.h | 9 + mm/memcontrol.c | 2 + mm/mmzone.c | 2 + mm/vmscan.c | 34 ++- mm/workingset_report.c | 413 ++++++++++++++++++++++++++++++ 10 files changed, 545 insertions(+), 1 deletion(-) create mode 100644 include/linux/workingset_report.h create mode 100644 mm/workingset_report.c diff --git a/drivers/base/node.c b/drivers/base/node.c index 1c05640461dd..4f589b8253f4 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -20,6 +20,7 @@ #include #include #include +#include static const struct bus_type node_subsys = { .name = "node", @@ -625,6 +626,7 @@ static int register_node(struct node *node, int num) } else { hugetlb_register_node(node); compaction_register_node(node); + wsr_register_node(node); } return error; @@ -641,6 +643,7 @@ void unregister_node(struct node *node) { hugetlb_unregister_node(node); compaction_unregister_node(node); + wsr_unregister_node(node); node_remove_accesses(node); node_remove_caches(node); device_unregister(&node->dev); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index a497f189d988..8839931646ee 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -24,6 +24,7 @@ #include #include #include +#include /* Free memory management - zoned buddy allocator. */ #ifndef CONFIG_ARCH_FORCE_MAX_ORDER @@ -625,6 +626,9 @@ struct lruvec { struct lru_gen_mm_state mm_state; #endif #endif /* CONFIG_LRU_GEN */ +#ifdef CONFIG_WORKINGSET_REPORT + struct wsr_state wsr; +#endif /* CONFIG_WORKINGSET_REPORT */ #ifdef CONFIG_MEMCG struct pglist_data *pgdat; #endif diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h new file mode 100644 index 000000000000..0de640cb1ef0 --- /dev/null +++ b/include/linux/workingset_report.h @@ -0,0 +1,69 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef _LINUX_WORKINGSET_REPORT_H +#define _LINUX_WORKINGSET_REPORT_H + +#include +#include + +struct mem_cgroup; +struct pglist_data; +struct node; +struct lruvec; + +#ifdef CONFIG_WORKINGSET_REPORT + +#define WORKINGSET_REPORT_MIN_NR_BINS 2 +#define WORKINGSET_REPORT_MAX_NR_BINS 32 + +#define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) +#define ANON_AND_FILE 2 + +struct wsr_report_bin { + unsigned long idle_age; + unsigned long nr_pages[ANON_AND_FILE]; +}; + +struct wsr_report_bins { + unsigned long nr_bins; + /* last bin contains WORKINGSET_INTERVAL_MAX */ + struct wsr_report_bin bins[WORKINGSET_REPORT_MAX_NR_BINS]; +}; + +struct wsr_page_age_histo { + unsigned long timestamp; + struct wsr_report_bins bins; +}; + +struct wsr_state { + /* breakdown of workingset by page age */ + struct mutex page_age_lock; + struct wsr_page_age_histo *page_age; +}; + +void wsr_init(struct lruvec *lruvec); +void wsr_destroy(struct lruvec *lruvec); + +/* + * Returns true if the wsr is configured to be refreshed. + * The next refresh time is stored in refresh_time. + */ +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat); +void wsr_register_node(struct node *node); +void wsr_unregister_node(struct node *node); +#else +static inline void wsr_init(struct lruvec *lruvec) +{ +} +static inline void wsr_destroy(struct lruvec *lruvec) +{ +} +static inline void wsr_register_node(struct node *node) +{ +} +static inline void wsr_unregister_node(struct node *node) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT */ + +#endif /* _LINUX_WORKINGSET_REPORT_H */ diff --git a/mm/Kconfig b/mm/Kconfig index ffc3a2ba3a8c..212f203b10b9 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1261,6 +1261,15 @@ config LOCK_MM_AND_FIND_VMA config IOMMU_MM_DATA bool +config WORKINGSET_REPORT + bool "Working set reporting" + depends on LRU_GEN && SYSFS + help + Report system and per-memcg working set to userspace. + + This option exports stats and events giving the user more insight + into its memory working set. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index e4b5b75aaec9..57093657030d 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -92,6 +92,7 @@ obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o +obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/internal.h b/mm/internal.h index f309a010d50f..5e0caba64ee4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -198,12 +198,21 @@ extern unsigned long highest_memmap_pfn; /* * in mm/vmscan.c: */ +struct scan_control; bool isolate_lru_page(struct page *page); bool folio_isolate_lru(struct folio *folio); void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +#ifdef CONFIG_WORKINGSET_REPORT +/* + * in mm/wsr.c + */ +/* Requires wsr->page_age_lock held */ +void wsr_refresh_scan(struct lruvec *lruvec); +#endif + /* * in mm/rmap.c: */ diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 1ed40f9d3a27..2f07141de16c 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -65,6 +65,7 @@ #include #include #include +#include #include "internal.h" #include #include @@ -5457,6 +5458,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node) if (!pn) return; + wsr_destroy(&pn->lruvec); free_percpu(pn->lruvec_stats_percpu); kfree(pn); } diff --git a/mm/mmzone.c b/mm/mmzone.c index c01896eca736..efca44c1b84b 100644 --- a/mm/mmzone.c +++ b/mm/mmzone.c @@ -90,6 +90,8 @@ void lruvec_init(struct lruvec *lruvec) */ list_del(&lruvec->lists[LRU_UNEVICTABLE]); + wsr_init(lruvec); + lru_gen_init_lruvec(lruvec); } diff --git a/mm/vmscan.c b/mm/vmscan.c index 1a7c7d537db6..b694d80ab2d1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -56,6 +56,7 @@ #include #include #include +#include #include #include @@ -3815,7 +3816,7 @@ static bool inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, return success; } -static bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, +bool try_to_inc_max_seq(struct lruvec *lruvec, unsigned long max_seq, struct scan_control *sc, bool can_swap, bool force_scan) { bool success; @@ -5606,6 +5607,8 @@ static int __init init_lru_gen(void) if (sysfs_create_group(mm_kobj, &lru_gen_attr_group)) pr_err("lru_gen: failed to create sysfs group\n"); + wsr_register_node(NULL); + debugfs_create_file("lru_gen", 0644, NULL, NULL, &lru_gen_rw_fops); debugfs_create_file("lru_gen_full", 0444, NULL, NULL, &lru_gen_ro_fops); @@ -5613,6 +5616,35 @@ static int __init init_lru_gen(void) }; late_initcall(init_lru_gen); +/****************************************************************************** + * workingset reporting + ******************************************************************************/ +#ifdef CONFIG_WORKINGSET_REPORT +void wsr_refresh_scan(struct lruvec *lruvec) +{ + DEFINE_MAX_SEQ(lruvec); + struct scan_control sc = { + .may_writepage = true, + .may_unmap = true, + .may_swap = true, + .proactive = true, + .reclaim_idx = MAX_NR_ZONES - 1, + .gfp_mask = GFP_KERNEL, + }; + unsigned int flags; + + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); +} +#endif /* CONFIG_WORKINGSET_REPORT */ + #else /* !CONFIG_LRU_GEN */ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) diff --git a/mm/workingset_report.c b/mm/workingset_report.c new file mode 100644 index 000000000000..98cdaffcb6b4 --- /dev/null +++ b/mm/workingset_report.c @@ -0,0 +1,413 @@ +// SPDX-License-Identifier: GPL-2.0 +// +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "internal.h" + +void wsr_init(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + memset(wsr, 0, sizeof(*wsr)); + mutex_init(&wsr->page_age_lock); +} + +void wsr_destroy(struct lruvec *lruvec) +{ + struct wsr_state *wsr = &lruvec->wsr; + + mutex_destroy(&wsr->page_age_lock); + kfree(wsr->page_age); + memset(wsr, 0, sizeof(*wsr)); +} + +static int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins) +{ + int err = 0, i = 0; + char *cur, *next = strim(src); + + if (*next == '\0') + return 0; + + while ((cur = strsep(&next, ","))) { + unsigned int interval; + + err = kstrtouint(cur, 0, &interval); + if (err) + goto out; + + bins->bins[i].idle_age = msecs_to_jiffies(interval); + if (i > 0 && bins->bins[i].idle_age <= bins->bins[i - 1].idle_age) { + err = -EINVAL; + goto out; + } + + if (++i == WORKINGSET_REPORT_MAX_NR_BINS) { + err = -ERANGE; + goto out; + } + } + + if (i && i < WORKINGSET_REPORT_MIN_NR_BINS - 1) { + err = -ERANGE; + goto out; + } + + bins->nr_bins = i; + bins->bins[i].idle_age = WORKINGSET_INTERVAL_MAX; +out: + return err ?: i; +} + +static unsigned long get_gen_start_time(const struct lru_gen_folio *lrugen, + unsigned long seq, + unsigned long max_seq, + unsigned long curr_timestamp) +{ + int younger_gen; + + if (seq == max_seq) + return curr_timestamp; + younger_gen = lru_gen_from_seq(seq + 1); + return READ_ONCE(lrugen->timestamps[younger_gen]); +} + +static void collect_page_age_type(const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + for (seq = max_seq; seq + 1 > min_seq; seq--) { + int gen, zone; + unsigned long gen_end, gen_start, size = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) + size += max( + READ_ONCE(lrugen->nr_pages[gen][type][zone]), + 0L); + + gen_start = get_gen_start_time(lrugen, seq, max_seq, + curr_timestamp); + gen_end = READ_ONCE(lrugen->timestamps[gen]); + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long gen_in_bin = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (gen_in_bin) { + unsigned long split_bin = + size / gen_len * gen_in_bin; + + bin->nr_pages[type] += split_bin; + size -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += size; + } +} + +/* + * proportionally aggregate Multi-gen LRU bins into a working set report + * MGLRU generations: + * current time + * | max_seq timestamp + * | | max_seq - 1 timestamp + * | | | unbounded + * | | | | + * -------------------------------- + * | max_seq | ... | ... | min_seq + * -------------------------------- + * + * Bins: + * + * current time + * | current - idle_age[0] + * | | current - idle_age[1] + * | | | unbounded + * | | | | + * ------------------------------ + * | bin 0 | ... | ... | bin n-1 + * ------------------------------ + * + * Assume the heuristic that pages are in the MGLRU generation + * through uniform accesses, so we can aggregate them + * proportionally into bins. + */ +static void collect_page_age(struct wsr_page_age_histo *page_age, + const struct lruvec *lruvec) +{ + int type; + const struct lru_gen_folio *lrugen = &lruvec->lrugen; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE((lruvec)->lrugen.max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_ANON]), + READ_ONCE(lruvec->lrugen.min_seq[LRU_GEN_FILE]), + }; + struct wsr_report_bins *bins = &page_age->bins; + + for (type = 0; type < ANON_AND_FILE; type++) { + struct wsr_report_bin *bin = &bins->bins[0]; + + collect_page_age_type(lrugen, bin, max_seq, min_seq[type], + curr_timestamp, type); + } +} + +/* First step: hierarchically scan child memcgs. */ +static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + wsr_refresh_scan(lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); +} + +/* Second step: aggregate child memcgs into the page age histogram. */ +static void refresh_aggregate(struct wsr_page_age_histo *page_age, + struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct mem_cgroup *memcg; + struct wsr_report_bin *bin; + + /* + * page_age_intervals should free the page_age struct + * if no intervals are provided. + */ + VM_WARN_ON_ONCE(page_age->bins.bins[0].idle_age == + WORKINGSET_INTERVAL_MAX); + + for (bin = page_age->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) { + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + } + /* the last used bin has idle_age == WORKINGSET_INTERVAL_MAX. */ + bin->nr_pages[0] = 0; + bin->nr_pages[1] = 0; + + memcg = mem_cgroup_iter(root, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + + collect_page_age(page_age, lruvec); + cond_resched(); + } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); + WRITE_ONCE(page_age->timestamp, jiffies); +} + +bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, + struct pglist_data *pgdat) +{ + struct wsr_page_age_histo *page_age; + + if (!READ_ONCE(wsr->page_age)) + return false; + + refresh_scan(wsr, root, pgdat); + mutex_lock(&wsr->page_age_lock); + page_age = READ_ONCE(wsr->page_age); + if (page_age) + refresh_aggregate(page_age, root, pgdat); + mutex_unlock(&wsr->page_age_lock); + return !!page_age; +} +EXPORT_SYMBOL_GPL(wsr_refresh_report); + +static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) +{ + int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : + first_memory_node; + + return NODE_DATA(nid); +} + +static struct wsr_state *kobj_to_wsr(struct kobject *kobj) +{ + return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; +} + +static ssize_t page_age_intervals_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + int len = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + mutex_lock(&wsr->page_age_lock); + + if (!!wsr->page_age) { + int i; + int nr_bins = wsr->page_age->bins.nr_bins; + + for (i = 0; i < nr_bins; ++i) { + struct wsr_report_bin *bin = + &wsr->page_age->bins.bins[i]; + + len += sysfs_emit_at(buf, len, "%u", + jiffies_to_msecs(bin->idle_age)); + if (i + 1 < nr_bins) + len += sysfs_emit_at(buf, len, ","); + } + } + len += sysfs_emit_at(buf, len, "\n"); + + mutex_unlock(&wsr->page_age_lock); + return len; +} + +static ssize_t page_age_intervals_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *src, size_t len) +{ + struct wsr_page_age_histo *page_age = NULL, *old; + char *buf = NULL; + int err = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + buf = kstrdup(src, GFP_KERNEL); + if (!buf) { + err = -ENOMEM; + goto failed; + } + + page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL_ACCOUNT); + + if (!page_age) { + err = -ENOMEM; + goto failed; + } + + err = workingset_report_intervals_parse(buf, &page_age->bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(page_age); + page_age = NULL; + } + + mutex_lock(&wsr->page_age_lock); + old = xchg(&wsr->page_age, page_age); + mutex_unlock(&wsr->page_age_lock); + kfree(old); + kfree(buf); + return len; +failed: + kfree(page_age); + kfree(buf); + + return err; +} + +static struct kobj_attribute page_age_intervals_attr = + __ATTR_RW(page_age_intervals); + +static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, + char *buf) +{ + struct wsr_report_bin *bin; + int ret = 0; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + if (!READ_ONCE(wsr->page_age)) + return -EINVAL; + + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) { + ret = -EINVAL; + goto unlock; + } + + for (bin = wsr->page_age->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + ret += sysfs_emit_at(buf, ret, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + ret += sysfs_emit_at(buf, ret, "%lu anon=%lu file=%lu\n", + WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + return ret; +} + +static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); + +static struct attribute *workingset_report_attrs[] = { + &page_age_intervals_attr.attr, &page_age_attr.attr, NULL +}; + +static const struct attribute_group workingset_report_attr_group = { + .name = "workingset_report", + .attrs = workingset_report_attrs, +}; + +void wsr_register_node(struct node *node) +{ + struct kobject *kobj = node ? &node->dev.kobj : mm_kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + + if (sysfs_create_group(kobj, &workingset_report_attr_group)) { + pr_warn("WSR failed to created group"); + return; + } +} +EXPORT_SYMBOL_GPL(wsr_register_node); + +void wsr_unregister_node(struct node *node) +{ + struct kobject *kobj = &node->dev.kobj; + struct wsr_state *wsr; + + if (IS_ENABLED(CONFIG_NUMA) && !node) + return; + + wsr = kobj_to_wsr(kobj); + sysfs_remove_group(kobj, &workingset_report_attr_group); + wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))); +} +EXPORT_SYMBOL_GPL(wsr_unregister_node); From patchwork Wed Mar 27 21:31:02 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783386 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 08E87153829 for ; Wed, 27 Mar 2024 21:31:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575097; cv=none; b=NmiI+/yhwEA1KnDeBqAMThFEbMigz0fpPjjjgiCnlyc5/uMn6kBe4NSLd8n5imfP1HhxBTC1doP/DCndZDsV0Qgl0XCJrLraU+XCuJ0lIWeHMEK4aq5gcUcSuYYyIluh9DmBVMI1hrWHrlxH03rmFMaUI5H8vWGJBf86LDkWeuE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575097; c=relaxed/simple; bh=KqgfMalJg5cK1sr8buPi1WIykQp4gPgm8sgWmVBbZUU=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=uR5zyc2QUv+gSGfdUfIaBt1Q5j5532pP0gxOYqJccPTHTdDcsLNOTp3RhWFV0NMnsNwcsFURGrwNIN535vp48Wts6ebxh2WiCbQmjYcM5/4ADh/Yw8zmyBD1WYwHK21vq0ldEjQms2txSBJ1xSbEmQvZz3FtLDhOZx282OhnB9k= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=lwdOBKp2; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="lwdOBKp2" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-6134cfcacdcso5222377b3.2 for ; Wed, 27 Mar 2024 14:31:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575094; x=1712179894; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=CyJDixqg00pkxMZu+nlTfhoUxwjhkmzgxYkGofwS92E=; b=lwdOBKp2D+xsoAkZgdowY+DOxtuZ/liBSA1+2PnITIp8eHaJDOP20+AmH7lUlwsgyt /0gOTAy711kMxJKmT9B9ytGZEwda25NwYYTArSqlWBvt7vrgc8IsfNvX1APXn9yTL/AE aaBKh7D6GBBIxxWwYL2T8yv/CCFX7BAXVErCQl/Si+oKou10BunaMOtm3pds86LrpN/F 6ryGKAI8z+cnt9lVE/Ve/0hjmkWvHWQ4RMbpJjG4sT37Tmixco2Du6saN8jktO/AgA++ hz8d5P0XrUF6+9ufJ+6PR+7gs/gzkICNcX6ysXmS4RwgiKcXlom0j6DTItEbtE65qVin Ewsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575094; x=1712179894; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CyJDixqg00pkxMZu+nlTfhoUxwjhkmzgxYkGofwS92E=; b=JFoImNM7VyvBtYRC2Nd4kozh2vi3C6fwotc1CUnjMgnvkqM4+kw+u/7o3vNn5Qcj+q P2F2Vj69cssGa0qmFQr6EYrTrktJLp0gjSeISpUJelYfzaL/ERv667IuFNsfKEs5tmNJ /utEdRRNHa+i9IPgDrpEhNXL7R2mIWMoy9333Xf5hR0Luv8l+KqJ7TSMhg5tzQHatN7I QwD5hyme6WKo3UYSVzW1ikF4lVKvdvBI48RPGHAxnDPMwmD8RiMRtGkGcp5oujfbMsqH kvWbYqwEdlFfRy/cLtIli6CAataHcT+OevtsQIGnYrqDXHgvAr78rQtjpaAp+uLqfFRp cDrA== X-Forwarded-Encrypted: i=1; AJvYcCXxEOrqFcFCmX53wy97c2OaXzOYMNJvr4r/eJr8sdpkXN+pvceVM9a8IWwtLLxwfxEH6C9gylrYfQ7kVSK/0nPUVQv2Z15vT8PA5FZeUPgk X-Gm-Message-State: AOJu0Yz4Pjl3Nj6XBBBU1g/ZSOysud9EuJ0O4WKE7brVsqKVuIzMKi0b +O4m5vNy8rXDLGf7esGIx6vkR6WoSpDB7DzSBOEyoFd9gqVV/iK7eLJOkTOuh05mCNSmweABcii kM7Y2+Q== X-Google-Smtp-Source: AGHT+IGSGqv5KUSJmM8bk0cJQQF7BQEKotxJoozxsxA3rktWM2RtKyv04koLrGenDZXEe0nH0BWimglXW+HF X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a0d:ca8b:0:b0:60a:e67:2ed0 with SMTP id m133-20020a0dca8b000000b0060a0e672ed0mr212398ywd.9.1711575094066; Wed, 27 Mar 2024 14:31:34 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:02 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-4-yuanchu@google.com> Subject: [RFC PATCH v3 3/8] mm: use refresh interval to rate-limit workingset report aggregation From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org The refresh interval is a rate limiting factor to workingset page age histogram reads. When a workingset report is generated, a timestamp is noted, and the same report will be read until it expires beyond the refresh interval, at which point a new report is generated. Sysfs interface /sys/devices/system/node/nodeX/workingset_report/refresh_interval time in milliseconds specifying how long the report is valid for Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 1 + mm/internal.h | 2 +- mm/vmscan.c | 27 ++++++++------ mm/workingset_report.c | 58 ++++++++++++++++++++++++++----- 4 files changed, 69 insertions(+), 19 deletions(-) diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 0de640cb1ef0..23d2ae747a31 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -35,6 +35,7 @@ struct wsr_page_age_histo { }; struct wsr_state { + unsigned long refresh_interval; /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index 5e0caba64ee4..151f09c6983e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -210,7 +210,7 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason * in mm/wsr.c */ /* Requires wsr->page_age_lock held */ -void wsr_refresh_scan(struct lruvec *lruvec); +void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); #endif /* diff --git a/mm/vmscan.c b/mm/vmscan.c index b694d80ab2d1..5f04a04f5261 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -5620,7 +5620,7 @@ late_initcall(init_lru_gen); * workingset reporting ******************************************************************************/ #ifdef CONFIG_WORKINGSET_REPORT -void wsr_refresh_scan(struct lruvec *lruvec) +void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval) { DEFINE_MAX_SEQ(lruvec); struct scan_control sc = { @@ -5633,15 +5633,22 @@ void wsr_refresh_scan(struct lruvec *lruvec) }; unsigned int flags; - set_task_reclaim_state(current, &sc.reclaim_state); - flags = memalloc_noreclaim_save(); - /* - * setting can_swap=true and force_scan=true ensures - * proper workingset stats when the system cannot swap. - */ - try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); - memalloc_noreclaim_restore(flags); - set_task_reclaim_state(current, NULL); + if (refresh_interval) { + int gen = lru_gen_from_seq(max_seq); + unsigned long birth = READ_ONCE(lruvec->lrugen.timestamps[gen]); + + if (time_is_before_jiffies(birth + refresh_interval)) { + set_task_reclaim_state(current, &sc.reclaim_state); + flags = memalloc_noreclaim_save(); + /* + * setting can_swap=true and force_scan=true ensures + * proper workingset stats when the system cannot swap. + */ + try_to_inc_max_seq(lruvec, max_seq, &sc, true, true); + memalloc_noreclaim_restore(flags); + set_task_reclaim_state(current, NULL); + } + } } #endif /* CONFIG_WORKINGSET_REPORT */ diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 98cdaffcb6b4..370e7d355604 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -181,7 +181,8 @@ static void collect_page_age(struct wsr_page_age_histo *page_age, /* First step: hierarchically scan child memcgs. */ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, + unsigned long refresh_interval) { struct mem_cgroup *memcg; @@ -189,7 +190,7 @@ static void refresh_scan(struct wsr_state *wsr, struct mem_cgroup *root, do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); - wsr_refresh_scan(lruvec); + wsr_refresh_scan(lruvec, refresh_interval); cond_resched(); } while ((memcg = mem_cgroup_iter(root, memcg, NULL))); } @@ -231,16 +232,25 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age, bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, struct pglist_data *pgdat) { - struct wsr_page_age_histo *page_age; + struct wsr_page_age_histo *page_age = NULL; + unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); if (!READ_ONCE(wsr->page_age)) return false; - refresh_scan(wsr, root, pgdat); + if (!refresh_interval) + return false; + mutex_lock(&wsr->page_age_lock); page_age = READ_ONCE(wsr->page_age); - if (page_age) - refresh_aggregate(page_age, root, pgdat); + if (!page_age) + goto unlock; + if (time_is_after_jiffies(page_age->timestamp + refresh_interval)) + goto unlock; + refresh_scan(wsr, root, pgdat, refresh_interval); + refresh_aggregate(page_age, root, pgdat); + +unlock: mutex_unlock(&wsr->page_age_lock); return !!page_age; } @@ -259,6 +269,35 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; } +static ssize_t refresh_interval_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int interval = READ_ONCE(wsr->refresh_interval); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(interval)); +} + +static ssize_t refresh_interval_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int interval; + int err; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + err = kstrtouint(buf, 0, &interval); + if (err) + return err; + + WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval)); + + return len; +} + +static struct kobj_attribute refresh_interval_attr = + __ATTR_RW(refresh_interval); + static ssize_t page_age_intervals_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -267,7 +306,7 @@ static ssize_t page_age_intervals_show(struct kobject *kobj, mutex_lock(&wsr->page_age_lock); - if (!!wsr->page_age) { + if (wsr->page_age) { int i; int nr_bins = wsr->page_age->bins.nr_bins; @@ -373,7 +412,10 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); static struct attribute *workingset_report_attrs[] = { - &page_age_intervals_attr.attr, &page_age_attr.attr, NULL + &refresh_interval_attr.attr, + &page_age_intervals_attr.attr, + &page_age_attr.attr, + NULL }; static const struct attribute_group workingset_report_attr_group = { From patchwork Wed Mar 27 21:31:03 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783791 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D1FE3153BCC for ; Wed, 27 Mar 2024 21:31:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575098; cv=none; b=JC1jkxEtiqvbieGHZUaBWIOAutd2tKi579SfLSmychIiGHAbDYqLK14DwHKU065FddcxhfZ5kx65SJZo/C2QDYDxEUVoHPtORRCqyI5lCmnXiI5J5s5BvI2WZ1XQq8c1TzuYwPxBZ6iT0OsAlTdEFEHDlyyrSS6/em8II64ZEB8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575098; c=relaxed/simple; bh=YVxkh2DGqIrC+tRI7u6jcPwoHLc5KJBc+Hr3yy4569I=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=gJmuTulv1L6ZEq1wd+LGpmvRXxa0w8TXpBeMv6TRQtbJegaBwtgSYG6iAr8PldUIEpaMxXzMbbVGNDKgsZUyjeh5g/f4xJY2S82SyGTzI/hdJJLUiVFdeGl6MxdQmEbD7qgXXeCOMU3L4NKr7ngOhvII7yr6/4Rvt6ndgO6ZM1s= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=WD1KDZmJ; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="WD1KDZmJ" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-60a55cd262aso5130377b3.2 for ; Wed, 27 Mar 2024 14:31:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575096; x=1712179896; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=LB+v0i5rcgoLcT2egvPDfe+7I6p18e+cGMfRHtt3nUw=; b=WD1KDZmJFIJtqMBnYSm7A5H60Wc92/GFSsGyehydFXl4VUmvpZmn380dwIuLNsoNKv Ou+rxoeXHXa9cNR/Pa293XuIOrrLUD5DwSL3BZO9XcL4sDc/D+63nBAwhkw4p1XskKJs IkoMemLzfvkPgter0LPhiJbEtu9D6GNcKMJnAyhFACnebFS8ZCOmxBwEyOpoRkg6TfGT tE7/e9xg7PdueqH58kbezuuokO5yBTttpqTQFvJ6EOm45beGelCm4gsFbCLQytTKGMXA J2UThlH33d9tZK8cd5Bu1xHG+HkvZAOUztzYkjAjg3TWiW9OSdDIB9gyA8nDvgHnwiCX PY4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575096; x=1712179896; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=LB+v0i5rcgoLcT2egvPDfe+7I6p18e+cGMfRHtt3nUw=; b=YMcto0aMpEf7g+hjMmkUfImG59Etl8fEN6ehOf7lTwpaXgYWqkO2iqrHKChe0eaWSk hsDYRkeGxRiLZ9Z2tCl336pWteKrOeNjQXeFb3XismUekJreQ18c+8LApwrgciw9HPI5 UMzvQq+g/08SG2C+Te2OvaQskiZ4EVGbFPtrZQtqA8mJ6oSDhx1mtDarchebV1ISKbye /5aQfM7rpQo543ub0qoaoaiJx0nJsEPek5O1sPvv3Eijds37VHq2W0brPxvRyyyCTuRE NABRWSdYzp6cx2uQUrrk0Xo+K+noE1pQvZh/uFG0ZqlJwXSHTaY2Om8r1St0eFT78GUd YlMw== X-Forwarded-Encrypted: i=1; AJvYcCWuG4v55q7Qvyo3jAcb6WbfXRIYMBv6a3qPSMvueGjUkcQpW8EopDayUWD8paHqn8kAmxiX1AgDO7pricJjIgLtgQy8KniC0MK5ykhY7MUn X-Gm-Message-State: AOJu0YwZw5W6BUT9Tpsso4k9RtoDXEg6vCIDo7eEO9lPKE4TyLBZ5WiB NjAdZ5tM67wXxtwWAot6/B3QHEcS0BhpUaGYQyuTpLwRqBIF/pqe/b4d79qpHAYSdw8H7hRJqfe N5/NO5A== X-Google-Smtp-Source: AGHT+IHMceXyiMQRNVP1E9sG9l7GvMJZx9fuMZnRf4fjgo+ewu4QirMYVr1TGSmtvsHYoxnOh9k/JVnZeT/d X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:6902:2183:b0:dc6:53c3:bcbd with SMTP id dl3-20020a056902218300b00dc653c3bcbdmr314870ybb.7.1711575095837; Wed, 27 Mar 2024 14:31:35 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:03 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-5-yuanchu@google.com> Subject: [RFC PATCH v3 4/8] mm: report workingset during memory pressure driven scanning From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org When a node reaches its low watermarks and wakes up kswapd, notify all userspace programs waiting on the workingset page age histogram of the memory pressure, so a userspace agent can read the workingset report in time and make policy decisions, such as logging, oom-killing, or migration. Sysfs interface: /sys/devices/system/node/nodeX/workingset_report/report_threshold time in milliseconds that specifies how often the userspace agent can be notified for node memory pressure. Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 4 +++ mm/internal.h | 6 +++++ mm/vmscan.c | 44 +++++++++++++++++++++++++++++++ mm/workingset_report.c | 39 +++++++++++++++++++++++++++ 4 files changed, 93 insertions(+) diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 23d2ae747a31..589d240d6251 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -35,7 +35,11 @@ struct wsr_page_age_histo { }; struct wsr_state { + unsigned long report_threshold; unsigned long refresh_interval; + + struct kernfs_node *page_age_sys_file; + /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; diff --git a/mm/internal.h b/mm/internal.h index 151f09c6983e..36480c7ac0dd 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -209,8 +209,14 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason /* * in mm/wsr.c */ +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); /* Requires wsr->page_age_lock held */ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); +#else +static inline void notify_workingset(struct mem_cgroup *memcg, + struct pglist_data *pgdat) +{ +} #endif /* diff --git a/mm/vmscan.c b/mm/vmscan.c index 5f04a04f5261..c6acd5265b3f 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2535,6 +2535,15 @@ static bool can_age_anon_pages(struct pglist_data *pgdat, return can_demote(pgdat->node_id, sc); } +#ifdef CONFIG_WORKINGSET_REPORT +static void try_to_report_workingset(struct pglist_data *pgdat, struct scan_control *sc); +#else +static inline void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ +} +#endif + #ifdef CONFIG_LRU_GEN #ifdef CONFIG_LRU_GEN_ENABLED @@ -3936,6 +3945,8 @@ static void lru_gen_age_node(struct pglist_data *pgdat, struct scan_control *sc) if (!min_ttl || sc->order || sc->priority == DEF_PRIORITY) return; + try_to_report_workingset(pgdat, sc); + memcg = mem_cgroup_iter(NULL, NULL, NULL); do { struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); @@ -5650,6 +5661,36 @@ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval) } } } + +static void try_to_report_workingset(struct pglist_data *pgdat, + struct scan_control *sc) +{ + struct mem_cgroup *memcg = sc->target_mem_cgroup; + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + unsigned long threshold = READ_ONCE(wsr->report_threshold); + + if (sc->priority == DEF_PRIORITY) + return; + + if (!threshold) + return; + + if (!mutex_trylock(&wsr->page_age_lock)) + return; + + if (!wsr->page_age) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + if (time_is_after_jiffies(wsr->page_age->timestamp + threshold)) { + mutex_unlock(&wsr->page_age_lock); + return; + } + + mutex_unlock(&wsr->page_age_lock); + notify_workingset(memcg, pgdat); +} #endif /* CONFIG_WORKINGSET_REPORT */ #else /* !CONFIG_LRU_GEN */ @@ -6177,6 +6218,9 @@ static void shrink_zones(struct zonelist *zonelist, struct scan_control *sc) if (zone->zone_pgdat == last_pgdat) continue; last_pgdat = zone->zone_pgdat; + + if (!sc->proactive) + try_to_report_workingset(zone->zone_pgdat, sc); shrink_node(zone->zone_pgdat, sc); } diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 370e7d355604..3ed3b0e8f8ad 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -269,6 +269,33 @@ static struct wsr_state *kobj_to_wsr(struct kobject *kobj) return &mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))->wsr; } +static ssize_t report_threshold_show(struct kobject *kobj, + struct kobj_attribute *attr, char *buf) +{ + struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned int threshold = READ_ONCE(wsr->report_threshold); + + return sysfs_emit(buf, "%u\n", jiffies_to_msecs(threshold)); +} + +static ssize_t report_threshold_store(struct kobject *kobj, + struct kobj_attribute *attr, + const char *buf, size_t len) +{ + unsigned int threshold; + struct wsr_state *wsr = kobj_to_wsr(kobj); + + if (kstrtouint(buf, 0, &threshold)) + return -EINVAL; + + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(threshold)); + + return len; +} + +static struct kobj_attribute report_threshold_attr = + __ATTR_RW(report_threshold); + static ssize_t refresh_interval_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { @@ -412,6 +439,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, static struct kobj_attribute page_age_attr = __ATTR_RO(page_age); static struct attribute *workingset_report_attrs[] = { + &report_threshold_attr.attr, &refresh_interval_attr.attr, &page_age_intervals_attr.attr, &page_age_attr.attr, @@ -437,6 +465,9 @@ void wsr_register_node(struct node *node) pr_warn("WSR failed to created group"); return; } + + wsr->page_age_sys_file = + kernfs_walk_and_get(kobj->sd, "workingset_report/page_age"); } EXPORT_SYMBOL_GPL(wsr_register_node); @@ -450,6 +481,14 @@ void wsr_unregister_node(struct node *node) wsr = kobj_to_wsr(kobj); sysfs_remove_group(kobj, &workingset_report_attr_group); + kernfs_put(wsr->page_age_sys_file); wsr_destroy(mem_cgroup_lruvec(NULL, kobj_to_pgdat(kobj))); } EXPORT_SYMBOL_GPL(wsr_unregister_node); + +void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) +{ + struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; + + kernfs_notify(wsr->page_age_sys_file); +} From patchwork Wed Mar 27 21:31:04 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783385 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 72DAD153BE4 for ; Wed, 27 Mar 2024 21:31:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575100; cv=none; b=V2Jzqb9ZmExJfSfJuP5WDZVXshkvG2jJkuOgJYWnd9c1Ybl5xX56i1rfqGIIsO2/yg+r4V2R3/BfBQGRRybhVq31qczMe+MAN8lgnnaLTNYvEdAGmqf8yp9t4H8gLdYt0Dr+xE+0bOG/IHt8VOXvKXXsQHoWN4fKXUD58l7JUYg= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575100; c=relaxed/simple; bh=bBIE5Ow/pYwgtpsi8Q9h2nMAHvXb3FBjBYB5fON9v+o=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=O5wiPpt2zQid3j8VlPMzW6xIpC5uQrixGHn3r74gMYtsxqsWZEJQdN5i2dYI7vlwV12UEk3nWzv8i8I+esRJ4eSPIVyxd7V8PixygDbq9JBP7Wf7o+tsT+ZAtXV64MsRfDWtCialH8skcAs3j3UcruD+TDzMssVkqT9cdRnl76o= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=wPEWaDx5; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="wPEWaDx5" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-60a03635590so6026197b3.0 for ; Wed, 27 Mar 2024 14:31:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575097; x=1712179897; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=xaCDmpnOXna8KuNDUqa6ynYFsa6ZfJ9LVAHJUw2Aofg=; b=wPEWaDx5HTkYByOae+MOT/UVhBJ84y2h9ndPFVFYN/t7E3lZlwxr6f8Zkk8cGaabvf 2iRZ0EjXD7q0I+Eb7S3/TCCVMM8AZwsqvD+6V7ulSGgq5nEPRt9Ys0UHcxAQiGM6punv U8gkabHKRFn7zSYiLDOtvgM4yYxj5eAZ4KITpIwlaKfv5YOItTtS57b1ieuxrvSCRAI5 79EPa8+i9BEpKAHRiqn7f5stprOM/FptmIpY9fKCrUx3gWXRN+CBEgcLs/yrBkIkiT+6 yA0WVTfTakDWsGZZA7ZwvvGYZMdMA1ARb1X3vCl+IWyqf//nLWGH3TPkvk4XU5prYHeU Hf/Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575097; x=1712179897; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=xaCDmpnOXna8KuNDUqa6ynYFsa6ZfJ9LVAHJUw2Aofg=; b=QFsnlMi3QtisSifW6It5J80OKK6C4GGBVCG0kKUgbqaWn3U+mvlq4LVuJcIpChl5Ni fAnZzJ8MLnFmc7u33Gi0w6A2M9Xn3a6IhTZwbZkuE6iZekBmWc901PjqrtxHYeyhoYR4 nQ2IPCwJEPHSfm1CqNOjROddGjQ/twBxvT67pMT8tCB1S/GkRpioBVZSNG1wR2cxVKl4 tkcYjY7joho0nd9SoJ0ovX+VpbBrblsVeyvSTfLppbRs2FjtKB4jN1QiwlItju7NCQJR i360h/in5wiUKBtwUsJxZs5ELg/M9aqGhmXlU9zYjCohdn2eIcYfPR/A1zOLyX1Ual5N KdxQ== X-Forwarded-Encrypted: i=1; AJvYcCUlZtnX2RN/xlrNsdQahqajYV96/wgnDaxOX9CFDqGjx7e5T5JRo0lqjWCwHjvEu6dw/xXcmKqadT+URANf/nL9tF5t3COjp+A/5tgTnmBE X-Gm-Message-State: AOJu0YwFZC9rLp3VQCWCG6kW8CyLHAQkpgDO4Ayep2jsqxWuz0BZzvaN Mk4CwlQV1BlMPXxUXl7Ewt2jIVWkncjxDekhPtlDfJQtqJxtYcpJqHu3uM7WMeOFY8WrH1SdMSt IfQt8hQ== X-Google-Smtp-Source: AGHT+IHPhGifJWkaGVpVQkyYgUvbZ7gXC3zKKNE+5jad1F4KE6vLxnBUAJz0fdIi3IF4p9BgtNJpeDc4GOj3 X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:690c:b9a:b0:610:dc1b:8f39 with SMTP id ck26-20020a05690c0b9a00b00610dc1b8f39mr216073ywb.0.1711575097511; Wed, 27 Mar 2024 14:31:37 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:04 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-6-yuanchu@google.com> Subject: [RFC PATCH v3 5/8] mm: extend working set reporting to memcgs From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org Break down the system-wide working set reporting into per-memcg reports, which aggregages its children hierarchically. The per-node working set reporting histograms and refresh/report threshold files are presented as memcg files, showing a report containing all the nodes. Memcg interface: /sys/fs/cgroup/.../memory.workingset.page_age The memcg equivalent of the sysfs workingset page age histogram, breaks down the workingset of this memcg and its children into page age intervals. Each node is prefixed with a node header and a newline. Non-proactive direct reclaim on this memcg can also wake up userspace agents that are waiting on this file. e.g. N0 1000 anon=0 file=0 2000 anon=0 file=0 3000 anon=0 file=0 4000 anon=0 file=0 5000 anon=0 file=0 18446744073709551615 anon=0 file=0 /sys/fs/cgroup/.../memory.workingset.page_age_intervals Configures the intervals for the page age histogram. This file operates on a per-node basis, allowing for different intervals for each node. e.g. echo N0=1000,2000,3000,4000,5000 > memory.workingset.page_age_intervals /sys/fs/cgroup/.../memory.workingset.refresh_interval The memcg equivalent of the sysfs refresh interval. A per-node number of how much time a page age histogram is valid for, in milliseconds. e.g. echo N0=2000 > memory.workingset.refresh_interval /sys/fs/cgroup/.../memory.workingset.report_threshold The memcg equivalent of the sysfs report threshold. A per-node number of how often userspace agent waiting on the page age histogram can be woken up, in milliseconds. e.g. echo N0=1000 > memory.workingset.report_threshold Signed-off-by: Yuanchu Xie --- include/linux/memcontrol.h | 5 + include/linux/workingset_report.h | 6 +- mm/internal.h | 2 + mm/memcontrol.c | 267 +++++++++++++++++++++++++++++- mm/workingset_report.c | 10 +- 5 files changed, 286 insertions(+), 4 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 20ff87f8e001..7d7bc0928961 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -335,6 +335,11 @@ struct mem_cgroup { struct lru_gen_mm_list mm_list; #endif +#ifdef CONFIG_WORKINGSET_REPORT + /* memory.workingset.page_age file */ + struct cgroup_file workingset_page_age_file; +#endif + struct mem_cgroup_per_node *nodeinfo[]; }; diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 589d240d6251..502542c812b3 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -9,6 +9,7 @@ struct mem_cgroup; struct pglist_data; struct node; struct lruvec; +struct cgroup_file; #ifdef CONFIG_WORKINGSET_REPORT @@ -38,7 +39,10 @@ struct wsr_state { unsigned long report_threshold; unsigned long refresh_interval; - struct kernfs_node *page_age_sys_file; + union { + struct kernfs_node *page_age_sys_file; + struct cgroup_file *page_age_cgroup_file; + }; /* breakdown of workingset by page age */ struct mutex page_age_lock; diff --git a/mm/internal.h b/mm/internal.h index 36480c7ac0dd..3730c8399ad4 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -212,6 +212,8 @@ extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); /* Requires wsr->page_age_lock held */ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); +int workingset_report_intervals_parse(char *src, + struct wsr_report_bins *bins); #else static inline void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2f07141de16c..75bda5f7994d 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7005,6 +7005,245 @@ static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf, return nbytes; } +#ifdef CONFIG_WORKINGSET_REPORT +static int memory_ws_page_age_intervals_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr; + struct wsr_page_age_histo *page_age; + int i, nr_bins; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + mutex_lock(&wsr->page_age_lock); + page_age = wsr->page_age; + if (!page_age) + goto no_page_age; + seq_printf(m, "N%d=", nid); + nr_bins = page_age->bins.nr_bins; + for (i = 0; i < nr_bins; ++i) { + struct wsr_report_bin *bin = + &page_age->bins.bins[i]; + + seq_printf(m, "%u", jiffies_to_msecs(bin->idle_age)); + if (i + 1 < nr_bins) + seq_putc(m, ','); + } + seq_putc(m, ' '); +no_page_age: + mutex_unlock(&wsr->page_age_lock); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_wsr_interval_parse(struct kernfs_open_file *of, char *buf, + size_t nbytes, unsigned int *nid_out, + struct wsr_report_bins *bins) +{ + char *node, *intervals; + unsigned int nid; + int err; + + buf = strstrip(buf); + intervals = buf; + node = strsep(&intervals, "="); + + if (*node != 'N') + return -EINVAL; + + err = kstrtouint(node + 1, 0, &nid); + if (err) + return err; + + if (nid >= nr_node_ids || !node_state(nid, N_MEMORY)) + return -EINVAL; + + err = workingset_report_intervals_parse(intervals, bins); + if (err < 0) + return err; + + *nid_out = nid; + return err; +} + +static ssize_t memory_ws_page_age_intervals_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid; + int err; + struct wsr_page_age_histo *page_age = NULL, *old; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + page_age = + kzalloc(sizeof(struct wsr_page_age_histo), GFP_KERNEL_ACCOUNT); + + if (!page_age) { + err = -ENOMEM; + goto failed; + } + + err = memory_wsr_interval_parse(of, buf, nbytes, &nid, &page_age->bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(page_age); + page_age = NULL; + } + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + mutex_lock(&wsr->page_age_lock); + old = xchg(&wsr->page_age, page_age); + mutex_unlock(&wsr->page_age_lock); + kfree(old); + return nbytes; +failed: + kfree(page_age); + return err; +} + +static int memory_ws_refresh_interval_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + seq_printf(m, "N%d=%u ", nid, + jiffies_to_msecs(READ_ONCE(wsr->refresh_interval))); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_wsr_threshold_parse(char *buf, size_t nbytes, + unsigned int *nid_out, + unsigned int *msecs) +{ + char *node, *threshold; + unsigned int nid; + int err; + + buf = strstrip(buf); + threshold = buf; + node = strsep(&threshold, "="); + + if (*node != 'N') + return -EINVAL; + + err = kstrtouint(node + 1, 0, &nid); + if (err) + return err; + + if (nid >= nr_node_ids || !node_state(nid, N_MEMORY)) + return -EINVAL; + + err = kstrtouint(threshold, 0, msecs); + if (err) + return err; + + *nid_out = nid; + + return nbytes; +} + +static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid, msecs; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); + + if (ret < 0) + return ret; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(msecs)); + return ret; +} + +static int memory_ws_report_threshold_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + + seq_printf(m, "N%d=%u ", nid, + jiffies_to_msecs(READ_ONCE(wsr->report_threshold))); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_ws_report_threshold_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid, msecs; + struct wsr_state *wsr; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); + + if (ret < 0) + return ret; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + WRITE_ONCE(wsr->report_threshold, msecs_to_jiffies(msecs)); + return ret; +} + +static int memory_ws_page_age_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + struct wsr_report_bin *bin; + + if (!READ_ONCE(wsr->page_age)) + continue; + + wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + mutex_lock(&wsr->page_age_lock); + if (!wsr->page_age) + goto unlock; + seq_printf(m, "N%d\n", nid); + for (bin = wsr->page_age->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + seq_printf(m, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + + seq_printf(m, "%lu anon=%lu file=%lu\n", WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0] * PAGE_SIZE, + bin->nr_pages[1] * PAGE_SIZE); + +unlock: + mutex_unlock(&wsr->page_age_lock); + } + + return 0; +} +#endif + static struct cftype memory_files[] = { { .name = "current", @@ -7073,7 +7312,33 @@ static struct cftype memory_files[] = { .flags = CFTYPE_NS_DELEGATABLE, .write = memory_reclaim, }, - { } /* terminate */ +#ifdef CONFIG_WORKINGSET_REPORT + { + .name = "workingset.page_age_intervals", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_page_age_intervals_show, + .write = memory_ws_page_age_intervals_write, + }, + { + .name = "workingset.refresh_interval", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_refresh_interval_show, + .write = memory_ws_refresh_interval_write, + }, + { + .name = "workingset.report_threshold", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_report_threshold_show, + .write = memory_ws_report_threshold_write, + }, + { + .name = "workingset.page_age", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .file_offset = offsetof(struct mem_cgroup, workingset_page_age_file), + .seq_show = memory_ws_page_age_show, + }, +#endif + {} /* terminate */ }; struct cgroup_subsys memory_cgrp_subsys = { diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 3ed3b0e8f8ad..b00ffbfebcab 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -20,9 +20,12 @@ void wsr_init(struct lruvec *lruvec) { struct wsr_state *wsr = &lruvec->wsr; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); memset(wsr, 0, sizeof(*wsr)); mutex_init(&wsr->page_age_lock); + if (memcg && !mem_cgroup_is_root(memcg)) + wsr->page_age_cgroup_file = &memcg->workingset_page_age_file; } void wsr_destroy(struct lruvec *lruvec) @@ -34,7 +37,7 @@ void wsr_destroy(struct lruvec *lruvec) memset(wsr, 0, sizeof(*wsr)); } -static int workingset_report_intervals_parse(char *src, +int workingset_report_intervals_parse(char *src, struct wsr_report_bins *bins) { int err = 0, i = 0; @@ -490,5 +493,8 @@ void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) { struct wsr_state *wsr = &mem_cgroup_lruvec(memcg, pgdat)->wsr; - kernfs_notify(wsr->page_age_sys_file); + if (mem_cgroup_is_root(memcg)) + kernfs_notify(wsr->page_age_sys_file); + else + cgroup_file_notify(wsr->page_age_cgroup_file); } From patchwork Wed Mar 27 21:31:05 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783790 Received: from mail-yw1-f201.google.com (mail-yw1-f201.google.com [209.85.128.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5BF5C153BC9 for ; Wed, 27 Mar 2024 21:31:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575102; cv=none; b=C3n29yeK1wabbUKviso8C3EHGIAhCc4VE/A0GeKXCpjMDgM40Pg0PMMWPz/G5W5Zg/3sdjhqvzycTVPfICZv1QZ9OiPWvJou0leQe2yfquKdpHONGpEj7zPzIpgtj202lcmIgScNS1jN6PYiBe2ygAUuQ8NNW9ctvM8g4nxqniQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575102; c=relaxed/simple; bh=9iofqq5XqTy0xSzgNBoBMiZSw9jPsidndoirLIq9dec=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=VJWNWa8RNdZQAvv6RtCKNsfl5ZYhKCiWNVqND4QHHOKL2vkH6o2zh0G11x7PwF3IEtYuAhJCRA/RS4uJ35DHG9tgAaXb3/kKUl20qqV+W0yA4VO7x/YZ7XAm7X9gD37ICXfUYQTkHoiV8ZdHWx+s500tCiii+pwjg66Uoir45tw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=4VehKRhS; arc=none smtp.client-ip=209.85.128.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="4VehKRhS" Received: by mail-yw1-f201.google.com with SMTP id 00721157ae682-609fe93b5cfso5276757b3.0 for ; Wed, 27 Mar 2024 14:31:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575099; x=1712179899; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=pYcWZ+uvnfpBRrJgdHxDGPK9xqtHO7VyC+05VAboBBI=; b=4VehKRhSB8dvrqwQqUXbhCMsh8DX0Hw70BAR8m9ymXpcN+zL/DJDc+is9au7vGlKuA dePcM174kofMfyyxRuiMAL8eDg3DaI/N/j8b/ip8A3VnIPTLiRIf4k3jUhE28nV4TLdT C5LksHQy7mW9FOZtBYCJLfsKXgGvq0lHV0IpeA5OLsg9N3Pm7yF876qaHqg31OtTpe7u Jm5dMrMe8qSzPzIlCala4tHA1OfjXg1+/uybudErfrSPvK+TyKqxjvvgsoOcp6qnv1n7 d7lWYEm/GgRBg1hTOYZtzFGYEibDgkc841QTpm1nySwlEiqFvTy9creeQmU28QKFL9wr BWhg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575099; x=1712179899; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=pYcWZ+uvnfpBRrJgdHxDGPK9xqtHO7VyC+05VAboBBI=; b=sB9S5CtAuYUpVgc+3wDAMVlwT9wYsLnHGF/6I5Vr+gk5qrzAkFn/8wFZdjCiqR6bZ0 A15MkCUWA1Z+OLrLI0uTXgpUEsvBOSnuqZd1Prqn/BADk6u1ago6QjSxJwDrRAY556KR x7bH5MRgiwwWGQqWz1m2h4t1OPuui4LPBOS9sfJTFTfJJvm119O6xRXe9E6rUex5vxaz 0E75zr8TpAWj4QgKujeU3cwVejGdRPbc3r2vydrFWMi1iJqvZBp9hr/C4TTRxOoGz4hm H26gFS5ptpSAN1Got9kJ4b6lA9min1Za/P1rJFfeGuVUQEAM5CfyJDT6T4+Rn3vR820O knvg== X-Forwarded-Encrypted: i=1; AJvYcCW0dAc8vtsEMoAqLiDR7zakdSSFNoEICh3npD2Vg4x0DgeRTT0cirVh3P+ITvlhNKCXpohiAgZ2En9iP3GIflqJgS7vi+NiaVn8rxNdU1Xu X-Gm-Message-State: AOJu0YzBqV0vNzSqdVdo0J5N29qyXmwGxn5TPMQ5CjGq5oyIGgOE1NaC iCIsfMbQ/475uzWkpi16qP9MxzQ/ZdGWhx9QB+LcThUhlsfuxeR0Ed48Z/O3JVPUwclYFXphxCh O7yL5Ow== X-Google-Smtp-Source: AGHT+IH29n/bXDk4V5f6G2KmwU71qUf9qBBJl3gumyEah4UzaOtBiCsSyfr3pIgcLollPnQRaMOx5Jjvfzl5 X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:6902:2088:b0:dc6:e20f:80cb with SMTP id di8-20020a056902208800b00dc6e20f80cbmr89815ybb.3.1711575099385; Wed, 27 Mar 2024 14:31:39 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:05 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-7-yuanchu@google.com> Subject: [RFC PATCH v3 6/8] mm: add per-memcg reaccess histogram From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org A reaccess refers to detecting an access on a page via refault or access bit harvesting after the initial access. Similar to the working set histogram, the reaccess histogram breaks down reaccesses into user-defined bins. It tracks reaccesses from MGLRU walks, where a move from older generations to the young generation counts as a reaccess. Swapped out pages are tracked with the generation number encoded in mm/workingset.c, and additional tracking is added for enabled memory cgroups to track an additional 4 swapped out generations. Memcg interfaces: /sys/fs/cgroup/.../memory.workingset.reaccess The format is identical to memory.workingset.page_age, but the content breaks down reaccesses into pre-defined intervals. e.g. N0 1000 anon=6330 file=0 2000 anon=72 file=0 4000 anon=0 file=0 18446744073709551615 anon=0 file=0 N1 18446744073709551615 anon=0 file=0 /sys/fs/cgroup/.../memory.workingset.reaccess_intervals Defines the per-node intervals for memory.workingset.reaccess. e.g. echo N0=120000,240000,480000 > memory.workingset.reaccess_intervals Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 20 +++ mm/internal.h | 28 ++++ mm/memcontrol.c | 112 ++++++++++++++ mm/vmscan.c | 8 +- mm/workingset.c | 9 +- mm/workingset_report.c | 249 ++++++++++++++++++++++++++++++ 6 files changed, 419 insertions(+), 7 deletions(-) diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index 502542c812b3..e908c5678b1e 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -4,6 +4,7 @@ #include #include +#include struct mem_cgroup; struct pglist_data; @@ -19,6 +20,12 @@ struct cgroup_file; #define WORKINGSET_INTERVAL_MAX ((unsigned long)-1) #define ANON_AND_FILE 2 +/* + * MAX_NR_EVICTED_GENS is set to 4 so we can track the same number of + * generations as MGLRU has resident. + */ +#define MAX_NR_EVICTED_GENS 4 + struct wsr_report_bin { unsigned long idle_age; unsigned long nr_pages[ANON_AND_FILE]; @@ -35,6 +42,18 @@ struct wsr_page_age_histo { struct wsr_report_bins bins; }; +struct wsr_evicted_gen { + unsigned long timestamp; + int seq; +}; + +struct wsr_reaccess_histo { + struct rcu_head rcu; + /* evicted gens start from min_seq[LRU_GEN_ANON] - 1 */ + struct wsr_evicted_gen gens[MAX_NR_EVICTED_GENS]; + struct wsr_report_bins bins; +}; + struct wsr_state { unsigned long report_threshold; unsigned long refresh_interval; @@ -47,6 +66,7 @@ struct wsr_state { /* breakdown of workingset by page age */ struct mutex page_age_lock; struct wsr_page_age_histo *page_age; + struct wsr_reaccess_histo __rcu *reaccess; }; void wsr_init(struct lruvec *lruvec); diff --git a/mm/internal.h b/mm/internal.h index 3730c8399ad4..077340b526e8 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -205,16 +205,44 @@ void putback_lru_page(struct page *page); void folio_putback_lru(struct folio *folio); extern void reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason); +/* + * in mm/workingset.c + */ +#define WORKINGSET_SHIFT 1 +#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ + WORKINGSET_SHIFT + NODES_SHIFT + \ + MEM_CGROUP_ID_SHIFT) +#define EVICTION_MASK (~0UL >> EVICTION_SHIFT) + #ifdef CONFIG_WORKINGSET_REPORT /* * in mm/wsr.c */ +void report_lru_gen_eviction(struct lruvec *lruvec, int type, int min_seq); +void lru_gen_report_reaccess(struct lruvec *lruvec, + struct lru_gen_mm_walk *walk); +void report_reaccess_refault(struct lruvec *lruvec, unsigned long token, + int type, int nr_pages); void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat); /* Requires wsr->page_age_lock held */ void wsr_refresh_scan(struct lruvec *lruvec, unsigned long refresh_interval); int workingset_report_intervals_parse(char *src, struct wsr_report_bins *bins); #else +struct lru_gen_mm_walk; +static inline void report_lru_gen_eviction(struct lruvec *lruvec, int type, + int min_seq) +{ +} +static inline void lru_gen_report_reaccess(struct lruvec *lruvec, + struct lru_gen_mm_walk *walk) +{ +} +static inline void report_reaccess_refault(struct lruvec *lruvec, + unsigned long token, int type, + int nr_pages) +{ +} static inline void notify_workingset(struct mem_cgroup *memcg, struct pglist_data *pgdat) { diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 75bda5f7994d..2a39a4445bb7 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7108,6 +7108,71 @@ static ssize_t memory_ws_page_age_intervals_write(struct kernfs_open_file *of, return err; } +static int memory_ws_reaccess_intervals_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr; + struct wsr_reaccess_histo *reaccess; + int i, nr_bins; + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) + goto unlock; + seq_printf(m, "N%d=", nid); + nr_bins = reaccess->bins.nr_bins; + for (i = 0; i < nr_bins; ++i) { + struct wsr_report_bin *bin = &reaccess->bins.bins[i]; + + seq_printf(m, "%u", jiffies_to_msecs(bin->idle_age)); + if (i + 1 < nr_bins) + seq_putc(m, ','); + } + seq_putc(m, ' '); +unlock: + rcu_read_unlock(); + } + seq_putc(m, '\n'); + + return 0; +} + +static ssize_t memory_ws_reaccess_intervals_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, + loff_t off) +{ + unsigned int nid; + int err; + struct wsr_state *wsr; + struct wsr_reaccess_histo *reaccess = NULL, *old; + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + + reaccess = kzalloc(sizeof(struct wsr_reaccess_histo), GFP_KERNEL); + if (!reaccess) + return -ENOMEM; + + err = memory_wsr_interval_parse(of, buf, nbytes, &nid, &reaccess->bins); + if (err < 0) + goto failed; + + if (err == 0) { + kfree(reaccess); + reaccess = NULL; + } + + wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + old = xchg(&wsr->reaccess, reaccess); + kfree_rcu(old, rcu); + return nbytes; +failed: + kfree(reaccess); + return err; +} + static int memory_ws_refresh_interval_show(struct seq_file *m, void *v) { int nid; @@ -7242,6 +7307,42 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v) return 0; } + +static int memory_ws_reaccess_histogram_show(struct seq_file *m, void *v) +{ + int nid; + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + for_each_node_state(nid, N_MEMORY) { + struct wsr_state *wsr = + &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + struct wsr_reaccess_histo *reaccess; + struct wsr_report_bin *bin; + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + + if (!reaccess) + goto unlock; + + wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + + seq_printf(m, "N%d\n", nid); + for (bin = reaccess->bins.bins; + bin->idle_age != WORKINGSET_INTERVAL_MAX; bin++) + seq_printf(m, "%u anon=%lu file=%lu\n", + jiffies_to_msecs(bin->idle_age), + bin->nr_pages[0], bin->nr_pages[1]); + + seq_printf(m, "%lu anon=%lu file=%lu\n", WORKINGSET_INTERVAL_MAX, + bin->nr_pages[0], bin->nr_pages[1]); + +unlock: + rcu_read_unlock(); + } + + return 0; +} #endif static struct cftype memory_files[] = { @@ -7337,6 +7438,17 @@ static struct cftype memory_files[] = { .file_offset = offsetof(struct mem_cgroup, workingset_page_age_file), .seq_show = memory_ws_page_age_show, }, + { + .name = "workingset.reaccess_intervals", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_reaccess_intervals_show, + .write = memory_ws_reaccess_intervals_write, + }, + { + .name = "workingset.reaccess", + .flags = CFTYPE_NOT_ON_ROOT | CFTYPE_NS_DELEGATABLE, + .seq_show = memory_ws_reaccess_histogram_show, + }, #endif {} /* terminate */ }; diff --git a/mm/vmscan.c b/mm/vmscan.c index c6acd5265b3f..4d9245e2c0d1 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -3637,6 +3637,7 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_ mem_cgroup_unlock_pages(); if (walk->batched) { + lru_gen_report_reaccess(lruvec, walk); spin_lock_irq(&lruvec->lru_lock); reset_batch_size(lruvec, walk); spin_unlock_irq(&lruvec->lru_lock); @@ -3709,6 +3710,7 @@ static bool inc_min_seq(struct lruvec *lruvec, int type, bool can_swap) } done: reset_ctrl_pos(lruvec, type, true); + report_lru_gen_eviction(lruvec, type, lrugen->min_seq[type] + 1); WRITE_ONCE(lrugen->min_seq[type], lrugen->min_seq[type] + 1); return true; @@ -3750,6 +3752,7 @@ static bool try_to_inc_min_seq(struct lruvec *lruvec, bool can_swap) continue; reset_ctrl_pos(lruvec, type, true); + report_lru_gen_eviction(lruvec, type, min_seq[type]); WRITE_ONCE(lrugen->min_seq[type], min_seq[type]); success = true; } @@ -4565,11 +4568,14 @@ static int evict_folios(struct lruvec *lruvec, struct scan_control *sc, int swap sc->nr_scanned -= folio_nr_pages(folio); } + walk = current->reclaim_state->mm_walk; + if (walk && walk->batched) + lru_gen_report_reaccess(lruvec, walk); + spin_lock_irq(&lruvec->lru_lock); move_folios_to_lru(lruvec, &list); - walk = current->reclaim_state->mm_walk; if (walk && walk->batched) reset_batch_size(lruvec, walk); diff --git a/mm/workingset.c b/mm/workingset.c index 226012974328..057fbedd91ea 100644 --- a/mm/workingset.c +++ b/mm/workingset.c @@ -17,6 +17,8 @@ #include #include +#include "internal.h" + /* * Double CLOCK lists * @@ -179,12 +181,6 @@ * refault distance will immediately activate the refaulting page. */ -#define WORKINGSET_SHIFT 1 -#define EVICTION_SHIFT ((BITS_PER_LONG - BITS_PER_XA_VALUE) + \ - WORKINGSET_SHIFT + NODES_SHIFT + \ - MEM_CGROUP_ID_SHIFT) -#define EVICTION_MASK (~0UL >> EVICTION_SHIFT) - /* * Eviction timestamps need to be able to cover the full range of * actionable refaults. However, bits are tight in the xarray @@ -294,6 +290,7 @@ static void lru_gen_refault(struct folio *folio, void *shadow) goto unlock; mod_lruvec_state(lruvec, WORKINGSET_REFAULT_BASE + type, delta); + report_reaccess_refault(lruvec, token, type, delta); if (!recent) goto unlock; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index b00ffbfebcab..504d840bbe6a 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -34,6 +34,7 @@ void wsr_destroy(struct lruvec *lruvec) mutex_destroy(&wsr->page_age_lock); kfree(wsr->page_age); + kfree_rcu(wsr->reaccess, rcu); memset(wsr, 0, sizeof(*wsr)); } @@ -259,6 +260,254 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, } EXPORT_SYMBOL_GPL(wsr_refresh_report); +static void lru_gen_collect_reaccess_refault(struct wsr_report_bins *bins, + unsigned long timestamp, int type, + int nr_pages) +{ + unsigned long curr_timestamp = jiffies; + struct wsr_report_bin *bin = &bins->bins[0]; + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(timestamp + bin->idle_age, curr_timestamp)) + bin++; + + bin->nr_pages[type] += nr_pages; +} + +static void collect_reaccess_type(struct lru_gen_mm_walk *walk, + const struct lru_gen_folio *lrugen, + struct wsr_report_bin *bin, + unsigned long max_seq, unsigned long min_seq, + unsigned long curr_timestamp, int type) +{ + unsigned long seq; + + /* Skip max_seq because a reaccess moves a page from another seq + * to max_seq. We use the negative change in page count from + * other seqs to track the number of reaccesses. + */ + for (seq = max_seq - 1; seq + 1 > min_seq; seq--) { + int younger_gen, gen, zone; + unsigned long gen_end, gen_start; + long delta = 0; + + gen = lru_gen_from_seq(seq); + + for (zone = 0; zone < MAX_NR_ZONES; zone++) { + long nr_pages = walk->nr_pages[gen][type][zone]; + + if (nr_pages < 0) + delta += -nr_pages; + } + + gen_end = READ_ONCE(lrugen->timestamps[gen]); + younger_gen = lru_gen_from_seq(seq + 1); + gen_start = READ_ONCE(lrugen->timestamps[younger_gen]); + + /* ensure gen_start is within idle_age of bin */ + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_start + bin->idle_age, curr_timestamp)) + bin++; + + while (bin->idle_age != WORKINGSET_INTERVAL_MAX && + time_before(gen_end + bin->idle_age, curr_timestamp)) { + unsigned long proportion = (long)gen_start - + (long)curr_timestamp + + (long)bin->idle_age; + unsigned long gen_len = (long)gen_start - (long)gen_end; + + if (!gen_len) + break; + if (proportion) { + unsigned long split_bin = + delta / gen_len * proportion; + bin->nr_pages[type] += split_bin; + delta -= split_bin; + } + gen_start = curr_timestamp - bin->idle_age; + bin++; + } + bin->nr_pages[type] += delta; + } +} + +/* + * Reaccesses are propagated up the memcg hierarchy during scanning/refault. + * Collect the reaccess information from a multi-gen LRU walk. + */ +static void lru_gen_collect_reaccess(struct wsr_report_bins *bins, + struct lru_gen_folio *lrugen, + struct lru_gen_mm_walk *walk) +{ + int type; + unsigned long curr_timestamp = jiffies; + unsigned long max_seq = READ_ONCE(walk->max_seq); + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]), + READ_ONCE(lrugen->min_seq[LRU_GEN_FILE]), + }; + + for (type = 0; type < ANON_AND_FILE; type++) { + struct wsr_report_bin *bin = &bins->bins[0]; + + collect_reaccess_type(walk, lrugen, bin, max_seq, + min_seq[type], curr_timestamp, type); + } +} + +void lru_gen_report_reaccess(struct lruvec *lruvec, struct lru_gen_mm_walk *walk) +{ + struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + for (memcg = lruvec_memcg(lruvec); memcg; + memcg = parent_mem_cgroup(memcg)) { + struct lruvec *memcg_lruvec = + mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); + struct wsr_state *wsr = &memcg_lruvec->wsr; + struct wsr_reaccess_histo *reaccess; + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) { + rcu_read_unlock(); + continue; + } + lru_gen_collect_reaccess(&reaccess->bins, lrugen, walk); + rcu_read_unlock(); + } +} + +static inline int evicted_gen_from_seq(unsigned long seq) +{ + return seq % MAX_NR_EVICTED_GENS; +} + +void report_lru_gen_eviction(struct lruvec *lruvec, int type, int min_seq) +{ + int seq; + struct wsr_reaccess_histo *reaccess = NULL; + struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct wsr_state *wsr = &lruvec->wsr; + + /* + * Since file can go ahead of anon, min_seq[file] >= min_seq[anon] + * only record evictions when anon moves forward. + */ + if (type != LRU_GEN_ANON) + return; + + /* + * lru_lock is held during eviction, so reaccess accounting + * can be serialized. + */ + lockdep_assert_held(&lruvec->lru_lock); + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) + goto unlock; + + for (seq = READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]); seq < min_seq; + ++seq) { + int evicted_gen = evicted_gen_from_seq(seq); + int gen = lru_gen_from_seq(seq); + + WRITE_ONCE(reaccess->gens[evicted_gen].seq, seq); + WRITE_ONCE(reaccess->gens[evicted_gen].timestamp, + READ_ONCE(lrugen->timestamps[gen])); + } + +unlock: + rcu_read_unlock(); +} + +/* + * May yield an incorrect timestamp if the token collides with + * a recently evicted generation. + */ +static int timestamp_from_workingset_token(struct lruvec *lruvec, + unsigned long token, + unsigned long *timestamp) +{ + int type, err = -EEXIST; + unsigned long seq, evicted_min_seq; + struct wsr_reaccess_histo *reaccess = NULL; + struct lru_gen_folio *lrugen = &lruvec->lrugen; + struct wsr_state *wsr = &lruvec->wsr; + unsigned long min_seq[ANON_AND_FILE] = { + READ_ONCE(lrugen->min_seq[LRU_GEN_ANON]), + READ_ONCE(lrugen->min_seq[LRU_GEN_FILE]) + }; + + token >>= LRU_REFS_WIDTH; + + /* recent eviction */ + for (type = 0; type < ANON_AND_FILE; ++type) { + if (token == + (min_seq[type] & (EVICTION_MASK >> LRU_REFS_WIDTH))) { + int gen = lru_gen_from_seq(min_seq[type]); + + *timestamp = READ_ONCE(lrugen->timestamps[gen]); + return 0; + } + } + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) + goto unlock; + + /* look up in evicted gen buffer */ + evicted_min_seq = min_seq[LRU_GEN_ANON] - MAX_NR_EVICTED_GENS; + if (min_seq[LRU_GEN_ANON] < MAX_NR_EVICTED_GENS) + evicted_min_seq = 0; + for (seq = min_seq[LRU_GEN_ANON]; seq > evicted_min_seq; --seq) { + int gen = evicted_gen_from_seq(seq - 1); + + if (token == (reaccess->gens[gen].seq & + (EVICTION_MASK >> LRU_REFS_WIDTH))) { + *timestamp = reaccess->gens[gen].timestamp; + + goto unlock; + } + } + +unlock: + rcu_read_unlock(); + return err; +} + +void report_reaccess_refault(struct lruvec *lruvec, unsigned long token, + int type, int nr_pages) +{ + unsigned long timestamp; + int err; + struct mem_cgroup *memcg = lruvec_memcg(lruvec); + + err = timestamp_from_workingset_token(lruvec, token, ×tamp); + if (err) + return; + + for (memcg = lruvec_memcg(lruvec); memcg; + memcg = parent_mem_cgroup(memcg)) { + struct lruvec *memcg_lruvec = + mem_cgroup_lruvec(memcg, lruvec_pgdat(lruvec)); + struct wsr_state *wsr = &memcg_lruvec->wsr; + struct wsr_reaccess_histo *reaccess = NULL; + + rcu_read_lock(); + reaccess = rcu_dereference(wsr->reaccess); + if (!reaccess) { + rcu_read_unlock(); + continue; + } + lru_gen_collect_reaccess_refault(&reaccess->bins, timestamp, + type, nr_pages); + rcu_read_unlock(); + } +} + static struct pglist_data *kobj_to_pgdat(struct kobject *kobj) { int nid = IS_ENABLED(CONFIG_NUMA) ? kobj_to_dev(kobj)->id : From patchwork Wed Mar 27 21:31:06 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783384 Received: from mail-yw1-f202.google.com (mail-yw1-f202.google.com [209.85.128.202]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2FAD115445B for ; Wed, 27 Mar 2024 21:31:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.202 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575104; cv=none; b=JyAu3RxgEfuiPRmsZslZI5LHXSK+o06IBogs7cn+Rc29lzGBImpbYaJTOQ8Rv/CpORh3uUxpdYlb8pa64S4Y6FbfQKbdNl3+kgoArqVCUKf7FA5La3q+oU2W3uz2oscB4dnYDfCrFZDfhuRpnZr/uJ1a6nZKqbJ3+A5IjGwP1ok= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575104; c=relaxed/simple; bh=lVBINjNVkAHsnoaSqt6PDQAq7kG3nIqmxIKIXIBHmXM=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=ZEmL5AJ53M3sgS5tHUmpLe7XSf/geb5oIlEds0V2wJ6TmNPR0o4eZb+KX2ykAFMJKIDKnAJdEmsPB3OnGHMPkjvq5iH63QdL7I+k64/DrWIcqmtc8sTYeS3MOQZsg1CtrFCt6bbYMf6PZfwK6hEZuy63ycmb2ybbF9DG2R2lZZs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=VHwp2kwn; arc=none smtp.client-ip=209.85.128.202 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="VHwp2kwn" Received: by mail-yw1-f202.google.com with SMTP id 00721157ae682-607838c0800so14501677b3.1 for ; Wed, 27 Mar 2024 14:31:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575101; x=1712179901; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=A565PQouZoECoC+bAbMpMjHjbmuhcXoGplNwxvsLI/Q=; b=VHwp2kwnIF60mr5EeVW7NxiCPW1QzRnDw7RMa+qY6d6aYgn5UFIgp085//ht1w93jT oWkySLf/Lq/9DGbyTw41pmLXiI8B3YFy9OinJAXf3BlUxnaNOV6aC68A6k7gFJx3NL5W BSuFmualpHaESi70Pt2SlbrU64+YxsS16xgsteJZ+99dgfJkFkS8s3BKWJj752fcvl9l oEfMlzGMUyWSXzte1V7rbQTO9h97vQO222Gf7pUHf/5IYuhQRuh1y3eU5dNpsI+6QmjG nLomY9U/2og8m3eV9gT2NunZmvZmk1dkAVO32ZAm+bpp6m7DFbL3qQK7N4bgLgMfgBQN 8TRg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575101; x=1712179901; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=A565PQouZoECoC+bAbMpMjHjbmuhcXoGplNwxvsLI/Q=; b=FKVzio2MN7iMd9Nzg4KM7cPu4HU+1dU90bkZv/Us7jzzK61gB6WFvu75uWOFVcMRvq u1mCqTfD6CHVpKhuw4Z1bEaGn37Vxsxpz9R7MM7a0haspioRNBQNq0ZVhZAyHulNGSMU rDizJp+FKCx9AeJL2qAUJytyDc09t/rqBk2x6JyZBJRcdler4+U9xqPfk3ZGkVZXPXjS n4JgW4a3aSEEE4GP0jN3dQ92WefQ7Gs1xj8TPIJsy9SqCF1T33fBt5wkRtwJxNQ0AsLr h6pXS3XiRYeorfUku8AzPVfSCF8g2KKyy7gp+nGS32ledSNQmxXqPCCbIWigfEZ7pHfT GgfQ== X-Forwarded-Encrypted: i=1; AJvYcCVc4E6fJC0o9vsPAxa6oUQKsjJefXkaP+szbi3GUR4XcbHZlWGkO1g7pUBypCUAU5lPkQip5n/F7/M9Z2foN0Wf27Kpuasq/lOQq4NNzGY1 X-Gm-Message-State: AOJu0YyFprjJqhoHmVkfqYg+9+wh8FnCaDws4NxrGnzqKWWTzK8/L3+H APYx2X9Kpv1VsipqNayccb59/5pQ8/Tp2edopOlBDmVFox7pLtdeXs0HjrWWawlx6Lcp51uD//x 3ZS8EiA== X-Google-Smtp-Source: AGHT+IGWjQUNRXXoGUh5+QqHFNkK8JcuM0kAdyJ/QDj3Y482hU2DxaC1k/t+Vf+seKF4W2aB0l9ObuFk2jZg X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a0d:df0d:0:b0:611:3077:2de7 with SMTP id i13-20020a0ddf0d000000b0061130772de7mr101184ywe.3.1711575101306; Wed, 27 Mar 2024 14:31:41 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:06 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-8-yuanchu@google.com> Subject: [RFC PATCH v3 7/8] mm: add kernel aging thread for workingset reporting From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org For reliable and timely aging on memcgs, one has to read the page age histograms on time. A kernel thread makes it easier by aging memcgs with valid page_age_intervals and refresh_interval when they can be refreshed, and also reduces the latency of any userspace consumers of the page age histogram. The kernel aging thread is gated behind CONFIG_WORKINGSET_REPORT_AGING. Debugging stats may be added in the future for when aging cannot keep up with the configured refresh_interval. Signed-off-by: Yuanchu Xie --- include/linux/workingset_report.h | 11 ++- mm/Kconfig | 6 ++ mm/Makefile | 1 + mm/memcontrol.c | 11 ++- mm/workingset_report.c | 14 +++- mm/workingset_report_aging.c | 127 ++++++++++++++++++++++++++++++ 6 files changed, 163 insertions(+), 7 deletions(-) create mode 100644 mm/workingset_report_aging.c diff --git a/include/linux/workingset_report.h b/include/linux/workingset_report.h index e908c5678b1e..759486a3a285 100644 --- a/include/linux/workingset_report.h +++ b/include/linux/workingset_report.h @@ -77,9 +77,18 @@ void wsr_destroy(struct lruvec *lruvec); * The next refresh time is stored in refresh_time. */ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat); + struct pglist_data *pgdat, unsigned long *refresh_time); void wsr_register_node(struct node *node); void wsr_unregister_node(struct node *node); + +#ifdef CONFIG_WORKINGSET_REPORT_AGING +void wsr_wakeup_aging_thread(void); +#else /* CONFIG_WORKINGSET_REPORT_AGING */ +static inline void wsr_wakeup_aging_thread(void) +{ +} +#endif /* CONFIG_WORKINGSET_REPORT_AGING */ + #else static inline void wsr_init(struct lruvec *lruvec) { diff --git a/mm/Kconfig b/mm/Kconfig index 212f203b10b9..1e6aa1bd63f2 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1270,6 +1270,12 @@ config WORKINGSET_REPORT This option exports stats and events giving the user more insight into its memory working set. +config WORKINGSET_REPORT_AGING + bool "Workingset report kernel aging thread" + depends on WORKINGSET_REPORT + help + Performs aging on memcgs with their configured refresh intervals. + source "mm/damon/Kconfig" endmenu diff --git a/mm/Makefile b/mm/Makefile index 57093657030d..7caae7f2d6cf 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -93,6 +93,7 @@ obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o obj-$(CONFIG_PAGE_COUNTER) += page_counter.o obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o obj-$(CONFIG_WORKINGSET_REPORT) += workingset_report.o +obj-$(CONFIG_WORKINGSET_REPORT_AGING) += workingset_report_aging.o ifdef CONFIG_SWAP obj-$(CONFIG_MEMCG) += swap_cgroup.o endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 2a39a4445bb7..86e15b9fc8e2 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -7102,6 +7102,8 @@ static ssize_t memory_ws_page_age_intervals_write(struct kernfs_open_file *of, old = xchg(&wsr->page_age, page_age); mutex_unlock(&wsr->page_age_lock); kfree(old); + if (err && READ_ONCE(wsr->refresh_interval)) + wsr_wakeup_aging_thread(); return nbytes; failed: kfree(page_age); @@ -7227,14 +7229,17 @@ static ssize_t memory_ws_refresh_interval_write(struct kernfs_open_file *of, { unsigned int nid, msecs; struct wsr_state *wsr; + unsigned long old_interval; struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); ssize_t ret = memory_wsr_threshold_parse(buf, nbytes, &nid, &msecs); if (ret < 0) return ret; - wsr = &mem_cgroup_lruvec(memcg, NODE_DATA(nid))->wsr; + old_interval = jiffies_to_msecs(READ_ONCE(wsr->refresh_interval)); WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(msecs)); + if (msecs && (!old_interval || jiffies_to_msecs(old_interval) > msecs)) + wsr_wakeup_aging_thread(); return ret; } @@ -7285,7 +7290,7 @@ static int memory_ws_page_age_show(struct seq_file *m, void *v) if (!READ_ONCE(wsr->page_age)) continue; - wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) goto unlock; @@ -7325,7 +7330,7 @@ static int memory_ws_reaccess_histogram_show(struct seq_file *m, void *v) if (!reaccess) goto unlock; - wsr_refresh_report(wsr, memcg, NODE_DATA(nid)); + wsr_refresh_report(wsr, memcg, NODE_DATA(nid), NULL); seq_printf(m, "N%d\n", nid); for (bin = reaccess->bins.bins; diff --git a/mm/workingset_report.c b/mm/workingset_report.c index 504d840bbe6a..da658967eac2 100644 --- a/mm/workingset_report.c +++ b/mm/workingset_report.c @@ -234,7 +234,7 @@ static void refresh_aggregate(struct wsr_page_age_histo *page_age, } bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, - struct pglist_data *pgdat) + struct pglist_data *pgdat, unsigned long *refresh_time) { struct wsr_page_age_histo *page_age = NULL; unsigned long refresh_interval = READ_ONCE(wsr->refresh_interval); @@ -253,6 +253,8 @@ bool wsr_refresh_report(struct wsr_state *wsr, struct mem_cgroup *root, goto unlock; refresh_scan(wsr, root, pgdat, refresh_interval); refresh_aggregate(page_age, root, pgdat); + if (refresh_time) + *refresh_time = page_age->timestamp + refresh_interval; unlock: mutex_unlock(&wsr->page_age_lock); @@ -564,12 +566,16 @@ static ssize_t refresh_interval_store(struct kobject *kobj, unsigned int interval; int err; struct wsr_state *wsr = kobj_to_wsr(kobj); + unsigned long old_interval; err = kstrtouint(buf, 0, &interval); if (err) return err; - WRITE_ONCE(wsr->refresh_interval, msecs_to_jiffies(interval)); + old_interval = xchg(&wsr->refresh_interval, msecs_to_jiffies(interval)); + if (interval && + (!old_interval || jiffies_to_msecs(old_interval) > interval)) + wsr_wakeup_aging_thread(); return len; } @@ -642,6 +648,8 @@ static ssize_t page_age_intervals_store(struct kobject *kobj, mutex_unlock(&wsr->page_age_lock); kfree(old); kfree(buf); + if (err && READ_ONCE(wsr->refresh_interval)) + wsr_wakeup_aging_thread(); return len; failed: kfree(page_age); @@ -663,7 +671,7 @@ static ssize_t page_age_show(struct kobject *kobj, struct kobj_attribute *attr, if (!READ_ONCE(wsr->page_age)) return -EINVAL; - wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj)); + wsr_refresh_report(wsr, NULL, kobj_to_pgdat(kobj), NULL); mutex_lock(&wsr->page_age_lock); if (!wsr->page_age) { diff --git a/mm/workingset_report_aging.c b/mm/workingset_report_aging.c new file mode 100644 index 000000000000..91ad5020778a --- /dev/null +++ b/mm/workingset_report_aging.c @@ -0,0 +1,127 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * Workingset report kernel aging thread + * + * Performs aging on behalf of memcgs with their configured refresh interval. + * While a userspace program can periodically read the page age breakdown + * per-memcg and trigger aging, the kernel performing aging is less overhead, + * more consistent, and more reliable for the use case where every memcg should + * be aged according to their refresh interval. + */ +#define pr_fmt(fmt) "workingset report aging: " fmt + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +static DECLARE_WAIT_QUEUE_HEAD(aging_wait); +static bool refresh_pending; + +static bool do_aging_node(int nid, unsigned long *next_wake_time) +{ + struct mem_cgroup *memcg; + bool should_wait = true; + struct pglist_data *pgdat = NODE_DATA(nid); + + memcg = mem_cgroup_iter(NULL, NULL, NULL); + do { + struct lruvec *lruvec = mem_cgroup_lruvec(memcg, pgdat); + struct wsr_state *wsr = &lruvec->wsr; + unsigned long refresh_time; + + /* use returned time to decide when to wake up next */ + if (wsr_refresh_report(wsr, memcg, pgdat, &refresh_time)) { + if (should_wait) { + should_wait = false; + *next_wake_time = refresh_time; + } else if (time_before(refresh_time, *next_wake_time)) { + *next_wake_time = refresh_time; + } + } + + cond_resched(); + } while ((memcg = mem_cgroup_iter(NULL, memcg, NULL))); + + return should_wait; +} + +static int do_aging(void *unused) +{ + while (!kthread_should_stop()) { + int nid; + long timeout_ticks; + unsigned long next_wake_time; + bool should_wait = true; + + WRITE_ONCE(refresh_pending, false); + for_each_node_state(nid, N_MEMORY) { + unsigned long node_next_wake_time; + + if (do_aging_node(nid, &node_next_wake_time)) + continue; + if (should_wait) { + should_wait = false; + next_wake_time = node_next_wake_time; + } else if (time_before(node_next_wake_time, + next_wake_time)) { + next_wake_time = node_next_wake_time; + } + } + + if (should_wait) { + wait_event_interruptible(aging_wait, refresh_pending); + continue; + } + + /* sleep until next aging */ + timeout_ticks = next_wake_time - jiffies; + if (timeout_ticks > 0 && + timeout_ticks != MAX_SCHEDULE_TIMEOUT) { + schedule_timeout_idle(timeout_ticks); + continue; + } + } + return 0; +} + +/* Invoked when refresh_interval shortens or changes to a non-zero value. */ +void wsr_wakeup_aging_thread(void) +{ + WRITE_ONCE(refresh_pending, true); + wake_up_interruptible(&aging_wait); +} + +static struct task_struct *aging_thread; + +static int aging_init(void) +{ + struct task_struct *task; + + task = kthread_run(do_aging, NULL, "kagingd"); + + if (IS_ERR(task)) { + pr_err("Failed to create aging kthread\n"); + return PTR_ERR(task); + } + + aging_thread = task; + pr_info("module loaded\n"); + return 0; +} + +static void aging_exit(void) +{ + kthread_stop(aging_thread); + aging_thread = NULL; + pr_info("module unloaded\n"); +} + +module_init(aging_init); +module_exit(aging_exit); From patchwork Wed Mar 27 21:31:07 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuanchu Xie X-Patchwork-Id: 783789 Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 52578154C15 for ; Wed, 27 Mar 2024 21:31:44 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575107; cv=none; b=jTcqfOF4mLH/Qaoe7hESowx5IHfset/Ieeqv/f7Y/WJwYzNTjr2uRTkCXfczvVjBtUjoYbDYDEAR5qqRlvm9UDUvYKk5y5U45Y19aKPEaZByzcXzbLGJU6M7mFvSXGgUhqb5x6BALp1ziRAkd2zXIt/UqsRL/k7PpLfnajoNXoQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1711575107; c=relaxed/simple; bh=ODPZnHhZflNC9BZbtwGgK4tRgbN0RJNs/bz0ja0tKnQ=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=SPqagQmcIjp6COHga2ntjmpbN1ti1nWiQnnLyw8xHfOISe7bsB7IjyU5ozmJRTKY2sqbSzFdUiYk0B3kQ77AeEsgEyCrFfjR9zzpnFzaxxzmScscnIkk5iFrC0MhSkGDGULYZFeglmshd3IFdB7m2wUAVd/OPsJcuHATAk5QyFY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=x72k5wPN; arc=none smtp.client-ip=209.85.219.201 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--yuanchu.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="x72k5wPN" Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-dcdc3db67f0so1632622276.1 for ; Wed, 27 Mar 2024 14:31:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1711575103; x=1712179903; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=O7c7k4V/S7G6QBja8bewx6p+TNglNHZEZpbTAgR+I/w=; b=x72k5wPNVswejkJ+ejrKs6ZM7RnW1y+xt6NEWi6iY0lMAdguuNo14fMlBcBWH3DU10 rLmeWJioJ79A3org0JteXAa4fUtBe31DOQ+P/+rZhqROONSKeZAsE4hl/fkIjMxgaNs/ LmqzDX6ajvfzYXZsPoUdrsYmznuZrCK3TB5pdrEgIxFOr7MLDFPwQCDzh7CfPdT6wF0d J2tnzhL3DkWyJGaJzgurnqzBocJj1EuFyb2c/i6TON09iDx2mpSHh4Q0w5QgERT9e9Aw REEEMje8yWE0C92z74eI4vml/nZw0YuVCAfyvxZx5a0su1P3AWjdyyypF7uZgJ6SHXXM v82A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1711575103; x=1712179903; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=O7c7k4V/S7G6QBja8bewx6p+TNglNHZEZpbTAgR+I/w=; b=FPOmUo5dpb1j9USG74ZZU2oTkhDulLceLXaoPVVhKU4nt1F7l+Rmcze8jEZntxl0kL f/faelOhQRr/IGxtKJIVGMM7msrBhjdUHJ+LzPyH4hu1dRTJ41+PehHqf4ZDgJXeB1QA FudyTzKxT5wfhiO7vnpV19951hKOBU240tXGtGR+8UpNnQ650xpgGTTG3NbhICk2GPAr V7NDK/9mmGIXl5E8Vfe6JwVwhRHvqtdzQx3Bk/oU/WN4KGJ59zkTffAvdxn/VrT60MY2 HPv3JYtY5Uhs6KarBjYCiNnR1DjKgmjVPkkJ6DjMph2Nmsqc5oXiTmufOOGFeg+vwWAd B9ew== X-Forwarded-Encrypted: i=1; AJvYcCXTc6K3Z0wijESQGQbHIe8CTdi6EZ9pkeeScb85NToKhp5Dllm0K1KenTqGtTqZMTd/RFGLyjnSq6v2hTK/cUyNgEsq2z1WLAbDfrDBe9Dq X-Gm-Message-State: AOJu0Yw8HRpLz4phfv4994cmzpeZk7reXc2D4QTX8/rP/qtYJfWYIQ15 fi9rrhUor++c99hTWYO83xVxZ3MV2qKxtYNXxEv4IelllhZGvzWGj291azvJJcGrVOWU9ngVq9R GjaB1Eg== X-Google-Smtp-Source: AGHT+IG08I1h3s4iLQpGTmc8i+JOVJgv8ta59pUthTIq/uAtBVC/Cxu1sn92i5SwYJx/4S1FLyw2LapJLN5i X-Received: from yuanchu-desktop.svl.corp.google.com ([2620:15c:2a3:200:6df3:ef42:a58e:a6b1]) (user=yuanchu job=sendgmr) by 2002:a05:6902:1507:b0:dbd:ee44:8908 with SMTP id q7-20020a056902150700b00dbdee448908mr154820ybu.0.1711575103440; Wed, 27 Mar 2024 14:31:43 -0700 (PDT) Date: Wed, 27 Mar 2024 14:31:07 -0700 In-Reply-To: <20240327213108.2384666-1-yuanchu@google.com> Precedence: bulk X-Mailing-List: linux-kselftest@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240327213108.2384666-1-yuanchu@google.com> X-Mailer: git-send-email 2.44.0.396.g6e790dbe36-goog Message-ID: <20240327213108.2384666-9-yuanchu@google.com> Subject: [RFC PATCH v3 8/8] mm: test system-wide workingset reporting From: Yuanchu Xie To: David Hildenbrand , "Aneesh Kumar K.V" , Khalid Aziz , Henry Huang , Yu Zhao , Dan Williams , Gregory Price , Huang Ying Cc: Wei Xu , David Rientjes , Greg Kroah-Hartman , "Rafael J. Wysocki" , Andrew Morton , Johannes Weiner , Michal Hocko , Roman Gushchin , Muchun Song , Shuah Khan , Yosry Ahmed , Matthew Wilcox , Sudarshan Rajagopalan , Kairui Song , "Michael S. Tsirkin" , Vasily Averin , Nhat Pham , Miaohe Lin , Qi Zheng , Abel Wu , "Vishal Moola (Oracle)" , Kefeng Wang , Yuanchu Xie , linux-kernel@vger.kernel.org, linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kselftest@vger.kernel.org A basic test that verifies the working set size of a simple memory accessor. It should work with or without the aging thread. Question: I don't know how to best test file memory in selftests. Is there a place where I should put the temporary file? /tmp can be tmpfs mounted in many distros. Signed-off-by: Yuanchu Xie --- tools/testing/selftests/mm/.gitignore | 1 + tools/testing/selftests/mm/Makefile | 3 + .../testing/selftests/mm/workingset_report.c | 315 +++++++++++++++++ .../testing/selftests/mm/workingset_report.h | 37 ++ .../selftests/mm/workingset_report_test.c | 328 ++++++++++++++++++ 5 files changed, 684 insertions(+) create mode 100644 tools/testing/selftests/mm/workingset_report.c create mode 100644 tools/testing/selftests/mm/workingset_report.h create mode 100644 tools/testing/selftests/mm/workingset_report_test.c diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore index 4ff10ea61461..14a2412c8257 100644 --- a/tools/testing/selftests/mm/.gitignore +++ b/tools/testing/selftests/mm/.gitignore @@ -46,3 +46,4 @@ gup_longterm mkdirty va_high_addr_switch hugetlb_fault_after_madv +workingset_report_test diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile index 2453add65d12..c0869bf07e99 100644 --- a/tools/testing/selftests/mm/Makefile +++ b/tools/testing/selftests/mm/Makefile @@ -70,6 +70,7 @@ TEST_GEN_FILES += ksm_tests TEST_GEN_FILES += ksm_functional_tests TEST_GEN_FILES += mdwe_test TEST_GEN_FILES += hugetlb_fault_after_madv +TEST_GEN_FILES += workingset_report_test ifneq ($(ARCH),arm64) TEST_GEN_FILES += soft-dirty @@ -123,6 +124,8 @@ $(TEST_GEN_FILES): vm_util.c thp_settings.c $(OUTPUT)/uffd-stress: uffd-common.c $(OUTPUT)/uffd-unit-tests: uffd-common.c +$(OUTPUT)/workingset_report_test: workingset_report.c + ifeq ($(ARCH),x86_64) BINARIES_32 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_32)) BINARIES_64 := $(patsubst %,$(OUTPUT)/%,$(BINARIES_64)) diff --git a/tools/testing/selftests/mm/workingset_report.c b/tools/testing/selftests/mm/workingset_report.c new file mode 100644 index 000000000000..93387f0f30ee --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.c @@ -0,0 +1,315 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include +#include +#include +#include +#include +#include +#include +#include + +#define SYSFS_NODE_ONLINE "/sys/devices/system/node/online" +#define PROC_DROP_CACHES "/proc/sys/vm/drop_caches" + +/* Returns read len on success, or -errno on failure. */ +static ssize_t read_text(const char *path, char *buf, size_t max_len) +{ + ssize_t len; + int fd, err; + size_t bytes_read = 0; + + if (!max_len) + return -EINVAL; + + fd = open(path, O_RDONLY); + if (fd < 0) + return -errno; + + while (bytes_read < max_len - 1) { + len = read(fd, buf + bytes_read, max_len - 1 - bytes_read); + + if (len <= 0) + break; + bytes_read += len; + } + + buf[bytes_read] = '\0'; + + err = -errno; + close(fd); + return len < 0 ? err : bytes_read; +} + +/* Returns written len on success, or -errno on failure. */ +static ssize_t write_text(const char *path, const char *buf, ssize_t max_len) +{ + int fd, len, err; + size_t bytes_written = 0; + + fd = open(path, O_WRONLY | O_APPEND); + if (fd < 0) + return -errno; + + while (bytes_written < max_len) { + len = write(fd, buf + bytes_written, max_len - bytes_written); + + if (len < 0) + break; + bytes_written += len; + } + + err = -errno; + close(fd); + return len < 0 ? err : bytes_written; +} + +static long read_num(const char *path) +{ + char buf[21]; + + if (read_text(path, buf, sizeof(buf)) <= 0) + return -1; + return (long)strtoul(buf, NULL, 10); +} + +static int write_num(const char *path, unsigned long n) +{ + char buf[21]; + + sprintf(buf, "%lu", n); + if (write_text(path, buf, strlen(buf)) < 0) + return -1; + return 0; +} + +long sysfs_get_refresh_interval(int nid) +{ + char file[128]; + + snprintf( + file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/refresh_interval", + nid); + return read_num(file); +} + +int sysfs_set_refresh_interval(int nid, long interval) +{ + char file[128]; + + snprintf( + file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/refresh_interval", + nid); + return write_num(file, interval); +} + +int sysfs_get_page_age_intervals_str(int nid, char *buf, int len) +{ + char path[128]; + + snprintf( + path, + sizeof(path), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return read_text(path, buf, len); + +} + +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len) +{ + char path[128]; + + snprintf( + path, + sizeof(path), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return write_text(path, buf, len); +} + +int sysfs_set_page_age_intervals(int nid, const char *intervals[], + int nr_intervals) +{ + char file[128]; + char buf[1024]; + int i; + int err, len = 0; + + for (i = 0; i < nr_intervals; ++i) { + err = snprintf(buf + len, sizeof(buf) - len, "%s", intervals[i]); + + if (err < 0) + return err; + len += err; + + if (i < nr_intervals - 1) { + err = snprintf(buf + len, sizeof(buf) - len, ","); + if (err < 0) + return err; + len += err; + } + } + + snprintf( + file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/page_age_intervals", + nid); + return write_text(file, buf, len); +} + +int get_nr_nodes(void) +{ + char buf[22]; + char *found; + + if (read_text(SYSFS_NODE_ONLINE, buf, sizeof(buf)) <= 0) + return -1; + found = strstr(buf, "-"); + if (found) + return (int)strtoul(found + 1, NULL, 10) + 1; + return (long)strtoul(buf, NULL, 10) + 1; +} + +int drop_pagecache(void) +{ + return write_num(PROC_DROP_CACHES, 1); +} + +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len) + +{ + char file[128]; + + snprintf(file, + sizeof(file), + "/sys/devices/system/node/node%d/workingset_report/page_age", + nid); + return read_text(file, buf, len); +} + +/* + * Finds the first occurrence of "N\n" + * Modifies buf to terminate before the next occurrence of "N". + * Returns a substring of buf starting after "N\n" + */ +char *page_age_split_node(char *buf, int nid, char **next) +{ + char node_str[5]; + char *found; + int node_str_len; + + node_str_len = snprintf(node_str, sizeof(node_str), "N%u\n", nid); + + /* find the node prefix first */ + found = strstr(buf, node_str); + if (!found) { + fprintf(stderr, "cannot find '%s' in page_idle_age", node_str); + return NULL; + } + found += node_str_len; + + *next = strchr(found, 'N'); + if (*next) + *(*next - 1) = '\0'; + + return found; +} + +ssize_t page_age_read(const char *buf, const char *interval, int pagetype) +{ + static const char * const type[ANON_AND_FILE] = { "anon=", "file=" }; + char *found; + + found = strstr(buf, interval); + if (!found) { + fprintf(stderr, "cannot find %s in page_age", interval); + return -1; + } + found = strstr(found, type[pagetype]); + if (!found) { + fprintf(stderr, "cannot find %s in page_age", type[pagetype]); + return -1; + } + found += strlen(type[pagetype]); + return (long)strtoul(found, NULL, 10); +} + +static const char *TEMP_FILE = "/tmp/workingset_selftest"; +void cleanup_file_workingset(void) +{ + remove(TEMP_FILE); +} + +int alloc_file_workingset(void *arg) +{ + int err = 0; + char *ptr; + int fd; + int ppid; + char *mapped; + size_t size = (size_t)arg; + size_t page_size = getpagesize(); + + ppid = getppid(); + + fd = open(TEMP_FILE, O_RDWR | O_CREAT); + if (fd < 0) { + err = -errno; + perror("failed to open temp file\n"); + goto cleanup; + } + + if (fallocate(fd, 0, 0, size) < 0) { + err = -errno; + perror("fallocate"); + goto cleanup; + } + + mapped = (char *)mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, + fd, 0); + if (mapped == NULL) { + err = -errno; + perror("mmap"); + goto cleanup; + } + + while (getppid() == ppid) { + sync(); + for (ptr = mapped; ptr < mapped + size; ptr += page_size) + *ptr = *ptr ^ 0xFF; + } + +cleanup: + cleanup_file_workingset(); + return err; +} + +int alloc_anon_workingset(void *arg) +{ + char *buf, *ptr; + int ppid = getppid(); + size_t size = (size_t)arg; + size_t page_size = getpagesize(); + + buf = malloc(size); + + if (!buf) { + fprintf(stderr, "cannot allocate anon workingset"); + exit(1); + } + + while (getppid() == ppid) { + for (ptr = buf; ptr < buf + size; ptr += page_size) + *ptr = *ptr ^ 0xFF; + } + + free(buf); + return 0; +} diff --git a/tools/testing/selftests/mm/workingset_report.h b/tools/testing/selftests/mm/workingset_report.h new file mode 100644 index 000000000000..f72a931298e0 --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report.h @@ -0,0 +1,37 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#ifndef WORKINGSET_REPORT_H_ +#define WORKINGSET_REPORT_H_ + +#define _GNU_SOURCE + +#include +#include +#include +#include +#include + +#define PAGETYPE_ANON 0 +#define PAGETYPE_FILE 1 +#define ANON_AND_FILE 2 + +int get_nr_nodes(void); +int drop_pagecache(void); + +long sysfs_get_refresh_interval(int nid); +int sysfs_set_refresh_interval(int nid, long interval); + +int sysfs_get_page_age_intervals_str(int nid, char *buf, int len); +int sysfs_set_page_age_intervals_str(int nid, const char *buf, int len); + +int sysfs_set_page_age_intervals(int nid, const char *intervals[], + int nr_intervals); + +char *page_age_split_node(char *buf, int nid, char **next); +ssize_t sysfs_page_age_read(int nid, char *buf, size_t len); +ssize_t page_age_read(const char *buf, const char *interval, int pagetype); + +int alloc_file_workingset(void *arg); +void cleanup_file_workingset(void); +int alloc_anon_workingset(void *arg); + +#endif /* WORKINGSET_REPORT_H_ */ diff --git a/tools/testing/selftests/mm/workingset_report_test.c b/tools/testing/selftests/mm/workingset_report_test.c new file mode 100644 index 000000000000..e6e857d8fe35 --- /dev/null +++ b/tools/testing/selftests/mm/workingset_report_test.c @@ -0,0 +1,328 @@ +// SPDX-License-Identifier: GPL-2.0 +#include "workingset_report.h" + +#include +#include +#include +#include + +#include "../clone3/clone3_selftests.h" + +#define REFRESH_INTERVAL 5000 +#define MB(x) (x << 20) + +static void sleep_ms(int milliseconds) +{ + struct timespec ts; + + ts.tv_sec = milliseconds / 1000; + ts.tv_nsec = (milliseconds % 1000) * 1000000; + nanosleep(&ts, NULL); +} + +/* + * Checks if two given values differ by less than err% of their sum. + */ +static inline int values_close(long a, long b, int err) +{ + return abs(a - b) <= (a + b) / 100 * err; +} + +static const char * const PAGE_AGE_INTERVALS[] = { + "6000", "10000", "15000", "18446744073709551615", +}; +#define NR_PAGE_AGE_INTERVALS (ARRAY_SIZE(PAGE_AGE_INTERVALS)) +/* add one for the catch all last interval */ + +static int set_page_age_intervals_all_nodes(const char *intervals, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_set_page_age_intervals_str( + i, &intervals[i * 1024], strlen(&intervals[i * 1024])); + + if (err < 0) + return err; + } + return 0; +} + +static int get_page_age_intervals_all_nodes(char *intervals, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_get_page_age_intervals_str( + i, &intervals[i * 1024], 1024); + + if (err < 0) + return err; + } + return 0; +} + +static int set_refresh_interval_all_nodes(const long *interval, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + int err = sysfs_set_refresh_interval(i, interval[i]); + + if (err < 0) + return err; + } + return 0; +} + +static int get_refresh_interval_all_nodes(long *interval, int nr_nodes) +{ + int i; + + for (i = 0; i < nr_nodes; ++i) { + long val = sysfs_get_refresh_interval(i); + + if (val < 0) + return val; + interval[i] = val; + } + return 0; +} + +static pid_t clone_and_run(int fn(void *arg), void *arg) +{ + pid_t pid; + + struct __clone_args args = { + .exit_signal = SIGCHLD, + }; + + pid = sys_clone3(&args, sizeof(struct __clone_args)); + + if (pid == 0) + exit(fn(arg)); + + return pid; +} + +static int read_workingset(int pagetype, int nid, + unsigned long page_age[NR_PAGE_AGE_INTERVALS]) +{ + int i, err; + char buf[4096]; + + err = sysfs_page_age_read(nid, buf, sizeof(buf)); + if (err < 0) + return err; + + for (i = 0; i < NR_PAGE_AGE_INTERVALS; ++i) { + err = page_age_read(buf, PAGE_AGE_INTERVALS[i], pagetype); + if (err < 0) + return err; + page_age[i] = err; + } + + return 0; +} + +static ssize_t read_interval_all_nodes(int pagetype, int interval) +{ + int i, err; + unsigned long page_age[NR_PAGE_AGE_INTERVALS]; + ssize_t ret = 0; + int nr_nodes = get_nr_nodes(); + + for (i = 0; i < nr_nodes; ++i) { + err = read_workingset(pagetype, i, page_age); + if (err < 0) + return err; + + ret += page_age[interval]; + } + + return ret; +} + +#define TEST_SIZE MB(500l) + +static int run_test(int f(void)) +{ + int i, err, test_result; + long *old_refresh_intervals; + long *new_refresh_intervals; + char *old_page_age_intervals; + int nr_nodes = get_nr_nodes(); + + if (nr_nodes <= 0) { + fprintf(stderr, "failed to get nr_nodes\n"); + return KSFT_FAIL; + } + + old_refresh_intervals = calloc(nr_nodes, sizeof(long)); + new_refresh_intervals = calloc(nr_nodes, sizeof(long)); + old_page_age_intervals = calloc(nr_nodes, 1024); + + if (!(old_refresh_intervals && new_refresh_intervals && + old_page_age_intervals)) { + fprintf(stderr, "failed to allocate memory for intervals\n"); + return KSFT_FAIL; + } + + err = get_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to read refresh interval\n"); + return KSFT_FAIL; + } + + err = get_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to read page age interval\n"); + return KSFT_FAIL; + } + + for (i = 0; i < nr_nodes; ++i) + new_refresh_intervals[i] = REFRESH_INTERVAL; + err = set_refresh_interval_all_nodes(new_refresh_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to set refresh interval\n"); + test_result = KSFT_FAIL; + goto fail; + } + + for (i = 0; i < nr_nodes; ++i) { + err = sysfs_set_page_age_intervals(i, PAGE_AGE_INTERVALS, + NR_PAGE_AGE_INTERVALS - 1); + if (err < 0) { + fprintf(stderr, "failed to set page age interval\n"); + test_result = KSFT_FAIL; + goto fail; + } + } + + sync(); + drop_pagecache(); + + test_result = f(); + +fail: + err = set_refresh_interval_all_nodes(old_refresh_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to restore refresh interval\n"); + test_result = KSFT_FAIL; + } + err = set_page_age_intervals_all_nodes(old_page_age_intervals, nr_nodes); + if (err < 0) { + fprintf(stderr, "failed to restore page age interval\n"); + test_result = KSFT_FAIL; + } + return test_result; +} + +static int test_file(void) +{ + ssize_t ws_size_ref, ws_size_test; + int ret = KSFT_FAIL, i; + pid_t pid = 0; + + ws_size_ref = read_interval_all_nodes(PAGETYPE_FILE, 0); + if (ws_size_ref < 0) + goto cleanup; + + pid = clone_and_run(alloc_file_workingset, (void *)TEST_SIZE); + if (pid < 0) + goto cleanup; + + read_interval_all_nodes(PAGETYPE_FILE, 0); + sleep_ms(REFRESH_INTERVAL); + + for (i = 0; i < 3; ++i) { + sleep_ms(REFRESH_INTERVAL); + ws_size_test = read_interval_all_nodes(PAGETYPE_FILE, 0); + + if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) { + fprintf(stderr, + "file working set size difference too large: actual=%ld, expected=%ld\n", + ws_size_test - ws_size_ref, TEST_SIZE); + goto cleanup; + } + } + ret = KSFT_PASS; + +cleanup: + if (pid > 0) + kill(pid, SIGKILL); + cleanup_file_workingset(); + return ret; +} + +static int test_anon(void) +{ + ssize_t ws_size_ref, ws_size_test; + pid_t pid = 0; + int ret = KSFT_FAIL, i; + + ws_size_ref = read_interval_all_nodes(PAGETYPE_ANON, 0); + if (ws_size_ref < 0) + goto cleanup; + + pid = clone_and_run(alloc_anon_workingset, (void *)TEST_SIZE); + if (pid < 0) + goto cleanup; + + sleep_ms(REFRESH_INTERVAL); + read_interval_all_nodes(PAGETYPE_ANON, 0); + + for (i = 0; i < 5; ++i) { + sleep_ms(REFRESH_INTERVAL); + ws_size_test = read_interval_all_nodes(PAGETYPE_ANON, 0); + if (ws_size_test < 0) + goto cleanup; + + if (!values_close(ws_size_test - ws_size_ref, TEST_SIZE, 10)) { + fprintf(stderr, + "anon working set size difference too large: actual=%ld, expected=%ld\n", + ws_size_test - ws_size_ref, TEST_SIZE); + /* goto cleanup; */ + } + } + ret = KSFT_PASS; + +cleanup: + if (pid > 0) + kill(pid, SIGKILL); + return ret; +} + + +#define T(x) { x, #x } +struct workingset_test { + int (*fn)(void); + const char *name; +} tests[] = { + T(test_anon), + T(test_file), +}; +#undef T + +int main(int argc, char **argv) +{ + int ret = EXIT_SUCCESS, i, err; + + for (i = 0; i < ARRAY_SIZE(tests); i++) { + err = run_test(tests[i].fn); + switch (err) { + case KSFT_PASS: + ksft_test_result_pass("%s\n", tests[i].name); + break; + case KSFT_SKIP: + ksft_test_result_skip("%s\n", tests[i].name); + break; + default: + ret = EXIT_FAILURE; + ksft_test_result_fail("%s with error %d\n", + tests[i].name, err); + break; + } + } + return ret; +}