[v8] mm/mempolicy: Weighted Interleave Auto-tuning

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

Previously, weighted interleave's default weights were just 1s -- which
would be equivalent to the (unweighted) interleave mempolicy, which goes
through the nodes in a round-robin fashion, ignoring bandwidth information.

This patch has two main goals:
First, it makes weighted interleave easier to use for users who wish to
relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
By providing a set of "real" default weights that just work out of the
box, users who might not have the capability (or wish to) perform
experimentation to find the most optimal weights for their system can
still take advantage of bandwidth-informed weighted interleave.

Second, it allows for weighted interleave to dynamically adjust to
hotplugged memory with new bandwidth information. Instead of manually
updating node weights every time new bandwidth information is reported
or taken off, weighted interleave adjusts and provides a new set of
default weights for weighted interleave to use when there is a change
in bandwidth information.

To meet these goals, this patch introduces an auto-configuration mode
for the interleave weights that provides a reasonable set of default
weights, calculated using bandwidth data reported by the system. In auto
mode, weights are dynamically adjusted based on whatever the current
bandwidth information reports (and responds to hotplug events).

This patch still supports users manually writing weights into the nodeN
sysfs interface by entering into manual mode. When a user enters manual
mode, the system stops dynamically updating any of the node weights,
even during hotplug events that shift the optimal weight distribution.

A new sysfs interface "auto" is introduced, which allows users to switch
between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
system also automatically enters manual mode when a nodeN interface is
manually written to.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent. With this patch, writing 0 is invalid.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
Suggested-by: Harry Yoo <harry.yoo@oracle.com>
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
Changelog
v8:
- Rebased on top of mm-new (93129fd5a3a9c87fe0e53ff8b0bfa30ee43a873a)
  - Specifically, to include Rakie Kim's patchset mm/mempolicy:
    Enhance sysfs handling for memory hotplug in weighted interleave
    (1f553871c3925bd1fc474b5fad53a0e8609cd5b0)
  - Rebasing also included cleaning up this patch's sysfs creation.
- Fixes to prevent racing between rcu_access_pointer and rcu_dereference
- Wordsmithing

v7:
- Wordsmithing
- Rename iw_table_lock to wi_state_lock
- Clean up reduce_interleave_weights, as suggested by Yunjeong Mun.
  - Combine iw_table allocation & initialization to be outside the function.
  - Skip scaling to [1,100] before scaling to [1,weightiness].
- Removed the second part of this patch, which prevented creating weight
  sysfs interfaces for memoryless nodes.
- Added Suggested-by tags; I should have done this much, much earlier.

v6:
- iw_weights and mode_auto are combined into one rcu-protected struct.
- Protection against memoryless nodes, as suggested by Oscar Salvador
- Wordsmithing (documentation, commit message and comments), as suggested
  by Andrew Morton.
- Removed unnecessary #include statement in hmat.c, as pointed out by
  Harry (Hyeonggon) Yoo and Ying Huang.
- Bandwidth values changed from u64_t to unsigned int, as pointed out by
  Ying Huang and Dan Carpenter.
- RCU optimizations, as suggested by Ying Huang.
- A second patch is included to fix unintended behavior that creates a
  weight knob for memoryless nodes as well.
- Sysfs show/store functions use str_true_false & kstrtobool.
- Fix a build error in 32-bit systems, which are unable to perform
  64-bit division by casting 64-bit values to 32-bit, if under the range.

v5:
- I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
  this patch. Added to the subject now!
- Wordsmithing, correcting typos, and re-naming variables for clarity.
- No functional changes.

v4:
- Renamed the mode interface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  case differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

 ...fs-kernel-mm-mempolicy-weighted-interleave |  35 +-
 drivers/base/node.c                           |   9 +
 include/linux/mempolicy.h                     |  13 +
 mm/mempolicy.c                                | 311 ++++++++++++++----
 4 files changed, 307 insertions(+), 61 deletions(-)

Message ID	20250505182328.4148265-1-joshua.hahnjy@gmail.com
State	Superseded
Headers	show Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com [209.85.219.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4EFFA25EF97; Mon, 5 May 2025 18:23:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.219.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1746469414; cv=none; b=L1io2BUDWwevxLtYlqdloj28nung/BNPKhgxBq/cCx5naswtkBP8s1CRui5p3oOqGbp/Dq9nbH1X8huERj0H1vgsYv4ym91cF01yKga7pH80O69776IJCPyXa+ozkLlJroC97VwYacHtqilukYlOxb+SYBMOoRGzfnXs+df+z1E= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1746469414; c=relaxed/simple; bh=fUDuxMDvBRbnis329OXQrQBYRUoajKWHaTqaVVVXLk4=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=iRqm/XoCaWf68y1YsxvIiCCpMwht62r3e3zlKp7j9t6Z44Jp4ZTgfnnccnGKNojLu45lQToFYIr5x0SKVkzgeB1u/nYWiLdZbo/m+ZkNtRKBV+aloMUAZUKakmtTAjMLFTTpJJP+yHu42mzKfVWW5wQR2OhSSXJeaybByFLnTHI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=OvigTnul; arc=none smtp.client-ip=209.85.219.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="OvigTnul" Received: by mail-yb1-f171.google.com with SMTP id 3f1490d57ef6-e637edaa652so3633141276.1; Mon, 05 May 2025 11:23:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746469410; x=1747074210; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=mJrh1fpJUM3CnP5u/kryoFi30LRmFNk3LukIvMTEexM=; b=OvigTnulhv4xVs6BZT6Z5TgGCAcMYf/ODPamOVzublYq6F8IewT2VAj9uJTS+egyBg 3TokVKCvtlA8BfqtVkmSU2dsxmRBmp18ppxaJapj5tH/OUc+Ark7taMVkakMJqzfpBf7 l8vlbWOLNT5oj/SDyuVu+oDh13bsybegzws5/FL2WWmcO6e868hulZmwg0ztGq42uc5x aMQYm23qAoEqaOuQHEhgfEUvmgnnWHEIwiLr9J7kataTSqINSvP+mRZPKmUoanjTmgGB +fVdFzQTU8UG+LBWhXwnr/UuID9kPWYfmsafsLIhnMU+b0RenIHCYCJu1fYpKGG0e9v8 qZGg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746469410; x=1747074210; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mJrh1fpJUM3CnP5u/kryoFi30LRmFNk3LukIvMTEexM=; b=vNmD10i91eRGRZaxugBLmy6hHS7kTDYfNEKJZy0LQMa8frv7e2uAErcCIBtgjkyYul 1rrdN2hHZzD4ZX6gB9scoKX+FPGKMw4a//XnTaiIyvYH58hVFIfMDntuGhlqS22UzQkF kEt23KusZ2XezNT6JagMiEdgk7JRZnk2mwfFlDVaQ2BIWySPSAvjlHJalA7na/sELdyp t3LUxl6sM7YJX7EOkbO7uTCu/xuue1HrBzFe6hkRjNQnTFaxhUIiNCRpLiuNqNNidFTt gWGZb4ftvq3WxHJJZFCA00Sneq03Z0E0gS2xgSiBzsySlygl1I6r+4m2GTUT7XqYpDR+ v6mA== X-Forwarded-Encrypted: i=1; AJvYcCXk21hBvI6ymkEUV86SNdT8g/uEZmeJZvlA/ZRC9v3uRb6zst/2BFqYwz3NALjwJBLmNzq+Blqm8ZUf@vger.kernel.org, AJvYcCXrx4lF+ziiMolqBwkX1mPtP5M66dgY+pizfWc/Lc5eQiwu84ucAMYIhYXglFiS9cRSM/ZdFOYfS7Hlhoe4@vger.kernel.org X-Gm-Message-State: AOJu0YwNhm38MoIrN98/edoIP+KYBBZ5IRAZ+cG46A8jLb/AYHcteBks r0WMmrO+ZnVqk/bYZFIaZvKMiOGVtc0dsSx/NBk0RszorAjr/XMm X-Gm-Gg: ASbGnctYpvN01+YBI2f97ctZZrM80d3cIsczShTycAZP6ChJuPligjBDwataRQbAoMB EPImxyLIrSjYMzu0oxH72BPk3/boH2ohDGBo4FqsyIhcjy2EtN6SqJJ41gW986dATmaKX+blF9J FZC+RPQsW85KubqEFGDHDTBP5tKYCTWEk0tPLc3AwhhedE/ZUErdPE5Fg5TiObGB2XfUGicnAH3 0UGVyRoGG6ktJMc79R7q3uVoApf8O70YMjNFTmGoMRe0yer+iC3ihXui/pJrG+DNzL57kQXgxdb W60cwhz7lWaAtFB0EJI8+kn/VyWS/Px4H7G4ML1h X-Google-Smtp-Source: AGHT+IE6s3qswJE8SErBzlimmGslVyunD60C+MR+Gh9O8dkDNf8U2qql+vS27WxmuoWbADP0T04+0A== X-Received: by 2002:a05:690c:314:b0:6fe:abff:cb17 with SMTP id 00721157ae682-708eaf6437emr105696127b3.26.1746469409911; Mon, 05 May 2025 11:23:29 -0700 (PDT) Received: from localhost ([2a03:2880:25ff:72::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-70918578b1fsm740127b3.19.2025.05.05.11.23.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 05 May 2025 11:23:29 -0700 (PDT) From: Joshua Hahn <joshua.hahnjy@gmail.com> To: gourry@gourry.net, ying.huang@linux.alibaba.com Cc: honggyu.kim@sk.com, yunjeong.mun@sk.com, gregkh@linuxfoundation.org, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, osalvador@suse.de, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: [PATCH v8] mm/mempolicy: Weighted Interleave Auto-tuning Date: Mon, 5 May 2025 11:23:28 -0700 Message-ID: <20250505182328.4148265-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.47.1 Precedence: bulk X-Mailing-List: linux-acpi@vger.kernel.org List-Id: <linux-acpi.vger.kernel.org> List-Subscribe: <mailto:linux-acpi+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-acpi+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v8] mm/mempolicy: Weighted Interleave Auto-tuning \| expand [v8] mm/mempolicy: Weighted Interleave Auto-tuning

[v8] mm/mempolicy: Weighted Interleave Auto-tuning

Commit Message

Comments

Patch