From patchwork Mon Apr  7 23:42:02 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878969
Received: from mail-yb1-f170.google.com (mail-yb1-f170.google.com
 [209.85.219.170])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 968BB15B554;
 Mon,  7 Apr 2025 23:42:25 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.170
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069347; cv=none;
 b=NIXURrl7IuIq91xbGs7iGJX24kd5+b0/ls8abtgtJoLzl/zEyeviQsGI9Y5uJSkHJqmga+LLCBmILLx5juyJCnDqB5sLHOpBLu28/NHPWEUbGikpb9zGxCeY5IpGfPuK2ppeTLKVhKcRdAlhPSe2QMti2DCM0hES/HDPFbQxeWA=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069347; c=relaxed/simple;
 bh=X/w2JwXFg2063diHOIftUTHwc7nq06mZHGwx8FGtAYs=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=rkc42n6SC4+8nftRlc9m8MfvhuQ/PLB6JYAlZeiqGd8GKTPGpVL/2EzRY9bok9gfVXllqcD8DuwXXYrlcKWL8Zhp7G4vNoqEmcWseAKvj42ur8mPkxxqy/xhB/07CI8OhTQFO9CftZWllC87sJvWqlzaSN4X4RYy2XxlvkSusz8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=UTaGEybq; arc=none smtp.client-ip=209.85.219.170
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="UTaGEybq"
Received: by mail-yb1-f170.google.com with SMTP id
 3f1490d57ef6-e5dc299dee9so4780101276.3;
 Mon, 07 Apr 2025 16:42:25 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069344; x=1744674144;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=01NE8d4c5rprciUbZX80cCVvuXYrVzb/ghlV1xi5+yQ=;
 b=UTaGEybqNs1YnffeT6jd5v98ZAUx6ADiqUvqzchJO2Wp/Vg9AVDVLAKoMUlYff5xye
 0wjxccNf/I2X/jWHetPx4fzD9OEkv2hDL5L6KYXdEPlcqAb0+Z/JjUFuKxnAfTC5g6eK
 7psEznvmtQHwVb93o4B4fj1VGEoyUT8vfTLgE0xZa7p+/TTT8J7BzFfCxlYTHfZQnU9Q
 w1yiGyIIZJM4A4OhbRTnxi01OGuKUME7tUywHNH/7mDPn1y3Bnt0iMxy43My+10K7Xe5
 C8Kz+Zvs1Th5Px004Zp9bCg9OofGOHZMpnl96sT222P+U2hsMwUWG5G1rMTwyker9JuE
 7ITQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069344; x=1744674144;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=01NE8d4c5rprciUbZX80cCVvuXYrVzb/ghlV1xi5+yQ=;
 b=NVLYOv05eI87NBs3LctoX4UYW2T7htjAgX8CtsT0UTW/zrIA7Ap1J6HuiPhwheoDbC
 FFS3qNXZq6rWSy/bdGORsThjPPVJMhj7JqHGGDxCpea2pBiyFsL+QNj/yrfNeDx9ytP+
 YgI6Lg/TVaob6ZDG4+VKRabL5O31fCrF6AlLgbELVl8RHpNId+1lrZ/T4DHyyMHsOa1p
 U+I9nidvv2xFyD2U7FdXurWXkfPA0H6j0rIuAAcA15EBS+AVzu0Xb7ME0xhCWArWj9Aw
 xpBg/AEPhLAAYR9EPjj4xu89AzSYoAckQECwSk/mC5fd8Ht5P+OnxWv4fMwU64YFAk6J
 Oypg==
X-Forwarded-Encrypted: i=1;
 AJvYcCVRuE1GZItEgfuCiHlfZSltlPboaPoGg46ZPRV+27e9p+hmXh7NsAAEKJMa4PQ45pvllovStpX5RUirj9nR@vger.kernel.org,
 AJvYcCXBri9OLbph5sJYjkJbjyStc0+W1aC8399g+BIucMb5F1apdu+p6O44DnPSL4CTz/NOXzLVqR1Z@vger.kernel.org,
 AJvYcCXc6JpMIeH1sq1qxuKeXlnTKFNtDteTHmMYsjEYbl5huRxZSSI1qWs6oifFRJki0w19H3yGIWu5iNI=@vger.kernel.org
X-Gm-Message-State: AOJu0YzT6YRF59BSOnPtIiku/lKqYsoYtjRlFdneJdZSFvvgyVMBA1TF
 MNtgb5LWNDypCllN0dZJJM28OsopAlEIBhEpnk8QON+FZ2ooVj4P
X-Gm-Gg: ASbGncu/CH+6fQuHtJELLp2bpo+TDOMePx2Q2Lj8GGQMjbdO5xDnHJsa/vPUejji/so
 KYzdI5e1Wkz14/84MFAMuLJQl7Dz0u5Pe8Bd792RND2+IyvYYCLp7/omVfv27Xup3nwXiZbVnGV
 kyDtTZGteAkT1uQdVYSj4JqagRKX9ooHuo3YDRJnFfb2QQQ91nRehtoeQ3qfRPDJcbrAHi8BV4I
 3+4Vla/A3ZVX0r2PasXylAKXzTOFxPNGQqAks3g1VDYDp7Fe7LhVEjeWHp6TGgFtc22jRyWbXb+
 /31EFrg07kJzGRqubAaJGOPqY9W2t1yy5OMZ048OdkOpxQ==
X-Google-Smtp-Source: AGHT+IFOp3tUm8iPf6YKwMSQa4oxKFyoh7IrPg9EFNWwYJEdbSfmFgU00UV1NVJA+nwj/GXIF2HQZA==
X-Received: by 2002:a05:690c:650f:b0:702:4eac:175f with SMTP id
 00721157ae682-703f42fabfemr190896357b3.31.1744069344383;
 Mon, 07 Apr 2025 16:42:24 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:8::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1f6f402sm27825447b3.86.2025.04.07.16.42.24
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:24 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 01/14] swapfile: rearrange functions
Date: Mon,  7 Apr 2025 16:42:02 -0700
Message-ID: <20250407234223.1059191-2-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Rearrange some functions in preparation for the rest of the series. No
functional change intended.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/swapfile.c | 230 +++++++++++++++++++++++++-------------------------
 1 file changed, 115 insertions(+), 115 deletions(-)

diff --git a/mm/swapfile.c b/mm/swapfile.c
index df7c4e8b089c..27cf985e08ac 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -124,11 +124,6 @@ static struct swap_info_struct *swap_type_to_swap_info(int type)
 	return READ_ONCE(swap_info[type]); /* rcu_dereference() */
 }
 
-static inline unsigned char swap_count(unsigned char ent)
-{
-	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
-}
-
 /*
  * Use the second highest bit of inuse_pages counter as the indicator
  * if one swap device is on the available plist, so the atomic can
@@ -161,6 +156,11 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim directly, bypass the slot cache and don't touch device lock */
 #define TTRS_DIRECT		0x8
 
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
+}
+
 static bool swap_is_has_cache(struct swap_info_struct *si,
 			      unsigned long offset, int nr_pages)
 {
@@ -1326,46 +1326,6 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
 	return NULL;
 }
 
-static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
-					      unsigned long offset,
-					      unsigned char usage)
-{
-	unsigned char count;
-	unsigned char has_cache;
-
-	count = si->swap_map[offset];
-
-	has_cache = count & SWAP_HAS_CACHE;
-	count &= ~SWAP_HAS_CACHE;
-
-	if (usage == SWAP_HAS_CACHE) {
-		VM_BUG_ON(!has_cache);
-		has_cache = 0;
-	} else if (count == SWAP_MAP_SHMEM) {
-		/*
-		 * Or we could insist on shmem.c using a special
-		 * swap_shmem_free() and free_shmem_swap_and_cache()...
-		 */
-		count = 0;
-	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
-		if (count == COUNT_CONTINUED) {
-			if (swap_count_continued(si, offset, count))
-				count = SWAP_MAP_MAX | COUNT_CONTINUED;
-			else
-				count = SWAP_MAP_MAX;
-		} else
-			count--;
-	}
-
-	usage = count | has_cache;
-	if (usage)
-		WRITE_ONCE(si->swap_map[offset], usage);
-	else
-		WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE);
-
-	return usage;
-}
-
 /*
  * When we get a swap entry, if there aren't some other ways to
  * prevent swapoff, such as the folio in swap cache is locked, RCU
@@ -1432,6 +1392,46 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 }
 
+static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
+					      unsigned long offset,
+					      unsigned char usage)
+{
+	unsigned char count;
+	unsigned char has_cache;
+
+	count = si->swap_map[offset];
+
+	has_cache = count & SWAP_HAS_CACHE;
+	count &= ~SWAP_HAS_CACHE;
+
+	if (usage == SWAP_HAS_CACHE) {
+		VM_BUG_ON(!has_cache);
+		has_cache = 0;
+	} else if (count == SWAP_MAP_SHMEM) {
+		/*
+		 * Or we could insist on shmem.c using a special
+		 * swap_shmem_free() and free_shmem_swap_and_cache()...
+		 */
+		count = 0;
+	} else if ((count & ~COUNT_CONTINUED) <= SWAP_MAP_MAX) {
+		if (count == COUNT_CONTINUED) {
+			if (swap_count_continued(si, offset, count))
+				count = SWAP_MAP_MAX | COUNT_CONTINUED;
+			else
+				count = SWAP_MAP_MAX;
+		} else
+			count--;
+	}
+
+	usage = count | has_cache;
+	if (usage)
+		WRITE_ONCE(si->swap_map[offset], usage);
+	else
+		WRITE_ONCE(si->swap_map[offset], SWAP_HAS_CACHE);
+
+	return usage;
+}
+
 static unsigned char __swap_entry_free(struct swap_info_struct *si,
 				       swp_entry_t entry)
 {
@@ -1585,25 +1585,6 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	unlock_cluster(ci);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
-{
-	int i;
-	struct swap_cluster_info *ci;
-	struct swap_info_struct *si = NULL;
-
-	if (n <= 0)
-		return;
-
-	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
-		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
-			unlock_cluster(ci);
-		}
-	}
-}
-
 int __swap_count(swp_entry_t entry)
 {
 	struct swap_info_struct *si = swp_swap_info(entry);
@@ -1717,57 +1698,6 @@ static bool folio_swapped(struct folio *folio)
 	return swap_page_trans_huge_swapped(si, entry, folio_order(folio));
 }
 
-static bool folio_swapcache_freeable(struct folio *folio)
-{
-	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
-
-	if (!folio_test_swapcache(folio))
-		return false;
-	if (folio_test_writeback(folio))
-		return false;
-
-	/*
-	 * Once hibernation has begun to create its image of memory,
-	 * there's a danger that one of the calls to folio_free_swap()
-	 * - most probably a call from __try_to_reclaim_swap() while
-	 * hibernation is allocating its own swap pages for the image,
-	 * but conceivably even a call from memory reclaim - will free
-	 * the swap from a folio which has already been recorded in the
-	 * image as a clean swapcache folio, and then reuse its swap for
-	 * another page of the image.  On waking from hibernation, the
-	 * original folio might be freed under memory pressure, then
-	 * later read back in from swap, now with the wrong data.
-	 *
-	 * Hibernation suspends storage while it is writing the image
-	 * to disk so check that here.
-	 */
-	if (pm_suspended_storage())
-		return false;
-
-	return true;
-}
-
-/**
- * folio_free_swap() - Free the swap space used for this folio.
- * @folio: The folio to remove.
- *
- * If swap is getting full, or if there are no more mappings of this folio,
- * then call folio_free_swap to free its swap space.
- *
- * Return: true if we were able to release the swap space.
- */
-bool folio_free_swap(struct folio *folio)
-{
-	if (!folio_swapcache_freeable(folio))
-		return false;
-	if (folio_swapped(folio))
-		return false;
-
-	delete_from_swap_cache(folio);
-	folio_set_dirty(folio);
-	return true;
-}
-
 /**
  * free_swap_and_cache_nr() - Release reference on range of swap entries and
  *                            reclaim their cache if no more references remain.
@@ -1842,6 +1772,76 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	put_swap_device(si);
 }
 
+void swapcache_free_entries(swp_entry_t *entries, int n)
+{
+	int i;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si = NULL;
+
+	if (n <= 0)
+		return;
+
+	for (i = 0; i < n; ++i) {
+		si = _swap_info_get(entries[i]);
+		if (si) {
+			ci = lock_cluster(si, swp_offset(entries[i]));
+			swap_entry_range_free(si, ci, entries[i], 1);
+			unlock_cluster(ci);
+		}
+	}
+}
+
+static bool folio_swapcache_freeable(struct folio *folio)
+{
+	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
+
+	if (!folio_test_swapcache(folio))
+		return false;
+	if (folio_test_writeback(folio))
+		return false;
+
+	/*
+	 * Once hibernation has begun to create its image of memory,
+	 * there's a danger that one of the calls to folio_free_swap()
+	 * - most probably a call from __try_to_reclaim_swap() while
+	 * hibernation is allocating its own swap pages for the image,
+	 * but conceivably even a call from memory reclaim - will free
+	 * the swap from a folio which has already been recorded in the
+	 * image as a clean swapcache folio, and then reuse its swap for
+	 * another page of the image.  On waking from hibernation, the
+	 * original folio might be freed under memory pressure, then
+	 * later read back in from swap, now with the wrong data.
+	 *
+	 * Hibernation suspends storage while it is writing the image
+	 * to disk so check that here.
+	 */
+	if (pm_suspended_storage())
+		return false;
+
+	return true;
+}
+
+/**
+ * folio_free_swap() - Free the swap space used for this folio.
+ * @folio: The folio to remove.
+ *
+ * If swap is getting full, or if there are no more mappings of this folio,
+ * then call folio_free_swap to free its swap space.
+ *
+ * Return: true if we were able to release the swap space.
+ */
+bool folio_free_swap(struct folio *folio)
+{
+	if (!folio_swapcache_freeable(folio))
+		return false;
+	if (folio_swapped(folio))
+		return false;
+
+	delete_from_swap_cache(folio);
+	folio_set_dirty(folio);
+	return true;
+}
+
 #ifdef CONFIG_HIBERNATION
 
 swp_entry_t get_swap_page_of_type(int type)

From patchwork Mon Apr  7 23:42:03 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879276
Received: from mail-yw1-f170.google.com (mail-yw1-f170.google.com
 [209.85.128.170])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 65627225A40;
 Mon,  7 Apr 2025 23:42:26 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.170
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069348; cv=none;
 b=c6UtkUTfBaJO5AYR4+UHutsoKV7OwtGpcLBd/XUnQYr1bzsJStKty9HmoF3AiSVYPpRJfl4mOsXQSnm8kW4BQg9P05aPP3neuLLKkG/lNvQLX3Who1S/cYi7qDeTFs7co3AJdUOrkzE1TudcJFMaz28NEOifYp59EgBOhqRA6Bc=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069348; c=relaxed/simple;
 bh=dBGQq1Pd6BCPNuVeF0eBxyPWwJHVwMcLmBXDM7kyFz4=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=anQa9uH+D7B+EWAe3DFDJVZKUPXKelc5Bxc7w1bIC5FLt0LkHf6Bh5vsJnnwsVsxGpXjFpLhkp+y4/3dHhXu33PMO1g78k+qxugAsJvWH3iFEUzecOaOHQIzFA+6q1uCt4wAScuB3gfKhZOf3IeoBbigAm2RhKSVeDYPXq0bAE0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=mcKZnDfC; arc=none smtp.client-ip=209.85.128.170
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="mcKZnDfC"
Received: by mail-yw1-f170.google.com with SMTP id
 00721157ae682-6feab7c5f96so47626957b3.3;
 Mon, 07 Apr 2025 16:42:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069345; x=1744674145;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=XfUGs0YddbsQ4nD768mNJH2PaDd6ou20rVgkj7mY+Lk=;
 b=mcKZnDfCsH/s1EdAQwrRx6Y9wI0PoBxm1hsYTSZJR8mtQvOK83DbmMeBIYBoNJD4wj
 yjsxp0LnGpeFBhWcNjiKrZ+f8WVVjztqKn6Du9vhiL7DEBG48Tuq6vUywnH6PzIb83lY
 CBV/Q6p5uAFONV4eSZSArScCD4weCauqnNzF13JW0OtxBRKhddvnbeVBwXz0UuU+hkrk
 JHsC6yf3p2qt5fVxN/gZBtNNxMR5SplT+t/+q41XZK5YHDi+isqrg8FvbomrHtD6O7QI
 pzHhU5ePuSwqCqZNzDMaT0QDjL4vIuO9AM7Q2IXjwj4n1pE2EVm3HKywm1hdW1mZw2+q
 BpxQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069345; x=1744674145;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=XfUGs0YddbsQ4nD768mNJH2PaDd6ou20rVgkj7mY+Lk=;
 b=lb6j4VirFDptu+BeShPWO7ie870N9OokdAQur93X4E3EBUQ4pYxzogycNPl68Zl+Sv
 46Dta2i4rDaIQVBFoOialmt3cJqirpuWMZS2zy//aNrll9buhjN1FF6h0V+ZpYoH/K95
 jgTGeXmoGccHL2j5jylQY5pOoPRRb/QIw3R5JFNSuT/jqLkto+OODZeo3zlTpTqG61WO
 L4CDYYPHgJOwD8W1rc2M+8jcq/CWoxlUhBy+nVdIaTRd/Ao2+ZLOvoTcHopbJ1/3u8vH
 o4aDdrCXiZU5bsvBnFcaLcs7wrVnKmqswW/4e9bVNsip07VGKrUHTe2RwfQQOprg3ZKH
 rveA==
X-Forwarded-Encrypted: i=1;
 AJvYcCURetcbxioT3fCvhCB1EQiLVnrZv8TfOySPPE9bsWG1WEI4H43K8lYa1AhAcSTDPUKh8OEyLk1IXAg=@vger.kernel.org,
 AJvYcCVUJ4G4LUw3IHrxzGsccmQ2ieW7qXScIvlm62e4c4MONXS+70RC28kFmErER5j+koc87EOXe7IE@vger.kernel.org,
 AJvYcCX16fXnkAmvKb8E8osXPlMs7mmeoazcVLYsw8dxqQBBbcPGR8nohfHcuLwH5QDKLfG2K350arcpQxrJ7dz4@vger.kernel.org
X-Gm-Message-State: AOJu0YzthWf8QGbF48Sydbt8u1hP0lTubLcAMDuLzDRpRTqKaKCEfL7o
 gLNcBFxJKXjkjIvJyn9KaqngDIN2vaKi9Egx+9hjJ9H3mEhYT1Ha
X-Gm-Gg: ASbGncugidutNgUV3qd1NIroWKzDM42ronsFEY8C1OnAgitheFhoOpr6jLHm1PANsG9
 SajsWM02PIu9HHwYVXZohvfxvxTvbGIISEiJesPPIEo97X47Jd7miLiiGdo2nRYbAlvXrYxSWdL
 /KpAiBn8BkskkWIHLL2m0HykOQhbc6+1w87FB90GYc5Wk5IDDBAhCNKM6oFOt5XglH5wj578Ee2
 84iUdjh9a7DtM1LiEqQY+6U0m8ZUP/HuctzWRWo2w1zzWhH+sbmKz42UPd8R5S2+mkHADmlgVrl
 b+1AQbQgDXD4dULEThKeiBKZH/9wvgUP
X-Google-Smtp-Source: AGHT+IHmylSLqkbMnakEdnta4Mc1w3JFTnb3aV7dqAhsLXyWaelffY0XhjOBtjJGz+rADBjFj72DYg==
X-Received: by 2002:a05:690c:ecf:b0:6ef:69b2:eac with SMTP id
 00721157ae682-703e1500617mr242935697b3.4.1744069345103;
 Mon, 07 Apr 2025 16:42:25 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff::]) by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1e5c6f5sm27876857b3.37.2025.04.07.16.42.24
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:24 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 02/14] mm: swap: add an abstract API for locking out
 swapoff
Date: Mon,  7 Apr 2025 16:42:03 -0700
Message-ID: <20250407234223.1059191-3-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Currently, we get a reference to the backing swap device in order to
lock out swapoff and ensure its validity. This is no longer sufficient
and/or doable when the swap entries are decoupled from their backing
stores - a swap entry might not have any backing swap device at all.

In preparation for this decoupling work, abstract away the swapoff
locking out behavior into a generic API (whose implementation will
eventually differ between the old and the new swap implementation).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h | 12 ++++++++++++
 mm/memory.c          | 13 +++++++------
 mm/shmem.c           |  7 +++----
 mm/swap_state.c      | 10 ++++------
 4 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index b13b72645db3..e479fd31c6d6 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -706,5 +706,17 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+static inline bool trylock_swapoff(swp_entry_t entry,
+				struct swap_info_struct **si)
+{
+	return get_swap_device(entry);
+}
+
+static inline void unlock_swapoff(swp_entry_t entry,
+				struct swap_info_struct *si)
+{
+	put_swap_device(si);
+}
+
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index fb7b8dc75167..e92914df5ca7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4305,6 +4305,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct swap_info_struct *si = NULL;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
+	bool swapoff_locked = false;
 	bool exclusive = false;
 	swp_entry_t entry;
 	pte_t pte;
@@ -4365,8 +4366,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	}
 
 	/* Prevent swapoff from happening to us. */
-	si = get_swap_device(entry);
-	if (unlikely(!si))
+	swapoff_locked = trylock_swapoff(entry, &si);
+	if (unlikely(!swapoff_locked))
 		goto out;
 
 	folio = swap_cache_get_folio(entry, vma, vmf->address);
@@ -4713,8 +4714,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4732,8 +4733,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (waitqueue_active(&swapcache_wq))
 			wake_up(&swapcache_wq);
 	}
-	if (si)
-		put_swap_device(si);
+	if (swapoff_locked)
+		unlock_swapoff(entry, si);
 	return ret;
 }
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 1ede0800e846..8ef72dcc592e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2262,8 +2262,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	if (is_poisoned_swp_entry(swap))
 		return -EIO;
 
-	si = get_swap_device(swap);
-	if (!si) {
+	if (!trylock_swapoff(swap, &si)) {
 		if (!shmem_confirm_swap(mapping, index, swap))
 			return -EEXIST;
 		else
@@ -2411,7 +2410,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	}
 	folio_mark_dirty(folio);
 	swap_free_nr(swap, nr_pages);
-	put_swap_device(si);
+	unlock_swapoff(swap, si);
 
 	*foliop = folio;
 	return 0;
@@ -2428,7 +2427,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		folio_unlock(folio);
 		folio_put(folio);
 	}
-	put_swap_device(si);
+	unlock_swapoff(swap, si);
 
 	return error;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ca42b2be64d9..81f69b2df550 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -419,12 +419,11 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
 	if (non_swap_entry(swp))
 		return ERR_PTR(-ENOENT);
 	/* Prevent swapoff from happening to us */
-	si = get_swap_device(swp);
-	if (!si)
+	if (!trylock_swapoff(swp, &si))
 		return ERR_PTR(-ENOENT);
 	index = swap_cache_index(swp);
 	folio = filemap_get_folio(swap_address_space(swp), index);
-	put_swap_device(si);
+	unlock_swapoff(swp, si);
 	return folio;
 }
 
@@ -439,8 +438,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	void *shadow = NULL;
 
 	*new_page_allocated = false;
-	si = get_swap_device(entry);
-	if (!si)
+	if (!trylock_swapoff(entry, &si))
 		return NULL;
 
 	for (;;) {
@@ -538,7 +536,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 	put_swap_folio(new_folio, entry);
 	folio_unlock(new_folio);
 put_and_return:
-	put_swap_device(si);
+	unlock_swapoff(entry, si);
 	if (!(*new_page_allocated) && new_folio)
 		folio_put(new_folio);
 	return result;

From patchwork Mon Apr  7 23:42:04 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878967
Received: from mail-yw1-f179.google.com (mail-yw1-f179.google.com
 [209.85.128.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 44F5A22A1ED;
 Mon,  7 Apr 2025 23:42:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.179
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069351; cv=none;
 b=V95dc9YunJ5ToQYn23WanbGUg7+GHDefQpRuOsu6LkRRDeqi5VuAHUMjLluGXH9ik5yjXahNCMY2cSTFampTsih0MKWMMBJ1h7OAq9MiE5BIldTfGBvCw85SNLicLxTarr7bUNzcm/8ywFb4z+d9S5ZDvySSM6Zy07C7N0graQE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069351; c=relaxed/simple;
 bh=xy3LsEKYZhRoi7k64bbUBrisG0EpYuzYPbq4cBwnDWE=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=l7BM+0ReG9hADF8NqVYQrkcz1zSVDJ6ttnKgAyqQbfchpKcHKBd6082uw2zOu4ahNxzZV0kh4CiPU9PUkSQEHsDCfxLKroNxWMhx8p7UE6q+/p0q+Kea4OashLjhGd0YgSAG3UCiiIMf3EpEowQqlYZDdlyzElJMhIWo+Qkp7h8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=bdQJu/R7; arc=none smtp.client-ip=209.85.128.179
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="bdQJu/R7"
Received: by mail-yw1-f179.google.com with SMTP id
 00721157ae682-6ef7c9e9592so47473407b3.1;
 Mon, 07 Apr 2025 16:42:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069346; x=1744674146;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=FBlq/QOLg0r1TkYbb42kXekbd2GPTkfXKF/05r7LA1U=;
 b=bdQJu/R7MObGdEgL3OgO1+lVGOrNEXadz56dqOeM35tg90789KRm6Dl+ZQXWbkMmzN
 3APrF+8lZA8+uUTY5aXdIudI1IoIq/gPiiXDoAEhp5TARUI1EAnsn6AKGtQCtzQ76dJ8
 B0SMlCM50luZhhHHRtQ92vQczvqWmWQW7RHpm+x8XJouHqTute/KLdO7ApQpyAgEBmi7
 qB7WBCcZJQeZ3yfqzxW/98Wc91TADCSiktr5IzgnO86eO5IJZxz7asQFZmKeOs8WhP8L
 se9sjwiIm8b4y0H/d45IdycxIbt6qNGM2fssRbRVPd9w2cceg0kBOVe7oC8A6bNqvykJ
 Q5HQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069346; x=1744674146;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=FBlq/QOLg0r1TkYbb42kXekbd2GPTkfXKF/05r7LA1U=;
 b=Fsvqxb2WZ6PnwU3wT1BJOcB5/2ek8Feg0xdd6E04Wr24KOzs6W5kMzm1fyjWR8u9/l
 BHf5i3ZKbBexuNJey3990Wds539dCoI4Vgnj4sMKenQ9FrlXds4WmJmu36GOxFuYDknV
 KSEq57xEr69T6AhNVaC987n7dusyUZDWKjq49E4YtxI+O390cwr+Ts47DssI7eaLKt82
 odCPPWR4pFIER1I64bgt+2TAdLgbIO36vxfy63C+ge9JNmfWYYbhVLLlCProsiTAK0pm
 r4x9xvRpShM5hFY6SRcNoda2oAbPWScaoYSEStQgsxhVKU6Z/41rVhTHY7+x+UTo24pf
 2jGw==
X-Forwarded-Encrypted: i=1;
 AJvYcCVsUTNgAOARntOUESNVgSdWD9Sh7IvDgjNkE4U1Z4Bgx3IfnQMy1Wa4HTNDFehHoqELD274QOTw@vger.kernel.org,
 AJvYcCW/z+crc6nc0TeH06mkoz/HPxeayhhSIWkl9WRViBMD/ZYDWjdAnQtLh6jILji7F9OyrS2cPPV9jcI=@vger.kernel.org,
 AJvYcCWmEoLU9zJKgt5H5fLCB3Y/1mwq+Z/+CPfpIYuQChX23VTaeuz5Mr+sV/y3Gz3RIBc0kqo4IyF0He7pNO5h@vger.kernel.org
X-Gm-Message-State: AOJu0Ywx7/l1hr4i1SlVHWxz69+ea0fgxZjAP7hf0022Laz6Toipexl5
 PLOlNbiQ+cAoE1Tb5uvlXwPK96vnKV+vyRKFoPxzFF1x8cVZ/ynk
X-Gm-Gg: ASbGncs8TYjka3vWwvDfdyu6puv7YujI9n6bliBNmF///c8grIKEdAv0JLQnBSTjKP7
 anlXbcptrqx8zZ9gbDq3Q/nqcAJB1quUiKDnbgig81ID5agxKfD1yWk+b0Qp0uicZMUQQcM1pPQ
 7iSEV6cvi9fElr5Yhu1D+3FL/RGAj6SkmPJTHI9EE1gLoJyNTOOcOI/Iaf9U6SIich8jpLnZEMU
 i3ToMpiANj7yw9XG8rnMfEyiiLPb16d7+CirEt6IeDjZiqGR/t5llY0oxASl+h7RTPlw7Ez4Bgx
 pZbg+3UNlk/9Vc7aO9DRFjhDir8+gbKp83o2
X-Google-Smtp-Source: AGHT+IHh0sZW9KLscMmSC5NY15HZ6AlEhuvPkp2qAIQa6K0Mbuwu6uf+9tY6ONNP9JcoArs/fMVUNQ==
X-Received: by 2002:a05:690c:6702:b0:6fb:1f78:d9ee with SMTP id
 00721157ae682-703e15303a8mr274751557b3.15.1744069345957;
 Mon, 07 Apr 2025 16:42:25 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:74::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1fa9ec1sm27772987b3.112.2025.04.07.16.42.25
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:25 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 03/14] mm: swap: add a separate type for physical swap
 slots
Date: Mon,  7 Apr 2025 16:42:04 -0700
Message-ID: <20250407234223.1059191-4-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In preparation for swap virtualization, add a new type to represent the
physical swap slots of swapfile. This allows us to separates:

1. The logical view of the swap entry (i.e what is stored in page table
   entries and used to index into the swap cache), represented by the
   old swp_entry_t type.

from:

2. Its physical backing state (i.e the actual backing slot on the swap
   device), represented by the new swp_slot_t type.

The functions that operate at the physical level (i.e on the swp_slot_t
types) are also renamed where appropriate (prefixed with swp_slot_* for
e.g). We also take this opportunity to re-arrange the header files
(include/linux/swap.h and swapops.h), grouping the swap API into the
following categories:

1. Virtual swap API (i.e functions on swp_entry_t type).

2. Swap cache API (mm/swap_state.c)

3. Swap slot cache API (mm/swap_slots.c)

4. Physical swap slots and device API (mm/swapfile.c).

Note that we have not made any behavioral change - the mapping between
the two types is the identity mapping. In later patches, we shall
dynamically allocate a virtual swap slot (of type swp_entry_t) for each
swapped out page to store in the page table entry, and associate it with
a backing store. A physical swap slot (i.e a slot on a physical swap
device) is one of the backing options.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/mm_types.h   |   7 ++
 include/linux/swap.h       | 124 +++++++++++++------
 include/linux/swap_slots.h |   2 +-
 include/linux/swapops.h    |  25 ++++
 kernel/power/swap.c        |   6 +-
 mm/internal.h              |  10 +-
 mm/memory.c                |   7 +-
 mm/page_io.c               |  33 +++--
 mm/shmem.c                 |  21 +++-
 mm/swap.h                  |  17 +--
 mm/swap_cgroup.c           |  10 +-
 mm/swap_slots.c            |  28 ++---
 mm/swap_state.c            |  26 +++-
 mm/swapfile.c              | 243 ++++++++++++++++++++-----------------
 14 files changed, 351 insertions(+), 208 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0234f14f2aa6..7d93bb2c3dae 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -283,6 +283,13 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/*
+ * Physical (i.e disk-based) swap slot handle.
+ */
+typedef struct {
+	unsigned long val;
+} swp_slot_t;
+
 /**
  * struct folio - Represents a contiguous set of bytes.
  * @flags: Identical to the page flags.
diff --git a/include/linux/swap.h b/include/linux/swap.h
index e479fd31c6d6..674089bc4cd1 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -277,7 +277,7 @@ enum swap_cluster_flags {
  * cluster to which it belongs being marked free. Therefore 0 is safe to use as
  * a sentinel to indicate an entry is not valid.
  */
-#define SWAP_ENTRY_INVALID	0
+#define SWAP_SLOT_INVALID	0
 
 #ifdef CONFIG_THP_SWAP
 #define SWAP_NR_ORDERS		(PMD_ORDER + 1)
@@ -452,25 +452,45 @@ extern void __meminit kswapd_run(int nid);
 extern void __meminit kswapd_stop(int nid);
 
 #ifdef CONFIG_SWAP
-
-int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
-		unsigned long nr_pages, sector_t start_block);
-int generic_swapfile_activate(struct swap_info_struct *, struct file *,
-		sector_t *);
-
+/* Virtual swap API */
+swp_entry_t folio_alloc_swap(struct folio *folio);
+bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
+int add_swap_count_continuation(swp_entry_t, gfp_t);
+void swap_shmem_alloc(swp_entry_t, int);
+int swap_duplicate(swp_entry_t);
+int swapcache_prepare(swp_entry_t entry, int nr);
+void swap_free_nr(swp_entry_t entry, int nr_pages);
+void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+int __swap_count(swp_entry_t entry);
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
+int swp_swapcount(swp_entry_t entry);
+
+/* Swap cache API (mm/swap_state.c) */
 static inline unsigned long total_swapcache_pages(void)
 {
 	return global_node_page_state(NR_SWAPCACHE);
 }
 
-void free_swap_cache(struct folio *folio);
 void free_page_and_swap_cache(struct page *);
 void free_pages_and_swap_cache(struct encoded_page **, int);
-/* linux/mm/swapfile.c */
+void free_swap_cache(struct folio *folio);
+int init_swap_address_space(unsigned int type, unsigned long nr_pages);
+void exit_swap_address_space(unsigned int type);
+
+/* Swap slot cache API (mm/swap_slot.c) */
+swp_slot_t folio_alloc_swap_slot(struct folio *folio);
+
+/* Physical swap slots and device API (mm/swapfile.c) */
+int add_swap_extent(struct swap_info_struct *sis, unsigned long start_page,
+		unsigned long nr_pages, sector_t start_block);
+int generic_swapfile_activate(struct swap_info_struct *, struct file *,
+		sector_t *);
+
 extern atomic_long_t nr_swap_pages;
 extern long total_swap_pages;
 extern atomic_t nr_rotate_swap;
-extern bool has_usable_swap(void);
+bool has_usable_swap(void);
 
 /* Swap 50% full? Release swapcache more aggressively.. */
 static inline bool vm_swap_full(void)
@@ -483,50 +503,37 @@ static inline long get_nr_swap_pages(void)
 	return atomic_long_read(&nr_swap_pages);
 }
 
-extern void si_swapinfo(struct sysinfo *);
-swp_entry_t folio_alloc_swap(struct folio *folio);
-bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
-extern swp_entry_t get_swap_page_of_type(int);
-extern int get_swap_pages(int n, swp_entry_t swp_entries[], int order);
-extern int add_swap_count_continuation(swp_entry_t, gfp_t);
-extern void swap_shmem_alloc(swp_entry_t, int);
-extern int swap_duplicate(swp_entry_t);
-extern int swapcache_prepare(swp_entry_t entry, int nr);
-extern void swap_free_nr(swp_entry_t entry, int nr_pages);
-extern void swapcache_free_entries(swp_entry_t *entries, int n);
-extern void free_swap_and_cache_nr(swp_entry_t entry, int nr);
+void si_swapinfo(struct sysinfo *);
+swp_slot_t swap_slot_alloc_of_type(int);
+int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
+void swap_slot_cache_free_slots(swp_slot_t *slots, int n);
 int swap_type_of(dev_t device, sector_t offset);
+sector_t swapdev_block(int, pgoff_t);
 int find_first_swap(dev_t *device);
-extern unsigned int count_swap_pages(int, int);
-extern sector_t swapdev_block(int, pgoff_t);
-extern int __swap_count(swp_entry_t entry);
-extern int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry);
-extern int swp_swapcount(swp_entry_t entry);
-struct swap_info_struct *swp_swap_info(swp_entry_t entry);
+unsigned int count_swap_pages(int, int);
+struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot);
 struct backing_dev_info;
-extern int init_swap_address_space(unsigned int type, unsigned long nr_pages);
-extern void exit_swap_address_space(unsigned int type);
-extern struct swap_info_struct *get_swap_device(swp_entry_t entry);
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot);
 sector_t swap_folio_sector(struct folio *folio);
 
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
 {
 	percpu_ref_put(&si->users);
 }
 
 #else /* CONFIG_SWAP */
-static inline struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
 	return NULL;
 }
 
-static inline struct swap_info_struct *get_swap_device(swp_entry_t entry)
+static inline struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 {
 	return NULL;
 }
 
-static inline void put_swap_device(struct swap_info_struct *si)
+static inline void swap_slot_put_swap_info(struct swap_info_struct *si)
 {
 }
 
@@ -575,7 +582,7 @@ static inline void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t swp)
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 }
 
@@ -606,12 +613,24 @@ static inline bool folio_free_swap(struct folio *folio)
 	return false;
 }
 
+static inline swp_slot_t folio_alloc_swap_slot(struct folio *folio)
+{
+	swp_slot_t slot;
+
+	slot.val = 0;
+	return slot;
+}
+
 static inline int add_swap_extent(struct swap_info_struct *sis,
 				  unsigned long start_page,
 				  unsigned long nr_pages, sector_t start_block)
 {
 	return -EINVAL;
 }
+
+static inline void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
+{
+}
 #endif /* CONFIG_SWAP */
 
 static inline void free_swap_and_cache(swp_entry_t entry)
@@ -706,16 +725,43 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ *                         virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+static inline swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+	return (swp_slot_t) { entry.val };
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ *                         physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+	return (swp_entry_t) { slot.val };
+}
+
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
 {
-	return get_swap_device(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	*si = swap_slot_tryget_swap_info(slot);
+	return *si;
 }
 
 static inline void unlock_swapoff(swp_entry_t entry,
 				struct swap_info_struct *si)
 {
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 }
 
 #endif /* __KERNEL__*/
diff --git a/include/linux/swap_slots.h b/include/linux/swap_slots.h
index 840aec3523b2..1ac926d46389 100644
--- a/include/linux/swap_slots.h
+++ b/include/linux/swap_slots.h
@@ -13,7 +13,7 @@
 struct swap_slots_cache {
 	bool		lock_initialized;
 	struct mutex	alloc_lock; /* protects slots, nr, cur */
-	swp_entry_t	*slots;
+	swp_slot_t	*slots;
 	int		nr;
 	int		cur;
 	int		n_ret;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 96f26e29fefe..2a4101c9bba4 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -618,5 +618,30 @@ static inline int non_swap_entry(swp_entry_t entry)
 	return swp_type(entry) >= MAX_SWAPFILES;
 }
 
+/* Physical swap slots operations */
+
+/*
+ * Store a swap device type + offset into a swp_slot_t handle.
+ */
+static inline swp_slot_t swp_slot(unsigned long type, pgoff_t offset)
+{
+	swp_slot_t ret;
+
+	ret.val = (type << SWP_TYPE_SHIFT) | (offset & SWP_OFFSET_MASK);
+	return ret;
+}
+
+/* Extract the `type' field from a swp_slot_t. */
+static inline unsigned swp_slot_type(swp_slot_t slot)
+{
+	return (slot.val >> SWP_TYPE_SHIFT);
+}
+
+/* Extract the `offset' field from a swp_slot_t. */
+static inline pgoff_t swp_slot_offset(swp_slot_t slot)
+{
+	return slot.val & SWP_OFFSET_MASK;
+}
+
 #endif /* CONFIG_MMU */
 #endif /* _LINUX_SWAPOPS_H */
diff --git a/kernel/power/swap.c b/kernel/power/swap.c
index 82b884b67152..32b236a81dbb 100644
--- a/kernel/power/swap.c
+++ b/kernel/power/swap.c
@@ -178,10 +178,10 @@ sector_t alloc_swapdev_block(int swap)
 {
 	unsigned long offset;
 
-	offset = swp_offset(get_swap_page_of_type(swap));
+	offset = swp_slot_offset(swap_slot_alloc_of_type(swap));
 	if (offset) {
 		if (swsusp_extents_insert(offset))
-			swap_free(swp_entry(swap, offset));
+			swap_slot_free_nr(swp_slot(swap, offset), 1);
 		else
 			return swapdev_block(swap, offset);
 	}
@@ -203,7 +203,7 @@ void free_all_swap_pages(int swap)
 
 		ext = rb_entry(node, struct swsusp_extent, node);
 		rb_erase(node, &swsusp_extents);
-		swap_free_nr(swp_entry(swap, ext->start),
+		swap_slot_free_nr(swp_slot(swap, ext->start),
 			     ext->end - ext->start + 1);
 
 		kfree(ext);
diff --git a/mm/internal.h b/mm/internal.h
index 20b3535935a3..2d63f6537e35 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -275,9 +275,13 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  */
 static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
-	swp_entry_t entry = pte_to_swp_entry(pte);
-	pte_t new = __swp_entry_to_pte(__swp_entry(swp_type(entry),
-						   (swp_offset(entry) + delta)));
+	swp_entry_t entry = pte_to_swp_entry(pte), new_entry;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	pte_t new;
+
+	new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+			swp_slot_offset(slot) + delta));
+	new = swp_entry_to_pte(new_entry);
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
diff --git a/mm/memory.c b/mm/memory.c
index e92914df5ca7..c44e845b5320 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4125,8 +4125,9 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
+	pgoff_t offset = swp_slot_offset(slot);
 	int i;
 
 	/*
@@ -4308,6 +4309,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	bool swapoff_locked = false;
 	bool exclusive = false;
 	swp_entry_t entry;
+	swp_slot_t slot;
 	pte_t pte;
 	vm_fault_t ret = 0;
 	void *shadow = NULL;
@@ -4369,6 +4371,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapoff_locked = trylock_swapoff(entry, &si);
 	if (unlikely(!swapoff_locked))
 		goto out;
+	slot = swp_entry_to_swp_slot(entry);
 
 	folio = swap_cache_get_folio(entry, vma, vmf->address);
 	if (folio)
diff --git a/mm/page_io.c b/mm/page_io.c
index 9b983de351f9..182851c47f43 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -204,14 +204,17 @@ static bool is_folio_zero_filled(struct folio *folio)
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	int nr_pages = folio_nr_pages(folio);
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		set_bit(swp_offset(entry), sis->zeromap);
+		slot = swp_entry_to_swp_slot(entry);
+		set_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 
 	count_vm_events(SWPOUT_ZERO, nr_pages);
@@ -223,13 +226,16 @@ static void swap_zeromap_folio_set(struct folio *folio)
 
 static void swap_zeromap_folio_clear(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		entry = page_swap_entry(folio_page(folio, i));
-		clear_bit(swp_offset(entry), sis->zeromap);
+		slot = swp_entry_to_swp_slot(entry);
+		clear_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 }
 
@@ -358,7 +364,8 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 		 * messages.
 		 */
 		pr_err_ratelimited("Write error %ld on dio swapfile (%llu)\n",
-				   ret, swap_dev_pos(page_swap_entry(page)));
+				   ret,
+				   swap_slot_pos(swp_entry_to_swp_slot(page_swap_entry(page))));
 		for (p = 0; p < sio->pages; p++) {
 			page = sio->bvec[p].bv_page;
 			set_page_dirty(page);
@@ -374,10 +381,11 @@ static void sio_write_complete(struct kiocb *iocb, long ret)
 
 static void swap_writepage_fs(struct folio *folio, struct writeback_control *wbc)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
 	struct swap_iocb *sio = NULL;
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct file *swap_file = sis->swap_file;
-	loff_t pos = swap_dev_pos(folio->swap);
+	loff_t pos = swap_slot_pos(slot);
 
 	count_swpout_vm_event(folio);
 	folio_start_writeback(folio);
@@ -452,7 +460,8 @@ static void swap_writepage_bdev_async(struct folio *folio,
 
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
 	/*
@@ -543,9 +552,10 @@ static bool swap_read_folio_zeromap(struct folio *folio)
 
 static void swap_read_folio_fs(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct swap_iocb *sio = NULL;
-	loff_t pos = swap_dev_pos(folio->swap);
+	loff_t pos = swap_slot_pos(slot);
 
 	if (plug)
 		sio = *plug;
@@ -614,7 +624,8 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	struct swap_info_struct *sis =
+		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
 	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
 	bool workingset = folio_test_workingset(folio);
 	unsigned long pflags;
diff --git a/mm/shmem.c b/mm/shmem.c
index 8ef72dcc592e..f8efa49eb499 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1387,6 +1387,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 	XA_STATE(xas, &mapping->i_pages, start);
 	struct folio *folio;
 	swp_entry_t entry;
+	swp_slot_t slot;
 
 	rcu_read_lock();
 	xas_for_each(&xas, folio, ULONG_MAX) {
@@ -1397,11 +1398,13 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 			continue;
 
 		entry = radix_to_swp_entry(folio);
+		slot = swp_entry_to_swp_slot(entry);
+
 		/*
 		 * swapin error entries can be found in the mapping. But they're
 		 * deliberately ignored here as we've done everything we can do.
 		 */
-		if (swp_type(entry) != type)
+		if (swp_slot_type(slot) != type)
 			continue;
 
 		indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1619,7 +1622,6 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if (!swap.val) {
 		if (nr_pages > 1)
 			goto try_split;
-
 		goto redirty;
 	}
 
@@ -2164,6 +2166,7 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
 	void *alloced_shadow = NULL;
 	int alloced_order = 0, i;
+	swp_slot_t slot = swp_entry_to_swp_slot(swap);
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
@@ -2202,11 +2205,14 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			 */
 			for (i = 0; i < 1 << order; i++) {
 				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp;
+				swp_entry_t tmp_entry;
+				swp_slot_t tmp_slot;
 
-				tmp = swp_entry(swp_type(swap), swp_offset(swap) + i);
+				tmp_slot =
+					swp_slot(swp_slot_type(slot), swp_slot_offset(slot) + i);
+				tmp_entry = swp_slot_to_swp_entry(tmp_slot);
 				__xa_store(&mapping->i_pages, aligned_index + i,
-					   swp_to_radix_entry(tmp), 0);
+					   swp_to_radix_entry(tmp_entry), 0);
 			}
 		}
 
@@ -2253,10 +2259,12 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
 	swp_entry_t swap;
+	swp_slot_t slot;
 	int error, nr_pages, order, split_order;
 
 	VM_BUG_ON(!*foliop || !xa_is_value(*foliop));
 	swap = radix_to_swp_entry(*foliop);
+	slot = swp_entry_to_swp_slot(swap);
 	*foliop = NULL;
 
 	if (is_poisoned_swp_entry(swap))
@@ -2328,7 +2336,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
 
-			swap = swp_entry(swp_type(swap), swp_offset(swap) + offset);
+			swap = swp_slot_to_swp_entry(swp_slot(
+					swp_slot_type(slot), swp_slot_offset(slot) + offset));
 		}
 
 		/* Here we actually start the io */
diff --git a/mm/swap.h b/mm/swap.h
index ad2f121de970..d5f8effa8015 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -32,12 +32,10 @@ extern struct address_space *swapper_spaces[];
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
-/*
- * Return the swap device position of the swap entry.
- */
-static inline loff_t swap_dev_pos(swp_entry_t entry)
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
 {
-	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
+	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
 }
 
 /*
@@ -78,7 +76,9 @@ struct folio *swapin_readahead(swp_entry_t entry, gfp_t flag,
 
 static inline unsigned int folio_swap_flags(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->flags;
+	swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
+
+	return swap_slot_swap_info(swp_slot)->flags;
 }
 
 /*
@@ -89,8 +89,9 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 		bool *is_zeromap)
 {
-	struct swap_info_struct *sis = swp_swap_info(entry);
-	unsigned long start = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
+	unsigned long start = swp_slot_offset(slot);
 	unsigned long end = start + max_nr;
 	bool first_bit;
 
diff --git a/mm/swap_cgroup.c b/mm/swap_cgroup.c
index 1007c30f12e2..5e4c91d694a0 100644
--- a/mm/swap_cgroup.c
+++ b/mm/swap_cgroup.c
@@ -65,11 +65,12 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
 			swp_entry_t ent)
 {
 	unsigned int nr_ents = folio_nr_pages(folio);
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
 	struct swap_cgroup *map;
 	pgoff_t offset, end;
 	unsigned short old;
 
-	offset = swp_offset(ent);
+	offset = swp_slot_offset(slot);
 	end = offset + nr_ents;
 	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
@@ -92,12 +93,12 @@ void swap_cgroup_record(struct folio *folio, unsigned short id,
  */
 unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 {
-	pgoff_t offset = swp_offset(ent);
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
+	pgoff_t offset = swp_slot_offset(slot);
 	pgoff_t end = offset + nr_ents;
 	struct swap_cgroup *map;
 	unsigned short old, iter = 0;
 
-	offset = swp_offset(ent);
 	end = offset + nr_ents;
 	map = swap_cgroup_ctrl[swp_type(ent)].map;
 
@@ -120,12 +121,13 @@ unsigned short swap_cgroup_clear(swp_entry_t ent, unsigned int nr_ents)
 unsigned short lookup_swap_cgroup_id(swp_entry_t ent)
 {
 	struct swap_cgroup_ctrl *ctrl;
+	swp_slot_t slot = swp_entry_to_swp_slot(ent);
 
 	if (mem_cgroup_disabled())
 		return 0;
 
 	ctrl = &swap_cgroup_ctrl[swp_type(ent)];
-	return __swap_cgroup_id_lookup(ctrl->map, swp_offset(ent));
+	return __swap_cgroup_id_lookup(ctrl->map, swp_slot_offset(slot));
 }
 
 int swap_cgroup_swapon(int type, unsigned long max_pages)
diff --git a/mm/swap_slots.c b/mm/swap_slots.c
index 9c7c171df7ba..4ec2de0c2756 100644
--- a/mm/swap_slots.c
+++ b/mm/swap_slots.c
@@ -111,14 +111,14 @@ static bool check_cache_active(void)
 static int alloc_swap_slot_cache(unsigned int cpu)
 {
 	struct swap_slots_cache *cache;
-	swp_entry_t *slots;
+	swp_slot_t *slots;
 
 	/*
 	 * Do allocation outside swap_slots_cache_mutex
 	 * as kvzalloc could trigger reclaim and folio_alloc_swap,
 	 * which can lock swap_slots_cache_mutex.
 	 */
-	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_entry_t),
+	slots = kvcalloc(SWAP_SLOTS_CACHE_SIZE, sizeof(swp_slot_t),
 			 GFP_KERNEL);
 	if (!slots)
 		return -ENOMEM;
@@ -160,7 +160,7 @@ static void drain_slots_cache_cpu(unsigned int cpu, bool free_slots)
 	cache = &per_cpu(swp_slots, cpu);
 	if (cache->slots) {
 		mutex_lock(&cache->alloc_lock);
-		swapcache_free_entries(cache->slots + cache->cur, cache->nr);
+		swap_slot_cache_free_slots(cache->slots + cache->cur, cache->nr);
 		cache->cur = 0;
 		cache->nr = 0;
 		if (free_slots && cache->slots) {
@@ -238,22 +238,22 @@ static int refill_swap_slots_cache(struct swap_slots_cache *cache)
 
 	cache->cur = 0;
 	if (swap_slot_cache_active)
-		cache->nr = get_swap_pages(SWAP_SLOTS_CACHE_SIZE,
+		cache->nr = swap_slot_alloc(SWAP_SLOTS_CACHE_SIZE,
 					   cache->slots, 0);
 
 	return cache->nr;
 }
 
-swp_entry_t folio_alloc_swap(struct folio *folio)
+swp_slot_t folio_alloc_swap_slot(struct folio *folio)
 {
-	swp_entry_t entry;
+	swp_slot_t slot;
 	struct swap_slots_cache *cache;
 
-	entry.val = 0;
+	slot.val = 0;
 
 	if (folio_test_large(folio)) {
 		if (IS_ENABLED(CONFIG_THP_SWAP))
-			get_swap_pages(1, &entry, folio_order(folio));
+			swap_slot_alloc(1, &slot, folio_order(folio));
 		goto out;
 	}
 
@@ -273,7 +273,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (cache->slots) {
 repeat:
 			if (cache->nr) {
-				entry = cache->slots[cache->cur];
+				slot = cache->slots[cache->cur];
 				cache->slots[cache->cur++].val = 0;
 				cache->nr--;
 			} else if (refill_swap_slots_cache(cache)) {
@@ -281,15 +281,11 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 			}
 		}
 		mutex_unlock(&cache->alloc_lock);
-		if (entry.val)
+		if (slot.val)
 			goto out;
 	}
 
-	get_swap_pages(1, &entry, 0);
+	swap_slot_alloc(1, &slot, 0);
 out:
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		put_swap_folio(folio, entry);
-		entry.val = 0;
-	}
-	return entry;
+	return slot;
 }
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 81f69b2df550..055e555d3382 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -167,6 +167,19 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	swp_slot_t slot = folio_alloc_swap_slot(folio);
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+
+	if (entry.val && mem_cgroup_try_charge_swap(folio, entry)) {
+		put_swap_folio(folio, entry);
+		entry.val = 0;
+	}
+
+	return entry;
+}
+
 /**
  * add_to_swap - allocate swap space for a folio
  * @folio: folio we want to move to swap
@@ -548,7 +561,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
  * A failure return means that either the page allocation failed or that
  * the swap entry is no longer in use.
  *
- * get/put_swap_device() aren't needed to call this function, because
+ * swap_slot_(get|put)_swap_info() aren't needed to call this function, because
  * __read_swap_cache_async() call them and swap_read_folio() holds the
  * swap cache folio lock.
  */
@@ -654,11 +667,12 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 				    struct mempolicy *mpol, pgoff_t ilx)
 {
 	struct folio *folio;
-	unsigned long entry_offset = swp_offset(entry);
-	unsigned long offset = entry_offset;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long slot_offset = swp_slot_offset(slot);
+	unsigned long offset = slot_offset;
 	unsigned long start_offset, end_offset;
 	unsigned long mask;
-	struct swap_info_struct *si = swp_swap_info(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
 	struct blk_plug plug;
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
@@ -679,13 +693,13 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	for (offset = start_offset; offset <= end_offset ; offset++) {
 		/* Ok, do the async read-ahead now */
 		folio = __read_swap_cache_async(
-				swp_entry(swp_type(entry), offset),
+				swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot), offset)),
 				gfp_mask, mpol, ilx, &page_allocated, false);
 		if (!folio)
 			continue;
 		if (page_allocated) {
 			swap_read_folio(folio, &splug);
-			if (offset != entry_offset) {
+			if (offset != slot_offset) {
 				folio_set_readahead(folio);
 				count_vm_event(SWAP_RA);
 			}
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 27cf985e08ac..a1dd7e998e90 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -53,9 +53,9 @@
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
 static void free_swap_count_continuations(struct swap_info_struct *);
-static void swap_entry_range_free(struct swap_info_struct *si,
+static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  swp_entry_t entry, unsigned int nr_pages);
+				  swp_slot_t slot, unsigned int nr_pages);
 static void swap_range_alloc(struct swap_info_struct *si,
 			     unsigned int nr_entries);
 static bool folio_swapcache_freeable(struct folio *folio);
@@ -203,7 +203,8 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 static int __try_to_reclaim_swap(struct swap_info_struct *si,
 				 unsigned long offset, unsigned long flags)
 {
-	swp_entry_t entry = swp_entry(si->type, offset);
+	swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
+	swp_slot_t slot;
 	struct address_space *address_space = swap_address_space(entry);
 	struct swap_cluster_info *ci;
 	struct folio *folio;
@@ -229,7 +230,8 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 
 	/* offset could point to the middle of a large folio */
 	entry = folio->swap;
-	offset = swp_offset(entry);
+	slot = swp_entry_to_swp_slot(entry);
+	offset = swp_slot_offset(slot);
 
 	need_reclaim = ((flags & TTRS_ANYWAY) ||
 			((flags & TTRS_UNMAPPED) && !folio_mapped(folio)) ||
@@ -263,7 +265,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	folio_set_dirty(folio);
 
 	ci = lock_cluster(si, offset);
-	swap_entry_range_free(si, ci, entry, nr_pages);
+	swap_slot_range_free(si, ci, slot, nr_pages);
 	unlock_cluster(ci);
 	ret = nr_pages;
 out_unlock:
@@ -344,12 +346,12 @@ offset_to_swap_extent(struct swap_info_struct *sis, unsigned long offset)
 
 sector_t swap_folio_sector(struct folio *folio)
 {
-	struct swap_info_struct *sis = swp_swap_info(folio->swap);
+	swp_slot_t slot = swp_entry_to_swp_slot(folio->swap);
+	struct swap_info_struct *sis = swap_slot_swap_info(slot);
 	struct swap_extent *se;
 	sector_t sector;
-	pgoff_t offset;
+	pgoff_t offset = swp_slot_offset(slot);
 
-	offset = swp_offset(folio->swap);
 	se = offset_to_swap_extent(sis, offset);
 	sector = se->start_block + (offset - se->start_page);
 	return sector << (PAGE_SHIFT - 9);
@@ -387,15 +389,15 @@ static void discard_swap_cluster(struct swap_info_struct *si,
 #ifdef CONFIG_THP_SWAP
 #define SWAPFILE_CLUSTER	HPAGE_PMD_NR
 
-#define swap_entry_order(order)	(order)
+#define swap_slot_order(order)	(order)
 #else
 #define SWAPFILE_CLUSTER	256
 
 /*
- * Define swap_entry_order() as constant to let compiler to optimize
+ * Define swap_slot_order() as constant to let compiler to optimize
  * out some code if !CONFIG_THP_SWAP
  */
-#define swap_entry_order(order)	0
+#define swap_slot_order(order)	0
 #endif
 #define LATENCY_LIMIT		256
 
@@ -779,7 +781,7 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 					    unsigned int order,
 					    unsigned char usage)
 {
-	unsigned int next = SWAP_ENTRY_INVALID, found = SWAP_ENTRY_INVALID;
+	unsigned int next = SWAP_SLOT_INVALID, found = SWAP_SLOT_INVALID;
 	unsigned long start = ALIGN_DOWN(offset, SWAPFILE_CLUSTER);
 	unsigned long end = min(start + SWAPFILE_CLUSTER, si->max);
 	unsigned int nr_pages = 1 << order;
@@ -883,7 +885,7 @@ static void swap_reclaim_work(struct work_struct *work)
  * pool (a cluster). This might involve allocating a new cluster for current CPU
  * too.
  */
-static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si, int order,
+static unsigned long cluster_alloc_swap_slot(struct swap_info_struct *si, int order,
 					      unsigned char usage)
 {
 	struct swap_cluster_info *ci;
@@ -1135,7 +1137,7 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
-		zswap_invalidate(swp_entry(si->type, offset + i));
+		zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
 	}
 
 	if (si->flags & SWP_BLKDEV)
@@ -1162,16 +1164,16 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 
 static int cluster_alloc_swap(struct swap_info_struct *si,
 			     unsigned char usage, int nr,
-			     swp_entry_t slots[], int order)
+			     swp_slot_t slots[], int order)
 {
 	int n_ret = 0;
 
 	while (n_ret < nr) {
-		unsigned long offset = cluster_alloc_swap_entry(si, order, usage);
+		unsigned long offset = cluster_alloc_swap_slot(si, order, usage);
 
 		if (!offset)
 			break;
-		slots[n_ret++] = swp_entry(si->type, offset);
+		slots[n_ret++] = swp_slot(si->type, offset);
 	}
 
 	return n_ret;
@@ -1179,7 +1181,7 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 
 static int scan_swap_map_slots(struct swap_info_struct *si,
 			       unsigned char usage, int nr,
-			       swp_entry_t slots[], int order)
+			       swp_slot_t slots[], int order)
 {
 	unsigned int nr_pages = 1 << order;
 
@@ -1231,9 +1233,9 @@ static bool get_swap_device_info(struct swap_info_struct *si)
 	return true;
 }
 
-int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
+int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 {
-	int order = swap_entry_order(entry_order);
+	int order = swap_slot_order(entry_order);
 	unsigned long size = 1 << order;
 	struct swap_info_struct *si, *next;
 	long avail_pgs;
@@ -1260,8 +1262,8 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
 			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_entries, order);
-			put_swap_device(si);
+					n_goal, swp_slots, order);
+			swap_slot_put_swap_info(si);
 			if (n_ret || size > 1)
 				goto check_out;
 		}
@@ -1292,36 +1294,36 @@ int get_swap_pages(int n_goal, swp_entry_t swp_entries[], int entry_order)
 	return n_ret;
 }
 
-static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
+static struct swap_info_struct *_swap_info_get(swp_slot_t slot)
 {
 	struct swap_info_struct *si;
 	unsigned long offset;
 
-	if (!entry.val)
+	if (!slot.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (!si)
 		goto bad_nofile;
 	if (data_race(!(si->flags & SWP_USED)))
 		goto bad_device;
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	if (offset >= si->max)
 		goto bad_offset;
-	if (data_race(!si->swap_map[swp_offset(entry)]))
+	if (data_race(!si->swap_map[swp_slot_offset(slot)]))
 		goto bad_free;
 	return si;
 
 bad_free:
-	pr_err("%s: %s%08lx\n", __func__, Unused_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Unused_offset, slot.val);
 	goto out;
 bad_offset:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
 	goto out;
 bad_device:
-	pr_err("%s: %s%08lx\n", __func__, Unused_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Unused_file, slot.val);
 	goto out;
 bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
 out:
 	return NULL;
 }
@@ -1331,8 +1333,9 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * prevent swapoff, such as the folio in swap cache is locked, RCU
  * reader side is locked, etc., the swap entry may become invalid
  * because of swapoff.  Then, we need to enclose all swap related
- * functions with get_swap_device() and put_swap_device(), unless the
- * swap functions call get/put_swap_device() by themselves.
+ * functions with swap_slot_tryget_swap_info() and
+ * swap_slot_put_swap_info(), unless the swap functions call
+ * swap_slot_(tryget|put)_swap_info by themselves.
  *
  * RCU reader side lock (including any spinlock) is sufficient to
  * prevent swapoff, because synchronize_rcu() is called in swapoff()
@@ -1341,11 +1344,11 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * Check whether swap entry is valid in the swap device.  If so,
  * return pointer to swap_info_struct, and keep the swap entry valid
  * via preventing the swap device from being swapoff, until
- * put_swap_device() is called.  Otherwise return NULL.
+ * swap_slot_put_swap_info() is called.  Otherwise return NULL.
  *
  * Notice that swapoff or swapoff+swapon can still happen before the
- * percpu_ref_tryget_live() in get_swap_device() or after the
- * percpu_ref_put() in put_swap_device() if there isn't any other way
+ * percpu_ref_tryget_live() in swap_slot_tryget_swap_info() or after the
+ * percpu_ref_put() in swap_slot_put_swap_info() if there isn't any other way
  * to prevent swapoff.  The caller must be prepared for that.  For
  * example, the following situation is possible.
  *
@@ -1365,34 +1368,34 @@ static struct swap_info_struct *_swap_info_get(swp_entry_t entry)
  * changed with the page table locked to check whether the swap device
  * has been swapoff or swapoff+swapon.
  */
-struct swap_info_struct *get_swap_device(swp_entry_t entry)
+struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 {
 	struct swap_info_struct *si;
 	unsigned long offset;
 
-	if (!entry.val)
+	if (!slot.val)
 		goto out;
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (!si)
 		goto bad_nofile;
 	if (!get_swap_device_info(si))
 		goto out;
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	if (offset >= si->max)
 		goto put_out;
 
 	return si;
 bad_nofile:
-	pr_err("%s: %s%08lx\n", __func__, Bad_file, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_file, slot.val);
 out:
 	return NULL;
 put_out:
-	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
+	pr_err("%s: %s%08lx\n", __func__, Bad_offset, slot.val);
 	percpu_ref_put(&si->users);
 	return NULL;
 }
 
-static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
+static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
 {
@@ -1432,27 +1435,27 @@ static unsigned char __swap_entry_free_locked(struct swap_info_struct *si,
 	return usage;
 }
 
-static unsigned char __swap_entry_free(struct swap_info_struct *si,
-				       swp_entry_t entry)
+static unsigned char __swap_slot_free(struct swap_info_struct *si,
+				       swp_slot_t slot)
 {
 	struct swap_cluster_info *ci;
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	unsigned char usage;
 
 	ci = lock_cluster(si, offset);
-	usage = __swap_entry_free_locked(si, offset, 1);
+	usage = __swap_slot_free_locked(si, offset, 1);
 	if (!usage)
-		swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+		swap_slot_range_free(si, ci, swp_slot(si->type, offset), 1);
 	unlock_cluster(ci);
 
 	return usage;
 }
 
-static bool __swap_entries_free(struct swap_info_struct *si,
-		swp_entry_t entry, int nr)
+static bool __swap_slots_free(struct swap_info_struct *si,
+		swp_slot_t slot, int nr)
 {
-	unsigned long offset = swp_offset(entry);
-	unsigned int type = swp_type(entry);
+	unsigned long offset = swp_slot_offset(slot);
+	unsigned int type = swp_slot_type(slot);
 	struct swap_cluster_info *ci;
 	bool has_cache = false;
 	unsigned char count;
@@ -1472,7 +1475,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 	for (i = 0; i < nr; i++)
 		WRITE_ONCE(si->swap_map[offset + i], SWAP_HAS_CACHE);
 	if (!has_cache)
-		swap_entry_range_free(si, ci, entry, nr);
+		swap_slot_range_free(si, ci, slot, nr);
 	unlock_cluster(ci);
 
 	return has_cache;
@@ -1480,7 +1483,7 @@ static bool __swap_entries_free(struct swap_info_struct *si,
 fallback:
 	for (i = 0; i < nr; i++) {
 		if (data_race(si->swap_map[offset + i])) {
-			count = __swap_entry_free(si, swp_entry(type, offset + i));
+			count = __swap_slot_free(si, swp_slot(type, offset + i));
 			if (count == SWAP_HAS_CACHE)
 				has_cache = true;
 		} else {
@@ -1494,13 +1497,14 @@ static bool __swap_entries_free(struct swap_info_struct *si,
  * Drop the last HAS_CACHE flag of swap entries, caller have to
  * ensure all entries belong to the same cgroup.
  */
-static void swap_entry_range_free(struct swap_info_struct *si,
+static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
-				  swp_entry_t entry, unsigned int nr_pages)
+				  swp_slot_t slot, unsigned int nr_pages)
 {
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1531,23 +1535,19 @@ static void cluster_swap_free_nr(struct swap_info_struct *si,
 
 	ci = lock_cluster(si, offset);
 	do {
-		if (!__swap_entry_free_locked(si, offset, usage))
-			swap_entry_range_free(si, ci, swp_entry(si->type, offset), 1);
+		if (!__swap_slot_free_locked(si, offset, usage))
+			swap_slot_range_free(si, ci, swp_slot(si->type, offset), 1);
 	} while (++offset < end);
 	unlock_cluster(ci);
 }
 
-/*
- * Caller has made sure that the swap device corresponding to entry
- * is still around or has not been recycled.
- */
-void swap_free_nr(swp_entry_t entry, int nr_pages)
+void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
 {
 	int nr;
 	struct swap_info_struct *sis;
-	unsigned long offset = swp_offset(entry);
+	unsigned long offset = swp_slot_offset(slot);
 
-	sis = _swap_info_get(entry);
+	sis = _swap_info_get(slot);
 	if (!sis)
 		return;
 
@@ -1559,27 +1559,37 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 	}
 }
 
+/*
+ * Caller has made sure that the swap device corresponding to entry
+ * is still around or has not been recycled.
+ */
+void swap_free_nr(swp_entry_t entry, int nr_pages)
+{
+	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
+}
+
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
 void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
-	int size = 1 << swap_entry_order(folio_order(folio));
+	int size = 1 << swap_slot_order(folio_order(folio));
 
-	si = _swap_info_get(entry);
+	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
-		swap_entry_range_free(si, ci, entry, size);
+		swap_slot_range_free(si, ci, slot, size);
 	else {
-		for (int i = 0; i < size; i++, entry.val++) {
-			if (!__swap_entry_free_locked(si, offset + i, SWAP_HAS_CACHE))
-				swap_entry_range_free(si, ci, entry, 1);
+		for (int i = 0; i < size; i++, slot.val++) {
+			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
+				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
@@ -1587,8 +1597,9 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 
 int __swap_count(swp_entry_t entry)
 {
-	struct swap_info_struct *si = swp_swap_info(entry);
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = swap_slot_swap_info(slot);
+	pgoff_t offset = swp_slot_offset(slot);
 
 	return swap_count(si->swap_map[offset]);
 }
@@ -1600,7 +1611,8 @@ int __swap_count(swp_entry_t entry)
  */
 int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
 {
-	pgoff_t offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	pgoff_t offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	int count;
 
@@ -1616,6 +1628,7 @@ int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
  */
 int swp_swapcount(swp_entry_t entry)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	int count, tmp_count, n;
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
@@ -1623,11 +1636,11 @@ int swp_swapcount(swp_entry_t entry)
 	pgoff_t offset;
 	unsigned char *map;
 
-	si = _swap_info_get(entry);
+	si = _swap_info_get(slot);
 	if (!si)
 		return 0;
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 
 	ci = lock_cluster(si, offset);
 
@@ -1659,10 +1672,11 @@ int swp_swapcount(swp_entry_t entry)
 static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 					 swp_entry_t entry, int order)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	struct swap_cluster_info *ci;
 	unsigned char *map = si->swap_map;
 	unsigned int nr_pages = 1 << order;
-	unsigned long roffset = swp_offset(entry);
+	unsigned long roffset = swp_slot_offset(slot);
 	unsigned long offset = round_down(roffset, nr_pages);
 	int i;
 	bool ret = false;
@@ -1687,7 +1701,8 @@ static bool swap_page_trans_huge_swapped(struct swap_info_struct *si,
 static bool folio_swapped(struct folio *folio)
 {
 	swp_entry_t entry = folio->swap;
-	struct swap_info_struct *si = _swap_info_get(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	struct swap_info_struct *si = _swap_info_get(slot);
 
 	if (!si)
 		return false;
@@ -1710,7 +1725,8 @@ static bool folio_swapped(struct folio *folio)
  */
 void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 {
-	const unsigned long start_offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	const unsigned long start_offset = swp_slot_offset(slot);
 	const unsigned long end_offset = start_offset + nr;
 	struct swap_info_struct *si;
 	bool any_only_cache = false;
@@ -1719,7 +1735,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	if (non_swap_entry(entry))
 		return;
 
-	si = get_swap_device(entry);
+	si = swap_slot_tryget_swap_info(slot);
 	if (!si)
 		return;
 
@@ -1729,7 +1745,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	/*
 	 * First free all entries in the range.
 	 */
-	any_only_cache = __swap_entries_free(si, entry, nr);
+	any_only_cache = __swap_slots_free(si, slot, nr);
 
 	/*
 	 * Short-circuit the below loop if none of the entries had their
@@ -1742,7 +1758,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	 * Now go back over the range trying to reclaim the swap cache. This is
 	 * more efficient for large folios because we will only try to reclaim
 	 * the swap once per folio in the common case. If we do
-	 * __swap_entry_free() and __try_to_reclaim_swap() in the same loop, the
+	 * __swap_slot_free() and __try_to_reclaim_swap() in the same loop, the
 	 * latter will get a reference and lock the folio for every individual
 	 * page but will only succeed once the swap slot for every subpage is
 	 * zero.
@@ -1769,10 +1785,10 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	}
 
 out:
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 }
 
-void swapcache_free_entries(swp_entry_t *entries, int n)
+void swap_slot_cache_free_slots(swp_slot_t *slots, int n)
 {
 	int i;
 	struct swap_cluster_info *ci;
@@ -1782,10 +1798,10 @@ void swapcache_free_entries(swp_entry_t *entries, int n)
 		return;
 
 	for (i = 0; i < n; ++i) {
-		si = _swap_info_get(entries[i]);
+		si = _swap_info_get(slots[i]);
 		if (si) {
-			ci = lock_cluster(si, swp_offset(entries[i]));
-			swap_entry_range_free(si, ci, entries[i], 1);
+			ci = lock_cluster(si, swp_slot_offset(slots[i]));
+			swap_slot_range_free(si, ci, slots[i], 1);
 			unlock_cluster(ci);
 		}
 	}
@@ -1844,22 +1860,22 @@ bool folio_free_swap(struct folio *folio)
 
 #ifdef CONFIG_HIBERNATION
 
-swp_entry_t get_swap_page_of_type(int type)
+swp_slot_t swap_slot_alloc_of_type(int type)
 {
 	struct swap_info_struct *si = swap_type_to_swap_info(type);
-	swp_entry_t entry = {0};
+	swp_slot_t slot = {0};
 
 	if (!si)
 		goto fail;
 
 	/* This is called for allocating swap entry, not cache */
 	if (get_swap_device_info(si)) {
-		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &entry, 0))
+		if ((si->flags & SWP_WRITEOK) && scan_swap_map_slots(si, 1, 1, &slot, 0))
 			atomic_long_dec(&nr_swap_pages);
-		put_swap_device(si);
+		swap_slot_put_swap_info(si);
 	}
 fail:
-	return entry;
+	return slot;
 }
 
 /*
@@ -2081,6 +2097,7 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long offset;
 		unsigned char swp_count;
 		swp_entry_t entry;
+		swp_slot_t slot;
 		int ret;
 		pte_t ptent;
 
@@ -2096,10 +2113,12 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 			continue;
 
 		entry = pte_to_swp_entry(ptent);
-		if (swp_type(entry) != type)
+		slot = swp_entry_to_swp_slot(entry);
+
+		if (swp_slot_type(slot) != type)
 			continue;
 
-		offset = swp_offset(entry);
+		offset = swp_slot_offset(slot);
 		pte_unmap(pte);
 		pte = NULL;
 
@@ -2281,6 +2300,7 @@ static int try_to_unuse(unsigned int type)
 	struct swap_info_struct *si = swap_info[type];
 	struct folio *folio;
 	swp_entry_t entry;
+	swp_slot_t slot;
 	unsigned int i;
 
 	if (!swap_usage_in_pages(si))
@@ -2328,7 +2348,8 @@ static int try_to_unuse(unsigned int type)
 	       !signal_pending(current) &&
 	       (i = find_next_to_unuse(si, i)) != 0) {
 
-		entry = swp_entry(type, i);
+		slot = swp_slot(type, i);
+		entry = swp_slot_to_swp_entry(slot);
 		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
 		if (IS_ERR(folio))
 			continue;
@@ -2737,7 +2758,7 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 	reenable_swap_slots_cache_unlock();
 
 	/*
-	 * Wait for swap operations protected by get/put_swap_device()
+	 * Wait for swap operations protected by get/swap_slot_put_swap_info()
 	 * to complete.  Because of synchronize_rcu() here, all swap
 	 * operations protected by RCU reader side lock (including any
 	 * spinlock) will be waited too.  This makes it easy to
@@ -3196,7 +3217,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 
 			cluster = per_cpu_ptr(si->percpu_cluster, cpu);
 			for (i = 0; i < SWAP_NR_ORDERS; i++)
-				cluster->next[i] = SWAP_ENTRY_INVALID;
+				cluster->next[i] = SWAP_SLOT_INVALID;
 			local_lock_init(&cluster->lock);
 		}
 	} else {
@@ -3205,7 +3226,7 @@ static struct swap_cluster_info *setup_clusters(struct swap_info_struct *si,
 		if (!si->global_cluster)
 			goto err_free;
 		for (i = 0; i < SWAP_NR_ORDERS; i++)
-			si->global_cluster->next[i] = SWAP_ENTRY_INVALID;
+			si->global_cluster->next[i] = SWAP_SLOT_INVALID;
 		spin_lock_init(&si->global_cluster_lock);
 	}
 
@@ -3538,6 +3559,7 @@ void si_swapinfo(struct sysinfo *val)
  */
 static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	struct swap_info_struct *si;
 	struct swap_cluster_info *ci;
 	unsigned long offset;
@@ -3545,13 +3567,13 @@ static int __swap_duplicate(swp_entry_t entry, unsigned char usage, int nr)
 	unsigned char has_cache;
 	int err, i;
 
-	si = swp_swap_info(entry);
+	si = swap_slot_swap_info(slot);
 	if (WARN_ON_ONCE(!si)) {
 		pr_err("%s%08lx\n", Bad_file, entry.val);
 		return -EINVAL;
 	}
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 	VM_WARN_ON(nr > SWAPFILE_CLUSTER - offset % SWAPFILE_CLUSTER);
 	VM_WARN_ON(usage == 1 && nr > 1);
 	ci = lock_cluster(si, offset);
@@ -3653,14 +3675,15 @@ int swapcache_prepare(swp_entry_t entry, int nr)
 
 void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 {
-	unsigned long offset = swp_offset(entry);
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+	unsigned long offset = swp_slot_offset(slot);
 
 	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
 
-struct swap_info_struct *swp_swap_info(swp_entry_t entry)
+struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
-	return swap_type_to_swap_info(swp_type(entry));
+	return swap_type_to_swap_info(swp_slot_type(slot));
 }
 
 /*
@@ -3668,7 +3691,8 @@ struct swap_info_struct *swp_swap_info(swp_entry_t entry)
  */
 struct address_space *swapcache_mapping(struct folio *folio)
 {
-	return swp_swap_info(folio->swap)->swap_file->f_mapping;
+	return swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap))
+				->swap_file->f_mapping;
 }
 EXPORT_SYMBOL_GPL(swapcache_mapping);
 
@@ -3702,6 +3726,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	struct page *list_page;
 	pgoff_t offset;
 	unsigned char count;
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	int ret = 0;
 
 	/*
@@ -3710,7 +3735,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	 */
 	page = alloc_page(gfp_mask | __GFP_HIGHMEM);
 
-	si = get_swap_device(entry);
+	si = swap_slot_tryget_swap_info(slot);
 	if (!si) {
 		/*
 		 * An acceptable race has occurred since the failing
@@ -3719,7 +3744,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 		goto outer;
 	}
 
-	offset = swp_offset(entry);
+	offset = swp_slot_offset(slot);
 
 	ci = lock_cluster(si, offset);
 
@@ -3782,7 +3807,7 @@ int add_swap_count_continuation(swp_entry_t entry, gfp_t gfp_mask)
 	spin_unlock(&si->cont_lock);
 out:
 	unlock_cluster(ci);
-	put_swap_device(si);
+	swap_slot_put_swap_info(si);
 outer:
 	if (page)
 		__free_page(page);

From patchwork Mon Apr  7 23:42:05 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878968
Received: from mail-yw1-f176.google.com (mail-yw1-f176.google.com
 [209.85.128.176])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0D1A622B8A4;
 Mon,  7 Apr 2025 23:42:27 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.176
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069350; cv=none;
 b=r5ej1L3rHA8zn4JCisorb7dT9TMNvuGgNjGMHKtnVmoTmAZiRz/oi8D5TejDlXWon8a+6Hmt+u6BnZdk8/PdnPJgKElvaqlctxirB8CPn0lZkR8itpPzioAYXkZ0io+H3YHpRI9q/T0vXPNAIDgmgyPwVT/73HwR3y/KMU9yUo4=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069350; c=relaxed/simple;
 bh=1L2gaPkIu5XYa3vBqkeaYMpblZz6+kkjUGPCN9QandM=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=sfhNC/XjbigGye/jgdEKY93nY2fkw0h8QwUrbBdQj3W26mTnQ1fWod/Jy6+WlpV9A869CZsiwYlE0Ed6uvRKu/3KVWRGbPM9QqDxTkfd/A9Sr4+WjZj/gyoAVgY1IoIJhfVRMi2pcrQuosraJt6Sn9PrJpdW4gxV82ejHZuuSZ4=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=MAPP8oYX; arc=none smtp.client-ip=209.85.128.176
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="MAPP8oYX"
Received: by mail-yw1-f176.google.com with SMTP id
 00721157ae682-6fee63b9139so44557687b3.1;
 Mon, 07 Apr 2025 16:42:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069347; x=1744674147;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=73uaIqpGQmGcoc6xX4uGJDrIoUQxEZ/6HdzUqXO0b/Q=;
 b=MAPP8oYXVDS/ZoAKAwneREp+q5lJ946pJ+3Iq7+fLd0vYAkJIBVdm7pOxo1pS6AsvW
 cXHElaL5JwOJ7+0aZvdib9fHA/wQwe0AW2Y7oz8fyiLTwUT8LZgBn+ODIukH/xUOnSTE
 kyNRJ0judkkNPbsXIRB0mNh4WzKs2duRhtEkC7PsLAmj3ms40suMIubwKKn45k0mJNto
 OFFiJ0qzSewYzG/CkNShR08Jk156LQk7OGDEjp12Tcl2rIOjacebQq2BtnwdZlrLR9Eo
 eH0zjFYfmotvzbs3f3XF8qUBf5nwnK39/I5wF7aayd0bAs2FGbMf44nA1/G4Eoms4wkD
 VYgQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069347; x=1744674147;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=73uaIqpGQmGcoc6xX4uGJDrIoUQxEZ/6HdzUqXO0b/Q=;
 b=YId6vpeD3s7C0l2kEJTsPKYFNa2pvfKrxEua9vpBbYKxE/rJ99umPxE7NSxNXBJRfR
 4WjupCaF/Y+o8rHfDUowsoVRzBU4ViI2Igb4rsBfJJH4RufgZXqhrefycuZU+MufcgmA
 uSjC3duakaXb9I7Fmr5z9GQGIXtr8pbxi4HixTBCnIP7gCA8fYxempFKyjHnz5iOCBoB
 dDrczjpW6uK2yMdMza7kccEfuoNSroLCi2Sqmv/AD1J/eMBdtjg7cWL0gGmbOQi8IjZn
 Q04Ka0+gA1Xkd39nP+ibECj3rXN/6P10egq/fo1eaKJlI8m6E3O8hiFhPFUQfV52Zp2u
 7qvA==
X-Forwarded-Encrypted: i=1;
 AJvYcCUYJrk/zZhqv5BFgWEq6U+oRwUI+xql93lClxpnvzfYFyigk52HimBT44VSYEDfzkDBJg0gXz1d@vger.kernel.org,
 AJvYcCVfOfu1aZXXwYCwTaaBCOC2/NgiS4Jah71J27n4n5OvDJfLQJ4H6FPSo9KOX5XNBe7VZosLhAppkMGgG4S+@vger.kernel.org,
 AJvYcCWEfgSiOnDBQD5FGx5hnx2JolNWByonR109XxatDT6HbJAhqw/htWtZz7P3zQ6+wCYxnP2bNJ1du+A=@vger.kernel.org
X-Gm-Message-State: AOJu0Yw6wTPvnU3VVeFYhxhE74zjwWuQazEkPZBihP29JDnSlULtWav+
 lYGs5Pyp5c3H8pcEfvoTuD252TZAVEm+DzWvi/0w9eVuJWMDw0OO
X-Gm-Gg: ASbGnctQJOQauP20T0uzMGLGnVVqnlyJKc9yn1x57Ba2H/rICadPI7kWKZwk2YSYEBB
 u+YQF4CchrbtHHYZmv0fywviiTV9dF/VqHyKkjveXcsaPdASR6uE0x8XBFZqgOPAFID61An9shq
 JcNPfCkyZDjP72LTHmneQMdhqU8xoW+v7Tsw3HWM5klz6wIy2dhn1aVlLFrchi71vbAW8jFmlXi
 UVbD2Q1PNNh+C4NMd/76BSPifR4HOR87896KGzldGLdDbphoVw3Nd0MPAnxIJ899dlbhO9RLQBK
 i1Mpik3qZ2ICQyjkzlA7cuqJPl0NZB2OHxw=
X-Google-Smtp-Source: AGHT+IHZdzxtGyNDUvpRfkN+hu9r8BnXmzk6o6ybIvQZjcqkoNyrqs6Z1N2T8Jqkk8HRpaLSwYFkoQ==
X-Received: by 2002:a05:690c:11:b0:6f9:492e:94db with SMTP id
 00721157ae682-7042d433cc9mr20957957b3.2.1744069346612;
 Mon, 07 Apr 2025 16:42:26 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:4::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1e2f5e3sm28156577b3.23.2025.04.07.16.42.26
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:26 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 04/14] mm: swap: swap cache support for virtualized swap
Date: Mon,  7 Apr 2025 16:42:05 -0700
Message-ID: <20250407234223.1059191-5-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Currently, the swap cache code assumes that the swap space is of a fixed
size. The virtual swap space is dynamically sized, so the existing
partitioning code cannot be easily reused.  A dynamic partitioning is
planned, but for now keep the design simple and just use a flat
swapcache for vswap.

Since the vswap's implementation has begun to diverge from the old
implementation, we also introduce a new build config
(CONFIG_VIRTUAL_SWAP). Users who do not select this config will get the
old implementation, with no behavioral change.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/Kconfig      | 13 ++++++++++
 mm/swap.h       | 22 ++++++++++------
 mm/swap_state.c | 68 +++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 85 insertions(+), 18 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index 1b501db06417..1a6acdb64333 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -22,6 +22,19 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config VIRTUAL_SWAP
+	bool "Swap space virtualization"
+	depends on SWAP
+	default n
+	help
+		When this is selected, the kernel is built with the new swap
+		design. This will allow us to decouple the swap backends
+		(zswap, on-disk swapfile, etc.), and save disk space when we
+		use zswap (or the zero-filled swap page optimization).
+
+		There might be more lock contentions with heavy swap use, since
+		the swap cache is no longer range partitioned.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/swap.h b/mm/swap.h
index d5f8effa8015..06e20b1d79c4 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -22,22 +22,27 @@ void swap_write_unplug(struct swap_iocb *sio);
 int swap_writepage(struct page *page, struct writeback_control *wbc);
 void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
 
-/* linux/mm/swap_state.c */
-/* One swap address space for each 64M swap space */
+/* Return the swap device position of the swap slot. */
+static inline loff_t swap_slot_pos(swp_slot_t slot)
+{
+	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
+}
+
 #define SWAP_ADDRESS_SPACE_SHIFT	14
 #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
 #define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
+
+/* linux/mm/swap_state.c */
+#ifdef CONFIG_VIRTUAL_SWAP
+extern struct address_space *swap_address_space(swp_entry_t entry);
+#define swap_cache_index(entry) entry.val
+#else
+/* One swap address space for each 64M swap space */
 extern struct address_space *swapper_spaces[];
 #define swap_address_space(entry)			    \
 	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
 		>> SWAP_ADDRESS_SPACE_SHIFT])
 
-/* Return the swap device position of the swap slot. */
-static inline loff_t swap_slot_pos(swp_slot_t slot)
-{
-	return ((loff_t)swp_slot_offset(slot)) << PAGE_SHIFT;
-}
-
 /*
  * Return the swap cache index of the swap entry.
  */
@@ -46,6 +51,7 @@ static inline pgoff_t swap_cache_index(swp_entry_t entry)
 	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
 	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
 }
+#endif
 
 void show_swap_cache_info(void);
 bool add_to_swap(struct folio *folio);
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 055e555d3382..268338a0ea57 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -38,10 +38,19 @@ static const struct address_space_operations swap_aops = {
 #endif
 };
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static struct address_space swapper_space __read_mostly;
+
+struct address_space *swap_address_space(swp_entry_t entry)
+{
+	return &swapper_space;
+}
+#else
 struct address_space *swapper_spaces[MAX_SWAPFILES] __read_mostly;
 static unsigned int nr_swapper_spaces[MAX_SWAPFILES] __read_mostly;
-static bool enable_vma_readahead __read_mostly = true;
+#endif
 
+static bool enable_vma_readahead __read_mostly = true;
 #define SWAP_RA_ORDER_CEILING	5
 
 #define SWAP_RA_WIN_SHIFT	(PAGE_SHIFT / 2)
@@ -260,6 +269,28 @@ void delete_from_swap_cache(struct folio *folio)
 	folio_ref_sub(folio, folio_nr_pages(folio));
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+				unsigned long end)
+{
+	swp_slot_t slot = swp_slot(type, begin);
+	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+	unsigned long index = swap_cache_index(entry);
+	struct address_space *address_space = swap_address_space(entry);
+	void *old;
+	XA_STATE(xas, &address_space->i_pages, index);
+
+	xas_set_update(&xas, workingset_update_node);
+
+	xa_lock_irq(&address_space->i_pages);
+	xas_for_each(&xas, old, entry.val + end - begin) {
+		if (!xa_is_value(old))
+			continue;
+		xas_store(&xas, NULL);
+	}
+	xa_unlock_irq(&address_space->i_pages);
+}
+#else
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				unsigned long end)
 {
@@ -290,6 +321,7 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
 			break;
 	}
 }
+#endif
 
 /*
  * If we are the only user, then try to free up the swap cache.
@@ -718,23 +750,34 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	return folio;
 }
 
+static void init_swapper_space(struct address_space *space)
+{
+	xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
+	atomic_set(&space->i_mmap_writable, 0);
+	space->a_ops = &swap_aops;
+	/* swap cache doesn't use writeback related tags */
+	mapping_set_no_writeback_tags(space);
+}
+
+#ifdef CONFIG_VIRTUAL_SWAP
 int init_swap_address_space(unsigned int type, unsigned long nr_pages)
 {
-	struct address_space *spaces, *space;
+	return 0;
+}
+
+void exit_swap_address_space(unsigned int type) {}
+#else
+int init_swap_address_space(unsigned int type, unsigned long nr_pages)
+{
+	struct address_space *spaces;
 	unsigned int i, nr;
 
 	nr = DIV_ROUND_UP(nr_pages, SWAP_ADDRESS_SPACE_PAGES);
 	spaces = kvcalloc(nr, sizeof(struct address_space), GFP_KERNEL);
 	if (!spaces)
 		return -ENOMEM;
-	for (i = 0; i < nr; i++) {
-		space = spaces + i;
-		xa_init_flags(&space->i_pages, XA_FLAGS_LOCK_IRQ);
-		atomic_set(&space->i_mmap_writable, 0);
-		space->a_ops = &swap_aops;
-		/* swap cache doesn't use writeback related tags */
-		mapping_set_no_writeback_tags(space);
-	}
+	for (i = 0; i < nr; i++)
+		init_swapper_space(spaces + i);
 	nr_swapper_spaces[type] = nr;
 	swapper_spaces[type] = spaces;
 
@@ -752,6 +795,7 @@ void exit_swap_address_space(unsigned int type)
 	nr_swapper_spaces[type] = 0;
 	swapper_spaces[type] = NULL;
 }
+#endif
 
 static int swap_vma_ra_win(struct vm_fault *vmf, unsigned long *start,
 			   unsigned long *end)
@@ -930,6 +974,10 @@ static int __init swap_init_sysfs(void)
 	int err;
 	struct kobject *swap_kobj;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	init_swapper_space(&swapper_space);
+#endif
+
 	swap_kobj = kobject_create_and_add("swap", mm_kobj);
 	if (!swap_kobj) {
 		pr_err("failed to create swap kobject\n");

From patchwork Mon Apr  7 23:42:06 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879275
Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com
 [209.85.219.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81B3F22C331;
 Mon,  7 Apr 2025 23:42:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.179
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069350; cv=none;
 b=bcBspqWjQ70amCtatZvpph8rH2Gz3Is2comDI4kxN/i56ADpYb/VBuwoU+9MOO6Npz4QVTBTNIbpJvUGDMnOLL5gz/nbAP6R85lmALmgM2P1UkoO6Ru0xt6E6Oyu+cOL8AOfkDsa9MWrgImek/WVBSfdB7sk0Zy9ee6JBm8kh+0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069350; c=relaxed/simple;
 bh=xGkTVATKl7Dz5MF8/YenF9EkW4XRw9kuAb6cvIvOxVM=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=GSlvFMfEX/hJAJTHwVsLfCQ12XPj+XYgyUtHbq+l6ItQqBIbEq9CkaR+I0gHnYdcCP1WAM2ZyvyCbVsp2Vnpzde9MKXqMXI/cVYB5Y2Blw22pVbMPw7GDsO+UnqBbNfZPJYgleCCAJbk1+RenKme1RC9FIdi+fZo5lN0i3Z35PI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=jaOA7qnP; arc=none smtp.client-ip=209.85.219.179
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="jaOA7qnP"
Received: by mail-yb1-f179.google.com with SMTP id
 3f1490d57ef6-e6e2a303569so2469350276.0;
 Mon, 07 Apr 2025 16:42:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069347; x=1744674147;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=daAcY/nDP4YSMa9CHOT/fNiMPU/yMDIagIbMON6vw+o=;
 b=jaOA7qnP4SIL3d4eLIDe0V+sP2qdUVpbCQPcrYVPmR3Qi42lBHp6SGfcEypn/0yO2T
 kvvmJdnDFLBT1DqcJVJz8KZ8ol28ia69Wt3JdOyb0nkhkiNb2lqhVATqBcSC6b8tPSLc
 zrgEF8r070r/UqHvwSeXPtMHjZnWvl8kt84mNqHmqpxwmwriyjAob1hB64j+x6QSiii8
 KungphvNHmjG3xoqqIXOT8SxN6CuH3zoQ25OdvUIATbvZ5EAZmYBvvZkeZ53DjDHeSe6
 AWiZgKYQz84ZIgNRNePzSQucnruiXPEjSliI0ICgq1dS9gHLiQen4vsbfSp5Kp/JI5rH
 h69g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069347; x=1744674147;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=daAcY/nDP4YSMa9CHOT/fNiMPU/yMDIagIbMON6vw+o=;
 b=qaRIRuGDm2JDMv0vXGHxq+d6Ifmu8kmZqFTW8VDSQGhD4x1n60QQVLRYW10IUSCN9g
 R57w569GBz/OWOGNFgiJKecnAsAFuh+yJ6ozk6KTky3Jnm7qDd7bFe5kGKS1j0DDYoDd
 ECYxJ6KMdYR5/oQfDFb3TfkzmM8ByuOw/ZgBovz1O5tHfUIm2mzc0SBqLLfh1EoACcXp
 6gx9/r9dyRD4YPADJcWRjMzdzXV7ZNl05tQt05/8pyXtenLYKc98XB5a8PNDPI31h9rv
 LpybKgLyDybrCOuJ1ZRtqgL+zYdRv68Mr9CPTcXHsssxmZdJaH4fA4U0CslaYcnrYmLN
 zolQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCUtBvxaOKaDPTapuEEbkOllovWC8S1nBxOaumvtS1WUeWw2cJruVFYHgT32+UWz34dW2SidkQnQQj8=@vger.kernel.org,
 AJvYcCW3pqjL5R0Bf9XlU5AVQ3VFgrG2RHRmTx9my1p/ykUyrNYqWoLRUhZMzYAQTBcWTvbjeqrGJjJlVala5nnB@vger.kernel.org,
 AJvYcCXz56DO/jxKqFIr3dx9m/oyGDYupQxmCM45lXTWOCKA8iEMWZ9zdDRQAiFa/wlgsBmHaETulGf5@vger.kernel.org
X-Gm-Message-State: AOJu0YxG7p6hNorvKlbuGOor5nyVI7lWXL1eeT2nTVXj2e2pQlbipkwX
 0nco4aLfiC2UovrCW86HLpEcd4C+4YtAehYF01nPhrhVwlh2rRv1
X-Gm-Gg: ASbGncuXxHbC/90OSQsRpSwNb/8sNe813dU/bWJmVscoUYLRa04lA+XsDCCvEiLVqB7
 3u9Ci5CMUgz8MXa1V/MEDuhb/Qa58CB5N9g+P6GDQUeisqs+z84YrDSosgbUuYU4mM6aoJbCJHf
 xsUV2qDbHFEU3cXIFRNrWuH7G1sHIpikxH1eV8hYEVFCdMIuJ+dFNGsEGQf85FQb9eFkntrkmgD
 ylenZQcKUGwtYSWidCBs33w9OhE208Po4xaAGzhFhidF4tX9EG4Ot2xJCeMVLb1F/8FfiJb/DdO
 nq0BEtjNetwJERBqwLePr/uw/6hGSIRxAyfYC6GT1rdyCUw=
X-Google-Smtp-Source: AGHT+IF3rCoCu+zrKEAp6pEjJvySTUvUifOvBdVUrbAK8xBJTGqXTP96YZ/RQIeo8ja2gEbN09711w==
X-Received: by 2002:a05:6902:218a:b0:e6b:7a48:b851 with SMTP id
 3f1490d57ef6-e6e1fa739f0mr20093696276.36.1744069347267;
 Mon, 07 Apr 2025 16:42:27 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:73::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e6e0c8b7c89sm2505675276.8.2025.04.07.16.42.26
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:27 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 05/14] zswap: unify zswap tree for virtualized swap
Date: Mon,  7 Apr 2025 16:42:06 -0700
Message-ID: <20250407234223.1059191-6-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Similar to swap cache, the zswap tree code, specifically the range
partition logic, can no longer easily be reused for the new virtual swap
space design. Use a simple unified zswap tree in the new implementation
for now. As in the case of swap cache, range partitioning is planned as
a follow up work.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 38 ++++++++++++++++++++++++++++++++------
 1 file changed, 32 insertions(+), 6 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 23365e76a3ce..c1327569ce80 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -203,8 +203,6 @@ struct zswap_entry {
 	struct list_head lru;
 };
 
-static struct xarray *zswap_trees[MAX_SWAPFILES];
-static unsigned int nr_zswap_trees[MAX_SWAPFILES];
 
 /* RCU-protected iteration */
 static LIST_HEAD(zswap_pools);
@@ -231,12 +229,28 @@ static bool zswap_has_pool;
 * helpers and fwd declarations
 **********************************/
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static DEFINE_XARRAY(zswap_tree);
+
+static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
+{
+	return &zswap_tree;
+}
+
+#define zswap_tree_index(entry)	entry.val
+#else
+static struct xarray *zswap_trees[MAX_SWAPFILES];
+static unsigned int nr_zswap_trees[MAX_SWAPFILES];
+
 static inline struct xarray *swap_zswap_tree(swp_entry_t swp)
 {
 	return &zswap_trees[swp_type(swp)][swp_offset(swp)
 		>> SWAP_ADDRESS_SPACE_SHIFT];
 }
 
+#define zswap_tree_index(entry)	swp_offset(entry)
+#endif
+
 #define zswap_pool_debug(msg, p)				\
 	pr_debug("%s pool %s/%s\n", msg, (p)->tfm_name,		\
 		 zpool_get_type((p)->zpool))
@@ -1047,7 +1061,7 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 				 swp_entry_t swpentry)
 {
 	struct xarray *tree;
-	pgoff_t offset = swp_offset(swpentry);
+	pgoff_t offset = zswap_tree_index(swpentry);
 	struct folio *folio;
 	struct mempolicy *mpol;
 	bool folio_was_allocated;
@@ -1463,7 +1477,7 @@ static bool zswap_store_page(struct page *page,
 		goto compress_failed;
 
 	old = xa_store(swap_zswap_tree(page_swpentry),
-		       swp_offset(page_swpentry),
+		       zswap_tree_index(page_swpentry),
 		       entry, GFP_KERNEL);
 	if (xa_is_err(old)) {
 		int err = xa_err(old);
@@ -1612,7 +1626,7 @@ bool zswap_store(struct folio *folio)
 bool zswap_load(struct folio *folio)
 {
 	swp_entry_t swp = folio->swap;
-	pgoff_t offset = swp_offset(swp);
+	pgoff_t offset = zswap_tree_index(swp);
 	bool swapcache = folio_test_swapcache(folio);
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
@@ -1670,7 +1684,7 @@ bool zswap_load(struct folio *folio)
 
 void zswap_invalidate(swp_entry_t swp)
 {
-	pgoff_t offset = swp_offset(swp);
+	pgoff_t offset = zswap_tree_index(swp);
 	struct xarray *tree = swap_zswap_tree(swp);
 	struct zswap_entry *entry;
 
@@ -1682,6 +1696,16 @@ void zswap_invalidate(swp_entry_t swp)
 		zswap_entry_free(entry);
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int zswap_swapon(int type, unsigned long nr_pages)
+{
+	return 0;
+}
+
+void zswap_swapoff(int type)
+{
+}
+#else
 int zswap_swapon(int type, unsigned long nr_pages)
 {
 	struct xarray *trees, *tree;
@@ -1718,6 +1742,8 @@ void zswap_swapoff(int type)
 	nr_zswap_trees[type] = 0;
 	zswap_trees[type] = NULL;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
+
 
 /*********************************
 * debugfs functions

From patchwork Mon Apr  7 23:42:07 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878966
Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com
 [209.85.219.169])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2EF0722A1E9;
 Mon,  7 Apr 2025 23:42:28 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.169
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069352; cv=none;
 b=klYNkCqHjemHpRSd+hDHYUWKsZQnXn/Bn5EV/S+RyMtZHuFC9DkLkdf70MenQKQiUNA1uPQIV+i3Lk62EgdIDHbJd+rBjneG0KoMai5LrWsLNw2a4pPGroF67hYNsTLpe0YHpXmjrEmB337KrM1bgpkxG92i6wdtD+zcNJ2UFwE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069352; c=relaxed/simple;
 bh=wXMByu3+yG6bgC4hESbH9CPw8s0oKS+GUPKiTDjIYoQ=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=QMfItyVV53uyMbvpG6W6AQmUA8Vvr6NxG48V3DO73045Cd6tjJ0CQEwEaXw4ofJ7F9CrNWL3mHqJn3YXnHyqxKCZF7e7a3aSF2pOuhJLD9jJB1JNtq/0IZiVNq3a0XQK0uMmWFkIgWn9wXbBkcpNP4ysuLeUgm9b73ec0T14o98=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=SfODJ0xc; arc=none smtp.client-ip=209.85.219.169
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="SfODJ0xc"
Received: by mail-yb1-f169.google.com with SMTP id
 3f1490d57ef6-e6dee579f38so4327975276.0;
 Mon, 07 Apr 2025 16:42:28 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069348; x=1744674148;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=OEQEACQD2Wfh+5v0+KctRKm6Y8AFy6+TQkhluAfNqzs=;
 b=SfODJ0xciu/nNl5oYhAPhP/h1q6YgBfhG/kHCRWDlygYRhtuIqNuvlGvj+zdbqlDjj
 967sJzTFs9G8FpPmIMeEajDesuXDB1RuC43Xhng4s5Y1gzVM6rSfCWiTD5f0cyY9jwHl
 /PbiQ1D5nlb58O7+PzIQ90rjPNmvTMaTxAnNK9QS4qTTtbzmgqpY26oFa4mBEzuj+zEa
 X4xSKnfOpIbRbFATMaZmuznJuxIGoexH9+ghD+422m2Ys0oJhDDSaNKtXRTl2G9iqPgr
 bYqQhsRiRnBRDAb4QWp7U32GnZaE4/9LgoZdRktSKLQJrYSj1jfcdEdsDuh8v5YG+y1P
 +JYg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069348; x=1744674148;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=OEQEACQD2Wfh+5v0+KctRKm6Y8AFy6+TQkhluAfNqzs=;
 b=kmGuvWGb6/9+ZZSc+RKrvh7vCrklKQgV4yDakupAHgRDYWxQjJXqJhtXbv+uIOqO+5
 5jeF+8hXBJXaEsvodqvNlqapVWzaZiVSFVUaie5RcHqiyX3ycGTSpH19s17sIS1smPQG
 +KRzj5JC1/cchAGn0X0YY5kMIr0AUlgj/mMUFVntLz7rRboyDV+srhF1fKBXNYreUh1I
 uz1/TXdOGYiS2/JlLlWMzrAFFS5SJdWZ3vPVqjEnZXg4/dgkmsoH1+62Ifh3/qzQSHx4
 iI7cYjmf/myxzrV4YCpO89ytwukb9QJuHdPB/708kOCHuhwHiJGl9Co5xZvCs34a6Wbq
 cwSA==
X-Forwarded-Encrypted: i=1;
 AJvYcCUM4w0jt9jZDRavFTJFxGOYkUrBB7DrDGqv+4nMF72qds4RBm2fUJk7q/8kgLjbuxqmi/80Japk@vger.kernel.org,
 AJvYcCV7lY99CkRvcUMTV3K5IWwfk2oAeIasHrTtFo9HNGKVEgUHDI3dkBwL4bbcAQLgJf4g3iSTZqYf8UM=@vger.kernel.org,
 AJvYcCXoPcPaW8yJI75ioI5PlaARvtTgBNSOYFLnDHXwkboCcsewePW+WbXhXRsRD2Pa64m0IcV5ce1r8oHmZTSF@vger.kernel.org
X-Gm-Message-State: AOJu0YzFhjA2iVO/er7hSfq9F9PvnKYiEJqzFmmWTijOAw1uJOjslNHX
 6thW3OYsCdnPgNxtImyiDVohJ9ICcj2bBiFoI2R08cIRX6ZwrHgc
X-Gm-Gg: ASbGncuXaYn6xQjae7dM4+bngBZHB1GnlAATUxPmIkEiZfy71wE/p7KqT7oQAsf76lT
 wCmghTgQIj/GAKJZ/xbP0pay9ut3DRXf3UN+yNqR/Gko1QYoYVgDAY2LckvTR9I/k/kJrE0gats
 mJ6zAlUsz3MckcUs288dYtHDuLrLMwDw18Hn7MN90uOxLzZo6xdSCBjIAsedTnIrW4T53IY5Zyv
 bNiCXCJzJNdA3hZc+Kk7wahQPymfTwsO9pwliiTmCN70+b/S5osCFKAzuF3NNJNfzrcysoX7kWH
 NwgECpUuHiT33/dKzGCNXr54P3zTOj2dTrAf
X-Google-Smtp-Source: AGHT+IFDsRD0X/uNyaT45Bx26CornbhSVsTZAubB1ovo34tXqCw2Uq6LR7maL4EQaJQJoudWeFTtMQ==
X-Received: by 2002:a05:6902:e02:b0:e6d:f47e:6a40 with SMTP id
 3f1490d57ef6-e6e1c2a5e39mr25036905276.6.1744069348014;
 Mon, 07 Apr 2025 16:42:28 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:70::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1f78eaesm27717997b3.70.2025.04.07.16.42.27
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:27 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 06/14] mm: swap: allocate a virtual swap slot for each
 swapped out page
Date: Mon,  7 Apr 2025 16:42:07 -0700
Message-ID: <20250407234223.1059191-7-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

For the new virtual swap space design, dynamically allocate a virtual
slot (as well as an associated metadata structure) for each swapped out
page, and associate it to the (physical) swap slot on the swapfile/swap
partition.

For now, there is always a physical slot in the swapfile associated for
each virtual swap slot (except those about to be freed). The virtual
swap slot's lifetime is still tied to the lifetime of its physical swap
slot.

We also maintain a backward map to look up the virtual swap slot from
its associated physical swap slot on swapfile. This is used in cluster
readahead, as well as several swapfile operations, such as the swap slot
reclamation that happens when the swapfile is almost full.  It will also
be used in a future patch that simplifies swapoff.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 MAINTAINERS             |   7 +
 include/linux/swap.h    |  28 ++-
 include/linux/swapops.h |  12 ++
 mm/Kconfig              |   4 +
 mm/Makefile             |   1 +
 mm/internal.h           |  43 ++++-
 mm/shmem.c              |  10 +-
 mm/swap.h               |   2 +
 mm/swap_state.c         |  39 +++--
 mm/swapfile.c           |  18 +-
 mm/vswap.c              | 375 ++++++++++++++++++++++++++++++++++++++++
 11 files changed, 508 insertions(+), 31 deletions(-)
 create mode 100644 mm/vswap.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 00e94bec401e..65108bf2a5f1 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -25290,6 +25290,13 @@ S:	Maintained
 F:	Documentation/devicetree/bindings/iio/light/vishay,veml6030.yaml
 F:	drivers/iio/light/veml6030.c
 
+VIRTUAL SWAP SPACE
+M:	Nhat Pham <nphamcs@gmail.com>
+M:	Johannes Weiner <hannes@cmpxchg.org>
+L:	linux-mm@kvack.org
+S:	Maintained
+F:	mm/vswap.c
+
 VISHAY VEML6075 UVA AND UVB LIGHT SENSOR DRIVER
 M:	Javier Carrasco <javier.carrasco.cruz@gmail.com>
 S:	Maintained
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 674089bc4cd1..d32a2c300924 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -455,7 +455,6 @@ extern void __meminit kswapd_stop(int nid);
 /* Virtual swap API */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
 int add_swap_count_continuation(swp_entry_t, gfp_t);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
@@ -504,6 +503,7 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
+void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
@@ -725,6 +725,26 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int vswap_init(void);
+void vswap_exit(void);
+void vswap_free(swp_entry_t entry);
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
+#else
+static inline int vswap_init(void)
+{
+	return 0;
+}
+
+static inline void vswap_exit(void)
+{
+}
+
+static inline void vswap_free(swp_entry_t entry)
+{
+}
+
 /**
  * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
  *                         virtual swap slot.
@@ -748,6 +768,12 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
 	return (swp_entry_t) { slot.val };
 }
+#endif
+
+static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
+}
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 2a4101c9bba4..ba7364e1400a 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -27,6 +27,18 @@
 #define SWP_TYPE_SHIFT	(BITS_PER_XA_VALUE - MAX_SWAPFILES_SHIFT)
 #define SWP_OFFSET_MASK	((1UL << SWP_TYPE_SHIFT) - 1)
 
+#ifdef CONFIG_VIRTUAL_SWAP
+#if SWP_TYPE_SHIFT > 32
+#define MAX_VSWAP	U32_MAX
+#else
+/*
+ * The range of virtual swap slots is the same as the range of physical swap
+ * slots.
+ */
+#define MAX_VSWAP	(((MAX_SWAPFILES - 1) << SWP_TYPE_SHIFT) | SWP_OFFSET_MASK)
+#endif
+#endif
+
 /*
  * Definitions only for PFN swap entries (see is_pfn_swap_entry()).  To
  * store PFN, we only need SWP_PFN_BITS bits.  Each of the pfn swap entries
diff --git a/mm/Kconfig b/mm/Kconfig
index 1a6acdb64333..d578b8e6ab6a 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -32,6 +32,10 @@ config VIRTUAL_SWAP
 		(zswap, on-disk swapfile, etc.), and save disk space when we
 		use zswap (or the zero-filled swap page optimization).
 
+		In this new design, for each swap entry, a virtual swap slot is
+		allocated and stored in the page table entry, rather than the
+		handle to the physical swap slot on the swap device itself.
+
 		There might be more lock contentions with heavy swap use, since
 		the swap cache is no longer range partitioned.
 
diff --git a/mm/Makefile b/mm/Makefile
index 850386a67b3e..b7216c714fa1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -76,6 +76,7 @@ ifdef CONFIG_MMU
 endif
 
 obj-$(CONFIG_SWAP)	+= page_io.o swap_state.o swapfile.o swap_slots.o
+obj-$(CONFIG_VIRTUAL_SWAP)	+= vswap.o
 obj-$(CONFIG_ZSWAP)	+= zswap.o
 obj-$(CONFIG_HAS_DMA)	+= dmapool.o
 obj-$(CONFIG_HUGETLBFS)	+= hugetlb.o
diff --git a/mm/internal.h b/mm/internal.h
index 2d63f6537e35..ca28729f822a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -262,6 +262,40 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 	return min(ptep - start_ptep, max_nr);
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+	return (swp_entry_t) { entry.val + n };
+}
+
+/* similar to swap_nth, but check the backing physical slots as well. */
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
+	swp_entry_t next_entry = swap_nth(entry, delta);
+
+	next_slot = swp_entry_to_swp_slot(next_entry);
+	if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
+			swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
+		next_entry.val = 0;
+
+	return next_entry;
+}
+#else
+static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	return swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
+			swp_slot_offset(slot) + n));
+}
+
+static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	return swap_nth(entry, delta);
+}
+#endif
+
 /**
  * pte_move_swp_offset - Move the swap entry offset field of a swap pte
  *	 forward or backward by delta
@@ -275,13 +309,8 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
  */
 static inline pte_t pte_move_swp_offset(pte_t pte, long delta)
 {
-	swp_entry_t entry = pte_to_swp_entry(pte), new_entry;
-	swp_slot_t slot = swp_entry_to_swp_slot(entry);
-	pte_t new;
-
-	new_entry = swp_slot_to_swp_entry(swp_slot(swp_slot_type(slot),
-			swp_slot_offset(slot) + delta));
-	new = swp_entry_to_pte(new_entry);
+	swp_entry_t entry = pte_to_swp_entry(pte);
+	pte_t new = swp_entry_to_pte(swap_move(entry, delta));
 
 	if (pte_swp_soft_dirty(pte))
 		new = pte_swp_mksoft_dirty(new);
diff --git a/mm/shmem.c b/mm/shmem.c
index f8efa49eb499..4c00b4673468 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2166,7 +2166,6 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 	XA_STATE_ORDER(xas, &mapping->i_pages, index, 0);
 	void *alloced_shadow = NULL;
 	int alloced_order = 0, i;
-	swp_slot_t slot = swp_entry_to_swp_slot(swap);
 
 	/* Convert user data gfp flags to xarray node gfp flags */
 	gfp &= GFP_RECLAIM_MASK;
@@ -2205,12 +2204,8 @@ static int shmem_split_large_entry(struct inode *inode, pgoff_t index,
 			 */
 			for (i = 0; i < 1 << order; i++) {
 				pgoff_t aligned_index = round_down(index, 1 << order);
-				swp_entry_t tmp_entry;
-				swp_slot_t tmp_slot;
+				swp_entry_t tmp_entry = swap_nth(swap, i);
 
-				tmp_slot =
-					swp_slot(swp_slot_type(slot), swp_slot_offset(slot) + i);
-				tmp_entry = swp_slot_to_swp_entry(tmp_slot);
 				__xa_store(&mapping->i_pages, aligned_index + i,
 					   swp_to_radix_entry(tmp_entry), 0);
 			}
@@ -2336,8 +2331,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 		if (split_order > 0) {
 			pgoff_t offset = index - round_down(index, 1 << split_order);
 
-			swap = swp_slot_to_swp_entry(swp_slot(
-					swp_slot_type(slot), swp_slot_offset(slot) + offset));
+			swap = swap_nth(swap, offset);
 		}
 
 		/* Here we actually start the io */
diff --git a/mm/swap.h b/mm/swap.h
index 06e20b1d79c4..31c94671cb44 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -36,6 +36,8 @@ static inline loff_t swap_slot_pos(swp_slot_t slot)
 #ifdef CONFIG_VIRTUAL_SWAP
 extern struct address_space *swap_address_space(swp_entry_t entry);
 #define swap_cache_index(entry) entry.val
+
+void virt_clear_shadow_from_swap_cache(swp_entry_t entry);
 #else
 /* One swap address space for each 64M swap space */
 extern struct address_space *swapper_spaces[];
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 268338a0ea57..eb4cd6ba2068 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -176,6 +176,7 @@ void __delete_from_swap_cache(struct folio *folio,
 	__lruvec_stat_mod_folio(folio, NR_SWAPCACHE, -nr);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_slot_t slot = folio_alloc_swap_slot(folio);
@@ -188,6 +189,7 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 
 	return entry;
 }
+#endif
 
 /**
  * add_to_swap - allocate swap space for a folio
@@ -270,26 +272,30 @@ void delete_from_swap_cache(struct folio *folio)
 }
 
 #ifdef CONFIG_VIRTUAL_SWAP
-void clear_shadow_from_swap_cache(int type, unsigned long begin,
-				unsigned long end)
+/*
+ * In the virtual swap implementation, we index the swap cache by virtual swap
+ * slots rather than physical ones. As a result, we only clear the shadow when
+ * the virtual swap slot is freed (via virt_clear_shadow_from_swap_cache()),
+ * not when the physical swap slot is freed (via clear_shadow_from_swap_cache()
+ * in the old implementation, which is a no-op in the new implementation).
+ */
+void virt_clear_shadow_from_swap_cache(swp_entry_t entry)
 {
-	swp_slot_t slot = swp_slot(type, begin);
-	swp_entry_t entry = swp_slot_to_swp_entry(slot);
-	unsigned long index = swap_cache_index(entry);
 	struct address_space *address_space = swap_address_space(entry);
-	void *old;
+	pgoff_t index = swap_cache_index(entry);
 	XA_STATE(xas, &address_space->i_pages, index);
 
 	xas_set_update(&xas, workingset_update_node);
-
 	xa_lock_irq(&address_space->i_pages);
-	xas_for_each(&xas, old, entry.val + end - begin) {
-		if (!xa_is_value(old))
-			continue;
+	if (xa_is_value(xas_load(&xas)))
 		xas_store(&xas, NULL);
-	}
 	xa_unlock_irq(&address_space->i_pages);
 }
+
+void clear_shadow_from_swap_cache(int type, unsigned long begin,
+				unsigned long end)
+{
+}
 #else
 void clear_shadow_from_swap_cache(int type, unsigned long begin,
 				unsigned long end)
@@ -978,10 +984,17 @@ static int __init swap_init_sysfs(void)
 	init_swapper_space(&swapper_space);
 #endif
 
+	err = vswap_init();
+	if (err) {
+		pr_err("failed to initialize virtual swap space\n");
+		return err;
+	}
+
 	swap_kobj = kobject_create_and_add("swap", mm_kobj);
 	if (!swap_kobj) {
 		pr_err("failed to create swap kobject\n");
-		return -ENOMEM;
+		err = -ENOMEM;
+		goto vswap_exit;
 	}
 	err = sysfs_create_group(swap_kobj, &swap_attr_group);
 	if (err) {
@@ -992,6 +1005,8 @@ static int __init swap_init_sysfs(void)
 
 delete_obj:
 	kobject_put(swap_kobj);
+vswap_exit:
+	vswap_exit();
 	return err;
 }
 subsys_initcall(swap_init_sysfs);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a1dd7e998e90..533011c97e03 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1505,6 +1505,20 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
 	swp_entry_t entry = swp_slot_to_swp_entry(slot);
+	int i;
+
+#ifndef CONFIG_VIRTUAL_SWAP
+	/*
+	 * In the new (i.e virtual swap) implementation, we will let the virtual
+	 * swap layer handle the cgroup swap accounting and charging.
+	 */
+	mem_cgroup_uncharge_swap(entry, nr_pages);
+#endif
+	/* release all the associated (virtual) swap slots */
+	for (i = 0; i < nr_pages; i++) {
+		vswap_free(entry);
+		entry.val++;
+	}
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1517,7 +1531,6 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 		*map = 0;
 	} while (++map < map_end);
 
-	mem_cgroup_uncharge_swap(entry, nr_pages);
 	swap_range_free(si, offset, nr_pages);
 
 	if (!ci->count)
@@ -1571,9 +1584,8 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 /*
  * Called after dropping swapcache to decrease refcnt to swap entries.
  */
-void put_swap_folio(struct folio *folio, swp_entry_t entry)
+void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 {
-	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
diff --git a/mm/vswap.c b/mm/vswap.c
new file mode 100644
index 000000000000..23a05c3393d8
--- /dev/null
+++ b/mm/vswap.c
@@ -0,0 +1,375 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Virtual swap space
+ *
+ * Copyright (C) 2024 Meta Platforms, Inc., Nhat Pham
+ */
+#include <linux/mm.h>
+#include <linux/gfp.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>
+#include <linux/swap_cgroup.h>
+#include "swap.h"
+
+/*
+ * Virtual Swap Space
+ *
+ * We associate with each swapped out page a virtual swap slot. This will allow
+ * us to change the backing state of a swapped out page without having to
+ * update every single page table entries referring to it.
+ *
+ * For now, there is a one-to-one correspondence between a virtual swap slot
+ * and its associated physical swap slot.
+ */
+
+/**
+ * Swap descriptor - metadata of a swapped out page.
+ *
+ * @slot: The handle to the physical swap slot backing this page.
+ * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ */
+struct swp_desc {
+	swp_slot_t slot;
+	struct rcu_head rcu;
+};
+
+/* Virtual swap space - swp_entry_t -> struct swp_desc */
+static DEFINE_XARRAY_FLAGS(vswap_map, XA_FLAGS_TRACK_FREE);
+
+static const struct xa_limit vswap_map_limit = {
+	.max = MAX_VSWAP,
+	/* reserve the 0 virtual swap slot to indicate errors */
+	.min = 1,
+};
+
+/* Physical (swp_slot_t) to virtual (swp_entry_t) swap slots. */
+static DEFINE_XARRAY(vswap_rmap);
+
+/*
+ * For swapping large folio of size n, we reserve an empty PMD-sized cluster
+ * of contiguous and aligned virtual swap slots, then allocate the first n
+ * virtual swap slots from the cluster.
+ */
+#define VSWAP_CLUSTER_SHIFT HPAGE_PMD_ORDER
+#define VSWAP_CLUSTER_SIZE (1UL << VSWAP_CLUSTER_SHIFT)
+
+/*
+ * Map from a cluster id to the number of allocated virtual swap slots in the
+ * (PMD-sized) cluster. This allows us to quickly allocate an empty cluster
+ * for a large folio being swapped out.
+ */
+static DEFINE_XARRAY_FLAGS(vswap_cluster_map, XA_FLAGS_TRACK_FREE);
+
+static const struct xa_limit vswap_cluster_map_limit = {
+	/* Do not allocate from the last cluster if it does not have enough slots. */
+	.max = (((MAX_VSWAP + 1) >> (VSWAP_CLUSTER_SHIFT)) - 1),
+	/*
+	 * First cluster is never handed out for large folios, since the 0 virtual
+	 * swap slot is reserved for errors.
+	 */
+	.min = 1,
+};
+
+static struct kmem_cache *swp_desc_cache;
+static atomic_t vswap_alloc_reject;
+static atomic_t vswap_used;
+
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+
+static struct dentry *vswap_debugfs_root;
+
+static int vswap_debug_fs_init(void)
+{
+	if (!debugfs_initialized())
+		return -ENODEV;
+
+	vswap_debugfs_root = debugfs_create_dir("vswap", NULL);
+	debugfs_create_atomic_t("alloc_reject", 0444,
+		vswap_debugfs_root, &vswap_alloc_reject);
+	debugfs_create_atomic_t("used", 0444, vswap_debugfs_root, &vswap_used);
+
+	return 0;
+}
+#else
+static int vswap_debug_fs_init(void)
+{
+	return 0;
+}
+#endif
+
+/* Allolcate a contiguous range of virtual swap slots */
+static swp_entry_t vswap_alloc(int nr)
+{
+	struct swp_desc **descs;
+	swp_entry_t entry;
+	u32 index, cluster_id;
+	void *cluster_entry;
+	unsigned long cluster_count;
+	int i, err;
+
+	entry.val = 0;
+	descs = kcalloc(nr, sizeof(*descs), GFP_KERNEL);
+	if (!descs) {
+		atomic_add(nr, &vswap_alloc_reject);
+		return (swp_entry_t){0};
+	}
+
+	if (unlikely(!kmem_cache_alloc_bulk(
+					swp_desc_cache, GFP_KERNEL, nr, (void **)descs))) {
+		atomic_add(nr, &vswap_alloc_reject);
+		kfree(descs);
+		return (swp_entry_t){0};
+	}
+
+	for (i = 0; i < nr; i++)
+		descs[i]->slot.val = 0;
+
+	xa_lock(&vswap_map);
+	if (nr == 1) {
+		if (__xa_alloc(&vswap_map, &index, descs[0], vswap_map_limit,
+				GFP_KERNEL))
+			goto unlock;
+		else {
+			/*
+			 * Increment the allocation count of the cluster which the
+			 * allocated virtual swap slot belongs to.
+			 */
+			cluster_id = index >> VSWAP_CLUSTER_SHIFT;
+			cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
+			cluster_count = cluster_entry ? xa_to_value(cluster_entry) : 0;
+			cluster_count++;
+			VM_WARN_ON(cluster_count > VSWAP_CLUSTER_SIZE);
+
+			if (xa_err(xa_store(&vswap_cluster_map, cluster_id,
+					xa_mk_value(cluster_count), GFP_KERNEL))) {
+				__xa_erase(&vswap_map, index);
+				goto unlock;
+			}
+		}
+	} else {
+		/* allocate an unused cluster */
+		cluster_entry = xa_mk_value(nr);
+		if (xa_alloc(&vswap_cluster_map, &cluster_id, cluster_entry,
+				vswap_cluster_map_limit, GFP_KERNEL))
+			goto unlock;
+
+		index = cluster_id << VSWAP_CLUSTER_SHIFT;
+
+		for (i = 0; i < nr; i++) {
+			err = __xa_insert(&vswap_map, index + i, descs[i], GFP_KERNEL);
+			VM_WARN_ON(err == -EBUSY);
+			if (err) {
+				while (--i >= 0)
+					__xa_erase(&vswap_map, index + i);
+				xa_erase(&vswap_cluster_map, cluster_id);
+				goto unlock;
+			}
+		}
+	}
+
+	VM_WARN_ON(!index);
+	VM_WARN_ON(index + nr - 1 > MAX_VSWAP);
+	entry.val = index;
+	atomic_add(nr, &vswap_used);
+unlock:
+	xa_unlock(&vswap_map);
+	if (!entry.val) {
+		atomic_add(nr, &vswap_alloc_reject);
+		kmem_cache_free_bulk(swp_desc_cache, nr, (void **)descs);
+	}
+	kfree(descs);
+	return entry;
+}
+
+static inline void release_vswap_slot(unsigned long index)
+{
+	unsigned long cluster_id = index >> VSWAP_CLUSTER_SHIFT, cluster_count;
+	void *cluster_entry;
+
+	xa_lock(&vswap_map);
+	__xa_erase(&vswap_map, index);
+	cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
+	VM_WARN_ON(!cluster_entry);
+	cluster_count = xa_to_value(cluster_entry);
+	cluster_count--;
+
+	VM_WARN_ON(cluster_count < 0);
+
+	if (cluster_count)
+		xa_store(&vswap_cluster_map, cluster_id,
+			xa_mk_value(cluster_count), GFP_KERNEL);
+	else
+		xa_erase(&vswap_cluster_map, cluster_id);
+	xa_unlock(&vswap_map);
+	atomic_dec(&vswap_used);
+}
+
+/**
+ * vswap_free - free a virtual swap slot.
+ * @id: the virtual swap slot to free
+ */
+void vswap_free(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+
+	if (!entry.val)
+		return;
+
+	/* do not immediately erase the virtual slot to prevent its reuse */
+	desc = xa_load(&vswap_map, entry.val);
+	if (!desc)
+		return;
+
+	virt_clear_shadow_from_swap_cache(entry);
+
+	if (desc->slot.val) {
+		/* we only charge after linkage was established */
+		mem_cgroup_uncharge_swap(entry, 1);
+		xa_erase(&vswap_rmap, desc->slot.val);
+	}
+
+	/* erase forward mapping and release the virtual slot for reallocation */
+	release_vswap_slot(entry.val);
+	kfree_rcu(desc, rcu);
+}
+
+/**
+ * folio_alloc_swap - allocate virtual swap slots for a folio.
+ * @folio: the folio.
+ *
+ * Return: the first allocated slot if success, or the zero virtuals swap slot
+ * on failure.
+ */
+swp_entry_t folio_alloc_swap(struct folio *folio)
+{
+	int i, err, nr = folio_nr_pages(folio);
+	bool manual_freeing = true;
+	struct swp_desc *desc;
+	swp_entry_t entry;
+	swp_slot_t slot;
+
+	entry = vswap_alloc(nr);
+	if (!entry.val)
+		return entry;
+
+	/*
+	 * XXX: for now, we always allocate a physical swap slot for each virtual
+	 * swap slot, and their lifetime are coupled. This will change once we
+	 * decouple virtual swap slots from their backing states, and only allocate
+	 * physical swap slots for them on demand (i.e on zswap writeback, or
+	 * fallback from zswap store failure).
+	 */
+	slot = folio_alloc_swap_slot(folio);
+	if (!slot.val)
+		goto vswap_free;
+
+	/* establish the vrtual <-> physical swap slots linkages. */
+	for (i = 0; i < nr; i++) {
+		err = xa_insert(&vswap_rmap, slot.val + i,
+				xa_mk_value(entry.val + i), GFP_KERNEL);
+		VM_WARN_ON(err == -EBUSY);
+		if (err) {
+			while (--i >= 0)
+				xa_erase(&vswap_rmap, slot.val + i);
+			goto put_physical_swap;
+		}
+	}
+
+	i = 0;
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		desc->slot.val = slot.val + i;
+		i++;
+	}
+	rcu_read_unlock();
+
+	manual_freeing = false;
+	/*
+	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+	 * swap slots allocation. This is acceptable because as noted above, each
+	 * virtual swap slot corresponds to a physical swap slot. Once we have
+	 * decoupled virtual and physical swap slots, we will only charge when we
+	 * actually allocate a physical swap slot.
+	 */
+	if (!mem_cgroup_try_charge_swap(folio, entry))
+		return entry;
+
+put_physical_swap:
+	/*
+	 * There is no any linkage between virtual and physical swap slots yet. We
+	 * have to manually and separately free the allocated virtual and physical
+	 * swap slots.
+	 */
+	swap_slot_put_folio(slot, folio);
+vswap_free:
+	if (manual_freeing) {
+		for (i = 0; i < nr; i++)
+			vswap_free((swp_entry_t){entry.val + i});
+	}
+	entry.val = 0;
+	return entry;
+}
+
+/**
+ * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
+ *                         virtual swap slot.
+ * @entry: the virtual swap slot.
+ *
+ * Return: the physical swap slot corresponding to the virtual swap slot.
+ */
+swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+
+	if (!entry.val)
+		return (swp_slot_t){0};
+
+	desc = xa_load(&vswap_map, entry.val);
+	return desc ? desc->slot : (swp_slot_t){0};
+}
+
+/**
+ * swp_slot_to_swp_entry - look up the virtual swap slot corresponding to a
+ *                         physical swap slot.
+ * @slot: the physical swap slot.
+ *
+ * Return: the virtual swap slot corresponding to the physical swap slot.
+ */
+swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
+{
+	void *entry = xa_load(&vswap_rmap, slot.val);
+
+	/*
+	 * entry can be NULL if we fail to link the virtual and physical swap slot
+	 * during the swap slot allocation process.
+	 */
+	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
+}
+
+int vswap_init(void)
+{
+	swp_desc_cache = KMEM_CACHE(swp_desc, 0);
+	if (!swp_desc_cache)
+		return -ENOMEM;
+
+	if (xa_insert(&vswap_cluster_map, 0, xa_mk_value(1), GFP_KERNEL)) {
+		kmem_cache_destroy(swp_desc_cache);
+		return -ENOMEM;
+	}
+
+	if (vswap_debug_fs_init())
+		pr_warn("Failed to initialize vswap debugfs\n");
+
+	return 0;
+}
+
+void vswap_exit(void)
+{
+	kmem_cache_destroy(swp_desc_cache);
+}

From patchwork Mon Apr  7 23:42:08 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879274
Received: from mail-yw1-f171.google.com (mail-yw1-f171.google.com
 [209.85.128.171])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id C705822CBE4;
 Mon,  7 Apr 2025 23:42:29 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069351; cv=none;
 b=VzBDNRaAFu6QC/zdqo5mKgirfQVNIAgOKcBV2+p2XU2Za9izALRloJlxts9nC78vhJwJp9ZDjqycqwle7UM7RZaQuT9EwvKILkw1aHIWbbSp8uHhCSA3Q7SEU6qXnRMvjnPtzlWXwhW/sNDSfM3bsUM+m5Jcq8h5jMr7V13igO0=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069351; c=relaxed/simple;
 bh=NKb1h0r/ekGqJhrCCk5g0MiE9b26osXSCs9ZAg+i8m4=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=HmRY51TJj7yYgxNltwWeh9x/0MWcI3I4eNY3iQDBcQvAFjoiA6fmz1i5jlG28adn+k4pJevkPlaZWSWLlkPiJi8Mfa1N6Tdv2x5UpNJVX9KzTcq5PnAV8H6MmNDmNv93COQPS6yTY+VXunkw3v+0/o0Qg/bbaGf9vt0DdcJbvPI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=XIxznimm; arc=none smtp.client-ip=209.85.128.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="XIxznimm"
Received: by mail-yw1-f171.google.com with SMTP id
 00721157ae682-6febbd3b75cso46442477b3.0;
 Mon, 07 Apr 2025 16:42:29 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069348; x=1744674148;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=VDaYNr62oZV/U6TDpPRjwWYHhp0Fz9Nf+NOXmppHJcs=;
 b=XIxznimmkxFN/eZOyqV2jnFcqWzY6pXQyMFreAtiFGfM4C39Q3W2nq9dzEVJgwibmc
 QdifXs7HIsjd89VFBTPchICFJZ9YdpRqAwTgi88tO59cEMDkHMSBy/EDi622xa59jkNz
 jvm/ZA35isLs55/mb8Y8rJeO0tXDhYxIA/IinO6rz/i/6gYNDMi/iunIbo+fJNtqjm+v
 6U6Yle7ReY/RAfr6T4apax+B1wEnK5uIEUk1XJPBtLHVPLBOCTNO4ioX52lgSz+1MR23
 qISPNTimUUG5wpkekyvu+odsnejNPR+RNgvRhEjq7/wGeEz2429QVMuXhQG6sQ+OxiZ2
 odjw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069348; x=1744674148;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=VDaYNr62oZV/U6TDpPRjwWYHhp0Fz9Nf+NOXmppHJcs=;
 b=TgxnO0hl277uFl+//wL8qAXCq91nYABQm5+ZhbZu6y7u2ENMrxzEOqwDlEd0KYQy6B
 +N2XTDTJDJHGTCHsiAG/LN82gNprEUc1SDtVbyXkJYKHgyN8dIddCPHY9iDeWhHFXQWF
 NCT8TiBSMM6QnmsoglfAmm2m1TBoWMUneo4FNfbrc4Uh4RSUFSzwD53gJv/zOsEXySqK
 lydVd3LgX4GC+7EDjTVzZ9MeIF0j6qvua3Z2qMQwaYTxiBjaxingL/NgF42ccQFASbPs
 wDIkZ7UiMeCE9Zr4KbHxrdIB7BPlTGdZF9aWmUCPIiXW5lh4zmy4JdLrV9XflzWHrsyE
 qQag==
X-Forwarded-Encrypted: i=1;
 AJvYcCUHaBP7fI5kJyLijaQAb2NjuWaqR+eXn3heo8gkyRlBXdZ0eyjdPj1Hk1Uuo0GRE2jQd2/fd8C/Sh8=@vger.kernel.org,
 AJvYcCUk6MwZJXmPvBofKNKfusFMxXAoOaKH4j7/G62u5VULszaltqw9ZaiL4+2lfk3yDFLvgCg2pDVK@vger.kernel.org,
 AJvYcCXRntvnnoipWOswWwvW5yN5afNysFe67gmswhh4ZXNkhyVWy26zqhqu3zPxtaJmDuUa5uy7shoGwYGN0h3k@vger.kernel.org
X-Gm-Message-State: AOJu0Yy6xh4gjEYbfL8JUqJzBZzqTTETONGMaaTEuNMXAoiyqZk9/tW7
 +dT7E7mUMKcyl3GgMQNf9OSR2zK0YC4Yo6rgFbKCwdskDiWlIP+t
X-Gm-Gg: ASbGncscwaFDS5GddpeGYKAVs/E5+Yj3Aw/WQkXRfzAyJvBVZ3QywpaKLiRUkkZxP7m
 5510Q1Ti3EQvLK06DBzUgDBFFZcpatL+NODVnuB4mnaodl20Wfp378VPswMFveIe3LincQHlUJW
 y4qMfUVyprf0WyLjlRKZrhDviOZ/xvzctiyfoSwHbSwm7O0eFSFXhHoBtCwZ9BIUTyf2TgbWuUk
 Ro1e9XZ5YlEYuhwVhBKx0/GABeP/5MskiNq6eA/0+Tm1seKknFSFpjDsdXr/7quBi/+/Z0QsEyj
 LtDFZJzhvfUTpjJLe2W8c6Xqn2hNnEq6Ozs=
X-Google-Smtp-Source: AGHT+IHmrBK25BV1qDKlvbY5RSNbeCthvTW9ZJSLIvxYr/6iVq0MDu+RnyoEtmVX9dC9D27ahdmdJA==
X-Received: by 2002:a05:690c:6089:b0:6ff:1c8d:fbfc with SMTP id
 00721157ae682-703e16250b7mr266617397b3.31.1744069348608;
 Mon, 07 Apr 2025 16:42:28 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:4::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1fa73e9sm27925297b3.106.2025.04.07.16.42.28
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:28 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 07/14] swap: implement the swap_cgroup API using virtual
 swap
Date: Mon,  7 Apr 2025 16:42:08 -0700
Message-ID: <20250407234223.1059191-8-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Once we decouple a swap entry from its backing store via the virtual
swap, we can no longer statically allocate an array to store the swap
entries' cgroup information. Move it to the swap descriptor.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/Makefile |  2 ++
 mm/vswap.c  | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/mm/Makefile b/mm/Makefile
index b7216c714fa1..35f2f282c8da 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -101,8 +101,10 @@ obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
 obj-$(CONFIG_MEMCG_V1) += memcontrol-v1.o
 obj-$(CONFIG_MEMCG) += memcontrol.o vmpressure.o
 ifdef CONFIG_SWAP
+ifndef CONFIG_VIRTUAL_SWAP
 obj-$(CONFIG_MEMCG) += swap_cgroup.o
 endif
+endif
 obj-$(CONFIG_CGROUP_HUGETLB) += hugetlb_cgroup.o
 obj-$(CONFIG_GUP_TEST) += gup_test.o
 obj-$(CONFIG_DMAPOOL_TEST) += dmapool_test.o
diff --git a/mm/vswap.c b/mm/vswap.c
index 23a05c3393d8..3792fa7f766b 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -27,10 +27,14 @@
  *
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ * @memcgid: The memcg id of the owning memcg, if any.
  */
 struct swp_desc {
 	swp_slot_t slot;
 	struct rcu_head rcu;
+#ifdef CONFIG_MEMCG
+	atomic_t memcgid;
+#endif
 };
 
 /* Virtual swap space - swp_entry_t -> struct swp_desc */
@@ -122,8 +126,10 @@ static swp_entry_t vswap_alloc(int nr)
 		return (swp_entry_t){0};
 	}
 
-	for (i = 0; i < nr; i++)
+	for (i = 0; i < nr; i++) {
 		descs[i]->slot.val = 0;
+		atomic_set(&descs[i]->memcgid, 0);
+	}
 
 	xa_lock(&vswap_map);
 	if (nr == 1) {
@@ -352,6 +358,70 @@ swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
 }
 
+#ifdef CONFIG_MEMCG
+static unsigned short vswap_cgroup_record(swp_entry_t entry,
+				unsigned short memcgid, unsigned int nr_ents)
+{
+	struct swp_desc *desc;
+	unsigned short oldid, iter = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr_ents - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		oldid = atomic_xchg(&desc->memcgid, memcgid);
+		if (!iter)
+			iter = oldid;
+		VM_WARN_ON(iter != oldid);
+	}
+	rcu_read_unlock();
+
+	return oldid;
+}
+
+void swap_cgroup_record(struct folio *folio, unsigned short memcgid,
+			swp_entry_t entry)
+{
+	unsigned short oldid =
+		vswap_cgroup_record(entry, memcgid, folio_nr_pages(folio));
+
+	VM_WARN_ON(oldid);
+}
+
+unsigned short swap_cgroup_clear(swp_entry_t entry, unsigned int nr_ents)
+{
+	return vswap_cgroup_record(entry, 0, nr_ents);
+}
+
+unsigned short lookup_swap_cgroup_id(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	unsigned short ret;
+
+	/*
+	 * Note that the virtual swap slot can be freed under us, for instance in
+	 * the invocation of mem_cgroup_swapin_charge_folio. We need to wrap the
+	 * entire lookup in RCU read-side critical section, and double check the
+	 * existence of the swap descriptor.
+	 */
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	ret = desc ? atomic_read(&desc->memcgid) : 0;
+	rcu_read_unlock();
+	return ret;
+}
+
+int swap_cgroup_swapon(int type, unsigned long max_pages)
+{
+	return 0;
+}
+
+void swap_cgroup_swapoff(int type) {}
+#endif /* CONFIG_MEMCG */
+
 int vswap_init(void)
 {
 	swp_desc_cache = KMEM_CACHE(swp_desc, 0);

From patchwork Mon Apr  7 23:42:09 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878965
Received: from mail-yb1-f179.google.com (mail-yb1-f179.google.com
 [209.85.219.179])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id AE9FE22D782;
 Mon,  7 Apr 2025 23:42:30 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.179
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069354; cv=none;
 b=rSI80x5t3M6nuMliGRxZFvk3LnRitmeAN9bhC6xPFvR8Lh14FJZY1EJTPYOqBhU+voJw3k5HlC6ShtGBJt0CrTDxfxUL4SbpdMmL1ptDibz/cDNZqewr4XSAnxx/IUrP/Mi7sD/nRvTSaxKdhbGvQTMas1IfFrjoavbLlgklqCQ=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069354; c=relaxed/simple;
 bh=VNV8V2TD8sJ+kBW7b5wms+zCH1ocOfgiZhE6eGWPQ58=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=IIsSzrE5LXVVV/XEmaYE3cN2HpaQ2ZQnKNJIg7vbY7aXkqw8SpRIMDbDJpOQI+ojxfV1ZGp+EjC5SO6gOIlfBTmcsGwkWNnJzI8isH5kUO2vn6STuKqsdFh6kKpzYX03BJ0lw1SkDxpAhDQE3JrzGfL5SnAVPsZO86oAW1OPx84=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=XjHysosq; arc=none smtp.client-ip=209.85.219.179
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="XjHysosq"
Received: by mail-yb1-f179.google.com with SMTP id
 3f1490d57ef6-e694601f624so3075801276.1;
 Mon, 07 Apr 2025 16:42:30 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069349; x=1744674149;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=RNVQHiku0n4XKZtgC76yPMy41UVumnRAWSS0zwzH74g=;
 b=XjHysosqKJnotn67XPb/5Uxi0c7WUR2rNI2Cgjt10F8VCPMp48P9VH0qBsMlfBMjUe
 4cbOBQU6cCiYv03rbRSyqOZEuJv6+qqmTY28fWR5XdijEpT0UbSD3acWqn98I0UttXQV
 KeDC/DVUAb+63mniY0EmW/RxsDDqsxSzCq4sZ9BaVauwW7PakXERjOAtNPaKtLNJIANT
 xWPSR3fvf1tSSb7qz2bYntuP1//PuNEvx6It+5h+ZZtNromhbhEs9l2OsNA1sJXtRvup
 Ndku0XDMi/+KMFOX6N873aNVVWQqAaaXAXuyIXS3OTYEaWS+Zi0+ipJz0iLeN8PatYuo
 M1LQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069349; x=1744674149;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=RNVQHiku0n4XKZtgC76yPMy41UVumnRAWSS0zwzH74g=;
 b=ZTc3ybvXZaWODaL0ssp4cDl596YSmAyGfmJeBcgMuGxmKex4jKDvK9RuxftuC8h9QC
 TnXHj2+MHNuLarsTXTyRfZ1Kgn/tll10AWQnqo4Ebu0T/6axjsM1MpHZuUAUxS7vZTRl
 JT+5VD3fBe66mBc6iqAEjtMcx0YdfNTmN72pQEDXeCtfIlpDARgOgQMqZo686egU/H/M
 gq8wJgBtXdXef6AToNcdibaSUT8NkVrVKgwCvybGQLYQGVhR8XnHCUpqgtMxxhEz4ORW
 NKXJsisynHV1xE2JT0hoi7a6VrFxL+g4lUa3ORiAAjwOYa3eQMAnicajDBWW7PFhpnXg
 Sjpg==
X-Forwarded-Encrypted: i=1;
 AJvYcCU/RQK6AhkFHWxqj2EXjBnA0k7AivpVaor8TKluoZ5Xq9NDUNkR6e7TmTF5P1wT2REKNIFNC2oin3A=@vger.kernel.org,
 AJvYcCUQj1zcey9a3VKgnBFFbWSiUjcYeaFuotF1sg35VPMLGPWGuqX2tplw5QyQGK9gfqZqc4wfVMsH3LxEC1Jm@vger.kernel.org,
 AJvYcCWaXjfeJF4ePgQ6U1WA6YVuLm/oKlwrZIjHUJn+bwfPPSk60tMOAacUhKtQP488yvkz50DQnacT@vger.kernel.org
X-Gm-Message-State: AOJu0YylmghasntV6jormuIxsVh2bxmdGiYiWnatYygP6Ulpkq4+lqGd
 IXkdP9PlacoM7DR/I3Fn8bckYC0otI+bpA/xXPiv4ddKP2+N3u7F
X-Gm-Gg: ASbGncsKUcm+77B4HOviz5oYm/TMi1r9yz6e4NESxV61oYOxLB8EvhTigdHAj1QX8XN
 XCTwD8I+8JpB6B4eRBnIS7Ie2CyGHvsFv3Ff+Aipd5w1PGAxTeBO6kuyP0NqvITW6M9lPsfCyU2
 bR1IXRnJgaU8NqXDsR7aJN8FGvLYaZ5oZdLcmAY4+lRAMJjvxITp6TjjcNp/lDcbsiEM8bWyavM
 wx/0zh9qxKMT6e/jktTU9ev3lpQYLNBaR7m6sEzG4cYatNeLTg+mnpd6ogQGVufgsAwHGoDpl7l
 mQO2NNfm15X92prh9nv3RLigYY1NZERsQKM=
X-Google-Smtp-Source: AGHT+IE+lbdLl8MqQcRcXE001XWwyvAzkYyhe/+Zkj12tCYH8oWKsSx+lBTqIB8/ocetOa6JVsIIzQ==
X-Received: by 2002:a05:690c:9a0e:b0:6fe:bfb9:549c with SMTP id
 00721157ae682-703f41257b5mr197667807b3.1.1744069349294;
 Mon, 07 Apr 2025 16:42:29 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:5::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1e6034fsm27730647b3.48.2025.04.07.16.42.28
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:29 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 08/14] swap: manage swap entry lifetime at the virtual
 swap layer
Date: Mon,  7 Apr 2025 16:42:09 -0700
Message-ID: <20250407234223.1059191-9-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch moves the swap entry lifetime to the virtual swap layer (if
we enable swap virtualization), by adding the following fields to the
swap descriptor:

1. Whether the swap entry is in swap cache (or about to be added).
2. The swap count of the swap entry, which counts the number of page
   table entries at which the swap entry is inserted.
3. The reference count of the swap entry, which is the swap count of the
   swap entry, added one if the entry has been added to the swap cache.

We also re-implement all of the swap entry lifetime API
(swap_duplicate(), swap_free_nr(), swapcache_prepare(), etc.) in the
virtual swap layer.

Note that the swapfile's swap map can be now be reduced under the virtual swap
implementation, as each slot can now only have 3 states: unallocated,
allocated, and bad slot. However, I leave this simplification to future work,
to minimize the amount of code change for review in this series of work.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  33 ++--
 mm/memory.c          |   2 +
 mm/swapfile.c        | 128 +++++++++++---
 mm/vswap.c           | 400 ++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 522 insertions(+), 41 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index d32a2c300924..1d8679bd57f3 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -225,6 +225,11 @@ enum {
 #define SWAP_CLUSTER_MAX_SKIPPED (SWAP_CLUSTER_MAX << 10)
 #define COMPACT_CLUSTER_MAX SWAP_CLUSTER_MAX
 
+#ifdef CONFIG_VIRTUAL_SWAP
+/* Swapfile's swap map state*/
+#define SWAP_MAP_ALLOCATED	0x01	/* Page is allocated */
+#define SWAP_MAP_BAD	0x02	/* Page is bad */
+#else
 /* Bit flag in swap_map */
 #define SWAP_HAS_CACHE	0x40	/* Flag page is cached, in first swap_map */
 #define COUNT_CONTINUED	0x80	/* Flag swap_map continuation for full count */
@@ -236,6 +241,7 @@ enum {
 
 /* Special value in each swap_map continuation */
 #define SWAP_CONT_MAX	0x7f	/* Max count */
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * We use this to track usage of a cluster. A cluster is a block of swap disk
@@ -455,7 +461,6 @@ extern void __meminit kswapd_stop(int nid);
 /* Virtual swap API */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
-int add_swap_count_continuation(swp_entry_t, gfp_t);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
 int swapcache_prepare(swp_entry_t entry, int nr);
@@ -559,11 +564,6 @@ static inline void free_swap_cache(struct folio *folio)
 {
 }
 
-static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
-{
-	return 0;
-}
-
 static inline void swap_shmem_alloc(swp_entry_t swp, int nr)
 {
 }
@@ -728,9 +728,13 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
 void vswap_exit(void);
-void vswap_free(swp_entry_t entry);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
+bool folio_swapped(struct folio *folio);
+bool vswap_swapcache_only(swp_entry_t entry, int nr);
+int non_swapcache_batch(swp_entry_t entry, int nr);
+bool swap_free_nr_any_cache_only(swp_entry_t entry, int nr);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
 #else
 static inline int vswap_init(void)
 {
@@ -741,10 +745,6 @@ static inline void vswap_exit(void)
 {
 }
 
-static inline void vswap_free(swp_entry_t entry)
-{
-}
-
 /**
  * swp_entry_to_swp_slot - look up the physical swap slot corresponding to a
  *                         virtual swap slot.
@@ -768,12 +768,12 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
 	return (swp_entry_t) { slot.val };
 }
-#endif
 
 static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
+#endif
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
@@ -790,5 +790,14 @@ static inline void unlock_swapoff(swp_entry_t entry,
 	swap_slot_put_swap_info(si);
 }
 
+#if defined(CONFIG_SWAP) && !defined(CONFIG_VIRTUAL_SWAP)
+int add_swap_count_continuation(swp_entry_t, gfp_t);
+#else
+static inline int add_swap_count_continuation(swp_entry_t swp, gfp_t gfp_mask)
+{
+	return 0;
+}
+#endif
+
 #endif /* __KERNEL__*/
 #endif /* _LINUX_SWAP_H */
diff --git a/mm/memory.c b/mm/memory.c
index c44e845b5320..a1d3631ad85f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4123,6 +4123,7 @@ static struct folio *__alloc_swap_folio(struct vm_fault *vmf)
 }
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#ifndef CONFIG_VIRTUAL_SWAP
 static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
@@ -4143,6 +4144,7 @@ static inline int non_swapcache_batch(swp_entry_t entry, int max_nr)
 
 	return i;
 }
+#endif
 
 /*
  * Check if the PTEs within a range are contiguous swap entries
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 533011c97e03..babb545acffd 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -50,8 +50,10 @@
 #include "internal.h"
 #include "swap.h"
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static bool swap_count_continued(struct swap_info_struct *, pgoff_t,
 				 unsigned char);
+#endif
 static void free_swap_count_continuations(struct swap_info_struct *);
 static void swap_slot_range_free(struct swap_info_struct *si,
 				  struct swap_cluster_info *ci,
@@ -156,6 +158,25 @@ static long swap_usage_in_pages(struct swap_info_struct *si)
 /* Reclaim directly, bypass the slot cache and don't touch device lock */
 #define TTRS_DIRECT		0x8
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static inline unsigned char swap_count(unsigned char ent)
+{
+	return ent;
+}
+
+static bool swap_is_has_cache(struct swap_info_struct *si,
+			      unsigned long offset, int nr_pages)
+{
+	swp_entry_t entry = swp_slot_to_swp_entry(swp_slot(si->type, offset));
+
+	return vswap_swapcache_only(entry, nr_pages);
+}
+
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
+{
+	return swap_is_has_cache(si, offset, 1);
+}
+#else
 static inline unsigned char swap_count(unsigned char ent)
 {
 	return ent & ~SWAP_HAS_CACHE;	/* may include COUNT_CONTINUED flag */
@@ -176,6 +197,11 @@ static bool swap_is_has_cache(struct swap_info_struct *si,
 	return true;
 }
 
+static bool swap_cache_only(struct swap_info_struct *si, unsigned long offset)
+{
+	return READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE;
+}
+
 static bool swap_is_last_map(struct swap_info_struct *si,
 		unsigned long offset, int nr_pages, bool *has_cache)
 {
@@ -194,6 +220,7 @@ static bool swap_is_last_map(struct swap_info_struct *si,
 	*has_cache = !!(count & SWAP_HAS_CACHE);
 	return true;
 }
+#endif
 
 /*
  * returns number of pages in the folio that backs the swap entry. If positive,
@@ -250,7 +277,11 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
 	if (!need_reclaim)
 		goto out_unlock;
 
-	if (!(flags & TTRS_DIRECT)) {
+	/*
+	 * For now, virtual swap implementation only supports freeing through the
+	 * swap slot cache...
+	 */
+	if (!(flags & TTRS_DIRECT) || IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
 		/* Free through slot cache */
 		delete_from_swap_cache(folio);
 		folio_set_dirty(folio);
@@ -700,7 +731,12 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 		case 0:
 			offset++;
 			break;
+#ifdef CONFIG_VIRTUAL_SWAP
+		/* __try_to_reclaim_swap() checks if the slot is in-cache only */
+		case SWAP_MAP_ALLOCATED:
+#else
 		case SWAP_HAS_CACHE:
+#endif
 			nr_reclaim = __try_to_reclaim_swap(si, offset, TTRS_ANYWAY | TTRS_DIRECT);
 			if (nr_reclaim > 0)
 				offset += nr_reclaim;
@@ -731,19 +767,20 @@ static bool cluster_scan_range(struct swap_info_struct *si,
 {
 	unsigned long offset, end = start + nr_pages;
 	unsigned char *map = si->swap_map;
+	unsigned char count;
 
 	for (offset = start; offset < end; offset++) {
-		switch (READ_ONCE(map[offset])) {
-		case 0:
+		count = READ_ONCE(map[offset]);
+		if (!count)
 			continue;
-		case SWAP_HAS_CACHE:
+
+		if (swap_cache_only(si, offset)) {
 			if (!vm_swap_full())
 				return false;
 			*need_reclaim = true;
 			continue;
-		default:
-			return false;
 		}
+		return false;
 	}
 
 	return true;
@@ -836,7 +873,6 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 	long to_scan = 1;
 	unsigned long offset, end;
 	struct swap_cluster_info *ci;
-	unsigned char *map = si->swap_map;
 	int nr_reclaim;
 
 	if (force)
@@ -848,7 +884,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 		to_scan--;
 
 		while (offset < end) {
-			if (READ_ONCE(map[offset]) == SWAP_HAS_CACHE) {
+			if (swap_cache_only(si, offset)) {
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY | TTRS_DIRECT);
@@ -1168,6 +1204,10 @@ static int cluster_alloc_swap(struct swap_info_struct *si,
 {
 	int n_ret = 0;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	VM_WARN_ON(usage != SWAP_MAP_ALLOCATED);
+#endif
+
 	while (n_ret < nr) {
 		unsigned long offset = cluster_alloc_swap_slot(si, order, usage);
 
@@ -1185,6 +1225,10 @@ static int scan_swap_map_slots(struct swap_info_struct *si,
 {
 	unsigned int nr_pages = 1 << order;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	VM_WARN_ON(usage != SWAP_MAP_ALLOCATED);
+#endif
+
 	/*
 	 * We try to cluster swap pages by allocating them sequentially
 	 * in swap.  Once we've allocated SWAPFILE_CLUSTER pages this
@@ -1241,7 +1285,13 @@ int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 	long avail_pgs;
 	int n_ret = 0;
 	int node;
+	unsigned char usage;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
+	usage = SWAP_HAS_CACHE;
+#endif
 	spin_lock(&swap_avail_lock);
 
 	avail_pgs = atomic_long_read(&nr_swap_pages) / size;
@@ -1261,8 +1311,7 @@ int swap_slot_alloc(int n_goal, swp_slot_t swp_slots[], int entry_order)
 		plist_requeue(&si->avail_lists[node], &swap_avail_heads[node]);
 		spin_unlock(&swap_avail_lock);
 		if (get_swap_device_info(si)) {
-			n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE,
-					n_goal, swp_slots, order);
+			n_ret = scan_swap_map_slots(si, usage, n_goal, swp_slots, order);
 			swap_slot_put_swap_info(si);
 			if (n_ret || size > 1)
 				goto check_out;
@@ -1395,6 +1444,17 @@ struct swap_info_struct *swap_slot_tryget_swap_info(swp_slot_t slot)
 	return NULL;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
+					      unsigned long offset,
+					      unsigned char usage)
+{
+	VM_WARN_ON(usage != 1);
+	VM_WARN_ON(si->swap_map[offset] != SWAP_MAP_ALLOCATED);
+
+	return 0;
+}
+#else
 static unsigned char __swap_slot_free_locked(struct swap_info_struct *si,
 					      unsigned long offset,
 					      unsigned char usage)
@@ -1492,6 +1552,7 @@ static bool __swap_slots_free(struct swap_info_struct *si,
 	}
 	return has_cache;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * Drop the last HAS_CACHE flag of swap entries, caller have to
@@ -1504,21 +1565,18 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 	unsigned long offset = swp_slot_offset(slot);
 	unsigned char *map = si->swap_map + offset;
 	unsigned char *map_end = map + nr_pages;
-	swp_entry_t entry = swp_slot_to_swp_entry(slot);
-	int i;
+	unsigned char usage;
 
-#ifndef CONFIG_VIRTUAL_SWAP
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
 	/*
 	 * In the new (i.e virtual swap) implementation, we will let the virtual
 	 * swap layer handle the cgroup swap accounting and charging.
 	 */
-	mem_cgroup_uncharge_swap(entry, nr_pages);
+	mem_cgroup_uncharge_swap(swp_slot_to_swp_entry(slot), nr_pages);
+	usage = SWAP_HAS_CACHE;
 #endif
-	/* release all the associated (virtual) swap slots */
-	for (i = 0; i < nr_pages; i++) {
-		vswap_free(entry);
-		entry.val++;
-	}
 
 	/* It should never free entries across different clusters */
 	VM_BUG_ON(ci != offset_to_cluster(si, offset + nr_pages - 1));
@@ -1527,7 +1585,7 @@ static void swap_slot_range_free(struct swap_info_struct *si,
 
 	ci->count -= nr_pages;
 	do {
-		VM_BUG_ON(*map != SWAP_HAS_CACHE);
+		VM_BUG_ON(*map != usage);
 		*map = 0;
 	} while (++map < map_end);
 
@@ -1572,6 +1630,7 @@ void swap_slot_free_nr(swp_slot_t slot, int nr_pages)
 	}
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 /*
  * Caller has made sure that the swap device corresponding to entry
  * is still around or has not been recycled.
@@ -1580,9 +1639,11 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
 }
+#endif
 
 /*
- * Called after dropping swapcache to decrease refcnt to swap entries.
+ * This should only be called in contexts in which the slot has
+ * been allocated but not associated with any swap entries.
  */
 void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 {
@@ -1590,23 +1651,31 @@ void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	int size = 1 << swap_slot_order(folio_order(folio));
+	unsigned char usage;
 
 	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
+#ifdef CONFIG_VIRTUAL_SWAP
+	usage = SWAP_MAP_ALLOCATED;
+#else
+	usage = SWAP_HAS_CACHE;
+#endif
+
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
 		swap_slot_range_free(si, ci, slot, size);
 	else {
 		for (int i = 0; i < size; i++, slot.val++) {
-			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
+			if (!__swap_slot_free_locked(si, offset + i, usage))
 				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 int __swap_count(swp_entry_t entry)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
@@ -1777,7 +1846,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 	 */
 	for (offset = start_offset; offset < end_offset; offset += nr) {
 		nr = 1;
-		if (READ_ONCE(si->swap_map[offset]) == SWAP_HAS_CACHE) {
+		if (swap_cache_only(si, offset)) {
 			/*
 			 * Folios are always naturally aligned in swap so
 			 * advance forward to the next boundary. Zero means no
@@ -1799,6 +1868,7 @@ void free_swap_and_cache_nr(swp_entry_t entry, int nr)
 out:
 	swap_slot_put_swap_info(si);
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 void swap_slot_cache_free_slots(swp_slot_t *slots, int n)
 {
@@ -3558,6 +3628,7 @@ void si_swapinfo(struct sysinfo *val)
 	spin_unlock(&swap_lock);
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 /*
  * Verify that nr swap entries are valid and increment their swap map counts.
  *
@@ -3692,6 +3763,7 @@ void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
 
 	cluster_swap_free_nr(si, offset, nr, SWAP_HAS_CACHE);
 }
+#endif  /* CONFIG_VIRTUAL_SWAP */
 
 struct swap_info_struct *swap_slot_swap_info(swp_slot_t slot)
 {
@@ -3714,6 +3786,15 @@ pgoff_t __folio_swap_cache_index(struct folio *folio)
 }
 EXPORT_SYMBOL_GPL(__folio_swap_cache_index);
 
+/*
+ * Swap count continuations helpers. Note that we do not use continuation in
+ * virtual swap implementation.
+ */
+#ifdef CONFIG_VIRTUAL_SWAP
+static void free_swap_count_continuations(struct swap_info_struct *si)
+{
+}
+#else
 /*
  * add_swap_count_continuation - called when a swap count is duplicated
  * beyond SWAP_MAP_MAX, it allocates a new page and links that to the entry's
@@ -3936,6 +4017,7 @@ static void free_swap_count_continuations(struct swap_info_struct *si)
 		}
 	}
 }
+#endif
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_BLK_CGROUP)
 void __folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
diff --git a/mm/vswap.c b/mm/vswap.c
index 3792fa7f766b..1b8cf894390c 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -18,23 +18,55 @@
  * us to change the backing state of a swapped out page without having to
  * update every single page table entries referring to it.
  *
- * For now, there is a one-to-one correspondence between a virtual swap slot
- * and its associated physical swap slot.
+ *
+ * I. Swap Entry Lifetime
+ *
+ * The swap entry's lifetime is now managed at the virtual swap layer. We
+ * assign each virtual swap slot a reference count, which includes:
+ *
+ * 1. The number of page table entries that refer to the virtual swap slot, i.e
+ *    its swap count.
+ *
+ * 2. Whether the virtual swap slot has been added to the swap cache - if so,
+ *    its reference count is incremented by 1.
+ *
+ * Each virtual swap slot starts out with a reference count of 1 (since it is
+ * about to be added to the swap cache). Its reference count is incremented or
+ * decremented every time it is mapped to or unmapped from a PTE, as well as
+ * when it is added to or removed from the swap cache. Finally, when its
+ * reference count reaches 0, the virtual swap slot is freed.
  */
 
 /**
  * Swap descriptor - metadata of a swapped out page.
  *
+ * @vswap: The virtual swap slot of the swap entry.
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
+ * @lock: The lock protecting the swap slot backing field.
  * @memcgid: The memcg id of the owning memcg, if any.
+ * @in_swapcache: whether the swapped out page is currently in swap cache.
+ *                This flag is also used to establish the primary owner of
+ *                the swap page in many cases.
+ * @refcnt: The number of references to this swap descriptor. This includes the
+ *          reference from the swap cache (in_swapcache), and references from
+ *          the page table entries (i.e swap_count).
+ * @swap_count: The number of page table entries referring to this swap entry.
  */
 struct swp_desc {
+	swp_entry_t vswap;
 	swp_slot_t slot;
 	struct rcu_head rcu;
+
+	rwlock_t lock;
+
 #ifdef CONFIG_MEMCG
 	atomic_t memcgid;
 #endif
+
+	atomic_t in_swapcache;
+	struct kref refcnt;
+	atomic_t swap_count;
 };
 
 /* Virtual swap space - swp_entry_t -> struct swp_desc */
@@ -129,6 +161,11 @@ static swp_entry_t vswap_alloc(int nr)
 	for (i = 0; i < nr; i++) {
 		descs[i]->slot.val = 0;
 		atomic_set(&descs[i]->memcgid, 0);
+		atomic_set(&descs[i]->swap_count, 0);
+		/* swap entity is about to be added to the swap cache */
+		atomic_set(&descs[i]->in_swapcache, 1);
+		kref_init(&descs[i]->refcnt);
+		rwlock_init(&descs[i]->lock);
 	}
 
 	xa_lock(&vswap_map);
@@ -142,6 +179,7 @@ static swp_entry_t vswap_alloc(int nr)
 			 * allocated virtual swap slot belongs to.
 			 */
 			cluster_id = index >> VSWAP_CLUSTER_SHIFT;
+			descs[0]->vswap.val = index;
 			cluster_entry = xa_load(&vswap_cluster_map, cluster_id);
 			cluster_count = cluster_entry ? xa_to_value(cluster_entry) : 0;
 			cluster_count++;
@@ -171,6 +209,7 @@ static swp_entry_t vswap_alloc(int nr)
 				xa_erase(&vswap_cluster_map, cluster_id);
 				goto unlock;
 			}
+			descs[i]->vswap.val = index + i;
 		}
 	}
 
@@ -215,7 +254,7 @@ static inline void release_vswap_slot(unsigned long index)
  * vswap_free - free a virtual swap slot.
  * @id: the virtual swap slot to free
  */
-void vswap_free(swp_entry_t entry)
+static void vswap_free(swp_entry_t entry)
 {
 	struct swp_desc *desc;
 
@@ -233,6 +272,7 @@ void vswap_free(swp_entry_t entry)
 		/* we only charge after linkage was established */
 		mem_cgroup_uncharge_swap(entry, 1);
 		xa_erase(&vswap_rmap, desc->slot.val);
+		swap_slot_free_nr(desc->slot, 1);
 	}
 
 	/* erase forward mapping and release the virtual slot for reallocation */
@@ -240,6 +280,13 @@ void vswap_free(swp_entry_t entry)
 	kfree_rcu(desc, rcu);
 }
 
+static void vswap_ref_release(struct kref *refcnt)
+{
+	struct swp_desc *desc = container_of(refcnt, struct swp_desc, refcnt);
+
+	vswap_free(desc->vswap);
+}
+
 /**
  * folio_alloc_swap - allocate virtual swap slots for a folio.
  * @folio: the folio.
@@ -332,12 +379,24 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 {
 	struct swp_desc *desc;
+	swp_slot_t slot;
 
 	if (!entry.val)
 		return (swp_slot_t){0};
 
+	rcu_read_lock();
 	desc = xa_load(&vswap_map, entry.val);
-	return desc ? desc->slot : (swp_slot_t){0};
+	if (!desc) {
+		rcu_read_unlock();
+		return (swp_slot_t){0};
+	}
+
+	read_lock(&desc->lock);
+	slot = desc->slot;
+	read_unlock(&desc->lock);
+	rcu_read_unlock();
+
+	return slot;
 }
 
 /**
@@ -349,13 +408,342 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
  */
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 {
-	void *entry = xa_load(&vswap_rmap, slot.val);
+	swp_entry_t ret;
+	void *entry;
 
+	rcu_read_lock();
 	/*
 	 * entry can be NULL if we fail to link the virtual and physical swap slot
 	 * during the swap slot allocation process.
 	 */
-	return entry ? (swp_entry_t){xa_to_value(entry)} : (swp_entry_t){0};
+	entry = xa_load(&vswap_rmap, slot.val);
+	if (!entry)
+		ret.val = 0;
+	else
+		ret = (swp_entry_t){xa_to_value(entry)};
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * swap_free_nr_any_cache_only - decrease the swap count of nr contiguous swap
+ *                               entries by 1 (when the swap entries are removed
+ *                               from a range of PTEs), and check if any of the
+ *                               swap entries are in swap cache only after the
+ *                               its swap count is decreased.
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ *
+ * Return: true if any of the swap entries are in swap cache only, or false
+ *         otherwise, after the swap counts are decreased.
+ */
+bool swap_free_nr_any_cache_only(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	bool ret = false;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (atomic_read(&desc->swap_count) == 1 &&
+				atomic_read(&desc->in_swapcache))
+			ret = true;
+		atomic_dec(&desc->swap_count);
+		kref_put(&desc->refcnt, vswap_ref_release);
+	}
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * swap_free_nr - decrease the swap count of nr contiguous swap entries by 1
+ *                (when the swap entries are removed from a range of PTEs).
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+void swap_free_nr(swp_entry_t entry, int nr)
+{
+	swap_free_nr_any_cache_only(entry, nr);
+}
+
+static int swap_duplicate_nr(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || !kref_get_unless_zero(&desc->refcnt))
+			goto done;
+		atomic_inc(&desc->swap_count);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && i < nr)
+		swap_free_nr(entry, i);
+
+	return i == nr ? 0 : -ENOENT;
+}
+
+/**
+ * swap_duplicate - increase the swap count of the swap entry by 1 (i.e when
+ *                  the swap entry is stored at a new PTE).
+ * @entry: the swap entry.
+ *
+ * Return: 0 (always).
+ *
+ * Note that according to the existing API, we ALWAYS returns 0 unless a swap
+ * continuation is required (which is no longer the case in the new design).
+ */
+int swap_duplicate(swp_entry_t entry)
+{
+	swap_duplicate_nr(entry, 1);
+	return 0;
+}
+
+bool folio_swapped(struct folio *folio)
+{
+	swp_entry_t entry = folio->swap;
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+	bool swapped = false;
+
+	if (!entry.val)
+		return false;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (desc && atomic_read(&desc->swap_count)) {
+			swapped = true;
+			break;
+		}
+	}
+	rcu_read_unlock();
+	return swapped;
+}
+
+/**
+ * swp_swapcount - return the swap count of the swap entry.
+ * @id: the swap entry.
+ *
+ * Note that all the swap count functions are identical in the new design,
+ * since we no longer need swap count continuation.
+ *
+ * Return: the swap count of the swap entry.
+ */
+int swp_swapcount(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	unsigned int ret;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	ret = desc ? atomic_read(&desc->swap_count) : 0;
+	rcu_read_unlock();
+
+	return ret;
+}
+
+int __swap_count(swp_entry_t entry)
+{
+	return swp_swapcount(entry);
+}
+
+int swap_swapcount(struct swap_info_struct *si, swp_entry_t entry)
+{
+	return swp_swapcount(entry);
+}
+
+void swap_shmem_alloc(swp_entry_t entry, int nr)
+{
+	swap_duplicate_nr(entry, nr);
+}
+
+void swapcache_clear(struct swap_info_struct *si, swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+
+	if (!nr)
+		return;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		atomic_dec(&desc->in_swapcache);
+		kref_put(&desc->refcnt, vswap_ref_release);
+	}
+	rcu_read_unlock();
+}
+
+int swapcache_prepare(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0, ret = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || !kref_get_unless_zero(&desc->refcnt)) {
+			ret = -ENOENT;
+			goto done;
+		}
+
+		if (atomic_cmpxchg(&desc->in_swapcache, 0, 1)) {
+			ret = -EEXIST;
+			kref_put(&desc->refcnt, vswap_ref_release);
+			goto done;
+		}
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && i < nr)
+		swapcache_clear(NULL, entry, i);
+	if (i < nr && !ret)
+		ret = -ENOENT;
+	return ret;
+}
+
+/**
+ * vswap_swapcache_only - check if all the slots in the range are still valid,
+ *                        and are in swap cache only (i.e not stored in any
+ *                        PTEs).
+ * @entry: the first slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * Return: true if all the slots in the range are still valid, and are in swap
+ * cache only, or false otherwise.
+ */
+bool vswap_swapcache_only(swp_entry_t entry, int nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc || atomic_read(&desc->swap_count) ||
+						!atomic_read(&desc->in_swapcache))
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	return i == nr;
+}
+
+/**
+ * non_swapcache_batch - count the longest range starting from a particular
+ *                       swap slot that are stil valid, but not in swap cache.
+ * @entry: the first slot to check.
+ * @max_nr: the maximum number of slots to check.
+ *
+ * Return: the number of slots in the longest range that are still valid, but
+ * not in swap cache.
+ */
+int non_swapcache_batch(swp_entry_t entry, int max_nr)
+{
+	struct swp_desc *desc;
+	int i = 0;
+
+	if (!entry.val)
+		return 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + max_nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!atomic_read(&desc->swap_count) ||
+				atomic_read(&desc->in_swapcache))
+			goto done;
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	return i;
+}
+
+/**
+ * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
+ *                            reclaim their cache if no more references remain.
+ * @entry: First entry of range.
+ * @nr: Number of entries in range.
+ *
+ * For each swap entry in the contiguous range, release a swap count. If any
+ * swap entries have their swap count decremented to zero, try to reclaim their
+ * associated swap cache pages.
+ */
+void free_swap_and_cache_nr(swp_entry_t entry, int nr)
+{
+	int i = 0, incr = 1;
+	struct folio *folio;
+
+	if (non_swap_entry(entry))
+		return;
+
+	if (swap_free_nr_any_cache_only(entry, nr)) {
+		while (i < nr) {
+			incr = 1;
+			if (vswap_swapcache_only(entry, 1)) {
+				folio = filemap_get_folio(swap_address_space(entry),
+							swap_cache_index(entry));
+				if (IS_ERR(folio))
+					goto next;
+				if (!folio_trylock(folio)) {
+					folio_put(folio);
+					goto next;
+				}
+				incr = folio_nr_pages(folio);
+				folio_free_swap(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+			}
+next:
+			i += incr;
+			entry.val += incr;
+		}
+	}
+}
+
+/*
+ * Called after dropping swapcache to decrease refcnt to swap entries.
+ */
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
+{
+	int nr = folio_nr_pages(folio);
+
+	VM_WARN_ON(!folio_test_locked(folio));
+	swapcache_clear(NULL, entry, nr);
 }
 
 #ifdef CONFIG_MEMCG

From patchwork Mon Apr  7 23:42:10 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879273
Received: from mail-yb1-f174.google.com (mail-yb1-f174.google.com
 [209.85.219.174])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70C3A22DF8C;
 Mon,  7 Apr 2025 23:42:31 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.174
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069353; cv=none;
 b=ALKGEPxhjaUMFzx40IdzuN7PGUD//8mXxjW9kZOVRsHaOCtXnvAP1HHoTRyPYHlOfgtQxcWKq4A/LKuFVCsrvILrgMUDl+DxSCHqgh94dsOPH+PHv0PrtSKcI7Hf3LYrxBYYq2CjebOlX6Zsc1LQA8X/ssBwcp4tws3e8BQCb5s=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069353; c=relaxed/simple;
 bh=+tImoJN8LOAx3XNgUjBTI+HCxDl7uh4xChIEILnZeQU=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=ErUl5AQZlQzFb+edws/+TsYWj9B9l//Yb8yXfORgd1xoAvfRkhbO6CLgwutPbVPwybRKaHBWrRGIH1PfLKwgtkDKXwz6JvVSDp5yXOM73nl0fQV1aq92LyN/huYOdjl+kjBA6m0Ol531kZc2HWjaaSajloLSenGcoiHFod2lzBI=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=NGqKT58k; arc=none smtp.client-ip=209.85.219.174
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="NGqKT58k"
Received: by mail-yb1-f174.google.com with SMTP id
 3f1490d57ef6-e6e09fb66c7so4061734276.2;
 Mon, 07 Apr 2025 16:42:31 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069350; x=1744674150;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=hEmnwad/N64NAZz/aFOTPsiQL1svvHVYAYQ6CpUFW0k=;
 b=NGqKT58kxfwB302Zuac9qrJek5JadKsV6bAGOEEucvVc1gWkBtflBG2Ui5gzgxHPYS
 FuQn14P/LrMHPgqhcxr51430K7XWnSy38C1LisCbs1sumhSJajcedwGvAKME6M6P6Wh/
 OSiktrfYv654o/ZY5hJ+FzxZc5IHeL1ImmnBhVEOXu6eDAvCK23rB/Ftdyu4ZLRb7Gsc
 4s392D0FDpASSLRoZagRalqGAEFzMh/sLertVdTCAInIVpLxCRKMjyWUfhdImKfOnxNN
 f8G9hsw/gEOay04QYgEuVEDJczJ5qBz9V752WQoMrqzzzuA6CLe+LeocAyKFKXYjAdf/
 VS7Q==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069350; x=1744674150;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=hEmnwad/N64NAZz/aFOTPsiQL1svvHVYAYQ6CpUFW0k=;
 b=SwUrPt10LxTAHPVBz4fe0J45AfnTh/zvFMm/bX93SRY4W9D25Yh1PxdsOWSrOWbQ2B
 zYMTg8UV3nUYPOXKUr9M2PWgRUNPJAWTWlakEYfhCJ9Hi8KN2pnQExHvbXrAvpvoWPH2
 EMhd04KbOlkBj1DdqUkV+w7MSx41hcexPNJ/MUU4HSAPImPzjY3bQ77cTT5Bs9+FmL2v
 hALG0NDYPfzCHQMQrHYQ6VQJeWeLCGvQlbOeMfhrrTAIVX/eDNZIC7icXfkGbIYq5VKR
 swUxjXEczaS9aQ5e0C0Mo8EHdZ9iYkAZ7k3F9XQdmrLKapUntik0mA73aBM4+fA6H1r+
 Wxug==
X-Forwarded-Encrypted: i=1;
 AJvYcCUpqFGx5iPsLWVJ0Q2c59DjN5GODK0Ad3V0ambaXLTJuc7ENgA61M0S39O+qf6ZBjWb6sqVk6mA@vger.kernel.org,
 AJvYcCVLS/R5c116GuE9OmiGDZAGsO8AspSazTIbsPT+z/dieyycaBbPuxY55iCxlGxNFr0LSnevbF/ZinWdXy3b@vger.kernel.org,
 AJvYcCW9JrDrkKWVDB2wII2qdxZZwB6Y3SUkKv7fn+ba3Zh/AWj9B4WhzDRmUgBQ/LUJwCpMMEEKFyKTvCE=@vger.kernel.org
X-Gm-Message-State: AOJu0Yx4hXkzUf44aUfE0QXDJjuPUeYIjxSGPNrTkli9WMa5s5gU54uR
 0dZxBbI4GqyWfAxCecRx2cmRqkjaiP/gnZ5sV0LRdmlfbjnjdVV1
X-Gm-Gg: ASbGncuC8HGhU11gruqU1+1boIbdD69yRGEX7u2SwvUmKKpmMig6aY05hSmr0OUOpmt
 Vy/UIRjLiptnpAluTiBRjq5pk4P4KOFjAsKy+Z7trj2NXP140izyQFMqnQudRh3C1xsbkukDWpH
 lx/Uu2s6SLVkeNZEFGkAgViTCXG5Hghx9WuGzwozCoRXoAIIyXdbgS60VIO8VTqcBo7Gbyl1763
 EePYnxq2hncPovi09eN+un2lwN7+4Zzm3jiz3MRI46RpqDaS0/h0BsVUHUZl6bkB+Dn/xX1PF9M
 mDMLjp4Ou074QoR6nvsdSXCxwrySe3RrM4Tqi4z/xCUEn5U=
X-Google-Smtp-Source: AGHT+IFtGhkxbSoAukRbj+0OU4tWjwIUdMbR+YxJ19NpEyyOkHcigzH97IKjBGD1dt64XVgLyZVotw==
X-Received: by 2002:a05:690c:670e:b0:6f9:4bb6:eb4e with SMTP id
 00721157ae682-703e1623ab2mr255941227b3.31.1744069350152;
 Mon, 07 Apr 2025 16:42:30 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:73::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1fdf9ebsm27726707b3.123.2025.04.07.16.42.29
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:29 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 09/14] swap: implement locking out swapoff using virtual
 swap slot
Date: Mon,  7 Apr 2025 16:42:10 -0700
Message-ID: <20250407234223.1059191-10-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

In the old design, we obtain a reference to the swap device to maintain
the validity of the device's metadata struct (i.e swap_info_struct), as
well as the swap entry itself, before various operations.

In the new virtual swap space design, however, this is no longer
necessary - we can simply acquire a reference to the virtual swap slot
itself to ensure it remains valid.

Furthermore, once we decouple virtual swap slots from their backing,
obtaining a reference to the backing swap device itself is not
sufficient or even possible anyway, as the backing of a virtual swap
slot can change under it.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h | 24 +++++++++++++++++++++++-
 mm/vswap.c           | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 1d8679bd57f3..7f6200f1db33 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -730,11 +730,33 @@ int vswap_init(void);
 void vswap_exit(void);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
+bool vswap_tryget(swp_entry_t entry);
+void vswap_put(swp_entry_t entry);
 bool folio_swapped(struct folio *folio);
 bool vswap_swapcache_only(swp_entry_t entry, int nr);
 int non_swapcache_batch(swp_entry_t entry, int nr);
 bool swap_free_nr_any_cache_only(swp_entry_t entry, int nr);
 void put_swap_folio(struct folio *folio, swp_entry_t entry);
+
+static inline bool trylock_swapoff(swp_entry_t entry,
+				struct swap_info_struct **si)
+{
+	if (!vswap_tryget(entry))
+		return false;
+
+	/*
+	 * No need to hold a reference to the swap device. The virtual swap slot pins
+	 * the physical swap slot, which in turns pin the swap device.
+	 */
+	*si = swap_slot_swap_info(swp_entry_to_swp_slot(entry));
+	return true;
+}
+
+static inline void unlock_swapoff(swp_entry_t entry,
+				struct swap_info_struct *si)
+{
+	vswap_put(entry);
+}
 #else
 static inline int vswap_init(void)
 {
@@ -773,7 +795,6 @@ static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
 	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
-#endif
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
@@ -789,6 +810,7 @@ static inline void unlock_swapoff(swp_entry_t entry,
 {
 	swap_slot_put_swap_info(si);
 }
+#endif
 
 #if defined(CONFIG_SWAP) && !defined(CONFIG_VIRTUAL_SWAP)
 int add_swap_count_continuation(swp_entry_t, gfp_t);
diff --git a/mm/vswap.c b/mm/vswap.c
index 1b8cf894390c..8a518ebd20e4 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -425,6 +425,42 @@ swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return ret;
 }
 
+/**
+ * vswap_tryget - try to obtain an ephemeral reference to a virtual swap slot.
+ *
+ * @entry: the virtual swap slot.
+ *
+ * Return: true if the reference was obtained.
+ */
+bool vswap_tryget(swp_entry_t entry)
+{
+	struct swp_desc *desc;
+	bool ret;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	if (!desc) {
+		rcu_read_unlock();
+		return false;
+	}
+
+	ret = kref_get_unless_zero(&desc->refcnt);
+	rcu_read_unlock();
+	return ret;
+}
+
+/**
+ * vswap_put - release an ephemeral reference to the virtual swap slot.
+ *
+ * @entry: the virtual swap slot.
+ */
+void vswap_put(swp_entry_t entry)
+{
+	struct swp_desc *desc = xa_load(&vswap_map, entry.val);
+
+	kref_put(&desc->refcnt, vswap_ref_release);
+}
+
 /**
  * swap_free_nr_any_cache_only - decrease the swap count of nr contiguous swap
  *                               entries by 1 (when the swap entries are removed

From patchwork Mon Apr  7 23:42:11 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879270
Received: from mail-yb1-f169.google.com (mail-yb1-f169.google.com
 [209.85.219.169])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id BBCEA22FF44;
 Mon,  7 Apr 2025 23:42:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.169
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069358; cv=none;
 b=uu194t+qWzzEhdMAOu612XwrHGcFDYmMniG/HH7CaU0ENjIewwvCuYDkedHoOu7ZCAJyxqLQD1V4CIZEXA7ub+n7oMmHH0fimNB22ZlP6owZWwibnxW6pdvECC4dhOKWFONz95Y85G9CappNGwy/8+QH4LSGmBZtKkcD6kC6kmw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069358; c=relaxed/simple;
 bh=j3Ez4oaXvJugRrSaR967azYeavG7pZoUaT+kPJaOBIc=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=hja9X6IWuCUzjbDv0XiWpbhrZH68u4pFaSg/ZE00GWPOLAGKS49p/VHnv2JdHeVtAxRrw0WtiNHfVLbxqKyR+xgxpAMSCpfeOY+WZ8bEl50Y8uDR9/CuuDSN6sOoCp+mQ/k4Blbqr9zGp8XLxDYIAp1OIhOkboVCZHpReSqJtG0=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=CWqbb1AD; arc=none smtp.client-ip=209.85.219.169
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="CWqbb1AD"
Received: by mail-yb1-f169.google.com with SMTP id
 3f1490d57ef6-e6df4507690so4460777276.0;
 Mon, 07 Apr 2025 16:42:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069351; x=1744674151;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=++y7zqsSUm8uP+99pfz8bU1v8EN8IiNpgEeMWYk8aVg=;
 b=CWqbb1ADg5mTXiAN9Zze3F1cPczdOznfYmOW6/pghUtnPkUvzBJ9GhVZ1B+zPGWIoG
 pGGMMaVqsUAyyQy5+Ioy680E9w7OqoECXeWL/1vxUJGARGLRy6uF0Ui4QzxtgK0nUXgg
 zNI0SwyqTMq/Z2Qlw9FL6bPGqrhunC8+be0f2PKnW2QIuEU+j9T8UkKd0+6Y1TLC2kqN
 K879EEar1d40xH95gyHDLQoI3JXrGNHnPjQi0obCpRHsEKzgwOaRruu/r3mWHZ1iEeGI
 ox/v68f4A+ePEqQtbCMCfDahkrzFZDEYkAUMYclPWbM/4m+1GFlHwu3v9Oswd6wwSyN6
 ocew==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069351; x=1744674151;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=++y7zqsSUm8uP+99pfz8bU1v8EN8IiNpgEeMWYk8aVg=;
 b=ZEyAsFnf0gOjU8cGJe84/ZY+5uG6iUGXnjSWvdTl6uowSGW3VFa87g7rRdFJLbBVU0
 IS6aMkVcFmCPi+Gtkd/WxrREnbDGgNYE5yO7Rzk90z8xA0S6mo+AFx1JJNqUCUjQe43P
 hBxbTLkuvUATfuRJbwAD3AUVO62rInL3oQSlanaiPnOBk3OXP5OUDzlTLw9fObrJVhhE
 PK+zm/SA0Gs1wvyzeYjafWwSDrTNpGk+sPKaMh03C+qZKWplDY8b7aFbcWgSuKAnjq0y
 Z0vETH0BbbdGuq3VZ+FSXbsEyhw/t947SDQ98orbNNoGAriOw+HcYgRzLzNzSmDWjBTM
 4Akw==
X-Forwarded-Encrypted: i=1;
 AJvYcCVNuOrfS4erWq0bjbkrqL4Tyc0mRgRLf6J9wdZ4ulbXvVJzYypvQhdiIIX7aiwDOB06eWoGlLujimA=@vger.kernel.org,
 AJvYcCVU/SvVc/WxlMjQUBa5NHrQ5qhGWN8gfa1vzL8O7JrbydK+zb08OLJn/xEVk5Xdp6Y55ZzAKAJ7yI9XVolB@vger.kernel.org,
 AJvYcCVgbego6/Jov7AGUuVAGuuABw4tD2Y0GJHc7EyWmj1leZ3NQ/jlsx8juc1rqAqtaukgFShbaV7k@vger.kernel.org
X-Gm-Message-State: AOJu0Yz0GP2MyomLYOJAyM8Tekucm/KMB3iWM+Y+pUUlFHyBZJFQlLfG
 9uvECI8uz0XDufEDj6rjJp/AsqEu3llRWiaVCoQTzz+EHAZ68R3v
X-Gm-Gg: ASbGncuq5NdliVqafNo3BZ6uo5R9Mm1FkPxFre1ebYAJFk0xHyDGPAxYaQzaIEWFYZa
 X3Ath2a6AthBreKzQOM7+mg0DRdQRniuEZWVssErOAMH2p3klVCN28fWq/xUMbwL6HKRp6jSG7g
 CMjxqDOpZ2sRyvKyRWx1TcrfXkw7RfA0gC9TVxW2fpQIff3vNQnSmgIGj/BbCFUb5MEP+Xj5wlf
 VWaanxwSbpd92++L245id1WjnjMvj+GrAo1UerDbsVZwjHlWg4wU+OKkDxpn1gP2yWFx0t+Kdo2
 HuAIoDJXQjd5gZiIaqnAUAaq6he44qMGOnCV07uEaA2fbw==
X-Google-Smtp-Source: AGHT+IHumYcSFOyyzMxaCYbmI7UMkmPuvvyLK3KYwROrGmszF9zaV/12YHvO/xSJpGYI0gp7r3mZ+w==
X-Received: by 2002:a05:6902:1381:b0:e5b:32f5:e38c with SMTP id
 3f1490d57ef6-e6e5d8dd287mr2145989276.8.1744069350968;
 Mon, 07 Apr 2025 16:42:30 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:4::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e6e0ca987d9sm2615939276.39.2025.04.07.16.42.30
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:30 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 10/14] mm: swap: decouple virtual swap slot from backing
 store
Date: Mon,  7 Apr 2025 16:42:11 -0700
Message-ID: <20250407234223.1059191-11-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch presents the first real use case of the new virtual swap
design. It leverages the virtualization of the swap space to decouple a
swap entry and its backing storage. A swap entry can now be backed by
one of the following options:

1. A slot on a physical swapfile/swap partition.
2. A "zero swap page".
3. A compressed object in the zswap pool.
4. An in-memory page. This can happen when a page is loaded
   (exclusively) from the zswap pool, or if the page is rejected by
   zswap and zswap writeback is disabled.

This allows us to use zswap and the zero swap page optimization, without
having to reserved a slot on a swapfile, or a swapfile at all.

Note that we disable THP swapin on virtual swap implementation, for now.
Similarly, we only operates on one swap entry at a time when we zap a
PTE range. There is no real reason why we cannot build support for this
in the new design. It is simply to make the patch smaller and more
manageable for reviewers - these capabilities will be restored in a
following patch.

In the same vein, for now we still charge virtual swap slots towards the
memcg's swap usage. In a following patch, we will change this behavior
and only charge physical (i.e on swapfile) swap slots towards the
memcg's swap usage.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  71 +++++-
 mm/Kconfig           |  11 +
 mm/huge_memory.c     |   5 +-
 mm/internal.h        |  16 +-
 mm/memcontrol.c      |  70 ++++--
 mm/memory.c          |  73 ++++--
 mm/migrate.c         |   1 +
 mm/page_io.c         |  31 ++-
 mm/shmem.c           |   7 +-
 mm/swap.h            |  10 +
 mm/swap_state.c      |  23 +-
 mm/swapfile.c        |  20 +-
 mm/vmscan.c          |  26 ++-
 mm/vswap.c           | 531 ++++++++++++++++++++++++++++++++++++++-----
 mm/zswap.c           |  34 ++-
 15 files changed, 771 insertions(+), 158 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 7f6200f1db33..073835335667 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -461,6 +461,7 @@ extern void __meminit kswapd_stop(int nid);
 /* Virtual swap API */
 swp_entry_t folio_alloc_swap(struct folio *folio);
 bool folio_free_swap(struct folio *folio);
+void put_swap_folio(struct folio *folio, swp_entry_t entry);
 void swap_shmem_alloc(swp_entry_t, int);
 int swap_duplicate(swp_entry_t);
 int swapcache_prepare(swp_entry_t entry, int nr);
@@ -508,7 +509,6 @@ static inline long get_nr_swap_pages(void)
 }
 
 void si_swapinfo(struct sysinfo *);
-void swap_slot_put_folio(swp_slot_t slot, struct folio *folio);
 swp_slot_t swap_slot_alloc_of_type(int);
 int swap_slot_alloc(int n, swp_slot_t swp_slots[], int order);
 void swap_slot_free_nr(swp_slot_t slot, int nr_pages);
@@ -725,9 +725,12 @@ static inline bool mem_cgroup_swap_full(struct folio *folio)
 }
 #endif
 
+struct zswap_entry;
+
 #ifdef CONFIG_VIRTUAL_SWAP
 int vswap_init(void);
 void vswap_exit(void);
+swp_slot_t vswap_alloc_swap_slot(struct folio *folio);
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry);
 swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot);
 bool vswap_tryget(swp_entry_t entry);
@@ -736,7 +739,13 @@ bool folio_swapped(struct folio *folio);
 bool vswap_swapcache_only(swp_entry_t entry, int nr);
 int non_swapcache_batch(swp_entry_t entry, int nr);
 bool swap_free_nr_any_cache_only(swp_entry_t entry, int nr);
-void put_swap_folio(struct folio *folio, swp_entry_t entry);
+bool vswap_disk_backed(swp_entry_t entry, int nr);
+void vswap_split_huge_page(struct folio *head, struct folio *subpage);
+void vswap_migrate(struct folio *src, struct folio *dst);
+bool vswap_folio_backed(swp_entry_t entry, int nr);
+void vswap_store_folio(swp_entry_t entry, struct folio *folio);
+void swap_zeromap_folio_set(struct folio *folio);
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
@@ -745,16 +754,33 @@ static inline bool trylock_swapoff(swp_entry_t entry,
 		return false;
 
 	/*
-	 * No need to hold a reference to the swap device. The virtual swap slot pins
-	 * the physical swap slot, which in turns pin the swap device.
+	 * Note that this function does not provide any guarantee that the virtual
+	 * swap slot's backing state will be stable, only that the slot itself will
+	 * not be freed. This has several implications:
+	 *
+	 * 1. If the virtual swap slot is backed by a physical swap slot, we need to
+	 * also acquire a reference to the swap device before returning it.
+	 * Otherwise, the virtual swap slot can change its backend, allowing swapoff
+	 * to free the swap device looked up here.
+	 *
+	 * 2. The swap device we are looking up here might be outdated by the time we
+	 * return to the caller. It is perfectly OK, if the swap_info_struct is only
+	 * used in a best-effort manner (i.e optimization). If we need the precise
+	 * backing state, we need to re-check after the entry is pinned in swap cache.
 	 */
-	*si = swap_slot_swap_info(swp_entry_to_swp_slot(entry));
+	if (vswap_disk_backed(entry, 1))
+		*si = swap_slot_tryget_swap_info(swp_entry_to_swp_slot(entry));
+	else
+		*si = NULL;
+
 	return true;
 }
 
 static inline void unlock_swapoff(swp_entry_t entry,
 				struct swap_info_struct *si)
 {
+	if (si)
+		swap_slot_put_swap_info(si);
 	vswap_put(entry);
 }
 #else
@@ -791,9 +817,32 @@ static inline swp_entry_t swp_slot_to_swp_entry(swp_slot_t slot)
 	return (swp_entry_t) { slot.val };
 }
 
-static inline void put_swap_folio(struct folio *folio, swp_entry_t entry)
+static inline swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
+{
+	return swp_entry_to_swp_slot(folio->swap);
+}
+
+static inline void vswap_split_huge_page(struct folio *head,
+				struct folio *subpage)
+{
+}
+
+static inline void vswap_migrate(struct folio *src, struct folio *dst)
+{
+}
+
+static inline bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+	return false;
+}
+
+static inline void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+}
+
+static inline void vswap_assoc_zswap(swp_entry_t entry,
+				struct zswap_entry *zswap_entry)
 {
-	swap_slot_put_folio(swp_entry_to_swp_slot(entry), folio);
 }
 
 static inline bool trylock_swapoff(swp_entry_t entry,
@@ -810,8 +859,16 @@ static inline void unlock_swapoff(swp_entry_t entry,
 {
 	swap_slot_put_swap_info(si);
 }
+
 #endif
 
+static inline struct swap_info_struct *vswap_get_device(swp_entry_t entry)
+{
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
+
+	return slot.val ? swap_slot_tryget_swap_info(slot) : NULL;
+}
+
 #if defined(CONFIG_SWAP) && !defined(CONFIG_VIRTUAL_SWAP)
 int add_swap_count_continuation(swp_entry_t, gfp_t);
 #else
diff --git a/mm/Kconfig b/mm/Kconfig
index d578b8e6ab6a..e9abc33fe4b7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -35,6 +35,17 @@ config VIRTUAL_SWAP
 		In this new design, for each swap entry, a virtual swap slot is
 		allocated and stored in the page table entry, rather than the
 		handle to the physical swap slot on the swap device itself.
+		Swap entries that are:
+
+		1. Zero-filled
+
+		2. Stored in the zswap pool.
+
+		3. Rejected by zswap but cannot be written back to a backing
+		   swap device.
+
+		no longer take up any disk storage (i.e they do not occupy any
+		slot in the backing swap device).
 
 		There might be more lock contentions with heavy swap use, since
 		the swap cache is no longer range partitioned.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 373781b21e5c..e6832ec2b07a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3172,6 +3172,7 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 {
 	struct page *head = &folio->page;
 	struct page *page_tail = head + tail;
+
 	/*
 	 * Careful: new_folio is not a "real" folio before we cleared PageTail.
 	 * Don't pass it around before clear_compound_head().
@@ -3227,8 +3228,10 @@ static void __split_huge_page_tail(struct folio *folio, int tail,
 		VM_WARN_ON_ONCE_PAGE(true, page_tail);
 		page_tail->private = 0;
 	}
-	if (folio_test_swapcache(folio))
+	if (folio_test_swapcache(folio)) {
 		new_folio->swap.val = folio->swap.val + tail;
+		vswap_split_huge_page(folio, new_folio);
+	}
 
 	/* Page flags must be visible before we make the page non-compound. */
 	smp_wmb();
diff --git a/mm/internal.h b/mm/internal.h
index ca28729f822a..51061691a731 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -268,17 +268,12 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 	return (swp_entry_t) { entry.val + n };
 }
 
-/* similar to swap_nth, but check the backing physical slots as well. */
+/* temporary disallow batched swap operations */
 static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
 {
-	swp_slot_t slot = swp_entry_to_swp_slot(entry), next_slot;
-	swp_entry_t next_entry = swap_nth(entry, delta);
-
-	next_slot = swp_entry_to_swp_slot(next_entry);
-	if (swp_slot_type(slot) != swp_slot_type(next_slot) ||
-			swp_slot_offset(slot) + delta != swp_slot_offset(next_slot))
-		next_entry.val = 0;
+	swp_entry_t next_entry;
 
+	next_entry.val = 0;
 	return next_entry;
 }
 #else
@@ -349,6 +344,8 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  * max_nr must be at least one and must be limited by the caller so scanning
  * cannot exceed a single page table.
  *
+ * Note that for virtual swap space, we will not batch anything for now.
+ *
  * Return: the number of table entries in the batch.
  */
 static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
@@ -363,6 +360,9 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(entry));
 
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+		return 1;
+
 	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index a037ec92881d..126b2d0e6aaa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5095,10 +5095,23 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	rcu_read_unlock();
 }
 
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
+
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
-	long nr_swap_pages = get_nr_swap_pages();
+	long nr_swap_pages, nr_zswap_pages = 0;
+
+	/*
+	 * If swap is virtualized and zswap is enabled, we can still use zswap even
+	 * if there is no space left in any swap file/partition.
+	 */
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled() &&
+			(mem_cgroup_disabled() || do_memsw_account() ||
+				mem_cgroup_may_zswap(memcg))) {
+		nr_zswap_pages = PAGE_COUNTER_MAX;
+	}
 
+	nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
 	if (mem_cgroup_disabled() || do_memsw_account())
 		return nr_swap_pages;
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
@@ -5267,6 +5280,29 @@ static struct cftype swap_files[] = {
 };
 
 #ifdef CONFIG_ZSWAP
+static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+	struct mem_cgroup *memcg;
+
+	for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
+	     memcg = parent_mem_cgroup(memcg)) {
+		unsigned long max = READ_ONCE(memcg->zswap_max);
+		unsigned long pages;
+
+		if (max == PAGE_COUNTER_MAX)
+			continue;
+		if (max == 0)
+			return false;
+
+		/* Force flush to get accurate stats for charging */
+		__mem_cgroup_flush_stats(memcg, true);
+		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
+		if (pages >= max)
+			return false;
+	}
+	return true;
+}
+
 /**
  * obj_cgroup_may_zswap - check if this cgroup can zswap
  * @objcg: the object cgroup
@@ -5281,34 +5317,15 @@ static struct cftype swap_files[] = {
  */
 bool obj_cgroup_may_zswap(struct obj_cgroup *objcg)
 {
-	struct mem_cgroup *memcg, *original_memcg;
+	struct mem_cgroup *memcg;
 	bool ret = true;
 
 	if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
 		return true;
 
-	original_memcg = get_mem_cgroup_from_objcg(objcg);
-	for (memcg = original_memcg; !mem_cgroup_is_root(memcg);
-	     memcg = parent_mem_cgroup(memcg)) {
-		unsigned long max = READ_ONCE(memcg->zswap_max);
-		unsigned long pages;
-
-		if (max == PAGE_COUNTER_MAX)
-			continue;
-		if (max == 0) {
-			ret = false;
-			break;
-		}
-
-		/* Force flush to get accurate stats for charging */
-		__mem_cgroup_flush_stats(memcg, true);
-		pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE;
-		if (pages < max)
-			continue;
-		ret = false;
-		break;
-	}
-	mem_cgroup_put(original_memcg);
+	memcg = get_mem_cgroup_from_objcg(objcg);
+	ret = mem_cgroup_may_zswap(memcg);
+	mem_cgroup_put(memcg);
 	return ret;
 }
 
@@ -5452,6 +5469,11 @@ static struct cftype zswap_files[] = {
 	},
 	{ }	/* terminate */
 };
+#else
+static inline bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg)
+{
+	return false;
+}
 #endif /* CONFIG_ZSWAP */
 
 static int __init mem_cgroup_swap_init(void)
diff --git a/mm/memory.c b/mm/memory.c
index a1d3631ad85f..c5c34efafa81 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4226,8 +4226,10 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * A large swapped out folio could be partially or fully in zswap. We
 	 * lack handling for such cases, so fallback to swapping in order-0
 	 * folio.
+	 *
+	 * We also disable THP swapin on the virtual swap implementation, for now.
 	 */
-	if (!zswap_never_enabled())
+	if (!zswap_never_enabled() || IS_ENABLED(CONFIG_VIRTUAL_SWAP))
 		goto fallback;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
@@ -4305,12 +4307,12 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	struct folio *swapcache, *folio = NULL;
 	DECLARE_WAITQUEUE(wait, current);
 	struct page *page;
-	struct swap_info_struct *si = NULL;
+	struct swap_info_struct *si = NULL, *stable_si;
 	rmap_t rmap_flags = RMAP_NONE;
 	bool need_clear_cache = false;
 	bool swapoff_locked = false;
 	bool exclusive = false;
-	swp_entry_t entry;
+	swp_entry_t orig_entry, entry;
 	swp_slot_t slot;
 	pte_t pte;
 	vm_fault_t ret = 0;
@@ -4324,6 +4326,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 		goto out;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
+	/*
+	 * entry might change if we get a large folio - remember the original entry
+	 * for unlocking swapoff etc.
+	 */
+	orig_entry = entry;
 	if (unlikely(non_swap_entry(entry))) {
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
@@ -4381,7 +4388,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 	swapcache = folio;
 
 	if (!folio) {
-		if (data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
+		if (si && data_race(si->flags & SWP_SYNCHRONOUS_IO) &&
 		    __swap_count(entry) == 1) {
 			/* skip swapcache */
 			folio = alloc_swap_folio(vmf);
@@ -4591,27 +4598,43 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			 * swapcache -> certainly exclusive.
 			 */
 			exclusive = true;
-		} else if (exclusive && folio_test_writeback(folio) &&
-			  data_race(si->flags & SWP_STABLE_WRITES)) {
+		} else if (exclusive && folio_test_writeback(folio)) {
 			/*
-			 * This is tricky: not all swap backends support
-			 * concurrent page modifications while under writeback.
-			 *
-			 * So if we stumble over such a page in the swapcache
-			 * we must not set the page exclusive, otherwise we can
-			 * map it writable without further checks and modify it
-			 * while still under writeback.
-			 *
-			 * For these problematic swap backends, simply drop the
-			 * exclusive marker: this is perfectly fine as we start
-			 * writeback only if we fully unmapped the page and
-			 * there are no unexpected references on the page after
-			 * unmapping succeeded. After fully unmapped, no
-			 * further GUP references (FOLL_GET and FOLL_PIN) can
-			 * appear, so dropping the exclusive marker and mapping
-			 * it only R/O is fine.
+			 * We need to look up the swap device again here, for the virtual
+			 * swap case. The si we got from trylock_swapoff() is not
+			 * guaranteed to be stable, as at that time we have not pinned
+			 * the virtual swap slot's backing storage. With the folio locked
+			 * and loaded into the swap cache, we can now guarantee a stable
+			 * backing state.
 			 */
-			exclusive = false;
+			if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+				stable_si = vswap_get_device(entry);
+			else
+				stable_si = si;
+			if (stable_si && data_race(stable_si->flags & SWP_STABLE_WRITES)) {
+				/*
+				 * This is tricky: not all swap backends support
+				 * concurrent page modifications while under writeback.
+				 *
+				 * So if we stumble over such a page in the swapcache
+				 * we must not set the page exclusive, otherwise we can
+				 * map it writable without further checks and modify it
+				 * while still under writeback.
+				 *
+				 * For these problematic swap backends, simply drop the
+				 * exclusive marker: this is perfectly fine as we start
+				 * writeback only if we fully unmapped the page and
+				 * there are no unexpected references on the page after
+				 * unmapping succeeded. After fully unmapped, no
+				 * further GUP references (FOLL_GET and FOLL_PIN) can
+				 * appear, so dropping the exclusive marker and mapping
+				 * it only R/O is fine.
+				 */
+				exclusive = false;
+			}
+
+			if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && stable_si)
+				swap_slot_put_swap_info(stable_si);
 		}
 	}
 
@@ -4720,7 +4743,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			wake_up(&swapcache_wq);
 	}
 	if (swapoff_locked)
-		unlock_swapoff(entry, si);
+		unlock_swapoff(orig_entry, si);
 	return ret;
 out_nomap:
 	if (vmf->pte)
@@ -4739,7 +4762,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			wake_up(&swapcache_wq);
 	}
 	if (swapoff_locked)
-		unlock_swapoff(entry, si);
+		unlock_swapoff(orig_entry, si);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 97f0edf0c032..3a2cf62f47ea 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -523,6 +523,7 @@ static int __folio_migrate_mapping(struct address_space *mapping,
 	if (folio_test_swapcache(folio)) {
 		folio_set_swapcache(newfolio);
 		newfolio->private = folio_get_private(folio);
+		vswap_migrate(folio, newfolio);
 		entries = nr;
 	} else {
 		entries = 1;
diff --git a/mm/page_io.c b/mm/page_io.c
index 182851c47f43..83fc4a466db8 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -201,6 +201,12 @@ static bool is_folio_zero_filled(struct folio *folio)
 	return true;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+static void swap_zeromap_folio_clear(struct folio *folio)
+{
+	vswap_store_folio(folio->swap, folio);
+}
+#else
 static void swap_zeromap_folio_set(struct folio *folio)
 {
 	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
@@ -238,6 +244,7 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 		clear_bit(swp_slot_offset(slot), sis->zeromap);
 	}
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * We may have stale swap cache pages in memory: notice
@@ -246,6 +253,7 @@ static void swap_zeromap_folio_clear(struct folio *folio)
 int swap_writepage(struct page *page, struct writeback_control *wbc)
 {
 	struct folio *folio = page_folio(page);
+	swp_slot_t slot;
 	int ret;
 
 	if (folio_free_swap(folio)) {
@@ -275,9 +283,8 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return 0;
 	} else {
 		/*
-		 * Clear bits this folio occupies in the zeromap to prevent
-		 * zero data being read in from any previous zero writes that
-		 * occupied the same swap entries.
+		 * Clear the zeromap state to prevent zero data being read in from any
+		 * previous zero writes that occupied the same swap entries.
 		 */
 		swap_zeromap_folio_clear(folio);
 	}
@@ -291,6 +298,13 @@ int swap_writepage(struct page *page, struct writeback_control *wbc)
 		return AOP_WRITEPAGE_ACTIVATE;
 	}
 
+	/* fall back to physical swap device */
+	slot = vswap_alloc_swap_slot(folio);
+	if (!slot.val) {
+		folio_mark_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	__swap_writepage(folio, wbc);
 	return 0;
 }
@@ -624,14 +638,11 @@ static void swap_read_folio_bdev_async(struct folio *folio,
 
 void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 {
-	struct swap_info_struct *sis =
-		swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
-	bool synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
-	bool workingset = folio_test_workingset(folio);
+	struct swap_info_struct *sis;
+	bool synchronous, workingset = folio_test_workingset(folio);
 	unsigned long pflags;
 	bool in_thrashing;
 
-	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(folio_test_uptodate(folio), folio);
 
@@ -657,6 +668,10 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
 
+	sis = swap_slot_swap_info(swp_entry_to_swp_slot(folio->swap));
+	synchronous = sis->flags & SWP_SYNCHRONOUS_IO;
+	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio) && !synchronous, folio);
+
 	if (data_race(sis->flags & SWP_FS_OPS)) {
 		swap_read_folio_fs(folio, plug);
 	} else if (synchronous) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 4c00b4673468..609971a2b365 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1404,7 +1404,7 @@ static int shmem_find_swap_entries(struct address_space *mapping,
 		 * swapin error entries can be found in the mapping. But they're
 		 * deliberately ignored here as we've done everything we can do.
 		 */
-		if (swp_slot_type(slot) != type)
+		if (!slot.val || swp_slot_type(slot) != type)
 			continue;
 
 		indices[folio_batch_count(fbatch)] = xas.xa_index;
@@ -1554,7 +1554,7 @@ static int shmem_writepage(struct page *page, struct writeback_control *wbc)
 	if ((info->flags & VM_LOCKED) || sbinfo->noswap)
 		goto redirty;
 
-	if (!total_swap_pages)
+	if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP) && !total_swap_pages)
 		goto redirty;
 
 	/*
@@ -2295,7 +2295,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 			fallback_order0 = true;
 
 		/* Skip swapcache for synchronous device. */
-		if (!fallback_order0 && data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
+		if (!fallback_order0 && si &&
+				  data_race(si->flags & SWP_SYNCHRONOUS_IO)) {
 			folio = shmem_swap_alloc_folio(inode, vma, index, swap, order, gfp);
 			if (!IS_ERR(folio)) {
 				skip_swapcache = true;
diff --git a/mm/swap.h b/mm/swap.h
index 31c94671cb44..411282d08a15 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -86,9 +86,18 @@ static inline unsigned int folio_swap_flags(struct folio *folio)
 {
 	swp_slot_t swp_slot = swp_entry_to_swp_slot(folio->swap);
 
+	/*
+	 * In the virtual swap implementation, the folio might not be backed by any
+	 * physical swap slots (for e.g zswap-backed only).
+	 */
+	if (!swp_slot.val)
+		return 0;
 	return swap_slot_swap_info(swp_slot)->flags;
 }
 
+#ifdef CONFIG_VIRTUAL_SWAP
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap);
+#else
 /*
  * Return the count of contiguous swap entries that share the same
  * zeromap status as the starting entry. If is_zeromap is not NULL,
@@ -114,6 +123,7 @@ static inline int swap_zeromap_batch(swp_entry_t entry, int max_nr,
 	else
 		return find_next_bit(sis->zeromap, end, start) - start;
 }
+#endif
 
 #else /* CONFIG_SWAP */
 struct swap_iocb;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index eb4cd6ba2068..3a549c4583e9 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -494,6 +494,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 
 	for (;;) {
 		int err;
+
 		/*
 		 * First check the swap cache.  Since this is normally
 		 * called after swap_cache_get_folio() failed, re-calling
@@ -531,8 +532,20 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
 		 * Swap entry may have been freed since our caller observed it.
 		 */
 		err = swapcache_prepare(entry, 1);
-		if (!err)
+		if (!err) {
+			/* This might be invoked by swap_cluster_readahead(), which can
+			 * race with shmem_swapin_folio(). The latter might have already
+			 * called delete_from_swap_cache(), allowing swapcache_prepare()
+			 * to succeed here. This can lead to reading bogus data to populate
+			 * the page. To prevent this, skip folio-backed virtual swap slots,
+			 * and let caller retry if necessary.
+			 */
+			if (vswap_folio_backed(entry, 1)) {
+				swapcache_clear(si, entry, 1);
+				goto put_and_return;
+			}
 			break;
+		}
 		else if (err != -EEXIST)
 			goto put_and_return;
 
@@ -715,6 +728,14 @@ struct folio *swap_cluster_readahead(swp_entry_t entry, gfp_t gfp_mask,
 	struct swap_iocb *splug = NULL;
 	bool page_allocated;
 
+	/*
+	 * If swap is virtualized, the swap entry might not be backed by any
+	 * physical swap slot. In that case, just skip readahead and bring in the
+	 * target entry.
+	 */
+	if (!slot.val)
+		goto skip;
+
 	mask = swapin_nr_pages(offset) - 1;
 	if (!mask)
 		goto skip;
diff --git a/mm/swapfile.c b/mm/swapfile.c
index babb545acffd..59b34d51b16b 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1165,16 +1165,22 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 	unsigned long begin = offset;
 	unsigned long end = offset + nr_entries - 1;
 	void (*swap_slot_free_notify)(struct block_device *, unsigned long);
+#ifndef CONFIG_VIRTUAL_SWAP
 	unsigned int i;
 
 	/*
 	 * Use atomic clear_bit operations only on zeromap instead of non-atomic
 	 * bitmap_clear to prevent adjacent bits corruption due to simultaneous writes.
+	 *
+	 * Note that in the virtual swap implementation, we do not need to perform
+	 * these operations, since zswap and zero-filled pages are not backed by
+	 * physical swapfile.
 	 */
 	for (i = 0; i < nr_entries; i++) {
 		clear_bit(offset + i, si->zeromap);
 		zswap_invalidate(swp_slot_to_swp_entry(swp_slot(si->type, offset + i)));
 	}
+#endif
 
 	if (si->flags & SWP_BLKDEV)
 		swap_slot_free_notify =
@@ -1639,43 +1645,35 @@ void swap_free_nr(swp_entry_t entry, int nr_pages)
 {
 	swap_slot_free_nr(swp_entry_to_swp_slot(entry), nr_pages);
 }
-#endif
 
 /*
  * This should only be called in contexts in which the slot has
  * been allocated but not associated with any swap entries.
  */
-void swap_slot_put_folio(swp_slot_t slot, struct folio *folio)
+void put_swap_folio(struct folio *folio, swp_entry_t entry)
 {
+	swp_slot_t slot = swp_entry_to_swp_slot(entry);
 	unsigned long offset = swp_slot_offset(slot);
 	struct swap_cluster_info *ci;
 	struct swap_info_struct *si;
 	int size = 1 << swap_slot_order(folio_order(folio));
-	unsigned char usage;
 
 	si = _swap_info_get(slot);
 	if (!si)
 		return;
 
-#ifdef CONFIG_VIRTUAL_SWAP
-	usage = SWAP_MAP_ALLOCATED;
-#else
-	usage = SWAP_HAS_CACHE;
-#endif
-
 	ci = lock_cluster(si, offset);
 	if (swap_is_has_cache(si, offset, size))
 		swap_slot_range_free(si, ci, slot, size);
 	else {
 		for (int i = 0; i < size; i++, slot.val++) {
-			if (!__swap_slot_free_locked(si, offset + i, usage))
+			if (!__swap_slot_free_locked(si, offset + i, SWAP_HAS_CACHE))
 				swap_slot_range_free(si, ci, slot, 1);
 		}
 	}
 	unlock_cluster(ci);
 }
 
-#ifndef CONFIG_VIRTUAL_SWAP
 int __swap_count(swp_entry_t entry)
 {
 	swp_slot_t slot = swp_entry_to_swp_slot(entry);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7..db4178bf5f6f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -341,10 +341,15 @@ static inline bool can_reclaim_anon_pages(struct mem_cgroup *memcg,
 {
 	if (memcg == NULL) {
 		/*
-		 * For non-memcg reclaim, is there
-		 * space in any swap device?
+		 * For non-memcg reclaim:
+		 *
+		 * If swap is virtualized, we can still use zswap even if there is no
+		 * space left in any swap file/partition.
+		 *
+		 * Otherwise, check if there is space in any swap device?
 		 */
-		if (get_nr_swap_pages() > 0)
+		if ((IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled()) ||
+				get_nr_swap_pages() > 0)
 			return true;
 	} else {
 		/* Is the memcg below its swap limit? */
@@ -2611,12 +2616,15 @@ static void get_scan_count(struct lruvec *lruvec, struct scan_control *sc,
 static bool can_age_anon_pages(struct pglist_data *pgdat,
 			       struct scan_control *sc)
 {
-	/* Aging the anon LRU is valuable if swap is present: */
-	if (total_swap_pages > 0)
-		return true;
-
-	/* Also valuable if anon pages can be demoted: */
-	return can_demote(pgdat->node_id, sc);
+	/*
+	 * Aging the anon LRU is valuable if:
+	 * 1. Swap is virtualized and zswap is enabled.
+	 * 2. There are physical swap slots available.
+	 * 3. Anon pages can be demoted
+	 */
+	return (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled()) ||
+			total_swap_pages > 0 ||
+			can_demote(pgdat->node_id, sc);
 }
 
 #ifdef CONFIG_LRU_GEN
diff --git a/mm/vswap.c b/mm/vswap.c
index 8a518ebd20e4..3146c231ca69 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -34,9 +34,33 @@
  * about to be added to the swap cache). Its reference count is incremented or
  * decremented every time it is mapped to or unmapped from a PTE, as well as
  * when it is added to or removed from the swap cache. Finally, when its
- * reference count reaches 0, the virtual swap slot is freed.
+ * reference count reaches 0, the virtual swap slot is freed, and its backing
+ * store released.
+ *
+ *
+ * II. Backing State
+ *
+ * Each virtual swap slot be backed by:
+ *
+ * 1. A slot on a physical swap device (i.e a swapfile or a swap partition).
+ * 2. A swapped out zero-filled page.
+ * 3. A compressed object in zswap.
+ * 4. An in-memory folio, that is not backed by neither a physical swap device
+ *    nor zswap (i.e only in swap cache). This is used for pages that are
+ *    rejected by zswap, but not (yet) backed by a physical swap device,
+ *    (for e.g, due to zswap.writeback = 0), or for pages that were previously
+ *    stored in zswap, but has since been loaded back into memory (and has its
+ *    zswap copy invalidated).
  */
 
+/* The backing state options of a virtual swap slot */
+enum swap_type {
+	VSWAP_SWAPFILE,
+	VSWAP_ZERO,
+	VSWAP_ZSWAP,
+	VSWAP_FOLIO
+};
+
 /**
  * Swap descriptor - metadata of a swapped out page.
  *
@@ -44,7 +68,11 @@
  * @slot: The handle to the physical swap slot backing this page.
  * @rcu: The RCU head to free the descriptor with an RCU grace period.
  * @lock: The lock protecting the swap slot backing field.
+ * @folio: The folio that backs the virtual swap slot.
+ * @zswap_entry: The zswap entry that backs the virtual swap slot.
+ * @lock: The lock protecting the swap slot backing fields.
  * @memcgid: The memcg id of the owning memcg, if any.
+ * @type: The backing store type of the swap entry.
  * @in_swapcache: whether the swapped out page is currently in swap cache.
  *                This flag is also used to establish the primary owner of
  *                the swap page in many cases.
@@ -55,10 +83,15 @@
  */
 struct swp_desc {
 	swp_entry_t vswap;
-	swp_slot_t slot;
+	union {
+		swp_slot_t slot;
+		struct folio *folio;
+		struct zswap_entry *zswap_entry;
+	};
 	struct rcu_head rcu;
 
 	rwlock_t lock;
+	enum swap_type type;
 
 #ifdef CONFIG_MEMCG
 	atomic_t memcgid;
@@ -159,6 +192,7 @@ static swp_entry_t vswap_alloc(int nr)
 	}
 
 	for (i = 0; i < nr; i++) {
+		descs[i]->type = VSWAP_SWAPFILE;
 		descs[i]->slot.val = 0;
 		atomic_set(&descs[i]->memcgid, 0);
 		atomic_set(&descs[i]->swap_count, 0);
@@ -250,6 +284,74 @@ static inline void release_vswap_slot(unsigned long index)
 	atomic_dec(&vswap_used);
 }
 
+/*
+ * Caller enters with swap descriptor unlocked, but needs to handle races with
+ * other operations themselves.
+ *
+ * For instance, this function is safe to be called in contexts where the swap
+ * entry has been added to the swap cache and the associated folio is locked.
+ * We cannot race with other accessors, and the swap entry is guaranteed to be
+ * valid the whole time (since in_swapcache status implies one reference
+ * count).
+ *
+ * We also need to make sure the backing state of the entire range matches.
+ * This is usually already checked by upstream callers.
+ */
+static inline void release_backing(swp_entry_t entry, int nr)
+{
+	swp_slot_t slot = (swp_slot_t){0};
+	struct swap_info_struct *si;
+	struct folio *folio = NULL;
+	enum swap_type type;
+	struct swp_desc *desc;
+	int i = 0;
+
+	VM_WARN_ON(!entry.val);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		VM_WARN_ON(!desc);
+		write_lock(&desc->lock);
+		if (!i) {
+			type = desc->type;
+			if (type == VSWAP_FOLIO)
+				folio = desc->folio;
+			else if (type == VSWAP_SWAPFILE)
+				slot = desc->slot;
+		} else {
+			VM_WARN_ON(type != desc->type);
+			VM_WARN_ON(type == VSWAP_FOLIO && desc->folio != folio);
+			VM_WARN_ON(type == VSWAP_SWAPFILE && slot.val &&
+				desc->slot.val != slot.val + i);
+		}
+
+		if (desc->type == VSWAP_ZSWAP)
+			zswap_invalidate((swp_entry_t){entry.val + i});
+		else if (desc->type == VSWAP_SWAPFILE) {
+			if (desc->slot.val) {
+				xa_erase(&vswap_rmap, desc->slot.val);
+				desc->slot.val = 0;
+			}
+		}
+		write_unlock(&desc->lock);
+		i++;
+	}
+	rcu_read_unlock();
+
+	if (slot.val) {
+		si = swap_slot_tryget_swap_info(slot);
+		if (si) {
+			swap_slot_free_nr(slot, nr);
+			swap_slot_put_swap_info(si);
+		}
+	}
+}
+
 /**
  * vswap_free - free a virtual swap slot.
  * @id: the virtual swap slot to free
@@ -261,20 +363,11 @@ static void vswap_free(swp_entry_t entry)
 	if (!entry.val)
 		return;
 
-	/* do not immediately erase the virtual slot to prevent its reuse */
 	desc = xa_load(&vswap_map, entry.val);
-	if (!desc)
-		return;
 
 	virt_clear_shadow_from_swap_cache(entry);
-
-	if (desc->slot.val) {
-		/* we only charge after linkage was established */
-		mem_cgroup_uncharge_swap(entry, 1);
-		xa_erase(&vswap_rmap, desc->slot.val);
-		swap_slot_free_nr(desc->slot, 1);
-	}
-
+	release_backing(entry, 1);
+	mem_cgroup_uncharge_swap(entry, 1);
 	/* erase forward mapping and release the virtual slot for reallocation */
 	release_vswap_slot(entry.val);
 	kfree_rcu(desc, rcu);
@@ -288,34 +381,78 @@ static void vswap_ref_release(struct kref *refcnt)
 }
 
 /**
- * folio_alloc_swap - allocate virtual swap slots for a folio.
- * @folio: the folio.
+ * folio_alloc_swap - allocate virtual swap slots for a folio, and
+ *                    set their backing store to the folio.
+ * @folio: the folio to allocate virtual swap slots for.
  *
  * Return: the first allocated slot if success, or the zero virtuals swap slot
  * on failure.
  */
 swp_entry_t folio_alloc_swap(struct folio *folio)
 {
-	int i, err, nr = folio_nr_pages(folio);
-	bool manual_freeing = true;
-	struct swp_desc *desc;
 	swp_entry_t entry;
-	swp_slot_t slot;
+	struct swp_desc *desc;
+	int i, nr = folio_nr_pages(folio);
 
 	entry = vswap_alloc(nr);
 	if (!entry.val)
 		return entry;
 
 	/*
-	 * XXX: for now, we always allocate a physical swap slot for each virtual
-	 * swap slot, and their lifetime are coupled. This will change once we
-	 * decouple virtual swap slots from their backing states, and only allocate
-	 * physical swap slots for them on demand (i.e on zswap writeback, or
-	 * fallback from zswap store failure).
+	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
+	 * swap slots allocation. This will be changed soon - we will only charge on
+	 * physical swap slots allocation.
 	 */
+	if (mem_cgroup_try_charge_swap(folio, entry)) {
+		for (i = 0; i < nr; i++) {
+			vswap_free(entry);
+			entry.val++;
+		}
+		atomic_add(nr, &vswap_alloc_reject);
+		entry.val = 0;
+		return entry;
+	}
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		desc->folio = folio;
+		desc->type = VSWAP_FOLIO;
+	}
+	rcu_read_unlock();
+	return entry;
+}
+
+/**
+ * vswap_alloc_swap_slot - allocate physical swap space for a folio that is
+ *                         already associated with virtual swap slots.
+ * @folio: folio we want to allocate physical swap space for.
+ *
+ * Return: the first allocated physical swap slot, if succeeds.
+ */
+swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
+{
+	int i, err, nr = folio_nr_pages(folio);
+	swp_slot_t slot = { .val = 0 };
+	swp_entry_t entry = folio->swap;
+	struct swp_desc *desc;
+	bool fallback = false;
+
+	/*
+	 * We might have already allocated a backing physical swap slot in past
+	 * attempts (for instance, when we disable zswap).
+	 */
+	slot = swp_entry_to_swp_slot(entry);
+	if (slot.val)
+		return slot;
+
 	slot = folio_alloc_swap_slot(folio);
 	if (!slot.val)
-		goto vswap_free;
+		return slot;
 
 	/* establish the vrtual <-> physical swap slots linkages. */
 	for (i = 0; i < nr; i++) {
@@ -325,7 +462,13 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (err) {
 			while (--i >= 0)
 				xa_erase(&vswap_rmap, slot.val + i);
-			goto put_physical_swap;
+			/*
+			 * We have not updated the backing type of the virtual swap slot.
+			 * Simply free up the physical swap slots here!
+			 */
+			swap_slot_free_nr(slot, nr);
+			slot.val = 0;
+			return slot;
 		}
 	}
 
@@ -337,36 +480,31 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 		if (xas_retry(&xas, desc))
 			continue;
 
+		write_lock(&desc->lock);
+		if (desc->type == VSWAP_FOLIO) {
+			/* case 1: fallback from zswap store failure */
+			fallback = true;
+			if (!folio)
+				folio = desc->folio;
+			else
+				VM_WARN_ON(folio != desc->folio);
+		} else {
+			/*
+			 * Case 2: zswap writeback.
+			 *
+			 * No need to free zswap entry here - it will be freed once zswap
+			 * writeback suceeds.
+			 */
+			VM_WARN_ON(desc->type != VSWAP_ZSWAP);
+			VM_WARN_ON(fallback);
+		}
+		desc->type = VSWAP_SWAPFILE;
 		desc->slot.val = slot.val + i;
+		write_unlock(&desc->lock);
 		i++;
 	}
 	rcu_read_unlock();
-
-	manual_freeing = false;
-	/*
-	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
-	 * swap slots allocation. This is acceptable because as noted above, each
-	 * virtual swap slot corresponds to a physical swap slot. Once we have
-	 * decoupled virtual and physical swap slots, we will only charge when we
-	 * actually allocate a physical swap slot.
-	 */
-	if (!mem_cgroup_try_charge_swap(folio, entry))
-		return entry;
-
-put_physical_swap:
-	/*
-	 * There is no any linkage between virtual and physical swap slots yet. We
-	 * have to manually and separately free the allocated virtual and physical
-	 * swap slots.
-	 */
-	swap_slot_put_folio(slot, folio);
-vswap_free:
-	if (manual_freeing) {
-		for (i = 0; i < nr; i++)
-			vswap_free((swp_entry_t){entry.val + i});
-	}
-	entry.val = 0;
-	return entry;
+	return slot;
 }
 
 /**
@@ -374,7 +512,9 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
  *                         virtual swap slot.
  * @entry: the virtual swap slot.
  *
- * Return: the physical swap slot corresponding to the virtual swap slot.
+ * Return: the physical swap slot corresponding to the virtual swap slot, if
+ * exists, or the zero physical swap slot if the virtual swap slot is not
+ * backed by any physical slot on a swapfile.
  */
 swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 {
@@ -392,7 +532,10 @@ swp_slot_t swp_entry_to_swp_slot(swp_entry_t entry)
 	}
 
 	read_lock(&desc->lock);
-	slot = desc->slot;
+	if (desc->type != VSWAP_SWAPFILE)
+		slot = (swp_slot_t){0};
+	else
+		slot = desc->slot;
 	read_unlock(&desc->lock);
 	rcu_read_unlock();
 
@@ -729,6 +872,286 @@ int non_swapcache_batch(swp_entry_t entry, int max_nr)
 	return i;
 }
 
+/**
+ * vswap_split_huge_page - update a subpage's swap descriptor to point to the
+ *                         recently split out subpage folio descriptor.
+ * @head: the original head's folio descriptor.
+ * @subpage: the subpage's folio descriptor.
+ */
+void vswap_split_huge_page(struct folio *head, struct folio *subpage)
+{
+	struct swp_desc *desc = xa_load(&vswap_map, subpage->swap.val);
+
+	write_lock(&desc->lock);
+	if (desc->type == VSWAP_FOLIO) {
+		VM_WARN_ON(desc->folio != head);
+		desc->folio = subpage;
+	}
+	write_unlock(&desc->lock);
+}
+
+/**
+ * vswap_migrate - update the swap entries of the original folio to refer to
+ *                 the new folio for migration.
+ * @old: the old folio.
+ * @new: the new folio.
+ */
+void vswap_migrate(struct folio *src, struct folio *dst)
+{
+	long nr = folio_nr_pages(src), nr_folio_backed = 0;
+	struct swp_desc *desc;
+
+	VM_WARN_ON(!folio_test_locked(src));
+	VM_WARN_ON(!folio_test_swapcache(src));
+
+	XA_STATE(xas, &vswap_map, src->swap.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, src->swap.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		if (desc->type == VSWAP_FOLIO) {
+			VM_WARN_ON(desc->folio != src);
+			desc->folio = dst;
+			nr_folio_backed++;
+		}
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+
+	/* we should not see mixed backing states for swap entries in swap cache */
+	VM_WARN_ON(nr_folio_backed && nr_folio_backed != nr);
+}
+
+/**
+ * vswap_store_folio - set a folio as the backing of a range of virtual swap
+ *                     slots.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio.
+ */
+void vswap_store_folio(swp_entry_t entry, struct folio *folio)
+{
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+
+	VM_BUG_ON(!folio_test_locked(folio));
+	VM_BUG_ON(folio->swap.val != entry.val);
+
+	release_backing(entry, nr);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		desc->type = VSWAP_FOLIO;
+		desc->folio = folio;
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+}
+
+/**
+ * vswap_assoc_zswap - associate a virtual swap slot to a zswap entry.
+ * @entry: the virtual swap slot.
+ * @zswap_entry: the zswap entry.
+ */
+void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry)
+{
+	struct swp_desc *desc;
+
+	release_backing(entry, 1);
+
+	desc = xa_load(&vswap_map, entry.val);
+	write_lock(&desc->lock);
+	desc->type = VSWAP_ZSWAP;
+	desc->zswap_entry = zswap_entry;
+	write_unlock(&desc->lock);
+}
+
+/**
+ * swap_zeromap_folio_set - mark a range of virtual swap slots corresponding to
+ *                          a folio as zero-filled.
+ * @folio: the folio
+ */
+void swap_zeromap_folio_set(struct folio *folio)
+{
+	struct obj_cgroup *objcg = get_obj_cgroup_from_folio(folio);
+	swp_entry_t entry = folio->swap;
+	int nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+
+	VM_BUG_ON(!folio_test_locked(folio));
+	VM_BUG_ON(!entry.val);
+
+	release_backing(entry, nr);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		desc->type = VSWAP_ZERO;
+		write_unlock(&desc->lock);
+	}
+	rcu_read_unlock();
+
+	count_vm_events(SWPOUT_ZERO, nr);
+	if (objcg) {
+		count_objcg_events(objcg, SWPOUT_ZERO, nr);
+		obj_cgroup_put(objcg);
+	}
+}
+
+/*
+ * Iterate through the entire range of virtual swap slots, returning the
+ * longest contiguous range of slots starting from the first slot that satisfies:
+ *
+ * 1. If the first slot is zero-mapped, the entire range should be
+ *    zero-mapped.
+ * 2. If the first slot is backed by a swapfile, the entire range should
+ *    be backed by a range of contiguous swap slots on the same swapfile.
+ * 3. If the first slot is zswap-backed, the entire range should be
+ *    zswap-backed.
+ * 4. If the first slot is backed by a folio, the entire range should
+ *    be backed by the same folio.
+ *
+ * Note that this check is racy unless we can ensure that the entire range
+ * has their backing state stable - for instance, if the caller was the one
+ * who set the in_swapcache flag of the entire field.
+ */
+static int vswap_check_backing(swp_entry_t entry, enum swap_type *type, int nr)
+{
+	unsigned int swapfile_type;
+	enum swap_type first_type;
+	struct swp_desc *desc;
+	pgoff_t first_offset;
+	struct folio *folio;
+	int i = 0;
+
+	if (!entry.val || non_swap_entry(entry))
+		return 0;
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc)
+			goto done;
+
+		read_lock(&desc->lock);
+		if (!i) {
+			first_type = desc->type;
+			if (first_type == VSWAP_SWAPFILE) {
+				swapfile_type = swp_slot_type(desc->slot);
+				first_offset = swp_slot_offset(desc->slot);
+			} else if (first_type == VSWAP_FOLIO) {
+				folio = desc->folio;
+			}
+		} else if (desc->type != first_type) {
+			read_unlock(&desc->lock);
+			goto done;
+		} else if (first_type == VSWAP_SWAPFILE &&
+				(swp_slot_type(desc->slot) != swapfile_type ||
+					swp_slot_offset(desc->slot) != first_offset + i)) {
+			read_unlock(&desc->lock);
+			goto done;
+		} else if (first_type == VSWAP_FOLIO && desc->folio != folio) {
+			read_unlock(&desc->lock);
+			goto done;
+		}
+		read_unlock(&desc->lock);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (type)
+		*type = first_type;
+	return i;
+}
+
+/**
+ * vswap_disk_backed - check if the virtual swap slots are backed by physical
+ *                     swap slots.
+ * @entry: the first entry in the range.
+ * @nr: the number of entries in the range.
+ */
+bool vswap_disk_backed(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr
+				&& type == VSWAP_SWAPFILE;
+}
+
+/**
+ * vswap_folio_backed - check if the virtual swap slots are backed by in-memory
+ *                      pages.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ */
+bool vswap_folio_backed(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr
+				&& type == VSWAP_FOLIO;
+}
+
+/*
+ * Return the count of contiguous swap entries that share the same
+ * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,
+ * it will return the VSWAP_ZERO status of the starting entry.
+ */
+int swap_zeromap_batch(swp_entry_t entry, int max_nr, bool *is_zeromap)
+{
+	struct swp_desc *desc;
+	int i = 0;
+	bool is_zero = false;
+
+	VM_WARN_ON(!entry.val || non_swap_entry(entry));
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + max_nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		if (!desc)
+			goto done;
+
+		read_lock(&desc->lock);
+		if (!i) {
+			is_zero = (desc->type == VSWAP_ZERO);
+		} else {
+			if ((desc->type == VSWAP_ZERO) != is_zero) {
+				read_unlock(&desc->lock);
+				goto done;
+			}
+		}
+		read_unlock(&desc->lock);
+		i++;
+	}
+done:
+	rcu_read_unlock();
+	if (i && is_zeromap)
+		*is_zeromap = is_zero;
+
+	return i;
+}
+
 /**
  * free_swap_and_cache_nr() - Release a swap count on range of swap entries and
  *                            reclaim their cache if no more references remain.
diff --git a/mm/zswap.c b/mm/zswap.c
index c1327569ce80..15429825d667 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1068,6 +1068,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct writeback_control wbc = {
 		.sync_mode = WB_SYNC_NONE,
 	};
+	struct zswap_entry *new_entry;
+	swp_slot_t slot;
 
 	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
@@ -1088,6 +1090,10 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 		return -EEXIST;
 	}
 
+	slot = vswap_alloc_swap_slot(folio);
+	if (!slot.val)
+		goto release_folio;
+
 	/*
 	 * folio is locked, and the swapcache is now secured against
 	 * concurrent swapping to and from the slot, and concurrent
@@ -1098,12 +1104,9 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	 * be dereferenced.
 	 */
 	tree = swap_zswap_tree(swpentry);
-	if (entry != xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL)) {
-		delete_from_swap_cache(folio);
-		folio_unlock(folio);
-		folio_put(folio);
-		return -ENOMEM;
-	}
+	new_entry = xa_cmpxchg(tree, offset, entry, NULL, GFP_KERNEL);
+	if (entry != new_entry)
+		goto fail;
 
 	zswap_decompress(entry, folio);
 
@@ -1124,6 +1127,14 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	folio_put(folio);
 
 	return 0;
+
+fail:
+	vswap_assoc_zswap(swpentry, new_entry);
+release_folio:
+	delete_from_swap_cache(folio);
+	folio_unlock(folio);
+	folio_put(folio);
+	return -ENOMEM;
 }
 
 /*********************************
@@ -1487,6 +1498,8 @@ static bool zswap_store_page(struct page *page,
 		goto store_failed;
 	}
 
+	vswap_assoc_zswap(page_swpentry, entry);
+
 	/*
 	 * We may have had an existing entry that became stale when
 	 * the folio was redirtied and now the new version is being
@@ -1608,7 +1621,7 @@ bool zswap_store(struct folio *folio)
 	 */
 	if (!ret) {
 		unsigned type = swp_type(swp);
-		pgoff_t offset = swp_offset(swp);
+		pgoff_t offset = zswap_tree_index(swp);
 		struct zswap_entry *entry;
 		struct xarray *tree;
 
@@ -1618,6 +1631,12 @@ bool zswap_store(struct folio *folio)
 			if (entry)
 				zswap_entry_free(entry);
 		}
+
+		/*
+		 * We might have also partially associated some virtual swap slots with
+		 * zswap entries. Undo this.
+		 */
+		vswap_store_folio(swp, folio);
 	}
 
 	return ret;
@@ -1674,6 +1693,7 @@ bool zswap_load(struct folio *folio)
 		count_objcg_events(entry->objcg, ZSWPIN, 1);
 
 	if (swapcache) {
+		vswap_store_folio(swp, folio);
 		zswap_entry_free(entry);
 		folio_mark_dirty(folio);
 	}

From patchwork Mon Apr  7 23:42:12 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879272
Received: from mail-yb1-f178.google.com (mail-yb1-f178.google.com
 [209.85.219.178])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id BC099254843;
 Mon,  7 Apr 2025 23:42:32 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.178
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069356; cv=none;
 b=ZjhalIr5J1mWVFTiPi66CyZl/MzmmUaII8jLBoBbf57MaTPvve8dMv9sE20R0dgfPIyOHSETXGwgwi76uECMnzY9WiRNCjsNdNICoQeWfwe0dqlm11s2KsJrDDUmFrMQje1oVgfPJWa9sNheNAaiGF3zyODeqHzfvhQO2pPtYmM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069356; c=relaxed/simple;
 bh=Pb5ViOs0CmYN4e54sVlv/fjVcXviZqrgEdvuN/sVPDU=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=W9GGPOm1G01TGUaydQUcLmFjX2NyZkPCgRm/wUldK3E50DlPHUcuIvARzlsxRPRrjV7Tl4oRDJtJn2GNRxhztGEfb7IHnI7Jnr0AAkY5ez/FRSgRrg1KCh8ieTT8FBwDkpla9o9R+d/XFUV3XsOPUb4rYJiF3ulQ06BxRSk3xZU=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=He6dCyPK; arc=none smtp.client-ip=209.85.219.178
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="He6dCyPK"
Received: by mail-yb1-f178.google.com with SMTP id
 3f1490d57ef6-e643f235a34so3663353276.1;
 Mon, 07 Apr 2025 16:42:32 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069351; x=1744674151;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=Gy9yoaTY8WhgNX1FAQKuY1G0jgwt8VM6ywJcV8fbRa8=;
 b=He6dCyPKKZtERj65Bs+ynrNzn3KbQXUMFJuhgTPbdQzDWCGKGsTQKbl01SyPVog+kD
 MEw8P4zag9/0VAoLXcElTb5hfoscUQMQRyDBRjqGIIuWZxsXS7fu8l+yx6vwyBO5we3Q
 mjvt4ir9FfKHgVkEXQ22iQdE8S2LPgWvX9H+XhDLPLoeYYip/d1Eow8Alg73e7rdbhZj
 krJYeWuSxgMMTGyNMjrG1v2SSr21rUSo4VSfOXTeJ7a1pdLQLWaXsWNhextG6T5nxsoY
 puhbRYg4uE8yxyY+moyonFXh6fxuNP7ubU07OcAdLLw14p+84k7kmnpNs6VjMN5bnM4f
 gu9A==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069351; x=1744674151;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=Gy9yoaTY8WhgNX1FAQKuY1G0jgwt8VM6ywJcV8fbRa8=;
 b=FF3FFYS23BrfwQfdLV2Baji5jrkjO+hCulnPzoNsKV/xvFwSvVZltBp9kw8zjw88j5
 ksLmaQl6YaxeX4pBLx6pvse7ElDTdMntmbfBKYE/aZLm4Yzusq5p98GqMveqvxjWgfc2
 jXtGoMh5v+Nq2bYKqXeipVlKxUqzSt9VdAkmQZyYK9sHKRcaUqs3yh41hS6WlMju4+Sw
 0wu18nZMLegDTmVXZNhT9o9YBECeYRE2wFDY1QYadNJ6c/cCh1L2twn0lNn2JlaQFqlU
 M1x2old8O5iNpp45Px1KKvPTrRWQoec75HBRaGjM3JaNSV/hFPeT3P3mUD79zp8mV50N
 fSJQ==
X-Forwarded-Encrypted: i=1;
 AJvYcCW7Ih50gwm9eK6FlOcZJl/wJKHBX3Vw06WP5vd/a9r+FUyDzgyCeC7ha7FmTlrBvt91v9nrwHj0Nvw=@vger.kernel.org,
 AJvYcCW7QZuHrmv06O0TzEhTG2+EGJgHdOIdYPRxzsC9lMfNb0UpIYxFYM4BaGXeAHmfWPeN4J/1nO8x@vger.kernel.org,
 AJvYcCXQHqDkCSIsmCyVGbjnv+789JYw2+Tdedkk8R4MQS9B8y50fsAFepRFDyYdZcFauvT/pgbvaU4eKHaMalrG@vger.kernel.org
X-Gm-Message-State: AOJu0Yy9IaladwH9z5gSMOiLkQ6+nRHM9BRYyH3EqUQUARHRX1vcmvUA
 ldLiVBCO/fFBsex03jaRUaTV2gpACEIxRIhTI9ggMqGJhmbJzsXP
X-Gm-Gg: ASbGncvrUgtc/F4Mq+7Zp474uFfVhwEk81NGhIaDNhjM4nqXmeBHrhsnmjub8qs9U4O
 /vaB8U78naTj72B9w9nv5KnspAUuVl7BthP2B75uxvDBP8l+m94RsWtJ2jgDK9jF0q5TjWnT2nN
 TtZ1O4bPG2iWu1v2fQQEhFZJmqrJTXjEiD+r96haqdQyR8g6KfvSZ/4HAuE9YyFQ8yjti4Ovazo
 r9CrtQV0vtJpcwrz3ML0NRugjIac3BZZM+tcYFTg2AiBvKjiftdkZg6pcgr1HnTwwGio4kD1RCG
 a0p/Slqu1S3H6KfN8v9m9WU99Dap6mCJRGM=
X-Google-Smtp-Source: AGHT+IFtjIGWjrSvPLmCF0xDwxdk78WKxan9bK4HNw+sdhPaREGZ3Zv4ygkRbbvK1bFWjicLYGKEQg==
X-Received: by 2002:a05:690c:25c1:b0:6fb:a376:3848 with SMTP id
 00721157ae682-703e1628096mr243017637b3.34.1744069351629;
 Mon, 07 Apr 2025 16:42:31 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:4::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1e86700sm27862117b3.68.2025.04.07.16.42.31
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:31 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 11/14] memcg: swap: only charge physical swap slots
Date: Mon,  7 Apr 2025 16:42:12 -0700
Message-ID: <20250407234223.1059191-12-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

Now that zswap and the zero-filled swap page optimization no longer
takes up any physical swap space, we should not charge towards the swap
usage and limits of the memcg in these case. We will only record the
memcg id on virtual swap slot allocation, and defer physical swap
charging (i.e towards memory.swap.current) until the virtual swap slot
is backed by an actual physical swap slot (on zswap store failure
fallback or zswap writeback).

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  17 ++++++++
 mm/memcontrol.c      | 102 ++++++++++++++++++++++++++++++++++---------
 mm/vswap.c           |  43 ++++++++----------
 3 files changed, 118 insertions(+), 44 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 073835335667..98cdfe0c1da7 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -679,6 +679,23 @@ static inline void folio_throttle_swaprate(struct folio *folio, gfp_t gfp)
 
 #if defined(CONFIG_MEMCG) && defined(CONFIG_SWAP)
 void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry);
+
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry);
+static inline void mem_cgroup_record_swap(struct folio *folio,
+		swp_entry_t entry)
+{
+	if (!mem_cgroup_disabled())
+		__mem_cgroup_record_swap(folio, entry);
+}
+
+void __mem_cgroup_unrecord_swap(swp_entry_t entry, unsigned int nr_pages);
+static inline void mem_cgroup_unrecord_swap(swp_entry_t entry,
+		unsigned int nr_pages)
+{
+	if (!mem_cgroup_disabled())
+		__mem_cgroup_unrecord_swap(entry, nr_pages);
+}
+
 int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry);
 static inline int mem_cgroup_try_charge_swap(struct folio *folio,
 		swp_entry_t entry)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 126b2d0e6aaa..c6bee12f2016 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5020,6 +5020,46 @@ void mem_cgroup_swapout(struct folio *folio, swp_entry_t entry)
 	css_put(&memcg->css);
 }
 
+/**
+ * __mem_cgroup_record_swap - record the folio's cgroup for the swap entries.
+ * @folio: folio being swapped out.
+ * @entry: the first swap entry in the range.
+ *
+ * In the virtual swap implementation, we only record the folio's cgroup
+ * for the virtual swap slots on their allocation. We will only charge
+ * physical swap slots towards the cgroup's swap usage, i.e when physical swap
+ * slots are allocated for zswap writeback or fallback from zswap store
+ * failure.
+ */
+void __mem_cgroup_record_swap(struct folio *folio, swp_entry_t entry)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+	struct mem_cgroup *memcg;
+
+	memcg = folio_memcg(folio);
+
+	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+	if (!memcg)
+		return;
+
+	memcg = mem_cgroup_id_get_online(memcg);
+	if (nr_pages > 1)
+		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+}
+
+void __mem_cgroup_unrecord_swap(swp_entry_t entry, unsigned int nr_pages)
+{
+	unsigned short id = swap_cgroup_clear(entry, nr_pages);
+	struct mem_cgroup *memcg;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_id(id);
+	if (memcg)
+		mem_cgroup_id_put_many(memcg, nr_pages);
+	rcu_read_unlock();
+}
+
 /**
  * __mem_cgroup_try_charge_swap - try charging swap space for a folio
  * @folio: folio being added to swap
@@ -5038,34 +5078,47 @@ int __mem_cgroup_try_charge_swap(struct folio *folio, swp_entry_t entry)
 	if (do_memsw_account())
 		return 0;
 
-	memcg = folio_memcg(folio);
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
+		/*
+		 * In the virtual swap implementation, we already record the cgroup
+		 * on virtual swap allocation. Note that the virtual swap slot holds
+		 * a reference to memcg, so this lookup should be safe.
+		 */
+		rcu_read_lock();
+		memcg = mem_cgroup_from_id(lookup_swap_cgroup_id(entry));
+		rcu_read_unlock();
+	} else {
+		memcg = folio_memcg(folio);
 
-	VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
-	if (!memcg)
-		return 0;
+		VM_WARN_ON_ONCE_FOLIO(!memcg, folio);
+		if (!memcg)
+			return 0;
 
-	if (!entry.val) {
-		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		return 0;
-	}
+		if (!entry.val) {
+			memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
+			return 0;
+		}
 
-	memcg = mem_cgroup_id_get_online(memcg);
+		memcg = mem_cgroup_id_get_online(memcg);
+	}
 
 	if (!mem_cgroup_is_root(memcg) &&
 	    !page_counter_try_charge(&memcg->swap, nr_pages, &counter)) {
 		memcg_memory_event(memcg, MEMCG_SWAP_MAX);
 		memcg_memory_event(memcg, MEMCG_SWAP_FAIL);
-		mem_cgroup_id_put(memcg);
+		if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+			mem_cgroup_id_put(memcg);
 		return -ENOMEM;
 	}
 
-	/* Get references for the tail pages, too */
-	if (nr_pages > 1)
-		mem_cgroup_id_get_many(memcg, nr_pages - 1);
+	if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP)) {
+		/* Get references for the tail pages, too */
+		if (nr_pages > 1)
+			mem_cgroup_id_get_many(memcg, nr_pages - 1);
+		swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
+	}
 	mod_memcg_state(memcg, MEMCG_SWAP, nr_pages);
 
-	swap_cgroup_record(folio, mem_cgroup_id(memcg), entry);
-
 	return 0;
 }
 
@@ -5079,7 +5132,11 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 	struct mem_cgroup *memcg;
 	unsigned short id;
 
-	id = swap_cgroup_clear(entry, nr_pages);
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+		id = lookup_swap_cgroup_id(entry);
+	else
+		id = swap_cgroup_clear(entry, nr_pages);
+
 	rcu_read_lock();
 	memcg = mem_cgroup_from_id(id);
 	if (memcg) {
@@ -5090,7 +5147,8 @@ void __mem_cgroup_uncharge_swap(swp_entry_t entry, unsigned int nr_pages)
 				page_counter_uncharge(&memcg->swap, nr_pages);
 		}
 		mod_memcg_state(memcg, MEMCG_SWAP, -nr_pages);
-		mem_cgroup_id_put_many(memcg, nr_pages);
+		if (!IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+			mem_cgroup_id_put_many(memcg, nr_pages);
 	}
 	rcu_read_unlock();
 }
@@ -5099,7 +5157,7 @@ static bool mem_cgroup_may_zswap(struct mem_cgroup *original_memcg);
 
 long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 {
-	long nr_swap_pages, nr_zswap_pages = 0;
+	long nr_swap_pages;
 
 	/*
 	 * If swap is virtualized and zswap is enabled, we can still use zswap even
@@ -5108,10 +5166,14 @@ long mem_cgroup_get_nr_swap_pages(struct mem_cgroup *memcg)
 	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && zswap_is_enabled() &&
 			(mem_cgroup_disabled() || do_memsw_account() ||
 				mem_cgroup_may_zswap(memcg))) {
-		nr_zswap_pages = PAGE_COUNTER_MAX;
+		/*
+		 * No need to check swap cgroup limits, since zswap is not charged
+		 * towards swap consumption.
+		 */
+		return PAGE_COUNTER_MAX;
 	}
 
-	nr_swap_pages = max_t(long, nr_zswap_pages, get_nr_swap_pages());
+	nr_swap_pages = get_nr_swap_pages();
 	if (mem_cgroup_disabled() || do_memsw_account())
 		return nr_swap_pages;
 	for (; !mem_cgroup_is_root(memcg); memcg = parent_mem_cgroup(memcg))
diff --git a/mm/vswap.c b/mm/vswap.c
index 3146c231ca69..fcc7807ba89b 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -349,6 +349,7 @@ static inline void release_backing(swp_entry_t entry, int nr)
 			swap_slot_free_nr(slot, nr);
 			swap_slot_put_swap_info(si);
 		}
+		mem_cgroup_uncharge_swap(entry, nr);
 	}
 }
 
@@ -367,7 +368,7 @@ static void vswap_free(swp_entry_t entry)
 
 	virt_clear_shadow_from_swap_cache(entry);
 	release_backing(entry, 1);
-	mem_cgroup_uncharge_swap(entry, 1);
+	mem_cgroup_unrecord_swap(entry, 1);
 	/* erase forward mapping and release the virtual slot for reallocation */
 	release_vswap_slot(entry.val);
 	kfree_rcu(desc, rcu);
@@ -392,27 +393,13 @@ swp_entry_t folio_alloc_swap(struct folio *folio)
 {
 	swp_entry_t entry;
 	struct swp_desc *desc;
-	int i, nr = folio_nr_pages(folio);
+	int nr = folio_nr_pages(folio);
 
 	entry = vswap_alloc(nr);
 	if (!entry.val)
 		return entry;
 
-	/*
-	 * XXX: for now, we charge towards the memory cgroup's swap limit on virtual
-	 * swap slots allocation. This will be changed soon - we will only charge on
-	 * physical swap slots allocation.
-	 */
-	if (mem_cgroup_try_charge_swap(folio, entry)) {
-		for (i = 0; i < nr; i++) {
-			vswap_free(entry);
-			entry.val++;
-		}
-		atomic_add(nr, &vswap_alloc_reject);
-		entry.val = 0;
-		return entry;
-	}
-
+	mem_cgroup_record_swap(folio, entry);
 	XA_STATE(xas, &vswap_map, entry.val);
 
 	rcu_read_lock();
@@ -454,6 +441,9 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 	if (!slot.val)
 		return slot;
 
+	if (mem_cgroup_try_charge_swap(folio, entry))
+		goto free_phys_swap;
+
 	/* establish the vrtual <-> physical swap slots linkages. */
 	for (i = 0; i < nr; i++) {
 		err = xa_insert(&vswap_rmap, slot.val + i,
@@ -462,13 +452,7 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 		if (err) {
 			while (--i >= 0)
 				xa_erase(&vswap_rmap, slot.val + i);
-			/*
-			 * We have not updated the backing type of the virtual swap slot.
-			 * Simply free up the physical swap slots here!
-			 */
-			swap_slot_free_nr(slot, nr);
-			slot.val = 0;
-			return slot;
+			goto uncharge;
 		}
 	}
 
@@ -505,6 +489,17 @@ swp_slot_t vswap_alloc_swap_slot(struct folio *folio)
 	}
 	rcu_read_unlock();
 	return slot;
+
+uncharge:
+	mem_cgroup_uncharge_swap(entry, nr);
+free_phys_swap:
+	/*
+	 * We have not updated the backing type of the virtual swap slot.
+	 * Simply free up the physical swap slots here!
+	 */
+	swap_slot_free_nr(slot, nr);
+	slot.val = 0;
+	return slot;
 }
 
 /**

From patchwork Mon Apr  7 23:42:13 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878964
Received: from mail-yb1-f181.google.com (mail-yb1-f181.google.com
 [209.85.219.181])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id A180F254AE7;
 Mon,  7 Apr 2025 23:42:33 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.181
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069356; cv=none;
 b=YTqhWzmbyfYfKBNeZej6SSTB2bPKYcgYIZfD0y3XQ3GVrk8Juo4pudy5CgRk4aEz6pfZBNMo3bdkW3Q6JVp399tRvQ1vnA78gMusLNqfX1sZ2dg+wru8bkGz0F/UVSEdTn3jl8pdOPRGPVdeXDaI0Hnh6/PZ91kmmXVLaQXD4kE=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069356; c=relaxed/simple;
 bh=2YQkjhUo7WUNeYTKn8u9eXhw+WcOSVKddEOQJrQYcSE=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=ZKztQhP6NHt5kiYqFixFXomEaj9JuUEyHTlLJWnvRA48ZDGplI3H57SWRmxIVRQr0glyQ7bL8H9fmGN7QJk14X1q6RRqiL6mc3EsQQDC2FnPasTLcdEDwHOLI2DJJfFlu0/v14jZ9gc8/LYS8VXuKvpliubn5pX/rxITQzR8eX8=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=gbbpSh0P; arc=none smtp.client-ip=209.85.219.181
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="gbbpSh0P"
Received: by mail-yb1-f181.google.com with SMTP id
 3f1490d57ef6-e6e1cd3f1c5so3251114276.0;
 Mon, 07 Apr 2025 16:42:33 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069352; x=1744674152;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=mLJdhqYj+zTYLwyNRCW52GVf6fq8/yyGxPtSSQksBmE=;
 b=gbbpSh0PJ4EeOQ+oVjsmpTDbg1uZ6SDuBcOjpZiv9G9Dn02t6RNq06CCYqohy81mSj
 AgTwl1CUarADAT8pg8aB2ka2TNBaUO2L1NOe6UBdI7ndzQPxBPT1iVczcHKZXJHS1Ve+
 9w+R52DoW7H6hK7oj0PihI39mfdYJpY5HPWbIIyCnx8ZMyV5knxxANEHV2/2S6yO33wJ
 2MZy1txN/HHNcABbAPcIkxkbjqFXh+aRZHRuiAEShj9Pi0XuWcv1hOtZXWN/40jGcxCS
 Fgrwp2rNOqZ88KcjBd7X8d2IMru683Xy3fcTQiN2JPn0/lURUFfXDLml+0/eMBF9UIME
 PtPQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069352; x=1744674152;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=mLJdhqYj+zTYLwyNRCW52GVf6fq8/yyGxPtSSQksBmE=;
 b=NNzC3fHIR/g5L4gT9q7nhkbXt8vqyS7Pc5umehdOfA9R9srqFtohBzriduRcYMe9GH
 CJpVhkJVFhkLiRdvJ7r1cHCXCdm4mi4pnm7XAih4i8He5y5q0BK0fhVABnL39rrQrDN0
 J3xP89FyhZcOowPjuezSqq75nBnElTv6bYm5coVhvPyVr/W/+VYfC9sPMZdla7UdG3xd
 U8v5R5kuHtSCYb+BY/5NmsuQ//MFwbnSzMHYgjZDLZ1by1TbWIZJurQMJRE4ZXQ63SeD
 TXDP0ZeP8rDGiHS3EI1Y+qzcKcJD+DAdOmtjPZlhd7y5Yt17vGQ3oO52Hlrio7X+ecd7
 dHzg==
X-Forwarded-Encrypted: i=1;
 AJvYcCW3bRSASg0wOnf0sQDCNImMvT+xeGfXUYmxeB6svBN70PbhBorByZZSigspKg+8Ix0twRuK0vqoCDi+yzqP@vger.kernel.org,
 AJvYcCWFIQ6a2s3TKUeGqyXWdJM8IzsRRCFI4v5SfKEJM/C1ZWdxHtIoNJci5kKYRWGQjdtrMq8zkddLUfU=@vger.kernel.org,
 AJvYcCWUFHfsfIPKhk4RDwLl/ISTZIoeDk4qSgAx0sLV5L05X0PFSsDXRJN0AFEWM+arApgUb0pfn8mY@vger.kernel.org
X-Gm-Message-State: AOJu0YwPKcstZ7DVERq4wMelVkRoL3qOQnnf/KAo4aeJa8EmF3yiO2fi
 lOuFl53y9PaPczxbnQ7qRa+DDfKutvRiNdqSTpScnOXIGyIbQFNB
X-Gm-Gg: ASbGnct+pMAW0byV0/uT1p03irWS69Hnu8ddWh2SlPwvz/61RmchlwfYFHjXG8wIB93
 35T6DCWKcAk1eNAyA/Kcw6IksZx92vHDtmehXjAkJsDvDLR8YOORvFIZ00gSCU5wVO0eECxFK64
 A28LZZ2/9YY6iJ6pzfwIJZroWynHQKTVME3wyA3635MybyE6KsmogSedE4+nfSyQ9mi2nBztjjY
 nS82Z/2lC7kG01+SfhtFKGkTUvrmzklXQF2ffnQ1SWmHo68jkZunQxPwC1WiF7wvpmuKGrQ/fOl
 tsLi1kgRBQh/M51iUa7VMC8Id7rhrbnK3KVU
X-Google-Smtp-Source: AGHT+IFYT4abdql8kqgVGeGYJ9Wg2G43qD5+sFoTbpPfUBUetFZ6KN/IufxrelAcp5w2iktxFkJT+Q==
X-Received: by 2002:a05:690c:7207:b0:6f9:7ec7:386a with SMTP id
 00721157ae682-703f41bbb6cmr186335597b3.21.1744069352322;
 Mon, 07 Apr 2025 16:42:32 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:73::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1e7731fsm28302937b3.58.2025.04.07.16.42.31
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:32 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 12/14] vswap: support THP swapin and batch
 free_swap_and_cache
Date: Mon,  7 Apr 2025 16:42:13 -0700
Message-ID: <20250407234223.1059191-13-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch implements the required functionalities for THP swapin and
batched free_swap_and_cache() in the virtual swap space design.

The central requirement is the range of entries we are working with must
have no mixed backing states:

1. For now, zswap-backed entries are not supported for these batched
   operations.
2. All the entries must be backed by the same type.
3. If the swap entries in the batch are backed by in-memory folio, it
   must be the same folio (i.e they correspond to the subpages of that
   folio).
4. If the swap entries in the batch are backed by slots on swapfiles, it
   must be the same swapfile, and these physical swap slots must also be
   contiguous.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/swap.h |  6 +++
 mm/internal.h        | 14 +------
 mm/memory.c          | 16 ++++++--
 mm/vswap.c           | 89 ++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 109 insertions(+), 16 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 98cdfe0c1da7..c3a10c952116 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -763,6 +763,7 @@ bool vswap_folio_backed(swp_entry_t entry, int nr);
 void vswap_store_folio(swp_entry_t entry, struct folio *folio);
 void swap_zeromap_folio_set(struct folio *folio);
 void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
@@ -862,6 +863,11 @@ static inline void vswap_assoc_zswap(swp_entry_t entry,
 {
 }
 
+static inline bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+	return true;
+}
+
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
 {
diff --git a/mm/internal.h b/mm/internal.h
index 51061691a731..6694e7a14745 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -268,14 +268,7 @@ static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 	return (swp_entry_t) { entry.val + n };
 }
 
-/* temporary disallow batched swap operations */
-static inline swp_entry_t swap_move(swp_entry_t entry, long delta)
-{
-	swp_entry_t next_entry;
-
-	next_entry.val = 0;
-	return next_entry;
-}
+swp_entry_t swap_move(swp_entry_t entry, long delta);
 #else
 static inline swp_entry_t swap_nth(swp_entry_t entry, long n)
 {
@@ -344,8 +337,6 @@ static inline pte_t pte_next_swp_offset(pte_t pte)
  * max_nr must be at least one and must be limited by the caller so scanning
  * cannot exceed a single page table.
  *
- * Note that for virtual swap space, we will not batch anything for now.
- *
  * Return: the number of table entries in the batch.
  */
 static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
@@ -360,9 +351,6 @@ static inline int swap_pte_batch(pte_t *start_ptep, int max_nr, pte_t pte)
 	VM_WARN_ON(!is_swap_pte(pte));
 	VM_WARN_ON(non_swap_entry(entry));
 
-	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP))
-		return 1;
-
 	cgroup_id = lookup_swap_cgroup_id(entry);
 	while (ptep < end_ptep) {
 		pte = ptep_get(ptep);
diff --git a/mm/memory.c b/mm/memory.c
index c5c34efafa81..5abb464913ef 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4226,10 +4226,8 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
 	 * A large swapped out folio could be partially or fully in zswap. We
 	 * lack handling for such cases, so fallback to swapping in order-0
 	 * folio.
-	 *
-	 * We also disable THP swapin on the virtual swap implementation, for now.
 	 */
-	if (!zswap_never_enabled() || IS_ENABLED(CONFIG_VIRTUAL_SWAP))
+	if (!zswap_never_enabled())
 		goto fallback;
 
 	entry = pte_to_swp_entry(vmf->orig_pte);
@@ -4419,6 +4417,18 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 				}
 				need_clear_cache = true;
 
+				/*
+				 * Recheck to make sure the entire range is still
+				 * THP-swapin-able. Note that before we call
+				 * swapcache_prepare(), entries in the range can
+				 * still have their backing status changed.
+				 */
+				if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) &&
+						!vswap_can_swapin_thp(entry, nr_pages)) {
+					schedule_timeout_uninterruptible(1);
+					goto out_page;
+				}
+
 				mem_cgroup_swapin_uncharge_swap(entry, nr_pages);
 
 				shadow = get_shadow_from_swap_cache(entry);
diff --git a/mm/vswap.c b/mm/vswap.c
index fcc7807ba89b..c09a7efc2aeb 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -9,6 +9,7 @@
 #include <linux/swap.h>
 #include <linux/swapops.h>
 #include <linux/swap_cgroup.h>
+#include "internal.h"
 #include "swap.h"
 
 /*
@@ -1104,6 +1105,94 @@ bool vswap_folio_backed(swp_entry_t entry, int nr)
 				&& type == VSWAP_FOLIO;
 }
 
+/**
+ * vswap_can_swapin_thp - check if the swap entries can be swapped in as a THP.
+ * @entry: the first virtual swap slot in the range.
+ * @nr: the number of slots in the range.
+ *
+ * For now, we can only swap in a THP if the entire range is zero-filled, or if
+ * the entire range is backed by a contiguous range of physical swap slots on a
+ * swapfile.
+ */
+bool vswap_can_swapin_thp(swp_entry_t entry, int nr)
+{
+	enum swap_type type;
+
+	return vswap_check_backing(entry, &type, nr) == nr &&
+		(type == VSWAP_ZERO || type == VSWAP_SWAPFILE);
+}
+
+/**
+ * swap_move - increment the swap slot by delta, checking the backing state and
+ *             return 0 if the backing state does not match (i.e wrong backing
+ *             state type, or wrong offset on the backing stores).
+ * @entry: the original virtual swap slot.
+ * @delta: the offset to increment the original slot.
+ *
+ * Note that this function is racy unless we can pin the backing state of these
+ * swap slots down with swapcache_prepare().
+ *
+ * Caller should only rely on this function as a best-effort hint otherwise,
+ * and should double-check after ensuring the whole range is pinned down.
+ *
+ * Return: the incremented virtual swap slot if the backing state matches, or
+ *         0 if the backing state does not match.
+ */
+swp_entry_t swap_move(swp_entry_t entry, long delta)
+{
+	struct swp_desc *desc, *next_desc;
+	swp_entry_t next_entry;
+	bool invalid = true;
+	struct folio *folio;
+	enum swap_type type;
+	swp_slot_t slot;
+
+	next_entry.val = entry.val + delta;
+
+	rcu_read_lock();
+	desc = xa_load(&vswap_map, entry.val);
+	next_desc = xa_load(&vswap_map, next_entry.val);
+
+	if (!desc || !next_desc) {
+		rcu_read_unlock();
+		return (swp_entry_t){0};
+	}
+
+	read_lock(&desc->lock);
+	if (desc->type == VSWAP_ZSWAP) {
+		read_unlock(&desc->lock);
+		goto rcu_unlock;
+	}
+
+	type = desc->type;
+	if (type == VSWAP_FOLIO)
+		folio = desc->folio;
+
+	if (type == VSWAP_SWAPFILE)
+		slot = desc->slot;
+	read_unlock(&desc->lock);
+
+	read_lock(&next_desc->lock);
+	if (next_desc->type != type)
+		goto next_unlock;
+
+	if (type == VSWAP_SWAPFILE &&
+			(swp_slot_type(next_desc->slot) != swp_slot_type(slot) ||
+				swp_slot_offset(next_desc->slot) !=
+							swp_slot_offset(slot) + delta))
+		goto next_unlock;
+
+	if (type == VSWAP_FOLIO && next_desc->folio != folio)
+		goto next_unlock;
+
+	invalid = false;
+next_unlock:
+	read_unlock(&next_desc->lock);
+rcu_unlock:
+	rcu_read_unlock();
+	return invalid ? (swp_entry_t){0} : next_entry;
+}
+
 /*
  * Return the count of contiguous swap entries that share the same
  * VSWAP_ZERO status as the starting entry. If is_zeromap is not NULL,

From patchwork Mon Apr  7 23:42:14 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 878963
Received: from mail-yw1-f173.google.com (mail-yw1-f173.google.com
 [209.85.128.173])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id 91AA32550B1;
 Mon,  7 Apr 2025 23:42:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.128.173
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069357; cv=none;
 b=NMWOq4WdoyrM4r3ZJTbHwtArGHX20wdSXVqCJUQSUCEAh3VdNjFivilsRqDC3AjinofK3bXaBsWDsMy1oPXC7QggydPp8yRKn18vfYizZsO+qdEVo6FD+VWirz90+3VOLCQLy7WAFpOD6EbBp+/1vJNzrm4RYHfBL6go8yXTxuM=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069357; c=relaxed/simple;
 bh=/KVO6HxUbvvC18M1Uh6nCHrsy1iVhl8jXcH7rEccGCo=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=JynVjuvAzTR4l3DZLlMXGAv3oiIzLmz1Y7azUWLZYmKgfd+uYNrNwlbNH/DUkGPa96MEYflJWVe7rHcq92iAz7zHukanFVuPHfFmiwgH87h5/b03GrFJ9r1RH8O1jIBKxWJcVb0v+b7699s6qTswisFO53eei1/2Zso2Hb6An2o=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=mHiFXzgT; arc=none smtp.client-ip=209.85.128.173
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="mHiFXzgT"
Received: by mail-yw1-f173.google.com with SMTP id
 00721157ae682-703cd93820fso46073667b3.2;
 Mon, 07 Apr 2025 16:42:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069353; x=1744674153;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=u4vHOJqfqkXMoPP1dZtjw+0/K3krKLrN4KxVkeSTfDE=;
 b=mHiFXzgTSCYW6A7EDErxZonxT84c3NMD3yQ/DH0sYksYhy6Bko1wL9YWvO3cwj0wj1
 kgUq/OCxtkl8Sbc1kvtTQ8HF7KG27xx9up6dSXMMgLzQasqHbRtkM87xhWT2yXmXgEfe
 DJQYErfxWXqeLsFm8ua89XAuWVf0/sbjqRmZEOY831kkp/lNr/8Y/iV+Qlgp0uZUjcmQ
 LgDInaPsWED19RDs2P2hTZptiYpLg6anINKOQyDJPCqGevHAcQCiu5VQ0VEYic2dyZ3y
 6ASMF5yy0NIIpkWipZr3PVRCBNR/Cu6i/VDgnE+8Nx+7/zVPH1MDBYdwVBMRAj1f9Wt5
 USRw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069353; x=1744674153;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=u4vHOJqfqkXMoPP1dZtjw+0/K3krKLrN4KxVkeSTfDE=;
 b=bXTFv5EEfQEP5oU4/zxf0elMFZ4YVsqKGaytvN9xLu/drQZbNisjV0OQCisBD1oAi3
 f96QadUsKdfJzpuq5ccG9fBfp0jHnFnTNhzg9qu67pYxGUhSyNRynZ6Tcnuyxrrd9sEY
 ny2CT/WTeOT6ZGJlyhPpNtLRtbOggXQoEXcfFiy+ltR4z+x8BoSMoQmBZGcbgQoOaMMK
 D6HlMnN+nekWQdPsRpuVf9n495IoO4Kak8kDP54Wkf6AjeqQ2qaUu9zSWFT+XlpXeLny
 X2O6Rigwr2bJdI9iFAKoal5kcfmzaBFkAQu2a8UmFRVNAB3IQ0C0aP7nX4YbaQ2cqW09
 g50g==
X-Forwarded-Encrypted: i=1;
 AJvYcCVl3rxxPune6Ii1TY7sXJDfWoPGBKP5As/OKr9Bigj/ivt1rtQv/gQHuBj5DJ7dSMZxyt2mQx76@vger.kernel.org,
 AJvYcCW6lo1NkLbJwkap4FVSEUaDxlPWXYw6iMTATbfMJD6Zf/0SE/wvf6nUqJlhrGPVFYYxaRb+MufPq00ZVRfm@vger.kernel.org,
 AJvYcCXcZjm7Z4c9P5tg4lqx6uSXDXuZ9yXLsMszPgFd0cMmbl+6mFvB2KJTfKUPzJYmkifBqtIkzjZ+O6I=@vger.kernel.org
X-Gm-Message-State: AOJu0Yyflhac8ac6tPuTppORT/3RrZZj8dvru+LHq/Hlvro3xArZGLQg
 bo4BmulQ3WhyuogdaBbi161NDzGdKoRCYm50nuuNNg9aHCMvlc+D
X-Gm-Gg: ASbGncssQWFZ3dNN47gRvDtL1QzR0jGh9HwSuxwYwrVy/jtFtPiNzDfeDPHa1jYjjfx
 UBX5k3liAqj3J9JeI+MtazAfbtblnABynxVBNrDJeSYnEWjS9+GD21CKSbPv0Hqr7QwEct4F/V2
 9RaPlmMPC/XRglznrGakNevSbI2WVqlovs6MvTx6AIL85/xgAXQHCv+W3pvxv9EgFMfqxNXFZi1
 XGtji1l/PhR6e9JxkiIJg3YwdKXkn2euRbAzkvk4Aj+TPP21rLFMJ48OOShkAdHfeBSLdOLVj5P
 rkKsmvwGbhXaMsp4dTxMaj+98gjRUFh/oYQ=
X-Google-Smtp-Source: AGHT+IHbZnYiY/BuJoWRHHRQ6L5BtEhuC3x+eZrXL9ezJSr88dEBvO1rKsdBYHgZC7piIso+1awHyQ==
X-Received: by 2002:a05:690c:48c5:b0:6fb:9b8c:4b50 with SMTP id
 00721157ae682-703e154ea64mr257126007b3.13.1744069353085;
 Mon, 07 Apr 2025 16:42:33 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:1::])
 by smtp.gmail.com with ESMTPSA id
 00721157ae682-703d1e57c4csm27969137b3.49.2025.04.07.16.42.32
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:32 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 13/14] swap: simplify swapoff using virtual swap
Date: Mon,  7 Apr 2025 16:42:14 -0700
Message-ID: <20250407234223.1059191-14-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

This patch presents the second applications of virtual swap design -
simplifying and optimizing swapoff.

With virtual swap slots stored at page table entries and used as indices
to various swap-related data structures, we no longer have to perform a
page table walk in swapoff. Simply iterate through all the allocated
swap slots on the swapfile, invoke the backward map and fault them in.

This is significantly cleaner, as well as slightly more performant,
especially when there are a lot of unrelated VMAs (since the old swapoff
code would have to traverse through all of them).

In a simple benchmark, in which we swapoff a 32 GB swapfile that is 50%
full, and in which there is a process that maps a 128GB file into
memory:

Baseline:
real: 25.54s
user: 0.00s
sys: 11.48s

New Design:
real: 11.69s
user: 0.00s
sys: 9.96s

Disregarding the real time reduction (which is mostly due to more IO
asynchrony), the new design reduces the kernel CPU time by about 13%.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 include/linux/shmem_fs.h |   3 +
 include/linux/swap.h     |   1 +
 mm/shmem.c               |   2 +
 mm/swapfile.c            | 189 ++++++++++++++++++++++++++++++++-------
 mm/vswap.c               |  61 +++++++++++++
 5 files changed, 225 insertions(+), 31 deletions(-)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0b273a7b9f01..668b6add3b8f 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -108,7 +108,10 @@ extern void shmem_unlock_mapping(struct address_space *mapping);
 extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 					pgoff_t index, gfp_t gfp_mask);
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
+
+#ifndef CONFIG_VIRTUAL_SWAP
 int shmem_unuse(unsigned int type);
+#endif
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 unsigned long shmem_allowable_huge_orders(struct inode *inode,
diff --git a/include/linux/swap.h b/include/linux/swap.h
index c3a10c952116..177f6640a026 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -764,6 +764,7 @@ void vswap_store_folio(swp_entry_t entry, struct folio *folio);
 void swap_zeromap_folio_set(struct folio *folio);
 void vswap_assoc_zswap(swp_entry_t entry, struct zswap_entry *zswap_entry);
 bool vswap_can_swapin_thp(swp_entry_t entry, int nr);
+void vswap_swapoff(swp_entry_t entry, struct folio *folio, swp_slot_t slot);
 
 static inline bool trylock_swapoff(swp_entry_t entry,
 				struct swap_info_struct **si)
diff --git a/mm/shmem.c b/mm/shmem.c
index 609971a2b365..fa792769e422 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1380,6 +1380,7 @@ static void shmem_evict_inode(struct inode *inode)
 #endif
 }
 
+#ifndef CONFIG_VIRTUAL_SWAP
 static int shmem_find_swap_entries(struct address_space *mapping,
 				   pgoff_t start, struct folio_batch *fbatch,
 				   pgoff_t *indices, unsigned int type)
@@ -1525,6 +1526,7 @@ int shmem_unuse(unsigned int type)
 
 	return error;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * Move the page from the page cache to the swap cache.
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 59b34d51b16b..d1251a9264fa 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -2053,6 +2053,163 @@ unsigned int count_swap_pages(int type, int free)
 }
 #endif /* CONFIG_HIBERNATION */
 
+/*
+ * Scan swap_map from current position to next entry still in use.
+ * Return 0 if there are no inuse entries after prev till end of
+ * the map.
+ */
+static unsigned int find_next_to_unuse(struct swap_info_struct *si,
+					unsigned int prev)
+{
+	unsigned int i;
+	unsigned char count;
+
+	/*
+	 * No need for swap_lock here: we're just looking
+	 * for whether an entry is in use, not modifying it; false
+	 * hits are okay, and sys_swapoff() has already prevented new
+	 * allocations from this area (while holding swap_lock).
+	 */
+	for (i = prev + 1; i < si->max; i++) {
+		count = READ_ONCE(si->swap_map[i]);
+		if (count && swap_count(count) != SWAP_MAP_BAD)
+			break;
+		if ((i % LATENCY_LIMIT) == 0)
+			cond_resched();
+	}
+
+	if (i == si->max)
+		i = 0;
+
+	return i;
+}
+
+#ifdef CONFIG_VIRTUAL_SWAP
+#define	for_each_allocated_offset(si, offset)	\
+	while (swap_usage_in_pages(si) && \
+		!signal_pending(current) && \
+		(offset = find_next_to_unuse(si, offset)) != 0)
+
+static struct folio *pagein(swp_entry_t entry, struct swap_iocb **splug,
+		struct mempolicy *mpol)
+{
+	bool folio_was_allocated;
+	struct folio *folio = __read_swap_cache_async(entry, GFP_KERNEL, mpol,
+			NO_INTERLEAVE_INDEX, &folio_was_allocated, false);
+
+	if (folio_was_allocated)
+		swap_read_folio(folio, splug);
+	return folio;
+}
+
+static int try_to_unuse(unsigned int type)
+{
+	struct swap_info_struct *si = swap_info[type];
+	struct swap_iocb *splug = NULL;
+	struct mempolicy *mpol;
+	struct blk_plug plug;
+	unsigned long offset;
+	struct folio *folio;
+	swp_entry_t entry;
+	swp_slot_t slot;
+	int ret = 0;
+
+	if (!atomic_long_read(&si->inuse_pages))
+		goto success;
+
+	mpol = get_task_policy(current);
+	blk_start_plug(&plug);
+
+	/* first round - submit the reads */
+	offset = 0;
+	for_each_allocated_offset(si, offset) {
+		slot = swp_slot(type, offset);
+		entry = swp_slot_to_swp_entry(slot);
+		if (!entry.val)
+			continue;
+
+		folio = pagein(entry, &splug, mpol);
+		if (folio)
+			folio_put(folio);
+	}
+	blk_finish_plug(&plug);
+	swap_read_unplug(splug);
+	lru_add_drain();
+
+	/* second round - updating the virtual swap slots' backing state */
+	offset = 0;
+	for_each_allocated_offset(si, offset) {
+		slot = swp_slot(type, offset);
+retry:
+		entry = swp_slot_to_swp_entry(slot);
+		if (!entry.val)
+			continue;
+
+		/* try to allocate swap cache folio */
+		folio = pagein(entry, &splug, mpol);
+		if (!folio) {
+			if (!swp_slot_to_swp_entry(swp_slot(type, offset)).val)
+				continue;
+
+			ret = -ENOMEM;
+			pr_err("swapoff: unable to allocate swap cache folio for %lu\n",
+						entry.val);
+			goto finish;
+		}
+
+		folio_lock(folio);
+		/*
+		 * We need to check if the folio is still in swap cache. We can, for
+		 * instance, race with zswap writeback, obtaining the temporary folio
+		 * it allocated for decompression and writeback, which would be
+		 * promply deleted from swap cache. By the time we lock that folio,
+		 * it might have already contained stale data.
+		 *
+		 * Concurrent swap operations might have also come in before we
+		 * reobtain the lock, deleting the folio from swap cache, invalidating
+		 * the virtual swap slot, then swapping out the folio again.
+		 *
+		 * In all of these cases, we must retry the physical -> virtual lookup.
+		 *
+		 * Note that if everything is still valid, then virtual swap slot must
+		 * corresponds to the head page (since all previous swap slots are
+		 * freed).
+		 */
+		if (!folio_test_swapcache(folio) || folio->swap.val != entry.val) {
+			folio_unlock(folio);
+			folio_put(folio);
+			if (signal_pending(current))
+				break;
+			schedule_timeout_uninterruptible(1);
+			goto retry;
+		}
+
+		folio_wait_writeback(folio);
+		vswap_swapoff(entry, folio, slot);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+finish:
+	if (ret == -ENOMEM)
+		return ret;
+
+	/* concurrent swappers might still be releasing physical swap slots... */
+	while (swap_usage_in_pages(si)) {
+		if (signal_pending(current))
+			return -EINTR;
+		schedule_timeout_uninterruptible(1);
+	}
+
+success:
+	/*
+	 * Make sure that further cleanups after try_to_unuse() returns happen
+	 * after swap_range_free() reduces si->inuse_pages to 0.
+	 */
+	smp_mb();
+	return 0;
+}
+#else
 static inline int pte_same_as_swp(pte_t pte, pte_t swp_pte)
 {
 	return pte_same(pte_swp_clear_flags(pte), swp_pte);
@@ -2340,37 +2497,6 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)
 	return ret;
 }
 
-/*
- * Scan swap_map from current position to next entry still in use.
- * Return 0 if there are no inuse entries after prev till end of
- * the map.
- */
-static unsigned int find_next_to_unuse(struct swap_info_struct *si,
-					unsigned int prev)
-{
-	unsigned int i;
-	unsigned char count;
-
-	/*
-	 * No need for swap_lock here: we're just looking
-	 * for whether an entry is in use, not modifying it; false
-	 * hits are okay, and sys_swapoff() has already prevented new
-	 * allocations from this area (while holding swap_lock).
-	 */
-	for (i = prev + 1; i < si->max; i++) {
-		count = READ_ONCE(si->swap_map[i]);
-		if (count && swap_count(count) != SWAP_MAP_BAD)
-			break;
-		if ((i % LATENCY_LIMIT) == 0)
-			cond_resched();
-	}
-
-	if (i == si->max)
-		i = 0;
-
-	return i;
-}
-
 static int try_to_unuse(unsigned int type)
 {
 	struct mm_struct *prev_mm;
@@ -2474,6 +2600,7 @@ static int try_to_unuse(unsigned int type)
 	smp_mb();
 	return 0;
 }
+#endif /* CONFIG_VIRTUAL_SWAP */
 
 /*
  * After a successful try_to_unuse, if no swap is now in use, we know
diff --git a/mm/vswap.c b/mm/vswap.c
index c09a7efc2aeb..c0da71d5d592 100644
--- a/mm/vswap.c
+++ b/mm/vswap.c
@@ -1289,6 +1289,67 @@ void put_swap_folio(struct folio *folio, swp_entry_t entry)
 	swapcache_clear(NULL, entry, nr);
 }
 
+/**
+ * vswap_swapoff - unlink a range of virtual swap slots from their backing
+ *                 physical swap slots on a swapfile that is being swapped off,
+ *                 and associate them with the swapped in folio.
+ * @entry: the first virtual swap slot in the range.
+ * @folio: the folio swapped in and loaded into swap cache.
+ * @slot: the first physical swap slot in the range.
+ */
+void vswap_swapoff(swp_entry_t entry, struct folio *folio, swp_slot_t slot)
+{
+	int i = 0, nr = folio_nr_pages(folio);
+	struct swp_desc *desc;
+	unsigned int type = swp_slot_type(slot);
+	unsigned int offset = swp_slot_offset(slot);
+
+	XA_STATE(xas, &vswap_map, entry.val);
+
+	rcu_read_lock();
+	xas_for_each(&xas, desc, entry.val + nr - 1) {
+		if (xas_retry(&xas, desc))
+			continue;
+
+		write_lock(&desc->lock);
+		/*
+		 * There might be concurrent swap operations that might invalidate the
+		 * originally obtained virtual swap slot, allowing it to be
+		 * re-allocated, or change its backing state.
+		 *
+		 * We must re-check here to make sure we are not performing bogus backing
+		 * store changes.
+		 */
+		if (desc->type != VSWAP_SWAPFILE ||
+				swp_slot_type(desc->slot) != type) {
+			/* there should not be mixed backing states among the subpages */
+			VM_WARN_ON(i);
+			write_unlock(&desc->lock);
+			break;
+		}
+
+		VM_WARN_ON(swp_slot_offset(desc->slot) != offset + i);
+
+		xa_erase(&vswap_rmap, desc->slot.val);
+		desc->type = VSWAP_FOLIO;
+		desc->folio = folio;
+		write_unlock(&desc->lock);
+		i++;
+	}
+	rcu_read_unlock();
+
+	if (i) {
+		/*
+		 * If we update the virtual swap slots' backing, mark the folio as
+		 * dirty so that reclaimers will try to page it out again.
+		 */
+		folio_mark_dirty(folio);
+		swap_slot_free_nr(slot, nr);
+		/* folio is in swap cache, so entries are guaranteed to be valid */
+		mem_cgroup_uncharge_swap(entry, nr);
+	}
+}
+
 #ifdef CONFIG_MEMCG
 static unsigned short vswap_cgroup_record(swp_entry_t entry,
 				unsigned short memcgid, unsigned int nr_ents)

From patchwork Mon Apr  7 23:42:15 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Nhat Pham <nphamcs@gmail.com>
X-Patchwork-Id: 879271
Received: from mail-yb1-f171.google.com (mail-yb1-f171.google.com
 [209.85.219.171])
 (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
 (No client certificate requested)
 by smtp.subspace.kernel.org (Postfix) with ESMTPS id D9D772550C4;
 Mon,  7 Apr 2025 23:42:34 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org;
 arc=none smtp.client-ip=209.85.219.171
ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
 t=1744069357; cv=none;
 b=OPguwdWgYb5iYkw2/44K+wjYYjU0SH8HnUdKyQ/ujHRAN2esiTjqcPctudJztZnYbPGWSAFCUuXsbiwbXPqG9ixlYUBf9pGl2c22hudF5QyDMJGcXpsDglQB+qOKSHSSTVETS7nbIbs3LW01sZII3srx4UcfFeI5wxmFPwXbBKw=
ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org;
 s=arc-20240116; t=1744069357; c=relaxed/simple;
 bh=Er3hegLgwCFhC/Ruck5UgO4+9V5jnr65pYLNSjYW1MQ=;
 h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
 MIME-Version;
 b=FzZwJmYXWfVWQTtTqsKNU4KcHAwHvUq/FAAKHRm5y7ao8PaWtGXOuD/WItf01MTvztFa4fYJ16sORVG/ceSD9yTpEbM/wuV1ZF8eMl+VXKYobQFHk+MDl5VUE8GgwMKHnv4IRTEnTWpLlXRfPvhlBnghzUOfzgmYR4vK5ErloMw=
ARC-Authentication-Results: i=1; smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com;
 spf=pass smtp.mailfrom=gmail.com;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b=Ti61MM7H; arc=none smtp.client-ip=209.85.219.171
Authentication-Results: smtp.subspace.kernel.org;
 dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
 dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com
 header.b="Ti61MM7H"
Received: by mail-yb1-f171.google.com with SMTP id
 3f1490d57ef6-e53ef7462b6so4814441276.3;
 Mon, 07 Apr 2025 16:42:34 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=gmail.com; s=20230601; t=1744069354; x=1744674154;
 darn=vger.kernel.org;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:from:to:cc:subject:date
 :message-id:reply-to;
 bh=rEZyehd4zLYWxImC6V7ghpPr92k+M9DC2r20TkQaMas=;
 b=Ti61MM7HWNyWGxAGkK+rdoqDbhonstrgswtIPK/sJ33RR48soOjZdyxRQnTvG+NlWM
 Gf/2/Go2X+r3QfiyWtGoVRVlkcRMG7lvpxedO1fmHGz4JqkR2JxiZyG3JAZGbS7jFZYq
 ZkDpCk6AqmU1sW2doh99GqpGO9MwU9vo1HVP5qPqyFhQZagcsv1L27PH1AKbtgLbtCFC
 ublMWZ0f9phdDG75OIP/IoSYAn9tBcw9RcJ2rkL6KGs7uzCQ7A0XB+pc2A3/3xGx3NB9
 zOVYvT6I0DkBAw8NIzHo5/dMFdVf5oZCUlYbC4rK9CbZB1Du47UCHIE0EUKyqV3d8Vtz
 Pi6w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1744069354; x=1744674154;
 h=content-transfer-encoding:mime-version:references:in-reply-to
 :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
 :subject:date:message-id:reply-to;
 bh=rEZyehd4zLYWxImC6V7ghpPr92k+M9DC2r20TkQaMas=;
 b=p+turHNdpdPl9mvW6lttIxZvY47TgumlkQZqtfe5cB6CX2Xni8g370gETDVAKuojzj
 WIrGKgHn/xJNYuHjF7E1jcinGqvMslgvRptEajs8EC0YLWnCUMsCd9e0E2GGPGpK3C5+
 TnHRc7oiA5CfVcxzqx5BfjUFy3Pf8+FCXmwU+0Lhe85E+1xlxxazXT+pBDqGuaPHeoEH
 v/uecBZyR95P0cyNmJENSafzz5g0CsB1ApnJfwga9xdWKLjCD/SMPv05AlbQSt+KDlfN
 pK7UU4X+tVG+mbYyq6BRg87nZerxAJJYt6uNuNUfJ+Z8LHUkNP50966u3Zdg0aFHZmYm
 5xbA==
X-Forwarded-Encrypted: i=1;
 AJvYcCV5C2GoZLX3clRCJlvngpqQttVm4edTGTy0BIgHe/bvOzC4fGMVFvEHq2ThH011QkGhltKq0LKw@vger.kernel.org,
 AJvYcCWbNfDUg+mH7ordX9CHaj7j9vlpPDZRO7ZP8V/0g/Ahu2QtQV/OX9r53gP9mtFDUSQ+UGgNmvMXSW+lF2IE@vger.kernel.org,
 AJvYcCXHaKP7XkVpgU4Q4oZP7Xo/CQ3GST96yBQLjkrrGU6Dm3Ta9UN9EHb2hv4jx6dcOPVBIOfLqs34lmE=@vger.kernel.org
X-Gm-Message-State: AOJu0YxRoWBbrWFufbUt8wJzKMjcOU+ezVPf09Fc0GR7rYheTrXPLBfM
 P3/lIR5cxXya5nkPRypM4p29lKr14IJQTDF/yaS8n+Cj6NDWF9vL
X-Gm-Gg: ASbGncuemmYSTcW2BngIxlk6rTr5xwbc7QxcXCD0aQk40vd08BqglPiJ9ws9VH7HXb0
 LbTqwL347jPx5u4GQqJYf96yQjj3oEzjZLXlCxQlJL8AZGdfWdp8WACo8rHeBQFfMqEe54+lc3q
 urRPRtjP5MzpCVqT3hh+ZSjO+Atd5K817uNLoilGqkrzyYYQn4ZNyLuQpzZoF+XA+nrFMzZ0FGU
 80zMyP+qBy9AQHm3bbwyiC3vcQ00us6vbfqnuQi2z0uWrp+BrndBulGV8Q1ZFr7NvaeghQOdcjo
 LTEtgrRtDIX5ZQ653wlo3P+WEfteUYZoK5Jl
X-Google-Smtp-Source: AGHT+IE5QNAfFkOFKjfUYz3+47925/olDBVlwaEQV+A5zw8RcFXQs1h3huSyPi7xSOcPqQJx6E0RTA==
X-Received: by 2002:a05:6902:480e:b0:e6d:deb9:6381 with SMTP id
 3f1490d57ef6-e6e1c2dc239mr23407047276.43.1744069353752;
 Mon, 07 Apr 2025 16:42:33 -0700 (PDT)
Received: from localhost ([2a03:2880:25ff:74::])
 by smtp.gmail.com with ESMTPSA id
 3f1490d57ef6-e6e0c8cf2d7sm2583871276.14.2025.04.07.16.42.33
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Mon, 07 Apr 2025 16:42:33 -0700 (PDT)
From: Nhat Pham <nphamcs@gmail.com>
To: linux-mm@kvack.org
Cc: akpm@linux-foundation.org, hannes@cmpxchg.org, hughd@google.com,
 yosry.ahmed@linux.dev, mhocko@kernel.org, roman.gushchin@linux.dev,
 shakeel.butt@linux.dev, muchun.song@linux.dev, len.brown@intel.com,
 chengming.zhou@linux.dev, kasong@tencent.com, chrisl@kernel.org,
 huang.ying.caritas@gmail.com, ryan.roberts@arm.com,
 viro@zeniv.linux.org.uk, baohua@kernel.org, osalvador@suse.de,
 lorenzo.stoakes@oracle.com, christophe.leroy@csgroup.eu, pavel@kernel.org,
 kernel-team@meta.com, linux-kernel@vger.kernel.org,
 cgroups@vger.kernel.org, linux-pm@vger.kernel.org
Subject: [RFC PATCH 14/14] zswap: do not start zswap shrinker if there is no
 physical swap slots
Date: Mon,  7 Apr 2025 16:42:15 -0700
Message-ID: <20250407234223.1059191-15-nphamcs@gmail.com>
X-Mailer: git-send-email 2.47.1
In-Reply-To: <20250407234223.1059191-1-nphamcs@gmail.com>
References: <20250407234223.1059191-1-nphamcs@gmail.com>
Precedence: bulk
X-Mailing-List: linux-pm@vger.kernel.org
List-Id: <linux-pm.vger.kernel.org>
List-Subscribe: <mailto:linux-pm+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-pm+unsubscribe@vger.kernel.org>
MIME-Version: 1.0

When swap is virtualized, we no longer pre-allocate a slot on swapfile
for each zswap entry. Do not start the zswap shrinker if there is no
physical swap slots available.

Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 mm/zswap.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/zswap.c b/mm/zswap.c
index 15429825d667..f2f412cc1911 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1277,6 +1277,14 @@ static unsigned long zswap_shrinker_count(struct shrinker *shrinker,
 	if (!zswap_shrinker_enabled || !mem_cgroup_zswap_writeback_enabled(memcg))
 		return 0;
 
+	/*
+	 * When swap is virtualized, we do not have any swap slots on swapfile
+	 * preallocated for zswap objects. If there is no slot available, we
+	 * cannot writeback and should just bail out here.
+	 */
+	if (IS_ENABLED(CONFIG_VIRTUAL_SWAP) && !get_nr_swap_pages())
+		return 0;
+
 	/*
 	 * The shrinker resumes swap writeback, which will enter block
 	 * and may enter fs. XXX: Harmonize with vmscan.c __GFP_FS