[v7,00/15] zswap IAA compress batching

Message ID	20250228100024.332528-1-kanchana.p.sridhar@intel.com
Headers	show Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4F0951C4A3D; Fri, 28 Feb 2025 10:00:26 +0000 (UTC) From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> To: linux-kernel@vger.kernel.org, linux-mm@kvack.org, hannes@cmpxchg.org, yosry.ahmed@linux.dev, nphamcs@gmail.com, chengming.zhou@linux.dev, usamaarif642@gmail.com, ryan.roberts@arm.com, 21cnbao@gmail.com, ying.huang@linux.alibaba.com, akpm@linux-foundation.org, linux-crypto@vger.kernel.org, herbert@gondor.apana.org.au, davem@davemloft.net, clabbe@baylibre.com, ardb@kernel.org, ebiggers@google.com, surenb@google.com, kristen.c.accardi@intel.com Cc: wajdi.k.feghali@intel.com, vinodh.gopal@intel.com, kanchana.p.sridhar@intel.com Subject: [PATCH v7 00/15] zswap IAA compress batching Date: Fri, 28 Feb 2025 02:00:09 -0800 Message-Id: <20250228100024.332528-1-kanchana.p.sridhar@intel.com> Precedence: bulk MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	zswap IAA compress batching \| expand [v7,00/15] zswap IAA compress batching [v7,03/15] crypto: iaa - Add an acomp_req flag CRYPTO_ACOMP_REQ_POLL to enable async mode. [v7,04/15] crypto: iaa - Implement batch compression/decompression with request chaining. [v7,05/15] crypto: iaa - Enable async mode and make it the default. [v7,08/15] crypto: iaa - Map IAA devices/wqs to cores based on packages instead of NUMA. [v7,09/15] crypto: iaa - Distribute compress jobs from all cores to all IAAs on a package. [v7,11/15] crypto: iaa - Fix for "deflate_generic_tfm" global being accessed without locks. [v7,14/15] mm: zswap: Restructure & simplify zswap_store() to make it amenable for batching.

Message ID

20250228100024.332528-1-kanchana.p.sridhar@intel.com

Headers

From: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
To: linux-kernel@vger.kernel.org,
	linux-mm@kvack.org,
	hannes@cmpxchg.org,
	yosry.ahmed@linux.dev,
	nphamcs@gmail.com,
	chengming.zhou@linux.dev,
	usamaarif642@gmail.com,
	ryan.roberts@arm.com,
	21cnbao@gmail.com,
	ying.huang@linux.alibaba.com,
	akpm@linux-foundation.org,
	linux-crypto@vger.kernel.org,
	herbert@gondor.apana.org.au,
	davem@davemloft.net,
	clabbe@baylibre.com,
	ardb@kernel.org,
	ebiggers@google.com,
	surenb@google.com,
	kristen.c.accardi@intel.com
Cc: wajdi.k.feghali@intel.com,
	vinodh.gopal@intel.com,
	kanchana.p.sridhar@intel.com
Subject: [PATCH v7 00/15] zswap IAA compress batching
Date: Fri, 28 Feb 2025 02:00:09 -0800
Message-Id: <20250228100024.332528-1-kanchana.p.sridhar@intel.com>
Precedence: bulk
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Series

zswap IAA compress batching | expand

Message

Sridhar, Kanchana P Feb. 28, 2025, 10 a.m. UTC

IAA Compression Batching with crypto_acomp Request Chaining:
============================================================

This patch-series introduces the use of the Intel Analytics Accelerator
(IAA) for parallel batch compression of pages in large folios to improve
zswap swapout latency. It does this by first creating a generic batching
framework in crypto_acomp using request chaining, followed by invoking
request chaining API to compress/decompress a batch in the iaa_crypto
driver.

Comments

Yosry Ahmed March 1, 2025, 1:12 a.m. UTC | #1

On Sat, Mar 01, 2025 at 01:09:22AM +0000, Sridhar, Kanchana P wrote:
> Hi All,
> 
> > Performance testing (Kernel compilation, allmodconfig):
> > =======================================================
> > 
> > The experiments with kernel compilation test, 32 threads, in tmpfs use the
> > "allmodconfig" that takes ~12 minutes, and has considerable swapout/swapin
> > activity. The cgroup's memory.max is set to 2G.
> > 
> > 
> >  64K folios: Kernel compilation/allmodconfig:
> >  ============================================
> > 
> >  -------------------------------------------------------------------------------
> >                      mm-unstable               v7   mm-unstable            v7
> >  -------------------------------------------------------------------------------
> >  zswap compressor    deflate-iaa      deflate-iaa          zstd          zstd
> >  -------------------------------------------------------------------------------
> >  real_sec                 775.83           765.90        769.39        772.63
> >  user_sec              15,659.10        15,659.14     15,666.28     15,665.98
> >  sys_sec                4,209.69         4,040.44      5,277.86      5,358.61
> >  -------------------------------------------------------------------------------
> >  Max_Res_Set_Size_KB   1,871,116        1,874,128     1,873,200     1,873,488
> >  -------------------------------------------------------------------------------
> >  memcg_high                    0                0             0             0
> >  memcg_swap_fail               0                0             0             0
> >  zswpout             107,305,181      106,985,511    86,621,912    89,355,274
> >  zswpin               32,418,991       32,184,517    25,337,514    26,522,042
> >  pswpout                     272               80            94            16
> >  pswpin                      274               69            54            16
> >  thp_swpout                    0                0             0             0
> >  thp_swpout_fallback           0                0             0             0
> >  64kB_swpout_fallback        494                0             0             0
> >  pgmajfault           34,577,545       34,333,290    26,892,991    28,132,682
> >  ZSWPOUT-64kB          3,498,796        3,460,751     2,737,544     2,823,211
> >  SWPOUT-64kB                  17                4             4             1
> >  -------------------------------------------------------------------------------
> > 
> > [...]
> >
> > Summary:
> > ========
> > The performance testing data with usemem 30 processes and kernel
> > compilation test show 61%-73% throughput gains and 27%-37% sys time
> > reduction (usemem30) and 4% sys time reduction (kernel compilation) with
> > zswap_store() large folios using IAA compress batching as compared to
> > IAA sequential. There is no performance regression for zstd/usemem30 and a
> > slight 1.5% sys time zstd regression with kernel compilation allmod
> > config.
> 
> I think I know why kernel_compilation with zstd shows a regression whereas
> usemem30 does not. It is because I lock/unlock the acomp_ctx mutex once
> per folio. This can cause decomp jobs to wait for the mutex, which can cause
> more compressions, and this repeats. kernel_compilation has 25M+ decomps
> with zstd, whereas usemem30 has practically no decomps, but is
> compression-intensive, because of which it benefits the once-per-folio lock
> acquire/release.
> 
> I am testing a fix where I return zswap_compress() to do the mutex lock/unlock,
> and expect to post v8 by end of the day. I would appreciate it if you can hold off
> on reviewing only the zswap patches [14, 15] in my v7 and instead review the v8
> versions of these two patches.

I was planning to take a look at v7 next week, so take your time, no
rush to post it on a Friday afternoon.

Anyway, thanks for the heads up, I appreciate you trying to save
everyone's time.

Sridhar, Kanchana P March 3, 2025, 8:55 a.m. UTC | #2

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Friday, February 28, 2025 5:18 PM
> To: Yosry Ahmed <yosry.ahmed@linux.dev>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v7 00/15] zswap IAA compress batching
> 
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosry.ahmed@linux.dev>
> > Sent: Friday, February 28, 2025 5:13 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; herbert@gondor.apana.org.au;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v7 00/15] zswap IAA compress batching
> >
> > On Sat, Mar 01, 2025 at 01:09:22AM +0000, Sridhar, Kanchana P wrote:
> > > Hi All,
> > >
> > > > Performance testing (Kernel compilation, allmodconfig):
> > > > =======================================================
> > > >
> > > > The experiments with kernel compilation test, 32 threads, in tmpfs use
> the
> > > > "allmodconfig" that takes ~12 minutes, and has considerable
> > swapout/swapin
> > > > activity. The cgroup's memory.max is set to 2G.
> > > >
> > > >
> > > >  64K folios: Kernel compilation/allmodconfig:
> > > >  ============================================
> > > >
> > > >  -------------------------------------------------------------------------------
> > > >                      mm-unstable               v7   mm-unstable            v7
> > > >  -------------------------------------------------------------------------------
> > > >  zswap compressor    deflate-iaa      deflate-iaa          zstd          zstd
> > > >  -------------------------------------------------------------------------------
> > > >  real_sec                 775.83           765.90        769.39        772.63
> > > >  user_sec              15,659.10        15,659.14     15,666.28     15,665.98
> > > >  sys_sec                4,209.69         4,040.44      5,277.86      5,358.61
> > > >  -------------------------------------------------------------------------------
> > > >  Max_Res_Set_Size_KB   1,871,116        1,874,128     1,873,200
> 1,873,488
> > > >  -------------------------------------------------------------------------------
> > > >  memcg_high                    0                0             0             0
> > > >  memcg_swap_fail               0                0             0             0
> > > >  zswpout             107,305,181      106,985,511    86,621,912    89,355,274
> > > >  zswpin               32,418,991       32,184,517    25,337,514    26,522,042
> > > >  pswpout                     272               80            94            16
> > > >  pswpin                      274               69            54            16
> > > >  thp_swpout                    0                0             0             0
> > > >  thp_swpout_fallback           0                0             0             0
> > > >  64kB_swpout_fallback        494                0             0             0
> > > >  pgmajfault           34,577,545       34,333,290    26,892,991    28,132,682
> > > >  ZSWPOUT-64kB          3,498,796        3,460,751     2,737,544     2,823,211
> > > >  SWPOUT-64kB                  17                4             4             1
> > > >  -------------------------------------------------------------------------------
> > > >
> > > > [...]
> > > >
> > > > Summary:
> > > > ========
> > > > The performance testing data with usemem 30 processes and kernel
> > > > compilation test show 61%-73% throughput gains and 27%-37% sys time
> > > > reduction (usemem30) and 4% sys time reduction (kernel compilation)
> > with
> > > > zswap_store() large folios using IAA compress batching as compared to
> > > > IAA sequential. There is no performance regression for zstd/usemem30
> > and a
> > > > slight 1.5% sys time zstd regression with kernel compilation allmod
> > > > config.
> > >
> > > I think I know why kernel_compilation with zstd shows a regression
> whereas
> > > usemem30 does not. It is because I lock/unlock the acomp_ctx mutex
> once
> > > per folio. This can cause decomp jobs to wait for the mutex, which can
> cause
> > > more compressions, and this repeats. kernel_compilation has 25M+
> > decomps
> > > with zstd, whereas usemem30 has practically no decomps, but is
> > > compression-intensive, because of which it benefits the once-per-folio
> lock
> > > acquire/release.
> > >
> > > I am testing a fix where I return zswap_compress() to do the mutex
> > lock/unlock,
> > > and expect to post v8 by end of the day. I would appreciate it if you can
> hold
> > off
> > > on reviewing only the zswap patches [14, 15] in my v7 and instead review
> > the v8
> > > versions of these two patches.
> >
> > I was planning to take a look at v7 next week, so take your time, no
> > rush to post it on a Friday afternoon.
> >
> > Anyway, thanks for the heads up, I appreciate you trying to save
> > everyone's time.
> 
> Thanks Yosry!

Hi Yosry, All,

I posted a v8 of this patch-series with a fix for the zstd kernel compilation regression,
and consolidation of common code between non-batching and batching
compressors, to follow up on suggestions from Yosry, Chengming, Nhat and Johannes.

You can disregard v7 and review v8 instead.

Thanks,
Kanchana

Herbert Xu March 4, 2025, 5:19 a.m. UTC | #3

On Fri, Feb 28, 2025 at 02:00:10AM -0800, Kanchana P Sridhar wrote:
>
>  Step 2: Process the request chain using the specified compress/decompress
>          "op":
> 
>   2.a) Synchronous: the chain of requests is processed in series:
> 
>        int acomp_do_req_chain(struct acomp_req *req,
>                               int (*op)(struct acomp_req *req));
> 
>   2.b) Asynchronous: the chain of requests is processed in parallel using a
>        submit-poll paradigm:
> 
>        int acomp_do_async_req_chain(struct acomp_req *req,
>                                     int (*op_submit)(struct acomp_req *req),
>                                     int (*op_poll)(struct acomp_req *req));
> 
> Request chaining will be used in subsequent patches to implement
> compress/decompress batching in the iaa_crypto driver for the two supported
> IAA driver sync_modes:
> 
>   sync_mode = 'sync' will use (2.a),
>   sync_mode = 'async' will use (2.b).

There shouldn't be any sync/async toggle.  The whole zswap code is
synchronous only and it makes zero sense to expose this to the user.
Just do whatever is the fastest from the driver's point of view.

I've actually implemented acomp chaining in my tree and I will be
reposting soon.

Cheers,

Sridhar, Kanchana P March 4, 2025, 9:14 p.m. UTC | #4

> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Monday, March 3, 2025 9:20 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; 21cnbao@gmail.com;
> ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v7 01/15] crypto: acomp - Add
> synchronous/asynchronous acomp request chaining.
> 
> On Fri, Feb 28, 2025 at 02:00:10AM -0800, Kanchana P Sridhar wrote:
> >
> >  Step 2: Process the request chain using the specified compress/decompress
> >          "op":
> >
> >   2.a) Synchronous: the chain of requests is processed in series:
> >
> >        int acomp_do_req_chain(struct acomp_req *req,
> >                               int (*op)(struct acomp_req *req));
> >
> >   2.b) Asynchronous: the chain of requests is processed in parallel using a
> >        submit-poll paradigm:
> >
> >        int acomp_do_async_req_chain(struct acomp_req *req,
> >                                     int (*op_submit)(struct acomp_req *req),
> >                                     int (*op_poll)(struct acomp_req *req));
> >
> > Request chaining will be used in subsequent patches to implement
> > compress/decompress batching in the iaa_crypto driver for the two
> supported
> > IAA driver sync_modes:
> >
> >   sync_mode = 'sync' will use (2.a),
> >   sync_mode = 'async' will use (2.b).
> 
> There shouldn't be any sync/async toggle.  The whole zswap code is
> synchronous only and it makes zero sense to expose this to the user.
> Just do whatever is the fastest from the driver's point of view.

These sync/async modes are mentioned here to distinguish between how 
request chaining will be supported from the iaa_crypto driver's perspective.
As you are aware, the iaa_crypto driver supports a fully synchronous mode
(sync_mode = 'sync') and a fully asynchronous, non-irq mode (sync_mode = 'async').

As mentioned in the cover letter, from zswap's perspective, the calls to crypto
are exactly the same whether or not batching (with request chaining) or sequential
compressions are used:

crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);

> 
> I've actually implemented acomp chaining in my tree and I will be
> reposting soon.

Thanks for the update. I took at look at your v2 [1], and will provide some
code review comments in [1]. My main open question is, how is this supposed
to work for iaa_crypto's submit-poll paradigm which is crucial to derive the 
benefits of hardware parallelism? IIUC, your v2 acomp request chaining only
works in fully synchronous mode, correct me if I am wrong.

In fact, at the start of "acomp_do_req_chain()", if the algorithm has opted in to
request chaining, it returns after processing the first request, without processing
the chained requests. These are some of the issues I had to resolve when using
the ahash reference implementation you provided, to develop my version of
acomp_do_request_chain() and acomp_do_async_request_chain() in patch 1
of my series.

I see that in your v2, you have introduced support for virtual addresses. One
suggestion I have is, can you please incorporate my implementation of acomp
request chaining (for both the above APIs) in a v3 of your patch-series that
enables both, acomp_do_async_request_chain() and acomp_do_request_chain(),
fixes some of the issues I pointed out above, and adds in the virtual address
support. Please let me know if this would be a good way for us to proceed in
getting zswap to realize the benefits of IAA batch compressions.

[1] https://patchwork.kernel.org/project/linux-mm/patch/a11883ded326c4f4f80dcf0307ad05fd8e31abc7.1741080140.git.herbert@gondor.apana.org.au/

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt