Message ID | 20250228100024.332528-1-kanchana.p.sridhar@intel.com |
---|---|
Headers | show |
Series | zswap IAA compress batching | expand |
On Sat, Mar 01, 2025 at 01:09:22AM +0000, Sridhar, Kanchana P wrote: > Hi All, > > > Performance testing (Kernel compilation, allmodconfig): > > ======================================================= > > > > The experiments with kernel compilation test, 32 threads, in tmpfs use the > > "allmodconfig" that takes ~12 minutes, and has considerable swapout/swapin > > activity. The cgroup's memory.max is set to 2G. > > > > > > 64K folios: Kernel compilation/allmodconfig: > > ============================================ > > > > ------------------------------------------------------------------------------- > > mm-unstable v7 mm-unstable v7 > > ------------------------------------------------------------------------------- > > zswap compressor deflate-iaa deflate-iaa zstd zstd > > ------------------------------------------------------------------------------- > > real_sec 775.83 765.90 769.39 772.63 > > user_sec 15,659.10 15,659.14 15,666.28 15,665.98 > > sys_sec 4,209.69 4,040.44 5,277.86 5,358.61 > > ------------------------------------------------------------------------------- > > Max_Res_Set_Size_KB 1,871,116 1,874,128 1,873,200 1,873,488 > > ------------------------------------------------------------------------------- > > memcg_high 0 0 0 0 > > memcg_swap_fail 0 0 0 0 > > zswpout 107,305,181 106,985,511 86,621,912 89,355,274 > > zswpin 32,418,991 32,184,517 25,337,514 26,522,042 > > pswpout 272 80 94 16 > > pswpin 274 69 54 16 > > thp_swpout 0 0 0 0 > > thp_swpout_fallback 0 0 0 0 > > 64kB_swpout_fallback 494 0 0 0 > > pgmajfault 34,577,545 34,333,290 26,892,991 28,132,682 > > ZSWPOUT-64kB 3,498,796 3,460,751 2,737,544 2,823,211 > > SWPOUT-64kB 17 4 4 1 > > ------------------------------------------------------------------------------- > > > > [...] > > > > Summary: > > ======== > > The performance testing data with usemem 30 processes and kernel > > compilation test show 61%-73% throughput gains and 27%-37% sys time > > reduction (usemem30) and 4% sys time reduction (kernel compilation) with > > zswap_store() large folios using IAA compress batching as compared to > > IAA sequential. There is no performance regression for zstd/usemem30 and a > > slight 1.5% sys time zstd regression with kernel compilation allmod > > config. > > I think I know why kernel_compilation with zstd shows a regression whereas > usemem30 does not. It is because I lock/unlock the acomp_ctx mutex once > per folio. This can cause decomp jobs to wait for the mutex, which can cause > more compressions, and this repeats. kernel_compilation has 25M+ decomps > with zstd, whereas usemem30 has practically no decomps, but is > compression-intensive, because of which it benefits the once-per-folio lock > acquire/release. > > I am testing a fix where I return zswap_compress() to do the mutex lock/unlock, > and expect to post v8 by end of the day. I would appreciate it if you can hold off > on reviewing only the zswap patches [14, 15] in my v7 and instead review the v8 > versions of these two patches. I was planning to take a look at v7 next week, so take your time, no rush to post it on a Friday afternoon. Anyway, thanks for the heads up, I appreciate you trying to save everyone's time.
> -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Friday, February 28, 2025 5:18 PM > To: Yosry Ahmed <yosry.ahmed@linux.dev> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com>; Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v7 00/15] zswap IAA compress batching > > > > -----Original Message----- > > From: Yosry Ahmed <yosry.ahmed@linux.dev> > > Sent: Friday, February 28, 2025 5:13 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; nphamcs@gmail.com; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > > crypto@vger.kernel.org; herbert@gondor.apana.org.au; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; > > Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v7 00/15] zswap IAA compress batching > > > > On Sat, Mar 01, 2025 at 01:09:22AM +0000, Sridhar, Kanchana P wrote: > > > Hi All, > > > > > > > Performance testing (Kernel compilation, allmodconfig): > > > > ======================================================= > > > > > > > > The experiments with kernel compilation test, 32 threads, in tmpfs use > the > > > > "allmodconfig" that takes ~12 minutes, and has considerable > > swapout/swapin > > > > activity. The cgroup's memory.max is set to 2G. > > > > > > > > > > > > 64K folios: Kernel compilation/allmodconfig: > > > > ============================================ > > > > > > > > ------------------------------------------------------------------------------- > > > > mm-unstable v7 mm-unstable v7 > > > > ------------------------------------------------------------------------------- > > > > zswap compressor deflate-iaa deflate-iaa zstd zstd > > > > ------------------------------------------------------------------------------- > > > > real_sec 775.83 765.90 769.39 772.63 > > > > user_sec 15,659.10 15,659.14 15,666.28 15,665.98 > > > > sys_sec 4,209.69 4,040.44 5,277.86 5,358.61 > > > > ------------------------------------------------------------------------------- > > > > Max_Res_Set_Size_KB 1,871,116 1,874,128 1,873,200 > 1,873,488 > > > > ------------------------------------------------------------------------------- > > > > memcg_high 0 0 0 0 > > > > memcg_swap_fail 0 0 0 0 > > > > zswpout 107,305,181 106,985,511 86,621,912 89,355,274 > > > > zswpin 32,418,991 32,184,517 25,337,514 26,522,042 > > > > pswpout 272 80 94 16 > > > > pswpin 274 69 54 16 > > > > thp_swpout 0 0 0 0 > > > > thp_swpout_fallback 0 0 0 0 > > > > 64kB_swpout_fallback 494 0 0 0 > > > > pgmajfault 34,577,545 34,333,290 26,892,991 28,132,682 > > > > ZSWPOUT-64kB 3,498,796 3,460,751 2,737,544 2,823,211 > > > > SWPOUT-64kB 17 4 4 1 > > > > ------------------------------------------------------------------------------- > > > > > > > > [...] > > > > > > > > Summary: > > > > ======== > > > > The performance testing data with usemem 30 processes and kernel > > > > compilation test show 61%-73% throughput gains and 27%-37% sys time > > > > reduction (usemem30) and 4% sys time reduction (kernel compilation) > > with > > > > zswap_store() large folios using IAA compress batching as compared to > > > > IAA sequential. There is no performance regression for zstd/usemem30 > > and a > > > > slight 1.5% sys time zstd regression with kernel compilation allmod > > > > config. > > > > > > I think I know why kernel_compilation with zstd shows a regression > whereas > > > usemem30 does not. It is because I lock/unlock the acomp_ctx mutex > once > > > per folio. This can cause decomp jobs to wait for the mutex, which can > cause > > > more compressions, and this repeats. kernel_compilation has 25M+ > > decomps > > > with zstd, whereas usemem30 has practically no decomps, but is > > > compression-intensive, because of which it benefits the once-per-folio > lock > > > acquire/release. > > > > > > I am testing a fix where I return zswap_compress() to do the mutex > > lock/unlock, > > > and expect to post v8 by end of the day. I would appreciate it if you can > hold > > off > > > on reviewing only the zswap patches [14, 15] in my v7 and instead review > > the v8 > > > versions of these two patches. > > > > I was planning to take a look at v7 next week, so take your time, no > > rush to post it on a Friday afternoon. > > > > Anyway, thanks for the heads up, I appreciate you trying to save > > everyone's time. > > Thanks Yosry! Hi Yosry, All, I posted a v8 of this patch-series with a fix for the zstd kernel compilation regression, and consolidation of common code between non-batching and batching compressors, to follow up on suggestions from Yosry, Chengming, Nhat and Johannes. You can disregard v7 and review v8 instead. Thanks, Kanchana
On Fri, Feb 28, 2025 at 02:00:10AM -0800, Kanchana P Sridhar wrote: > > Step 2: Process the request chain using the specified compress/decompress > "op": > > 2.a) Synchronous: the chain of requests is processed in series: > > int acomp_do_req_chain(struct acomp_req *req, > int (*op)(struct acomp_req *req)); > > 2.b) Asynchronous: the chain of requests is processed in parallel using a > submit-poll paradigm: > > int acomp_do_async_req_chain(struct acomp_req *req, > int (*op_submit)(struct acomp_req *req), > int (*op_poll)(struct acomp_req *req)); > > Request chaining will be used in subsequent patches to implement > compress/decompress batching in the iaa_crypto driver for the two supported > IAA driver sync_modes: > > sync_mode = 'sync' will use (2.a), > sync_mode = 'async' will use (2.b). There shouldn't be any sync/async toggle. The whole zswap code is synchronous only and it makes zero sense to expose this to the user. Just do whatever is the fastest from the driver's point of view. I've actually implemented acomp chaining in my tree and I will be reposting soon. Cheers,
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Monday, March 3, 2025 9:20 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosry.ahmed@linux.dev; nphamcs@gmail.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; 21cnbao@gmail.com; > ying.huang@linux.alibaba.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v7 01/15] crypto: acomp - Add > synchronous/asynchronous acomp request chaining. > > On Fri, Feb 28, 2025 at 02:00:10AM -0800, Kanchana P Sridhar wrote: > > > > Step 2: Process the request chain using the specified compress/decompress > > "op": > > > > 2.a) Synchronous: the chain of requests is processed in series: > > > > int acomp_do_req_chain(struct acomp_req *req, > > int (*op)(struct acomp_req *req)); > > > > 2.b) Asynchronous: the chain of requests is processed in parallel using a > > submit-poll paradigm: > > > > int acomp_do_async_req_chain(struct acomp_req *req, > > int (*op_submit)(struct acomp_req *req), > > int (*op_poll)(struct acomp_req *req)); > > > > Request chaining will be used in subsequent patches to implement > > compress/decompress batching in the iaa_crypto driver for the two > supported > > IAA driver sync_modes: > > > > sync_mode = 'sync' will use (2.a), > > sync_mode = 'async' will use (2.b). > > There shouldn't be any sync/async toggle. The whole zswap code is > synchronous only and it makes zero sense to expose this to the user. > Just do whatever is the fastest from the driver's point of view. These sync/async modes are mentioned here to distinguish between how request chaining will be supported from the iaa_crypto driver's perspective. As you are aware, the iaa_crypto driver supports a fully synchronous mode (sync_mode = 'sync') and a fully asynchronous, non-irq mode (sync_mode = 'async'). As mentioned in the cover letter, from zswap's perspective, the calls to crypto are exactly the same whether or not batching (with request chaining) or sequential compressions are used: crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait); > > I've actually implemented acomp chaining in my tree and I will be > reposting soon. Thanks for the update. I took at look at your v2 [1], and will provide some code review comments in [1]. My main open question is, how is this supposed to work for iaa_crypto's submit-poll paradigm which is crucial to derive the benefits of hardware parallelism? IIUC, your v2 acomp request chaining only works in fully synchronous mode, correct me if I am wrong. In fact, at the start of "acomp_do_req_chain()", if the algorithm has opted in to request chaining, it returns after processing the first request, without processing the chained requests. These are some of the issues I had to resolve when using the ahash reference implementation you provided, to develop my version of acomp_do_request_chain() and acomp_do_async_request_chain() in patch 1 of my series. I see that in your v2, you have introduced support for virtual addresses. One suggestion I have is, can you please incorporate my implementation of acomp request chaining (for both the above APIs) in a v3 of your patch-series that enables both, acomp_do_async_request_chain() and acomp_do_request_chain(), fixes some of the issues I pointed out above, and adds in the virtual address support. Please let me know if this would be a good way for us to proceed in getting zswap to realize the benefits of IAA batch compressions. [1] https://patchwork.kernel.org/project/linux-mm/patch/a11883ded326c4f4f80dcf0307ad05fd8e31abc7.1741080140.git.herbert@gondor.apana.org.au/ Thanks, Kanchana > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt