Message ID | 20241123070127.332773-10-kanchana.p.sridhar@intel.com |
---|---|
State | Superseded |
Headers | show |
Series | zswap IAA compress batching | expand |
On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar <kanchana.p.sridhar@intel.com> wrote: > > This patch does the following: > > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a > configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to > "struct crypto_acomp_ctx" to contain the nr of resources that will be > allocated in the cpu onlining code. > > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the > crypto_acomp created for the pool (in other words, the zswap compression > algorithm) has registered an implementation for batch_compress() and > batch_decompress(). If so, it will set "nr_reqs" to > SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and set > the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not support > batching, "nr_reqs" defaults to 1. > > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to > true if the batching API are present for the crypto_acomp. Why do we need this "can_batch" field? IIUC, this can be determined from the compressor internal fields itself, no? acomp_has_async_batching(acomp); Is this just for convenience, or is this actually an expensive thing to compute? > > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress batching I like a sane default value as much as the next guy, but this seems a bit odd to me: 1. The placement of this constant/default value seems strange to me. This is a compressor-specific value no? Why are we enforcing this batching size at the zswap level, and uniformly at that? What if we introduce a new batch compression algorithm...? Or am I missing something, and this is a sane default for other compressors too? 2. Why is this value set to 8? Experimentation? Could you add some justification in documentation?
Hi Nhat, > -----Original Message----- > From: Nhat Pham <nphamcs@gmail.com> > Sent: Monday, December 2, 2024 11:16 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > herbert@gondor.apana.org.au; davem@davemloft.net; > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com; > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar > <kanchana.p.sridhar@intel.com> wrote: > > > > This patch does the following: > > > > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a > > configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to > > "struct crypto_acomp_ctx" to contain the nr of resources that will be > > allocated in the cpu onlining code. > > > > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the > > crypto_acomp created for the pool (in other words, the zswap > compression > > algorithm) has registered an implementation for batch_compress() and > > batch_decompress(). If so, it will set "nr_reqs" to > > SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and > set > > the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not > support > > batching, "nr_reqs" defaults to 1. > > > > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to > > true if the batching API are present for the crypto_acomp. > > Why do we need this "can_batch" field? IIUC, this can be determined > from the compressor internal fields itself, no? > > acomp_has_async_batching(acomp); > > Is this just for convenience, or is this actually an expensive thing to compute? Thanks for your comments. This is a good question. I tried not to imply that batching resources have been allocated for the cpu based only on what acomp_has_async_batching() returns. It is possible that the cpu onlining code ran into an -ENOMEM error on any particular cpu. In this case, I set the pool->can_batch to "false", mainly for convenience, so that zswap can be somewhat insulated from migration. I agree that this may not be the best solution; and whether or not batching is enabled can be directly determined just before the call to crypto_acomp_batch_compress() based on: acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; I currently have a BUG_ON() for this condition not being met, that relies on the pool->can_batch gating the flow to get to zswap_batch_compress(). I think a better solution would be to check for having SWAP_CRYPTO_BATCH_SIZE # of acomp_ctx resources right after we acquire the acomp_ctx->mutex and before the call to crypto_acomp_batch_compress(). If so, we proceed, and if not, we call crypto_acomp_compress(). It seems this might be the only way to know for sure whether the crypto batching API can be called, given that migration is possible at any point in zswap_store(). Once we have obtained the mutex_lock, it seems we can proceed with batching based on this check (although the UAF situation remains as a larger issue, beyond the scope of this patch). I would appreciate other ideas as well. Also, I have submitted a patch-series [1] with Yosry's & Johannes' suggestions to this series. This is setting up a consolidated zswap_store()/zswap_store_pages() code path for batching and non-batching compressors. My goal is for [1] to go through code reviews and be able to transition to batching, with a simple check: if (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) zswap_batch_compress(); else zswap_compress(); Please feel free to provide code review comments in [1]. Thanks! [1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937 > > > > > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress > batching > > I like a sane default value as much as the next guy, but this seems a > bit odd to me: > > 1. The placement of this constant/default value seems strange to me. > This is a compressor-specific value no? Why are we enforcing this > batching size at the zswap level, and uniformly at that? What if we > introduce a new batch compression algorithm...? Or am I missing > something, and this is a sane default for other compressors too? You bring up an excellent point. This is a compressor-specific value. Instead of setting this up as a constant, which as you correctly observe, may not make sense for a non-IAA compressor, one way to get this could be by querying the compressor, say: int acomp_get_max_batchsize(struct crypto_acomp *tfm) {...}; to then allocate sufficient acomp_reqs/buffers/etc. in the zswap cpu onlining code. > > 2. Why is this value set to 8? Experimentation? Could you add some > justification in documentation? Can I get back to you later this week with a proposal for this? We plan to have a team discussion on how best to approach this for current and future hardware. Thanks, Kanchana
On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote: > > > Why do we need this "can_batch" field? IIUC, this can be determined > > from the compressor internal fields itself, no? > > > > acomp_has_async_batching(acomp); > > > > Is this just for convenience, or is this actually an expensive thing to compute? > > Thanks for your comments. This is a good question. I tried not to imply that > batching resources have been allocated for the cpu based only on what > acomp_has_async_batching() returns. It is possible that the cpu onlining > code ran into an -ENOMEM error on any particular cpu. In this case, I set > the pool->can_batch to "false", mainly for convenience, so that zswap > can be somewhat insulated from migration. I agree that this may not be > the best solution; and whether or not batching is enabled can be directly > determined just before the call to crypto_acomp_batch_compress() > based on: > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; With ahash request chaining, the idea is to accumulate as much data as you can before you provide it to the Crypto API. The API is responsible for dividing it up if the underlying driver is only able to handle one request at a time. So that would be the ideal model to use for compression as well. Provide as much data as you can and let the API handle the case where the data needs to be divided up. Cheers,
> -----Original Message----- > From: Herbert Xu <herbert@gondor.apana.org.au> > Sent: Tuesday, December 3, 2024 12:01 AM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote: > > > > > Why do we need this "can_batch" field? IIUC, this can be determined > > > from the compressor internal fields itself, no? > > > > > > acomp_has_async_batching(acomp); > > > > > > Is this just for convenience, or is this actually an expensive thing to > compute? > > > > Thanks for your comments. This is a good question. I tried not to imply that > > batching resources have been allocated for the cpu based only on what > > acomp_has_async_batching() returns. It is possible that the cpu onlining > > code ran into an -ENOMEM error on any particular cpu. In this case, I set > > the pool->can_batch to "false", mainly for convenience, so that zswap > > can be somewhat insulated from migration. I agree that this may not be > > the best solution; and whether or not batching is enabled can be directly > > determined just before the call to crypto_acomp_batch_compress() > > based on: > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; > > With ahash request chaining, the idea is to accumulate as much > data as you can before you provide it to the Crypto API. The > API is responsible for dividing it up if the underlying driver > is only able to handle one request at a time. > > So that would be the ideal model to use for compression as well. > Provide as much data as you can and let the API handle the case > where the data needs to be divided up. Thanks for this suggestion! This sounds like a clean way to handle the batching/sequential compress/decompress within the crypto API as long as it can be contained in the crypto acompress layer. If the zswap maintainers don't have any objections, I can look into the feasibility of doing this. Thanks, Kanchana > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > > -----Original Message----- > > From: Herbert Xu <herbert@gondor.apana.org.au> > > Sent: Tuesday, December 3, 2024 12:01 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com; > > chengming.zhou@linux.dev; usamaarif642@gmail.com; > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > > Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > > the crypto_alg supports batching. > > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote: > > > > > > > Why do we need this "can_batch" field? IIUC, this can be determined > > > > from the compressor internal fields itself, no? > > > > > > > > acomp_has_async_batching(acomp); > > > > > > > > Is this just for convenience, or is this actually an expensive thing to > > compute? > > > > > > Thanks for your comments. This is a good question. I tried not to imply that > > > batching resources have been allocated for the cpu based only on what > > > acomp_has_async_batching() returns. It is possible that the cpu onlining > > > code ran into an -ENOMEM error on any particular cpu. In this case, I set > > > the pool->can_batch to "false", mainly for convenience, so that zswap > > > can be somewhat insulated from migration. I agree that this may not be > > > the best solution; and whether or not batching is enabled can be directly > > > determined just before the call to crypto_acomp_batch_compress() > > > based on: > > > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; > > > > With ahash request chaining, the idea is to accumulate as much > > data as you can before you provide it to the Crypto API. The > > API is responsible for dividing it up if the underlying driver > > is only able to handle one request at a time. > > > > So that would be the ideal model to use for compression as well. > > Provide as much data as you can and let the API handle the case > > where the data needs to be divided up. > > Thanks for this suggestion! This sounds like a clean way to handle the > batching/sequential compress/decompress within the crypto API as long > as it can be contained in the crypto acompress layer. > If the zswap maintainers don't have any objections, I can look into the > feasibility of doing this. Does this mean that instead of zswap breaking down the folio into SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the crypto layer and let it do the batching as it pleases? It sounds nice on the surface, but this implies that we have to allocate folio_nr_pages() buffers in zswap, essentially as the allocation is the same size as the folio itself. While the allocation does not need to be contiguous, making a large number of allocations in the reclaim path is definitely not something we want. For a 2M THP, we'd need to allocate 2M in zswap_store(). If we choose to keep preallocating, assuming the maximum THP size is 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of memory. I feel like I am missing something. > > Thanks, > Kanchana > > > > > Cheers, > > -- > > Email: Herbert Xu <herbert@gondor.apana.org.au> > > Home Page: http://gondor.apana.org.au/~herbert/ > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Tuesday, December 3, 2024 1:44 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com; > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > -----Original Message----- > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > Sent: Tuesday, December 3, 2024 12:01 AM > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; > linux- > > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com; > > > chengming.zhou@linux.dev; usamaarif642@gmail.com; > > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; > > > Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching > resources if > > > the crypto_alg supports batching. > > > > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote: > > > > > > > > > Why do we need this "can_batch" field? IIUC, this can be determined > > > > > from the compressor internal fields itself, no? > > > > > > > > > > acomp_has_async_batching(acomp); > > > > > > > > > > Is this just for convenience, or is this actually an expensive thing to > > > compute? > > > > > > > > Thanks for your comments. This is a good question. I tried not to imply > that > > > > batching resources have been allocated for the cpu based only on what > > > > acomp_has_async_batching() returns. It is possible that the cpu onlining > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I > set > > > > the pool->can_batch to "false", mainly for convenience, so that zswap > > > > can be somewhat insulated from migration. I agree that this may not be > > > > the best solution; and whether or not batching is enabled can be directly > > > > determined just before the call to crypto_acomp_batch_compress() > > > > based on: > > > > > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; > > > > > > With ahash request chaining, the idea is to accumulate as much > > > data as you can before you provide it to the Crypto API. The > > > API is responsible for dividing it up if the underlying driver > > > is only able to handle one request at a time. > > > > > > So that would be the ideal model to use for compression as well. > > > Provide as much data as you can and let the API handle the case > > > where the data needs to be divided up. > > > > Thanks for this suggestion! This sounds like a clean way to handle the > > batching/sequential compress/decompress within the crypto API as long > > as it can be contained in the crypto acompress layer. > > If the zswap maintainers don't have any objections, I can look into the > > feasibility of doing this. > > Does this mean that instead of zswap breaking down the folio into > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > crypto layer and let it do the batching as it pleases? If I understand Herbert's suggestion correctly, I think what he meant was that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say, 8) during the cpu onlining always. The acomp_has_async_batching() API can be used to determine whether to allocate more than one acomp_req and crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait for the request chaining with the goal of understanding performance wrt the existing implementation of crypto_acomp_batch_compress()). In zswap_store_folio(), we process the large folio in batches of 8 pages and call "crypto_acomp_batch_compress()" for each batch. Based on earlier discussions in this thread, it might make sense to add a bool option to crypto_acomp_batch_compress() as follows: static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], struct crypto_wait *waits[], struct page *pages[], u8 *dsts[], unsigned int dlens[], int errors[], int nr_pages, bool parallel); zswap would acquire the per-cpu acomp_ctx->mutex, and pass (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel" parameter. This indicates to crypto_acomp_batch_compress() whether or not SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and "waits". If we have multiple "reqs" (parallel == true), we use request chaining (or the existing asynchronous poll implementation) for IAA batching. If (parallel == false), crypto_acomp_batch_compress() will look something like this: static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], struct crypto_wait *waits[], struct page *pages[], u8 *dsts[], unsigned int dlens[], int errors[], int nr_pages, bool parallel) { if (!parallel) { struct scatterlist input, output; int i; for (i = 0; i < nr_pages; ++i) { /* for pages[i], buffers[i], dlens[i]: borrow first half of * zswap_compress() functionality: */ dst = acomp_ctx->buffers[i]; sg_init_table(&input, 1); sg_set_page(&input, pages[i], PAGE_SIZE, 0); sg_init_one(&output, dst, PAGE_SIZE * 2); acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlens[i]); comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx->waits[0]); dlens[i] = acomp_ctx->reqs[0]->dlen; } } /* * At this point we would have sequentially compressed the batch. * zswap_store_folio() can process the buffers and dlens using * common code for batching and non-batching compressors. */ } IIUC, this suggestion appears to be along the lines of using common code in zswap as far as possible, for compressors that do and do not support batching. Herbert can correct me if I am wrong. If this is indeed the case, the memory penalty for software compressors would be: 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in zswap_cpu_comp_prepare(). 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in zswap_store_folio(). This would be an additional memory penalty for what we gain by having the common code paths in zswap for compressors that do and do not support batching. Thanks, Kanchana > > It sounds nice on the surface, but this implies that we have to > allocate folio_nr_pages() buffers in zswap, essentially as the > allocation is the same size as the folio itself. While the allocation > does not need to be contiguous, making a large number of allocations > in the reclaim path is definitely not something we want. For a 2M THP, > we'd need to allocate 2M in zswap_store(). > > If we choose to keep preallocating, assuming the maximum THP size is > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of > memory. > > I feel like I am missing something. > > > > > Thanks, > > Kanchana > > > > > > > > Cheers, > > > -- > > > Email: Herbert Xu <herbert@gondor.apana.org.au> > > > Home Page: http://gondor.apana.org.au/~herbert/ > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Tuesday, December 3, 2024 2:18 PM > To: Yosry Ahmed <yosryahmed@google.com> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>; > Gopal, Vinodh <vinodh.gopal@intel.com>; Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Tuesday, December 3, 2024 1:44 PM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham > > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > mm@kvack.org; > > hannes@cmpxchg.org; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com; > > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources > if > > the crypto_alg supports batching. > > > > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > > > > -----Original Message----- > > > > From: Herbert Xu <herbert@gondor.apana.org.au> > > > > Sent: Tuesday, December 3, 2024 12:01 AM > > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; > > linux- > > > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com; > > > > chengming.zhou@linux.dev; usamaarif642@gmail.com; > > > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > > > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org; > > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C > > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; > > > > Gopal, Vinodh <vinodh.gopal@intel.com> > > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching > > resources if > > > > the crypto_alg supports batching. > > > > > > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote: > > > > > > > > > > > Why do we need this "can_batch" field? IIUC, this can be > determined > > > > > > from the compressor internal fields itself, no? > > > > > > > > > > > > acomp_has_async_batching(acomp); > > > > > > > > > > > > Is this just for convenience, or is this actually an expensive thing to > > > > compute? > > > > > > > > > > Thanks for your comments. This is a good question. I tried not to imply > > that > > > > > batching resources have been allocated for the cpu based only on what > > > > > acomp_has_async_batching() returns. It is possible that the cpu > onlining > > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I > > set > > > > > the pool->can_batch to "false", mainly for convenience, so that zswap > > > > > can be somewhat insulated from migration. I agree that this may not > be > > > > > the best solution; and whether or not batching is enabled can be > directly > > > > > determined just before the call to crypto_acomp_batch_compress() > > > > > based on: > > > > > > > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; > > > > > > > > With ahash request chaining, the idea is to accumulate as much > > > > data as you can before you provide it to the Crypto API. The > > > > API is responsible for dividing it up if the underlying driver > > > > is only able to handle one request at a time. > > > > > > > > So that would be the ideal model to use for compression as well. > > > > Provide as much data as you can and let the API handle the case > > > > where the data needs to be divided up. > > > > > > Thanks for this suggestion! This sounds like a clean way to handle the > > > batching/sequential compress/decompress within the crypto API as long > > > as it can be contained in the crypto acompress layer. > > > If the zswap maintainers don't have any objections, I can look into the > > > feasibility of doing this. > > > > Does this mean that instead of zswap breaking down the folio into > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > > crypto layer and let it do the batching as it pleases? > > If I understand Herbert's suggestion correctly, I think what he meant was > that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say, > 8) > during the cpu onlining always. The acomp_has_async_batching() API can > be used to determine whether to allocate more than one acomp_req and > crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait > for the request chaining with the goal of understanding performance wrt the > existing implementation of crypto_acomp_batch_compress()). > In zswap_store_folio(), we process the large folio in batches of 8 pages > and call "crypto_acomp_batch_compress()" for each batch. Based on earlier > discussions in this thread, it might make sense to add a bool option to > crypto_acomp_batch_compress() as follows: > > static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], > struct crypto_wait *waits[], > struct page *pages[], > u8 *dsts[], > unsigned int dlens[], > int errors[], > int nr_pages, > bool parallel); > > zswap would acquire the per-cpu acomp_ctx->mutex, and pass > (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel" > parameter. > This indicates to crypto_acomp_batch_compress() whether or not > SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and > "waits". > > If we have multiple "reqs" (parallel == true), we use request chaining (or the > existing asynchronous poll implementation) for IAA batching. If (parallel == > false), > crypto_acomp_batch_compress() will look something like this: > > static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[], > struct crypto_wait *waits[], > struct page *pages[], > u8 *dsts[], > unsigned int dlens[], > int errors[], > int nr_pages, > bool parallel) > { > if (!parallel) { > struct scatterlist input, output; > int i; > > for (i = 0; i < nr_pages; ++i) { > /* for pages[i], buffers[i], dlens[i]: borrow first half of > * zswap_compress() functionality: > */ > dst = acomp_ctx->buffers[i]; > sg_init_table(&input, 1); > sg_set_page(&input, pages[i], PAGE_SIZE, 0); > > sg_init_one(&output, dst, PAGE_SIZE * 2); > acomp_request_set_params(acomp_ctx->reqs[0], > &input, &output, PAGE_SIZE, dlens[i]); > > comp_ret = > crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx- > >waits[0]); > dlens[i] = acomp_ctx->reqs[0]->dlen; > } > } > > /* > * At this point we would have sequentially compressed the batch. > * zswap_store_folio() can process the buffers and dlens using > * common code for batching and non-batching compressors. > */ > } > > IIUC, this suggestion appears to be along the lines of using common > code in zswap as far as possible, for compressors that do and do not > support batching. Herbert can correct me if I am wrong. > > If this is indeed the case, the memory penalty for software compressors > would be: > 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in > zswap_cpu_comp_prepare(). > 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in > zswap_store_folio(). > > This would be an additional memory penalty for what we gain by > having the common code paths in zswap for compressors that do > and do not support batching. Alternately, we could use request chaining always, even for software compressors for a larger memory penalty per-cpu, by allocating SWAP_CRYPTO_BATCH_SIZE # of reqs/waits by default. I don't know if this would have functional issues because the chain of requests will be processed sequentially (basically all requests are added to a list), but maybe Herbert is suggesting this (not sure). Thanks, Kanchana > > Thanks, > Kanchana > > > > > It sounds nice on the surface, but this implies that we have to > > allocate folio_nr_pages() buffers in zswap, essentially as the > > allocation is the same size as the folio itself. While the allocation > > does not need to be contiguous, making a large number of allocations > > in the reclaim path is definitely not something we want. For a 2M THP, > > we'd need to allocate 2M in zswap_store(). > > > > If we choose to keep preallocating, assuming the maximum THP size is > > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of > > memory. > > > > I feel like I am missing something. > > > > > > > > Thanks, > > > Kanchana > > > > > > > > > > > Cheers, > > > > -- > > > > Email: Herbert Xu <herbert@gondor.apana.org.au> > > > > Home Page: http://gondor.apana.org.au/~herbert/ > > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote: > > Does this mean that instead of zswap breaking down the folio into > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > crypto layer and let it do the batching as it pleases? You provide as much (or little) as you're comfortable with. Just treat the acomp API as one that can take as much as you want to give it. Cheers,
On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote: > > > > Does this mean that instead of zswap breaking down the folio into > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > > crypto layer and let it do the batching as it pleases? > > You provide as much (or little) as you're comfortable with. Just > treat the acomp API as one that can take as much as you want to > give it. In this case, it seems like the batch size is completely up to zswap, and not necessarily dependent on the compressor. That being said, Intel IAA will naturally prefer a batch size that maximizes the parallelization. How about this, we can define a fixed max batch size in zswap, to provide a hard limit on the number of buffers we preallocate (e.g. MAX_BATCH_SIZE). The compressors can provide zswap a hint with their desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate min(MAX_BATCH_SIZE, compressor_batch_size). Assuming software compressors provide 1 for the batch size, if MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and software compressors get the same behavior as today. This abstracts the batch size needed by the compressor while making sure zswap does not preallocate a ridiculous amount of memory. Does this make sense to everyone or am I missing something?
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, December 4, 2024 2:36 PM > To: Herbert Xu <herbert@gondor.apana.org.au> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com; > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au> > wrote: > > > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote: > > > > > > Does this mean that instead of zswap breaking down the folio into > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > > > crypto layer and let it do the batching as it pleases? > > > > You provide as much (or little) as you're comfortable with. Just > > treat the acomp API as one that can take as much as you want to > > give it. > > In this case, it seems like the batch size is completely up to zswap, > and not necessarily dependent on the compressor. That being said, > Intel IAA will naturally prefer a batch size that maximizes the > parallelization. > > How about this, we can define a fixed max batch size in zswap, to > provide a hard limit on the number of buffers we preallocate (e.g. > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate > min(MAX_BATCH_SIZE, compressor_batch_size). > > Assuming software compressors provide 1 for the batch size, if > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and > software compressors get the same behavior as today. This abstracts > the batch size needed by the compressor while making sure zswap does > not preallocate a ridiculous amount of memory. > > Does this make sense to everyone or am I missing something? Thanks Yosry, this makes perfect sense. I can declare a default CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can query, acomp_get_batch_size(struct crypto_acomp *tfm) that can call a crypto algorithm interface if it is registered, for e.g. crypto_get_batch_size() that IAA can register to return the max batch size for IAA. If a compressor does not provide an implementation for crypto_get_batch_size(), we would return CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will need to be done for the software compressors for now. Unless they define a specific batch_size via say, another interface, crypto_set_batch_size(), the acomp_get_batch_size() will return 1. Thanks, Kanchana
On Wed, Dec 4, 2024 at 2:49 PM Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> wrote: > > > > -----Original Message----- > > From: Yosry Ahmed <yosryahmed@google.com> > > Sent: Wednesday, December 4, 2024 2:36 PM > > To: Herbert Xu <herbert@gondor.apana.org.au> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham > > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; chengming.zhou@linux.dev; > > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com; > > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > > the crypto_alg supports batching. > > > > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au> > > wrote: > > > > > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote: > > > > > > > > Does this mean that instead of zswap breaking down the folio into > > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the > > > > crypto layer and let it do the batching as it pleases? > > > > > > You provide as much (or little) as you're comfortable with. Just > > > treat the acomp API as one that can take as much as you want to > > > give it. > > > > In this case, it seems like the batch size is completely up to zswap, > > and not necessarily dependent on the compressor. That being said, > > Intel IAA will naturally prefer a batch size that maximizes the > > parallelization. > > > > How about this, we can define a fixed max batch size in zswap, to > > provide a hard limit on the number of buffers we preallocate (e.g. > > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their > > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate > > min(MAX_BATCH_SIZE, compressor_batch_size). > > > > Assuming software compressors provide 1 for the batch size, if > > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and > > software compressors get the same behavior as today. This abstracts > > the batch size needed by the compressor while making sure zswap does > > not preallocate a ridiculous amount of memory. > > > > Does this make sense to everyone or am I missing something? > > Thanks Yosry, this makes perfect sense. I can declare a default > CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can > query, acomp_get_batch_size(struct crypto_acomp *tfm) that > can call a crypto algorithm interface if it is registered, for e.g. > crypto_get_batch_size() that IAA can register to return the max > batch size for IAA. If a compressor does not provide an > implementation for crypto_get_batch_size(), we would return > CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will > need to be done for the software compressors for now. Unless > they define a specific batch_size via say, another interface, > crypto_set_batch_size(), the acomp_get_batch_size() will return 1. I still think zswap should define its own maximum to avoid having the compressors have complete control over the amount of memory that zswap preallocates. For the acomp stuff I will let Herbert decide what he thinks is best.
> -----Original Message----- > From: Yosry Ahmed <yosryahmed@google.com> > Sent: Wednesday, December 4, 2024 2:55 PM > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; chengming.zhou@linux.dev; > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com; > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > On Wed, Dec 4, 2024 at 2:49 PM Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> wrote: > > > > > > > -----Original Message----- > > > From: Yosry Ahmed <yosryahmed@google.com> > > > Sent: Wednesday, December 4, 2024 2:36 PM > > > To: Herbert Xu <herbert@gondor.apana.org.au> > > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham > > > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux- > mm@kvack.org; > > > hannes@cmpxchg.org; chengming.zhou@linux.dev; > > > usamaarif642@gmail.com; ryan.roberts@arm.com; > ying.huang@intel.com; > > > 21cnbao@gmail.com; akpm@linux-foundation.org; linux- > > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com; > > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi, > > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K > > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com> > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching > resources if > > > the crypto_alg supports batching. > > > > > > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu > <herbert@gondor.apana.org.au> > > > wrote: > > > > > > > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote: > > > > > > > > > > Does this mean that instead of zswap breaking down the folio into > > > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to > the > > > > > crypto layer and let it do the batching as it pleases? > > > > > > > > You provide as much (or little) as you're comfortable with. Just > > > > treat the acomp API as one that can take as much as you want to > > > > give it. > > > > > > In this case, it seems like the batch size is completely up to zswap, > > > and not necessarily dependent on the compressor. That being said, > > > Intel IAA will naturally prefer a batch size that maximizes the > > > parallelization. > > > > > > How about this, we can define a fixed max batch size in zswap, to > > > provide a hard limit on the number of buffers we preallocate (e.g. > > > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their > > > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate > > > min(MAX_BATCH_SIZE, compressor_batch_size). > > > > > > Assuming software compressors provide 1 for the batch size, if > > > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and > > > software compressors get the same behavior as today. This abstracts > > > the batch size needed by the compressor while making sure zswap does > > > not preallocate a ridiculous amount of memory. > > > > > > Does this make sense to everyone or am I missing something? > > > > Thanks Yosry, this makes perfect sense. I can declare a default > > CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can > > query, acomp_get_batch_size(struct crypto_acomp *tfm) that > > can call a crypto algorithm interface if it is registered, for e.g. > > crypto_get_batch_size() that IAA can register to return the max > > batch size for IAA. If a compressor does not provide an > > implementation for crypto_get_batch_size(), we would return > > CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will > > need to be done for the software compressors for now. Unless > > they define a specific batch_size via say, another interface, > > crypto_set_batch_size(), the acomp_get_batch_size() will return 1. > > I still think zswap should define its own maximum to avoid having the > compressors have complete control over the amount of memory that zswap > preallocates. For sure, zswap should set the MAX_BATCH_SIZE for this purpose. > > For the acomp stuff I will let Herbert decide what he thinks is best. > From the zswap side, I just want: > - A hard limit on the amount of memory we preallocate. > - No change for the software compressors. Sounds good! Thanks, Kanchana > > > > > Thanks, > > Kanchana
Hi Nhat, > -----Original Message----- > From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > Sent: Monday, December 2, 2024 4:31 PM > To: Nhat Pham <nphamcs@gmail.com> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > hannes@cmpxchg.org; yosryahmed@google.com; > chengming.zhou@linux.dev; usamaarif642@gmail.com; > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > herbert@gondor.apana.org.au; davem@davemloft.net; > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com; > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>; > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > <vinodh.gopal@intel.com>; Sridhar, Kanchana P > <kanchana.p.sridhar@intel.com> > Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if > the crypto_alg supports batching. > > Hi Nhat, > > > -----Original Message----- > > From: Nhat Pham <nphamcs@gmail.com> > > Sent: Monday, December 2, 2024 11:16 AM > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com> > > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org; > > hannes@cmpxchg.org; yosryahmed@google.com; > > chengming.zhou@linux.dev; usamaarif642@gmail.com; > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com; > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org; > > herbert@gondor.apana.org.au; davem@davemloft.net; > > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com; > > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>; > > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh > > <vinodh.gopal@intel.com> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources > if > > the crypto_alg supports batching. > > > > On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar > > <kanchana.p.sridhar@intel.com> wrote: > > > > > > This patch does the following: > > > > > > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a > > > configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to > > > "struct crypto_acomp_ctx" to contain the nr of resources that will be > > > allocated in the cpu onlining code. > > > > > > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the > > > crypto_acomp created for the pool (in other words, the zswap > > compression > > > algorithm) has registered an implementation for batch_compress() and > > > batch_decompress(). If so, it will set "nr_reqs" to > > > SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and > > set > > > the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not > > support > > > batching, "nr_reqs" defaults to 1. > > > > > > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to > > > true if the batching API are present for the crypto_acomp. > > > > Why do we need this "can_batch" field? IIUC, this can be determined > > from the compressor internal fields itself, no? > > > > acomp_has_async_batching(acomp); > > > > Is this just for convenience, or is this actually an expensive thing to > compute? > > Thanks for your comments. This is a good question. I tried not to imply that > batching resources have been allocated for the cpu based only on what > acomp_has_async_batching() returns. It is possible that the cpu onlining > code ran into an -ENOMEM error on any particular cpu. In this case, I set > the pool->can_batch to "false", mainly for convenience, so that zswap > can be somewhat insulated from migration. I agree that this may not be > the best solution; and whether or not batching is enabled can be directly > determined just before the call to crypto_acomp_batch_compress() > based on: > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE; > > I currently have a BUG_ON() for this condition not being met, that relies > on the pool->can_batch gating the flow to get to zswap_batch_compress(). > > I think a better solution would be to check for having > SWAP_CRYPTO_BATCH_SIZE > # of acomp_ctx resources right after we acquire the acomp_ctx->mutex and > before > the call to crypto_acomp_batch_compress(). If so, we proceed, and if not, we > call > crypto_acomp_compress(). It seems this might be the only way to know for > sure > whether the crypto batching API can be called, given that migration is possible > at any point in zswap_store(). Once we have obtained the mutex_lock, it > seems > we can proceed with batching based on this check (although the UAF situation > remains as a larger issue, beyond the scope of this patch). I would appreciate > other ideas as well. > > Also, I have submitted a patch-series [1] with Yosry's & Johannes' suggestions > to this series. This is setting up a consolidated > zswap_store()/zswap_store_pages() > code path for batching and non-batching compressors. My goal is for [1] to > go through code reviews and be able to transition to batching, with a simple > check: > > if (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) > zswap_batch_compress(); > else > zswap_compress(); > > Please feel free to provide code review comments in [1]. Thanks! > > [1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937 > > > > > > > > > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress > > batching > > > > I like a sane default value as much as the next guy, but this seems a > > bit odd to me: > > > > 1. The placement of this constant/default value seems strange to me. > > This is a compressor-specific value no? Why are we enforcing this > > batching size at the zswap level, and uniformly at that? What if we > > introduce a new batch compression algorithm...? Or am I missing > > something, and this is a sane default for other compressors too? > > You bring up an excellent point. This is a compressor-specific value. > Instead of setting this up as a constant, which as you correctly observe, > may not make sense for a non-IAA compressor, one way to get > this could be by querying the compressor, say: > > int acomp_get_max_batchsize(struct crypto_acomp *tfm) {...}; > > to then allocate sufficient acomp_reqs/buffers/etc. in the zswap > cpu onlining code. > > > > > 2. Why is this value set to 8? Experimentation? Could you add some > > justification in documentation? > > Can I get back to you later this week with a proposal for this? We plan > to have a team discussion on how best to approach this for current > and future hardware. Sorry it took me quite a while to get back to you on this. I have been busy with implementing request chaining, and other major improvements to this series based on the comments received thus far. I will be submitting a v5 of this series shortly, in which I have implemented an IAA_CRYPTO_MAX_BATCH_SIZE in the iaa_crypto driver. For now I set this to 8 since we have done all our testing with a batch size of 8, but we are still running experiments to figure this out, hence this #define in the iaa_crypto driver (in v5) can potentially change. Further, there is a zswap-specific ZSWAP_MAX_BATCH_SIZE in v5, which is also 8. I would appreciate code review comments for v5. If the approach I've taken in v5 is acceptable, I will add more details/justification in the documentation in a v6. Thanks, Kanchana > > Thanks, > Kanchana
diff --git a/include/linux/zswap.h b/include/linux/zswap.h index d961ead91bf1..9ad27ab3d222 100644 --- a/include/linux/zswap.h +++ b/include/linux/zswap.h @@ -7,6 +7,13 @@ struct lruvec; +/* + * For IAA compression batching: + * Maximum number of IAA acomp compress requests that will be processed + * in a batch: in parallel, if iaa_crypto async/no irq mode is enabled + * (the default); else sequentially, if iaa_crypto sync mode is in effect. + */ +#define SWAP_CRYPTO_BATCH_SIZE 8UL extern atomic_long_t zswap_stored_pages; #ifdef CONFIG_ZSWAP diff --git a/mm/zswap.c b/mm/zswap.c index f6316b66fb23..173f7632990e 100644 --- a/mm/zswap.c +++ b/mm/zswap.c @@ -143,9 +143,10 @@ bool zswap_never_enabled(void) struct crypto_acomp_ctx { struct crypto_acomp *acomp; - struct acomp_req *req; + struct acomp_req **reqs; + u8 **buffers; + unsigned int nr_reqs; struct crypto_wait wait; - u8 *buffer; struct mutex mutex; bool is_sleepable; }; @@ -158,6 +159,7 @@ struct crypto_acomp_ctx { */ struct zswap_pool { struct zpool *zpool; + bool can_batch; struct crypto_acomp_ctx __percpu *acomp_ctx; struct percpu_ref ref; struct list_head list; @@ -285,6 +287,8 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor) goto error; } + pool->can_batch = false; + ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE, &pool->node); if (ret) @@ -818,49 +822,90 @@ static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node) struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node); struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); struct crypto_acomp *acomp; - struct acomp_req *req; - int ret; + unsigned int nr_reqs = 1; + int ret = -ENOMEM; + int i, j; mutex_init(&acomp_ctx->mutex); - - acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); - if (!acomp_ctx->buffer) - return -ENOMEM; + acomp_ctx->nr_reqs = 0; acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu)); if (IS_ERR(acomp)) { pr_err("could not alloc crypto acomp %s : %ld\n", pool->tfm_name, PTR_ERR(acomp)); - ret = PTR_ERR(acomp); - goto acomp_fail; + return PTR_ERR(acomp); } acomp_ctx->acomp = acomp; acomp_ctx->is_sleepable = acomp_is_async(acomp); - req = acomp_request_alloc(acomp_ctx->acomp); - if (!req) { - pr_err("could not alloc crypto acomp_request %s\n", - pool->tfm_name); - ret = -ENOMEM; + /* + * Create the necessary batching resources if the crypto acomp alg + * implements the batch_compress and batch_decompress API. + */ + if (acomp_has_async_batching(acomp)) { + pool->can_batch = true; + nr_reqs = SWAP_CRYPTO_BATCH_SIZE; + pr_info_once("Creating acomp_ctx with %d reqs for batching since crypto acomp %s\nhas registered batch_compress() and batch_decompress()\n", + nr_reqs, pool->tfm_name); + } + + acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu)); + if (!acomp_ctx->buffers) + goto buf_fail; + + for (i = 0; i < nr_reqs; ++i) { + acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu)); + if (!acomp_ctx->buffers[i]) { + for (j = 0; j < i; ++j) + kfree(acomp_ctx->buffers[j]); + kfree(acomp_ctx->buffers); + ret = -ENOMEM; + goto buf_fail; + } + } + + acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu)); + if (!acomp_ctx->reqs) goto req_fail; + + for (i = 0; i < nr_reqs; ++i) { + acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp); + if (!acomp_ctx->reqs[i]) { + pr_err("could not alloc crypto acomp_request reqs[%d] %s\n", + i, pool->tfm_name); + for (j = 0; j < i; ++j) + acomp_request_free(acomp_ctx->reqs[j]); + kfree(acomp_ctx->reqs); + ret = -ENOMEM; + goto req_fail; + } } - acomp_ctx->req = req; + /* + * The crypto_wait is used only in fully synchronous, i.e., with scomp + * or non-poll mode of acomp, hence there is only one "wait" per + * acomp_ctx, with callback set to reqs[0], under the assumption that + * there is at least 1 request per acomp_ctx. + */ crypto_init_wait(&acomp_ctx->wait); /* * if the backend of acomp is async zip, crypto_req_done() will wakeup * crypto_wait_req(); if the backend of acomp is scomp, the callback * won't be called, crypto_wait_req() will return without blocking. */ - acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG, + acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG, crypto_req_done, &acomp_ctx->wait); + acomp_ctx->nr_reqs = nr_reqs; return 0; req_fail: + for (i = 0; i < nr_reqs; ++i) + kfree(acomp_ctx->buffers[i]); + kfree(acomp_ctx->buffers); +buf_fail: crypto_free_acomp(acomp_ctx->acomp); -acomp_fail: - kfree(acomp_ctx->buffer); + pool->can_batch = false; return ret; } @@ -870,11 +915,22 @@ static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node) struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu); if (!IS_ERR_OR_NULL(acomp_ctx)) { - if (!IS_ERR_OR_NULL(acomp_ctx->req)) - acomp_request_free(acomp_ctx->req); + int i; + + for (i = 0; i < acomp_ctx->nr_reqs; ++i) + if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i])) + acomp_request_free(acomp_ctx->reqs[i]); + kfree(acomp_ctx->reqs); + + for (i = 0; i < acomp_ctx->nr_reqs; ++i) + kfree(acomp_ctx->buffers[i]); + kfree(acomp_ctx->buffers); + if (!IS_ERR_OR_NULL(acomp_ctx->acomp)) crypto_free_acomp(acomp_ctx->acomp); - kfree(acomp_ctx->buffer); + + acomp_ctx->nr_reqs = 0; + acomp_ctx = NULL; } return 0; @@ -897,7 +953,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, mutex_lock(&acomp_ctx->mutex); - dst = acomp_ctx->buffer; + dst = acomp_ctx->buffers[0]; sg_init_table(&input, 1); sg_set_page(&input, page, PAGE_SIZE, 0); @@ -907,7 +963,7 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, * giving the dst buffer with enough length to avoid buffer overflow. */ sg_init_one(&output, dst, PAGE_SIZE * 2); - acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen); + acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen); /* * it maybe looks a little bit silly that we send an asynchronous request, @@ -921,8 +977,8 @@ static bool zswap_compress(struct page *page, struct zswap_entry *entry, * but in different threads running on different cpu, we have different * acomp instance, so multiple threads can do (de)compression in parallel. */ - comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait); - dlen = acomp_ctx->req->dlen; + comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait); + dlen = acomp_ctx->reqs[0]->dlen; if (comp_ret) goto unlock; @@ -975,20 +1031,20 @@ static void zswap_decompress(struct zswap_entry *entry, struct folio *folio) */ if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) || !virt_addr_valid(src)) { - memcpy(acomp_ctx->buffer, src, entry->length); - src = acomp_ctx->buffer; + memcpy(acomp_ctx->buffers[0], src, entry->length); + src = acomp_ctx->buffers[0]; zpool_unmap_handle(zpool, entry->handle); } sg_init_one(&input, src, entry->length); sg_init_table(&output, 1); sg_set_folio(&output, folio, PAGE_SIZE, 0); - acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE); - BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait)); - BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE); + acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE); + BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait)); + BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE); mutex_unlock(&acomp_ctx->mutex); - if (src != acomp_ctx->buffer) + if (src != acomp_ctx->buffers[0]) zpool_unmap_handle(zpool, entry->handle); }
This patch does the following: 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to "struct crypto_acomp_ctx" to contain the nr of resources that will be allocated in the cpu onlining code. 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the crypto_acomp created for the pool (in other words, the zswap compression algorithm) has registered an implementation for batch_compress() and batch_decompress(). If so, it will set "nr_reqs" to SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and set the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not support batching, "nr_reqs" defaults to 1. 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to true if the batching API are present for the crypto_acomp. SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress batching "sub-batch" size when zswap_batch_store() is processing a large folio. This represents the nr of buffers that can be compressed/decompressed in parallel by Intel IAA hardware. Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com> --- include/linux/zswap.h | 7 +++ mm/zswap.c | 120 +++++++++++++++++++++++++++++++----------- 2 files changed, 95 insertions(+), 32 deletions(-)