diff mbox series

[v4,09/10] mm: zswap: Allocate pool batching resources if the crypto_alg supports batching.

Message ID 20241123070127.332773-10-kanchana.p.sridhar@intel.com
State Superseded
Headers show
Series zswap IAA compress batching | expand

Commit Message

Kanchana P Sridhar Nov. 23, 2024, 7:01 a.m. UTC
This patch does the following:

1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
   configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
   "struct crypto_acomp_ctx" to contain the nr of resources that will be
   allocated in the cpu onlining code.

2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
   crypto_acomp created for the pool (in other words, the zswap compression
   algorithm) has registered an implementation for batch_compress() and
   batch_decompress(). If so, it will set "nr_reqs" to
   SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and set
   the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not support
   batching, "nr_reqs" defaults to 1.

3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
   true if the batching API are present for the crypto_acomp.

SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress batching
"sub-batch" size when zswap_batch_store() is processing a large folio. This
represents the nr of buffers that can be compressed/decompressed in
parallel by Intel IAA hardware.

Signed-off-by: Kanchana P Sridhar <kanchana.p.sridhar@intel.com>
---
 include/linux/zswap.h |   7 +++
 mm/zswap.c            | 120 +++++++++++++++++++++++++++++++-----------
 2 files changed, 95 insertions(+), 32 deletions(-)

Comments

Nhat Pham Dec. 2, 2024, 7:15 p.m. UTC | #1
On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
<kanchana.p.sridhar@intel.com> wrote:
>
> This patch does the following:
>
> 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
>    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
>    "struct crypto_acomp_ctx" to contain the nr of resources that will be
>    allocated in the cpu onlining code.
>
> 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
>    crypto_acomp created for the pool (in other words, the zswap compression
>    algorithm) has registered an implementation for batch_compress() and
>    batch_decompress(). If so, it will set "nr_reqs" to
>    SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and set
>    the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not support
>    batching, "nr_reqs" defaults to 1.
>
> 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
>    true if the batching API are present for the crypto_acomp.

Why do we need this "can_batch" field? IIUC, this can be determined
from the compressor internal fields itself, no?

acomp_has_async_batching(acomp);

Is this just for convenience, or is this actually an expensive thing to compute?

>
> SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress batching

I like a sane default value as much as the next guy, but this seems a
bit odd to me:

1. The placement of this constant/default value seems strange to me.
This is a compressor-specific value no? Why are we enforcing this
batching size at the zswap level, and uniformly at that? What if we
introduce a new batch compression algorithm...? Or am I missing
something, and this is a sane default for other compressors too?

2. Why is this value set to 8? Experimentation? Could you add some
justification in documentation?
Kanchana P Sridhar Dec. 3, 2024, 12:30 a.m. UTC | #2
Hi Nhat,

> -----Original Message-----
> From: Nhat Pham <nphamcs@gmail.com>
> Sent: Monday, December 2, 2024 11:16 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
> <kanchana.p.sridhar@intel.com> wrote:
> >
> > This patch does the following:
> >
> > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
> >    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
> >    "struct crypto_acomp_ctx" to contain the nr of resources that will be
> >    allocated in the cpu onlining code.
> >
> > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
> >    crypto_acomp created for the pool (in other words, the zswap
> compression
> >    algorithm) has registered an implementation for batch_compress() and
> >    batch_decompress(). If so, it will set "nr_reqs" to
> >    SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and
> set
> >    the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not
> support
> >    batching, "nr_reqs" defaults to 1.
> >
> > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
> >    true if the batching API are present for the crypto_acomp.
> 
> Why do we need this "can_batch" field? IIUC, this can be determined
> from the compressor internal fields itself, no?
> 
> acomp_has_async_batching(acomp);
> 
> Is this just for convenience, or is this actually an expensive thing to compute?

Thanks for your comments. This is a good question. I tried not to imply that
batching resources have been allocated for the cpu based only on what
acomp_has_async_batching() returns. It is possible that the cpu onlining
code ran into an -ENOMEM error on any particular cpu. In this case, I set
the pool->can_batch to "false", mainly for convenience, so that zswap
can be somewhat insulated from migration. I agree that this may not be
the best solution; and whether or not batching is enabled can be directly
determined just before the call to crypto_acomp_batch_compress()
based on:

acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;

I currently have a BUG_ON() for this condition not being met, that relies
on the pool->can_batch gating the flow to get to zswap_batch_compress().

I think a better solution would be to check for having SWAP_CRYPTO_BATCH_SIZE
# of acomp_ctx resources right after we acquire the acomp_ctx->mutex and before
the call to crypto_acomp_batch_compress(). If so, we proceed, and if not, we call
crypto_acomp_compress(). It seems this might be the only way to know for sure
whether the crypto batching API can be called, given that migration is possible
at any point in zswap_store(). Once we have obtained the mutex_lock, it seems
we can proceed with batching based on this check (although the UAF situation
remains as a larger issue, beyond the scope of this patch). I would appreciate
other ideas as well.

Also, I have submitted a patch-series [1] with Yosry's & Johannes' suggestions
to this series. This is setting up a consolidated zswap_store()/zswap_store_pages()
code path for batching and non-batching compressors. My goal is for [1] to
go through code reviews and be able to transition to batching, with a simple
check:

if (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE)
         zswap_batch_compress();
else
         zswap_compress();

Please feel free to provide code review comments in [1]. Thanks!

[1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937

> 
> >
> > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress
> batching
> 
> I like a sane default value as much as the next guy, but this seems a
> bit odd to me:
> 
> 1. The placement of this constant/default value seems strange to me.
> This is a compressor-specific value no? Why are we enforcing this
> batching size at the zswap level, and uniformly at that? What if we
> introduce a new batch compression algorithm...? Or am I missing
> something, and this is a sane default for other compressors too?

You bring up an excellent point. This is a compressor-specific value.
Instead of setting this up as a constant, which as you correctly observe,
may not make sense for a non-IAA compressor, one way to get
this could be by querying the compressor, say:

int acomp_get_max_batchsize(struct crypto_acomp *tfm) {...};

to then allocate sufficient acomp_reqs/buffers/etc. in the zswap
cpu onlining code. 

> 
> 2. Why is this value set to 8? Experimentation? Could you add some
> justification in documentation?

Can I get back to you later this week with a proposal for this? We plan
to have a team discussion on how best to approach this for current
and future hardware.

Thanks,
Kanchana
Herbert Xu Dec. 3, 2024, 8 a.m. UTC | #3
On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
>
> > Why do we need this "can_batch" field? IIUC, this can be determined
> > from the compressor internal fields itself, no?
> > 
> > acomp_has_async_batching(acomp);
> > 
> > Is this just for convenience, or is this actually an expensive thing to compute?
> 
> Thanks for your comments. This is a good question. I tried not to imply that
> batching resources have been allocated for the cpu based only on what
> acomp_has_async_batching() returns. It is possible that the cpu onlining
> code ran into an -ENOMEM error on any particular cpu. In this case, I set
> the pool->can_batch to "false", mainly for convenience, so that zswap
> can be somewhat insulated from migration. I agree that this may not be
> the best solution; and whether or not batching is enabled can be directly
> determined just before the call to crypto_acomp_batch_compress()
> based on:
> 
> acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;

With ahash request chaining, the idea is to accumulate as much
data as you can before you provide it to the Crypto API.  The
API is responsible for dividing it up if the underlying driver
is only able to handle one request at a time.

So that would be the ideal model to use for compression as well.
Provide as much data as you can and let the API handle the case
where the data needs to be divided up.

Cheers,
Kanchana P Sridhar Dec. 3, 2024, 9:37 p.m. UTC | #4
> -----Original Message-----
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Sent: Tuesday, December 3, 2024 12:01 AM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> >
> > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > from the compressor internal fields itself, no?
> > >
> > > acomp_has_async_batching(acomp);
> > >
> > > Is this just for convenience, or is this actually an expensive thing to
> compute?
> >
> > Thanks for your comments. This is a good question. I tried not to imply that
> > batching resources have been allocated for the cpu based only on what
> > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > code ran into an -ENOMEM error on any particular cpu. In this case, I set
> > the pool->can_batch to "false", mainly for convenience, so that zswap
> > can be somewhat insulated from migration. I agree that this may not be
> > the best solution; and whether or not batching is enabled can be directly
> > determined just before the call to crypto_acomp_batch_compress()
> > based on:
> >
> > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> 
> With ahash request chaining, the idea is to accumulate as much
> data as you can before you provide it to the Crypto API.  The
> API is responsible for dividing it up if the underlying driver
> is only able to handle one request at a time.
> 
> So that would be the ideal model to use for compression as well.
> Provide as much data as you can and let the API handle the case
> where the data needs to be divided up.

Thanks for this suggestion! This sounds like a clean way to handle the
batching/sequential compress/decompress within the crypto API as long
as it can be contained in the crypto acompress layer. 
If the zswap maintainers don't have any objections, I can look into the
feasibility of doing this.

Thanks,
Kanchana

> 
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Yosry Ahmed Dec. 3, 2024, 9:44 p.m. UTC | #5
On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Herbert Xu <herbert@gondor.apana.org.au>
> > Sent: Tuesday, December 3, 2024 12:01 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> > Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > >
> > > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > > from the compressor internal fields itself, no?
> > > >
> > > > acomp_has_async_batching(acomp);
> > > >
> > > > Is this just for convenience, or is this actually an expensive thing to
> > compute?
> > >
> > > Thanks for your comments. This is a good question. I tried not to imply that
> > > batching resources have been allocated for the cpu based only on what
> > > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > > code ran into an -ENOMEM error on any particular cpu. In this case, I set
> > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > can be somewhat insulated from migration. I agree that this may not be
> > > the best solution; and whether or not batching is enabled can be directly
> > > determined just before the call to crypto_acomp_batch_compress()
> > > based on:
> > >
> > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> >
> > With ahash request chaining, the idea is to accumulate as much
> > data as you can before you provide it to the Crypto API.  The
> > API is responsible for dividing it up if the underlying driver
> > is only able to handle one request at a time.
> >
> > So that would be the ideal model to use for compression as well.
> > Provide as much data as you can and let the API handle the case
> > where the data needs to be divided up.
>
> Thanks for this suggestion! This sounds like a clean way to handle the
> batching/sequential compress/decompress within the crypto API as long
> as it can be contained in the crypto acompress layer.
> If the zswap maintainers don't have any objections, I can look into the
> feasibility of doing this.

Does this mean that instead of zswap breaking down the folio into
SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
crypto layer and let it do the batching as it pleases?

It sounds nice on the surface, but this implies that we have to
allocate folio_nr_pages() buffers in zswap, essentially as the
allocation is the same size as the folio itself. While the allocation
does not need to be contiguous, making a large number of allocations
in the reclaim path is definitely not something we want. For a 2M THP,
we'd need to allocate 2M in zswap_store().

If we choose to keep preallocating, assuming the maximum THP size is
2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
memory.

I feel like I am missing something.

>
> Thanks,
> Kanchana
>
> >
> > Cheers,
> > --
> > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > Home Page: http://gondor.apana.org.au/~herbert/
> > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Kanchana P Sridhar Dec. 3, 2024, 10:17 p.m. UTC | #6
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Tuesday, December 3, 2024 1:44 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> linux-
> > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>;
> > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> resources if
> > > the crypto_alg supports batching.
> > >
> > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > >
> > > > > Why do we need this "can_batch" field? IIUC, this can be determined
> > > > > from the compressor internal fields itself, no?
> > > > >
> > > > > acomp_has_async_batching(acomp);
> > > > >
> > > > > Is this just for convenience, or is this actually an expensive thing to
> > > compute?
> > > >
> > > > Thanks for your comments. This is a good question. I tried not to imply
> that
> > > > batching resources have been allocated for the cpu based only on what
> > > > acomp_has_async_batching() returns. It is possible that the cpu onlining
> > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> set
> > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > can be somewhat insulated from migration. I agree that this may not be
> > > > the best solution; and whether or not batching is enabled can be directly
> > > > determined just before the call to crypto_acomp_batch_compress()
> > > > based on:
> > > >
> > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > >
> > > With ahash request chaining, the idea is to accumulate as much
> > > data as you can before you provide it to the Crypto API.  The
> > > API is responsible for dividing it up if the underlying driver
> > > is only able to handle one request at a time.
> > >
> > > So that would be the ideal model to use for compression as well.
> > > Provide as much data as you can and let the API handle the case
> > > where the data needs to be divided up.
> >
> > Thanks for this suggestion! This sounds like a clean way to handle the
> > batching/sequential compress/decompress within the crypto API as long
> > as it can be contained in the crypto acompress layer.
> > If the zswap maintainers don't have any objections, I can look into the
> > feasibility of doing this.
> 
> Does this mean that instead of zswap breaking down the folio into
> SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> crypto layer and let it do the batching as it pleases?

If I understand Herbert's suggestion correctly, I think what he meant was
that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say, 8)
during the cpu onlining always. The acomp_has_async_batching() API can
be used to determine whether to allocate more than one acomp_req and
crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
for the request chaining with the goal of understanding performance wrt the
existing implementation of crypto_acomp_batch_compress()).
In zswap_store_folio(), we process the large folio in batches of 8 pages
and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
discussions in this thread, it might make sense to add a bool option to
crypto_acomp_batch_compress() as follows:

static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
					       struct crypto_wait *waits[],
					       struct page *pages[],
					       u8 *dsts[],
					       unsigned int dlens[],
					       int errors[],
					       int nr_pages,
					       bool parallel);

zswap would acquire the per-cpu acomp_ctx->mutex, and pass
(acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel" parameter.
This indicates to crypto_acomp_batch_compress() whether or not
SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and "waits".

If we have multiple "reqs" (parallel == true), we use request chaining (or the
existing asynchronous poll implementation) for IAA batching. If (parallel == false),
crypto_acomp_batch_compress() will look something like this:

static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
					       struct crypto_wait *waits[],
					       struct page *pages[],
					       u8 *dsts[],
					       unsigned int dlens[],
					       int errors[],
					       int nr_pages,
					       bool parallel)
{
	if (!parallel) {
		struct scatterlist input, output;
		int i;

		for (i = 0; i < nr_pages; ++i) {
			/* for pages[i], buffers[i], dlens[i]: borrow first half of
			 * zswap_compress() functionality:
			*/
			dst = acomp_ctx->buffers[i];
			sg_init_table(&input, 1);
			sg_set_page(&input, pages[i], PAGE_SIZE, 0);

			sg_init_one(&output, dst, PAGE_SIZE * 2);
			acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlens[i]);

			comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx->waits[0]);
			dlens[i] = acomp_ctx->reqs[0]->dlen;
		}
	}

	/*
	 * At this point we would have sequentially compressed the batch.
	 * zswap_store_folio() can process the buffers and dlens using
	 * common code for batching and non-batching compressors.
	*/
}

IIUC, this suggestion appears to be along the lines of using common
code in zswap as far as possible, for compressors that do and do not
support batching. Herbert can correct me if I am wrong.

If this is indeed the case, the memory penalty for software compressors
would be:
1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
    zswap_cpu_comp_prepare().
2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
    zswap_store_folio().

This would be an additional memory penalty for what we gain by
having the common code paths in zswap for compressors that do
and do not support batching.

Thanks,
Kanchana

> 
> It sounds nice on the surface, but this implies that we have to
> allocate folio_nr_pages() buffers in zswap, essentially as the
> allocation is the same size as the folio itself. While the allocation
> does not need to be contiguous, making a large number of allocations
> in the reclaim path is definitely not something we want. For a 2M THP,
> we'd need to allocate 2M in zswap_store().
> 
> If we choose to keep preallocating, assuming the maximum THP size is
> 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> memory.
> 
> I feel like I am missing something.
> 
> >
> > Thanks,
> > Kanchana
> >
> > >
> > > Cheers,
> > > --
> > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > Home Page: http://gondor.apana.org.au/~herbert/
> > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Kanchana P Sridhar Dec. 3, 2024, 10:24 p.m. UTC | #7
> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Tuesday, December 3, 2024 2:18 PM
> To: Yosry Ahmed <yosryahmed@google.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> <kristen.c.accardi@intel.com>; Feghali, Wajdi K <wajdi.k.feghali@intel.com>;
> Gopal, Vinodh <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> 
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Tuesday, December 3, 2024 1:44 PM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org;
> > hannes@cmpxchg.org; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources
> if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 3, 2024 at 1:37 PM Sridhar, Kanchana P
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Sent: Tuesday, December 3, 2024 12:01 AM
> > > > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > > > Cc: Nhat Pham <nphamcs@gmail.com>; linux-kernel@vger.kernel.org;
> > linux-
> > > > mm@kvack.org; hannes@cmpxchg.org; yosryahmed@google.com;
> > > > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > > > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > > > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > > > davem@davemloft.net; clabbe@baylibre.com; ardb@kernel.org;
> > > > ebiggers@google.com; surenb@google.com; Accardi, Kristen C
> > > > <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>;
> > > > Gopal, Vinodh <vinodh.gopal@intel.com>
> > > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> > resources if
> > > > the crypto_alg supports batching.
> > > >
> > > > On Tue, Dec 03, 2024 at 12:30:30AM +0000, Sridhar, Kanchana P wrote:
> > > > >
> > > > > > Why do we need this "can_batch" field? IIUC, this can be
> determined
> > > > > > from the compressor internal fields itself, no?
> > > > > >
> > > > > > acomp_has_async_batching(acomp);
> > > > > >
> > > > > > Is this just for convenience, or is this actually an expensive thing to
> > > > compute?
> > > > >
> > > > > Thanks for your comments. This is a good question. I tried not to imply
> > that
> > > > > batching resources have been allocated for the cpu based only on what
> > > > > acomp_has_async_batching() returns. It is possible that the cpu
> onlining
> > > > > code ran into an -ENOMEM error on any particular cpu. In this case, I
> > set
> > > > > the pool->can_batch to "false", mainly for convenience, so that zswap
> > > > > can be somewhat insulated from migration. I agree that this may not
> be
> > > > > the best solution; and whether or not batching is enabled can be
> directly
> > > > > determined just before the call to crypto_acomp_batch_compress()
> > > > > based on:
> > > > >
> > > > > acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> > > >
> > > > With ahash request chaining, the idea is to accumulate as much
> > > > data as you can before you provide it to the Crypto API.  The
> > > > API is responsible for dividing it up if the underlying driver
> > > > is only able to handle one request at a time.
> > > >
> > > > So that would be the ideal model to use for compression as well.
> > > > Provide as much data as you can and let the API handle the case
> > > > where the data needs to be divided up.
> > >
> > > Thanks for this suggestion! This sounds like a clean way to handle the
> > > batching/sequential compress/decompress within the crypto API as long
> > > as it can be contained in the crypto acompress layer.
> > > If the zswap maintainers don't have any objections, I can look into the
> > > feasibility of doing this.
> >
> > Does this mean that instead of zswap breaking down the folio into
> > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > crypto layer and let it do the batching as it pleases?
> 
> If I understand Herbert's suggestion correctly, I think what he meant was
> that we allocate only SWAP_CRYPTO_BATCH_SIZE # of buffers in zswap (say,
> 8)
> during the cpu onlining always. The acomp_has_async_batching() API can
> be used to determine whether to allocate more than one acomp_req and
> crypto_wait (fyi, I am creating SWAP_CRYPTO_BATCH_SIZE # of crypto_wait
> for the request chaining with the goal of understanding performance wrt the
> existing implementation of crypto_acomp_batch_compress()).
> In zswap_store_folio(), we process the large folio in batches of 8 pages
> and call "crypto_acomp_batch_compress()" for each batch. Based on earlier
> discussions in this thread, it might make sense to add a bool option to
> crypto_acomp_batch_compress() as follows:
> 
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> 					       struct crypto_wait *waits[],
> 					       struct page *pages[],
> 					       u8 *dsts[],
> 					       unsigned int dlens[],
> 					       int errors[],
> 					       int nr_pages,
> 					       bool parallel);
> 
> zswap would acquire the per-cpu acomp_ctx->mutex, and pass
> (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE) for the "parallel"
> parameter.
> This indicates to crypto_acomp_batch_compress() whether or not
> SWAP_CRYPTO_BATCH_SIZE # of elements are available in "reqs" and
> "waits".
> 
> If we have multiple "reqs" (parallel == true), we use request chaining (or the
> existing asynchronous poll implementation) for IAA batching. If (parallel ==
> false),
> crypto_acomp_batch_compress() will look something like this:
> 
> static inline bool crypto_acomp_batch_compress(struct acomp_req *reqs[],
> 					       struct crypto_wait *waits[],
> 					       struct page *pages[],
> 					       u8 *dsts[],
> 					       unsigned int dlens[],
> 					       int errors[],
> 					       int nr_pages,
> 					       bool parallel)
> {
> 	if (!parallel) {
> 		struct scatterlist input, output;
> 		int i;
> 
> 		for (i = 0; i < nr_pages; ++i) {
> 			/* for pages[i], buffers[i], dlens[i]: borrow first half of
> 			 * zswap_compress() functionality:
> 			*/
> 			dst = acomp_ctx->buffers[i];
> 			sg_init_table(&input, 1);
> 			sg_set_page(&input, pages[i], PAGE_SIZE, 0);
> 
> 			sg_init_one(&output, dst, PAGE_SIZE * 2);
> 			acomp_request_set_params(acomp_ctx->reqs[0],
> &input, &output, PAGE_SIZE, dlens[i]);
> 
> 			comp_ret =
> crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), acomp_ctx-
> >waits[0]);
> 			dlens[i] = acomp_ctx->reqs[0]->dlen;
> 		}
> 	}
> 
> 	/*
> 	 * At this point we would have sequentially compressed the batch.
> 	 * zswap_store_folio() can process the buffers and dlens using
> 	 * common code for batching and non-batching compressors.
> 	*/
> }
> 
> IIUC, this suggestion appears to be along the lines of using common
> code in zswap as far as possible, for compressors that do and do not
> support batching. Herbert can correct me if I am wrong.
> 
> If this is indeed the case, the memory penalty for software compressors
> would be:
> 1) pre-allocating SWAP_CRYPTO_BATCH_SIZE acomp_ctx->buffers in
>     zswap_cpu_comp_prepare().
> 2) SWAP_CRYPTO_BATCH_SIZE stack variables for pages and dlens in
>     zswap_store_folio().
> 
> This would be an additional memory penalty for what we gain by
> having the common code paths in zswap for compressors that do
> and do not support batching.

Alternately, we could use request chaining always, even for software
compressors for a larger memory penalty per-cpu, by allocating
SWAP_CRYPTO_BATCH_SIZE # of reqs/waits by default. I don't know
if this would have functional issues because the chain of requests
will be processed sequentially (basically all requests are added to a
list), but maybe Herbert is suggesting this (not sure).

Thanks,
Kanchana

> 
> Thanks,
> Kanchana
> 
> >
> > It sounds nice on the surface, but this implies that we have to
> > allocate folio_nr_pages() buffers in zswap, essentially as the
> > allocation is the same size as the folio itself. While the allocation
> > does not need to be contiguous, making a large number of allocations
> > in the reclaim path is definitely not something we want. For a 2M THP,
> > we'd need to allocate 2M in zswap_store().
> >
> > If we choose to keep preallocating, assuming the maximum THP size is
> > 2M, we need to allocate 2M * nr_cpus worth of buffers. That's a lot of
> > memory.
> >
> > I feel like I am missing something.
> >
> > >
> > > Thanks,
> > > Kanchana
> > >
> > > >
> > > > Cheers,
> > > > --
> > > > Email: Herbert Xu <herbert@gondor.apana.org.au>
> > > > Home Page: http://gondor.apana.org.au/~herbert/
> > > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Herbert Xu Dec. 4, 2024, 1:42 a.m. UTC | #8
On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
>
> Does this mean that instead of zswap breaking down the folio into
> SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> crypto layer and let it do the batching as it pleases?

You provide as much (or little) as you're comfortable with.  Just
treat the acomp API as one that can take as much as you want to
give it.

Cheers,
Yosry Ahmed Dec. 4, 2024, 10:35 p.m. UTC | #9
On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au> wrote:
>
> On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> >
> > Does this mean that instead of zswap breaking down the folio into
> > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > crypto layer and let it do the batching as it pleases?
>
> You provide as much (or little) as you're comfortable with.  Just
> treat the acomp API as one that can take as much as you want to
> give it.

In this case, it seems like the batch size is completely up to zswap,
and not necessarily dependent on the compressor. That being said,
Intel IAA will naturally prefer a batch size that maximizes the
parallelization.

How about this, we can define a fixed max batch size in zswap, to
provide a hard limit on the number of buffers we preallocate (e.g.
MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
min(MAX_BATCH_SIZE, compressor_batch_size).

Assuming software compressors provide 1 for the batch size, if
MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
software compressors get the same behavior as today. This abstracts
the batch size needed by the compressor while making sure zswap does
not preallocate a ridiculous amount of memory.

Does this make sense to everyone or am I missing something?
Kanchana P Sridhar Dec. 4, 2024, 10:49 p.m. UTC | #10
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, December 4, 2024 2:36 PM
> To: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au>
> wrote:
> >
> > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> > >
> > > Does this mean that instead of zswap breaking down the folio into
> > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > > crypto layer and let it do the batching as it pleases?
> >
> > You provide as much (or little) as you're comfortable with.  Just
> > treat the acomp API as one that can take as much as you want to
> > give it.
> 
> In this case, it seems like the batch size is completely up to zswap,
> and not necessarily dependent on the compressor. That being said,
> Intel IAA will naturally prefer a batch size that maximizes the
> parallelization.
> 
> How about this, we can define a fixed max batch size in zswap, to
> provide a hard limit on the number of buffers we preallocate (e.g.
> MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
> desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
> min(MAX_BATCH_SIZE, compressor_batch_size).
> 
> Assuming software compressors provide 1 for the batch size, if
> MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
> software compressors get the same behavior as today. This abstracts
> the batch size needed by the compressor while making sure zswap does
> not preallocate a ridiculous amount of memory.
> 
> Does this make sense to everyone or am I missing something?

Thanks Yosry, this makes perfect sense. I can declare a default
CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can
query, acomp_get_batch_size(struct crypto_acomp *tfm) that
can call a crypto algorithm interface if it is registered, for e.g.
crypto_get_batch_size() that IAA can register to return the max
batch size for IAA. If a compressor does not provide an
implementation for crypto_get_batch_size(), we would return
CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will
need to be done for the software compressors for now. Unless
they define a specific batch_size via say, another interface,
crypto_set_batch_size(), the acomp_get_batch_size() will return 1.

Thanks,
Kanchana
Yosry Ahmed Dec. 4, 2024, 10:55 p.m. UTC | #11
On Wed, Dec 4, 2024 at 2:49 PM Sridhar, Kanchana P
<kanchana.p.sridhar@intel.com> wrote:
>
>
> > -----Original Message-----
> > From: Yosry Ahmed <yosryahmed@google.com>
> > Sent: Wednesday, December 4, 2024 2:36 PM
> > To: Herbert Xu <herbert@gondor.apana.org.au>
> > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; chengming.zhou@linux.dev;
> > usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> > the crypto_alg supports batching.
> >
> > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu <herbert@gondor.apana.org.au>
> > wrote:
> > >
> > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> > > >
> > > > Does this mean that instead of zswap breaking down the folio into
> > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to the
> > > > crypto layer and let it do the batching as it pleases?
> > >
> > > You provide as much (or little) as you're comfortable with.  Just
> > > treat the acomp API as one that can take as much as you want to
> > > give it.
> >
> > In this case, it seems like the batch size is completely up to zswap,
> > and not necessarily dependent on the compressor. That being said,
> > Intel IAA will naturally prefer a batch size that maximizes the
> > parallelization.
> >
> > How about this, we can define a fixed max batch size in zswap, to
> > provide a hard limit on the number of buffers we preallocate (e.g.
> > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
> > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
> > min(MAX_BATCH_SIZE, compressor_batch_size).
> >
> > Assuming software compressors provide 1 for the batch size, if
> > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
> > software compressors get the same behavior as today. This abstracts
> > the batch size needed by the compressor while making sure zswap does
> > not preallocate a ridiculous amount of memory.
> >
> > Does this make sense to everyone or am I missing something?
>
> Thanks Yosry, this makes perfect sense. I can declare a default
> CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can
> query, acomp_get_batch_size(struct crypto_acomp *tfm) that
> can call a crypto algorithm interface if it is registered, for e.g.
> crypto_get_batch_size() that IAA can register to return the max
> batch size for IAA. If a compressor does not provide an
> implementation for crypto_get_batch_size(), we would return
> CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will
> need to be done for the software compressors for now. Unless
> they define a specific batch_size via say, another interface,
> crypto_set_batch_size(), the acomp_get_batch_size() will return 1.

I still think zswap should define its own maximum to avoid having the
compressors have complete control over the amount of memory that zswap
preallocates.

For the acomp stuff I will let Herbert decide what he thinks is best.
Kanchana P Sridhar Dec. 4, 2024, 11:12 p.m. UTC | #12
> -----Original Message-----
> From: Yosry Ahmed <yosryahmed@google.com>
> Sent: Wednesday, December 4, 2024 2:55 PM
> To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>; Nhat Pham
> <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; chengming.zhou@linux.dev;
> usamaarif642@gmail.com; ryan.roberts@arm.com; ying.huang@intel.com;
> 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> On Wed, Dec 4, 2024 at 2:49 PM Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com> wrote:
> >
> >
> > > -----Original Message-----
> > > From: Yosry Ahmed <yosryahmed@google.com>
> > > Sent: Wednesday, December 4, 2024 2:36 PM
> > > To: Herbert Xu <herbert@gondor.apana.org.au>
> > > Cc: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>; Nhat Pham
> > > <nphamcs@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org;
> > > hannes@cmpxchg.org; chengming.zhou@linux.dev;
> > > usamaarif642@gmail.com; ryan.roberts@arm.com;
> ying.huang@intel.com;
> > > 21cnbao@gmail.com; akpm@linux-foundation.org; linux-
> > > crypto@vger.kernel.org; davem@davemloft.net; clabbe@baylibre.com;
> > > ardb@kernel.org; ebiggers@google.com; surenb@google.com; Accardi,
> > > Kristen C <kristen.c.accardi@intel.com>; Feghali, Wajdi K
> > > <wajdi.k.feghali@intel.com>; Gopal, Vinodh <vinodh.gopal@intel.com>
> > > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching
> resources if
> > > the crypto_alg supports batching.
> > >
> > > On Tue, Dec 3, 2024 at 5:42 PM Herbert Xu
> <herbert@gondor.apana.org.au>
> > > wrote:
> > > >
> > > > On Tue, Dec 03, 2024 at 01:44:00PM -0800, Yosry Ahmed wrote:
> > > > >
> > > > > Does this mean that instead of zswap breaking down the folio into
> > > > > SWAP_CRYPTO_BATCH_SIZE -sized batches, we pass all the pages to
> the
> > > > > crypto layer and let it do the batching as it pleases?
> > > >
> > > > You provide as much (or little) as you're comfortable with.  Just
> > > > treat the acomp API as one that can take as much as you want to
> > > > give it.
> > >
> > > In this case, it seems like the batch size is completely up to zswap,
> > > and not necessarily dependent on the compressor. That being said,
> > > Intel IAA will naturally prefer a batch size that maximizes the
> > > parallelization.
> > >
> > > How about this, we can define a fixed max batch size in zswap, to
> > > provide a hard limit on the number of buffers we preallocate (e.g.
> > > MAX_BATCH_SIZE). The compressors can provide zswap a hint with their
> > > desired batch size (e.g. 8 for Intel IAA). Then zswap can allocate
> > > min(MAX_BATCH_SIZE, compressor_batch_size).
> > >
> > > Assuming software compressors provide 1 for the batch size, if
> > > MAX_BATCH_SIZE is >= 8, Intel IAA gets the batching rate it wants, and
> > > software compressors get the same behavior as today. This abstracts
> > > the batch size needed by the compressor while making sure zswap does
> > > not preallocate a ridiculous amount of memory.
> > >
> > > Does this make sense to everyone or am I missing something?
> >
> > Thanks Yosry, this makes perfect sense. I can declare a default
> > CRYPTO_ACOMP_BATCH_SIZE=1, and a crypto API that zswap can
> > query, acomp_get_batch_size(struct crypto_acomp *tfm) that
> > can call a crypto algorithm interface if it is registered, for e.g.
> > crypto_get_batch_size() that IAA can register to return the max
> > batch size for IAA. If a compressor does not provide an
> > implementation for crypto_get_batch_size(), we would return
> > CRYPTO_ACOMP_BATCH_SIZE. This way, nothing specific will
> > need to be done for the software compressors for now. Unless
> > they define a specific batch_size via say, another interface,
> > crypto_set_batch_size(), the acomp_get_batch_size() will return 1.
> 
> I still think zswap should define its own maximum to avoid having the
> compressors have complete control over the amount of memory that zswap
> preallocates.

For sure, zswap should set the MAX_BATCH_SIZE for this purpose.

> 
> For the acomp stuff I will let Herbert decide what he thinks is best.
> From the zswap side, I just want:
> - A hard limit on the amount of memory we preallocate.
> - No change for the software compressors.

Sounds good!

Thanks,
Kanchana

> 
> >
> > Thanks,
> > Kanchana
Kanchana P Sridhar Dec. 21, 2024, 6:30 a.m. UTC | #13
Hi Nhat,

> -----Original Message-----
> From: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> Sent: Monday, December 2, 2024 4:31 PM
> To: Nhat Pham <nphamcs@gmail.com>
> Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> hannes@cmpxchg.org; yosryahmed@google.com;
> chengming.zhou@linux.dev; usamaarif642@gmail.com;
> ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> herbert@gondor.apana.org.au; davem@davemloft.net;
> clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> <vinodh.gopal@intel.com>; Sridhar, Kanchana P
> <kanchana.p.sridhar@intel.com>
> Subject: RE: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources if
> the crypto_alg supports batching.
> 
> Hi Nhat,
> 
> > -----Original Message-----
> > From: Nhat Pham <nphamcs@gmail.com>
> > Sent: Monday, December 2, 2024 11:16 AM
> > To: Sridhar, Kanchana P <kanchana.p.sridhar@intel.com>
> > Cc: linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> > hannes@cmpxchg.org; yosryahmed@google.com;
> > chengming.zhou@linux.dev; usamaarif642@gmail.com;
> > ryan.roberts@arm.com; ying.huang@intel.com; 21cnbao@gmail.com;
> > akpm@linux-foundation.org; linux-crypto@vger.kernel.org;
> > herbert@gondor.apana.org.au; davem@davemloft.net;
> > clabbe@baylibre.com; ardb@kernel.org; ebiggers@google.com;
> > surenb@google.com; Accardi, Kristen C <kristen.c.accardi@intel.com>;
> > Feghali, Wajdi K <wajdi.k.feghali@intel.com>; Gopal, Vinodh
> > <vinodh.gopal@intel.com>
> > Subject: Re: [PATCH v4 09/10] mm: zswap: Allocate pool batching resources
> if
> > the crypto_alg supports batching.
> >
> > On Fri, Nov 22, 2024 at 11:01 PM Kanchana P Sridhar
> > <kanchana.p.sridhar@intel.com> wrote:
> > >
> > > This patch does the following:
> > >
> > > 1) Modifies the definition of "struct crypto_acomp_ctx" to represent a
> > >    configurable number of acomp_reqs and buffers. Adds a "nr_reqs" to
> > >    "struct crypto_acomp_ctx" to contain the nr of resources that will be
> > >    allocated in the cpu onlining code.
> > >
> > > 2) The zswap_cpu_comp_prepare() cpu onlining code will detect if the
> > >    crypto_acomp created for the pool (in other words, the zswap
> > compression
> > >    algorithm) has registered an implementation for batch_compress() and
> > >    batch_decompress(). If so, it will set "nr_reqs" to
> > >    SWAP_CRYPTO_BATCH_SIZE and allocate these many reqs/buffers, and
> > set
> > >    the acomp_ctx->nr_reqs accordingly. If the crypto_acomp does not
> > support
> > >    batching, "nr_reqs" defaults to 1.
> > >
> > > 3) Adds a "bool can_batch" to "struct zswap_pool" that step (2) will set to
> > >    true if the batching API are present for the crypto_acomp.
> >
> > Why do we need this "can_batch" field? IIUC, this can be determined
> > from the compressor internal fields itself, no?
> >
> > acomp_has_async_batching(acomp);
> >
> > Is this just for convenience, or is this actually an expensive thing to
> compute?
> 
> Thanks for your comments. This is a good question. I tried not to imply that
> batching resources have been allocated for the cpu based only on what
> acomp_has_async_batching() returns. It is possible that the cpu onlining
> code ran into an -ENOMEM error on any particular cpu. In this case, I set
> the pool->can_batch to "false", mainly for convenience, so that zswap
> can be somewhat insulated from migration. I agree that this may not be
> the best solution; and whether or not batching is enabled can be directly
> determined just before the call to crypto_acomp_batch_compress()
> based on:
> 
> acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE;
> 
> I currently have a BUG_ON() for this condition not being met, that relies
> on the pool->can_batch gating the flow to get to zswap_batch_compress().
> 
> I think a better solution would be to check for having
> SWAP_CRYPTO_BATCH_SIZE
> # of acomp_ctx resources right after we acquire the acomp_ctx->mutex and
> before
> the call to crypto_acomp_batch_compress(). If so, we proceed, and if not, we
> call
> crypto_acomp_compress(). It seems this might be the only way to know for
> sure
> whether the crypto batching API can be called, given that migration is possible
> at any point in zswap_store(). Once we have obtained the mutex_lock, it
> seems
> we can proceed with batching based on this check (although the UAF situation
> remains as a larger issue, beyond the scope of this patch). I would appreciate
> other ideas as well.
> 
> Also, I have submitted a patch-series [1] with Yosry's & Johannes' suggestions
> to this series. This is setting up a consolidated
> zswap_store()/zswap_store_pages()
> code path for batching and non-batching compressors. My goal is for [1] to
> go through code reviews and be able to transition to batching, with a simple
> check:
> 
> if (acomp_ctx->nr_reqs == SWAP_CRYPTO_BATCH_SIZE)
>          zswap_batch_compress();
> else
>          zswap_compress();
> 
> Please feel free to provide code review comments in [1]. Thanks!
> 
> [1]: https://patchwork.kernel.org/project/linux-mm/list/?series=912937
> 
> >
> > >
> > > SWAP_CRYPTO_BATCH_SIZE is set to 8, which will be the IAA compress
> > batching
> >
> > I like a sane default value as much as the next guy, but this seems a
> > bit odd to me:
> >
> > 1. The placement of this constant/default value seems strange to me.
> > This is a compressor-specific value no? Why are we enforcing this
> > batching size at the zswap level, and uniformly at that? What if we
> > introduce a new batch compression algorithm...? Or am I missing
> > something, and this is a sane default for other compressors too?
> 
> You bring up an excellent point. This is a compressor-specific value.
> Instead of setting this up as a constant, which as you correctly observe,
> may not make sense for a non-IAA compressor, one way to get
> this could be by querying the compressor, say:
> 
> int acomp_get_max_batchsize(struct crypto_acomp *tfm) {...};
> 
> to then allocate sufficient acomp_reqs/buffers/etc. in the zswap
> cpu onlining code.
> 
> >
> > 2. Why is this value set to 8? Experimentation? Could you add some
> > justification in documentation?
> 
> Can I get back to you later this week with a proposal for this? We plan
> to have a team discussion on how best to approach this for current
> and future hardware.

Sorry it took me quite a while to get back to you on this. I have been busy
with implementing request chaining, and other major improvements to this
series based on the comments received thus far.

I will be submitting a v5 of this series shortly, in which I have implemented
an IAA_CRYPTO_MAX_BATCH_SIZE in the iaa_crypto driver. For now I set this
to 8 since we have done all our testing with a batch size of 8, but we are still
running experiments to figure this out, hence this #define in the iaa_crypto
driver (in v5) can potentially change. Further, there is a zswap-specific
ZSWAP_MAX_BATCH_SIZE in v5, which is also 8. I would appreciate code
review comments for v5. If the approach I've taken in v5 is acceptable, I
will add more details/justification in the documentation in a v6.

Thanks,
Kanchana

> 
> Thanks,
> Kanchana
diff mbox series

Patch

diff --git a/include/linux/zswap.h b/include/linux/zswap.h
index d961ead91bf1..9ad27ab3d222 100644
--- a/include/linux/zswap.h
+++ b/include/linux/zswap.h
@@ -7,6 +7,13 @@ 
 
 struct lruvec;
 
+/*
+ * For IAA compression batching:
+ * Maximum number of IAA acomp compress requests that will be processed
+ * in a batch: in parallel, if iaa_crypto async/no irq mode is enabled
+ * (the default); else sequentially, if iaa_crypto sync mode is in effect.
+ */
+#define SWAP_CRYPTO_BATCH_SIZE 8UL
 extern atomic_long_t zswap_stored_pages;
 
 #ifdef CONFIG_ZSWAP
diff --git a/mm/zswap.c b/mm/zswap.c
index f6316b66fb23..173f7632990e 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -143,9 +143,10 @@  bool zswap_never_enabled(void)
 
 struct crypto_acomp_ctx {
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
+	struct acomp_req **reqs;
+	u8 **buffers;
+	unsigned int nr_reqs;
 	struct crypto_wait wait;
-	u8 *buffer;
 	struct mutex mutex;
 	bool is_sleepable;
 };
@@ -158,6 +159,7 @@  struct crypto_acomp_ctx {
  */
 struct zswap_pool {
 	struct zpool *zpool;
+	bool can_batch;
 	struct crypto_acomp_ctx __percpu *acomp_ctx;
 	struct percpu_ref ref;
 	struct list_head list;
@@ -285,6 +287,8 @@  static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
 		goto error;
 	}
 
+	pool->can_batch = false;
+
 	ret = cpuhp_state_add_instance(CPUHP_MM_ZSWP_POOL_PREPARE,
 				       &pool->node);
 	if (ret)
@@ -818,49 +822,90 @@  static int zswap_cpu_comp_prepare(unsigned int cpu, struct hlist_node *node)
 	struct zswap_pool *pool = hlist_entry(node, struct zswap_pool, node);
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 	struct crypto_acomp *acomp;
-	struct acomp_req *req;
-	int ret;
+	unsigned int nr_reqs = 1;
+	int ret = -ENOMEM;
+	int i, j;
 
 	mutex_init(&acomp_ctx->mutex);
-
-	acomp_ctx->buffer = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
-	if (!acomp_ctx->buffer)
-		return -ENOMEM;
+	acomp_ctx->nr_reqs = 0;
 
 	acomp = crypto_alloc_acomp_node(pool->tfm_name, 0, 0, cpu_to_node(cpu));
 	if (IS_ERR(acomp)) {
 		pr_err("could not alloc crypto acomp %s : %ld\n",
 				pool->tfm_name, PTR_ERR(acomp));
-		ret = PTR_ERR(acomp);
-		goto acomp_fail;
+		return PTR_ERR(acomp);
 	}
 	acomp_ctx->acomp = acomp;
 	acomp_ctx->is_sleepable = acomp_is_async(acomp);
 
-	req = acomp_request_alloc(acomp_ctx->acomp);
-	if (!req) {
-		pr_err("could not alloc crypto acomp_request %s\n",
-		       pool->tfm_name);
-		ret = -ENOMEM;
+	/*
+	 * Create the necessary batching resources if the crypto acomp alg
+	 * implements the batch_compress and batch_decompress API.
+	 */
+	if (acomp_has_async_batching(acomp)) {
+		pool->can_batch = true;
+		nr_reqs = SWAP_CRYPTO_BATCH_SIZE;
+		pr_info_once("Creating acomp_ctx with %d reqs for batching since crypto acomp %s\nhas registered batch_compress() and batch_decompress()\n",
+			nr_reqs, pool->tfm_name);
+	}
+
+	acomp_ctx->buffers = kmalloc_node(nr_reqs * sizeof(u8 *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->buffers)
+		goto buf_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->buffers[i] = kmalloc_node(PAGE_SIZE * 2, GFP_KERNEL, cpu_to_node(cpu));
+		if (!acomp_ctx->buffers[i]) {
+			for (j = 0; j < i; ++j)
+				kfree(acomp_ctx->buffers[j]);
+			kfree(acomp_ctx->buffers);
+			ret = -ENOMEM;
+			goto buf_fail;
+		}
+	}
+
+	acomp_ctx->reqs = kmalloc_node(nr_reqs * sizeof(struct acomp_req *), GFP_KERNEL, cpu_to_node(cpu));
+	if (!acomp_ctx->reqs)
 		goto req_fail;
+
+	for (i = 0; i < nr_reqs; ++i) {
+		acomp_ctx->reqs[i] = acomp_request_alloc(acomp_ctx->acomp);
+		if (!acomp_ctx->reqs[i]) {
+			pr_err("could not alloc crypto acomp_request reqs[%d] %s\n",
+			       i, pool->tfm_name);
+			for (j = 0; j < i; ++j)
+				acomp_request_free(acomp_ctx->reqs[j]);
+			kfree(acomp_ctx->reqs);
+			ret = -ENOMEM;
+			goto req_fail;
+		}
 	}
-	acomp_ctx->req = req;
 
+	/*
+	 * The crypto_wait is used only in fully synchronous, i.e., with scomp
+	 * or non-poll mode of acomp, hence there is only one "wait" per
+	 * acomp_ctx, with callback set to reqs[0], under the assumption that
+	 * there is at least 1 request per acomp_ctx.
+	 */
 	crypto_init_wait(&acomp_ctx->wait);
 	/*
 	 * if the backend of acomp is async zip, crypto_req_done() will wakeup
 	 * crypto_wait_req(); if the backend of acomp is scomp, the callback
 	 * won't be called, crypto_wait_req() will return without blocking.
 	 */
-	acomp_request_set_callback(req, CRYPTO_TFM_REQ_MAY_BACKLOG,
+	acomp_request_set_callback(acomp_ctx->reqs[0], CRYPTO_TFM_REQ_MAY_BACKLOG,
 				   crypto_req_done, &acomp_ctx->wait);
 
+	acomp_ctx->nr_reqs = nr_reqs;
 	return 0;
 
 req_fail:
+	for (i = 0; i < nr_reqs; ++i)
+		kfree(acomp_ctx->buffers[i]);
+	kfree(acomp_ctx->buffers);
+buf_fail:
 	crypto_free_acomp(acomp_ctx->acomp);
-acomp_fail:
-	kfree(acomp_ctx->buffer);
+	pool->can_batch = false;
 	return ret;
 }
 
@@ -870,11 +915,22 @@  static int zswap_cpu_comp_dead(unsigned int cpu, struct hlist_node *node)
 	struct crypto_acomp_ctx *acomp_ctx = per_cpu_ptr(pool->acomp_ctx, cpu);
 
 	if (!IS_ERR_OR_NULL(acomp_ctx)) {
-		if (!IS_ERR_OR_NULL(acomp_ctx->req))
-			acomp_request_free(acomp_ctx->req);
+		int i;
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			if (!IS_ERR_OR_NULL(acomp_ctx->reqs[i]))
+				acomp_request_free(acomp_ctx->reqs[i]);
+		kfree(acomp_ctx->reqs);
+
+		for (i = 0; i < acomp_ctx->nr_reqs; ++i)
+			kfree(acomp_ctx->buffers[i]);
+		kfree(acomp_ctx->buffers);
+
 		if (!IS_ERR_OR_NULL(acomp_ctx->acomp))
 			crypto_free_acomp(acomp_ctx->acomp);
-		kfree(acomp_ctx->buffer);
+
+		acomp_ctx->nr_reqs = 0;
+		acomp_ctx = NULL;
 	}
 
 	return 0;
@@ -897,7 +953,7 @@  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 
 	mutex_lock(&acomp_ctx->mutex);
 
-	dst = acomp_ctx->buffer;
+	dst = acomp_ctx->buffers[0];
 	sg_init_table(&input, 1);
 	sg_set_page(&input, page, PAGE_SIZE, 0);
 
@@ -907,7 +963,7 @@  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * giving the dst buffer with enough length to avoid buffer overflow.
 	 */
 	sg_init_one(&output, dst, PAGE_SIZE * 2);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, PAGE_SIZE, dlen);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, PAGE_SIZE, dlen);
 
 	/*
 	 * it maybe looks a little bit silly that we send an asynchronous request,
@@ -921,8 +977,8 @@  static bool zswap_compress(struct page *page, struct zswap_entry *entry,
 	 * but in different threads running on different cpu, we have different
 	 * acomp instance, so multiple threads can do (de)compression in parallel.
 	 */
-	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->req), &acomp_ctx->wait);
-	dlen = acomp_ctx->req->dlen;
+	comp_ret = crypto_wait_req(crypto_acomp_compress(acomp_ctx->reqs[0]), &acomp_ctx->wait);
+	dlen = acomp_ctx->reqs[0]->dlen;
 	if (comp_ret)
 		goto unlock;
 
@@ -975,20 +1031,20 @@  static void zswap_decompress(struct zswap_entry *entry, struct folio *folio)
 	 */
 	if ((acomp_ctx->is_sleepable && !zpool_can_sleep_mapped(zpool)) ||
 	    !virt_addr_valid(src)) {
-		memcpy(acomp_ctx->buffer, src, entry->length);
-		src = acomp_ctx->buffer;
+		memcpy(acomp_ctx->buffers[0], src, entry->length);
+		src = acomp_ctx->buffers[0];
 		zpool_unmap_handle(zpool, entry->handle);
 	}
 
 	sg_init_one(&input, src, entry->length);
 	sg_init_table(&output, 1);
 	sg_set_folio(&output, folio, PAGE_SIZE, 0);
-	acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, PAGE_SIZE);
-	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait));
-	BUG_ON(acomp_ctx->req->dlen != PAGE_SIZE);
+	acomp_request_set_params(acomp_ctx->reqs[0], &input, &output, entry->length, PAGE_SIZE);
+	BUG_ON(crypto_wait_req(crypto_acomp_decompress(acomp_ctx->reqs[0]), &acomp_ctx->wait));
+	BUG_ON(acomp_ctx->reqs[0]->dlen != PAGE_SIZE);
 	mutex_unlock(&acomp_ctx->mutex);
 
-	if (src != acomp_ctx->buffer)
+	if (src != acomp_ctx->buffers[0])
 		zpool_unmap_handle(zpool, entry->handle);
 }