Message ID | 20170720114000.13840-1-ard.biesheuvel@linaro.org |
---|---|
State | New |
Headers | show |
On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote: > The scompress code unconditionally allocates 2 per-CPU scratch buffers > of 128 KB each, in order to avoid allocation overhead in the async > wrapper that encapsulates the synchronous compression algorithm, since > it may execute in atomic context. The whole point of pre-allocation is that we cannot allocate 128K (or 64K as it was before scomp) at run-time, and in particular, for IPsec which runs in softirq path. Am I missing something? Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 21 July 2017 at 13:42, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Thu, Jul 20, 2017 at 12:40:00PM +0100, Ard Biesheuvel wrote: >> The scompress code unconditionally allocates 2 per-CPU scratch buffers >> of 128 KB each, in order to avoid allocation overhead in the async >> wrapper that encapsulates the synchronous compression algorithm, since >> it may execute in atomic context. > > The whole point of pre-allocation is that we cannot allocate 128K > (or 64K as it was before scomp) at run-time, and in particular, > for IPsec which runs in softirq path. Am I missing something? > Right. And is req->dst guaranteed to be assigned in that case? Because crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the scatterlist if req->dst == NULL. Is there any way we could make these scratch buffers part of the request structure instead? Or at least defer allocating them until the first call to crypto_scomp_init_tfm()? And on top of that, we should probably only use the per-CPU scratch buffers if CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not pre-emptible to begin with, and the concern does not apply.
On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote: > > Right. And is req->dst guaranteed to be assigned in that case? Because > crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the > scatterlist if req->dst == NULL. > > Is there any way we could make these scratch buffers part of the > request structure instead? Or at least defer allocating them until the > first call to crypto_scomp_init_tfm()? And on top of that, we should > probably only use the per-CPU scratch buffers if > CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not > pre-emptible to begin with, and the concern does not apply. Well for now scomp is new code so nothing actually uses it. But the idea is to convert IPcomp over to acomp and scomp will fill in as the compatibility glue. The medium/long-term plan is to convert IPcomp over to a partial decompression model so that we can allocate pages instead of a contiguous chunk as we do now. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 21 July 2017 at 14:11, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote: >> >> Right. And is req->dst guaranteed to be assigned in that case? Because >> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the >> scatterlist if req->dst == NULL. >> >> Is there any way we could make these scratch buffers part of the >> request structure instead? Or at least defer allocating them until the >> first call to crypto_scomp_init_tfm()? And on top of that, we should >> probably only use the per-CPU scratch buffers if >> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not >> pre-emptible to begin with, and the concern does not apply. > > Well for now scomp is new code so nothing actually uses it. But > the idea is to convert IPcomp over to acomp and scomp will fill in > as the compatibility glue. > Yes, I guessed as much. > The medium/long-term plan is to convert IPcomp over to a partial > decompression model so that we can allocate pages instead of a > contiguous chunk as we do now. > OK, but that doesn't really answer any of my questions: - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the other, because the current situation is buggy. - Could we move the scratch buffers to the request structure instead? Or may that be allocated from softirq context as well? - Would you mind a patch that defers the allocation of the per-CPU buffers to the first invocation of crypto_scomp_init_tfm()? - Would you mind a patch that makes the code only use the per-CPU buffers if we are running atomically to begin with? Thanks, Ard.
On 21 July 2017 at 14:24, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote: > On 21 July 2017 at 14:11, Herbert Xu <herbert@gondor.apana.org.au> wrote: >> On Fri, Jul 21, 2017 at 02:09:39PM +0100, Ard Biesheuvel wrote: >>> >>> Right. And is req->dst guaranteed to be assigned in that case? Because >>> crypto_scomp_sg_alloc() happily allocates pages and kmalloc()s the >>> scatterlist if req->dst == NULL. >>> >>> Is there any way we could make these scratch buffers part of the >>> request structure instead? Or at least defer allocating them until the >>> first call to crypto_scomp_init_tfm()? And on top of that, we should >>> probably only use the per-CPU scratch buffers if >>> CRYPTO_TFM_REQ_MAY_SLEEP is cleared, because in that case, we are not >>> pre-emptible to begin with, and the concern does not apply. >> >> Well for now scomp is new code so nothing actually uses it. But >> the idea is to convert IPcomp over to acomp and scomp will fill in >> as the compatibility glue. >> > > Yes, I guessed as much. > >> The medium/long-term plan is to convert IPcomp over to a partial >> decompression model so that we can allocate pages instead of a >> contiguous chunk as we do now. >> > > OK, but that doesn't really answer any of my questions: > - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually > exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should > crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the > other, because the current situation is buggy. I mean CRYPTO_ACOMP_ALLOC_OUTPUT should not be set if CRYPTO_TFM_REQ_MAY_SLEEP is cleared. > - Could we move the scratch buffers to the request structure instead? > Or may that be allocated from softirq context as well? > - Would you mind a patch that defers the allocation of the per-CPU > buffers to the first invocation of crypto_scomp_init_tfm()? > - Would you mind a patch that makes the code only use the per-CPU > buffers if we are running atomically to begin with? > > Thanks, > Ard.
On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote: > > OK, but that doesn't really answer any of my questions: > - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually > exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should > crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the > other, because the current situation is buggy. Yeah I'm not surprised that it is broken, because there are currently no users of the API. We should fix the acomp API. > - Could we move the scratch buffers to the request structure instead? > Or may that be allocated from softirq context as well? Yes the IPcomp requests would be allocated from softirq context. > - Would you mind a patch that defers the allocation of the per-CPU > buffers to the first invocation of crypto_scomp_init_tfm()? Sure. That's what we currently do for IPcomp too. > - Would you mind a patch that makes the code only use the per-CPU > buffers if we are running atomically to begin with? That would mean dropping the first packet so no. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 21 July 2017 at 14:31, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Fri, Jul 21, 2017 at 02:24:02PM +0100, Ard Biesheuvel wrote: >> >> OK, but that doesn't really answer any of my questions: >> - Should we enforce that CRYPTO_ACOMP_ALLOC_OUTPUT is mutually >> exclusive with CRYPTO_TFM_REQ_MAY_SLEEP, or should >> crypto_scomp_sg_alloc() always use GFP_ATOMIC? We need one or the >> other, because the current situation is buggy. > > Yeah I'm not surprised that it is broken, because there are currently > no users of the API. We should fix the acomp API. > >> - Could we move the scratch buffers to the request structure instead? >> Or may that be allocated from softirq context as well? > > Yes the IPcomp requests would be allocated from softirq context. > __acomp_request_alloc() currently uses GFP_KERNEL explicitly, so that needs to be fixed as well then. >> - Would you mind a patch that defers the allocation of the per-CPU >> buffers to the first invocation of crypto_scomp_init_tfm()? > > Sure. That's what we currently do for IPcomp too. > OK. >> - Would you mind a patch that makes the code only use the per-CPU >> buffers if we are running atomically to begin with? > > That would mean dropping the first packet so no. > I think you misunderstood me: the idea is not to allocate the per-CPU buffers at that time, but avoiding them and use the heap directly (as my patch does now). This way, we don't unnecessarily disable preemption for calls executing in process context. I.e.. if (req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP) { cpu = get_cpu(); scratch_src = *per_cpu_ptr(scomp_src_scratches, cpu); ... } else { scratch_src = (u8 *)__get_free_pages(gfp, alloc_order); } and make the put_cpu() at the end conditional in the same way. In any case, my main gripe with this code is the unconditional allocation of the scratch buffers (taking up memory and vmalloc space) so if we can defer that to crypto_scomp_init_tfm() instead, I already have most of what I need. -- Ard.
On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote: > > >> - Would you mind a patch that makes the code only use the per-CPU > >> buffers if we are running atomically to begin with? > > > > That would mean dropping the first packet so no. > > > > I think you misunderstood me: the idea is not to allocate the per-CPU > buffers at that time, but avoiding them and use the heap directly (as > my patch does now). This way, we don't unnecessarily disable > preemption for calls executing in process context. When we actually have a user other than IPcomp then we could try this optimisation. Otherwise I think it'd be a waste of time. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
On 21 July 2017 at 14:44, Herbert Xu <herbert@gondor.apana.org.au> wrote: > On Fri, Jul 21, 2017 at 02:42:20PM +0100, Ard Biesheuvel wrote: >> >> >> - Would you mind a patch that makes the code only use the per-CPU >> >> buffers if we are running atomically to begin with? >> > >> > That would mean dropping the first packet so no. >> > >> >> I think you misunderstood me: the idea is not to allocate the per-CPU >> buffers at that time, but avoiding them and use the heap directly (as >> my patch does now). This way, we don't unnecessarily disable >> preemption for calls executing in process context. > > When we actually have a user other than IPcomp then we could > try this optimisation. Otherwise I think it'd be a waste of > time. > Fair enough.
diff --git a/crypto/scompress.c b/crypto/scompress.c index ae1d3cf209e4..3d4c35a4c5a8 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -30,10 +30,6 @@ #include "internal.h" static const struct crypto_type crypto_scomp_type; -static void * __percpu *scomp_src_scratches; -static void * __percpu *scomp_dst_scratches; -static int scomp_scratch_users; -static DEFINE_MUTEX(scomp_lock); #ifdef CONFIG_NET static int crypto_scomp_report(struct sk_buff *skb, struct crypto_alg *alg) @@ -70,67 +66,6 @@ static int crypto_scomp_init_tfm(struct crypto_tfm *tfm) return 0; } -static void crypto_scomp_free_scratches(void * __percpu *scratches) -{ - int i; - - if (!scratches) - return; - - for_each_possible_cpu(i) - vfree(*per_cpu_ptr(scratches, i)); - - free_percpu(scratches); -} - -static void * __percpu *crypto_scomp_alloc_scratches(void) -{ - void * __percpu *scratches; - int i; - - scratches = alloc_percpu(void *); - if (!scratches) - return NULL; - - for_each_possible_cpu(i) { - void *scratch; - - scratch = vmalloc_node(SCOMP_SCRATCH_SIZE, cpu_to_node(i)); - if (!scratch) - goto error; - *per_cpu_ptr(scratches, i) = scratch; - } - - return scratches; - -error: - crypto_scomp_free_scratches(scratches); - return NULL; -} - -static void crypto_scomp_free_all_scratches(void) -{ - if (!--scomp_scratch_users) { - crypto_scomp_free_scratches(scomp_src_scratches); - crypto_scomp_free_scratches(scomp_dst_scratches); - scomp_src_scratches = NULL; - scomp_dst_scratches = NULL; - } -} - -static int crypto_scomp_alloc_all_scratches(void) -{ - if (!scomp_scratch_users++) { - scomp_src_scratches = crypto_scomp_alloc_scratches(); - if (!scomp_src_scratches) - return -ENOMEM; - scomp_dst_scratches = crypto_scomp_alloc_scratches(); - if (!scomp_dst_scratches) - return -ENOMEM; - } - return 0; -} - static void crypto_scomp_sg_free(struct scatterlist *sgl) { int i, n; @@ -184,24 +119,31 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) void **tfm_ctx = acomp_tfm_ctx(tfm); struct crypto_scomp *scomp = *tfm_ctx; void **ctx = acomp_request_ctx(req); - const int cpu = get_cpu(); - u8 *scratch_src = *per_cpu_ptr(scomp_src_scratches, cpu); - u8 *scratch_dst = *per_cpu_ptr(scomp_dst_scratches, cpu); + u8 *scratch_src; + u8 *scratch_dst; + int alloc_order; + gfp_t gfp; int ret; - if (!req->src || !req->slen || req->slen > SCOMP_SCRATCH_SIZE) { - ret = -EINVAL; - goto out; - } + if (!req->src || !req->slen || req->slen > SCOMP_SCRATCH_SIZE) + return -EINVAL; - if (req->dst && !req->dlen) { - ret = -EINVAL; - goto out; - } + if (req->dst && !req->dlen) + return -EINVAL; if (!req->dlen || req->dlen > SCOMP_SCRATCH_SIZE) req->dlen = SCOMP_SCRATCH_SIZE; + gfp = req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? GFP_KERNEL + : GFP_ATOMIC; + + alloc_order = get_order(PAGE_ALIGN(req->slen) + req->dlen); + scratch_src = (u8 *)__get_free_pages(gfp, alloc_order); + if (!scratch_src) + return -ENOMEM; + + scratch_dst = scratch_src + PAGE_ALIGN(req->slen); + scatterwalk_map_and_copy(scratch_src, req->src, 0, req->slen, 0); if (dir) ret = crypto_scomp_compress(scomp, scratch_src, req->slen, @@ -211,9 +153,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) scratch_dst, &req->dlen, *ctx); if (!ret) { if (!req->dst) { - req->dst = crypto_scomp_sg_alloc(req->dlen, - req->base.flags & CRYPTO_TFM_REQ_MAY_SLEEP ? - GFP_KERNEL : GFP_ATOMIC); + req->dst = crypto_scomp_sg_alloc(req->dlen, gfp); if (!req->dst) goto out; } @@ -221,7 +161,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) 1); } out: - put_cpu(); + free_pages((unsigned long)scratch_src, alloc_order); return ret; } @@ -316,40 +256,18 @@ static const struct crypto_type crypto_scomp_type = { int crypto_register_scomp(struct scomp_alg *alg) { struct crypto_alg *base = &alg->base; - int ret = -ENOMEM; - - mutex_lock(&scomp_lock); - if (crypto_scomp_alloc_all_scratches()) - goto error; base->cra_type = &crypto_scomp_type; base->cra_flags &= ~CRYPTO_ALG_TYPE_MASK; base->cra_flags |= CRYPTO_ALG_TYPE_SCOMPRESS; - ret = crypto_register_alg(base); - if (ret) - goto error; - - mutex_unlock(&scomp_lock); - return ret; - -error: - crypto_scomp_free_all_scratches(); - mutex_unlock(&scomp_lock); - return ret; + return crypto_register_alg(base); } EXPORT_SYMBOL_GPL(crypto_register_scomp); int crypto_unregister_scomp(struct scomp_alg *alg) { - int ret; - - mutex_lock(&scomp_lock); - ret = crypto_unregister_alg(&alg->base); - crypto_scomp_free_all_scratches(); - mutex_unlock(&scomp_lock); - - return ret; + return crypto_unregister_alg(&alg->base); } EXPORT_SYMBOL_GPL(crypto_unregister_scomp);
The scompress code unconditionally allocates 2 per-CPU scratch buffers of 128 KB each, in order to avoid allocation overhead in the async wrapper that encapsulates the synchronous compression algorithm, since it may execute in atomic context. There are a couple of issues with this: - The code may be invoked in process context with CRYPTO_TFM_REQ_MAY_SLEEP set in the request flags. If the CRYPTO_ACOMP_ALLOC_OUTPUT flag is also set, we end up calling kmalloc_array() and alloc_page() with the GFP flags set to GFP_KERNEL, and so we may attempt to sleep even though we are running with preemption disabled (due to the use of get_cpu()) - On a system such as Cavium ThunderX, which has 96 cores, the per-CPU buffers waste 24 MB of memory. - On 32-bit systems, claiming vmalloc space of this order (e.g., 1 MB on a 4-core system) is undesirable as well, given how small the vmalloc space typically is on such systems. - Lengthy CPU bound tasks such as compression should not be run with preemption disabled if we can avoid it. So replace the per-PCU buffers with on-demand page allocations in the handler. This reduces the memory and vmalloc space footprint of this driver to ~0 unless you actually use it. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> --- The req->dst case could be further optimized, and reuse the allocated scratch_dst buffer as the buffer that is returned to the caller. Given that there are no occurrences of crypto_alloc_acomp() anywhere in the kernel, I will leave that to the next person. crypto/scompress.c | 126 ++++---------------- 1 file changed, 22 insertions(+), 104 deletions(-) -- 2.9.3