diff mbox series

crypto: api - Fix generic algorithm self-test races

Message ID ZtQgVOnK6WzdIDlU@gondor.apana.org.au
State Accepted
Commit e7a4142b35ce489fc8908d75596c51549711ade0
Headers show
Series crypto: api - Fix generic algorithm self-test races | expand

Commit Message

Herbert Xu Sept. 1, 2024, 8:05 a.m. UTC
On Fri, Aug 30, 2024 at 10:51:54AM -0700, Eric Biggers wrote:
>
> Given below in defconfig form, use 'make olddefconfig' to apply.  The failures
> are nondeterministic and sometimes there are different ones, for example:
> 
> [    0.358017] alg: skcipher: failed to allocate transform for cbc(twofish-generic): -2
> [    0.358365] alg: self-tests for cbc(twofish) using cbc(twofish-generic) failed (rc=-2)
> [    0.358535] alg: skcipher: failed to allocate transform for cbc(camellia-generic): -2
> [    0.358918] alg: self-tests for cbc(camellia) using cbc(camellia-generic) failed (rc=-2)
> [    0.371533] alg: skcipher: failed to allocate transform for xts(ecb(aes-generic)): -2
> [    0.371922] alg: self-tests for xts(aes) using xts(ecb(aes-generic)) failed (rc=-2)
> 
> Modules are not enabled, maybe that matters (I haven't checked yet).

Yes I think that was the key.  This triggers a massive self-test
run which executes in parallel and reveals a few race conditions
in the system.  I think it boils down to the following scenario:

Base algorithm X-generic, X-optimised
Template Y
Optimised algorithm Y-X-optimised

Everything gets registered, and then the self-tests are started.
When Y-X-optimised gets tested, it requests the creation of the
generic Y(X-generic).  Which then itself undergoes testing.

The race is that after Y(X-generic) gets registered, but just
before it gets tested, X-optimised finally finishes self-testing
which then causes all spawns of X-generic to be destroyed.  So
by the time the self-test for Y(X-generic) comes along, it can
no longer find the algorithm.  This error then bubbles up all
the way up to the self-test of Y-X-optimised which then fails.

Note that there is some complexity that I've omitted here because
when the generic self-test fails to find Y(X-generic) it actually
triggers the construction of it again which then fails for various
other reasons (these are not important because the construction
should *not* be triggered at this point).

So in a way the error is expected, and we should probably remove
the pr_err for the case where ENOENT is returned for the algorithm
that we're currently testing.

The solution is two-fold.  First when an algorithm undergoes
self-testing it should not trigger its construction.  Secondly
if an instance larval fails to materialise due to it being destroyed
by a more optimised algorithm coming along, it should obviously
retry the construction.

Remove the check in __crypto_alg_lookup that stops a larval from
matching new requests based on differences in the mask.  It is better
to block new requests even if it is wrong and then simply retry the
lookup.  If this ends up being the wrong larval it will sort iself
out during the retry.

Reduce the CRYPTO_ALG_TYPE_MASK bits in type during larval creation
as otherwise LSKCIPHER algorithms may not match SKCIPHER larvals.

Also block the instance creation during self-testing in the function
crypto_larval_lookup by checking for CRYPTO_ALG_TESTED in the mask
field.

Finally change the return value when crypto_alg_lookup fails in
crypto_larval_wait to EAGAIN to redo the lookup.

Fixes: 37da5d0ffa7b ("crypto: api - Do not wait for tests during registration")
Reported-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Comments

Eric Biggers Sept. 2, 2024, 5:05 p.m. UTC | #1
On Sun, Sep 01, 2024 at 04:05:40PM +0800, Herbert Xu wrote:
> On Fri, Aug 30, 2024 at 10:51:54AM -0700, Eric Biggers wrote:
> >
> > Given below in defconfig form, use 'make olddefconfig' to apply.  The failures
> > are nondeterministic and sometimes there are different ones, for example:
> > 
> > [    0.358017] alg: skcipher: failed to allocate transform for cbc(twofish-generic): -2
> > [    0.358365] alg: self-tests for cbc(twofish) using cbc(twofish-generic) failed (rc=-2)
> > [    0.358535] alg: skcipher: failed to allocate transform for cbc(camellia-generic): -2
> > [    0.358918] alg: self-tests for cbc(camellia) using cbc(camellia-generic) failed (rc=-2)
> > [    0.371533] alg: skcipher: failed to allocate transform for xts(ecb(aes-generic)): -2
> > [    0.371922] alg: self-tests for xts(aes) using xts(ecb(aes-generic)) failed (rc=-2)
> > 
> > Modules are not enabled, maybe that matters (I haven't checked yet).
> 
> Yes I think that was the key.  This triggers a massive self-test
> run which executes in parallel and reveals a few race conditions
> in the system.  I think it boils down to the following scenario:
> 
> Base algorithm X-generic, X-optimised
> Template Y
> Optimised algorithm Y-X-optimised
> 
> Everything gets registered, and then the self-tests are started.
> When Y-X-optimised gets tested, it requests the creation of the
> generic Y(X-generic).  Which then itself undergoes testing.
> 
> The race is that after Y(X-generic) gets registered, but just
> before it gets tested, X-optimised finally finishes self-testing
> which then causes all spawns of X-generic to be destroyed.  So
> by the time the self-test for Y(X-generic) comes along, it can
> no longer find the algorithm.  This error then bubbles up all
> the way up to the self-test of Y-X-optimised which then fails.
> 
> Note that there is some complexity that I've omitted here because
> when the generic self-test fails to find Y(X-generic) it actually
> triggers the construction of it again which then fails for various
> other reasons (these are not important because the construction
> should *not* be triggered at this point).
> 
> So in a way the error is expected, and we should probably remove
> the pr_err for the case where ENOENT is returned for the algorithm
> that we're currently testing.
> 
> The solution is two-fold.  First when an algorithm undergoes
> self-testing it should not trigger its construction.  Secondly
> if an instance larval fails to materialise due to it being destroyed
> by a more optimised algorithm coming along, it should obviously
> retry the construction.
> 
> Remove the check in __crypto_alg_lookup that stops a larval from
> matching new requests based on differences in the mask.  It is better
> to block new requests even if it is wrong and then simply retry the
> lookup.  If this ends up being the wrong larval it will sort iself
> out during the retry.
> 
> Reduce the CRYPTO_ALG_TYPE_MASK bits in type during larval creation
> as otherwise LSKCIPHER algorithms may not match SKCIPHER larvals.
> 
> Also block the instance creation during self-testing in the function
> crypto_larval_lookup by checking for CRYPTO_ALG_TESTED in the mask
> field.
> 
> Finally change the return value when crypto_alg_lookup fails in
> crypto_larval_wait to EAGAIN to redo the lookup.
> 
> Fixes: 37da5d0ffa7b ("crypto: api - Do not wait for tests during registration")
> Reported-by: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
> 
> diff --git a/crypto/api.c b/crypto/api.c
> index bbe29d438815..bfd177a4313a 100644
> --- a/crypto/api.c
> +++ b/crypto/api.c
> @@ -70,11 +70,6 @@ static struct crypto_alg *__crypto_alg_lookup(const char *name, u32 type,
>  		if ((q->cra_flags ^ type) & mask)
>  			continue;
>  
> -		if (crypto_is_larval(q) &&
> -		    !crypto_is_test_larval((struct crypto_larval *)q) &&
> -		    ((struct crypto_larval *)q)->mask != mask)
> -			continue;
> -
>  		exact = !strcmp(q->cra_driver_name, name);
>  		fuzzy = !strcmp(q->cra_name, name);
>  		if (!exact && !(fuzzy && q->cra_priority > best))
> @@ -113,6 +108,8 @@ struct crypto_larval *crypto_larval_alloc(const char *name, u32 type, u32 mask)
>  	if (!larval)
>  		return ERR_PTR(-ENOMEM);
>  
> +	type &= ~CRYPTO_ALG_TYPE_MASK | (mask ?: CRYPTO_ALG_TYPE_MASK);
> +
>  	larval->mask = mask;
>  	larval->alg.cra_flags = CRYPTO_ALG_LARVAL | type;
>  	larval->alg.cra_priority = -1;
> @@ -229,7 +226,7 @@ static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg)
>  		type = alg->cra_flags & ~(CRYPTO_ALG_LARVAL | CRYPTO_ALG_DEAD);
>  		mask = larval->mask;
>  		alg = crypto_alg_lookup(alg->cra_name, type, mask) ?:
> -		      ERR_PTR(-ENOENT);
> +		      ERR_PTR(-EAGAIN);
>  	} else if (IS_ERR(alg))
>  		;
>  	else if (crypto_is_test_larval(larval) &&
> @@ -308,8 +305,12 @@ static struct crypto_alg *crypto_larval_lookup(const char *name, u32 type,
>  
>  	if (!IS_ERR_OR_NULL(alg) && crypto_is_larval(alg))
>  		alg = crypto_larval_wait(alg);
> -	else if (!alg)
> +	else if (alg)
> +		;
> +	else if (!(mask & CRYPTO_ALG_TESTED))
>  		alg = crypto_larval_add(name, type, mask);
> +	else
> +		alg = ERR_PTR(-ENOENT);
>  
>  	return alg;
>  }

With both this patch "crypto: api - Fix generic algorithm self-test races" and
your other patch "crypto: algboss - Pass instance creation error up" applied,
I'm still getting errors occasionally, e.g.:

    [    5.155587] alg: skcipher: failed to allocate transform for cbc(sm4-generic): -2
    [    5.155954] alg: self-tests for cbc(sm4) using cbc(sm4-generic) failed (rc=-2)
    [    5.372511] alg: aead: failed to allocate transform for gcm_base(ctr(aes-generic),ghash-generic): -2
    [    5.372861] alg: self-tests for gcm(aes) using gcm_base(ctr(aes-generic),ghash-generic) failed (rc=-2)

I can't follow your explanation of what is going on here and what the fix is.
Would it make any sense to just revert the commits that introduced this problem?

- Eric
Herbert Xu Sept. 2, 2024, 11:07 p.m. UTC | #2
On Mon, Sep 02, 2024 at 10:05:54AM -0700, Eric Biggers wrote:
>
> With both this patch "crypto: api - Fix generic algorithm self-test races" and
> your other patch "crypto: algboss - Pass instance creation error up" applied,
> I'm still getting errors occasionally, e.g.:
> 
>     [    5.155587] alg: skcipher: failed to allocate transform for cbc(sm4-generic): -2
>     [    5.155954] alg: self-tests for cbc(sm4) using cbc(sm4-generic) failed (rc=-2)
>     [    5.372511] alg: aead: failed to allocate transform for gcm_base(ctr(aes-generic),ghash-generic): -2
>     [    5.372861] alg: self-tests for gcm(aes) using gcm_base(ctr(aes-generic),ghash-generic) failed (rc=-2)
> 
> I can't follow your explanation of what is going on here and what the fix is.
> Would it make any sense to just revert the commits that introduced this problem?

As I said earlier, these errors are expected.  What's happening
is this:

__ecb-sm4-aesni-avx gets registered (but not tested)

cbc(sm4-generic) gets registered (but not tested)

__ecb-sm4-aesni-avx finishes testing
	with lskcipher this is equivalent to crypto_cipher sm4
	so it triggers the destruction of all instances of sm4

cbc(sm4-generic) gets marked as dead

cbc(sm4-generic) fails self-test because it's already dead (ENOENT)

It's harmless because whatever that is asking for cbc(sm4-generic)
(in this case it's the extra-test mechanism) will simply retry the
allocation which will then succeed.

I will send a patch to disable the warning when allocating X returns
ENOENT while we're testing X itself.  This can always happen if X
gets killed for the reason mentioned above and it's perfectly harmless.

It's just that the race window was tiny previously because testing
occurred immediately after registration.  But now we've magnified
that window many times with asynchronous testing.

Cheers,
diff mbox series

Patch

diff --git a/crypto/api.c b/crypto/api.c
index bbe29d438815..bfd177a4313a 100644
--- a/crypto/api.c
+++ b/crypto/api.c
@@ -70,11 +70,6 @@  static struct crypto_alg *__crypto_alg_lookup(const char *name, u32 type,
 		if ((q->cra_flags ^ type) & mask)
 			continue;
 
-		if (crypto_is_larval(q) &&
-		    !crypto_is_test_larval((struct crypto_larval *)q) &&
-		    ((struct crypto_larval *)q)->mask != mask)
-			continue;
-
 		exact = !strcmp(q->cra_driver_name, name);
 		fuzzy = !strcmp(q->cra_name, name);
 		if (!exact && !(fuzzy && q->cra_priority > best))
@@ -113,6 +108,8 @@  struct crypto_larval *crypto_larval_alloc(const char *name, u32 type, u32 mask)
 	if (!larval)
 		return ERR_PTR(-ENOMEM);
 
+	type &= ~CRYPTO_ALG_TYPE_MASK | (mask ?: CRYPTO_ALG_TYPE_MASK);
+
 	larval->mask = mask;
 	larval->alg.cra_flags = CRYPTO_ALG_LARVAL | type;
 	larval->alg.cra_priority = -1;
@@ -229,7 +226,7 @@  static struct crypto_alg *crypto_larval_wait(struct crypto_alg *alg)
 		type = alg->cra_flags & ~(CRYPTO_ALG_LARVAL | CRYPTO_ALG_DEAD);
 		mask = larval->mask;
 		alg = crypto_alg_lookup(alg->cra_name, type, mask) ?:
-		      ERR_PTR(-ENOENT);
+		      ERR_PTR(-EAGAIN);
 	} else if (IS_ERR(alg))
 		;
 	else if (crypto_is_test_larval(larval) &&
@@ -308,8 +305,12 @@  static struct crypto_alg *crypto_larval_lookup(const char *name, u32 type,
 
 	if (!IS_ERR_OR_NULL(alg) && crypto_is_larval(alg))
 		alg = crypto_larval_wait(alg);
-	else if (!alg)
+	else if (alg)
+		;
+	else if (!(mask & CRYPTO_ALG_TESTED))
 		alg = crypto_larval_add(name, type, mask);
+	else
+		alg = ERR_PTR(-ENOENT);
 
 	return alg;
 }