diff mbox series

btrfs: fix possible free space tree corruption with online conversion

Message ID 0d49d6227962f3f3d34b6c7ccfd0c330f98517af.1607545035.git.josef@toxicpanda.com
State Accepted
Commit 2f96e40212d435b328459ba6b3956395eed8fa9f
Headers show
Series btrfs: fix possible free space tree corruption with online conversion | expand

Commit Message

Josef Bacik Dec. 9, 2020, 8:17 p.m. UTC
While running btrfs/011 in a loop I would often ASSERT() while trying to
add a new free space entry that already existed, or get an -EEXIST while
adding a new block to the extent tree, which is another indication of
double allocation.

This occurs because when we do the free space tree population, we create
the new root and then populate the tree and commit the transaction.
The problem is when you create a new root, the root node and commit root
node are the same.  This means that caching a block group before the
transaction is committed can race with other operations modifying the
free space tree, and thus you can get double adds and other sort of
shenanigans.  This is only a problem for the first transaction, once
we've committed the transaction we created the free space tree in we're
OK to use the free space tree to cache block groups.

Fix this by marking the fs_info as unsafe to load the free space tree,
and fall back on the old slow method.  We could be smarter than this,
for example caching the block group while we're populating the free
space tree, but since this is a serious problem I've opted for the
simplest solution.

cc: stable@vger.kernel.org
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
---
 fs/btrfs/block-group.c     | 11 ++++++++++-
 fs/btrfs/ctree.h           |  3 +++
 fs/btrfs/free-space-tree.c |  9 ++++++++-
 3 files changed, 21 insertions(+), 2 deletions(-)

Comments

Nikolay Borisov Dec. 10, 2020, 9:22 a.m. UTC | #1
On 9.12.20 г. 22:17 ч., Josef Bacik wrote:
> While running btrfs/011 in a loop I would often ASSERT() while trying to

> add a new free space entry that already existed, or get an -EEXIST while

> adding a new block to the extent tree, which is another indication of

> double allocation.

> 

> This occurs because when we do the free space tree population, we create

> the new root and then populate the tree and commit the transaction.

> The problem is when you create a new root, the root node and commit root

> node are the same.  This means that caching a block group before the

> transaction is committed can race with other operations modifying the

> free space tree, and thus you can get double adds and other sort of


FST creation happens during mount so what would initiate block group
caching at that time, the race scenario should be better described in
the change log. E.g. what those other operations might be, considering
we are in mount ?

> shenanigans.  This is only a problem for the first transaction, once

> we've committed the transaction we created the free space tree in we're

> OK to use the free space tree to cache block groups.

> 

> Fix this by marking the fs_info as unsafe to load the free space tree,

> and fall back on the old slow method.  We could be smarter than this,

> for example caching the block group while we're populating the free

> space tree, but since this is a serious problem I've opted for the

> simplest solution.

> 

> cc: stable@vger.kernel.org

> Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")

> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

> ---

>  fs/btrfs/block-group.c     | 11 ++++++++++-

>  fs/btrfs/ctree.h           |  3 +++

>  fs/btrfs/free-space-tree.c |  9 ++++++++-

>  3 files changed, 21 insertions(+), 2 deletions(-)

> 


?<snip>


>  /*

> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c

> index e33a65bd9a0c..8fbda221f4b5 100644

> --- a/fs/btrfs/free-space-tree.c

> +++ b/fs/btrfs/free-space-tree.c

> @@ -1150,6 +1150,7 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)

>  		return PTR_ERR(trans);

>  

>  	set_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);

> +	set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);

>  	free_space_root = btrfs_create_tree(trans,

>  					    BTRFS_FREE_SPACE_TREE_OBJECTID);

>  	if (IS_ERR(free_space_root)) {

> @@ -1171,8 +1172,14 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)

>  	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);

>  	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID);

>  	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);

> +	ret = btrfs_commit_transaction(trans);

>  

> -	return btrfs_commit_transaction(trans);

> +	/*

> +	 * Now that we've committed the transaction any reading of our commit

> +	 * root will be safe, so we can caching from the free space tree now.

> +	 */

> +	clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);

> +	return ret;


I guess you can't simply move the clearing of the
BTRFS_FS_CREATING_FREE_SPACE_TREE after the commit since it blocks
delayed refs running.

>  

>  abort:

>  	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);


Shouldn't the new flag be cleared on abort ?

>
Josef Bacik Dec. 10, 2020, 2:26 p.m. UTC | #2
On 12/10/20 4:22 AM, Nikolay Borisov wrote:
> 

> 

> On 9.12.20 г. 22:17 ч., Josef Bacik wrote:

>> While running btrfs/011 in a loop I would often ASSERT() while trying to

>> add a new free space entry that already existed, or get an -EEXIST while

>> adding a new block to the extent tree, which is another indication of

>> double allocation.

>>

>> This occurs because when we do the free space tree population, we create

>> the new root and then populate the tree and commit the transaction.

>> The problem is when you create a new root, the root node and commit root

>> node are the same.  This means that caching a block group before the

>> transaction is committed can race with other operations modifying the

>> free space tree, and thus you can get double adds and other sort of

> 

> FST creation happens during mount so what would initiate block group

> caching at that time, the race scenario should be better described in

> the change log. E.g. what those other operations might be, considering

> we are in mount ?

> 


It's happening during the transaction commit.  Creating the free space tree 
allocates blocks, which updates the FST via delayed refs.  These run during 
transaction commit, which allocates more blocks because we're now COW'ing the 
extent tree.  At this point if we have to cache a block group we start doing 
that, which happens in another thread.  Now we have the caching thread trying to 
cache while the tree is changing because of delayed refs running.

>> shenanigans.  This is only a problem for the first transaction, once

>> we've committed the transaction we created the free space tree in we're

>> OK to use the free space tree to cache block groups.

>>

>> Fix this by marking the fs_info as unsafe to load the free space tree,

>> and fall back on the old slow method.  We could be smarter than this,

>> for example caching the block group while we're populating the free

>> space tree, but since this is a serious problem I've opted for the

>> simplest solution.

>>

>> cc: stable@vger.kernel.org

>> Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")

>> Signed-off-by: Josef Bacik <josef@toxicpanda.com>

>> ---

>>   fs/btrfs/block-group.c     | 11 ++++++++++-

>>   fs/btrfs/ctree.h           |  3 +++

>>   fs/btrfs/free-space-tree.c |  9 ++++++++-

>>   3 files changed, 21 insertions(+), 2 deletions(-)

>>

> 

> ?<snip>

> 

> 

>>   /*

>> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c

>> index e33a65bd9a0c..8fbda221f4b5 100644

>> --- a/fs/btrfs/free-space-tree.c

>> +++ b/fs/btrfs/free-space-tree.c

>> @@ -1150,6 +1150,7 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)

>>   		return PTR_ERR(trans);

>>   

>>   	set_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);

>> +	set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);

>>   	free_space_root = btrfs_create_tree(trans,

>>   					    BTRFS_FREE_SPACE_TREE_OBJECTID);

>>   	if (IS_ERR(free_space_root)) {

>> @@ -1171,8 +1172,14 @@ int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)

>>   	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);

>>   	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID);

>>   	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);

>> +	ret = btrfs_commit_transaction(trans);

>>   

>> -	return btrfs_commit_transaction(trans);

>> +	/*

>> +	 * Now that we've committed the transaction any reading of our commit

>> +	 * root will be safe, so we can caching from the free space tree now.

>> +	 */

>> +	clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);

>> +	return ret;

> 

> I guess you can't simply move the clearing of the

> BTRFS_FS_CREATING_FREE_SPACE_TREE after the commit since it blocks

> delayed refs running.

> 

>>   

>>   abort:

>>   	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);

> 

> Shouldn't the new flag be cleared on abort ?

> 


Yup I'll fix that, thanks,

Josef
diff mbox series

Patch

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 52f2198d44c9..b8bbdd95743e 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -673,7 +673,16 @@  static noinline void caching_thread(struct btrfs_work *work)
 		wake_up(&caching_ctl->wait);
 	}
 
-	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))
+	/*
+	 * If we are in the transaction that populated the free space tree we
+	 * can't actually cache from the free space tree as our commit root and
+	 * real root are the same, so we could change the contents of the blocks
+	 * while caching.  Instead do the slow caching in this case, and after
+	 * the transaction has committed we will be safe.
+	 */
+	if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE) &&
+	    !(test_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
+		       &fs_info->flags)))
 		ret = load_free_space_tree(caching_ctl);
 	else
 		ret = load_extent_tree_free(caching_ctl);
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 3935d297d198..4a60d81da5cb 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -562,6 +562,9 @@  enum {
 
 	/* Indicate that we need to cleanup space cache v1 */
 	BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
+
+	/* Indicate that we can't trust the free space tree for caching yet. */
+	BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
 };
 
 /*
diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
index e33a65bd9a0c..8fbda221f4b5 100644
--- a/fs/btrfs/free-space-tree.c
+++ b/fs/btrfs/free-space-tree.c
@@ -1150,6 +1150,7 @@  int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
 		return PTR_ERR(trans);
 
 	set_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+	set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
 	free_space_root = btrfs_create_tree(trans,
 					    BTRFS_FREE_SPACE_TREE_OBJECTID);
 	if (IS_ERR(free_space_root)) {
@@ -1171,8 +1172,14 @@  int btrfs_create_free_space_tree(struct btrfs_fs_info *fs_info)
 	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE);
 	btrfs_set_fs_compat_ro(fs_info, FREE_SPACE_TREE_VALID);
 	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);
+	ret = btrfs_commit_transaction(trans);
 
-	return btrfs_commit_transaction(trans);
+	/*
+	 * Now that we've committed the transaction any reading of our commit
+	 * root will be safe, so we can caching from the free space tree now.
+	 */
+	clear_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
+	return ret;
 
 abort:
 	clear_bit(BTRFS_FS_CREATING_FREE_SPACE_TREE, &fs_info->flags);