diff mbox series

[4/7] btrfs: qgroup: try to flush qgroup space when we get -EDQUOT

Message ID 740e4978ebebfc08491db3f52264f7b5ba60ed96.1628845854.git.anand.jain@oracle.com
State New
Headers show
Series make btrfs/153 sucessful on 5.4.y | expand

Commit Message

Anand Jain Aug. 13, 2021, 9:55 a.m. UTC
From: Qu Wenruo <wqu@suse.com>

commit c53e9653605dbf708f5be02902de51831be4b009 upstream

[PROBLEM]
There are known problem related to how btrfs handles qgroup reserved
space.  One of the most obvious case is the the test case btrfs/153,
which do fallocate, then write into the preallocated range.

  btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
      --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
      +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
      @@ -1,2 +1,5 @@
       QA output created by 153
      +pwrite: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
      +/mnt/scratch/testfile2: Disk quota exceeded
       Silence is golden
      ...
      (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)

[CAUSE]
Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
we always reserve space no matter if it's COW or not.

Such behavior change is mostly for performance, and reverting it is not
a good idea anyway.

For preallcoated extent, we reserve qgroup data space for it already,
and since we also reserve data space for qgroup at buffered write time,
it needs twice the space for us to write into preallocated space.

This leads to the -EDQUOT in buffered write routine.

And we can't follow the same solution, unlike data/meta space check,
qgroup reserved space is shared between data/metadata.
The EDQUOT can happen at the metadata reservation, so doing NODATACOW
check after qgroup reservation failure is not a solution.

[FIX]
To solve the problem, we don't return -EDQUOT directly, but every time
we got a -EDQUOT, we try to flush qgroup space:

- Flush all inodes of the root
  NODATACOW writes will free the qgroup reserved at run_dealloc_range().
  However we don't have the infrastructure to only flush NODATACOW
  inodes, here we flush all inodes anyway.

- Wait for ordered extents
  This would convert the preallocated metadata space into per-trans
  metadata, which can be freed in later transaction commit.

- Commit transaction
  This will free all per-trans metadata space.

Also we don't want to trigger flush multiple times, so here we introduce
a per-root wait list and a new root status, to ensure only one thread
starts the flushing.

Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Anand Jain <anand.jain@oracle.com>
---
 fs/btrfs/ctree.h   |   3 ++
 fs/btrfs/disk-io.c |   1 +
 fs/btrfs/qgroup.c  | 100 +++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 96 insertions(+), 8 deletions(-)

Comments

Qu Wenruo Aug. 13, 2021, 10:26 a.m. UTC | #1
On 2021/8/13 下午5:55, Anand Jain wrote:
> From: Qu Wenruo <wqu@suse.com>
>
> commit c53e9653605dbf708f5be02902de51831be4b009 upstream

This lacks certain upstream fixes for it:

f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
cloning inline extents and using qgroups

4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
btrfs_delayed_inode_reserve_metadata

6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
transaction when we already hold the handle

All these fixes are to ensure we don't try to flush in context where we
shouldn't.

Without them, it can hit various deadlock.

Thanks,
Qu
>
> [PROBLEM]
> There are known problem related to how btrfs handles qgroup reserved
> space.  One of the most obvious case is the the test case btrfs/153,
> which do fallocate, then write into the preallocated range.
>
>    btrfs/153 1s ... - output mismatch (see xfstests-dev/results//btrfs/153.out.bad)
>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 20:24:40.730000089 +0800
>        @@ -1,2 +1,5 @@
>         QA output created by 153
>        +pwrite: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>        +/mnt/scratch/testfile2: Disk quota exceeded
>         Silence is golden
>        ...
>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>
> [CAUSE]
> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have to"),
> we always reserve space no matter if it's COW or not.
>
> Such behavior change is mostly for performance, and reverting it is not
> a good idea anyway.
>
> For preallcoated extent, we reserve qgroup data space for it already,
> and since we also reserve data space for qgroup at buffered write time,
> it needs twice the space for us to write into preallocated space.
>
> This leads to the -EDQUOT in buffered write routine.
>
> And we can't follow the same solution, unlike data/meta space check,
> qgroup reserved space is shared between data/metadata.
> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
> check after qgroup reservation failure is not a solution.
>
> [FIX]
> To solve the problem, we don't return -EDQUOT directly, but every time
> we got a -EDQUOT, we try to flush qgroup space:
>
> - Flush all inodes of the root
>    NODATACOW writes will free the qgroup reserved at run_dealloc_range().
>    However we don't have the infrastructure to only flush NODATACOW
>    inodes, here we flush all inodes anyway.
>
> - Wait for ordered extents
>    This would convert the preallocated metadata space into per-trans
>    metadata, which can be freed in later transaction commit.
>
> - Commit transaction
>    This will free all per-trans metadata space.
>
> Also we don't want to trigger flush multiple times, so here we introduce
> a per-root wait list and a new root status, to ensure only one thread
> starts the flushing.
>
> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> Reviewed-by: David Sterba <dsterba@suse.com>
> Signed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Anand Jain <anand.jain@oracle.com>
> ---
>   fs/btrfs/ctree.h   |   3 ++
>   fs/btrfs/disk-io.c |   1 +
>   fs/btrfs/qgroup.c  | 100 +++++++++++++++++++++++++++++++++++++++++----
>   3 files changed, 96 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 7960359dbc70..5448dc62e915 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -945,6 +945,8 @@ enum {
>   	BTRFS_ROOT_DEAD_TREE,
>   	/* The root has a log tree. Used only for subvolume roots. */
>   	BTRFS_ROOT_HAS_LOG_TREE,
> +	/* Qgroup flushing is in progress */
> +	BTRFS_ROOT_QGROUP_FLUSHING,
>   };
>
>   /*
> @@ -1097,6 +1099,7 @@ struct btrfs_root {
>   	spinlock_t qgroup_meta_rsv_lock;
>   	u64 qgroup_meta_rsv_pertrans;
>   	u64 qgroup_meta_rsv_prealloc;
> +	wait_queue_head_t qgroup_flush_wait;
>
>   	/* Number of active swapfiles */
>   	atomic_t nr_swapfiles;
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index e6aa94a583e9..e3bcab38a166 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1154,6 +1154,7 @@ static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
>   	mutex_init(&root->log_mutex);
>   	mutex_init(&root->ordered_extent_mutex);
>   	mutex_init(&root->delalloc_mutex);
> +	init_waitqueue_head(&root->qgroup_flush_wait);
>   	init_waitqueue_head(&root->log_writer_wait);
>   	init_waitqueue_head(&root->log_commit_wait[0]);
>   	init_waitqueue_head(&root->log_commit_wait[1]);
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index 50c45b4fcfd4..b312ac645e08 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -3479,17 +3479,58 @@ static int qgroup_unreserve_range(struct btrfs_inode *inode,
>   }
>
>   /*
> - * Reserve qgroup space for range [start, start + len).
> + * Try to free some space for qgroup.
>    *
> - * This function will either reserve space from related qgroups or doing
> - * nothing if the range is already reserved.
> + * For qgroup, there are only 3 ways to free qgroup space:
> + * - Flush nodatacow write
> + *   Any nodatacow write will free its reserved data space at run_delalloc_range().
> + *   In theory, we should only flush nodatacow inodes, but it's not yet
> + *   possible, so we need to flush the whole root.
>    *
> - * Return 0 for successful reserve
> - * Return <0 for error (including -EQUOT)
> + * - Wait for ordered extents
> + *   When ordered extents are finished, their reserved metadata is finally
> + *   converted to per_trans status, which can be freed by later commit
> + *   transaction.
>    *
> - * NOTE: this function may sleep for memory allocation.
> + * - Commit transaction
> + *   This would free the meta_per_trans space.
> + *   In theory this shouldn't provide much space, but any more qgroup space
> + *   is needed.
>    */
> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> +static int try_flush_qgroup(struct btrfs_root *root)
> +{
> +	struct btrfs_trans_handle *trans;
> +	int ret;
> +
> +	/*
> +	 * We don't want to run flush again and again, so if there is a running
> +	 * one, we won't try to start a new flush, but exit directly.
> +	 */
> +	if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
> +		wait_event(root->qgroup_flush_wait,
> +			!test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
> +		return 0;
> +	}
> +
> +	ret = btrfs_start_delalloc_snapshot(root);
> +	if (ret < 0)
> +		goto out;
> +	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
> +
> +	trans = btrfs_join_transaction(root);
> +	if (IS_ERR(trans)) {
> +		ret = PTR_ERR(trans);
> +		goto out;
> +	}
> +
> +	ret = btrfs_commit_transaction(trans);
> +out:
> +	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
> +	wake_up(&root->qgroup_flush_wait);
> +	return ret;
> +}
> +
> +static int qgroup_reserve_data(struct btrfs_inode *inode,
>   			struct extent_changeset **reserved_ret, u64 start,
>   			u64 len)
>   {
> @@ -3542,6 +3583,34 @@ int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>   	return ret;
>   }
>
> +/*
> + * Reserve qgroup space for range [start, start + len).
> + *
> + * This function will either reserve space from related qgroups or do nothing
> + * if the range is already reserved.
> + *
> + * Return 0 for successful reservation
> + * Return <0 for error (including -EQUOT)
> + *
> + * NOTE: This function may sleep for memory allocation, dirty page flushing and
> + *	 commit transaction. So caller should not hold any dirty page locked.
> + */
> +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
> +			struct extent_changeset **reserved_ret, u64 start,
> +			u64 len)
> +{
> +	int ret;
> +
> +	ret = qgroup_reserve_data(inode, reserved_ret, start, len);
> +	if (ret <= 0 && ret != -EDQUOT)
> +		return ret;
> +
> +	ret = try_flush_qgroup(inode->root);
> +	if (ret < 0)
> +		return ret;
> +	return qgroup_reserve_data(inode, reserved_ret, start, len);
> +}
> +
>   /* Free ranges specified by @reserved, normally in error path */
>   static int qgroup_free_reserved_data(struct btrfs_inode *inode,
>   			struct extent_changeset *reserved, u64 start, u64 len)
> @@ -3712,7 +3781,7 @@ static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
>   	return num_bytes;
>   }
>
> -int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
> +static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>   				enum btrfs_qgroup_rsv_type type, bool enforce)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
> @@ -3739,6 +3808,21 @@ int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>   	return ret;
>   }
>
> +int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
> +				enum btrfs_qgroup_rsv_type type, bool enforce)
> +{
> +	int ret;
> +
> +	ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
> +	if (ret <= 0 && ret != -EDQUOT)
> +		return ret;
> +
> +	ret = try_flush_qgroup(root);
> +	if (ret < 0)
> +		return ret;
> +	return qgroup_reserve_meta(root, num_bytes, type, enforce);
> +}
> +
>   void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
>   {
>   	struct btrfs_fs_info *fs_info = root->fs_info;
>
Anand Jain Aug. 13, 2021, 10:30 a.m. UTC | #2
On 13/08/2021 18:26, Qu Wenruo wrote:
> 
> 
> On 2021/8/13 下午5:55, Anand Jain wrote:
>> From: Qu Wenruo <wqu@suse.com>
>>
>> commit c53e9653605dbf708f5be02902de51831be4b009 upstream
> 
> This lacks certain upstream fixes for it:
> 
> f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
> cloning inline extents and using qgroups
> 
> 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
> btrfs_delayed_inode_reserve_metadata
> 
> 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
> transaction when we already hold the handle
> 
> All these fixes are to ensure we don't try to flush in context where we
> shouldn't.
> 
> Without them, it can hit various deadlock.
>

Qu,

    Thanks for taking a look. I will send it in v2.

-Anand


> Thanks,
> Qu
>>
>> [PROBLEM]
>> There are known problem related to how btrfs handles qgroup reserved
>> space.  One of the most obvious case is the the test case btrfs/153,
>> which do fallocate, then write into the preallocated range.
>>
>>    btrfs/153 1s ... - output mismatch (see 
>> xfstests-dev/results//btrfs/153.out.bad)
>>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 
>> 20:24:40.730000089 +0800
>>        @@ -1,2 +1,5 @@
>>         QA output created by 153
>>        +pwrite: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>         Silence is golden
>>        ...
>>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out 
>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>
>> [CAUSE]
>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we have 
>> to"),
>> we always reserve space no matter if it's COW or not.
>>
>> Such behavior change is mostly for performance, and reverting it is not
>> a good idea anyway.
>>
>> For preallcoated extent, we reserve qgroup data space for it already,
>> and since we also reserve data space for qgroup at buffered write time,
>> it needs twice the space for us to write into preallocated space.
>>
>> This leads to the -EDQUOT in buffered write routine.
>>
>> And we can't follow the same solution, unlike data/meta space check,
>> qgroup reserved space is shared between data/metadata.
>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>> check after qgroup reservation failure is not a solution.
>>
>> [FIX]
>> To solve the problem, we don't return -EDQUOT directly, but every time
>> we got a -EDQUOT, we try to flush qgroup space:
>>
>> - Flush all inodes of the root
>>    NODATACOW writes will free the qgroup reserved at run_dealloc_range().
>>    However we don't have the infrastructure to only flush NODATACOW
>>    inodes, here we flush all inodes anyway.
>>
>> - Wait for ordered extents
>>    This would convert the preallocated metadata space into per-trans
>>    metadata, which can be freed in later transaction commit.
>>
>> - Commit transaction
>>    This will free all per-trans metadata space.
>>
>> Also we don't want to trigger flush multiple times, so here we introduce
>> a per-root wait list and a new root status, to ensure only one thread
>> starts the flushing.
>>
>> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> Reviewed-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>> ---
>>   fs/btrfs/ctree.h   |   3 ++
>>   fs/btrfs/disk-io.c |   1 +
>>   fs/btrfs/qgroup.c  | 100 +++++++++++++++++++++++++++++++++++++++++----
>>   3 files changed, 96 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>> index 7960359dbc70..5448dc62e915 100644
>> --- a/fs/btrfs/ctree.h
>> +++ b/fs/btrfs/ctree.h
>> @@ -945,6 +945,8 @@ enum {
>>       BTRFS_ROOT_DEAD_TREE,
>>       /* The root has a log tree. Used only for subvolume roots. */
>>       BTRFS_ROOT_HAS_LOG_TREE,
>> +    /* Qgroup flushing is in progress */
>> +    BTRFS_ROOT_QGROUP_FLUSHING,
>>   };
>>
>>   /*
>> @@ -1097,6 +1099,7 @@ struct btrfs_root {
>>       spinlock_t qgroup_meta_rsv_lock;
>>       u64 qgroup_meta_rsv_pertrans;
>>       u64 qgroup_meta_rsv_prealloc;
>> +    wait_queue_head_t qgroup_flush_wait;
>>
>>       /* Number of active swapfiles */
>>       atomic_t nr_swapfiles;
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index e6aa94a583e9..e3bcab38a166 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -1154,6 +1154,7 @@ static void __setup_root(struct btrfs_root 
>> *root, struct btrfs_fs_info *fs_info,
>>       mutex_init(&root->log_mutex);
>>       mutex_init(&root->ordered_extent_mutex);
>>       mutex_init(&root->delalloc_mutex);
>> +    init_waitqueue_head(&root->qgroup_flush_wait);
>>       init_waitqueue_head(&root->log_writer_wait);
>>       init_waitqueue_head(&root->log_commit_wait[0]);
>>       init_waitqueue_head(&root->log_commit_wait[1]);
>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>> index 50c45b4fcfd4..b312ac645e08 100644
>> --- a/fs/btrfs/qgroup.c
>> +++ b/fs/btrfs/qgroup.c
>> @@ -3479,17 +3479,58 @@ static int qgroup_unreserve_range(struct 
>> btrfs_inode *inode,
>>   }
>>
>>   /*
>> - * Reserve qgroup space for range [start, start + len).
>> + * Try to free some space for qgroup.
>>    *
>> - * This function will either reserve space from related qgroups or doing
>> - * nothing if the range is already reserved.
>> + * For qgroup, there are only 3 ways to free qgroup space:
>> + * - Flush nodatacow write
>> + *   Any nodatacow write will free its reserved data space at 
>> run_delalloc_range().
>> + *   In theory, we should only flush nodatacow inodes, but it's not yet
>> + *   possible, so we need to flush the whole root.
>>    *
>> - * Return 0 for successful reserve
>> - * Return <0 for error (including -EQUOT)
>> + * - Wait for ordered extents
>> + *   When ordered extents are finished, their reserved metadata is 
>> finally
>> + *   converted to per_trans status, which can be freed by later commit
>> + *   transaction.
>>    *
>> - * NOTE: this function may sleep for memory allocation.
>> + * - Commit transaction
>> + *   This would free the meta_per_trans space.
>> + *   In theory this shouldn't provide much space, but any more qgroup 
>> space
>> + *   is needed.
>>    */
>> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>> +static int try_flush_qgroup(struct btrfs_root *root)
>> +{
>> +    struct btrfs_trans_handle *trans;
>> +    int ret;
>> +
>> +    /*
>> +     * We don't want to run flush again and again, so if there is a 
>> running
>> +     * one, we won't try to start a new flush, but exit directly.
>> +     */
>> +    if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
>> +        wait_event(root->qgroup_flush_wait,
>> +            !test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
>> +        return 0;
>> +    }
>> +
>> +    ret = btrfs_start_delalloc_snapshot(root);
>> +    if (ret < 0)
>> +        goto out;
>> +    btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>> +
>> +    trans = btrfs_join_transaction(root);
>> +    if (IS_ERR(trans)) {
>> +        ret = PTR_ERR(trans);
>> +        goto out;
>> +    }
>> +
>> +    ret = btrfs_commit_transaction(trans);
>> +out:
>> +    clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
>> +    wake_up(&root->qgroup_flush_wait);
>> +    return ret;
>> +}
>> +
>> +static int qgroup_reserve_data(struct btrfs_inode *inode,
>>               struct extent_changeset **reserved_ret, u64 start,
>>               u64 len)
>>   {
>> @@ -3542,6 +3583,34 @@ int btrfs_qgroup_reserve_data(struct 
>> btrfs_inode *inode,
>>       return ret;
>>   }
>>
>> +/*
>> + * Reserve qgroup space for range [start, start + len).
>> + *
>> + * This function will either reserve space from related qgroups or do 
>> nothing
>> + * if the range is already reserved.
>> + *
>> + * Return 0 for successful reservation
>> + * Return <0 for error (including -EQUOT)
>> + *
>> + * NOTE: This function may sleep for memory allocation, dirty page 
>> flushing and
>> + *     commit transaction. So caller should not hold any dirty page 
>> locked.
>> + */
>> +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>> +            struct extent_changeset **reserved_ret, u64 start,
>> +            u64 len)
>> +{
>> +    int ret;
>> +
>> +    ret = qgroup_reserve_data(inode, reserved_ret, start, len);
>> +    if (ret <= 0 && ret != -EDQUOT)
>> +        return ret;
>> +
>> +    ret = try_flush_qgroup(inode->root);
>> +    if (ret < 0)
>> +        return ret;
>> +    return qgroup_reserve_data(inode, reserved_ret, start, len);
>> +}
>> +
>>   /* Free ranges specified by @reserved, normally in error path */
>>   static int qgroup_free_reserved_data(struct btrfs_inode *inode,
>>               struct extent_changeset *reserved, u64 start, u64 len)
>> @@ -3712,7 +3781,7 @@ static int sub_root_meta_rsv(struct btrfs_root 
>> *root, int num_bytes,
>>       return num_bytes;
>>   }
>>
>> -int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>> +static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>>                   enum btrfs_qgroup_rsv_type type, bool enforce)
>>   {
>>       struct btrfs_fs_info *fs_info = root->fs_info;
>> @@ -3739,6 +3808,21 @@ int __btrfs_qgroup_reserve_meta(struct 
>> btrfs_root *root, int num_bytes,
>>       return ret;
>>   }
>>
>> +int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>> +                enum btrfs_qgroup_rsv_type type, bool enforce)
>> +{
>> +    int ret;
>> +
>> +    ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
>> +    if (ret <= 0 && ret != -EDQUOT)
>> +        return ret;
>> +
>> +    ret = try_flush_qgroup(root);
>> +    if (ret < 0)
>> +        return ret;
>> +    return qgroup_reserve_meta(root, num_bytes, type, enforce);
>> +}
>> +
>>   void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
>>   {
>>       struct btrfs_fs_info *fs_info = root->fs_info;
>>
Qu Wenruo Aug. 13, 2021, 10:39 a.m. UTC | #3
On 2021/8/13 下午6:30, Anand Jain wrote:
> 
> 
> On 13/08/2021 18:26, Qu Wenruo wrote:
>>
>>
>> On 2021/8/13 下午5:55, Anand Jain wrote:
>>> From: Qu Wenruo <wqu@suse.com>
>>>
>>> commit c53e9653605dbf708f5be02902de51831be4b009 upstream
>>
>> This lacks certain upstream fixes for it:
>>
>> f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
>> cloning inline extents and using qgroups
>>
>> 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
>> btrfs_delayed_inode_reserve_metadata
>>
>> 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
>> transaction when we already hold the handle
>>
>> All these fixes are to ensure we don't try to flush in context where we
>> shouldn't.
>>
>> Without them, it can hit various deadlock.
>>
> 
> Qu,
> 
>     Thanks for taking a look. I will send it in v2.

I guess you only need to add the missing fixes?

Thanks,
Qu
> 
> -Anand
> 
> 
>> Thanks,
>> Qu
>>>
>>> [PROBLEM]
>>> There are known problem related to how btrfs handles qgroup reserved
>>> space.  One of the most obvious case is the the test case btrfs/153,
>>> which do fallocate, then write into the preallocated range.
>>>
>>>    btrfs/153 1s ... - output mismatch (see 
>>> xfstests-dev/results//btrfs/153.out.bad)
>>>        --- tests/btrfs/153.out     2019-10-22 15:18:14.068965341 +0800
>>>        +++ xfstests-dev/results//btrfs/153.out.bad      2020-07-01 
>>> 20:24:40.730000089 +0800
>>>        @@ -1,2 +1,5 @@
>>>         QA output created by 153
>>>        +pwrite: Disk quota exceeded
>>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>>        +/mnt/scratch/testfile2: Disk quota exceeded
>>>         Silence is golden
>>>        ...
>>>        (Run 'diff -u xfstests-dev/tests/btrfs/153.out 
>>> xfstests-dev/results//btrfs/153.out.bad'  to see the entire diff)
>>>
>>> [CAUSE]
>>> Since commit c6887cd11149 ("Btrfs: don't do nocow check unless we 
>>> have to"),
>>> we always reserve space no matter if it's COW or not.
>>>
>>> Such behavior change is mostly for performance, and reverting it is not
>>> a good idea anyway.
>>>
>>> For preallcoated extent, we reserve qgroup data space for it already,
>>> and since we also reserve data space for qgroup at buffered write time,
>>> it needs twice the space for us to write into preallocated space.
>>>
>>> This leads to the -EDQUOT in buffered write routine.
>>>
>>> And we can't follow the same solution, unlike data/meta space check,
>>> qgroup reserved space is shared between data/metadata.
>>> The EDQUOT can happen at the metadata reservation, so doing NODATACOW
>>> check after qgroup reservation failure is not a solution.
>>>
>>> [FIX]
>>> To solve the problem, we don't return -EDQUOT directly, but every time
>>> we got a -EDQUOT, we try to flush qgroup space:
>>>
>>> - Flush all inodes of the root
>>>    NODATACOW writes will free the qgroup reserved at 
>>> run_dealloc_range().
>>>    However we don't have the infrastructure to only flush NODATACOW
>>>    inodes, here we flush all inodes anyway.
>>>
>>> - Wait for ordered extents
>>>    This would convert the preallocated metadata space into per-trans
>>>    metadata, which can be freed in later transaction commit.
>>>
>>> - Commit transaction
>>>    This will free all per-trans metadata space.
>>>
>>> Also we don't want to trigger flush multiple times, so here we introduce
>>> a per-root wait list and a new root status, to ensure only one thread
>>> starts the flushing.
>>>
>>> Fixes: c6887cd11149 ("Btrfs: don't do nocow check unless we have to")
>>> Reviewed-by: Josef Bacik <josef@toxicpanda.com>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> Reviewed-by: David Sterba <dsterba@suse.com>
>>> Signed-off-by: David Sterba <dsterba@suse.com>
>>> Signed-off-by: Anand Jain <anand.jain@oracle.com>
>>> ---
>>>   fs/btrfs/ctree.h   |   3 ++
>>>   fs/btrfs/disk-io.c |   1 +
>>>   fs/btrfs/qgroup.c  | 100 +++++++++++++++++++++++++++++++++++++++++----
>>>   3 files changed, 96 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
>>> index 7960359dbc70..5448dc62e915 100644
>>> --- a/fs/btrfs/ctree.h
>>> +++ b/fs/btrfs/ctree.h
>>> @@ -945,6 +945,8 @@ enum {
>>>       BTRFS_ROOT_DEAD_TREE,
>>>       /* The root has a log tree. Used only for subvolume roots. */
>>>       BTRFS_ROOT_HAS_LOG_TREE,
>>> +    /* Qgroup flushing is in progress */
>>> +    BTRFS_ROOT_QGROUP_FLUSHING,
>>>   };
>>>
>>>   /*
>>> @@ -1097,6 +1099,7 @@ struct btrfs_root {
>>>       spinlock_t qgroup_meta_rsv_lock;
>>>       u64 qgroup_meta_rsv_pertrans;
>>>       u64 qgroup_meta_rsv_prealloc;
>>> +    wait_queue_head_t qgroup_flush_wait;
>>>
>>>       /* Number of active swapfiles */
>>>       atomic_t nr_swapfiles;
>>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>>> index e6aa94a583e9..e3bcab38a166 100644
>>> --- a/fs/btrfs/disk-io.c
>>> +++ b/fs/btrfs/disk-io.c
>>> @@ -1154,6 +1154,7 @@ static void __setup_root(struct btrfs_root 
>>> *root, struct btrfs_fs_info *fs_info,
>>>       mutex_init(&root->log_mutex);
>>>       mutex_init(&root->ordered_extent_mutex);
>>>       mutex_init(&root->delalloc_mutex);
>>> +    init_waitqueue_head(&root->qgroup_flush_wait);
>>>       init_waitqueue_head(&root->log_writer_wait);
>>>       init_waitqueue_head(&root->log_commit_wait[0]);
>>>       init_waitqueue_head(&root->log_commit_wait[1]);
>>> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
>>> index 50c45b4fcfd4..b312ac645e08 100644
>>> --- a/fs/btrfs/qgroup.c
>>> +++ b/fs/btrfs/qgroup.c
>>> @@ -3479,17 +3479,58 @@ static int qgroup_unreserve_range(struct 
>>> btrfs_inode *inode,
>>>   }
>>>
>>>   /*
>>> - * Reserve qgroup space for range [start, start + len).
>>> + * Try to free some space for qgroup.
>>>    *
>>> - * This function will either reserve space from related qgroups or 
>>> doing
>>> - * nothing if the range is already reserved.
>>> + * For qgroup, there are only 3 ways to free qgroup space:
>>> + * - Flush nodatacow write
>>> + *   Any nodatacow write will free its reserved data space at 
>>> run_delalloc_range().
>>> + *   In theory, we should only flush nodatacow inodes, but it's not yet
>>> + *   possible, so we need to flush the whole root.
>>>    *
>>> - * Return 0 for successful reserve
>>> - * Return <0 for error (including -EQUOT)
>>> + * - Wait for ordered extents
>>> + *   When ordered extents are finished, their reserved metadata is 
>>> finally
>>> + *   converted to per_trans status, which can be freed by later commit
>>> + *   transaction.
>>>    *
>>> - * NOTE: this function may sleep for memory allocation.
>>> + * - Commit transaction
>>> + *   This would free the meta_per_trans space.
>>> + *   In theory this shouldn't provide much space, but any more 
>>> qgroup space
>>> + *   is needed.
>>>    */
>>> -int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>>> +static int try_flush_qgroup(struct btrfs_root *root)
>>> +{
>>> +    struct btrfs_trans_handle *trans;
>>> +    int ret;
>>> +
>>> +    /*
>>> +     * We don't want to run flush again and again, so if there is a 
>>> running
>>> +     * one, we won't try to start a new flush, but exit directly.
>>> +     */
>>> +    if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
>>> +        wait_event(root->qgroup_flush_wait,
>>> +            !test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
>>> +        return 0;
>>> +    }
>>> +
>>> +    ret = btrfs_start_delalloc_snapshot(root);
>>> +    if (ret < 0)
>>> +        goto out;
>>> +    btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
>>> +
>>> +    trans = btrfs_join_transaction(root);
>>> +    if (IS_ERR(trans)) {
>>> +        ret = PTR_ERR(trans);
>>> +        goto out;
>>> +    }
>>> +
>>> +    ret = btrfs_commit_transaction(trans);
>>> +out:
>>> +    clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
>>> +    wake_up(&root->qgroup_flush_wait);
>>> +    return ret;
>>> +}
>>> +
>>> +static int qgroup_reserve_data(struct btrfs_inode *inode,
>>>               struct extent_changeset **reserved_ret, u64 start,
>>>               u64 len)
>>>   {
>>> @@ -3542,6 +3583,34 @@ int btrfs_qgroup_reserve_data(struct 
>>> btrfs_inode *inode,
>>>       return ret;
>>>   }
>>>
>>> +/*
>>> + * Reserve qgroup space for range [start, start + len).
>>> + *
>>> + * This function will either reserve space from related qgroups or 
>>> do nothing
>>> + * if the range is already reserved.
>>> + *
>>> + * Return 0 for successful reservation
>>> + * Return <0 for error (including -EQUOT)
>>> + *
>>> + * NOTE: This function may sleep for memory allocation, dirty page 
>>> flushing and
>>> + *     commit transaction. So caller should not hold any dirty page 
>>> locked.
>>> + */
>>> +int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
>>> +            struct extent_changeset **reserved_ret, u64 start,
>>> +            u64 len)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = qgroup_reserve_data(inode, reserved_ret, start, len);
>>> +    if (ret <= 0 && ret != -EDQUOT)
>>> +        return ret;
>>> +
>>> +    ret = try_flush_qgroup(inode->root);
>>> +    if (ret < 0)
>>> +        return ret;
>>> +    return qgroup_reserve_data(inode, reserved_ret, start, len);
>>> +}
>>> +
>>>   /* Free ranges specified by @reserved, normally in error path */
>>>   static int qgroup_free_reserved_data(struct btrfs_inode *inode,
>>>               struct extent_changeset *reserved, u64 start, u64 len)
>>> @@ -3712,7 +3781,7 @@ static int sub_root_meta_rsv(struct btrfs_root 
>>> *root, int num_bytes,
>>>       return num_bytes;
>>>   }
>>>
>>> -int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>>> +static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>>>                   enum btrfs_qgroup_rsv_type type, bool enforce)
>>>   {
>>>       struct btrfs_fs_info *fs_info = root->fs_info;
>>> @@ -3739,6 +3808,21 @@ int __btrfs_qgroup_reserve_meta(struct 
>>> btrfs_root *root, int num_bytes,
>>>       return ret;
>>>   }
>>>
>>> +int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
>>> +                enum btrfs_qgroup_rsv_type type, bool enforce)
>>> +{
>>> +    int ret;
>>> +
>>> +    ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
>>> +    if (ret <= 0 && ret != -EDQUOT)
>>> +        return ret;
>>> +
>>> +    ret = try_flush_qgroup(root);
>>> +    if (ret < 0)
>>> +        return ret;
>>> +    return qgroup_reserve_meta(root, num_bytes, type, enforce);
>>> +}
>>> +
>>>   void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
>>>   {
>>>       struct btrfs_fs_info *fs_info = root->fs_info;
>>>
>
Greg KH Aug. 13, 2021, 10:56 a.m. UTC | #4
On Fri, Aug 13, 2021 at 06:41:53PM +0800, Anand Jain wrote:
> 
> 
> On 13/08/2021 18:39, Qu Wenruo wrote:
> > 
> > 
> > On 2021/8/13 下午6:30, Anand Jain wrote:
> > > 
> > > 
> > > On 13/08/2021 18:26, Qu Wenruo wrote:
> > > > 
> > > > 
> > > > On 2021/8/13 下午5:55, Anand Jain wrote:
> > > > > From: Qu Wenruo <wqu@suse.com>
> > > > > 
> > > > > commit c53e9653605dbf708f5be02902de51831be4b009 upstream
> > > > 
> > > > This lacks certain upstream fixes for it:
> > > > 
> > > > f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
> > > > cloning inline extents and using qgroups
> > > > 
> > > > 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
> > > > btrfs_delayed_inode_reserve_metadata
> > > > 
> > > > 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
> > > > transaction when we already hold the handle
> > > > 
> > > > All these fixes are to ensure we don't try to flush in context where we
> > > > shouldn't.
> > > > 
> > > > Without them, it can hit various deadlock.
> > > > 
> > > 
> > > Qu,
> > > 
> > >     Thanks for taking a look. I will send it in v2.
> > 
> > I guess you only need to add the missing fixes?
> 
>   Yeah, maybe it's better to send it as a new set.

So should I drop the existing patches and wait for a whole new series,
or will you send these as an additional set?

And at least one of the above commits needs to go to the 5.10.y tree, I
did not check them all...

thanks,

greg k-h
Anand Jain Aug. 13, 2021, 11:06 a.m. UTC | #5
On 13/08/2021 18:56, Greg KH wrote:
> On Fri, Aug 13, 2021 at 06:41:53PM +0800, Anand Jain wrote:
>>
>>
>> On 13/08/2021 18:39, Qu Wenruo wrote:
>>>
>>>
>>> On 2021/8/13 下午6:30, Anand Jain wrote:
>>>>
>>>>
>>>> On 13/08/2021 18:26, Qu Wenruo wrote:
>>>>>
>>>>>
>>>>> On 2021/8/13 下午5:55, Anand Jain wrote:
>>>>>> From: Qu Wenruo <wqu@suse.com>
>>>>>>
>>>>>> commit c53e9653605dbf708f5be02902de51831be4b009 upstream
>>>>>
>>>>> This lacks certain upstream fixes for it:
>>>>>
>>>>> f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when
>>>>> cloning inline extents and using qgroups
>>>>>
>>>>> 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from
>>>>> btrfs_delayed_inode_reserve_metadata
>>>>>
>>>>> 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit
>>>>> transaction when we already hold the handle
>>>>>
>>>>> All these fixes are to ensure we don't try to flush in context where we
>>>>> shouldn't.
>>>>>
>>>>> Without them, it can hit various deadlock.
>>>>>
>>>>
>>>> Qu,
>>>>
>>>>      Thanks for taking a look. I will send it in v2.
>>>
>>> I guess you only need to add the missing fixes?
>>
>>    Yeah, maybe it's better to send it as a new set.
> 
> So should I drop the existing patches and wait for a whole new series,
> or will you send these as an additional set?

  Greg, I am sending it as an additional set.

> And at least one of the above commits needs to go to the 5.10.y tree, I
> did not check them all...

  I need to look into it.

Thanks, Anand

> thanks,
> 
> greg k-h
>
Anand Jain Aug. 30, 2021, 10:27 p.m. UTC | #6
On 13/08/2021 19:06, Anand Jain wrote:
> 

> 

> On 13/08/2021 18:56, Greg KH wrote:

>> On Fri, Aug 13, 2021 at 06:41:53PM +0800, Anand Jain wrote:

>>>

>>>

>>> On 13/08/2021 18:39, Qu Wenruo wrote:

>>>>

>>>>

>>>> On 2021/8/13 下午6:30, Anand Jain wrote:

>>>>>

>>>>>

>>>>> On 13/08/2021 18:26, Qu Wenruo wrote:

>>>>>>

>>>>>>

>>>>>> On 2021/8/13 下午5:55, Anand Jain wrote:

>>>>>>> From: Qu Wenruo <wqu@suse.com>

>>>>>>>

>>>>>>> commit c53e9653605dbf708f5be02902de51831be4b009 upstream

>>>>>>

>>>>>> This lacks certain upstream fixes for it:

>>>>>>

>>>>>> f9baa501b4fd6962257853d46ddffbc21f27e344 btrfs: fix deadlock when

>>>>>> cloning inline extents and using qgroups

>>>>>>

>>>>>> 4d14c5cde5c268a2bc26addecf09489cb953ef64 btrfs: don't flush from

>>>>>> btrfs_delayed_inode_reserve_metadata

>>>>>>

>>>>>> 6f23277a49e68f8a9355385c846939ad0b1261e7 btrfs: qgroup: don't commit

>>>>>> transaction when we already hold the handle

>>>>>>

>>>>>> All these fixes are to ensure we don't try to flush in context 

>>>>>> where we

>>>>>> shouldn't.

>>>>>>

>>>>>> Without them, it can hit various deadlock.

>>>>>>

>>>>>

>>>>> Qu,

>>>>>

>>>>>      Thanks for taking a look. I will send it in v2.

>>>>

>>>> I guess you only need to add the missing fixes?

>>>

>>>    Yeah, maybe it's better to send it as a new set.

>>

>> So should I drop the existing patches and wait for a whole new series,

>> or will you send these as an additional set?

> 

>   Greg, I am sending it as an additional set.

> 



>> And at least one of the above commits needs to go to the 5.10.y tree, I

>> did not check them all...

> 

>   I need to look into it.


We don't need 1/7 in 5.10.y it was a preparatory patch in 5.4.y
  [PATCH 1/7] btrfs: make qgroup_free_reserved_data take btrfs_inode

The rest of the patches (in patchset 1 and 2) are already in the 
stable-5.10.y.

Thx, Anand


> 

> Thanks, Anand

> 

>> thanks,

>>

>> greg k-h

>>
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 7960359dbc70..5448dc62e915 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -945,6 +945,8 @@  enum {
 	BTRFS_ROOT_DEAD_TREE,
 	/* The root has a log tree. Used only for subvolume roots. */
 	BTRFS_ROOT_HAS_LOG_TREE,
+	/* Qgroup flushing is in progress */
+	BTRFS_ROOT_QGROUP_FLUSHING,
 };
 
 /*
@@ -1097,6 +1099,7 @@  struct btrfs_root {
 	spinlock_t qgroup_meta_rsv_lock;
 	u64 qgroup_meta_rsv_pertrans;
 	u64 qgroup_meta_rsv_prealloc;
+	wait_queue_head_t qgroup_flush_wait;
 
 	/* Number of active swapfiles */
 	atomic_t nr_swapfiles;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e6aa94a583e9..e3bcab38a166 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1154,6 +1154,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	mutex_init(&root->log_mutex);
 	mutex_init(&root->ordered_extent_mutex);
 	mutex_init(&root->delalloc_mutex);
+	init_waitqueue_head(&root->qgroup_flush_wait);
 	init_waitqueue_head(&root->log_writer_wait);
 	init_waitqueue_head(&root->log_commit_wait[0]);
 	init_waitqueue_head(&root->log_commit_wait[1]);
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 50c45b4fcfd4..b312ac645e08 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3479,17 +3479,58 @@  static int qgroup_unreserve_range(struct btrfs_inode *inode,
 }
 
 /*
- * Reserve qgroup space for range [start, start + len).
+ * Try to free some space for qgroup.
  *
- * This function will either reserve space from related qgroups or doing
- * nothing if the range is already reserved.
+ * For qgroup, there are only 3 ways to free qgroup space:
+ * - Flush nodatacow write
+ *   Any nodatacow write will free its reserved data space at run_delalloc_range().
+ *   In theory, we should only flush nodatacow inodes, but it's not yet
+ *   possible, so we need to flush the whole root.
  *
- * Return 0 for successful reserve
- * Return <0 for error (including -EQUOT)
+ * - Wait for ordered extents
+ *   When ordered extents are finished, their reserved metadata is finally
+ *   converted to per_trans status, which can be freed by later commit
+ *   transaction.
  *
- * NOTE: this function may sleep for memory allocation.
+ * - Commit transaction
+ *   This would free the meta_per_trans space.
+ *   In theory this shouldn't provide much space, but any more qgroup space
+ *   is needed.
  */
-int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+static int try_flush_qgroup(struct btrfs_root *root)
+{
+	struct btrfs_trans_handle *trans;
+	int ret;
+
+	/*
+	 * We don't want to run flush again and again, so if there is a running
+	 * one, we won't try to start a new flush, but exit directly.
+	 */
+	if (test_and_set_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state)) {
+		wait_event(root->qgroup_flush_wait,
+			!test_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state));
+		return 0;
+	}
+
+	ret = btrfs_start_delalloc_snapshot(root);
+	if (ret < 0)
+		goto out;
+	btrfs_wait_ordered_extents(root, U64_MAX, 0, (u64)-1);
+
+	trans = btrfs_join_transaction(root);
+	if (IS_ERR(trans)) {
+		ret = PTR_ERR(trans);
+		goto out;
+	}
+
+	ret = btrfs_commit_transaction(trans);
+out:
+	clear_bit(BTRFS_ROOT_QGROUP_FLUSHING, &root->state);
+	wake_up(&root->qgroup_flush_wait);
+	return ret;
+}
+
+static int qgroup_reserve_data(struct btrfs_inode *inode,
 			struct extent_changeset **reserved_ret, u64 start,
 			u64 len)
 {
@@ -3542,6 +3583,34 @@  int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
 	return ret;
 }
 
+/*
+ * Reserve qgroup space for range [start, start + len).
+ *
+ * This function will either reserve space from related qgroups or do nothing
+ * if the range is already reserved.
+ *
+ * Return 0 for successful reservation
+ * Return <0 for error (including -EQUOT)
+ *
+ * NOTE: This function may sleep for memory allocation, dirty page flushing and
+ *	 commit transaction. So caller should not hold any dirty page locked.
+ */
+int btrfs_qgroup_reserve_data(struct btrfs_inode *inode,
+			struct extent_changeset **reserved_ret, u64 start,
+			u64 len)
+{
+	int ret;
+
+	ret = qgroup_reserve_data(inode, reserved_ret, start, len);
+	if (ret <= 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(inode->root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_data(inode, reserved_ret, start, len);
+}
+
 /* Free ranges specified by @reserved, normally in error path */
 static int qgroup_free_reserved_data(struct btrfs_inode *inode,
 			struct extent_changeset *reserved, u64 start, u64 len)
@@ -3712,7 +3781,7 @@  static int sub_root_meta_rsv(struct btrfs_root *root, int num_bytes,
 	return num_bytes;
 }
 
-int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+static int qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 				enum btrfs_qgroup_rsv_type type, bool enforce)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;
@@ -3739,6 +3808,21 @@  int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
 	return ret;
 }
 
+int __btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes,
+				enum btrfs_qgroup_rsv_type type, bool enforce)
+{
+	int ret;
+
+	ret = qgroup_reserve_meta(root, num_bytes, type, enforce);
+	if (ret <= 0 && ret != -EDQUOT)
+		return ret;
+
+	ret = try_flush_qgroup(root);
+	if (ret < 0)
+		return ret;
+	return qgroup_reserve_meta(root, num_bytes, type, enforce);
+}
+
 void btrfs_qgroup_free_meta_all_pertrans(struct btrfs_root *root)
 {
 	struct btrfs_fs_info *fs_info = root->fs_info;