Message ID | 20200824043121.13421-1-wanghonghao@bytedance.com |
---|---|
State | Superseded |
Headers | show |
Series | [1/2] QSLIST: add atomic replace operation | expand |
Hi, I'd like to know if there are any other problems with this patch, or if there is a better implement to improve coroutine pool. 王洪浩 <wanghonghao@bytedance.com> 于2020年8月26日周三 下午2:06写道: > > The purpose of this patch is to improve performance without increasing > memory consumption. > > My test case: > QEMU command line arguments > -drive file=/dev/nvme2n1p1,format=raw,if=none,id=local0,cache=none,aio=native \ > -device virtio-blk,id=blk0,drive=local0,iothread=iothread0,num-queues=4 \ > -drive file=/dev/nvme3n1p1,format=raw,if=none,id=local1,cache=none,aio=native \ > -device virtio-blk,id=blk1,drive=local1,iothread=iothread1,num-queues=4 \ > > run these two fio jobs at the same time > [job-vda] > filename=/dev/vda > iodepth=64 > ioengine=libaio > rw=randrw > bs=4k > size=300G > rwmixread=80 > direct=1 > numjobs=2 > runtime=60 > > [job-vdb] > filename=/dev/vdb > iodepth=64 > ioengine=libaio > rw=randrw > bs=4k > size=300G > rwmixread=90 > direct=1 > numjobs=2 > loops=1 > runtime=60 > > without this patch, test 3 times: > total iops: 278548.1, 312374.1, 276638.2 > with this patch, test 3 times: > total iops: 368370.9, 335693.2, 327693.1 > > 18.9% improvement in average. > > In addition, we are also using a distributed block storage, of which > the io latency is much more than local nvme devices because of the > network overhead. So it needs higher iodepth(>=256) to reach its max > throughput. > Without this patch, it has more than 5% chance of calling > `qemu_coroutine_new` and the iops is less than 100K, while the iops is > about 260K with this patch. > > On the other hand, there's a simpler way to reduce or eliminate the > cost of `qemu_coroutine_new` is to increase POOL_BATCH_SIZE. But it > will also bring much more memory consumption which we don't expect. > So it's the purpose of this patch. > > Stefan Hajnoczi <stefanha@redhat.com> 于2020年8月25日周二 下午10:52写道: > > > > On Mon, Aug 24, 2020 at 12:31:21PM +0800, wanghonghao wrote: > > > This patch replace the global coroutine queue with a lock-free stack of which > > > the elements are coroutine queues. Threads can put coroutine queues into the > > > stack or take queues from it and each coroutine queue has exactly > > > POOL_BATCH_SIZE coroutines. Note that the stack is not strictly LIFO, but it's > > > enough for buffer pool. > > > > > > Coroutines will be put into thread-local pools first while release. Now the > > > fast pathes of both allocation and release are atomic-free, and there won't > > > be too many coroutines remain in a single thread since POOL_BATCH_SIZE has been > > > reduced to 16. > > > > > > In practice, I've run a VM with two block devices binding to two different > > > iothreads, and run fio with iodepth 128 on each device. It maintains around > > > 400 coroutines and has about 1% chance of calling to `qemu_coroutine_new` > > > without this patch. And with this patch, it maintains no more than 273 > > > coroutines and doesn't call `qemu_coroutine_new` after initial allocations. > > > > Does throughput or IOPS change? > > > > Is the main purpose of this patch to reduce memory consumption? > > > > Stefan
On Tue, Sep 29, 2020 at 11:24:14AM +0800, 王洪浩 wrote: > Hi, I'd like to know if there are any other problems with this patch, > or if there is a better implement to improve coroutine pool. Please rebase onto qemu.git/master and resend the patch as a top-level email thread. I think v2 was overlooked because it was sent as a reply: https://wiki.qemu.org/Contribute/SubmitAPatch Thanks, Stefan
diff --git a/include/qemu/queue.h b/include/qemu/queue.h index 456a5b01ee..a3ff544193 100644 --- a/include/qemu/queue.h +++ b/include/qemu/queue.h @@ -226,6 +226,10 @@ struct { \ (dest)->slh_first = atomic_xchg(&(src)->slh_first, NULL); \ } while (/*CONSTCOND*/0) +#define QSLIST_REPLACE_ATOMIC(dest, src) do { \ + (src)->slh_first = atomic_xchg(&(dest)->slh_first, (src)->slh_first); \ +} while (/*CONSTCOND*/0) + #define QSLIST_REMOVE_HEAD(head, field) do { \ typeof((head)->slh_first) elm = (head)->slh_first; \ (head)->slh_first = elm->field.sle_next; \
Replace a queue with another atomicly. It's useful when we need to transfer queues between threads. Signed-off-by: wanghonghao <wanghonghao@bytedance.com> --- include/qemu/queue.h | 4 ++++ 1 file changed, 4 insertions(+)