Message ID | 20230509065230.32552-3-ed.tsai@mediatek.com |
---|---|
State | New |
Headers | show |
Series | block: improve the share tag set performance | expand |
Hi, 在 2023/05/13 2:12, Bart Van Assche 写道: > The fair tag sharing algorithm has a negative impact on all SCSI devices > with multiple logical units. This is because logical units are > considered active until (request timeout) seconds have elapsed after the > logical unit stopped being used (see also the blk_mq_tag_idle() call in > blk_mq_timeout_work()). UFS users are hit by this because UFS 3.0 > devices have a limited queue depth (32) and because power management > commands are submitted to a logical unit (WLUN). Hence, it happens often > that the block layer "active queue" counter is equal to 2 while only one > logical unit is being used actively (a logical unit backed by NAND > flash). The performance difference between queue depths 16 and 32 for > UFS devices is significant. We meet similiar problem before, but I think remove tag fair sharing might cause some problems, because get tag is not fair currently, for example 2 devices share 32 tag, while device a issue large amount of io concurrently, and device b only issue one io, in this case, if fair tag sharing is removed, device b can get bad io latency. By the way, I tried to propose a way to workaround this by following: 1) disable fair tag sharing untill get tag found no tag is avaiable; 2) enable fair tag sharing again if the disk donesn't faild to get tag for a period of time; Can this approch be considered? Thanks, Kuai
On 5/12/23 20:09, Yu Kuai wrote: > 在 2023/05/13 2:12, Bart Van Assche 写道: >> The fair tag sharing algorithm has a negative impact on all SCSI >> devices with multiple logical units. This is because logical units are >> considered active until (request timeout) seconds have elapsed after >> the logical unit stopped being used (see also the blk_mq_tag_idle() >> call in blk_mq_timeout_work()). UFS users are hit by this because UFS >> 3.0 devices have a limited queue depth (32) and because power >> management commands are submitted to a logical unit (WLUN). Hence, it >> happens often that the block layer "active queue" counter is equal to >> 2 while only one logical unit is being used actively (a logical unit >> backed by NAND flash). The performance difference between queue depths >> 16 and 32 for UFS devices is significant. > > We meet similiar problem before, but I think remove tag fair sharing > might cause some problems, because get tag is not fair currently, for > example 2 devices share 32 tag, while device a issue large amount of > io concurrently, and device b only issue one io, in this case, if fair > tag sharing is removed, device b can get bad io latency. > > By the way, I tried to propose a way to workaround this by following: > > 1) disable fair tag sharing untill get tag found no tag is avaiable; > 2) enable fair tag sharing again if the disk donesn't faild to get tag > for a period of time; > > Can this approch be considered? I'm afraid that this approach won't help for the UFS driver since it is likely that all tags are in use by a single logical unit during an IOPS test. Hence, fair sharing would be enabled even when we don't want it to be enabled. I propose that we switch to one of these two approaches: * Either remove the fair tag sharing code entirely and rely on the fairness mechanism provided by the sbitmap code. I'm referring to how __sbitmap_queue_wake_up() uses the wake_index member variable. * Or make the behavior of the fairness algorithm configurable from user space. One possible approach is to make the proportion of tags for a logical unit / NVMe namespace configurable via sysfs. This will allow to reduce the number of tags for the WLUN of UFS devices. Thanks, Bart.
Hi, 在 2023/05/16 23:12, Bart Van Assche 写道: > On 5/12/23 20:09, Yu Kuai wrote: >> 在 2023/05/13 2:12, Bart Van Assche 写道: >>> The fair tag sharing algorithm has a negative impact on all SCSI >>> devices with multiple logical units. This is because logical units >>> are considered active until (request timeout) seconds have elapsed >>> after the logical unit stopped being used (see also the >>> blk_mq_tag_idle() call in blk_mq_timeout_work()). UFS users are hit >>> by this because UFS 3.0 devices have a limited queue depth (32) and >>> because power management commands are submitted to a logical unit >>> (WLUN). Hence, it happens often that the block layer "active queue" >>> counter is equal to 2 while only one logical unit is being used >>> actively (a logical unit backed by NAND flash). The performance >>> difference between queue depths 16 and 32 for UFS devices is >>> significant. >> >> We meet similiar problem before, but I think remove tag fair sharing >> might cause some problems, because get tag is not fair currently, for >> example 2 devices share 32 tag, while device a issue large amount of >> io concurrently, and device b only issue one io, in this case, if fair >> tag sharing is removed, device b can get bad io latency. >> >> By the way, I tried to propose a way to workaround this by following: >> >> 1) disable fair tag sharing untill get tag found no tag is avaiable; >> 2) enable fair tag sharing again if the disk donesn't faild to get tag >> for a period of time; >> >> Can this approch be considered? > > I'm afraid that this approach won't help for the UFS driver since it is > likely that all tags are in use by a single logical unit during an IOPS > test. Hence, fair sharing would be enabled even when we don't want it to > be enabled. It's right my original method is not flexible. > > I propose that we switch to one of these two approaches: How about a smoothing method that the device with more io will share more tag, and each device will get at least one tag? Thanks, Kuai > * Either remove the fair tag sharing code entirely and rely on the > fairness mechanism provided by the sbitmap code. I'm referring to how > __sbitmap_queue_wake_up() uses the wake_index member variable. > * Or make the behavior of the fairness algorithm configurable from user > space. One possible approach is to make the proportion of tags for a > logical unit / NVMe namespace configurable via sysfs. This will allow to > reduce the number of tags for the WLUN of UFS devices. > > Thanks, > > Bart. > > > . >
On 5/17/23 00:49, Yu Kuai wrote: > 在 2023/05/16 23:12, Bart Van Assche 写道: >> I propose that we switch to one of these two approaches: > > How about a smoothing method that the device with more io will share > more tag, and each device will get at least one tag? Hi Yu, hctx_may_queue() is called from the hot path (blk_mq_get_tag()). I'm pretty sure that adding any nontrivial code in that path will cause a performance (IOPS) regression. So I don't think that adding a smoothing method in hctx_may_queue() is a realistic option. Thanks, Bart.
Hi, 在 2023/05/18 2:23, Bart Van Assche 写道: > On 5/17/23 00:49, Yu Kuai wrote: >> 在 2023/05/16 23:12, Bart Van Assche 写道: >>> I propose that we switch to one of these two approaches: >> >> How about a smoothing method that the device with more io will share >> more tag, and each device will get at least one tag? > > Hi Yu, > > hctx_may_queue() is called from the hot path (blk_mq_get_tag()). I'm > pretty sure that adding any nontrivial code in that path will cause a > performance (IOPS) regression. So I don't think that adding a smoothing > method in hctx_may_queue() is a realistic option. > Currently, fair share from hctx_may_queue() requires two atomic_read(active_queues and active_requests), I think this smoothing method can be placed into get_tag fail path, for example, the more times a disk failed to get tag in a period of time, the more tag this disk can get, and all the information can be updated here(perhaps directly record how many tags a disk can get, then hctx_may_queue() still only require 2 atomic_read()). Thanks, Bart
On 5/17/23 18:49, Yu Kuai wrote: > Currently, fair share from hctx_may_queue() requires two > atomic_read(active_queues and active_requests), I think this smoothing > method can be placed into get_tag fail path, for example, the more times > a disk failed to get tag in a period of time, the more tag this disk can > get, and all the information can be updated here(perhaps directly > record how many tags a disk can get, then hctx_may_queue() still only > require 2 atomic_read()). That sounds interesting to me. Do you perhaps plan to implement this approach and to post it as a patch? Thanks, Bart.
Hi, 在 2023/05/18 10:23, Bart Van Assche 写道: > On 5/17/23 18:49, Yu Kuai wrote: >> Currently, fair share from hctx_may_queue() requires two >> atomic_read(active_queues and active_requests), I think this smoothing >> method can be placed into get_tag fail path, for example, the more times >> a disk failed to get tag in a period of time, the more tag this disk can >> get, and all the information can be updated here(perhaps directly >> record how many tags a disk can get, then hctx_may_queue() still only >> require 2 atomic_read()). > > That sounds interesting to me. Do you perhaps plan to implement this > approach and to post it as a patch? Of course, I'll try to send a RFC patch. Thanks, Kuai > > Thanks, > > Bart. > > > . >
On 5/18/23 00:55, Yu Kuai wrote: > 在 2023/05/18 10:23, Bart Van Assche 写道: >> On 5/17/23 18:49, Yu Kuai wrote: >>> Currently, fair share from hctx_may_queue() requires two >>> atomic_read(active_queues and active_requests), I think this smoothing >>> method can be placed into get_tag fail path, for example, the more times >>> a disk failed to get tag in a period of time, the more tag this disk can >>> get, and all the information can be updated here(perhaps directly >>> record how many tags a disk can get, then hctx_may_queue() still only >>> require 2 atomic_read()). >> >> That sounds interesting to me. Do you perhaps plan to implement this >> approach and to post it as a patch? > > Of course, I'll try to send a RFC patch. Hi Kuai, Has this RFC patch already been posted or did I perhaps miss it? Thanks, Bart.
diff --git a/drivers/ufs/core/ufshcd.c b/drivers/ufs/core/ufshcd.c index 17d7bb875fee..e96a50265285 100644 --- a/drivers/ufs/core/ufshcd.c +++ b/drivers/ufs/core/ufshcd.c @@ -5149,6 +5149,9 @@ static int ufshcd_slave_configure(struct scsi_device *sdev) blk_queue_update_dma_pad(q, PRDT_DATA_BYTE_COUNT_PAD - 1); if (hba->quirks & UFSHCD_QUIRK_4KB_DMA_ALIGNMENT) blk_queue_update_dma_alignment(q, 4096 - 1); + + blk_queue_flag_clear(QUEUE_FLAG_FAIR_TAG_SHARING, q); + /* * Block runtime-pm until all consumers are added. * Refer ufshcd_setup_links().
The tags allocation is limited by the fair sharing algorithm. It hurts the performance for UFS devices, because the queue depth of general I/O is reduced by half once the UFS send a control command. Signed-off-by: Ed Tsai <ed.tsai@mediatek.com> --- drivers/ufs/core/ufshcd.c | 3 +++ 1 file changed, 3 insertions(+)