Message ID | 1591810159-240929-1-git-send-email-john.garry@huawei.com |
---|---|
Headers | show |
Series | blk-mq/scsi: Provide hostwide shared tags for SCSI HBAs | expand |
On Thu, Jun 11, 2020 at 01:29:07AM +0800, John Garry wrote: > Hi all, > > Here is v7 of the patchset. > > In this version of the series, we keep the shared sbitmap for driver tags, > and introduce changes to fix up the tag budgeting across request queues ( > and associated debugfs changes). > > Some performance figures: > > Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included, > but it is not always an appropriate scheduler to use. > > Tag depth 4000 (default) 260** > > Baseline: > none sched: 2290K IOPS 894K > mq-deadline sched: 2341K IOPS 2313K > > Final, host_tagset=0 in LLDD* > none sched: 2289K IOPS 703K > mq-deadline sched: 2337K IOPS 2291K > > Final: > none sched: 2281K IOPS 1101K > mq-deadline sched: 2322K IOPS 1278K > > * this is relevant as this is the performance in supporting but not > enabling the feature > ** depth=260 is relevant as some point where we are regularly waiting for > tags to be available. Figures were are a bit unstable here for testing. > > A copy of the patches can be found here: > https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7 > > And to progress this series, we the the following to go in first, when ready: > https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/ I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so that it is easier to verify the changes without real hardware. Thanks, Ming
On 11/06/2020 04:07, Ming Lei wrote: >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are included, >> but it is not always an appropriate scheduler to use. >> >> Tag depth 4000 (default) 260** >> >> Baseline: >> none sched: 2290K IOPS 894K >> mq-deadline sched: 2341K IOPS 2313K >> >> Final, host_tagset=0 in LLDD* >> none sched: 2289K IOPS 703K >> mq-deadline sched: 2337K IOPS 2291K >> >> Final: >> none sched: 2281K IOPS 1101K >> mq-deadline sched: 2322K IOPS 1278K >> >> * this is relevant as this is the performance in supporting but not >> enabling the feature >> ** depth=260 is relevant as some point where we are regularly waiting for >> tags to be available. Figures were are a bit unstable here for testing. >> >> A copy of the patches can be found here: >> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq-shared-tags-rfc-v7 >> >> And to progress this series, we the the following to go in first, when ready: >> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de/ > I'd suggest to add options to enable shared tags for null_blk & scsi_debug in V8, so > that it is easier to verify the changes without real hardware. > ok, fine, I can look at including that. To stop the series getting too large, I might spin off the early patches, which are not strictly related. Thanks, John
> On 11/06/2020 04:07, Ming Lei wrote: > >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are > >> included, but it is not always an appropriate scheduler to use. > >> > >> Tag depth 4000 (default) 260** > >> > >> Baseline: > >> none sched: 2290K IOPS 894K > >> mq-deadline sched: 2341K IOPS 2313K > >> > >> Final, host_tagset=0 in LLDD* > >> none sched: 2289K IOPS 703K > >> mq-deadline sched: 2337K IOPS 2291K > >> > >> Final: > >> none sched: 2281K IOPS 1101K > >> mq-deadline sched: 2322K IOPS 1278K > >> > >> * this is relevant as this is the performance in supporting but not > >> enabling the feature > >> ** depth=260 is relevant as some point where we are regularly waiting > >> for > >> tags to be available. Figures were are a bit unstable here for > >> testing. John - I tried V7 series and debug further on mq-deadline interface. This time I have used another setup since HDD based setup is not readily available for me. In fact, I was able to simulate issue very easily using single scsi_device as well. BTW, this is not an issue with this RFC, but generic issue. Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC, It becomes noticeable now. Problem - Using below command I see heavy CPU utilization on " native_queued_spin_lock_slowpath". This is because kblockd work queue is submitting IO from all the CPUs even though fio is bound to single CPU. Lock contention from " dd_dispatch_request" is causing this issue. numactl -C 13 fio single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1 --cpus_allowed_policy=split --ioscheduler=mq-deadline --group_reporting --filename=/dev/sdd While running above command, ideally we expect only kworker/13 to be active. But you can see below - All the CPU is attempting submission and lots of CPU consumption is due to lock contention. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 2726 root 0 -20 0 0 0 R 56.5 0.0 0:53.20 kworker/13:1H-k 7815 root 20 0 712404 15536 2228 R 43.2 0.0 0:05.03 fio 2792 root 0 -20 0 0 0 I 26.6 0.0 0:22.19 kworker/18:1H-k 2791 root 0 -20 0 0 0 I 19.9 0.0 0:17.17 kworker/19:1H-k 1419 root 0 -20 0 0 0 I 19.6 0.0 0:17.03 kworker/20:1H-k 2793 root 0 -20 0 0 0 I 18.3 0.0 0:15.64 kworker/21:1H-k 1424 root 0 -20 0 0 0 I 17.3 0.0 0:14.99 kworker/22:1H-k 2626 root 0 -20 0 0 0 I 16.9 0.0 0:14.68 kworker/26:1H-k 2794 root 0 -20 0 0 0 I 16.9 0.0 0:14.87 kworker/23:1H-k 2795 root 0 -20 0 0 0 I 16.9 0.0 0:14.81 kworker/24:1H-k 2797 root 0 -20 0 0 0 I 16.9 0.0 0:14.62 kworker/27:1H-k 1415 root 0 -20 0 0 0 I 16.6 0.0 0:14.44 kworker/30:1H-k 2669 root 0 -20 0 0 0 I 16.6 0.0 0:14.38 kworker/31:1H-k 2796 root 0 -20 0 0 0 I 16.6 0.0 0:14.74 kworker/25:1H-k 2799 root 0 -20 0 0 0 I 16.6 0.0 0:14.56 kworker/28:1H-k 1425 root 0 -20 0 0 0 I 16.3 0.0 0:14.21 kworker/34:1H-k 2746 root 0 -20 0 0 0 I 16.3 0.0 0:14.33 kworker/32:1H-k 2798 root 0 -20 0 0 0 I 16.3 0.0 0:14.50 kworker/29:1H-k 2800 root 0 -20 0 0 0 I 16.3 0.0 0:14.27 kworker/33:1H-k 1423 root 0 -20 0 0 0 I 15.9 0.0 0:14.10 kworker/54:1H-k 1784 root 0 -20 0 0 0 I 15.9 0.0 0:14.03 kworker/55:1H-k 2801 root 0 -20 0 0 0 I 15.9 0.0 0:14.15 kworker/35:1H-k 2815 root 0 -20 0 0 0 I 15.9 0.0 0:13.97 kworker/56:1H-k 1484 root 0 -20 0 0 0 I 15.6 0.0 0:13.90 kworker/57:1H-k 1485 root 0 -20 0 0 0 I 15.6 0.0 0:13.82 kworker/59:1H-k 1519 root 0 -20 0 0 0 I 15.6 0.0 0:13.64 kworker/62:1H-k 2315 root 0 -20 0 0 0 I 15.6 0.0 0:13.87 kworker/58:1H-k 2627 root 0 -20 0 0 0 I 15.6 0.0 0:13.69 kworker/61:1H-k 2816 root 0 -20 0 0 0 I 15.6 0.0 0:13.75 kworker/60:1H-k I root cause this issue - Block layer always queue IO on hctx context mapped to CPU-13, but hw queue run from all the hctx context. I noticed in my test hctx48 has queued all the IOs. No other hctx has queued IO. But all the hctx is counting for "run". # cat hctx48/queued 2087058 #cat hctx*/run 151318 30038 83110 50680 69907 60391 111239 18036 33935 91648 34582 22853 61286 19489 Below patch has fix - "Run the hctx queue for which request was completed instead of running all the hardware queue." If this looks valid fix, please include in V8 OR I can post separate patch for this. Just want to have some level of review from this discussion. diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 0652acd..f52118f 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req, blk_status_t error, struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); struct scsi_device *sdev = cmd->device; struct request_queue *q = sdev->request_queue; + struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx; if (blk_update_request(req, error, bytes)) return true; @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req, blk_status_t error, !list_empty(&sdev->host->starved_list)) kblockd_schedule_work(&sdev->requeue_work); else - blk_mq_run_hw_queues(q, true); + blk_mq_run_hw_queue(mq_hctx, true); + //blk_mq_run_hw_queues(q, true); percpu_ref_put(&q->q_usage_counter); return false; After above patch - Only kworker/13 is actively doing submission. 3858 root 0 -20 0 0 0 I 22.9 0.0 3:24.04 kworker/13:1H-k 16768 root 20 0 712008 14968 2180 R 21.6 0.0 0:03.27 fio 16769 root 20 0 712012 14968 2180 R 21.6 0.0 0:03.27 fio Without above patch - 24 SSD driver can give 1.5M IOPS and after above patch 3.2M IOPS. I will continue my testing. Thanks, Kashyap > >> > >> A copy of the patches can be found here: > >> https://github.com/hisilicon/kernel-dev/commits/private-topic-blk-mq- > >> shared-tags-rfc-v7 > >> > >> And to progress this series, we the the following to go in first, when > >> ready: > >> https://lore.kernel.org/linux-scsi/20200430131904.5847-1-hare@suse.de > >> / > > I'd suggest to add options to enable shared tags for null_blk & > > scsi_debug in V8, so that it is easier to verify the changes without > > real > hardware. > > > > ok, fine, I can look at including that. To stop the series getting too > large, I > might spin off the early patches, which are not strictly related. > > Thanks, > John
On Sat, Jun 13, 2020 at 12:17:37AM +0530, Kashyap Desai wrote: > > On 11/06/2020 04:07, Ming Lei wrote: > > >> Using 12x SAS SSDs on hisi_sas v3 hw. mq-deadline results are > > >> included, but it is not always an appropriate scheduler to use. > > >> > > >> Tag depth 4000 (default) 260** > > >> > > >> Baseline: > > >> none sched: 2290K IOPS 894K > > >> mq-deadline sched: 2341K IOPS 2313K > > >> > > >> Final, host_tagset=0 in LLDD* > > >> none sched: 2289K IOPS 703K > > >> mq-deadline sched: 2337K IOPS 2291K > > >> > > >> Final: > > >> none sched: 2281K IOPS 1101K > > >> mq-deadline sched: 2322K IOPS 1278K > > >> > > >> * this is relevant as this is the performance in supporting but not > > >> enabling the feature > > >> ** depth=260 is relevant as some point where we are regularly waiting > > >> for > > >> tags to be available. Figures were are a bit unstable here for > > >> testing. > > John - > > I tried V7 series and debug further on mq-deadline interface. This time I > have used another setup since HDD based setup is not readily available for > me. > In fact, I was able to simulate issue very easily using single scsi_device > as well. BTW, this is not an issue with this RFC, but generic issue. > Since I have converted nr_hw_queue > 1 for Broadcom product using this RFC, > It becomes noticeable now. > > Problem - Using below command I see heavy CPU utilization on " > native_queued_spin_lock_slowpath". This is because kblockd work queue is > submitting IO from all the CPUs even though fio is bound to single CPU. > Lock contention from " dd_dispatch_request" is causing this issue. > > numactl -C 13 fio > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none --numjobs=1 > --cpus_allowed_policy=split --ioscheduler=mq-deadline > --group_reporting --filename=/dev/sdd > > While running above command, ideally we expect only kworker/13 to be active. > But you can see below - All the CPU is attempting submission and lots of CPU > consumption is due to lock contention. > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > 2726 root 0 -20 0 0 0 R 56.5 0.0 0:53.20 > kworker/13:1H-k > 7815 root 20 0 712404 15536 2228 R 43.2 0.0 0:05.03 fio > 2792 root 0 -20 0 0 0 I 26.6 0.0 0:22.19 > kworker/18:1H-k > 2791 root 0 -20 0 0 0 I 19.9 0.0 0:17.17 > kworker/19:1H-k > 1419 root 0 -20 0 0 0 I 19.6 0.0 0:17.03 > kworker/20:1H-k > 2793 root 0 -20 0 0 0 I 18.3 0.0 0:15.64 > kworker/21:1H-k > 1424 root 0 -20 0 0 0 I 17.3 0.0 0:14.99 > kworker/22:1H-k > 2626 root 0 -20 0 0 0 I 16.9 0.0 0:14.68 > kworker/26:1H-k > 2794 root 0 -20 0 0 0 I 16.9 0.0 0:14.87 > kworker/23:1H-k > 2795 root 0 -20 0 0 0 I 16.9 0.0 0:14.81 > kworker/24:1H-k > 2797 root 0 -20 0 0 0 I 16.9 0.0 0:14.62 > kworker/27:1H-k > 1415 root 0 -20 0 0 0 I 16.6 0.0 0:14.44 > kworker/30:1H-k > 2669 root 0 -20 0 0 0 I 16.6 0.0 0:14.38 > kworker/31:1H-k > 2796 root 0 -20 0 0 0 I 16.6 0.0 0:14.74 > kworker/25:1H-k > 2799 root 0 -20 0 0 0 I 16.6 0.0 0:14.56 > kworker/28:1H-k > 1425 root 0 -20 0 0 0 I 16.3 0.0 0:14.21 > kworker/34:1H-k > 2746 root 0 -20 0 0 0 I 16.3 0.0 0:14.33 > kworker/32:1H-k > 2798 root 0 -20 0 0 0 I 16.3 0.0 0:14.50 > kworker/29:1H-k > 2800 root 0 -20 0 0 0 I 16.3 0.0 0:14.27 > kworker/33:1H-k > 1423 root 0 -20 0 0 0 I 15.9 0.0 0:14.10 > kworker/54:1H-k > 1784 root 0 -20 0 0 0 I 15.9 0.0 0:14.03 > kworker/55:1H-k > 2801 root 0 -20 0 0 0 I 15.9 0.0 0:14.15 > kworker/35:1H-k > 2815 root 0 -20 0 0 0 I 15.9 0.0 0:13.97 > kworker/56:1H-k > 1484 root 0 -20 0 0 0 I 15.6 0.0 0:13.90 > kworker/57:1H-k > 1485 root 0 -20 0 0 0 I 15.6 0.0 0:13.82 > kworker/59:1H-k > 1519 root 0 -20 0 0 0 I 15.6 0.0 0:13.64 > kworker/62:1H-k > 2315 root 0 -20 0 0 0 I 15.6 0.0 0:13.87 > kworker/58:1H-k > 2627 root 0 -20 0 0 0 I 15.6 0.0 0:13.69 > kworker/61:1H-k > 2816 root 0 -20 0 0 0 I 15.6 0.0 0:13.75 > kworker/60:1H-k > > > I root cause this issue - > > Block layer always queue IO on hctx context mapped to CPU-13, but hw queue > run from all the hctx context. > I noticed in my test hctx48 has queued all the IOs. No other hctx has queued > IO. But all the hctx is counting for "run". > > # cat hctx48/queued > 2087058 > > #cat hctx*/run > 151318 > 30038 > 83110 > 50680 > 69907 > 60391 > 111239 > 18036 > 33935 > 91648 > 34582 > 22853 > 61286 > 19489 > > Below patch has fix - "Run the hctx queue for which request was completed > instead of running all the hardware queue." > If this looks valid fix, please include in V8 OR I can post separate patch > for this. Just want to have some level of review from this discussion. > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 0652acd..f52118f 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req, > blk_status_t error, > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); > struct scsi_device *sdev = cmd->device; > struct request_queue *q = sdev->request_queue; > + struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx; > > if (blk_update_request(req, error, bytes)) > return true; > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req, > blk_status_t error, > !list_empty(&sdev->host->starved_list)) > kblockd_schedule_work(&sdev->requeue_work); > else > - blk_mq_run_hw_queues(q, true); > + blk_mq_run_hw_queue(mq_hctx, true); > + //blk_mq_run_hw_queues(q, true); This way may cause IO hang because ->device_busy is shared by all hctxs. Thanks, Ming
> > > > John - > > > > I tried V7 series and debug further on mq-deadline interface. This > > time I have used another setup since HDD based setup is not readily > > available for me. > > In fact, I was able to simulate issue very easily using single > > scsi_device as well. BTW, this is not an issue with this RFC, but generic issue. > > Since I have converted nr_hw_queue > 1 for Broadcom product using this > > RFC, It becomes noticeable now. > > > > Problem - Using below command I see heavy CPU utilization on " > > native_queued_spin_lock_slowpath". This is because kblockd work queue > > is submitting IO from all the CPUs even though fio is bound to single CPU. > > Lock contention from " dd_dispatch_request" is causing this issue. > > > > numactl -C 13 fio > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none > > --numjobs=1 --cpus_allowed_policy=split --ioscheduler=mq-deadline > > --group_reporting --filename=/dev/sdd > > > > While running above command, ideally we expect only kworker/13 to be > active. > > But you can see below - All the CPU is attempting submission and lots > > of CPU consumption is due to lock contention. > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND > > 2726 root 0 -20 0 0 0 R 56.5 0.0 0:53.20 > > kworker/13:1H-k > > 7815 root 20 0 712404 15536 2228 R 43.2 0.0 0:05.03 fio > > 2792 root 0 -20 0 0 0 I 26.6 0.0 0:22.19 > > kworker/18:1H-k > > 2791 root 0 -20 0 0 0 I 19.9 0.0 0:17.17 > > kworker/19:1H-k > > 1419 root 0 -20 0 0 0 I 19.6 0.0 0:17.03 > > kworker/20:1H-k > > 2793 root 0 -20 0 0 0 I 18.3 0.0 0:15.64 > > kworker/21:1H-k > > 1424 root 0 -20 0 0 0 I 17.3 0.0 0:14.99 > > kworker/22:1H-k > > 2626 root 0 -20 0 0 0 I 16.9 0.0 0:14.68 > > kworker/26:1H-k > > 2794 root 0 -20 0 0 0 I 16.9 0.0 0:14.87 > > kworker/23:1H-k > > 2795 root 0 -20 0 0 0 I 16.9 0.0 0:14.81 > > kworker/24:1H-k > > 2797 root 0 -20 0 0 0 I 16.9 0.0 0:14.62 > > kworker/27:1H-k > > 1415 root 0 -20 0 0 0 I 16.6 0.0 0:14.44 > > kworker/30:1H-k > > 2669 root 0 -20 0 0 0 I 16.6 0.0 0:14.38 > > kworker/31:1H-k > > 2796 root 0 -20 0 0 0 I 16.6 0.0 0:14.74 > > kworker/25:1H-k > > 2799 root 0 -20 0 0 0 I 16.6 0.0 0:14.56 > > kworker/28:1H-k > > 1425 root 0 -20 0 0 0 I 16.3 0.0 0:14.21 > > kworker/34:1H-k > > 2746 root 0 -20 0 0 0 I 16.3 0.0 0:14.33 > > kworker/32:1H-k > > 2798 root 0 -20 0 0 0 I 16.3 0.0 0:14.50 > > kworker/29:1H-k > > 2800 root 0 -20 0 0 0 I 16.3 0.0 0:14.27 > > kworker/33:1H-k > > 1423 root 0 -20 0 0 0 I 15.9 0.0 0:14.10 > > kworker/54:1H-k > > 1784 root 0 -20 0 0 0 I 15.9 0.0 0:14.03 > > kworker/55:1H-k > > 2801 root 0 -20 0 0 0 I 15.9 0.0 0:14.15 > > kworker/35:1H-k > > 2815 root 0 -20 0 0 0 I 15.9 0.0 0:13.97 > > kworker/56:1H-k > > 1484 root 0 -20 0 0 0 I 15.6 0.0 0:13.90 > > kworker/57:1H-k > > 1485 root 0 -20 0 0 0 I 15.6 0.0 0:13.82 > > kworker/59:1H-k > > 1519 root 0 -20 0 0 0 I 15.6 0.0 0:13.64 > > kworker/62:1H-k > > 2315 root 0 -20 0 0 0 I 15.6 0.0 0:13.87 > > kworker/58:1H-k > > 2627 root 0 -20 0 0 0 I 15.6 0.0 0:13.69 > > kworker/61:1H-k > > 2816 root 0 -20 0 0 0 I 15.6 0.0 0:13.75 > > kworker/60:1H-k > > > > > > I root cause this issue - > > > > Block layer always queue IO on hctx context mapped to CPU-13, but hw > > queue run from all the hctx context. > > I noticed in my test hctx48 has queued all the IOs. No other hctx has > > queued IO. But all the hctx is counting for "run". > > > > # cat hctx48/queued > > 2087058 > > > > #cat hctx*/run > > 151318 > > 30038 > > 83110 > > 50680 > > 69907 > > 60391 > > 111239 > > 18036 > > 33935 > > 91648 > > 34582 > > 22853 > > 61286 > > 19489 > > > > Below patch has fix - "Run the hctx queue for which request was > > completed instead of running all the hardware queue." > > If this looks valid fix, please include in V8 OR I can post separate > > patch for this. Just want to have some level of review from this discussion. > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index > > 0652acd..f52118f 100644 > > --- a/drivers/scsi/scsi_lib.c > > +++ b/drivers/scsi/scsi_lib.c > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req, > > blk_status_t error, > > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); > > struct scsi_device *sdev = cmd->device; > > struct request_queue *q = sdev->request_queue; > > + struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx; > > > > if (blk_update_request(req, error, bytes)) > > return true; > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req, > > blk_status_t error, > > !list_empty(&sdev->host->starved_list)) > > kblockd_schedule_work(&sdev->requeue_work); > > else > > - blk_mq_run_hw_queues(q, true); > > + blk_mq_run_hw_queue(mq_hctx, true); > > + //blk_mq_run_hw_queues(q, true); > > This way may cause IO hang because ->device_busy is shared by all hctxs. From SCSI stack, if we attempt to run all h/w queue, is it possible that block layer actually run hw_queue which has really not queued any IO. Currently, in case of mq-deadline, IOS are inserted using "dd_insert_request". This function will add IOs on elevator data which is per request queue and not per hctx. When there is an attempt to run hctx, "blk_mq_sched_has_work" will check pending work which is per request queue and not per hctx. Because of this, IOs queued on only one hctx will be run from all the hctx and this will create unnecessary lock contention. How about below patch - ? diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 126021f..1d30bd3 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx) { struct elevator_queue *e = hctx->queue->elevator; + /* If current hctx has not queued any request, there is no need to run. + * blk_mq_run_hw_queue() on hctx which has queued IO will handle + * running specific hctx. + */ + if (!hctx->queued) + return false; + if (e && e->type->ops.has_work) return e->type->ops.has_work(hctx); Kashyap > > Thanks, > Ming
On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote: > > > > > > John - > > > > > > I tried V7 series and debug further on mq-deadline interface. This > > > time I have used another setup since HDD based setup is not readily > > > available for me. > > > In fact, I was able to simulate issue very easily using single > > > scsi_device as well. BTW, this is not an issue with this RFC, but > generic issue. > > > Since I have converted nr_hw_queue > 1 for Broadcom product using this > > > RFC, It becomes noticeable now. > > > > > > Problem - Using below command I see heavy CPU utilization on " > > > native_queued_spin_lock_slowpath". This is because kblockd work queue > > > is submitting IO from all the CPUs even though fio is bound to single > CPU. > > > Lock contention from " dd_dispatch_request" is causing this issue. > > > > > > numactl -C 13 fio > > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none > > > --numjobs=1 --cpus_allowed_policy=split --ioscheduler=mq-deadline > > > --group_reporting --filename=/dev/sdd > > > > > > While running above command, ideally we expect only kworker/13 to be > > active. > > > But you can see below - All the CPU is attempting submission and lots > > > of CPU consumption is due to lock contention. > > > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 2726 root 0 -20 0 0 0 R 56.5 0.0 0:53.20 > > > kworker/13:1H-k > > > 7815 root 20 0 712404 15536 2228 R 43.2 0.0 0:05.03 > fio > > > 2792 root 0 -20 0 0 0 I 26.6 0.0 0:22.19 > > > kworker/18:1H-k > > > 2791 root 0 -20 0 0 0 I 19.9 0.0 0:17.17 > > > kworker/19:1H-k > > > 1419 root 0 -20 0 0 0 I 19.6 0.0 0:17.03 > > > kworker/20:1H-k > > > 2793 root 0 -20 0 0 0 I 18.3 0.0 0:15.64 > > > kworker/21:1H-k > > > 1424 root 0 -20 0 0 0 I 17.3 0.0 0:14.99 > > > kworker/22:1H-k > > > 2626 root 0 -20 0 0 0 I 16.9 0.0 0:14.68 > > > kworker/26:1H-k > > > 2794 root 0 -20 0 0 0 I 16.9 0.0 0:14.87 > > > kworker/23:1H-k > > > 2795 root 0 -20 0 0 0 I 16.9 0.0 0:14.81 > > > kworker/24:1H-k > > > 2797 root 0 -20 0 0 0 I 16.9 0.0 0:14.62 > > > kworker/27:1H-k > > > 1415 root 0 -20 0 0 0 I 16.6 0.0 0:14.44 > > > kworker/30:1H-k > > > 2669 root 0 -20 0 0 0 I 16.6 0.0 0:14.38 > > > kworker/31:1H-k > > > 2796 root 0 -20 0 0 0 I 16.6 0.0 0:14.74 > > > kworker/25:1H-k > > > 2799 root 0 -20 0 0 0 I 16.6 0.0 0:14.56 > > > kworker/28:1H-k > > > 1425 root 0 -20 0 0 0 I 16.3 0.0 0:14.21 > > > kworker/34:1H-k > > > 2746 root 0 -20 0 0 0 I 16.3 0.0 0:14.33 > > > kworker/32:1H-k > > > 2798 root 0 -20 0 0 0 I 16.3 0.0 0:14.50 > > > kworker/29:1H-k > > > 2800 root 0 -20 0 0 0 I 16.3 0.0 0:14.27 > > > kworker/33:1H-k > > > 1423 root 0 -20 0 0 0 I 15.9 0.0 0:14.10 > > > kworker/54:1H-k > > > 1784 root 0 -20 0 0 0 I 15.9 0.0 0:14.03 > > > kworker/55:1H-k > > > 2801 root 0 -20 0 0 0 I 15.9 0.0 0:14.15 > > > kworker/35:1H-k > > > 2815 root 0 -20 0 0 0 I 15.9 0.0 0:13.97 > > > kworker/56:1H-k > > > 1484 root 0 -20 0 0 0 I 15.6 0.0 0:13.90 > > > kworker/57:1H-k > > > 1485 root 0 -20 0 0 0 I 15.6 0.0 0:13.82 > > > kworker/59:1H-k > > > 1519 root 0 -20 0 0 0 I 15.6 0.0 0:13.64 > > > kworker/62:1H-k > > > 2315 root 0 -20 0 0 0 I 15.6 0.0 0:13.87 > > > kworker/58:1H-k > > > 2627 root 0 -20 0 0 0 I 15.6 0.0 0:13.69 > > > kworker/61:1H-k > > > 2816 root 0 -20 0 0 0 I 15.6 0.0 0:13.75 > > > kworker/60:1H-k > > > > > > > > > I root cause this issue - > > > > > > Block layer always queue IO on hctx context mapped to CPU-13, but hw > > > queue run from all the hctx context. > > > I noticed in my test hctx48 has queued all the IOs. No other hctx has > > > queued IO. But all the hctx is counting for "run". > > > > > > # cat hctx48/queued > > > 2087058 > > > > > > #cat hctx*/run > > > 151318 > > > 30038 > > > 83110 > > > 50680 > > > 69907 > > > 60391 > > > 111239 > > > 18036 > > > 33935 > > > 91648 > > > 34582 > > > 22853 > > > 61286 > > > 19489 > > > > > > Below patch has fix - "Run the hctx queue for which request was > > > completed instead of running all the hardware queue." > > > If this looks valid fix, please include in V8 OR I can post separate > > > patch for this. Just want to have some level of review from this > discussion. > > > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index > > > 0652acd..f52118f 100644 > > > --- a/drivers/scsi/scsi_lib.c > > > +++ b/drivers/scsi/scsi_lib.c > > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request *req, > > > blk_status_t error, > > > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); > > > struct scsi_device *sdev = cmd->device; > > > struct request_queue *q = sdev->request_queue; > > > + struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx; > > > > > > if (blk_update_request(req, error, bytes)) > > > return true; > > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request *req, > > > blk_status_t error, > > > !list_empty(&sdev->host->starved_list)) > > > kblockd_schedule_work(&sdev->requeue_work); > > > else > > > - blk_mq_run_hw_queues(q, true); > > > + blk_mq_run_hw_queue(mq_hctx, true); > > > + //blk_mq_run_hw_queues(q, true); > > > > This way may cause IO hang because ->device_busy is shared by all hctxs. > > From SCSI stack, if we attempt to run all h/w queue, is it possible that > block layer actually run hw_queue which has really not queued any IO. > Currently, in case of mq-deadline, IOS are inserted using > "dd_insert_request". This function will add IOs on elevator data which is > per request queue and not per hctx. > When there is an attempt to run hctx, "blk_mq_sched_has_work" will check > pending work which is per request queue and not per hctx. > Because of this, IOs queued on only one hctx will be run from all the hctx > and this will create unnecessary lock contention. Deadline is supposed for HDD. slow disks, so the lock contention shouldn't have been one problem given there is seldom MQ HDD. before this patchset. Another related issue is default scheduler, I guess deadline still should have been the default io sched for HDDs. attached to this kind HBA with multiple reply queues and single submission queue. > > How about below patch - ? > > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h > index 126021f..1d30bd3 100644 > --- a/block/blk-mq-sched.h > +++ b/block/blk-mq-sched.h > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct > blk_mq_hw_ctx *hctx) > { > struct elevator_queue *e = hctx->queue->elevator; > > + /* If current hctx has not queued any request, there is no need to > run. > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > + * running specific hctx. > + */ > + if (!hctx->queued) > + return false; > + > if (e && e->type->ops.has_work) > return e->type->ops.has_work(hctx); ->queued is increased only and not decreased just for debug purpose so far, so it can't be relied for this purpose. One approach is to add one similar counter, and maintain it by scheduler's insert/dispatch callback. Thanks, Ming
> On Mon, Jun 15, 2020 at 12:27:30PM +0530, Kashyap Desai wrote: > > > > > > > > John - > > > > > > > > I tried V7 series and debug further on mq-deadline interface. This > > > > time I have used another setup since HDD based setup is not > > > > readily available for me. > > > > In fact, I was able to simulate issue very easily using single > > > > scsi_device as well. BTW, this is not an issue with this RFC, but > > generic issue. > > > > Since I have converted nr_hw_queue > 1 for Broadcom product using > > > > this RFC, It becomes noticeable now. > > > > > > > > Problem - Using below command I see heavy CPU utilization on " > > > > native_queued_spin_lock_slowpath". This is because kblockd work > > > > queue is submitting IO from all the CPUs even though fio is bound > > > > to single > > CPU. > > > > Lock contention from " dd_dispatch_request" is causing this issue. > > > > > > > > numactl -C 13 fio > > > > single.fio --iodepth=32 --bs=4k --rw=randread --ioscheduler=none > > > > --numjobs=1 --cpus_allowed_policy=split --ioscheduler=mq-deadline > > > > --group_reporting --filename=/dev/sdd > > > > > > > > While running above command, ideally we expect only kworker/13 to > > > > be > > > active. > > > > But you can see below - All the CPU is attempting submission and > > > > lots of CPU consumption is due to lock contention. > > > > > > > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > > COMMAND > > > > 2726 root 0 -20 0 0 0 R 56.5 0.0 0:53.20 > > > > kworker/13:1H-k > > > > 7815 root 20 0 712404 15536 2228 R 43.2 0.0 0:05.03 > > fio > > > > 2792 root 0 -20 0 0 0 I 26.6 0.0 0:22.19 > > > > kworker/18:1H-k > > > > 2791 root 0 -20 0 0 0 I 19.9 0.0 0:17.17 > > > > kworker/19:1H-k > > > > 1419 root 0 -20 0 0 0 I 19.6 0.0 0:17.03 > > > > kworker/20:1H-k > > > > 2793 root 0 -20 0 0 0 I 18.3 0.0 0:15.64 > > > > kworker/21:1H-k > > > > 1424 root 0 -20 0 0 0 I 17.3 0.0 0:14.99 > > > > kworker/22:1H-k > > > > 2626 root 0 -20 0 0 0 I 16.9 0.0 0:14.68 > > > > kworker/26:1H-k > > > > 2794 root 0 -20 0 0 0 I 16.9 0.0 0:14.87 > > > > kworker/23:1H-k > > > > 2795 root 0 -20 0 0 0 I 16.9 0.0 0:14.81 > > > > kworker/24:1H-k > > > > 2797 root 0 -20 0 0 0 I 16.9 0.0 0:14.62 > > > > kworker/27:1H-k > > > > 1415 root 0 -20 0 0 0 I 16.6 0.0 0:14.44 > > > > kworker/30:1H-k > > > > 2669 root 0 -20 0 0 0 I 16.6 0.0 0:14.38 > > > > kworker/31:1H-k > > > > 2796 root 0 -20 0 0 0 I 16.6 0.0 0:14.74 > > > > kworker/25:1H-k > > > > 2799 root 0 -20 0 0 0 I 16.6 0.0 0:14.56 > > > > kworker/28:1H-k > > > > 1425 root 0 -20 0 0 0 I 16.3 0.0 0:14.21 > > > > kworker/34:1H-k > > > > 2746 root 0 -20 0 0 0 I 16.3 0.0 0:14.33 > > > > kworker/32:1H-k > > > > 2798 root 0 -20 0 0 0 I 16.3 0.0 0:14.50 > > > > kworker/29:1H-k > > > > 2800 root 0 -20 0 0 0 I 16.3 0.0 0:14.27 > > > > kworker/33:1H-k > > > > 1423 root 0 -20 0 0 0 I 15.9 0.0 0:14.10 > > > > kworker/54:1H-k > > > > 1784 root 0 -20 0 0 0 I 15.9 0.0 0:14.03 > > > > kworker/55:1H-k > > > > 2801 root 0 -20 0 0 0 I 15.9 0.0 0:14.15 > > > > kworker/35:1H-k > > > > 2815 root 0 -20 0 0 0 I 15.9 0.0 0:13.97 > > > > kworker/56:1H-k > > > > 1484 root 0 -20 0 0 0 I 15.6 0.0 0:13.90 > > > > kworker/57:1H-k > > > > 1485 root 0 -20 0 0 0 I 15.6 0.0 0:13.82 > > > > kworker/59:1H-k > > > > 1519 root 0 -20 0 0 0 I 15.6 0.0 0:13.64 > > > > kworker/62:1H-k > > > > 2315 root 0 -20 0 0 0 I 15.6 0.0 0:13.87 > > > > kworker/58:1H-k > > > > 2627 root 0 -20 0 0 0 I 15.6 0.0 0:13.69 > > > > kworker/61:1H-k > > > > 2816 root 0 -20 0 0 0 I 15.6 0.0 0:13.75 > > > > kworker/60:1H-k > > > > > > > > > > > > I root cause this issue - > > > > > > > > Block layer always queue IO on hctx context mapped to CPU-13, but > > > > hw queue run from all the hctx context. > > > > I noticed in my test hctx48 has queued all the IOs. No other hctx > > > > has queued IO. But all the hctx is counting for "run". > > > > > > > > # cat hctx48/queued > > > > 2087058 > > > > > > > > #cat hctx*/run > > > > 151318 > > > > 30038 > > > > 83110 > > > > 50680 > > > > 69907 > > > > 60391 > > > > 111239 > > > > 18036 > > > > 33935 > > > > 91648 > > > > 34582 > > > > 22853 > > > > 61286 > > > > 19489 > > > > > > > > Below patch has fix - "Run the hctx queue for which request was > > > > completed instead of running all the hardware queue." > > > > If this looks valid fix, please include in V8 OR I can post > > > > separate patch for this. Just want to have some level of review > > > > from this > > discussion. > > > > > > > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > > > > index 0652acd..f52118f 100644 > > > > --- a/drivers/scsi/scsi_lib.c > > > > +++ b/drivers/scsi/scsi_lib.c > > > > @@ -554,6 +554,7 @@ static bool scsi_end_request(struct request > > > > *req, blk_status_t error, > > > > struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(req); > > > > struct scsi_device *sdev = cmd->device; > > > > struct request_queue *q = sdev->request_queue; > > > > + struct blk_mq_hw_ctx *mq_hctx = req->mq_hctx; > > > > > > > > if (blk_update_request(req, error, bytes)) > > > > return true; > > > > @@ -595,7 +596,8 @@ static bool scsi_end_request(struct request > > > > *req, blk_status_t error, > > > > !list_empty(&sdev->host->starved_list)) > > > > kblockd_schedule_work(&sdev->requeue_work); > > > > else > > > > - blk_mq_run_hw_queues(q, true); > > > > + blk_mq_run_hw_queue(mq_hctx, true); > > > > + //blk_mq_run_hw_queues(q, true); > > > > > > This way may cause IO hang because ->device_busy is shared by all hctxs. > > > > From SCSI stack, if we attempt to run all h/w queue, is it possible > > that block layer actually run hw_queue which has really not queued any IO. > > Currently, in case of mq-deadline, IOS are inserted using > > "dd_insert_request". This function will add IOs on elevator data which > > is per request queue and not per hctx. > > When there is an attempt to run hctx, "blk_mq_sched_has_work" will > > check pending work which is per request queue and not per hctx. > > Because of this, IOs queued on only one hctx will be run from all the > > hctx and this will create unnecessary lock contention. > > Deadline is supposed for HDD. slow disks, so the lock contention shouldn't > have been one problem given there is seldom MQ HDD. before this patchset. > > Another related issue is default scheduler, I guess deadline still should have > been the default io sched for HDDs. attached to this kind HBA with multiple > reply queues and single submission queue. > > > > > How about below patch - ? > > > > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index > > 126021f..1d30bd3 100644 > > --- a/block/blk-mq-sched.h > > +++ b/block/blk-mq-sched.h > > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct > > blk_mq_hw_ctx *hctx) { > > struct elevator_queue *e = hctx->queue->elevator; > > > > + /* If current hctx has not queued any request, there is no > > + need to > > run. > > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > > + * running specific hctx. > > + */ > > + if (!hctx->queued) > > + return false; > > + > > if (e && e->type->ops.has_work) > > return e->type->ops.has_work(hctx); > > ->queued is increased only and not decreased just for debug purpose so > ->far, so > it can't be relied for this purpose. Thanks. I overlooked that that it is only incremental counter. > > One approach is to add one similar counter, and maintain it by scheduler's > insert/dispatch callback. I tried below and I see performance is on expected range. diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index fdcc2c1..ea201d0 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq, bool at_head, list_add(&rq->queuelist, &list); e->type->ops.insert_requests(hctx, &list, at_head); + atomic_inc(&hctx->elevator_queued); } else { spin_lock(&ctx->lock); __blk_mq_insert_request(hctx, rq, at_head); @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx, percpu_ref_get(&q->q_usage_counter); e = hctx->queue->elevator; - if (e && e->type->ops.insert_requests) + if (e && e->type->ops.insert_requests) { e->type->ops.insert_requests(hctx, list, false); + atomic_inc(&hctx->elevator_queued); + } else { /* * try to issue requests directly if the hw queue isn't diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 126021f..946b47a 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx) { struct elevator_queue *e = hctx->queue->elevator; + /* If current hctx has not queued any request, there is no need to run. + * blk_mq_run_hw_queue() on hctx which has queued IO will handle + * running specific hctx. + */ + if (!atomic_read(&hctx->elevator_queued)) + return false; + if (e && e->type->ops.has_work) return e->type->ops.has_work(hctx); diff --git a/block/blk-mq.c b/block/blk-mq.c index f73a2f9..48f1824 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq) struct blk_mq_hw_ctx *hctx = rq->mq_hctx; if (rq->rq_flags & RQF_ELVPRIV) { - if (e && e->type->ops.finish_request) + if (e && e->type->ops.finish_request) { e->type->ops.finish_request(rq); + atomic_dec(&hctx->elevator_queued); + } if (rq->elv.icq) { put_io_context(rq->elv.icq->ioc); rq->elv.icq = NULL; @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, goto free_hctx; atomic_set(&hctx->nr_active, 0); + atomic_set(&hctx->elevator_queued, 0); if (node == NUMA_NO_NODE) node = set->numa_node; hctx->numa_node = node; diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 66711c7..ea1ddb1 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx { * shared across request queues. */ atomic_t nr_active; + /** + * @elevator_queued: Number of queued requests on hctx. + */ + atomic_t elevator_queued; /** @cpuhp_online: List to store request if CPU is going to die */ struct hlist_node cpuhp_online; > > Thanks, > Ming
On 6/17/20 1:26 PM, Kashyap Desai wrote: >> >> ->queued is increased only and not decreased just for debug purpose so >> ->far, so >> it can't be relied for this purpose. > > Thanks. I overlooked that that it is only incremental counter. > >> >> One approach is to add one similar counter, and maintain it by > scheduler's >> insert/dispatch callback. > > I tried below and I see performance is on expected range. > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > index fdcc2c1..ea201d0 100644 > --- a/block/blk-mq-sched.c > +++ b/block/blk-mq-sched.c > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq, > bool at_head, > > list_add(&rq->queuelist, &list); > e->type->ops.insert_requests(hctx, &list, at_head); > + atomic_inc(&hctx->elevator_queued); > } else { > spin_lock(&ctx->lock); > __blk_mq_insert_request(hctx, rq, at_head); > @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct > blk_mq_hw_ctx *hctx, > percpu_ref_get(&q->q_usage_counter); > > e = hctx->queue->elevator; > - if (e && e->type->ops.insert_requests) > + if (e && e->type->ops.insert_requests) { > e->type->ops.insert_requests(hctx, list, false); > + atomic_inc(&hctx->elevator_queued); > + } > else { > /* > * try to issue requests directly if the hw queue isn't > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h > index 126021f..946b47a 100644 > --- a/block/blk-mq-sched.h > +++ b/block/blk-mq-sched.h > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct > blk_mq_hw_ctx *hctx) > { > struct elevator_queue *e = hctx->queue->elevator; > > + /* If current hctx has not queued any request, there is no need to > run. > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > + * running specific hctx. > + */ > + if (!atomic_read(&hctx->elevator_queued)) > + return false; > + > if (e && e->type->ops.has_work) > return e->type->ops.has_work(hctx); > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index f73a2f9..48f1824 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq) > struct blk_mq_hw_ctx *hctx = rq->mq_hctx; > > if (rq->rq_flags & RQF_ELVPRIV) { > - if (e && e->type->ops.finish_request) > + if (e && e->type->ops.finish_request) { > e->type->ops.finish_request(rq); > + atomic_dec(&hctx->elevator_queued); > + } > if (rq->elv.icq) { > put_io_context(rq->elv.icq->ioc); > rq->elv.icq = NULL; > @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct > blk_mq_tag_set *set, > goto free_hctx; > > atomic_set(&hctx->nr_active, 0); > + atomic_set(&hctx->elevator_queued, 0); > if (node == NUMA_NO_NODE) > node = set->numa_node; > hctx->numa_node = node; > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > index 66711c7..ea1ddb1 100644 > --- a/include/linux/blk-mq.h > +++ b/include/linux/blk-mq.h > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx { > * shared across request queues. > */ > atomic_t nr_active; > + /** > + * @elevator_queued: Number of queued requests on hctx. > + */ > + atomic_t elevator_queued; > > /** @cpuhp_online: List to store request if CPU is going to die */ > struct hlist_node cpuhp_online; > > > Would it make sense to move it into the elevator itself? It's a bit odd having a value 'elevator_queued' with no direct reference to the elevator... Cheers, Hannes -- Dr. Hannes Reinecke Teamlead Storage & Networking hare@suse.de +49 911 74053 688 SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote: > On 6/17/20 1:26 PM, Kashyap Desai wrote: > > > > > > ->queued is increased only and not decreased just for debug purpose so > > > ->far, so > > > it can't be relied for this purpose. > > > > Thanks. I overlooked that that it is only incremental counter. > > > > > > > > One approach is to add one similar counter, and maintain it by > > scheduler's > > > insert/dispatch callback. > > > > I tried below and I see performance is on expected range. > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c > > index fdcc2c1..ea201d0 100644 > > --- a/block/blk-mq-sched.c > > +++ b/block/blk-mq-sched.c > > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request *rq, > > bool at_head, > > > > list_add(&rq->queuelist, &list); > > e->type->ops.insert_requests(hctx, &list, at_head); > > + atomic_inc(&hctx->elevator_queued); > > } else { > > spin_lock(&ctx->lock); > > __blk_mq_insert_request(hctx, rq, at_head); > > @@ -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct > > blk_mq_hw_ctx *hctx, > > percpu_ref_get(&q->q_usage_counter); > > > > e = hctx->queue->elevator; > > - if (e && e->type->ops.insert_requests) > > + if (e && e->type->ops.insert_requests) { > > e->type->ops.insert_requests(hctx, list, false); > > + atomic_inc(&hctx->elevator_queued); > > + } > > else { > > /* > > * try to issue requests directly if the hw queue isn't > > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h > > index 126021f..946b47a 100644 > > --- a/block/blk-mq-sched.h > > +++ b/block/blk-mq-sched.h > > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct > > blk_mq_hw_ctx *hctx) > > { > > struct elevator_queue *e = hctx->queue->elevator; > > > > + /* If current hctx has not queued any request, there is no need to > > run. > > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > > + * running specific hctx. > > + */ > > + if (!atomic_read(&hctx->elevator_queued)) > > + return false; > > + > > if (e && e->type->ops.has_work) > > return e->type->ops.has_work(hctx); > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index f73a2f9..48f1824 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq) > > struct blk_mq_hw_ctx *hctx = rq->mq_hctx; > > > > if (rq->rq_flags & RQF_ELVPRIV) { > > - if (e && e->type->ops.finish_request) > > + if (e && e->type->ops.finish_request) { > > e->type->ops.finish_request(rq); > > + atomic_dec(&hctx->elevator_queued); > > + } > > if (rq->elv.icq) { > > put_io_context(rq->elv.icq->ioc); > > rq->elv.icq = NULL; > > @@ -2571,6 +2573,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct > > blk_mq_tag_set *set, > > goto free_hctx; > > > > atomic_set(&hctx->nr_active, 0); > > + atomic_set(&hctx->elevator_queued, 0); > > if (node == NUMA_NO_NODE) > > node = set->numa_node; > > hctx->numa_node = node; > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h > > index 66711c7..ea1ddb1 100644 > > --- a/include/linux/blk-mq.h > > +++ b/include/linux/blk-mq.h > > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx { > > * shared across request queues. > > */ > > atomic_t nr_active; > > + /** > > + * @elevator_queued: Number of queued requests on hctx. > > + */ > > + atomic_t elevator_queued; > > > > /** @cpuhp_online: List to store request if CPU is going to die */ > > struct hlist_node cpuhp_online; > > > > > > > Would it make sense to move it into the elevator itself? That is my initial suggestion, and the counter is just done for bfq & mq-deadline, then we needn't to pay the cost for others. Thanks, Ming
> > On Mon, Jun 22, 2020 at 08:24:39AM +0200, Hannes Reinecke wrote: > > On 6/17/20 1:26 PM, Kashyap Desai wrote: > > > > > > > > ->queued is increased only and not decreased just for debug > > > > ->purpose so far, so > > > > it can't be relied for this purpose. > > > > > > Thanks. I overlooked that that it is only incremental counter. > > > > > > > > > > > One approach is to add one similar counter, and maintain it by > > > scheduler's > > > > insert/dispatch callback. > > > > > > I tried below and I see performance is on expected range. > > > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index > > > fdcc2c1..ea201d0 100644 > > > --- a/block/blk-mq-sched.c > > > +++ b/block/blk-mq-sched.c > > > @@ -485,6 +485,7 @@ void blk_mq_sched_insert_request(struct request > > > *rq, bool at_head, > > > > > > list_add(&rq->queuelist, &list); > > > e->type->ops.insert_requests(hctx, &list, at_head); > > > + atomic_inc(&hctx->elevator_queued); > > > } else { > > > spin_lock(&ctx->lock); > > > __blk_mq_insert_request(hctx, rq, at_head); @@ > > > -511,8 +512,10 @@ void blk_mq_sched_insert_requests(struct > > > blk_mq_hw_ctx *hctx, > > > percpu_ref_get(&q->q_usage_counter); > > > > > > e = hctx->queue->elevator; > > > - if (e && e->type->ops.insert_requests) > > > + if (e && e->type->ops.insert_requests) { > > > e->type->ops.insert_requests(hctx, list, false); > > > + atomic_inc(&hctx->elevator_queued); > > > + } > > > else { > > > /* > > > * try to issue requests directly if the hw queue > > > isn't diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index > > > 126021f..946b47a 100644 > > > --- a/block/blk-mq-sched.h > > > +++ b/block/blk-mq-sched.h > > > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct > > > blk_mq_hw_ctx *hctx) > > > { > > > struct elevator_queue *e = hctx->queue->elevator; > > > > > > + /* If current hctx has not queued any request, there is no > > > + need to > > > run. > > > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > > > + * running specific hctx. > > > + */ > > > + if (!atomic_read(&hctx->elevator_queued)) > > > + return false; > > > + > > > if (e && e->type->ops.has_work) > > > return e->type->ops.has_work(hctx); > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c index f73a2f9..48f1824 > > > 100644 > > > --- a/block/blk-mq.c > > > +++ b/block/blk-mq.c > > > @@ -517,8 +517,10 @@ void blk_mq_free_request(struct request *rq) > > > struct blk_mq_hw_ctx *hctx = rq->mq_hctx; > > > > > > if (rq->rq_flags & RQF_ELVPRIV) { > > > - if (e && e->type->ops.finish_request) > > > + if (e && e->type->ops.finish_request) { > > > e->type->ops.finish_request(rq); > > > + atomic_dec(&hctx->elevator_queued); > > > + } > > > if (rq->elv.icq) { > > > put_io_context(rq->elv.icq->ioc); > > > rq->elv.icq = NULL; @@ -2571,6 +2573,7 @@ > > > blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set > > > *set, > > > goto free_hctx; > > > > > > atomic_set(&hctx->nr_active, 0); > > > + atomic_set(&hctx->elevator_queued, 0); > > > if (node == NUMA_NO_NODE) > > > node = set->numa_node; > > > hctx->numa_node = node; > > > diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index > > > 66711c7..ea1ddb1 100644 > > > --- a/include/linux/blk-mq.h > > > +++ b/include/linux/blk-mq.h > > > @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx { > > > * shared across request queues. > > > */ > > > atomic_t nr_active; > > > + /** > > > + * @elevator_queued: Number of queued requests on hctx. > > > + */ > > > + atomic_t elevator_queued; > > > > > > /** @cpuhp_online: List to store request if CPU is going to die */ > > > struct hlist_node cpuhp_online; > > > > > > > > > > > Would it make sense to move it into the elevator itself? I am not sure where exactly I should add this counter since I need counter per hctx. Elevator data is per request object. Please suggest. > > That is my initial suggestion, and the counter is just done for bfq & mq- > deadline, then we needn't to pay the cost for others. I have updated patch - diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index a1123d4..3e0005c 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx *hctx) { struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; + /* If current hctx has not queued any request, there is no need to run. + * blk_mq_run_hw_queue() on hctx which has queued IO will handle + * running specific hctx. + */ + if (!atomic_read(&hctx->elevator_queued)) + return false; /* * Avoiding lock: a race on bfqd->busy_queues should cause at * most a call to dispatch for nothing @@ -5554,6 +5561,7 @@ static void bfq_insert_requests(struct blk_mq_hw_ctx *hctx, rq = list_first_entry(list, struct request, queuelist); list_del_init(&rq->queuelist); bfq_insert_request(hctx, rq, at_head); + atomic_inc(&hctx->elevator_queued); } } @@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct request *rq) if (likely(rq->rq_flags & RQF_STARTED)) { unsigned long flags; + struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx; spin_lock_irqsave(&bfqd->lock, flags); @@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct request *rq) bfq_completed_request(bfqq, bfqd); bfq_finish_requeue_request_body(bfqq); + atomic_dec(&hctx->elevator_queued); spin_unlock_irqrestore(&bfqd->lock, flags); } else { /* diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index 126021f..946b47a 100644 --- a/block/blk-mq-sched.h +++ b/block/blk-mq-sched.h @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct blk_mq_hw_ctx *hctx) { struct elevator_queue *e = hctx->queue->elevator; + /* If current hctx has not queued any request, there is no need to run. + * blk_mq_run_hw_queue() on hctx which has queued IO will handle + * running specific hctx. + */ + if (!atomic_read(&hctx->elevator_queued)) + return false; + if (e && e->type->ops.has_work) return e->type->ops.has_work(hctx); diff --git a/block/blk-mq.c b/block/blk-mq.c index f73a2f9..82dd152 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -2571,6 +2571,7 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set, goto free_hctx; atomic_set(&hctx->nr_active, 0); + atomic_set(&hctx->elevator_queued, 0); if (node == NUMA_NO_NODE) node = set->numa_node; hctx->numa_node = node; diff --git a/block/mq-deadline.c b/block/mq-deadline.c index b57470e..703ac55 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -533,6 +533,7 @@ static void dd_insert_requests(struct blk_mq_hw_ctx *hctx, rq = list_first_entry(list, struct request, queuelist); list_del_init(&rq->queuelist); dd_insert_request(hctx, rq, at_head); + atomic_inc(&hctx->elevator_queued); } spin_unlock(&dd->lock); } @@ -562,6 +563,7 @@ static void dd_prepare_request(struct request *rq) static void dd_finish_request(struct request *rq) { struct request_queue *q = rq->q; + struct blk_mq_hw_ctx *hctx = rq->mq_hctx; if (blk_queue_is_zoned(q)) { struct deadline_data *dd = q->elevator->elevator_data; @@ -570,15 +572,23 @@ static void dd_finish_request(struct request *rq) spin_lock_irqsave(&dd->zone_lock, flags); blk_req_zone_write_unlock(rq); if (!list_empty(&dd->fifo_list[WRITE])) - blk_mq_sched_mark_restart_hctx(rq->mq_hctx); + blk_mq_sched_mark_restart_hctx(hctx); spin_unlock_irqrestore(&dd->zone_lock, flags); } + atomic_dec(&hctx->elevator_queued); } static bool dd_has_work(struct blk_mq_hw_ctx *hctx) { struct deadline_data *dd = hctx->queue->elevator->elevator_data; + /* If current hctx has not queued any request, there is no need to run. + * blk_mq_run_hw_queue() on hctx which has queued IO will handle + * running specific hctx. + */ + if (!atomic_read(&hctx->elevator_queued)) + return false; + return !list_empty_careful(&dd->dispatch) || !list_empty_careful(&dd->fifo_list[0]) || !list_empty_careful(&dd->fifo_list[1]); diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h index 66711c7..ea1ddb1 100644 --- a/include/linux/blk-mq.h +++ b/include/linux/blk-mq.h @@ -139,6 +139,10 @@ struct blk_mq_hw_ctx { * shared across request queues. */ atomic_t nr_active; + /** + * @elevator_queued: Number of queued requests on hctx. + */ + atomic_t elevator_queued; /** @cpuhp_online: List to store request if CPU is going to die */ struct hlist_node cpuhp_online; > > Thanks, > Ming
> > > > > > > Would it make sense to move it into the elevator itself? > > I am not sure where exactly I should add this counter since I need counter per > hctx. Elevator data is per request object. > Please suggest. > > > > > That is my initial suggestion, and the counter is just done for bfq & > > mq- deadline, then we needn't to pay the cost for others. > > I have updated patch - > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index a1123d4..3e0005c > 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -4640,6 +4640,12 @@ static bool bfq_has_work(struct blk_mq_hw_ctx > *hctx) { > struct bfq_data *bfqd = hctx->queue->elevator->elevator_data; > > + /* If current hctx has not queued any request, there is no need to run. > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > + * running specific hctx. > + */ > + if (!atomic_read(&hctx->elevator_queued)) > + return false; > /* > * Avoiding lock: a race on bfqd->busy_queues should cause at > * most a call to dispatch for nothing @@ -5554,6 +5561,7 @@ static void > bfq_insert_requests(struct blk_mq_hw_ctx *hctx, > rq = list_first_entry(list, struct request, queuelist); > list_del_init(&rq->queuelist); > bfq_insert_request(hctx, rq, at_head); > + atomic_inc(&hctx->elevator_queued); > } > } > > @@ -5925,6 +5933,7 @@ static void bfq_finish_requeue_request(struct > request *rq) > > if (likely(rq->rq_flags & RQF_STARTED)) { > unsigned long flags; > + struct blk_mq_hw_ctx *mq_hctx = rq->mq_hctx; > > spin_lock_irqsave(&bfqd->lock, flags); > > @@ -5934,6 +5943,7 @@ static void bfq_finish_requeue_request(struct > request *rq) > bfq_completed_request(bfqq, bfqd); > bfq_finish_requeue_request_body(bfqq); > > + atomic_dec(&hctx->elevator_queued); > spin_unlock_irqrestore(&bfqd->lock, flags); > } else { > /* > diff --git a/block/blk-mq-sched.h b/block/blk-mq-sched.h index > 126021f..946b47a 100644 > --- a/block/blk-mq-sched.h > +++ b/block/blk-mq-sched.h > @@ -74,6 +74,13 @@ static inline bool blk_mq_sched_has_work(struct > blk_mq_hw_ctx *hctx) { > struct elevator_queue *e = hctx->queue->elevator; > > + /* If current hctx has not queued any request, there is no need to run. > + * blk_mq_run_hw_queue() on hctx which has queued IO will handle > + * running specific hctx. > + */ > + if (!atomic_read(&hctx->elevator_queued)) > + return false; > + I have missed this. I will remove above code since it is now managed within mq-deadline and bfq-iosched *has_work* callback. > if (e && e->type->ops.has_work) > return e->type->ops.has_work(hctx); >