Message ID | YHaez6iN2HHYxYOh@T590 |
---|---|
State | New |
Headers | show |
Series | [bug,report] shared tags causes IO hang and performance drop | expand |
> Hi Ming, > > > > > It is reported inside RH that CPU utilization is increased ~20% when > > running simple FIO test inside VM which disk is built on image stored > > on XFS/megaraid_sas. > > > > When I try to investigate by reproducing the issue via scsi_debug, I > > found IO hang when running randread IO(8k, direct IO, libaio) on > > scsi_debug disk created by the following command: > > > > modprobe scsi_debug host_max_queue=128 > submit_queues=$NR_CPUS > > virtual_gb=256 > > > > So I can recreate this hang for using mq-deadline IO sched for scsi debug, > in > that fio does not exit. I'm using v5.12-rc7. I can also recreate this issue using mq-deadline. Using <none>, there is no IO hang issue. Also if I run script to change scheduler periodically (none, mq-deadline), sysfs entry hangs. Here is call trace- Call Trace: [ 1229.879862] __schedule+0x29d/0x7a0 [ 1229.879871] schedule+0x3c/0xa0 [ 1229.879875] blk_mq_freeze_queue_wait+0x62/0x90 [ 1229.879880] ? finish_wait+0x80/0x80 [ 1229.879884] elevator_switch+0x12/0x40 [ 1229.879888] elv_iosched_store+0x79/0x120 [ 1229.879892] ? kernfs_fop_write_iter+0xc7/0x1b0 [ 1229.879897] queue_attr_store+0x42/0x70 [ 1229.879901] kernfs_fop_write_iter+0x11f/0x1b0 [ 1229.879905] new_sync_write+0x11f/0x1b0 [ 1229.879912] vfs_write+0x184/0x250 [ 1229.879915] ksys_write+0x59/0xd0 [ 1229.879917] do_syscall_64+0x33/0x40 [ 1229.879922] entry_SYSCALL_64_after_hwframe+0x44/0xae I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same behavior. Let me also check megaraid_sas and see if anything generic or this is a special case of scsi_debug. > > Do you have any idea of what changed to cause this, as we would have > tested > this before? Or maybe only none IO sched on scsi_debug. And normally 4k > block size and only rw=read (for me, anyway). > > Note that host_max_queue=128 will cap submit queue depth at 128, while > would be 192 by default. > > Will check more...including CPU utilization. > > Thanks, > John > > > Looks it is caused by SCHED_RESTART because current RESTART is just > > done on current hctx, and we may need to restart all hctxs for shared > > tags, and the issue can be fixed by the append patch. However, IOPS > > drops more than 10% with the patch. > > > > So any idea for this issue and the original performance drop? > > > > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index > > e1e997af89a0..45188f7aa789 100644 > > --- a/block/blk-mq-sched.c > > +++ b/block/blk-mq-sched.c > > @@ -59,10 +59,18 @@ > EXPORT_SYMBOL_GPL(blk_mq_sched_mark_restart_hctx); > > > > void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) > > { > > + bool shared_tag = blk_mq_is_sbitmap_shared(hctx->flags); > > + > > + if (shared_tag) > > + blk_mq_run_hw_queues(hctx->queue, true); > > + > > if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) > > return; > > clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); > > > > + if (shared_tag) > > + return; > > + > > /* > > * Order clearing SCHED_RESTART and list_empty_careful(&hctx- > >dispatch) > > * in blk_mq_run_hw_queue(). Its pair is the barrier in > > > > Thanks, > > Ming > > > > . > >
> > I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same behavior. > > Let me also check megaraid_sas and see if anything generic or this is > > a special case of scsi_debug. > > As I mentioned, it could be one generic issue wrt. SCHED_RESTART. > shared tags might have to restart all hctx since all share same tags. Ming - I tried many combination on MR shared host tag driver but there is no single instance of IO hang. I will keep trying, but when I look at scsi_debug driver code I found below odd settings in scsi_debug driver. can_queue of adapter is set to 128 but queue_depth of sdev is set to 255. If I apply below patch, scsi_debug driver's hang is also resolved. Ideally sdev->queue depth cannot exceed shost->can_queue. Not sure why cmd_per_lun is 255 in scsi_debug driver which can easily exceed can_queue. I will simulate something similar in MR driver and see how it behaves w.r.t IO hang issue. diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c index 70165be10f00..dded762540ee 100644 --- a/drivers/scsi/scsi_debug.c +++ b/drivers/scsi/scsi_debug.c @@ -218,7 +218,7 @@ static const char *sdebug_version_date = "20200710"; */ #define SDEBUG_CANQUEUE_WORDS 3 /* a WORD is bits in a long */ #define SDEBUG_CANQUEUE (SDEBUG_CANQUEUE_WORDS * BITS_PER_LONG) -#define DEF_CMD_PER_LUN 255 +#define DEF_CMD_PER_LUN SDEBUG_CANQUEUE /* UA - Unit Attention; SA - Service Action; SSU - Start Stop Unit */ #define F_D_IN 1 /* Data-in command (e.g. READ) */ @@ -7558,6 +7558,7 @@ static int sdebug_driver_probe(struct device *dev) sdbg_host = to_sdebug_host(dev); sdebug_driver_template.can_queue = sdebug_max_queue; + sdebug_driver_template.cmd_per_lun = sdebug_max_queue; if (!sdebug_clustering) sdebug_driver_template.dma_boundary = PAGE_SIZE - 1; > > > Thanks, > Ming
On 2021-04-14 9:59 a.m., Kashyap Desai wrote: >>> I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same > behavior. >>> Let me also check megaraid_sas and see if anything generic or this is >>> a special case of scsi_debug. >> >> As I mentioned, it could be one generic issue wrt. SCHED_RESTART. >> shared tags might have to restart all hctx since all share same tags. > > Ming - I tried many combination on MR shared host tag driver but there is > no single instance of IO hang. > I will keep trying, but when I look at scsi_debug driver code I found > below odd settings in scsi_debug driver. > can_queue of adapter is set to 128 but queue_depth of sdev is set to 255. > > If I apply below patch, scsi_debug driver's hang is also resolved. Ideally > sdev->queue depth cannot exceed shost->can_queue. > Not sure why cmd_per_lun is 255 in scsi_debug driver which can easily > exceed can_queue. I will simulate something similar in MR driver and see > how it behaves w.r.t IO hang issue. > > diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c > index 70165be10f00..dded762540ee 100644 > --- a/drivers/scsi/scsi_debug.c > +++ b/drivers/scsi/scsi_debug.c > @@ -218,7 +218,7 @@ static const char *sdebug_version_date = "20200710"; > */ > #define SDEBUG_CANQUEUE_WORDS 3 /* a WORD is bits in a long */ > #define SDEBUG_CANQUEUE (SDEBUG_CANQUEUE_WORDS * BITS_PER_LONG) So SDEBUG_CANQUEUE is 3*64 = 192 and is a hard limit (it is used to dimension an array). Should it be upped to 4, say? [That will slow things down a bit if that is an issue.] > -#define DEF_CMD_PER_LUN 255 > +#define DEF_CMD_PER_LUN SDEBUG_CANQUEUE > > /* UA - Unit Attention; SA - Service Action; SSU - Start Stop Unit */ > #define F_D_IN 1 /* Data-in command (e.g. READ) */ > @@ -7558,6 +7558,7 @@ static int sdebug_driver_probe(struct device *dev) > sdbg_host = to_sdebug_host(dev); > > sdebug_driver_template.can_queue = sdebug_max_queue; > + sdebug_driver_template.cmd_per_lun = sdebug_max_queue; I'll push out a patch shortly. Doug Gilbert > if (!sdebug_clustering) > sdebug_driver_template.dma_boundary = PAGE_SIZE - 1; >> >> >> Thanks, >> Ming
On Wed, Apr 14, 2021 at 07:29:07PM +0530, Kashyap Desai wrote: > > > I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same > behavior. > > > Let me also check megaraid_sas and see if anything generic or this is > > > a special case of scsi_debug. > > > > As I mentioned, it could be one generic issue wrt. SCHED_RESTART. > > shared tags might have to restart all hctx since all share same tags. > > Ming - I tried many combination on MR shared host tag driver but there is > no single instance of IO hang. > I will keep trying, but when I look at scsi_debug driver code I found > below odd settings in scsi_debug driver. > can_queue of adapter is set to 128 but queue_depth of sdev is set to 255. > > If I apply below patch, scsi_debug driver's hang is also resolved. Ideally > sdev->queue depth cannot exceed shost->can_queue. > Not sure why cmd_per_lun is 255 in scsi_debug driver which can easily > exceed can_queue. I will simulate something similar in MR driver and see > how it behaves w.r.t IO hang issue. > > diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c > index 70165be10f00..dded762540ee 100644 > --- a/drivers/scsi/scsi_debug.c > +++ b/drivers/scsi/scsi_debug.c > @@ -218,7 +218,7 @@ static const char *sdebug_version_date = "20200710"; > */ > #define SDEBUG_CANQUEUE_WORDS 3 /* a WORD is bits in a long */ > #define SDEBUG_CANQUEUE (SDEBUG_CANQUEUE_WORDS * BITS_PER_LONG) > -#define DEF_CMD_PER_LUN 255 > +#define DEF_CMD_PER_LUN SDEBUG_CANQUEUE > > /* UA - Unit Attention; SA - Service Action; SSU - Start Stop Unit */ > #define F_D_IN 1 /* Data-in command (e.g. READ) */ > @@ -7558,6 +7558,7 @@ static int sdebug_driver_probe(struct device *dev) > sdbg_host = to_sdebug_host(dev); > > sdebug_driver_template.can_queue = sdebug_max_queue; > + sdebug_driver_template.cmd_per_lun = sdebug_max_queue; > if (!sdebug_clustering) > sdebug_driver_template.dma_boundary = PAGE_SIZE - 1; Yeah, the change makes the issue disappear. That looks scsi's restart (scsi_run_queue_async<-scsi_end_request) finally restarts the queue. Thanks, Ming
On Wed, Apr 14, 2021 at 01:06:25PM +0100, John Garry wrote: > On 14/04/2021 12:12, Ming Lei wrote: > > On Wed, Apr 14, 2021 at 04:12:22PM +0530, Kashyap Desai wrote: > > > > Hi Ming, > > > > > > > > > It is reported inside RH that CPU utilization is increased ~20% when > > > > > running simple FIO test inside VM which disk is built on image stored > > > > > on XFS/megaraid_sas. > > > > > > > > > > When I try to investigate by reproducing the issue via scsi_debug, I > > > > > found IO hang when running randread IO(8k, direct IO, libaio) on > > > > > scsi_debug disk created by the following command: > > > > > > > > > > modprobe scsi_debug host_max_queue=128 > > > > submit_queues=$NR_CPUS > > > > > virtual_gb=256 > > > > > > > > > So I can recreate this hang for using mq-deadline IO sched for scsi debug, > > > > in > > > > that fio does not exit. I'm using v5.12-rc7. > > > I can also recreate this issue using mq-deadline. Using <none>, there is no > > > IO hang issue. > > > Also if I run script to change scheduler periodically (none, mq-deadline), > > > sysfs entry hangs. > > > > > > Here is call trace- > > > Call Trace: > > > [ 1229.879862] __schedule+0x29d/0x7a0 > > > [ 1229.879871] schedule+0x3c/0xa0 > > > [ 1229.879875] blk_mq_freeze_queue_wait+0x62/0x90 > > > [ 1229.879880] ? finish_wait+0x80/0x80 > > > [ 1229.879884] elevator_switch+0x12/0x40 > > > [ 1229.879888] elv_iosched_store+0x79/0x120 > > > [ 1229.879892] ? kernfs_fop_write_iter+0xc7/0x1b0 > > > [ 1229.879897] queue_attr_store+0x42/0x70 > > > [ 1229.879901] kernfs_fop_write_iter+0x11f/0x1b0 > > > [ 1229.879905] new_sync_write+0x11f/0x1b0 > > > [ 1229.879912] vfs_write+0x184/0x250 > > > [ 1229.879915] ksys_write+0x59/0xd0 > > > [ 1229.879917] do_syscall_64+0x33/0x40 > > > [ 1229.879922] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > > > > > > > I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same behavior. > > > Let me also check megaraid_sas and see if anything generic or this is a > > > special case of scsi_debug. > > As I mentioned, it could be one generic issue wrt. SCHED_RESTART. > > shared tags might have to restart all hctx since all share same tags. > > I tested on hisi_sas v2 hw (which now sets host_tagset), and can reproduce. > Seems to be combination of mq-deadline and fio rw=randread settings required > to reproduce from limited experiments. > > Incidentally, about the mq-deadline vs none IO scheduler on the same host, I > get this with 6x SAS SSD: > > rw=read > CPU util IOPs > mq-deadline usr=26.80%, sys=52.78% 650K > none usr=22.99%, sys=74.10% 475K > > rw=randread > CPU util IOPs > mq-deadline usr=21.72%, sys=44.18%, 423K > none usr=23.15%, sys=74.01% 450K Today I re-run the scsi_debug test on two server hardwares(32cores, dual numa nodes), and the CPU utilization issue can be reproduced, follow the test result: 1) randread test on ibm-x3850x6[*] with deadline |IOPS | FIO CPU util ------------------------------------------------ hosttags | 94k | usr=1.13%, sys=14.75% ------------------------------------------------ non hosttags | 124k | usr=1.12%, sys=10.65%, 2) randread test on ibm-x3850x6[*] with none |IOPS | FIO CPU util ------------------------------------------------ hosttags | 120k | usr=0.89%, sys=6.55% ------------------------------------------------ non hosttags | 121k | usr=1.07%, sys=7.35% ------------------------------------------------ *: - that is the machine Yanhui reported VM cpu utilization increased by 20% - kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57) - also run same test on another 32cores machine, IOPS drop isn't observed, but CPU utilization is increased obviously 3) test script #/bin/bash run_fio() { RTIME=$1 JOBS=$2 DEVS=$3 BS=$4 QD=64 BATCH=16 fio --bs=$BS --ioengine=libaio \ --iodepth=$QD \ --iodepth_batch_submit=$BATCH \ --iodepth_batch_complete_min=$BATCH \ --filename=$DEVS \ --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ --name=test --group_reporting } SCHED=$1 NRQS=`getconf _NPROCESSORS_ONLN` rmmod scsi_debug modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256 sleep 2 DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` echo $SCHED > /sys/block/`basename $DEV`/queue/scheduler echo 128 > /sys/block/`basename $DEV`/device/queue_depth run_fio 20 16 $DEV 8K rmmod scsi_debug modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256 sleep 2 DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` echo $SCHED > /sys/block/`basename $DEV`/queue/scheduler echo 128 > /sys/block/`basename $DEV`/device/queue_depth run_fio 20 16 $DEV 8k Thanks, Ming
Hi Ming, I'll have a look. BTW, are you intentionally using scsi_debug over null_blk? null_blk supports shared sbitmap as well, and performance figures there are generally higher than scsi_debug for similar fio settings. Thanks, john EOM >> IOPs >> mq-deadline usr=21.72%, sys=44.18%, 423K >> none usr=23.15%, sys=74.01% 450K > Today I re-run the scsi_debug test on two server hardwares(32cores, dual > numa nodes), and the CPU utilization issue can be reproduced, follow > the test result: > > 1) randread test on ibm-x3850x6[*] with deadline > > |IOPS | FIO CPU util > ------------------------------------------------ > hosttags | 94k | usr=1.13%, sys=14.75% > ------------------------------------------------ > non hosttags | 124k | usr=1.12%, sys=10.65%, > > > 2) randread test on ibm-x3850x6[*] with none > |IOPS | FIO CPU util > ------------------------------------------------ > hosttags | 120k | usr=0.89%, sys=6.55% > ------------------------------------------------ > non hosttags | 121k | usr=1.07%, sys=7.35% > ------------------------------------------------ > > *: > - that is the machine Yanhui reported VM cpu utilization increased by 20% > - kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57) > - also run same test on another 32cores machine, IOPS drop isn't > observed, but CPU utilization is increased obviously > > 3) test script > #/bin/bash > > run_fio() { > RTIME=$1 > JOBS=$2 > DEVS=$3 > BS=$4 > > QD=64 > BATCH=16 > > fio --bs=$BS --ioengine=libaio \ > --iodepth=$QD \ > --iodepth_batch_submit=$BATCH \ > --iodepth_batch_complete_min=$BATCH \ > --filename=$DEVS \ > --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ > --name=test --group_reporting > } > > SCHED=$1 > > NRQS=`getconf _NPROCESSORS_ONLN` > > rmmod scsi_debug > modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256 > sleep 2 > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler > echo 128 >/sys/block/`basename $DEV`/device/queue_depth > run_fio 20 16 $DEV 8K > > > rmmod scsi_debug > modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256 > sleep 2 > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler > echo 128 >/sys/block/`basename $DEV`/device/queue_depth > run_fio 20 16 $DEV 8k
On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote: > Hi Ming, > > I'll have a look. > > BTW, are you intentionally using scsi_debug over null_blk? null_blk supports > shared sbitmap as well, and performance figures there are generally higher > than scsi_debug for similar fio settings. I use both, but scsi_debug can cover scsi stack test. Thanks, Ming > > Thanks, > john > > EOM > > > > > IOPs > > > mq-deadline usr=21.72%, sys=44.18%, 423K > > > none usr=23.15%, sys=74.01% 450K > > Today I re-run the scsi_debug test on two server hardwares(32cores, dual > > numa nodes), and the CPU utilization issue can be reproduced, follow > > the test result: > > > > 1) randread test on ibm-x3850x6[*] with deadline > > > > |IOPS | FIO CPU util > > ------------------------------------------------ > > hosttags | 94k | usr=1.13%, sys=14.75% > > ------------------------------------------------ > > non hosttags | 124k | usr=1.12%, sys=10.65%, > > > > > > 2) randread test on ibm-x3850x6[*] with none > > |IOPS | FIO CPU util > > ------------------------------------------------ > > hosttags | 120k | usr=0.89%, sys=6.55% > > ------------------------------------------------ > > non hosttags | 121k | usr=1.07%, sys=7.35% > > ------------------------------------------------ > > > > *: > > - that is the machine Yanhui reported VM cpu utilization increased by 20% > > - kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57) > > - also run same test on another 32cores machine, IOPS drop isn't > > observed, but CPU utilization is increased obviously > > > > 3) test script > > #/bin/bash > > > > run_fio() { > > RTIME=$1 > > JOBS=$2 > > DEVS=$3 > > BS=$4 > > > > QD=64 > > BATCH=16 > > > > fio --bs=$BS --ioengine=libaio \ > > --iodepth=$QD \ > > --iodepth_batch_submit=$BATCH \ > > --iodepth_batch_complete_min=$BATCH \ > > --filename=$DEVS \ > > --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ > > --name=test --group_reporting > > } > > > > SCHED=$1 > > > > NRQS=`getconf _NPROCESSORS_ONLN` > > > > rmmod scsi_debug > > modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256 > > sleep 2 > > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` > > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler > > echo 128 >/sys/block/`basename $DEV`/device/queue_depth > > run_fio 20 16 $DEV 8K > > > > > > rmmod scsi_debug > > modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256 > > sleep 2 > > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` > > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler > > echo 128 >/sys/block/`basename $DEV`/device/queue_depth > > run_fio 20 16 $DEV 8k > > -- Ming
On 15/04/2021 13:18, Ming Lei wrote: > On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote: >> Hi Ming, >> >> I'll have a look. >> >> BTW, are you intentionally using scsi_debug over null_blk? null_blk supports >> shared sbitmap as well, and performance figures there are generally higher >> than scsi_debug for similar fio settings. > I use both, but scsi_debug can cover scsi stack test. > Hi Ming, I can't seem to recreate your same issue. Are you mainline defconfig, or a special disto config? What is am seeing is that scsi_debug throughput is fixed @ ~ 32K IOPS for scsi_debug with both modprobe configs and both none and mq-deadline IO sched. CPU util seems a bit higher for hosttags with none. When I tried null_blk, the performance diff for hosttags and using none IO scheduler was noticeable, but not for mq-deadline: 1) randread test with deadline |IOPS | FIO CPU util ------------------------------------------------ hosttags* | 325K usr=1.34%, sys=76.49% ------------------------------------------------ non hosttags** | 325k usr=1.36%, sys=76.25% 2) randread test with none |IOPS | FIO CPU util ------------------------------------------------ hosttags* |6421k | usr=23.84%, sys=76.06% ------------------------------------------------ non hosttags** | 6893K | usr=25.57%, sys=74.33% ------------------------------------------------ * insmod null_blk.ko submit_queues=32 shared_tag_bitmap=1 ** insmod null_blk.ko submit_queues=32 However I don't think that the null_blk test is a good like-for-like comparison, as setting shared_tag_bitmap means just just the same tagset over all hctx, but still have same count of hctx. Just setting submit_queues=1 gives a big drop in performance, as would be expected. Thanks, John >> >> EOM >> >> >>>> IOPs >>>> mq-deadline usr=21.72%, sys=44.18%, 423K >>>> none usr=23.15%, sys=74.01% 450K >>> Today I re-run the scsi_debug test on two server hardwares(32cores, dual >>> numa nodes), and the CPU utilization issue can be reproduced, follow >>> the test result: >>> >>> 1) randread test on ibm-x3850x6[*] with deadline >>> >>> |IOPS | FIO CPU util >>> ------------------------------------------------ >>> hosttags | 94k | usr=1.13%, sys=14.75% >>> ------------------------------------------------ >>> non hosttags | 124k | usr=1.12%, sys=10.65%, >>> >>> >>> 2) randread test on ibm-x3850x6[*] with none >>> |IOPS | FIO CPU util >>> ------------------------------------------------ >>> hosttags | 120k | usr=0.89%, sys=6.55% >>> ------------------------------------------------ >>> non hosttags | 121k | usr=1.07%, sys=7.35% >>> ------------------------------------------------ >>> >>> *: >>> - that is the machine Yanhui reported VM cpu utilization increased by 20% >>> - kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57) >>> - also run same test on another 32cores machine, IOPS drop isn't >>> observed, but CPU utilization is increased obviously >>> >>> 3) test script >>> #/bin/bash >>> >>> run_fio() { >>> RTIME=$1 >>> JOBS=$2 >>> DEVS=$3 >>> BS=$4 >>> >>> QD=64 >>> BATCH=16 >>> >>> fio --bs=$BS --ioengine=libaio \ >>> --iodepth=$QD \ >>> --iodepth_batch_submit=$BATCH \ >>> --iodepth_batch_complete_min=$BATCH \ >>> --filename=$DEVS \ >>> --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ >>> --name=test --group_reporting >>> } >>> >>> SCHED=$1 >>> >>> NRQS=`getconf _NPROCESSORS_ONLN` >>> >>> rmmod scsi_debug >>> modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256 >>> sleep 2 >>> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` >>> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler >>> echo 128 >/sys/block/`basename $DEV`/device/queue_depth >>> run_fio 20 16 $DEV 8K >>> >>> >>> rmmod scsi_debug >>> modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256 >>> sleep 2 >>> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` >>> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler >>> echo 128 >/sys/block/`basename $DEV`/device/queue_depth >>> run_fio 20 16 $DEV 8k
On Thu, Apr 15, 2021 at 04:41:06PM +0100, John Garry wrote: > On 15/04/2021 13:18, Ming Lei wrote: > > On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote: > > > Hi Ming, > > > > > > I'll have a look. > > > > > > BTW, are you intentionally using scsi_debug over null_blk? null_blk supports > > > shared sbitmap as well, and performance figures there are generally higher > > > than scsi_debug for similar fio settings. > > I use both, but scsi_debug can cover scsi stack test. > > > > Hi Ming, > > I can't seem to recreate your same issue. Are you mainline defconfig, or a > special disto config? The config is rhel8 config. As I mentioned, with deadline, IOPS drop is observed on one hardware(ibm-x3850x6) which is exactly the machine Yanhui reported the cpu utilization issue. On another machine(HP DL380G10), still 32cores, dual numa nodes, IOPS drop can't be observed, but cpu utilization difference is still obserable. I use scsi_debug just because it is hard to run the virt workloads on that machine. And the reported issue is on megaraid_sas, which is a scsi device, so null_blk isn't good to simulate here because it can't cover scsi stack. Thanks, Ming
On 16/04/2021 01:46, Ming Lei wrote: >> I can't seem to recreate your same issue. Are you mainline defconfig, or a >> special disto config? > The config is rhel8 config. > Can you share that? Has anyone tested against mainline x86 config? > As I mentioned, with deadline, IOPS drop is observed on one hardware(ibm-x3850x6) > which is exactly the machine Yanhui reported the cpu utilization issue. > > On another machine(HP DL380G10), still 32cores, dual numa nodes, IOPS drop can't be > observed, but cpu utilization difference is still obserable. > > I use scsi_debug just because it is hard to run the virt workloads on > that machine. And the reported issue is on megaraid_sas, which is a scsi > device, so null_blk isn't good to simulate here because it can't cover > scsi stack. Understood. Apart from scsi_debug, the only thing I can readily compare hosttags vs non-hosttags for scsi stack on is a megaraid_sas system with 1x SATA disk. Or hack with hisi_sas to support both. Thanks, John
On Fri, Apr 16, 2021 at 09:29:37AM +0100, John Garry wrote: > On 16/04/2021 01:46, Ming Lei wrote: > > > I can't seem to recreate your same issue. Are you mainline defconfig, or a > > > special disto config? > > The config is rhel8 config. > > > > Can you share that? Has anyone tested against mainline x86 config? Sure, see the attachment. Thanks, Ming
On 16/04/2021 09:39, Ming Lei wrote: > On Fri, Apr 16, 2021 at 09:29:37AM +0100, John Garry wrote: >> On 16/04/2021 01:46, Ming Lei wrote: >>>> I can't seem to recreate your same issue. Are you mainline defconfig, or a >>>> special disto config? >>> The config is rhel8 config. >>> >> Can you share that? Has anyone tested against mainline x86 config? > Sure, see the attachment. Thanks. I assume that this is not seen on mainline x86 defconfig. Unfortunately it's anything but easy for me to install an x86 kernel ATM. And I am still seeing this on hisi_sas v2 hw with 5.12-rc7: [ 214.448368] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: [ 214.454468] rcu:Tasks blocked on level-1 rcu_node (CPUs 0-15): [ 214.460474] (detected by 40, t=5255 jiffies, g=2229, q=1110) [ 214.466208] rcu: All QSes seen, last rcu_preempt kthread activity 1 (4294945760-4294945759), jiffies_till_next_fqs=1, root ->qsmask 0x1 [ 214.478466] BUG: scheduling while atomic: irq/151-hisi_sa/503/0x00000004 [ 214.485162] Modules linked in: [ 214.488208] CPU: 40 PID: 503 Comm: irq/151-hisi_sa Not tainted 5.11.0 #75 [ 214.494985] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 Nemo 2.0 RC0 04/18/2018 [ 214.504105] Call trace: [ 214.506540] dump_backtrace+0x0/0x1b0 [ 214.510208] show_stack+0x18/0x68 [ 214.513513] dump_stack+0xd8/0x134 [ 214.516907] __schedule_bug+0x60/0x78 [ 214.520560] __schedule+0x620/0x6d8 [ 214.524039] schedule+0x70/0x108 [ 214.527256] irq_thread+0xdc/0x230 [ 214.530648] kthread+0x154/0x158 [ 214.533866] ret_from_fork+0x10/0x30 john@ubuntu:~$ For rw=randread and mq-deadline only, it seems. v5.11 has the same. Not sure if this is a driver or other issue. Today I don't have access to other boards with enough disks to get a decent throughput to test against :( Thanks, John
On 2021-04-16 10:59 a.m., John Garry wrote: > On 16/04/2021 09:39, Ming Lei wrote: >> On Fri, Apr 16, 2021 at 09:29:37AM +0100, John Garry wrote: >>> On 16/04/2021 01:46, Ming Lei wrote: >>>>> I can't seem to recreate your same issue. Are you mainline defconfig, or a >>>>> special disto config? >>>> The config is rhel8 config. >>>> >>> Can you share that? Has anyone tested against mainline x86 config? >> Sure, see the attachment. > > Thanks. I assume that this is not seen on mainline x86 defconfig. > > Unfortunately it's anything but easy for me to install an x86 kernel ATM. > > And I am still seeing this on hisi_sas v2 hw with 5.12-rc7: > > [ 214.448368] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks: > [ 214.454468] rcu:Tasks blocked on level-1 rcu_node (CPUs 0-15): > [ 214.460474] (detected by 40, t=5255 jiffies, g=2229, q=1110) > [ 214.466208] rcu: All QSes seen, last rcu_preempt kthread activity 1 > (4294945760-4294945759), jiffies_till_next_fqs=1, root ->qsmask 0x1 > [ 214.478466] BUG: scheduling while atomic: irq/151-hisi_sa/503/0x00000004 > [ 214.485162] Modules linked in: > [ 214.488208] CPU: 40 PID: 503 Comm: irq/151-hisi_sa Not tainted 5.11.0 #75 > [ 214.494985] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 > Nemo 2.0 RC0 04/18/2018 > [ 214.504105] Call trace: > [ 214.506540] dump_backtrace+0x0/0x1b0 > [ 214.510208] show_stack+0x18/0x68 > [ 214.513513] dump_stack+0xd8/0x134 > [ 214.516907] __schedule_bug+0x60/0x78 > [ 214.520560] __schedule+0x620/0x6d8 > [ 214.524039] schedule+0x70/0x108 > [ 214.527256] irq_thread+0xdc/0x230 > [ 214.530648] kthread+0x154/0x158 > [ 214.533866] ret_from_fork+0x10/0x30 > john@ubuntu:~$ > > For rw=randread and mq-deadline only, it seems. v5.11 has the same. Not sure if > this is a driver or other issue. > > Today I don't have access to other boards with enough disks to get a decent > throughput to test against :( I have always suspected under extreme pressure the block layer (or scsi mid-level) does strange things, like an IO hang, attempts to prove that usually lead back to my own code :-). But I have one example recently where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) and the following one stalled (all on the same thread). Seconds later those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and my dd variant went to its conclusion (reporting 10 errors). Following copies showed no ill effects. My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last two monitor for stalls during the copy. Each submitted READ and WRITE command gets its pack_id from an incrementing atomic and a management thread in those copies checks every 300 milliseconds that that atomic value is greater than the previous check. If not, dump the state of the sg driver. The stalled request was in busy state with a timeout of 1 nanosecond which indicated that blk_execute_rq_nowait() had not been called. So the chief suspect would be blk_get_request() followed by the bio setup calls IMO. So it certainly looked like an IO hang, not a locking, resource nor corruption issue IMO. That was with a branch off MKP's 5.13/scsi-staging branch taken a few weeks back. So basically lk 5.12.0-rc1 . Doug Gilbert
On 4/19/21 8:06 PM, Douglas Gilbert wrote: > I have always suspected under extreme pressure the block layer (or scsi > mid-level) does strange things, like an IO hang, attempts to prove that > usually lead back to my own code :-). But I have one example recently > where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) > and the following one stalled (all on the same thread). Seconds later > those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and > my dd variant went to its conclusion (reporting 10 errors). Following > copies showed no ill effects. > > My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last > two monitor for stalls during the copy. Each submitted READ and WRITE > command gets its pack_id from an incrementing atomic and a management > thread in those copies checks every 300 milliseconds that that atomic > value is greater than the previous check. If not, dump the state of the > sg driver. The stalled request was in busy state with a timeout of 1 > nanosecond which indicated that blk_execute_rq_nowait() had not been > called. So the chief suspect would be blk_get_request() followed by > the bio setup calls IMO. > > So it certainly looked like an IO hang, not a locking, resource nor > corruption issue IMO. That was with a branch off MKP's > 5.13/scsi-staging branch taken a few weeks back. So basically > lk 5.12.0-rc1 . Hi Doug, If it would be possible to develop a script that reproduces this hang and if that script can be shared I will help with root-causing and fixing this hang. Thanks, Bart.
On 2021-04-19 11:22 p.m., Bart Van Assche wrote: > On 4/19/21 8:06 PM, Douglas Gilbert wrote: >> I have always suspected under extreme pressure the block layer (or scsi >> mid-level) does strange things, like an IO hang, attempts to prove that >> usually lead back to my own code :-). But I have one example recently >> where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) >> and the following one stalled (all on the same thread). Seconds later >> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and >> my dd variant went to its conclusion (reporting 10 errors). Following >> copies showed no ill effects. >> >> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last >> two monitor for stalls during the copy. Each submitted READ and WRITE >> command gets its pack_id from an incrementing atomic and a management >> thread in those copies checks every 300 milliseconds that that atomic >> value is greater than the previous check. If not, dump the state of the >> sg driver. The stalled request was in busy state with a timeout of 1 >> nanosecond which indicated that blk_execute_rq_nowait() had not been >> called. So the chief suspect would be blk_get_request() followed by >> the bio setup calls IMO. >> >> So it certainly looked like an IO hang, not a locking, resource nor >> corruption issue IMO. That was with a branch off MKP's >> 5.13/scsi-staging branch taken a few weeks back. So basically >> lk 5.12.0-rc1 . > > Hi Doug, > > If it would be possible to develop a script that reproduces this hang and > if that script can be shared I will help with root-causing and fixing this > hang. Possible, but not very practical: 1) apply supplied 83 patches to sg driver 2) apply pending patch to scsi_debug driver 3) find a stable kernel platform (perhaps not lk 5.12.0-rc1) 4) run supplied scripts for three weeks 5) dig through the output and maybe find one case (there were lots of EAGAINs from blk_get_request() but they are expected when thrashing the storage layers) My basic testing strategy may be useful for others: sg_dd iflag=random bs=512 of=/dev/sg6 sg_dd if=/dev/sg6 bs=512 of=/dev/sg7 sg_dd --verify if=/dev/sg6 bs=512 of=/dev/sg7 If the copy works, so should the verify (compare). The sg_dd utility in sg3_utils release 1.46 is needed to support iflag=random in the first line and the --verify in the third line. If the backing LLD is scsi_debug, then per_host_store=1 is needed. Best not to use SSDs. The above pattern will work just as well for /dev/sd* device nodes, but iflag= and oflag= lists must contain the sgio flag. Then ioctl(/dev/sd*, SG_IO, ...) is used for IO. The limitations of the third line could be bypassed with something like: cmp /dev/sd6 /dev/sd7 If real disks are used, all user data will be trashed. Doug Gilbert
On Tue, Apr 20, 2021 at 12:54 PM Douglas Gilbert <dgilbert@interlog.com> wrote: > > On 2021-04-19 11:22 p.m., Bart Van Assche wrote: > > On 4/19/21 8:06 PM, Douglas Gilbert wrote: > >> I have always suspected under extreme pressure the block layer (or scsi > >> mid-level) does strange things, like an IO hang, attempts to prove that > >> usually lead back to my own code :-). But I have one example recently > >> where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) > >> and the following one stalled (all on the same thread). Seconds later > >> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and > >> my dd variant went to its conclusion (reporting 10 errors). Following > >> copies showed no ill effects. > >> > >> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last > >> two monitor for stalls during the copy. Each submitted READ and WRITE > >> command gets its pack_id from an incrementing atomic and a management > >> thread in those copies checks every 300 milliseconds that that atomic > >> value is greater than the previous check. If not, dump the state of the > >> sg driver. The stalled request was in busy state with a timeout of 1 > >> nanosecond which indicated that blk_execute_rq_nowait() had not been > >> called. So the chief suspect would be blk_get_request() followed by > >> the bio setup calls IMO. > >> > >> So it certainly looked like an IO hang, not a locking, resource nor > >> corruption issue IMO. That was with a branch off MKP's > >> 5.13/scsi-staging branch taken a few weeks back. So basically > >> lk 5.12.0-rc1 . > > > > Hi Doug, > > > > If it would be possible to develop a script that reproduces this hang and > > if that script can be shared I will help with root-causing and fixing this > > hang. > > Possible, but not very practical: > 1) apply supplied 83 patches to sg driver > 2) apply pending patch to scsi_debug driver > 3) find a stable kernel platform (perhaps not lk 5.12.0-rc1) > 4) run supplied scripts for three weeks > 5) dig through the output and maybe find one case (there were lots > of EAGAINs from blk_get_request() but they are expected when > thrashing the storage layers) Or collecting the debugfs log after IO hang is triggered in your test: (cd /sys/kernel/debug/block/$SDEV && find . -type f -exec grep -aH . {} \;) $SDEV is the disk on which IO hang is observed. Thanks, Ming
On 2021-04-20 2:52 a.m., Ming Lei wrote: > On Tue, Apr 20, 2021 at 12:54 PM Douglas Gilbert <dgilbert@interlog.com> wrote: >> >> On 2021-04-19 11:22 p.m., Bart Van Assche wrote: >>> On 4/19/21 8:06 PM, Douglas Gilbert wrote: >>>> I have always suspected under extreme pressure the block layer (or scsi >>>> mid-level) does strange things, like an IO hang, attempts to prove that >>>> usually lead back to my own code :-). But I have one example recently >>>> where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) >>>> and the following one stalled (all on the same thread). Seconds later >>>> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and >>>> my dd variant went to its conclusion (reporting 10 errors). Following >>>> copies showed no ill effects. >>>> >>>> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last >>>> two monitor for stalls during the copy. Each submitted READ and WRITE >>>> command gets its pack_id from an incrementing atomic and a management >>>> thread in those copies checks every 300 milliseconds that that atomic >>>> value is greater than the previous check. If not, dump the state of the >>>> sg driver. The stalled request was in busy state with a timeout of 1 >>>> nanosecond which indicated that blk_execute_rq_nowait() had not been >>>> called. So the chief suspect would be blk_get_request() followed by >>>> the bio setup calls IMO. >>>> >>>> So it certainly looked like an IO hang, not a locking, resource nor >>>> corruption issue IMO. That was with a branch off MKP's >>>> 5.13/scsi-staging branch taken a few weeks back. So basically >>>> lk 5.12.0-rc1 . >>> >>> Hi Doug, >>> >>> If it would be possible to develop a script that reproduces this hang and >>> if that script can be shared I will help with root-causing and fixing this >>> hang. >> >> Possible, but not very practical: >> 1) apply supplied 83 patches to sg driver >> 2) apply pending patch to scsi_debug driver >> 3) find a stable kernel platform (perhaps not lk 5.12.0-rc1) >> 4) run supplied scripts for three weeks >> 5) dig through the output and maybe find one case (there were lots >> of EAGAINs from blk_get_request() but they are expected when >> thrashing the storage layers) > > Or collecting the debugfs log after IO hang is triggered in your test: > > (cd /sys/kernel/debug/block/$SDEV && find . -type f -exec grep -aH . {} \;) > > $SDEV is the disk on which IO hang is observed. Thanks. I'll try adding that to my IO hang trigger code. My patches on the sg driver add debugfs support so these produce the same output: cat /proc/scsi/sg/debug cat /sys/kernel/debug/scsi_generic/snapshot There is also a /sys/kernel/debug/scsi_generic/snapped file whose contents reflect the driver's state when ioctl(<sg_fd>, SG_DEBUG, &one) was last called. When I test, the root file system is usually on a NVMe SSD so the state of all SCSI disks present should be dumped as they are part of my test. Also I find the netconsole module extremely useful and have an old laptop on my network running: socat udp-recv:6665 - > socat.txt picking up the UDP packets from netconsole on port 6665. Not quite as good as monitoring a hardware serial console, but less fiddly. And most modern laptops don't have access to a serial console so netconsole is the only option. Another observation: upper level issues seem to impact the submission side of request handling (e.g. the IO hang I was trying to describe) while error injection I can do (e.g. using the scsi_debug driver) impacts the completion side (obviously). Are there any tools to inject errors into the block layer submission code? Doug Gilbert
On Tue, Apr 20, 2021 at 04:22:53PM -0400, Douglas Gilbert wrote: > On 2021-04-20 2:52 a.m., Ming Lei wrote: > > On Tue, Apr 20, 2021 at 12:54 PM Douglas Gilbert <dgilbert@interlog.com> wrote: > > > > > > On 2021-04-19 11:22 p.m., Bart Van Assche wrote: > > > > On 4/19/21 8:06 PM, Douglas Gilbert wrote: > > > > > I have always suspected under extreme pressure the block layer (or scsi > > > > > mid-level) does strange things, like an IO hang, attempts to prove that > > > > > usually lead back to my own code :-). But I have one example recently > > > > > where upwards of 10 commands had been submitted (blk_execute_rq_nowait()) > > > > > and the following one stalled (all on the same thread). Seconds later > > > > > those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and > > > > > my dd variant went to its conclusion (reporting 10 errors). Following > > > > > copies showed no ill effects. > > > > > > > > > > My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last > > > > > two monitor for stalls during the copy. Each submitted READ and WRITE > > > > > command gets its pack_id from an incrementing atomic and a management > > > > > thread in those copies checks every 300 milliseconds that that atomic > > > > > value is greater than the previous check. If not, dump the state of the > > > > > sg driver. The stalled request was in busy state with a timeout of 1 > > > > > nanosecond which indicated that blk_execute_rq_nowait() had not been > > > > > called. So the chief suspect would be blk_get_request() followed by > > > > > the bio setup calls IMO. > > > > > > > > > > So it certainly looked like an IO hang, not a locking, resource nor > > > > > corruption issue IMO. That was with a branch off MKP's > > > > > 5.13/scsi-staging branch taken a few weeks back. So basically > > > > > lk 5.12.0-rc1 . > > > > > > > > Hi Doug, > > > > > > > > If it would be possible to develop a script that reproduces this hang and > > > > if that script can be shared I will help with root-causing and fixing this > > > > hang. > > > > > > Possible, but not very practical: > > > 1) apply supplied 83 patches to sg driver > > > 2) apply pending patch to scsi_debug driver > > > 3) find a stable kernel platform (perhaps not lk 5.12.0-rc1) > > > 4) run supplied scripts for three weeks > > > 5) dig through the output and maybe find one case (there were lots > > > of EAGAINs from blk_get_request() but they are expected when > > > thrashing the storage layers) > > > > Or collecting the debugfs log after IO hang is triggered in your test: > > > > (cd /sys/kernel/debug/block/$SDEV && find . -type f -exec grep -aH . {} \;) > > > > $SDEV is the disk on which IO hang is observed. > > Thanks. I'll try adding that to my IO hang trigger code. > > My patches on the sg driver add debugfs support so these produce > the same output: > cat /proc/scsi/sg/debug > cat /sys/kernel/debug/scsi_generic/snapshot > > There is also a /sys/kernel/debug/scsi_generic/snapped file whose > contents reflect the driver's state when ioctl(<sg_fd>, SG_DEBUG, &one) > was last called. > > When I test, the root file system is usually on a NVMe SSD so the > state of all SCSI disks present should be dumped as they are part > of my test. Also I find the netconsole module extremely useful and > have an old laptop on my network running: > socat udp-recv:6665 - > socat.txt > > picking up the UDP packets from netconsole on port 6665. Not quite as > good as monitoring a hardware serial console, but less fiddly. And > most modern laptops don't have access to a serial console so > netconsole is the only option. Yeah, years ago netconsole does help me much when serial console isn't available, especially ssh doesn't work at that time, such as kernel panic. > > Another observation: upper level issues seem to impact the submission > side of request handling (e.g. the IO hang I was trying to describe) > while error injection I can do (e.g. using the scsi_debug driver) > impacts the completion side (obviously). Are there any tools to inject > errors into the block layer submission code? block layer supports fault injection in submission side, see should_fail_bio() which is called from submit_bio_checks(). thanks, Ming
On 15/04/2021 04:46, Ming Lei wrote: >> CPU util IOPs >> mq-deadline usr=21.72%, sys=44.18%, 423K >> none usr=23.15%, sys=74.01% 450K > Today I re-run the scsi_debug test on two server hardwares(32cores, dual > numa nodes), and the CPU utilization issue can be reproduced, follow > the test result: > I haven't forgotten about this. I finally got your .config working in x86 qemu with only a 4-CPU system. > 1) randread test on ibm-x3850x6[*] with deadline > > |IOPS | FIO CPU util > ------------------------------------------------ > hosttags | 94k | usr=1.13%, sys=14.75% > ------------------------------------------------ > non hosttags | 124k | usr=1.12%, sys=10.65%, > Getting these results for mq-deadline: hosttags 100K cpu 1.52 4.47 non-hosttags 109K cpu 1.74 5.49 So I still don't see the same CPU usage increase for hosttags. But throughput is down, so at least I can check on that... > > 2) randread test on ibm-x3850x6[*] with none > |IOPS | FIO CPU util > ------------------------------------------------ > hosttags | 120k | usr=0.89%, sys=6.55% > ------------------------------------------------ > non hosttags | 121k | usr=1.07%, sys=7.35% > ------------------------------------------------ > Here I get: hosttags 113K cpu 2.04 5.83 non-hosttags 108K cpu 1.71 5.05 Thanks, John > *: > - that is the machine Yanhui reported VM cpu utilization increased by 20% > - kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57) > - also run same test on another 32cores machine, IOPS drop isn't > observed, but CPU utilization is increased obviously > > 3) test script > #/bin/bash > > run_fio() { > RTIME=$1 > JOBS=$2 > DEVS=$3 > BS=$4 > > QD=64 > BATCH=16 > > fio --bs=$BS --ioengine=libaio \ > --iodepth=$QD \ > --iodepth_batch_submit=$BATCH \ > --iodepth_batch_complete_min=$BATCH \ > --filename=$DEVS \ > --direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \ > --name=test --group_reporting > } > > SCHED=$1 > > NRQS=`getconf _NPROCESSORS_ONLN` > > rmmod scsi_debug > modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256 > sleep 2 > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler > echo 128 >/sys/block/`basename $DEV`/device/queue_depth > run_fio 20 16 $DEV 8K > > > rmmod scsi_debug > modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256 > sleep 2 > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'` > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler > echo 128 >/sys/block/`basename $DEV`/device/queue_depth > run_fio 20 16 $DEV 8k
On 23/04/2021 09:43, John Garry wrote: >> 1) randread test on ibm-x3850x6[*] with deadline >> >> |IOPS | FIO CPU util >> ------------------------------------------------ >> hosttags | 94k | usr=1.13%, sys=14.75% >> ------------------------------------------------ >> non hosttags | 124k | usr=1.12%, sys=10.65%, >> > > Getting these results for mq-deadline: > > hosttags > 100K cpu 1.52 4.47 > > non-hosttags > 109K cpu 1.74 5.49 > > So I still don't see the same CPU usage increase for hosttags. > > But throughput is down, so at least I can check on that... > >> >> 2) randread test on ibm-x3850x6[*] with none >> |IOPS | FIO CPU util >> ------------------------------------------------ >> hosttags | 120k | usr=0.89%, sys=6.55% >> ------------------------------------------------ >> non hosttags | 121k | usr=1.07%, sys=7.35% >> ------------------------------------------------ >> > > Here I get: > hosttags > 113K cpu 2.04 5.83 > > non-hosttags > 108K cpu 1.71 5.05 Hi Ming, One thing I noticed is that for the non-hosttags scenario is that I am hitting the IO scheduler tag exhaustion path in blk_mq_get_tag() often; here's some perf output: |--15.88%--blk_mq_submit_bio | | | |--11.27%--__blk_mq_alloc_request | | | | | --11.19%--blk_mq_get_tag | | | | | |--6.00%--__blk_mq_delay_run_hw_queue | | | | ... | | | | | |--3.29%--io_schedule | | | | .... | | | | | | | --1.32%--io_schedule_prepare | | | ... | | | | | |--0.60%--sbitmap_finish_wait | | | --0.56%--sbitmap_get I don't see this for hostwide tags - this may be because we have multiple hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But this is not from hostwide tags specifically, but for multiple HW queues in general. As I understood, sched tags were meant to be per request queue, right? I am reading this correctly? I can barely remember some debate on this, but could not find the thread. Hannes did have a patch related to topic, but was dropped: https://lore.kernel.org/linux-scsi/20191202153914.84722-7-hare@suse.de/#t Thanks, John
On Mon, Apr 26, 2021 at 11:53:45AM +0100, John Garry wrote: > On 23/04/2021 09:43, John Garry wrote: > > > 1) randread test on ibm-x3850x6[*] with deadline > > > > > > |IOPS | FIO CPU util > > > ------------------------------------------------ > > > hosttags | 94k | usr=1.13%, sys=14.75% > > > ------------------------------------------------ > > > non hosttags | 124k | usr=1.12%, sys=10.65%, > > > > > > > Getting these results for mq-deadline: > > > > hosttags > > 100K cpu 1.52 4.47 > > > > non-hosttags > > 109K cpu 1.74 5.49 > > > > So I still don't see the same CPU usage increase for hosttags. > > > > But throughput is down, so at least I can check on that... > > > > > > > > 2) randread test on ibm-x3850x6[*] with none > > > |IOPS | FIO CPU util > > > ------------------------------------------------ > > > hosttags | 120k | usr=0.89%, sys=6.55% > > > ------------------------------------------------ > > > non hosttags | 121k | usr=1.07%, sys=7.35% > > > ------------------------------------------------ > > > > > > > Here I get: > > hosttags > > 113K cpu 2.04 5.83 > > > > non-hosttags > > 108K cpu 1.71 5.05 > > Hi Ming, > > One thing I noticed is that for the non-hosttags scenario is that I am > hitting the IO scheduler tag exhaustion path in blk_mq_get_tag() often; > here's some perf output: > > |--15.88%--blk_mq_submit_bio > | | > | |--11.27%--__blk_mq_alloc_request > | | | > | | --11.19%--blk_mq_get_tag > | | | > | | |--6.00%--__blk_mq_delay_run_hw_queue > | | | | > > ... > > | | | > | | |--3.29%--io_schedule > | | | | > > .... > > | | | | > | | | --1.32%--io_schedule_prepare > | | | > > ... > > | | | > | | |--0.60%--sbitmap_finish_wait > | | | > --0.56%--sbitmap_get > > I don't see this for hostwide tags - this may be because we have multiple > hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But > this is not from hostwide tags specifically, but for multiple HW queues in > general. As I understood, sched tags were meant to be per request queue, > right? I am reading this correctly? sched tags is still per-hctx. I just found that you didn't change sched tags into per-request-queue shared tags. Then for hostwide tags, each hctx still has its own standalone sched tags and request pool, that is one big difference with non hostwide tags. That is why you observe that scheduler tag exhaustion is easy to trigger in case of non-hostwide tags. I'd suggest to add one per-request-queue sched tags, and make all hctxs sharing it, just like what you did for driver tag. Thanks, Ming
On 26/04/2021 15:48, Ming Lei wrote: >> --0.56%--sbitmap_get >> >> I don't see this for hostwide tags - this may be because we have multiple >> hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But >> this is not from hostwide tags specifically, but for multiple HW queues in >> general. As I understood, sched tags were meant to be per request queue, >> right? I am reading this correctly? > sched tags is still per-hctx. > > I just found that you didn't change sched tags into per-request-queue > shared tags. > Then for hostwide tags, each hctx still has its own > standalone sched tags and request pool, that is one big difference with > non hostwide tags. For both hostwide and non-hostwide tags, we have standalone sched tags and request pool per hctx when q->nr_hw_queues > 1. > That is why you observe that scheduler tag exhaustion > is easy to trigger in case of non-hostwide tags. > > I'd suggest to add one per-request-queue sched tags, and make all hctxs > sharing it, just like what you did for driver tag. > That sounds reasonable. But I don't see how this is related to hostwide tags specifically, but rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI MQ HBAs have (without using hostwide tags). Thanks, John
On Mon, Apr 26, 2021 at 04:52:28PM +0100, John Garry wrote: > On 26/04/2021 15:48, Ming Lei wrote: > > > --0.56%--sbitmap_get > > > > > > I don't see this for hostwide tags - this may be because we have multiple > > > hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But > > > this is not from hostwide tags specifically, but for multiple HW queues in > > > general. As I understood, sched tags were meant to be per request queue, > > > right? I am reading this correctly? > > sched tags is still per-hctx. > > > > I just found that you didn't change sched tags into per-request-queue > > shared tags. Then for hostwide tags, each hctx still has its own > > standalone sched tags and request pool, that is one big difference with > > non hostwide tags. > > For both hostwide and non-hostwide tags, we have standalone sched tags and > request pool per hctx when q->nr_hw_queues > 1. driver tags is shared for hostwide tags. > > > That is why you observe that scheduler tag exhaustion > > is easy to trigger in case of non-hostwide tags. > > > > I'd suggest to add one per-request-queue sched tags, and make all hctxs > > sharing it, just like what you did for driver tag. > > > > That sounds reasonable. > > But I don't see how this is related to hostwide tags specifically, but > rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI > MQ HBAs have (without using hostwide tags). Before hostwide tags, the whole scheduler queue depth should be 256. After hostwide tags, the whole scheduler queue depth becomes 256 * nr_hw_queues. But the driver tag queue depth is _not_ changed. More requests come and are tried to dispatch to LLD and can't succeed because of limited driver tag depth, and CPU utilization could be increased. Thanks, Ming
On 26/04/2021 17:03, Ming Lei wrote: >> For both hostwide and non-hostwide tags, we have standalone sched tags and >> request pool per hctx when q->nr_hw_queues > 1. > driver tags is shared for hostwide tags. > >>> That is why you observe that scheduler tag exhaustion >>> is easy to trigger in case of non-hostwide tags. >>> >>> I'd suggest to add one per-request-queue sched tags, and make all hctxs >>> sharing it, just like what you did for driver tag. >>> >> That sounds reasonable. >> >> But I don't see how this is related to hostwide tags specifically, but >> rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI >> MQ HBAs have (without using hostwide tags). > Before hostwide tags, the whole scheduler queue depth should be 256. > After hostwide tags, the whole scheduler queue depth becomes 256 * > nr_hw_queues. But the driver tag queue depth is_not_ changed. Fine. > > More requests come and are tried to dispatch to LLD and can't succeed > because of limited driver tag depth, and CPU utilization could be increased. Right, maybe this is a problem. I quickly added some debug, and see that __blk_mq_get_driver_tag()->__sbitmap_queue_get() fails ~7% for hostwide tags and 3% for non-hostwide tags. Having it fail at all for non-hostwide tags seems a bit dubious... here's the code for deciding the rq sched tag depth: q->nr_requests = 2 * min(q->tags_set->queue_depth [128], BLK_DEV_MAX_RQ [128]) So we get 256 for our test scenario, which is appreciably bigger than q->tags_set->queue_depth, so the failures make sense. Anyway, I'll look at adding code for a per-request queue sched tags to see if it helps. But I would plan to continue to use a per hctx sched request pool. Thanks, John
On Mon, Apr 26, 2021 at 06:02:31PM +0100, John Garry wrote: > On 26/04/2021 17:03, Ming Lei wrote: > > > For both hostwide and non-hostwide tags, we have standalone sched tags and > > > request pool per hctx when q->nr_hw_queues > 1. > > driver tags is shared for hostwide tags. > > > > > > That is why you observe that scheduler tag exhaustion > > > > is easy to trigger in case of non-hostwide tags. > > > > > > > > I'd suggest to add one per-request-queue sched tags, and make all hctxs > > > > sharing it, just like what you did for driver tag. > > > > > > > That sounds reasonable. > > > > > > But I don't see how this is related to hostwide tags specifically, but > > > rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI > > > MQ HBAs have (without using hostwide tags). > > Before hostwide tags, the whole scheduler queue depth should be 256. > > After hostwide tags, the whole scheduler queue depth becomes 256 * > > nr_hw_queues. But the driver tag queue depth is_not_ changed. > > Fine. > > > > > More requests come and are tried to dispatch to LLD and can't succeed > > because of limited driver tag depth, and CPU utilization could be increased. > > Right, maybe this is a problem. > > I quickly added some debug, and see that > __blk_mq_get_driver_tag()->__sbitmap_queue_get() fails ~7% for hostwide tags > and 3% for non-hostwide tags. > > Having it fail at all for non-hostwide tags seems a bit dubious... here's > the code for deciding the rq sched tag depth: > > q->nr_requests = 2 * min(q->tags_set->queue_depth [128], BLK_DEV_MAX_RQ > [128]) > > So we get 256 for our test scenario, which is appreciably bigger than > q->tags_set->queue_depth, so the failures make sense. > > Anyway, I'll look at adding code for a per-request queue sched tags to see > if it helps. But I would plan to continue to use a per hctx sched request > pool. Why not switch to per hctx sched request pool? Thanks, Ming
On 27/04/2021 00:59, Ming Lei wrote: >> Anyway, I'll look at adding code for a per-request queue sched tags to see >> if it helps. But I would plan to continue to use a per hctx sched request >> pool. > Why not switch to per hctx sched request pool? I don't understand. The current code uses a per-hctx sched request pool, and I said that I don't plan to change that. Thanks, John
On Tue, Apr 27, 2021 at 08:52:53AM +0100, John Garry wrote: > On 27/04/2021 00:59, Ming Lei wrote: > > > Anyway, I'll look at adding code for a per-request queue sched tags to see > > > if it helps. But I would plan to continue to use a per hctx sched request > > > pool. > > Why not switch to per hctx sched request pool? > > I don't understand. The current code uses a per-hctx sched request pool, and > I said that I don't plan to change that. I forget why you didn't do that, because for hostwide tags, request is always 1:1 for either sched tags(real io sched) or driver tags(none). Maybe you want to keep request local to hctx, but never see related performance data for supporting the point, sbitmap queue allocator has been intelligent enough to allocate tag freed from native cpu. Then you just waste lots of memory, I remember that scsi request payload is a bit big. Thanks, Ming
On 27/04/2021 10:11, Ming Lei wrote: > On Tue, Apr 27, 2021 at 08:52:53AM +0100, John Garry wrote: >> On 27/04/2021 00:59, Ming Lei wrote: >>>> Anyway, I'll look at adding code for a per-request queue sched tags to see >>>> if it helps. But I would plan to continue to use a per hctx sched request >>>> pool. >>> Why not switch to per hctx sched request pool? >> I don't understand. The current code uses a per-hctx sched request pool, and >> I said that I don't plan to change that. > I forget why you didn't do that, because for hostwide tags, request > is always 1:1 for either sched tags(real io sched) or driver tags(none). > > Maybe you want to keep request local to hctx, but never see related > performance data for supporting the point, sbitmap queue allocator has > been intelligent enough to allocate tag freed from native cpu. > > Then you just waste lots of memory, I remember that scsi request payload > is a bit big. It's true that we waste much memory for regular static requests for when using hostwide tags today. One problem in trying to use a single set of "hostwide" static requests is that we call blk_mq_init_request(..., hctx_idx, ...) -> set->ops->init_request(.., hctx_idx, ...) for each static rq, and this would not work for a single set of "hostwide" requests. And I see a similar problem for a "request queue-wide" sched static requests. Maybe we can improve this in future. BTW, for the performance issue which Yanhui witnessed with megaraid sas, do you think it may because of the IO sched tags issue of total sched tag depth growing vs driver tags? Are there lots of LUNs? I can imagine that megaraid sas has much larger can_queue than scsi_debug :) Thanks, John
On Tue, Apr 27, 2021 at 10:37:39AM +0100, John Garry wrote: > On 27/04/2021 10:11, Ming Lei wrote: > > On Tue, Apr 27, 2021 at 08:52:53AM +0100, John Garry wrote: > > > On 27/04/2021 00:59, Ming Lei wrote: > > > > > Anyway, I'll look at adding code for a per-request queue sched tags to see > > > > > if it helps. But I would plan to continue to use a per hctx sched request > > > > > pool. > > > > Why not switch to per hctx sched request pool? > > > I don't understand. The current code uses a per-hctx sched request pool, and > > > I said that I don't plan to change that. > > I forget why you didn't do that, because for hostwide tags, request > > is always 1:1 for either sched tags(real io sched) or driver tags(none). > > > > Maybe you want to keep request local to hctx, but never see related > > performance data for supporting the point, sbitmap queue allocator has > > been intelligent enough to allocate tag freed from native cpu. > > > > Then you just waste lots of memory, I remember that scsi request payload > > is a bit big. > > It's true that we waste much memory for regular static requests for when > using hostwide tags today. > > One problem in trying to use a single set of "hostwide" static requests is > that we call blk_mq_init_request(..., hctx_idx, ...) -> > set->ops->init_request(.., hctx_idx, ...) for each static rq, and this would > not work for a single set of "hostwide" requests. > > And I see a similar problem for a "request queue-wide" sched static > requests. > > Maybe we can improve this in future. OK, fair enough. > > BTW, for the performance issue which Yanhui witnessed with megaraid sas, do > you think it may because of the IO sched tags issue of total sched tag depth > growing vs driver tags? I think it is highly possible. Will you work a patch to convert to per-request-queue sched tag? > Are there lots of LUNs? I can imagine that megaraid > sas has much larger can_queue than scsi_debug :) No, there are just two LUNs, the 1st LUN is one commodity SSD(queue depth is 32) and the performance issue is reported on this LUN, another is one HDD(queue depth is 256) which is root disk, but the megaraid host tag depth is 228, another weird setting. But the issue still can be reproduced after we set 2nd LUN's depth as 64 for avoiding driver tag contention. Thanks, Ming
On 27/04/2021 10:52, Ming Lei wrote: >> BTW, for the performance issue which Yanhui witnessed with megaraid sas, do >> you think it may because of the IO sched tags issue of total sched tag depth >> growing vs driver tags? > I think it is highly possible. Will you work a patch to convert to > per-request-queue sched tag? > Sure, I'm just hacking now to see what difference it can make to performance. Early results look promising... >> Are there lots of LUNs? I can imagine that megaraid >> sas has much larger can_queue than scsi_debug:) > No, there are just two LUNs, the 1st LUN is one commodity SSD(queue > depth is 32) and the performance issue is reported on this LUN, another is one > HDD(queue depth is 256) which is root disk, but the megaraid host tag depth is > 228, another weird setting. But the issue still can be reproduced after we set > 2nd LUN's depth as 64 for avoiding driver tag contention. > > BTW, one more thing which Kashyap and I looked at when initially developing the hostwide tag support was the wait struct usage in tag exhaustion scenario: https://lore.kernel.org/linux-block/ecaeccf029c6fe377ebd4f30f04df9f1@mail.gmail.com/ IIRC, we looked at a "hostwide" wait_index - it didn't seem to make a difference then, and we didn't end up make any changes here, but still worth remembering. Thanks, John
On 27/04/2021 10:52, Ming Lei wrote: >>> Then you just waste lots of memory, I remember that scsi request payload >>> is a bit big. >> It's true that we waste much memory for regular static requests for when >> using hostwide tags today. >> >> One problem in trying to use a single set of "hostwide" static requests is >> that we call blk_mq_init_request(..., hctx_idx, ...) -> >> set->ops->init_request(.., hctx_idx, ...) for each static rq, and this would >> not work for a single set of "hostwide" requests. >> >> And I see a similar problem for a "request queue-wide" sched static >> requests. >> >> Maybe we can improve this in future. > OK, fair enough. > JFYI, I am working on this now. My idea is to introduce a hostwide and request-wide static requests (for hostwide tagset), and have the per-hctx tags point at them, so that we don't need to allocate per-hctx static requests. SCSI init_request callback just ignores hctx_idx passed, so then we can just call init_request(hctx_idx = 0) for those hostwide/request queue-wide static requests. Thanks, John
diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index e1e997af89a0..45188f7aa789 100644 --- a/block/blk-mq-sched.c +++ b/block/blk-mq-sched.c @@ -59,10 +59,18 @@ EXPORT_SYMBOL_GPL(blk_mq_sched_mark_restart_hctx); void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx) { + bool shared_tag = blk_mq_is_sbitmap_shared(hctx->flags); + + if (shared_tag) + blk_mq_run_hw_queues(hctx->queue, true); + if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state)) return; clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state); + if (shared_tag) + return; + /* * Order clearing SCHED_RESTART and list_empty_careful(&hctx->dispatch) * in blk_mq_run_hw_queue(). Its pair is the barrier in