diff mbox series

[bug,report] shared tags causes IO hang and performance drop

Message ID YHaez6iN2HHYxYOh@T590
State New
Headers show
Series [bug,report] shared tags causes IO hang and performance drop | expand

Commit Message

Ming Lei April 14, 2021, 7:50 a.m. UTC
Hi John,

It is reported inside RH that CPU utilization is increased ~20% when
running simple FIO test inside VM which disk is built on image stored
on XFS/megaraid_sas.

When I try to investigate by reproducing the issue via scsi_debug, I found
IO hang when running randread IO(8k, direct IO, libaio) on scsi_debug disk
created by the following command:

	modprobe scsi_debug host_max_queue=128 submit_queues=$NR_CPUS virtual_gb=256

Looks it is caused by SCHED_RESTART because current RESTART is just done
on current hctx, and we may need to restart all hctxs for shared tags, and the
issue can be fixed by the append patch. However, IOPS drops more than 10% with
the patch.

So any idea for this issue and the original performance drop?


Thanks, 
Ming

Comments

Kashyap Desai April 14, 2021, 10:42 a.m. UTC | #1
> Hi Ming,
>
> >
> > It is reported inside RH that CPU utilization is increased ~20% when
> > running simple FIO test inside VM which disk is built on image stored
> > on XFS/megaraid_sas.
> >
> > When I try to investigate by reproducing the issue via scsi_debug, I
> > found IO hang when running randread IO(8k, direct IO, libaio) on
> > scsi_debug disk created by the following command:
> >
> > 	modprobe scsi_debug host_max_queue=128
> submit_queues=$NR_CPUS
> > virtual_gb=256
> >
>
> So I can recreate this hang for using mq-deadline IO sched for scsi debug,
> in
> that fio does not exit. I'm using v5.12-rc7.

I can also recreate this issue using mq-deadline. Using <none>, there is no
IO hang issue.
Also if I run script to change scheduler periodically (none, mq-deadline),
sysfs entry hangs.

Here is call trace-
Call Trace:
[ 1229.879862]  __schedule+0x29d/0x7a0
[ 1229.879871]  schedule+0x3c/0xa0
[ 1229.879875]  blk_mq_freeze_queue_wait+0x62/0x90
[ 1229.879880]  ? finish_wait+0x80/0x80
[ 1229.879884]  elevator_switch+0x12/0x40
[ 1229.879888]  elv_iosched_store+0x79/0x120
[ 1229.879892]  ? kernfs_fop_write_iter+0xc7/0x1b0
[ 1229.879897]  queue_attr_store+0x42/0x70
[ 1229.879901]  kernfs_fop_write_iter+0x11f/0x1b0
[ 1229.879905]  new_sync_write+0x11f/0x1b0
[ 1229.879912]  vfs_write+0x184/0x250
[ 1229.879915]  ksys_write+0x59/0xd0
[ 1229.879917]  do_syscall_64+0x33/0x40
[ 1229.879922]  entry_SYSCALL_64_after_hwframe+0x44/0xae


I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same behavior.
Let me also check  megaraid_sas and see if anything generic or this is a
special case of scsi_debug.

>
> Do you have any idea of what changed to cause this, as we would have
> tested
> this before? Or maybe only none IO sched on scsi_debug. And normally 4k
> block size and only rw=read (for me, anyway).
>
> Note that host_max_queue=128 will cap submit queue depth at 128, while
> would be 192 by default.
>
> Will check more...including CPU utilization.
>
> Thanks,
> John
>
> > Looks it is caused by SCHED_RESTART because current RESTART is just
> > done on current hctx, and we may need to restart all hctxs for shared
> > tags, and the issue can be fixed by the append patch. However, IOPS
> > drops more than 10% with the patch.
> >
> > So any idea for this issue and the original performance drop?
> >
> > diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c index
> > e1e997af89a0..45188f7aa789 100644
> > --- a/block/blk-mq-sched.c
> > +++ b/block/blk-mq-sched.c
> > @@ -59,10 +59,18 @@
> EXPORT_SYMBOL_GPL(blk_mq_sched_mark_restart_hctx);
> >
> >   void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
> >   {
> > +	bool shared_tag = blk_mq_is_sbitmap_shared(hctx->flags);
> > +
> > +	if (shared_tag)
> > +		blk_mq_run_hw_queues(hctx->queue, true);
> > +
> >   	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
> >   		return;
> >   	clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
> >
> > +	if (shared_tag)
> > +		return;
> > +
> >   	/*
> >   	 * Order clearing SCHED_RESTART and list_empty_careful(&hctx-
> >dispatch)
> >   	 * in blk_mq_run_hw_queue(). Its pair is the barrier in
> >
> > Thanks,
> > Ming
> >
> > .
> >
Kashyap Desai April 14, 2021, 1:59 p.m. UTC | #2
> > I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same
behavior.
> > Let me also check  megaraid_sas and see if anything generic or this is
> > a special case of scsi_debug.
>
> As I mentioned, it could be one generic issue wrt. SCHED_RESTART.
> shared tags might have to restart all hctx since all share same tags.

Ming - I tried many combination on MR shared host tag driver but there is
no single instance of IO hang.
I will keep trying, but when I look at scsi_debug driver code I found
below odd settings in scsi_debug driver.
can_queue of adapter is set to 128 but queue_depth of sdev is set to 255.

If I apply below patch, scsi_debug driver's hang is also resolved. Ideally
sdev->queue depth cannot exceed shost->can_queue.
Not sure why cmd_per_lun is 255 in scsi_debug driver which can easily
exceed can_queue.  I will simulate something similar in MR driver and see
how it behaves w.r.t IO hang issue.

diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index 70165be10f00..dded762540ee 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -218,7 +218,7 @@ static const char *sdebug_version_date = "20200710";
  */
 #define SDEBUG_CANQUEUE_WORDS  3       /* a WORD is bits in a long */
 #define SDEBUG_CANQUEUE  (SDEBUG_CANQUEUE_WORDS * BITS_PER_LONG)
-#define DEF_CMD_PER_LUN  255
+#define DEF_CMD_PER_LUN  SDEBUG_CANQUEUE

 /* UA - Unit Attention; SA - Service Action; SSU - Start Stop Unit */
 #define F_D_IN                 1       /* Data-in command (e.g. READ) */
@@ -7558,6 +7558,7 @@ static int sdebug_driver_probe(struct device *dev)
        sdbg_host = to_sdebug_host(dev);

        sdebug_driver_template.can_queue = sdebug_max_queue;
+       sdebug_driver_template.cmd_per_lun = sdebug_max_queue;
        if (!sdebug_clustering)
                sdebug_driver_template.dma_boundary = PAGE_SIZE - 1;
>
>
> Thanks,
> Ming
Douglas Gilbert April 14, 2021, 5:03 p.m. UTC | #3
On 2021-04-14 9:59 a.m., Kashyap Desai wrote:
>>> I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same
> behavior.
>>> Let me also check  megaraid_sas and see if anything generic or this is
>>> a special case of scsi_debug.
>>
>> As I mentioned, it could be one generic issue wrt. SCHED_RESTART.
>> shared tags might have to restart all hctx since all share same tags.
> 
> Ming - I tried many combination on MR shared host tag driver but there is
> no single instance of IO hang.
> I will keep trying, but when I look at scsi_debug driver code I found
> below odd settings in scsi_debug driver.
> can_queue of adapter is set to 128 but queue_depth of sdev is set to 255.
> 
> If I apply below patch, scsi_debug driver's hang is also resolved. Ideally
> sdev->queue depth cannot exceed shost->can_queue.
> Not sure why cmd_per_lun is 255 in scsi_debug driver which can easily
> exceed can_queue.  I will simulate something similar in MR driver and see
> how it behaves w.r.t IO hang issue.
> 
> diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
> index 70165be10f00..dded762540ee 100644
> --- a/drivers/scsi/scsi_debug.c
> +++ b/drivers/scsi/scsi_debug.c
> @@ -218,7 +218,7 @@ static const char *sdebug_version_date = "20200710";
>    */
>   #define SDEBUG_CANQUEUE_WORDS  3       /* a WORD is bits in a long */
>   #define SDEBUG_CANQUEUE  (SDEBUG_CANQUEUE_WORDS * BITS_PER_LONG)

So SDEBUG_CANQUEUE is 3*64 = 192 and is a hard limit (it is used to
dimension an array). Should it be upped to 4, say? [That will slow things
down a bit if that is an issue.]

> -#define DEF_CMD_PER_LUN  255
> +#define DEF_CMD_PER_LUN  SDEBUG_CANQUEUE
> 
>   /* UA - Unit Attention; SA - Service Action; SSU - Start Stop Unit */
>   #define F_D_IN                 1       /* Data-in command (e.g. READ) */
> @@ -7558,6 +7558,7 @@ static int sdebug_driver_probe(struct device *dev)
>          sdbg_host = to_sdebug_host(dev);
> 
>          sdebug_driver_template.can_queue = sdebug_max_queue;
> +       sdebug_driver_template.cmd_per_lun = sdebug_max_queue;

I'll  push out a patch shortly.

Doug Gilbert


>          if (!sdebug_clustering)
>                  sdebug_driver_template.dma_boundary = PAGE_SIZE - 1;
>>
>>
>> Thanks,
>> Ming
Ming Lei April 15, 2021, 12:58 a.m. UTC | #4
On Wed, Apr 14, 2021 at 07:29:07PM +0530, Kashyap Desai wrote:
> > > I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same

> behavior.

> > > Let me also check  megaraid_sas and see if anything generic or this is

> > > a special case of scsi_debug.

> >

> > As I mentioned, it could be one generic issue wrt. SCHED_RESTART.

> > shared tags might have to restart all hctx since all share same tags.

> 

> Ming - I tried many combination on MR shared host tag driver but there is

> no single instance of IO hang.

> I will keep trying, but when I look at scsi_debug driver code I found

> below odd settings in scsi_debug driver.

> can_queue of adapter is set to 128 but queue_depth of sdev is set to 255.

> 

> If I apply below patch, scsi_debug driver's hang is also resolved. Ideally

> sdev->queue depth cannot exceed shost->can_queue.

> Not sure why cmd_per_lun is 255 in scsi_debug driver which can easily

> exceed can_queue.  I will simulate something similar in MR driver and see

> how it behaves w.r.t IO hang issue.

> 

> diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c

> index 70165be10f00..dded762540ee 100644

> --- a/drivers/scsi/scsi_debug.c

> +++ b/drivers/scsi/scsi_debug.c

> @@ -218,7 +218,7 @@ static const char *sdebug_version_date = "20200710";

>   */

>  #define SDEBUG_CANQUEUE_WORDS  3       /* a WORD is bits in a long */

>  #define SDEBUG_CANQUEUE  (SDEBUG_CANQUEUE_WORDS * BITS_PER_LONG)

> -#define DEF_CMD_PER_LUN  255

> +#define DEF_CMD_PER_LUN  SDEBUG_CANQUEUE

> 

>  /* UA - Unit Attention; SA - Service Action; SSU - Start Stop Unit */

>  #define F_D_IN                 1       /* Data-in command (e.g. READ) */

> @@ -7558,6 +7558,7 @@ static int sdebug_driver_probe(struct device *dev)

>         sdbg_host = to_sdebug_host(dev);

> 

>         sdebug_driver_template.can_queue = sdebug_max_queue;

> +       sdebug_driver_template.cmd_per_lun = sdebug_max_queue;

>         if (!sdebug_clustering)

>                 sdebug_driver_template.dma_boundary = PAGE_SIZE - 1;


Yeah, the change makes the issue disappear. That looks scsi's restart
(scsi_run_queue_async<-scsi_end_request) finally restarts the queue.


Thanks,
Ming
Ming Lei April 15, 2021, 3:46 a.m. UTC | #5
On Wed, Apr 14, 2021 at 01:06:25PM +0100, John Garry wrote:
> On 14/04/2021 12:12, Ming Lei wrote:

> > On Wed, Apr 14, 2021 at 04:12:22PM +0530, Kashyap Desai wrote:

> > > > Hi Ming,

> > > > 

> > > > > It is reported inside RH that CPU utilization is increased ~20% when

> > > > > running simple FIO test inside VM which disk is built on image stored

> > > > > on XFS/megaraid_sas.

> > > > > 

> > > > > When I try to investigate by reproducing the issue via scsi_debug, I

> > > > > found IO hang when running randread IO(8k, direct IO, libaio) on

> > > > > scsi_debug disk created by the following command:

> > > > > 

> > > > > 	modprobe scsi_debug host_max_queue=128

> > > > submit_queues=$NR_CPUS

> > > > > virtual_gb=256

> > > > > 

> > > > So I can recreate this hang for using mq-deadline IO sched for scsi debug,

> > > > in

> > > > that fio does not exit. I'm using v5.12-rc7.

> > > I can also recreate this issue using mq-deadline. Using <none>, there is no

> > > IO hang issue.

> > > Also if I run script to change scheduler periodically (none, mq-deadline),

> > > sysfs entry hangs.

> > > 

> > > Here is call trace-

> > > Call Trace:

> > > [ 1229.879862]  __schedule+0x29d/0x7a0

> > > [ 1229.879871]  schedule+0x3c/0xa0

> > > [ 1229.879875]  blk_mq_freeze_queue_wait+0x62/0x90

> > > [ 1229.879880]  ? finish_wait+0x80/0x80

> > > [ 1229.879884]  elevator_switch+0x12/0x40

> > > [ 1229.879888]  elv_iosched_store+0x79/0x120

> > > [ 1229.879892]  ? kernfs_fop_write_iter+0xc7/0x1b0

> > > [ 1229.879897]  queue_attr_store+0x42/0x70

> > > [ 1229.879901]  kernfs_fop_write_iter+0x11f/0x1b0

> > > [ 1229.879905]  new_sync_write+0x11f/0x1b0

> > > [ 1229.879912]  vfs_write+0x184/0x250

> > > [ 1229.879915]  ksys_write+0x59/0xd0

> > > [ 1229.879917]  do_syscall_64+0x33/0x40

> > > [ 1229.879922]  entry_SYSCALL_64_after_hwframe+0x44/0xae

> > > 

> > > 

> > > I tried both - 5.12.0-rc1 and 5.11.0-rc2+ and there is a same behavior.

> > > Let me also check  megaraid_sas and see if anything generic or this is a

> > > special case of scsi_debug.

> > As I mentioned, it could be one generic issue wrt. SCHED_RESTART.

> > shared tags might have to restart all hctx since all share same tags.

> 

> I tested on hisi_sas v2 hw (which now sets host_tagset), and can reproduce.

> Seems to be combination of mq-deadline and fio rw=randread settings required

> to reproduce from limited experiments.

> 

> Incidentally, about the mq-deadline vs none IO scheduler on the same host, I

> get this with 6x SAS SSD:

> 

> rw=read

> 		CPU util			IOPs

> mq-deadline	usr=26.80%, sys=52.78%		650K

> none		usr=22.99%, sys=74.10%		475K

> 

> rw=randread

> 		CPU util			IOPs

> mq-deadline	usr=21.72%, sys=44.18%,		423K

> none		usr=23.15%, sys=74.01%		450K


Today I re-run the scsi_debug test on two server hardwares(32cores, dual
numa nodes), and the CPU utilization issue can be reproduced, follow
the test result:

1) randread test on ibm-x3850x6[*] with deadline

              |IOPS    | FIO CPU util 
------------------------------------------------
hosttags      | 94k    | usr=1.13%, sys=14.75%
------------------------------------------------
non hosttags  | 124k   | usr=1.12%, sys=10.65%,


2) randread test on ibm-x3850x6[*] with none
              |IOPS    | FIO CPU util 
------------------------------------------------
hosttags      | 120k   | usr=0.89%, sys=6.55%
------------------------------------------------
non hosttags  | 121k   | usr=1.07%, sys=7.35%
------------------------------------------------

 *:
 	- that is the machine Yanhui reported VM cpu utilization increased by 20% 
	- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)
	- also run same test on another 32cores machine, IOPS drop isn't
	  observed, but CPU utilization is increased obviously

3) test script
#/bin/bash

run_fio() {
	RTIME=$1
	JOBS=$2
	DEVS=$3
	BS=$4

	QD=64
	BATCH=16

	fio --bs=$BS --ioengine=libaio \
		--iodepth=$QD \
	    --iodepth_batch_submit=$BATCH \
		--iodepth_batch_complete_min=$BATCH \
		--filename=$DEVS \
		--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \
		--name=test --group_reporting
}

SCHED=$1

NRQS=`getconf _NPROCESSORS_ONLN`

rmmod scsi_debug
modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256
sleep 2
DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`
echo $SCHED > /sys/block/`basename $DEV`/queue/scheduler
echo 128 > /sys/block/`basename $DEV`/device/queue_depth
run_fio 20 16 $DEV 8K


rmmod scsi_debug
modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256
sleep 2
DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`
echo $SCHED > /sys/block/`basename $DEV`/queue/scheduler
echo 128 > /sys/block/`basename $DEV`/device/queue_depth
run_fio 20 16 $DEV 8k



Thanks,
Ming
John Garry April 15, 2021, 10:41 a.m. UTC | #6
Hi Ming,

I'll have a look.

BTW, are you intentionally using scsi_debug over null_blk? null_blk 
supports shared sbitmap as well, and performance figures there are 
generally higher than scsi_debug for similar fio settings.

Thanks,
john

EOM


>> 		IOPs

>> mq-deadline	usr=21.72%, sys=44.18%,		423K

>> none		usr=23.15%, sys=74.01%		450K

> Today I re-run the scsi_debug test on two server hardwares(32cores, dual

> numa nodes), and the CPU utilization issue can be reproduced, follow

> the test result:

> 

> 1) randread test on ibm-x3850x6[*] with deadline

> 

>                |IOPS    | FIO CPU util

> ------------------------------------------------

> hosttags      | 94k    | usr=1.13%, sys=14.75%

> ------------------------------------------------

> non hosttags  | 124k   | usr=1.12%, sys=10.65%,

> 

> 

> 2) randread test on ibm-x3850x6[*] with none

>                |IOPS    | FIO CPU util

> ------------------------------------------------

> hosttags      | 120k   | usr=0.89%, sys=6.55%

> ------------------------------------------------

> non hosttags  | 121k   | usr=1.07%, sys=7.35%

> ------------------------------------------------

> 

>   *:

>   	- that is the machine Yanhui reported VM cpu utilization increased by 20%

> 	- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)

> 	- also run same test on another 32cores machine, IOPS drop isn't

> 	  observed, but CPU utilization is increased obviously

> 

> 3) test script

> #/bin/bash

> 

> run_fio() {

> 	RTIME=$1

> 	JOBS=$2

> 	DEVS=$3

> 	BS=$4

> 

> 	QD=64

> 	BATCH=16

> 

> 	fio --bs=$BS --ioengine=libaio \

> 		--iodepth=$QD \

> 	    --iodepth_batch_submit=$BATCH \

> 		--iodepth_batch_complete_min=$BATCH \

> 		--filename=$DEVS \

> 		--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \

> 		--name=test --group_reporting

> }

> 

> SCHED=$1

> 

> NRQS=`getconf _NPROCESSORS_ONLN`

> 

> rmmod scsi_debug

> modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256

> sleep 2

> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

> echo 128 >/sys/block/`basename $DEV`/device/queue_depth

> run_fio 20 16 $DEV 8K

> 

> 

> rmmod scsi_debug

> modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256

> sleep 2

> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

> echo 128 >/sys/block/`basename $DEV`/device/queue_depth

> run_fio 20 16 $DEV 8k
Ming Lei April 15, 2021, 12:18 p.m. UTC | #7
On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote:
> Hi Ming,

> 

> I'll have a look.

> 

> BTW, are you intentionally using scsi_debug over null_blk? null_blk supports

> shared sbitmap as well, and performance figures there are generally higher

> than scsi_debug for similar fio settings.


I use both, but scsi_debug can cover scsi stack test.

Thanks,
Ming

> 

> Thanks,

> john

> 

> EOM

> 

> 

> > > 		IOPs

> > > mq-deadline	usr=21.72%, sys=44.18%,		423K

> > > none		usr=23.15%, sys=74.01%		450K

> > Today I re-run the scsi_debug test on two server hardwares(32cores, dual

> > numa nodes), and the CPU utilization issue can be reproduced, follow

> > the test result:

> > 

> > 1) randread test on ibm-x3850x6[*] with deadline

> > 

> >                |IOPS    | FIO CPU util

> > ------------------------------------------------

> > hosttags      | 94k    | usr=1.13%, sys=14.75%

> > ------------------------------------------------

> > non hosttags  | 124k   | usr=1.12%, sys=10.65%,

> > 

> > 

> > 2) randread test on ibm-x3850x6[*] with none

> >                |IOPS    | FIO CPU util

> > ------------------------------------------------

> > hosttags      | 120k   | usr=0.89%, sys=6.55%

> > ------------------------------------------------

> > non hosttags  | 121k   | usr=1.07%, sys=7.35%

> > ------------------------------------------------

> > 

> >   *:

> >   	- that is the machine Yanhui reported VM cpu utilization increased by 20%

> > 	- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)

> > 	- also run same test on another 32cores machine, IOPS drop isn't

> > 	  observed, but CPU utilization is increased obviously

> > 

> > 3) test script

> > #/bin/bash

> > 

> > run_fio() {

> > 	RTIME=$1

> > 	JOBS=$2

> > 	DEVS=$3

> > 	BS=$4

> > 

> > 	QD=64

> > 	BATCH=16

> > 

> > 	fio --bs=$BS --ioengine=libaio \

> > 		--iodepth=$QD \

> > 	    --iodepth_batch_submit=$BATCH \

> > 		--iodepth_batch_complete_min=$BATCH \

> > 		--filename=$DEVS \

> > 		--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \

> > 		--name=test --group_reporting

> > }

> > 

> > SCHED=$1

> > 

> > NRQS=`getconf _NPROCESSORS_ONLN`

> > 

> > rmmod scsi_debug

> > modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256

> > sleep 2

> > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

> > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

> > echo 128 >/sys/block/`basename $DEV`/device/queue_depth

> > run_fio 20 16 $DEV 8K

> > 

> > 

> > rmmod scsi_debug

> > modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256

> > sleep 2

> > DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

> > echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

> > echo 128 >/sys/block/`basename $DEV`/device/queue_depth

> > run_fio 20 16 $DEV 8k

> 

> 


-- 
Ming
John Garry April 15, 2021, 3:41 p.m. UTC | #8
On 15/04/2021 13:18, Ming Lei wrote:
> On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote:

>> Hi Ming,

>>

>> I'll have a look.

>>

>> BTW, are you intentionally using scsi_debug over null_blk? null_blk supports

>> shared sbitmap as well, and performance figures there are generally higher

>> than scsi_debug for similar fio settings.

> I use both, but scsi_debug can cover scsi stack test.

> 


Hi Ming,

I can't seem to recreate your same issue. Are you mainline defconfig, or 
a special disto config?

What is am seeing is that scsi_debug throughput is fixed @ ~ 32K IOPS 
for scsi_debug with both modprobe configs and both none and mq-deadline 
IO sched. CPU util seems a bit higher for hosttags with none.

When I tried null_blk, the performance diff for hosttags and using none 
IO scheduler was noticeable, but not for mq-deadline:

1) randread test with deadline
|IOPS | FIO CPU util
------------------------------------------------
hosttags* | 325K usr=1.34%, sys=76.49%
------------------------------------------------
non hosttags** | 325k usr=1.36%, sys=76.25%

2) randread test with none
|IOPS | FIO CPU util
------------------------------------------------
hosttags* |6421k | usr=23.84%, sys=76.06%
------------------------------------------------
non hosttags** | 6893K | usr=25.57%, sys=74.33%
------------------------------------------------

* insmod null_blk.ko submit_queues=32 shared_tag_bitmap=1
** insmod null_blk.ko submit_queues=32

However I don't think that the null_blk test is a good like-for-like 
comparison, as setting shared_tag_bitmap means just just the same tagset 
over all hctx, but still have same count of hctx.

Just setting submit_queues=1 gives a big drop in performance, as would 
be expected.

Thanks,
John


>>

>> EOM

>>

>>

>>>> 		IOPs

>>>> mq-deadline	usr=21.72%, sys=44.18%,		423K

>>>> none		usr=23.15%, sys=74.01%		450K

>>> Today I re-run the scsi_debug test on two server hardwares(32cores, dual

>>> numa nodes), and the CPU utilization issue can be reproduced, follow

>>> the test result:

>>>

>>> 1) randread test on ibm-x3850x6[*] with deadline

>>>

>>>                 |IOPS    | FIO CPU util

>>> ------------------------------------------------

>>> hosttags      | 94k    | usr=1.13%, sys=14.75%

>>> ------------------------------------------------

>>> non hosttags  | 124k   | usr=1.12%, sys=10.65%,

>>>

>>>

>>> 2) randread test on ibm-x3850x6[*] with none

>>>                 |IOPS    | FIO CPU util

>>> ------------------------------------------------

>>> hosttags      | 120k   | usr=0.89%, sys=6.55%

>>> ------------------------------------------------

>>> non hosttags  | 121k   | usr=1.07%, sys=7.35%

>>> ------------------------------------------------

>>>

>>>    *:

>>>    	- that is the machine Yanhui reported VM cpu utilization increased by 20%

>>> 	- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)

>>> 	- also run same test on another 32cores machine, IOPS drop isn't

>>> 	  observed, but CPU utilization is increased obviously

>>>

>>> 3) test script

>>> #/bin/bash

>>>

>>> run_fio() {

>>> 	RTIME=$1

>>> 	JOBS=$2

>>> 	DEVS=$3

>>> 	BS=$4

>>>

>>> 	QD=64

>>> 	BATCH=16

>>>

>>> 	fio --bs=$BS --ioengine=libaio \

>>> 		--iodepth=$QD \

>>> 	    --iodepth_batch_submit=$BATCH \

>>> 		--iodepth_batch_complete_min=$BATCH \

>>> 		--filename=$DEVS \

>>> 		--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \

>>> 		--name=test --group_reporting

>>> }

>>>

>>> SCHED=$1

>>>

>>> NRQS=`getconf _NPROCESSORS_ONLN`

>>>

>>> rmmod scsi_debug

>>> modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256

>>> sleep 2

>>> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

>>> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

>>> echo 128 >/sys/block/`basename $DEV`/device/queue_depth

>>> run_fio 20 16 $DEV 8K

>>>

>>>

>>> rmmod scsi_debug

>>> modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256

>>> sleep 2

>>> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

>>> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

>>> echo 128 >/sys/block/`basename $DEV`/device/queue_depth

>>> run_fio 20 16 $DEV 8k
Ming Lei April 16, 2021, 12:46 a.m. UTC | #9
On Thu, Apr 15, 2021 at 04:41:06PM +0100, John Garry wrote:
> On 15/04/2021 13:18, Ming Lei wrote:

> > On Thu, Apr 15, 2021 at 11:41:52AM +0100, John Garry wrote:

> > > Hi Ming,

> > > 

> > > I'll have a look.

> > > 

> > > BTW, are you intentionally using scsi_debug over null_blk? null_blk supports

> > > shared sbitmap as well, and performance figures there are generally higher

> > > than scsi_debug for similar fio settings.

> > I use both, but scsi_debug can cover scsi stack test.

> > 

> 

> Hi Ming,

> 

> I can't seem to recreate your same issue. Are you mainline defconfig, or a

> special disto config?


The config is rhel8 config.

As I mentioned, with deadline, IOPS drop is observed on one hardware(ibm-x3850x6)
which is exactly the machine Yanhui reported the cpu utilization issue.

On another machine(HP DL380G10), still 32cores, dual numa nodes, IOPS drop can't be
observed, but cpu utilization difference is still obserable.

I use scsi_debug just because it is hard to run the virt workloads on
that machine. And the reported issue is on megaraid_sas, which is a scsi
device, so null_blk isn't good to simulate here because it can't cover
scsi stack.


Thanks,
Ming
John Garry April 16, 2021, 8:29 a.m. UTC | #10
On 16/04/2021 01:46, Ming Lei wrote:
>> I can't seem to recreate your same issue. Are you mainline defconfig, or a

>> special disto config?

> The config is rhel8 config.

> 


Can you share that? Has anyone tested against mainline x86 config?

> As I mentioned, with deadline, IOPS drop is observed on one hardware(ibm-x3850x6)

> which is exactly the machine Yanhui reported the cpu utilization issue.

> 

> On another machine(HP DL380G10), still 32cores, dual numa nodes, IOPS drop can't be

> observed, but cpu utilization difference is still obserable.

> 

> I use scsi_debug just because it is hard to run the virt workloads on

> that machine. And the reported issue is on megaraid_sas, which is a scsi

> device, so null_blk isn't good to simulate here because it can't cover

> scsi stack.


Understood. Apart from scsi_debug, the only thing I can readily compare 
hosttags vs non-hosttags for scsi stack on is a megaraid_sas system with 
1x SATA disk. Or hack with hisi_sas to support both.

Thanks,
John
Ming Lei April 16, 2021, 8:39 a.m. UTC | #11
On Fri, Apr 16, 2021 at 09:29:37AM +0100, John Garry wrote:
> On 16/04/2021 01:46, Ming Lei wrote:

> > > I can't seem to recreate your same issue. Are you mainline defconfig, or a

> > > special disto config?

> > The config is rhel8 config.

> > 

> 

> Can you share that? Has anyone tested against mainline x86 config?


Sure, see the attachment.

Thanks,
Ming
John Garry April 16, 2021, 2:59 p.m. UTC | #12
On 16/04/2021 09:39, Ming Lei wrote:
> On Fri, Apr 16, 2021 at 09:29:37AM +0100, John Garry wrote:

>> On 16/04/2021 01:46, Ming Lei wrote:

>>>> I can't seem to recreate your same issue. Are you mainline defconfig, or a

>>>> special disto config?

>>> The config is rhel8 config.

>>>

>> Can you share that? Has anyone tested against mainline x86 config?

> Sure, see the attachment.


Thanks. I assume that this is not seen on mainline x86 defconfig.

Unfortunately it's anything but easy for me to install an x86 kernel ATM.

And I am still seeing this on hisi_sas v2 hw with 5.12-rc7:

[  214.448368] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[  214.454468] rcu:Tasks blocked on level-1 rcu_node (CPUs 0-15):
[  214.460474]  (detected by 40, t=5255 jiffies, g=2229, q=1110)
[  214.466208] rcu: All QSes seen, last rcu_preempt kthread activity 1 
(4294945760-4294945759), jiffies_till_next_fqs=1, root ->qsmask 0x1
[  214.478466] BUG: scheduling while atomic: 
irq/151-hisi_sa/503/0x00000004
[  214.485162] Modules linked in:
[  214.488208] CPU: 40 PID: 503 Comm: irq/151-hisi_sa Not tainted 5.11.0 
#75
[  214.494985] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon 
D05 IT21 Nemo 2.0 RC0 04/18/2018
[  214.504105] Call trace:
[  214.506540]  dump_backtrace+0x0/0x1b0
[  214.510208]  show_stack+0x18/0x68
[  214.513513]  dump_stack+0xd8/0x134
[  214.516907]  __schedule_bug+0x60/0x78
[  214.520560]  __schedule+0x620/0x6d8
[  214.524039]  schedule+0x70/0x108
[  214.527256]  irq_thread+0xdc/0x230
[  214.530648]  kthread+0x154/0x158
[  214.533866]  ret_from_fork+0x10/0x30
john@ubuntu:~$

For rw=randread and mq-deadline only, it seems. v5.11 has the same. Not 
sure if this is a driver or other issue.

Today I don't have access to other boards with enough disks to get a 
decent throughput to test against :(

Thanks,
John
Douglas Gilbert April 20, 2021, 3:06 a.m. UTC | #13
On 2021-04-16 10:59 a.m., John Garry wrote:
> On 16/04/2021 09:39, Ming Lei wrote:

>> On Fri, Apr 16, 2021 at 09:29:37AM +0100, John Garry wrote:

>>> On 16/04/2021 01:46, Ming Lei wrote:

>>>>> I can't seem to recreate your same issue. Are you mainline defconfig, or a

>>>>> special disto config?

>>>> The config is rhel8 config.

>>>>

>>> Can you share that? Has anyone tested against mainline x86 config?

>> Sure, see the attachment.

> 

> Thanks. I assume that this is not seen on mainline x86 defconfig.

> 

> Unfortunately it's anything but easy for me to install an x86 kernel ATM.

> 

> And I am still seeing this on hisi_sas v2 hw with 5.12-rc7:

> 

> [  214.448368] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:

> [  214.454468] rcu:Tasks blocked on level-1 rcu_node (CPUs 0-15):

> [  214.460474]  (detected by 40, t=5255 jiffies, g=2229, q=1110)

> [  214.466208] rcu: All QSes seen, last rcu_preempt kthread activity 1 

> (4294945760-4294945759), jiffies_till_next_fqs=1, root ->qsmask 0x1

> [  214.478466] BUG: scheduling while atomic: irq/151-hisi_sa/503/0x00000004

> [  214.485162] Modules linked in:

> [  214.488208] CPU: 40 PID: 503 Comm: irq/151-hisi_sa Not tainted 5.11.0 #75

> [  214.494985] Hardware name: Huawei Taishan 2280 /D05, BIOS Hisilicon D05 IT21 

> Nemo 2.0 RC0 04/18/2018

> [  214.504105] Call trace:

> [  214.506540]  dump_backtrace+0x0/0x1b0

> [  214.510208]  show_stack+0x18/0x68

> [  214.513513]  dump_stack+0xd8/0x134

> [  214.516907]  __schedule_bug+0x60/0x78

> [  214.520560]  __schedule+0x620/0x6d8

> [  214.524039]  schedule+0x70/0x108

> [  214.527256]  irq_thread+0xdc/0x230

> [  214.530648]  kthread+0x154/0x158

> [  214.533866]  ret_from_fork+0x10/0x30

> john@ubuntu:~$

> 

> For rw=randread and mq-deadline only, it seems. v5.11 has the same. Not sure if 

> this is a driver or other issue.

> 

> Today I don't have access to other boards with enough disks to get a decent 

> throughput to test against :(


I have always suspected under extreme pressure the block layer (or scsi
mid-level) does strange things, like an IO hang, attempts to prove that
usually lead back to my own code :-). But I have one example recently
where upwards of 10 commands had been submitted (blk_execute_rq_nowait())
and the following one stalled (all on the same thread). Seconds later
those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and
my dd variant went to its conclusion (reporting 10 errors). Following
copies showed no ill effects.

My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last
two monitor for stalls during the copy. Each submitted READ and WRITE
command gets its pack_id from an incrementing atomic and a management
thread in those copies checks every 300 milliseconds that that atomic
value is greater than the previous check. If not, dump the state of the
sg driver. The stalled request was in busy state with a timeout of 1
nanosecond which indicated that blk_execute_rq_nowait() had not been
called. So the chief suspect would be blk_get_request() followed by
the bio setup calls IMO.

So it certainly looked like an IO hang, not a locking, resource nor
corruption issue IMO. That was with a branch off MKP's
5.13/scsi-staging branch taken a few weeks back. So basically
lk 5.12.0-rc1 .

Doug Gilbert
Bart Van Assche April 20, 2021, 3:22 a.m. UTC | #14
On 4/19/21 8:06 PM, Douglas Gilbert wrote:
> I have always suspected under extreme pressure the block layer (or scsi

> mid-level) does strange things, like an IO hang, attempts to prove that

> usually lead back to my own code :-). But I have one example recently

> where upwards of 10 commands had been submitted (blk_execute_rq_nowait())

> and the following one stalled (all on the same thread). Seconds later

> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and

> my dd variant went to its conclusion (reporting 10 errors). Following

> copies showed no ill effects.

> 

> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last

> two monitor for stalls during the copy. Each submitted READ and WRITE

> command gets its pack_id from an incrementing atomic and a management

> thread in those copies checks every 300 milliseconds that that atomic

> value is greater than the previous check. If not, dump the state of the

> sg driver. The stalled request was in busy state with a timeout of 1

> nanosecond which indicated that blk_execute_rq_nowait() had not been

> called. So the chief suspect would be blk_get_request() followed by

> the bio setup calls IMO.

> 

> So it certainly looked like an IO hang, not a locking, resource nor

> corruption issue IMO. That was with a branch off MKP's

> 5.13/scsi-staging branch taken a few weeks back. So basically

> lk 5.12.0-rc1 .


Hi Doug,

If it would be possible to develop a script that reproduces this hang and
if that script can be shared I will help with root-causing and fixing this
hang.

Thanks,

Bart.
Douglas Gilbert April 20, 2021, 4:54 a.m. UTC | #15
On 2021-04-19 11:22 p.m., Bart Van Assche wrote:
> On 4/19/21 8:06 PM, Douglas Gilbert wrote:

>> I have always suspected under extreme pressure the block layer (or scsi

>> mid-level) does strange things, like an IO hang, attempts to prove that

>> usually lead back to my own code :-). But I have one example recently

>> where upwards of 10 commands had been submitted (blk_execute_rq_nowait())

>> and the following one stalled (all on the same thread). Seconds later

>> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and

>> my dd variant went to its conclusion (reporting 10 errors). Following

>> copies showed no ill effects.

>>

>> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last

>> two monitor for stalls during the copy. Each submitted READ and WRITE

>> command gets its pack_id from an incrementing atomic and a management

>> thread in those copies checks every 300 milliseconds that that atomic

>> value is greater than the previous check. If not, dump the state of the

>> sg driver. The stalled request was in busy state with a timeout of 1

>> nanosecond which indicated that blk_execute_rq_nowait() had not been

>> called. So the chief suspect would be blk_get_request() followed by

>> the bio setup calls IMO.

>>

>> So it certainly looked like an IO hang, not a locking, resource nor

>> corruption issue IMO. That was with a branch off MKP's

>> 5.13/scsi-staging branch taken a few weeks back. So basically

>> lk 5.12.0-rc1 .

> 

> Hi Doug,

> 

> If it would be possible to develop a script that reproduces this hang and

> if that script can be shared I will help with root-causing and fixing this

> hang.


Possible, but not very practical:
    1) apply supplied 83 patches to sg driver
    2) apply pending patch to scsi_debug driver
    3) find a stable kernel platform (perhaps not lk 5.12.0-rc1)
    4) run supplied scripts for three weeks
    5) dig through the output and maybe find one case (there were lots
       of EAGAINs from blk_get_request() but they are expected when
       thrashing the storage layers)


My basic testing strategy may be useful for others:
     sg_dd iflag=random bs=512 of=/dev/sg6
     sg_dd if=/dev/sg6 bs=512 of=/dev/sg7
     sg_dd --verify if=/dev/sg6 bs=512 of=/dev/sg7

If the copy works, so should the verify (compare). The sg_dd utility in
sg3_utils release 1.46 is needed to support iflag=random in the first
line and the --verify in the third line.

If the backing LLD is scsi_debug, then per_host_store=1 is needed. Best
not to use SSDs. The above pattern will work just as well for /dev/sd*
device nodes, but iflag= and oflag= lists must contain the sgio flag.
Then ioctl(/dev/sd*, SG_IO, ...) is used for IO. The limitations of the
third line could be bypassed with something like:
     cmp /dev/sd6 /dev/sd7

If real disks are used, all user data will be trashed.

Doug Gilbert
Ming Lei April 20, 2021, 6:52 a.m. UTC | #16
On Tue, Apr 20, 2021 at 12:54 PM Douglas Gilbert <dgilbert@interlog.com> wrote:
>

> On 2021-04-19 11:22 p.m., Bart Van Assche wrote:

> > On 4/19/21 8:06 PM, Douglas Gilbert wrote:

> >> I have always suspected under extreme pressure the block layer (or scsi

> >> mid-level) does strange things, like an IO hang, attempts to prove that

> >> usually lead back to my own code :-). But I have one example recently

> >> where upwards of 10 commands had been submitted (blk_execute_rq_nowait())

> >> and the following one stalled (all on the same thread). Seconds later

> >> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and

> >> my dd variant went to its conclusion (reporting 10 errors). Following

> >> copies showed no ill effects.

> >>

> >> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last

> >> two monitor for stalls during the copy. Each submitted READ and WRITE

> >> command gets its pack_id from an incrementing atomic and a management

> >> thread in those copies checks every 300 milliseconds that that atomic

> >> value is greater than the previous check. If not, dump the state of the

> >> sg driver. The stalled request was in busy state with a timeout of 1

> >> nanosecond which indicated that blk_execute_rq_nowait() had not been

> >> called. So the chief suspect would be blk_get_request() followed by

> >> the bio setup calls IMO.

> >>

> >> So it certainly looked like an IO hang, not a locking, resource nor

> >> corruption issue IMO. That was with a branch off MKP's

> >> 5.13/scsi-staging branch taken a few weeks back. So basically

> >> lk 5.12.0-rc1 .

> >

> > Hi Doug,

> >

> > If it would be possible to develop a script that reproduces this hang and

> > if that script can be shared I will help with root-causing and fixing this

> > hang.

>

> Possible, but not very practical:

>     1) apply supplied 83 patches to sg driver

>     2) apply pending patch to scsi_debug driver

>     3) find a stable kernel platform (perhaps not lk 5.12.0-rc1)

>     4) run supplied scripts for three weeks

>     5) dig through the output and maybe find one case (there were lots

>        of EAGAINs from blk_get_request() but they are expected when

>        thrashing the storage layers)


Or collecting the debugfs log after IO hang is triggered in your test:

(cd /sys/kernel/debug/block/$SDEV && find . -type f -exec grep -aH . {} \;)

$SDEV is the disk on which IO hang is observed.

Thanks,
Ming
Douglas Gilbert April 20, 2021, 8:22 p.m. UTC | #17
On 2021-04-20 2:52 a.m., Ming Lei wrote:
> On Tue, Apr 20, 2021 at 12:54 PM Douglas Gilbert <dgilbert@interlog.com> wrote:

>>

>> On 2021-04-19 11:22 p.m., Bart Van Assche wrote:

>>> On 4/19/21 8:06 PM, Douglas Gilbert wrote:

>>>> I have always suspected under extreme pressure the block layer (or scsi

>>>> mid-level) does strange things, like an IO hang, attempts to prove that

>>>> usually lead back to my own code :-). But I have one example recently

>>>> where upwards of 10 commands had been submitted (blk_execute_rq_nowait())

>>>> and the following one stalled (all on the same thread). Seconds later

>>>> those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and

>>>> my dd variant went to its conclusion (reporting 10 errors). Following

>>>> copies showed no ill effects.

>>>>

>>>> My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last

>>>> two monitor for stalls during the copy. Each submitted READ and WRITE

>>>> command gets its pack_id from an incrementing atomic and a management

>>>> thread in those copies checks every 300 milliseconds that that atomic

>>>> value is greater than the previous check. If not, dump the state of the

>>>> sg driver. The stalled request was in busy state with a timeout of 1

>>>> nanosecond which indicated that blk_execute_rq_nowait() had not been

>>>> called. So the chief suspect would be blk_get_request() followed by

>>>> the bio setup calls IMO.

>>>>

>>>> So it certainly looked like an IO hang, not a locking, resource nor

>>>> corruption issue IMO. That was with a branch off MKP's

>>>> 5.13/scsi-staging branch taken a few weeks back. So basically

>>>> lk 5.12.0-rc1 .

>>>

>>> Hi Doug,

>>>

>>> If it would be possible to develop a script that reproduces this hang and

>>> if that script can be shared I will help with root-causing and fixing this

>>> hang.

>>

>> Possible, but not very practical:

>>      1) apply supplied 83 patches to sg driver

>>      2) apply pending patch to scsi_debug driver

>>      3) find a stable kernel platform (perhaps not lk 5.12.0-rc1)

>>      4) run supplied scripts for three weeks

>>      5) dig through the output and maybe find one case (there were lots

>>         of EAGAINs from blk_get_request() but they are expected when

>>         thrashing the storage layers)

> 

> Or collecting the debugfs log after IO hang is triggered in your test:

> 

> (cd /sys/kernel/debug/block/$SDEV && find . -type f -exec grep -aH . {} \;)

> 

> $SDEV is the disk on which IO hang is observed.


Thanks. I'll try adding that to my IO hang trigger code.

My patches on the sg driver add debugfs support so these produce
the same output:
     cat /proc/scsi/sg/debug
     cat /sys/kernel/debug/scsi_generic/snapshot

There is also a /sys/kernel/debug/scsi_generic/snapped file whose
contents reflect the driver's state when ioctl(<sg_fd>, SG_DEBUG, &one)
was last called.

When I test, the root file system is usually on a NVMe SSD so the
state of all SCSI disks present should be dumped as they are part
of my test. Also I find the netconsole module extremely useful and
have an old laptop on my network running:
    socat udp-recv:6665 - > socat.txt

picking up the UDP packets from netconsole on port 6665. Not quite as
good as monitoring a hardware serial console, but less fiddly. And
most modern laptops don't have access to a serial console so
netconsole is the only option.

Another observation: upper level issues seem to impact the submission
side of request handling (e.g. the IO hang I was trying to describe)
while error injection I can do (e.g. using the scsi_debug driver)
impacts the completion side (obviously). Are there any tools to inject
errors into the block layer submission code?

Doug Gilbert
Ming Lei April 21, 2021, 1:40 a.m. UTC | #18
On Tue, Apr 20, 2021 at 04:22:53PM -0400, Douglas Gilbert wrote:
> On 2021-04-20 2:52 a.m., Ming Lei wrote:

> > On Tue, Apr 20, 2021 at 12:54 PM Douglas Gilbert <dgilbert@interlog.com> wrote:

> > > 

> > > On 2021-04-19 11:22 p.m., Bart Van Assche wrote:

> > > > On 4/19/21 8:06 PM, Douglas Gilbert wrote:

> > > > > I have always suspected under extreme pressure the block layer (or scsi

> > > > > mid-level) does strange things, like an IO hang, attempts to prove that

> > > > > usually lead back to my own code :-). But I have one example recently

> > > > > where upwards of 10 commands had been submitted (blk_execute_rq_nowait())

> > > > > and the following one stalled (all on the same thread). Seconds later

> > > > > those 10 commands reported DID_TIME_OUT, the stalled thread awoke, and

> > > > > my dd variant went to its conclusion (reporting 10 errors). Following

> > > > > copies showed no ill effects.

> > > > > 

> > > > > My weapons of choice are sg_dd, actually sgh_dd and sg_mrq_dd. Those last

> > > > > two monitor for stalls during the copy. Each submitted READ and WRITE

> > > > > command gets its pack_id from an incrementing atomic and a management

> > > > > thread in those copies checks every 300 milliseconds that that atomic

> > > > > value is greater than the previous check. If not, dump the state of the

> > > > > sg driver. The stalled request was in busy state with a timeout of 1

> > > > > nanosecond which indicated that blk_execute_rq_nowait() had not been

> > > > > called. So the chief suspect would be blk_get_request() followed by

> > > > > the bio setup calls IMO.

> > > > > 

> > > > > So it certainly looked like an IO hang, not a locking, resource nor

> > > > > corruption issue IMO. That was with a branch off MKP's

> > > > > 5.13/scsi-staging branch taken a few weeks back. So basically

> > > > > lk 5.12.0-rc1 .

> > > > 

> > > > Hi Doug,

> > > > 

> > > > If it would be possible to develop a script that reproduces this hang and

> > > > if that script can be shared I will help with root-causing and fixing this

> > > > hang.

> > > 

> > > Possible, but not very practical:

> > >      1) apply supplied 83 patches to sg driver

> > >      2) apply pending patch to scsi_debug driver

> > >      3) find a stable kernel platform (perhaps not lk 5.12.0-rc1)

> > >      4) run supplied scripts for three weeks

> > >      5) dig through the output and maybe find one case (there were lots

> > >         of EAGAINs from blk_get_request() but they are expected when

> > >         thrashing the storage layers)

> > 

> > Or collecting the debugfs log after IO hang is triggered in your test:

> > 

> > (cd /sys/kernel/debug/block/$SDEV && find . -type f -exec grep -aH . {} \;)

> > 

> > $SDEV is the disk on which IO hang is observed.

> 

> Thanks. I'll try adding that to my IO hang trigger code.

> 

> My patches on the sg driver add debugfs support so these produce

> the same output:

>     cat /proc/scsi/sg/debug

>     cat /sys/kernel/debug/scsi_generic/snapshot

> 

> There is also a /sys/kernel/debug/scsi_generic/snapped file whose

> contents reflect the driver's state when ioctl(<sg_fd>, SG_DEBUG, &one)

> was last called.

> 

> When I test, the root file system is usually on a NVMe SSD so the

> state of all SCSI disks present should be dumped as they are part

> of my test. Also I find the netconsole module extremely useful and

> have an old laptop on my network running:

>    socat udp-recv:6665 - > socat.txt

> 

> picking up the UDP packets from netconsole on port 6665. Not quite as

> good as monitoring a hardware serial console, but less fiddly. And

> most modern laptops don't have access to a serial console so

> netconsole is the only option.


Yeah, years ago netconsole does help me much when serial console isn't
available, especially ssh doesn't work at that time, such as kernel
panic.

> 

> Another observation: upper level issues seem to impact the submission

> side of request handling (e.g. the IO hang I was trying to describe)

> while error injection I can do (e.g. using the scsi_debug driver)

> impacts the completion side (obviously). Are there any tools to inject

> errors into the block layer submission code?


block layer supports fault injection in submission side, see
should_fail_bio() which is called from submit_bio_checks().


thanks, 
Ming
John Garry April 23, 2021, 8:43 a.m. UTC | #19
On 15/04/2021 04:46, Ming Lei wrote:
>> CPU util			IOPs

>> mq-deadline	usr=21.72%, sys=44.18%,		423K

>> none		usr=23.15%, sys=74.01%		450K

> Today I re-run the scsi_debug test on two server hardwares(32cores, dual

> numa nodes), and the CPU utilization issue can be reproduced, follow

> the test result:

> 


I haven't forgotten about this.

I finally got your .config working in x86 qemu with only a 4-CPU system.

> 1) randread test on ibm-x3850x6[*] with deadline

> 

>                |IOPS    | FIO CPU util

> ------------------------------------------------

> hosttags      | 94k    | usr=1.13%, sys=14.75%

> ------------------------------------------------

> non hosttags  | 124k   | usr=1.12%, sys=10.65%,

> 


Getting these results for mq-deadline:

hosttags
100K cpu 1.52 4.47

non-hosttags
109K cpu 1.74 5.49

So I still don't see the same CPU usage increase for hosttags.

But throughput is down, so at least I can check on that...

> 

> 2) randread test on ibm-x3850x6[*] with none

>                |IOPS    | FIO CPU util

> ------------------------------------------------

> hosttags      | 120k   | usr=0.89%, sys=6.55%

> ------------------------------------------------

> non hosttags  | 121k   | usr=1.07%, sys=7.35%

> ------------------------------------------------

> 


Here I get:
hosttags
113K cpu 2.04 5.83

non-hosttags
108K cpu 1.71 5.05

Thanks,
John



>   *:

>   	- that is the machine Yanhui reported VM cpu utilization increased by 20%

> 	- kernel: latest linus tree(v5.12-rc7, commit: 7f75285ca57)

> 	- also run same test on another 32cores machine, IOPS drop isn't

> 	  observed, but CPU utilization is increased obviously

> 

> 3) test script

> #/bin/bash

> 

> run_fio() {

> 	RTIME=$1

> 	JOBS=$2

> 	DEVS=$3

> 	BS=$4

> 

> 	QD=64

> 	BATCH=16

> 

> 	fio --bs=$BS --ioengine=libaio \

> 		--iodepth=$QD \

> 	    --iodepth_batch_submit=$BATCH \

> 		--iodepth_batch_complete_min=$BATCH \

> 		--filename=$DEVS \

> 		--direct=1 --runtime=$RTIME --numjobs=$JOBS --rw=randread \

> 		--name=test --group_reporting

> }

> 

> SCHED=$1

> 

> NRQS=`getconf _NPROCESSORS_ONLN`

> 

> rmmod scsi_debug

> modprobe scsi_debug host_max_queue=128 submit_queues=$NRQS virtual_gb=256

> sleep 2

> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

> echo 128 >/sys/block/`basename $DEV`/device/queue_depth

> run_fio 20 16 $DEV 8K

> 

> 

> rmmod scsi_debug

> modprobe scsi_debug max_queue=128 submit_queues=1 virtual_gb=256

> sleep 2

> DEV=`lsscsi | grep scsi_debug | awk '{print $6}'`

> echo $SCHED >/sys/block/`basename $DEV`/queue/scheduler

> echo 128 >/sys/block/`basename $DEV`/device/queue_depth

> run_fio 20 16 $DEV 8k
John Garry April 26, 2021, 10:53 a.m. UTC | #20
On 23/04/2021 09:43, John Garry wrote:
>> 1) randread test on ibm-x3850x6[*] with deadline

>>

>>                |IOPS    | FIO CPU util

>> ------------------------------------------------

>> hosttags      | 94k    | usr=1.13%, sys=14.75%

>> ------------------------------------------------

>> non hosttags  | 124k   | usr=1.12%, sys=10.65%,

>>

> 

> Getting these results for mq-deadline:

> 

> hosttags

> 100K cpu 1.52 4.47

> 

> non-hosttags

> 109K cpu 1.74 5.49

> 

> So I still don't see the same CPU usage increase for hosttags.

> 

> But throughput is down, so at least I can check on that...

> 

>>

>> 2) randread test on ibm-x3850x6[*] with none

>>                |IOPS    | FIO CPU util

>> ------------------------------------------------

>> hosttags      | 120k   | usr=0.89%, sys=6.55%

>> ------------------------------------------------

>> non hosttags  | 121k   | usr=1.07%, sys=7.35%

>> ------------------------------------------------

>>

> 

> Here I get:

> hosttags

> 113K cpu 2.04 5.83

> 

> non-hosttags

> 108K cpu 1.71 5.05


Hi Ming,

One thing I noticed is that for the non-hosttags scenario is that I am 
hitting the IO scheduler tag exhaustion path in blk_mq_get_tag() often; 
here's some perf output:

|--15.88%--blk_mq_submit_bio
|     |
|     |--11.27%--__blk_mq_alloc_request
|     |      |
|     |       --11.19%--blk_mq_get_tag
|     |      |
|     |      |--6.00%--__blk_mq_delay_run_hw_queue
|     |      |     |

...

|     |      |
|     |      |--3.29%--io_schedule
|     |      |     |

....

|     |      |     |
|     |      |     --1.32%--io_schedule_prepare
|     |      |

...

|     |      |
|     |      |--0.60%--sbitmap_finish_wait
|     |      |
      --0.56%--sbitmap_get

I don't see this for hostwide tags - this may be because we have 
multiple hctx, and the IO sched tags are per hctx, so less chance of 
exhaustion. But this is not from hostwide tags specifically, but for 
multiple HW queues in general. As I understood, sched tags were meant to 
be per request queue, right? I am reading this correctly?

I can barely remember some debate on this, but could not find the 
thread. Hannes did have a patch related to topic, but was dropped:
https://lore.kernel.org/linux-scsi/20191202153914.84722-7-hare@suse.de/#t

Thanks,
John
Ming Lei April 26, 2021, 2:48 p.m. UTC | #21
On Mon, Apr 26, 2021 at 11:53:45AM +0100, John Garry wrote:
> On 23/04/2021 09:43, John Garry wrote:

> > > 1) randread test on ibm-x3850x6[*] with deadline

> > > 

> > >                |IOPS    | FIO CPU util

> > > ------------------------------------------------

> > > hosttags      | 94k    | usr=1.13%, sys=14.75%

> > > ------------------------------------------------

> > > non hosttags  | 124k   | usr=1.12%, sys=10.65%,

> > > 

> > 

> > Getting these results for mq-deadline:

> > 

> > hosttags

> > 100K cpu 1.52 4.47

> > 

> > non-hosttags

> > 109K cpu 1.74 5.49

> > 

> > So I still don't see the same CPU usage increase for hosttags.

> > 

> > But throughput is down, so at least I can check on that...

> > 

> > > 

> > > 2) randread test on ibm-x3850x6[*] with none

> > >                |IOPS    | FIO CPU util

> > > ------------------------------------------------

> > > hosttags      | 120k   | usr=0.89%, sys=6.55%

> > > ------------------------------------------------

> > > non hosttags  | 121k   | usr=1.07%, sys=7.35%

> > > ------------------------------------------------

> > > 

> > 

> > Here I get:

> > hosttags

> > 113K cpu 2.04 5.83

> > 

> > non-hosttags

> > 108K cpu 1.71 5.05

> 

> Hi Ming,

> 

> One thing I noticed is that for the non-hosttags scenario is that I am

> hitting the IO scheduler tag exhaustion path in blk_mq_get_tag() often;

> here's some perf output:

> 

> |--15.88%--blk_mq_submit_bio

> |     |

> |     |--11.27%--__blk_mq_alloc_request

> |     |      |

> |     |       --11.19%--blk_mq_get_tag

> |     |      |

> |     |      |--6.00%--__blk_mq_delay_run_hw_queue

> |     |      |     |

> 

> ...

> 

> |     |      |

> |     |      |--3.29%--io_schedule

> |     |      |     |

> 

> ....

> 

> |     |      |     |

> |     |      |     --1.32%--io_schedule_prepare

> |     |      |

> 

> ...

> 

> |     |      |

> |     |      |--0.60%--sbitmap_finish_wait

> |     |      |

>      --0.56%--sbitmap_get

> 

> I don't see this for hostwide tags - this may be because we have multiple

> hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But

> this is not from hostwide tags specifically, but for multiple HW queues in

> general. As I understood, sched tags were meant to be per request queue,

> right? I am reading this correctly?


sched tags is still per-hctx.

I just found that you didn't change sched tags into per-request-queue
shared tags. Then for hostwide tags, each hctx still has its own
standalone sched tags and request pool, that is one big difference with
non hostwide tags. That is why you observe that scheduler tag exhaustion
is easy to trigger in case of non-hostwide tags.

I'd suggest to add one per-request-queue sched tags, and make all hctxs
sharing it, just like what you did for driver tag.


Thanks,
Ming
John Garry April 26, 2021, 3:52 p.m. UTC | #22
On 26/04/2021 15:48, Ming Lei wrote:
>>       --0.56%--sbitmap_get

>>

>> I don't see this for hostwide tags - this may be because we have multiple

>> hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But

>> this is not from hostwide tags specifically, but for multiple HW queues in

>> general. As I understood, sched tags were meant to be per request queue,

>> right? I am reading this correctly?

> sched tags is still per-hctx.

> 

> I just found that you didn't change sched tags into per-request-queue

> shared tags. 

> Then for hostwide tags, each hctx still has its own

> standalone sched tags and request pool, that is one big difference with

> non hostwide tags. 


For both hostwide and non-hostwide tags, we have standalone sched tags 
and request pool per hctx when q->nr_hw_queues > 1.

> That is why you observe that scheduler tag exhaustion

> is easy to trigger in case of non-hostwide tags.

> 

> I'd suggest to add one per-request-queue sched tags, and make all hctxs

> sharing it, just like what you did for driver tag.

> 


That sounds reasonable.

But I don't see how this is related to hostwide tags specifically, but 
rather just having q->nr_hw_queues > 1, which NVMe PCI and some other 
SCSI MQ HBAs have (without using hostwide tags).

Thanks,
John
Ming Lei April 26, 2021, 4:03 p.m. UTC | #23
On Mon, Apr 26, 2021 at 04:52:28PM +0100, John Garry wrote:
> On 26/04/2021 15:48, Ming Lei wrote:

> > >       --0.56%--sbitmap_get

> > > 

> > > I don't see this for hostwide tags - this may be because we have multiple

> > > hctx, and the IO sched tags are per hctx, so less chance of exhaustion. But

> > > this is not from hostwide tags specifically, but for multiple HW queues in

> > > general. As I understood, sched tags were meant to be per request queue,

> > > right? I am reading this correctly?

> > sched tags is still per-hctx.

> > 

> > I just found that you didn't change sched tags into per-request-queue

> > shared tags. Then for hostwide tags, each hctx still has its own

> > standalone sched tags and request pool, that is one big difference with

> > non hostwide tags.

> 

> For both hostwide and non-hostwide tags, we have standalone sched tags and

> request pool per hctx when q->nr_hw_queues > 1.


driver tags is shared for hostwide tags.

> 

> > That is why you observe that scheduler tag exhaustion

> > is easy to trigger in case of non-hostwide tags.

> > 

> > I'd suggest to add one per-request-queue sched tags, and make all hctxs

> > sharing it, just like what you did for driver tag.

> > 

> 

> That sounds reasonable.

> 

> But I don't see how this is related to hostwide tags specifically, but

> rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI

> MQ HBAs have (without using hostwide tags).


Before hostwide tags, the whole scheduler queue depth should be 256.
After hostwide tags, the whole scheduler queue depth becomes 256 *
nr_hw_queues. But the driver tag queue depth is _not_ changed.

More requests come and are tried to dispatch to LLD and can't succeed
because of limited driver tag depth, and CPU utilization could be increased.

Thanks,
Ming
John Garry April 26, 2021, 5:02 p.m. UTC | #24
On 26/04/2021 17:03, Ming Lei wrote:
>> For both hostwide and non-hostwide tags, we have standalone sched tags and

>> request pool per hctx when q->nr_hw_queues > 1.

> driver tags is shared for hostwide tags.

> 

>>> That is why you observe that scheduler tag exhaustion

>>> is easy to trigger in case of non-hostwide tags.

>>>

>>> I'd suggest to add one per-request-queue sched tags, and make all hctxs

>>> sharing it, just like what you did for driver tag.

>>>

>> That sounds reasonable.

>>

>> But I don't see how this is related to hostwide tags specifically, but

>> rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI

>> MQ HBAs have (without using hostwide tags).

> Before hostwide tags, the whole scheduler queue depth should be 256.

> After hostwide tags, the whole scheduler queue depth becomes 256 *

> nr_hw_queues. But the driver tag queue depth is_not_  changed.


Fine.

> 

> More requests come and are tried to dispatch to LLD and can't succeed

> because of limited driver tag depth, and CPU utilization could be increased.


Right, maybe this is a problem.

I quickly added some debug, and see that 
__blk_mq_get_driver_tag()->__sbitmap_queue_get() fails ~7% for hostwide 
tags and 3% for non-hostwide tags.

Having it fail at all for non-hostwide tags seems a bit dubious... 
here's the code for deciding the rq sched tag depth:

q->nr_requests = 2 * min(q->tags_set->queue_depth [128], BLK_DEV_MAX_RQ 
[128])

So we get 256 for our test scenario, which is appreciably bigger than 
q->tags_set->queue_depth, so the failures make sense.

Anyway, I'll look at adding code for a per-request queue sched tags to 
see if it helps. But I would plan to continue to use a per hctx sched 
request pool.

Thanks,
John
Ming Lei April 26, 2021, 11:59 p.m. UTC | #25
On Mon, Apr 26, 2021 at 06:02:31PM +0100, John Garry wrote:
> On 26/04/2021 17:03, Ming Lei wrote:
> > > For both hostwide and non-hostwide tags, we have standalone sched tags and
> > > request pool per hctx when q->nr_hw_queues > 1.
> > driver tags is shared for hostwide tags.
> > 
> > > > That is why you observe that scheduler tag exhaustion
> > > > is easy to trigger in case of non-hostwide tags.
> > > > 
> > > > I'd suggest to add one per-request-queue sched tags, and make all hctxs
> > > > sharing it, just like what you did for driver tag.
> > > > 
> > > That sounds reasonable.
> > > 
> > > But I don't see how this is related to hostwide tags specifically, but
> > > rather just having q->nr_hw_queues > 1, which NVMe PCI and some other SCSI
> > > MQ HBAs have (without using hostwide tags).
> > Before hostwide tags, the whole scheduler queue depth should be 256.
> > After hostwide tags, the whole scheduler queue depth becomes 256 *
> > nr_hw_queues. But the driver tag queue depth is_not_  changed.
> 
> Fine.
> 
> > 
> > More requests come and are tried to dispatch to LLD and can't succeed
> > because of limited driver tag depth, and CPU utilization could be increased.
> 
> Right, maybe this is a problem.
> 
> I quickly added some debug, and see that
> __blk_mq_get_driver_tag()->__sbitmap_queue_get() fails ~7% for hostwide tags
> and 3% for non-hostwide tags.
> 
> Having it fail at all for non-hostwide tags seems a bit dubious... here's
> the code for deciding the rq sched tag depth:
> 
> q->nr_requests = 2 * min(q->tags_set->queue_depth [128], BLK_DEV_MAX_RQ
> [128])
> 
> So we get 256 for our test scenario, which is appreciably bigger than
> q->tags_set->queue_depth, so the failures make sense.
> 
> Anyway, I'll look at adding code for a per-request queue sched tags to see
> if it helps. But I would plan to continue to use a per hctx sched request
> pool.

Why not switch to per hctx sched request pool?

Thanks,
Ming
John Garry April 27, 2021, 7:52 a.m. UTC | #26
On 27/04/2021 00:59, Ming Lei wrote:
>> Anyway, I'll look at adding code for a per-request queue sched tags to see

>> if it helps. But I would plan to continue to use a per hctx sched request

>> pool.

> Why not switch to per hctx sched request pool?


I don't understand. The current code uses a per-hctx sched request pool, 
and I said that I don't plan to change that.

Thanks,
John
Ming Lei April 27, 2021, 9:11 a.m. UTC | #27
On Tue, Apr 27, 2021 at 08:52:53AM +0100, John Garry wrote:
> On 27/04/2021 00:59, Ming Lei wrote:

> > > Anyway, I'll look at adding code for a per-request queue sched tags to see

> > > if it helps. But I would plan to continue to use a per hctx sched request

> > > pool.

> > Why not switch to per hctx sched request pool?

> 

> I don't understand. The current code uses a per-hctx sched request pool, and

> I said that I don't plan to change that.


I forget why you didn't do that, because for hostwide tags, request
is always 1:1 for either sched tags(real io sched) or driver tags(none).

Maybe you want to keep request local to hctx, but never see related
performance data for supporting the point, sbitmap queue allocator has
been intelligent enough to allocate tag freed from native cpu.

Then you just waste lots of memory, I remember that scsi request payload
is a bit big.


Thanks,
Ming
John Garry April 27, 2021, 9:37 a.m. UTC | #28
On 27/04/2021 10:11, Ming Lei wrote:
> On Tue, Apr 27, 2021 at 08:52:53AM +0100, John Garry wrote:

>> On 27/04/2021 00:59, Ming Lei wrote:

>>>> Anyway, I'll look at adding code for a per-request queue sched tags to see

>>>> if it helps. But I would plan to continue to use a per hctx sched request

>>>> pool.

>>> Why not switch to per hctx sched request pool?

>> I don't understand. The current code uses a per-hctx sched request pool, and

>> I said that I don't plan to change that.

> I forget why you didn't do that, because for hostwide tags, request

> is always 1:1 for either sched tags(real io sched) or driver tags(none).

> 

> Maybe you want to keep request local to hctx, but never see related

> performance data for supporting the point, sbitmap queue allocator has

> been intelligent enough to allocate tag freed from native cpu.

> 

> Then you just waste lots of memory, I remember that scsi request payload

> is a bit big.


It's true that we waste much memory for regular static requests for when 
using hostwide tags today.

One problem in trying to use a single set of "hostwide" static requests 
is that we call blk_mq_init_request(..., hctx_idx, ...) -> 
set->ops->init_request(.., hctx_idx, ...) for each static rq, and this 
would not work for a single set of "hostwide" requests.

And I see a similar problem for a "request queue-wide" sched static 
requests.

Maybe we can improve this in future.

BTW, for the performance issue which Yanhui witnessed with megaraid sas, 
do you think it may because of the IO sched tags issue of total sched 
tag depth growing vs driver tags? Are there lots of LUNs? I can imagine 
that megaraid sas has much larger can_queue than scsi_debug :)

Thanks,
John
Ming Lei April 27, 2021, 9:52 a.m. UTC | #29
On Tue, Apr 27, 2021 at 10:37:39AM +0100, John Garry wrote:
> On 27/04/2021 10:11, Ming Lei wrote:

> > On Tue, Apr 27, 2021 at 08:52:53AM +0100, John Garry wrote:

> > > On 27/04/2021 00:59, Ming Lei wrote:

> > > > > Anyway, I'll look at adding code for a per-request queue sched tags to see

> > > > > if it helps. But I would plan to continue to use a per hctx sched request

> > > > > pool.

> > > > Why not switch to per hctx sched request pool?

> > > I don't understand. The current code uses a per-hctx sched request pool, and

> > > I said that I don't plan to change that.

> > I forget why you didn't do that, because for hostwide tags, request

> > is always 1:1 for either sched tags(real io sched) or driver tags(none).

> > 

> > Maybe you want to keep request local to hctx, but never see related

> > performance data for supporting the point, sbitmap queue allocator has

> > been intelligent enough to allocate tag freed from native cpu.

> > 

> > Then you just waste lots of memory, I remember that scsi request payload

> > is a bit big.

> 

> It's true that we waste much memory for regular static requests for when

> using hostwide tags today.

> 

> One problem in trying to use a single set of "hostwide" static requests is

> that we call blk_mq_init_request(..., hctx_idx, ...) ->

> set->ops->init_request(.., hctx_idx, ...) for each static rq, and this would

> not work for a single set of "hostwide" requests.

> 

> And I see a similar problem for a "request queue-wide" sched static

> requests.

> 

> Maybe we can improve this in future.


OK, fair enough.

> 

> BTW, for the performance issue which Yanhui witnessed with megaraid sas, do

> you think it may because of the IO sched tags issue of total sched tag depth

> growing vs driver tags?


I think it is highly possible. Will you work a patch to convert to
per-request-queue sched tag?

> Are there lots of LUNs? I can imagine that megaraid

> sas has much larger can_queue than scsi_debug :)


No, there are just two LUNs, the 1st LUN is one commodity SSD(queue
depth is 32) and the performance issue is reported on this LUN, another is one
HDD(queue depth is 256) which is root disk, but the megaraid host tag depth is
228, another weird setting. But the issue still can be reproduced after we set
2nd LUN's depth as 64 for avoiding driver tag contention.



Thanks,
Ming
John Garry April 27, 2021, 10:15 a.m. UTC | #30
On 27/04/2021 10:52, Ming Lei wrote:
>> BTW, for the performance issue which Yanhui witnessed with megaraid sas, do

>> you think it may because of the IO sched tags issue of total sched tag depth

>> growing vs driver tags?

> I think it is highly possible. Will you work a patch to convert to

> per-request-queue sched tag?

> 


Sure, I'm just hacking now to see what difference it can make to 
performance. Early results look promising...

>> Are there lots of LUNs? I can imagine that megaraid

>> sas has much larger can_queue than scsi_debug:)

> No, there are just two LUNs, the 1st LUN is one commodity SSD(queue

> depth is 32) and the performance issue is reported on this LUN, another is one

> HDD(queue depth is 256) which is root disk, but the megaraid host tag depth is

> 228, another weird setting. But the issue still can be reproduced after we set

> 2nd LUN's depth as 64 for avoiding driver tag contention.

> 

> 


BTW, one more thing which Kashyap and I looked at when initially 
developing the hostwide tag support was the wait struct usage in tag 
exhaustion scenario:

https://lore.kernel.org/linux-block/ecaeccf029c6fe377ebd4f30f04df9f1@mail.gmail.com/

IIRC, we looked at a "hostwide" wait_index - it didn't seem to make a 
difference then, and we didn't end up make any changes here, but still 
worth remembering.

Thanks,
John
John Garry July 7, 2021, 5:06 p.m. UTC | #31
On 27/04/2021 10:52, Ming Lei wrote:
>>> Then you just waste lots of memory, I remember that scsi request payload

>>> is a bit big.

>> It's true that we waste much memory for regular static requests for when

>> using hostwide tags today.

>>

>> One problem in trying to use a single set of "hostwide" static requests is

>> that we call blk_mq_init_request(..., hctx_idx, ...) ->

>> set->ops->init_request(.., hctx_idx, ...) for each static rq, and this would

>> not work for a single set of "hostwide" requests.

>>

>> And I see a similar problem for a "request queue-wide" sched static

>> requests.

>>

>> Maybe we can improve this in future.

> OK, fair enough.

> 


JFYI, I am working on this now.

My idea is to introduce a hostwide and request-wide static requests (for 
hostwide tagset), and have the per-hctx tags point at them, so that we 
don't need to allocate per-hctx static requests.

SCSI init_request callback just ignores hctx_idx passed, so then we can 
just call init_request(hctx_idx = 0) for those hostwide/request 
queue-wide static requests.

Thanks,
John
diff mbox series

Patch

diff --git a/block/blk-mq-sched.c b/block/blk-mq-sched.c
index e1e997af89a0..45188f7aa789 100644
--- a/block/blk-mq-sched.c
+++ b/block/blk-mq-sched.c
@@ -59,10 +59,18 @@  EXPORT_SYMBOL_GPL(blk_mq_sched_mark_restart_hctx);
 
 void blk_mq_sched_restart(struct blk_mq_hw_ctx *hctx)
 {
+	bool shared_tag = blk_mq_is_sbitmap_shared(hctx->flags);
+
+	if (shared_tag)
+		blk_mq_run_hw_queues(hctx->queue, true);
+
 	if (!test_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state))
 		return;
 	clear_bit(BLK_MQ_S_SCHED_RESTART, &hctx->state);
 
+	if (shared_tag)
+		return;
+
 	/*
 	 * Order clearing SCHED_RESTART and list_empty_careful(&hctx->dispatch)
 	 * in blk_mq_run_hw_queue(). Its pair is the barrier in