Message ID | 20210305232151.1531-1-melanieplageman@gmail.com |
---|---|
State | New |
Headers | show |
Series | [v1] scsi: storvsc: Cap cmd_per_lun at can_queue | expand |
From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March 5, 2021 3:22 PM > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > allocation. > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> > --- > drivers/scsi/storvsc_drv.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > index 6bc5453cea8a..d7953a6e00e6 100644 > --- a/drivers/scsi/storvsc_drv.c > +++ b/drivers/scsi/storvsc_drv.c > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > (max_sub_channels + 1) * > (100 - ring_avail_percent_lowater) / 100; > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); > + I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? Be aware that the calculation of "can_queue" in this driver is somewhat flawed -- it should not be based on the size of the ring buffer, but instead on the maximum number of requests Hyper-V will queue. And even then, can_queue doesn't provide the cap you might expect because the blk-mq layer allocates can_queue tags for each HW queue, not as a total. I agree that the cmd_per_lun setting is also too big, but we should fix that in the context of getting all of these different settings working together correctly, and not piecemeal. Michael > host = scsi_host_alloc(&scsi_driver, > sizeof(struct hv_host_device)); > if (!host) > -- > 2.20.1
On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: > From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March 5, 2021 3:22 PM > > > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > > allocation. > > > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > > > Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> > > --- > > drivers/scsi/storvsc_drv.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > index 6bc5453cea8a..d7953a6e00e6 100644 > > --- a/drivers/scsi/storvsc_drv.c > > +++ b/drivers/scsi/storvsc_drv.c > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > > (max_sub_channels + 1) * > > (100 - ring_avail_percent_lowater) / 100; > > > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); > > + > > I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set Scsi_Host->cmd_per_lun in storvsc_probe(). In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is called and sets the scsi_device->queue_depth to the Scsi_Host's cmd_per_lun with this code: scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? sdev->host->cmd_per_lun : 1); During dispatch, the scsi_device->queue_depth is used in scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine whether or not the device can queue another command. On some machines, with the 2048 value of cmd_per_lun that was used to set the initial scsi_device->queue_depth, commands can be queued that are later not able to be dispatched after running out of space in the ringbuffer. On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD (running an fio workload that I can provide if needed), storvsc_do_io() ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. This is the call stack: hv_get_bytes_to_write hv_ringbuffer_write vmbus_send_packet storvsc_dio_io storvsc_queuecommand scsi_dispatch_cmd scsi_queue_rq dispatch_rq_list > Be aware that the calculation of "can_queue" in this driver is somewhat > flawed -- it should not be based on the size of the ring buffer, but instead on > the maximum number of requests Hyper-V will queue. And even then, > can_queue doesn't provide the cap you might expect because the blk-mq layer > allocates can_queue tags for each HW queue, not as a total. The docs for scsi_mid_low_api document Scsi_Host can_queue this way: can_queue - must be greater than 0; do not send more than can_queue commands to the adapter. I did notice that in scsi_host.h, the comment for can_queue does say can_queue is the "maximum number of simultaneous commands a single hw queue in HBA will accept." However, I don't see it being used this way in the code. During dispatch, In scsi_target_queue_ready(), there is this code: if (busy >= starget->can_queue) goto starved; And the scsi_target->can_queue value should be coming from Scsi_host as mentioned in the scsi_target definition in scsi_device.h /* * LLDs should set this in the slave_alloc host template callout. * If set to zero then there is not limit. */ unsigned int can_queue; So, I don't really see how this would be per hardware queue. > > I agree that the cmd_per_lun setting is also too big, but we should fix that in > the context of getting all of these different settings working together correctly, > and not piecemeal. > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe will also prevent the LUN queue_depth from being set to a value that is higher than it can ever be set to again by the user when storvsc_change_queue_depth() is invoked. Also in scsi_sysfs sdev_store_queue_depth() there is this check: if (depth < 1 || depth > sdev->host->can_queue) return -EINVAL; I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun is set to the min of the configured cmd_per_lun and Scsi_Host->can_queue: shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); Best, Melanie
On 08/03/2021 17:56, Melanie Plageman wrote: > On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: >> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March 5, 2021 3:22 PM >>> >>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during >>> allocation. >>> >>> Cap cmd_per_lun at can_queue to avoid dispatch errors. >>> >>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> >>> --- >>> drivers/scsi/storvsc_drv.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c >>> index 6bc5453cea8a..d7953a6e00e6 100644 >>> --- a/drivers/scsi/storvsc_drv.c >>> +++ b/drivers/scsi/storvsc_drv.c >>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, >>> (max_sub_channels + 1) * >>> (100 - ring_avail_percent_lowater) / 100; >>> >>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); >>> + >> >> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? > > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > Scsi_Host->cmd_per_lun in storvsc_probe(). > > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > called and sets the scsi_device->queue_depth to the Scsi_Host's > cmd_per_lun with this code: > > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > sdev->host->cmd_per_lun : 1); > > During dispatch, the scsi_device->queue_depth is used in > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > whether or not the device can queue another command. > > On some machines, with the 2048 value of cmd_per_lun that was used to > set the initial scsi_device->queue_depth, commands can be queued that > are later not able to be dispatched after running out of space in the > ringbuffer. > > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > (running an fio workload that I can provide if needed), storvsc_do_io() > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > > This is the call stack: > > hv_get_bytes_to_write > hv_ringbuffer_write > vmbus_send_packet > storvsc_dio_io > storvsc_queuecommand > scsi_dispatch_cmd > scsi_queue_rq > dispatch_rq_list > >> Be aware that the calculation of "can_queue" in this driver is somewhat >> flawed -- it should not be based on the size of the ring buffer, but instead on >> the maximum number of requests Hyper-V will queue. And even then, >> can_queue doesn't provide the cap you might expect because the blk-mq layer >> allocates can_queue tags for each HW queue, not as a total. > > > The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > > can_queue > - must be greater than 0; do not send more than can_queue > commands to the adapter. > > I did notice that in scsi_host.h, the comment for can_queue does say > can_queue is the "maximum number of simultaneous commands a single hw > queue in HBA will accept." However, I don't see it being used this way > in the code. > JFYI, the block layer ensures that no more than can_queue requests are sent to the host. See scsi_mq_setup_tags(), and how the tagset queue depth is set to shost->can_queue. Thanks, John > During dispatch, In scsi_target_queue_ready(), there is this code: > > if (busy >= starget->can_queue) > goto starved; > > And the scsi_target->can_queue value should be coming from Scsi_host as > mentioned in the scsi_target definition in scsi_device.h > /* > * LLDs should set this in the slave_alloc host template callout. > * If set to zero then there is not limit. > */ > unsigned int can_queue; > > So, I don't really see how this would be per hardware queue. > >> >> I agree that the cmd_per_lun setting is also too big, but we should fix that in >> the context of getting all of these different settings working together correctly, >> and not piecemeal. >> > > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe > will also prevent the LUN queue_depth from being set to a value that is > higher than it can ever be set to again by the user when > storvsc_change_queue_depth() is invoked. > > Also in scsi_sysfs sdev_store_queue_depth() there is this check: > > if (depth < 1 || depth > sdev->host->can_queue) > return -EINVAL; > > I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun > is set to the min of the configured cmd_per_lun and > Scsi_Host->can_queue: > > shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); > > Best, > Melanie > . >
From: Melanie Plageman <melanieplageman@gmail.com> Sent: Monday, March 8, 2021 9:56 AM > > On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: > > From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, March > 5, 2021 3:22 PM > > > > > > The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > > > allocation. > > > > > > Cap cmd_per_lun at can_queue to avoid dispatch errors. > > > > > > Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> > > > --- > > > drivers/scsi/storvsc_drv.c | 2 ++ > > > 1 file changed, 2 insertions(+) > > > > > > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > > > index 6bc5453cea8a..d7953a6e00e6 100644 > > > --- a/drivers/scsi/storvsc_drv.c > > > +++ b/drivers/scsi/storvsc_drv.c > > > @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > > > (max_sub_channels + 1) * > > > (100 - ring_avail_percent_lowater) / 100; > > > > > > + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > scsi_driver.can_queue); > > > + > > > > I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? > > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > Scsi_Host->cmd_per_lun in storvsc_probe(). > > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > called and sets the scsi_device->queue_depth to the Scsi_Host's > cmd_per_lun with this code: > > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > sdev->host->cmd_per_lun : 1); > > During dispatch, the scsi_device->queue_depth is used in > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > whether or not the device can queue another command. > > On some machines, with the 2048 value of cmd_per_lun that was used to > set the initial scsi_device->queue_depth, commands can be queued that > are later not able to be dispatched after running out of space in the > ringbuffer. > > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > (running an fio workload that I can provide if needed), storvsc_do_io() > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > > This is the call stack: > > hv_get_bytes_to_write > hv_ringbuffer_write > vmbus_send_packet > storvsc_dio_io > storvsc_queuecommand > scsi_dispatch_cmd > scsi_queue_rq > dispatch_rq_list > > > Be aware that the calculation of "can_queue" in this driver is somewhat > > flawed -- it should not be based on the size of the ring buffer, but instead on > > the maximum number of requests Hyper-V will queue. And even then, > > can_queue doesn't provide the cap you might expect because the blk-mq layer > > allocates can_queue tags for each HW queue, not as a total. > > > The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > > can_queue > - must be greater than 0; do not send more than can_queue > commands to the adapter. > > I did notice that in scsi_host.h, the comment for can_queue does say > can_queue is the "maximum number of simultaneous commands a single hw > queue in HBA will accept." Yes, this comment is correct. The can_queue value is per HW queue. > However, I don't see it being used this way > in the code. > > During dispatch, In scsi_target_queue_ready(), there is this code: > > if (busy >= starget->can_queue) > goto starved; > > And the scsi_target->can_queue value should be coming from Scsi_host as > mentioned in the scsi_target definition in scsi_device.h > /* > * LLDs should set this in the slave_alloc host template callout. > * If set to zero then there is not limit. > */ > unsigned int can_queue; > > So, I don't really see how this would be per hardware queue. For the storvsc driver, the can_queue value in the scsi_target is initialized to zero in scsi_alloc_target() and it remains unchanged. Maybe I'm missing something, but the only place I see that sets starget->can_queue to a non-zero value is iscsi_target_alloc(). The storvsc slave_alloc() function does not set it. So the test in scsi_target_queue_ready() for exceeding can_queue never executes. We've run live tests, and can see that the number of requests sent to the storvsc driver exceeds the can_queue value when the # of HW queues is greater than 1. That result is consistent with what I see in the code. Michael > > > > > I agree that the cmd_per_lun setting is also too big, but we should fix that in > > the context of getting all of these different settings working together correctly, > > and not piecemeal. > > > > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe > will also prevent the LUN queue_depth from being set to a value that is > higher than it can ever be set to again by the user when > storvsc_change_queue_depth() is invoked. > > Also in scsi_sysfs sdev_store_queue_depth() there is this check: > > if (depth < 1 || depth > sdev->host->can_queue) > return -EINVAL; > > I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun > is set to the min of the configured cmd_per_lun and > Scsi_Host->can_queue: > > shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); > > Best, > Melanie
From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 2:10 AM > > On 08/03/2021 17:56, Melanie Plageman wrote: > > On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: > >> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, > March 5, 2021 3:22 PM > >>> > >>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > >>> allocation. > >>> > >>> Cap cmd_per_lun at can_queue to avoid dispatch errors. > >>> > >>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> > >>> --- > >>> drivers/scsi/storvsc_drv.c | 2 ++ > >>> 1 file changed, 2 insertions(+) > >>> > >>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > >>> index 6bc5453cea8a..d7953a6e00e6 100644 > >>> --- a/drivers/scsi/storvsc_drv.c > >>> +++ b/drivers/scsi/storvsc_drv.c > >>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > >>> (max_sub_channels + 1) * > >>> (100 - ring_avail_percent_lowater) / 100; > >>> > >>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > scsi_driver.can_queue); > >>> + > >> > >> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? > > > > The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > > Scsi_Host->cmd_per_lun in storvsc_probe(). > > > > In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > > called and sets the scsi_device->queue_depth to the Scsi_Host's > > cmd_per_lun with this code: > > > > scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > > sdev->host->cmd_per_lun : 1); > > > > During dispatch, the scsi_device->queue_depth is used in > > scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > > whether or not the device can queue another command. > > > > On some machines, with the 2048 value of cmd_per_lun that was used to > > set the initial scsi_device->queue_depth, commands can be queued that > > are later not able to be dispatched after running out of space in the > > ringbuffer. > > > > On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > > (running an fio workload that I can provide if needed), storvsc_do_io() > > ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > > > > This is the call stack: > > > > hv_get_bytes_to_write > > hv_ringbuffer_write > > vmbus_send_packet > > storvsc_dio_io > > storvsc_queuecommand > > scsi_dispatch_cmd > > scsi_queue_rq > > dispatch_rq_list > > > >> Be aware that the calculation of "can_queue" in this driver is somewhat > >> flawed -- it should not be based on the size of the ring buffer, but instead on > >> the maximum number of requests Hyper-V will queue. And even then, > >> can_queue doesn't provide the cap you might expect because the blk-mq layer > >> allocates can_queue tags for each HW queue, not as a total. > > > > > > The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > > > > can_queue > > - must be greater than 0; do not send more than can_queue > > commands to the adapter. > > > > I did notice that in scsi_host.h, the comment for can_queue does say > > can_queue is the "maximum number of simultaneous commands a single hw > > queue in HBA will accept." However, I don't see it being used this way > > in the code. > > > > JFYI, the block layer ensures that no more than can_queue requests are > sent to the host. See scsi_mq_setup_tags(), and how the tagset queue > depth is set to shost->can_queue. > > Thanks, > John Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag set queue_depth as needed until it succeeds. The key thing is that __blk_mq_alloc_rq_maps() iterates over the number of HW queues calling __blk_mq_alloc_map_and_request(). The latter function allocates the map and the requests with a count of the tag set's queue_depth. There's no logic to apportion the can_queue value across multiple HW queues. So each HW queue gets can_queue tags allocated, and the SCSI host driver may see up to (can_queue * # HW queues) simultaneous requests. I'm certainly not an expert in this area, but that's what I see in the code. We've run live experiments, and can see the number simultaneous requests sent to the storvsc driver be greater than can_queue when the # of HW queues is greater than 1, which seems to be consistent with the code. Michael > > > > During dispatch, In scsi_target_queue_ready(), there is this code: > > > > if (busy >= starget->can_queue) > > goto starved; > > > > And the scsi_target->can_queue value should be coming from Scsi_host as > > mentioned in the scsi_target definition in scsi_device.h > > /* > > * LLDs should set this in the slave_alloc host template callout. > > * If set to zero then there is not limit. > > */ > > unsigned int can_queue; > > > > So, I don't really see how this would be per hardware queue. > > > >> > >> I agree that the cmd_per_lun setting is also too big, but we should fix that in > >> the context of getting all of these different settings working together correctly, > >> and not piecemeal. > >> > > > > Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe > > will also prevent the LUN queue_depth from being set to a value that is > > higher than it can ever be set to again by the user when > > storvsc_change_queue_depth() is invoked. > > > > Also in scsi_sysfs sdev_store_queue_depth() there is this check: > > > > if (depth < 1 || depth > sdev->host->can_queue) > > return -EINVAL; > > > > I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun > > is set to the min of the configured cmd_per_lun and > > Scsi_Host->can_queue: > > > > shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); > > > > Best, > > Melanie > > . > >
On 09/03/2021 15:57, Michael Kelley wrote: > From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 2:10 AM >> >> On 08/03/2021 17:56, Melanie Plageman wrote: >>> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: >>>> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, >> March 5, 2021 3:22 PM >>>>> >>>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during >>>>> allocation. >>>>> >>>>> Cap cmd_per_lun at can_queue to avoid dispatch errors. >>>>> >>>>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> >>>>> --- >>>>> drivers/scsi/storvsc_drv.c | 2 ++ >>>>> 1 file changed, 2 insertions(+) >>>>> >>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c >>>>> index 6bc5453cea8a..d7953a6e00e6 100644 >>>>> --- a/drivers/scsi/storvsc_drv.c >>>>> +++ b/drivers/scsi/storvsc_drv.c >>>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, >>>>> (max_sub_channels + 1) * >>>>> (100 - ring_avail_percent_lowater) / 100; >>>>> >>>>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, >> scsi_driver.can_queue); >>>>> + >>>> >>>> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? >>> >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set >>> Scsi_Host->cmd_per_lun in storvsc_probe(). >>> >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is >>> called and sets the scsi_device->queue_depth to the Scsi_Host's >>> cmd_per_lun with this code: >>> >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? >>> sdev->host->cmd_per_lun : 1); >>> >>> During dispatch, the scsi_device->queue_depth is used in >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine >>> whether or not the device can queue another command. >>> >>> On some machines, with the 2048 value of cmd_per_lun that was used to >>> set the initial scsi_device->queue_depth, commands can be queued that >>> are later not able to be dispatched after running out of space in the >>> ringbuffer. >>> >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD >>> (running an fio workload that I can provide if needed), storvsc_do_io() >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. >>> >>> This is the call stack: >>> >>> hv_get_bytes_to_write >>> hv_ringbuffer_write >>> vmbus_send_packet >>> storvsc_dio_io >>> storvsc_queuecommand >>> scsi_dispatch_cmd >>> scsi_queue_rq >>> dispatch_rq_list >>> >>>> Be aware that the calculation of "can_queue" in this driver is somewhat >>>> flawed -- it should not be based on the size of the ring buffer, but instead on >>>> the maximum number of requests Hyper-V will queue. And even then, >>>> can_queue doesn't provide the cap you might expect because the blk-mq layer >>>> allocates can_queue tags for each HW queue, not as a total. >>> >>> >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way: >>> >>> can_queue >>> - must be greater than 0; do not send more than can_queue >>> commands to the adapter. >>> >>> I did notice that in scsi_host.h, the comment for can_queue does say >>> can_queue is the "maximum number of simultaneous commands a single hw >>> queue in HBA will accept." However, I don't see it being used this way >>> in the code. >>> >> >> JFYI, the block layer ensures that no more than can_queue requests are >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue >> depth is set to shost->can_queue. >> >> Thanks, >> John > > Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag > set queue_depth as needed until it succeeds. > > The key thing is that __blk_mq_alloc_rq_maps() iterates over the > number of HW queues calling __blk_mq_alloc_map_and_request(). > The latter function allocates the map and the requests with a count > of the tag set's queue_depth. There's no logic to apportion the > can_queue value across multiple HW queues. So each HW queue gets > can_queue tags allocated, and the SCSI host driver may see up to > (can_queue * # HW queues) simultaneous requests. > > I'm certainly not an expert in this area, but that's what I see in the > code. We've run live experiments, and can see the number > simultaneous requests sent to the storvsc driver be greater than > can_queue when the # of HW queues is greater than 1, which seems > to be consistent with the code. ah, ok. I assumed that # of HW queues = 1 here. So you're describing a problem similar to https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1bb5@huawei.com/ So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads: the total queue depth per host is nr_hw_queues * can_queue. However, for when host_tagset is set, the total queue depth is can_queue. Setting .host_tagset will ensure at most can_queue requests will be sent over all HW queues at any given time. A few SCSI MQ drivers set this now. Thanks, John > > Michael > >> >> >>> During dispatch, In scsi_target_queue_ready(), there is this code: >>> >>> if (busy >= starget->can_queue) >>> goto starved; >>> >>> And the scsi_target->can_queue value should be coming from Scsi_host as >>> mentioned in the scsi_target definition in scsi_device.h >>> /* >>> * LLDs should set this in the slave_alloc host template callout. >>> * If set to zero then there is not limit. >>> */ >>> unsigned int can_queue; >>> >>> So, I don't really see how this would be per hardware queue. >>> >>>> >>>> I agree that the cmd_per_lun setting is also too big, but we should fix that in >>>> the context of getting all of these different settings working together correctly, >>>> and not piecemeal. >>>> >>> >>> Capping Scsi_Host->cmd_per_lun to scsi_driver.can_queue during probe >>> will also prevent the LUN queue_depth from being set to a value that is >>> higher than it can ever be set to again by the user when >>> storvsc_change_queue_depth() is invoked. >>> >>> Also in scsi_sysfs sdev_store_queue_depth() there is this check: >>> >>> if (depth < 1 || depth > sdev->host->can_queue) >>> return -EINVAL; >>> >>> I would also note that VirtIO SCSI in virtscsi_probe(), Scsi_Host->cmd_per_lun >>> is set to the min of the configured cmd_per_lun and >>> Scsi_Host->can_queue: >>> >>> shost->cmd_per_lun = min_t(u32, cmd_per_lun, shost->can_queue); >>> >>> Best, >>> Melanie >>> . >>> >
From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 8:36 AM > > On 09/03/2021 15:57, Michael Kelley wrote: > > From: John Garry <john.garry@huawei.com> Sent: Tuesday, March 9, 2021 2:10 AM > >> > >> On 08/03/2021 17:56, Melanie Plageman wrote: > >>> On Mon, Mar 08, 2021 at 02:37:40PM +0000, Michael Kelley wrote: > >>>> From: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> Sent: Friday, > >> March 5, 2021 3:22 PM > >>>>> > >>>>> The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during > >>>>> allocation. > >>>>> > >>>>> Cap cmd_per_lun at can_queue to avoid dispatch errors. > >>>>> > >>>>> Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> > >>>>> --- > >>>>> drivers/scsi/storvsc_drv.c | 2 ++ > >>>>> 1 file changed, 2 insertions(+) > >>>>> > >>>>> diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c > >>>>> index 6bc5453cea8a..d7953a6e00e6 100644 > >>>>> --- a/drivers/scsi/storvsc_drv.c > >>>>> +++ b/drivers/scsi/storvsc_drv.c > >>>>> @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, > >>>>> (max_sub_channels + 1) * > >>>>> (100 - ring_avail_percent_lowater) / 100; > >>>>> > >>>>> + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, > >> scsi_driver.can_queue); > >>>>> + > >>>> > >>>> I'm not sure what you mean by "avoid dispatch errors". Can you elaborate? > >>> > >>> The scsi_driver.cmd_per_lun is set to 2048. Which is then used to set > >>> Scsi_Host->cmd_per_lun in storvsc_probe(). > >>> > >>> In storvsc_probe(), when doing scsi_scan_host(), scsi_alloc_sdev() is > >>> called and sets the scsi_device->queue_depth to the Scsi_Host's > >>> cmd_per_lun with this code: > >>> > >>> scsi_change_queue_depth(sdev, sdev->host->cmd_per_lun ? > >>> sdev->host->cmd_per_lun : 1); > >>> > >>> During dispatch, the scsi_device->queue_depth is used in > >>> scsi_dev_queue_ready(), called by scsi_mq_get_budget() to determine > >>> whether or not the device can queue another command. > >>> > >>> On some machines, with the 2048 value of cmd_per_lun that was used to > >>> set the initial scsi_device->queue_depth, commands can be queued that > >>> are later not able to be dispatched after running out of space in the > >>> ringbuffer. > >>> > >>> On an 8 core Azure VM with 16GB of memory with a single 1 TiB SSD > >>> (running an fio workload that I can provide if needed), storvsc_do_io() > >>> ends up often returning SCSI_MLQUEUE_DEVICE_BUSY. > >>> > >>> This is the call stack: > >>> > >>> hv_get_bytes_to_write > >>> hv_ringbuffer_write > >>> vmbus_send_packet > >>> storvsc_dio_io > >>> storvsc_queuecommand > >>> scsi_dispatch_cmd > >>> scsi_queue_rq > >>> dispatch_rq_list > >>> > >>>> Be aware that the calculation of "can_queue" in this driver is somewhat > >>>> flawed -- it should not be based on the size of the ring buffer, but instead on > >>>> the maximum number of requests Hyper-V will queue. And even then, > >>>> can_queue doesn't provide the cap you might expect because the blk-mq layer > >>>> allocates can_queue tags for each HW queue, not as a total. > >>> > >>> > >>> The docs for scsi_mid_low_api document Scsi_Host can_queue this way: > >>> > >>> can_queue > >>> - must be greater than 0; do not send more than can_queue > >>> commands to the adapter. > >>> > >>> I did notice that in scsi_host.h, the comment for can_queue does say > >>> can_queue is the "maximum number of simultaneous commands a single hw > >>> queue in HBA will accept." However, I don't see it being used this way > >>> in the code. > >>> > >> > >> JFYI, the block layer ensures that no more than can_queue requests are > >> sent to the host. See scsi_mq_setup_tags(), and how the tagset queue > >> depth is set to shost->can_queue. > >> > >> Thanks, > >> John > > > > Agree on what's in scsi_mq_setup_tags(). But scsi_mq_setup_tags() calls > > blk_mq_alloc_tag_set(), which in turn calls blk_mq_alloc_map_and_requests(), > > which calls __blk_mq_alloc_rq_maps() repeatedly, reducing the tag > > set queue_depth as needed until it succeeds. > > > > The key thing is that __blk_mq_alloc_rq_maps() iterates over the > > number of HW queues calling __blk_mq_alloc_map_and_request(). > > The latter function allocates the map and the requests with a count > > of the tag set's queue_depth. There's no logic to apportion the > > can_queue value across multiple HW queues. So each HW queue gets > > can_queue tags allocated, and the SCSI host driver may see up to > > (can_queue * # HW queues) simultaneous requests. > > > > I'm certainly not an expert in this area, but that's what I see in the > > code. We've run live experiments, and can see the number > > simultaneous requests sent to the storvsc driver be greater than > > can_queue when the # of HW queues is greater than 1, which seems > > to be consistent with the code. > > ah, ok. I assumed that # of HW queues = 1 here. So you're describing a > problem similar to > https://lore.kernel.org/linux-scsi/b3e4e597-779b-7c1e-0d3c-07bc3dab1bb5@huawei.com/ Yes. > > So if you check nr_hw_queues comment in include/scsi/scsi_host.h, it reads: > the total queue depth per host is nr_hw_queues * can_queue. However, for > when host_tagset is set, the total queue depth is can_queue. > > Setting .host_tagset will ensure at most can_queue requests will be sent > over all HW queues at any given time. A few SCSI MQ drivers set this now. > I had seen the "host_tagset" option in some recent investigations, and it looks like it better models what the storvsc driver wants. But given that only a few drivers were using it, it seemed like it might be far enough off the beaten path to be a bit risky. The storvsc driver is the driver for the synthetic SCSI controller provided by Microsoft's Hyper-V hypervisor, and as such it is used heavily in the Azure public cloud. The settings in storvsc have changed incrementally over the years, and there are now some inconsistencies as Melanie has pointed out. Storvsc and the underlying Hyper-V run in VMs with relatively low IOPS limits and also in VMs that can hit hundreds of K IOPS. Getting enough parallelism to drive the high IOPS while not overloading in the low IOPS environments is the challenge. As a purely synthetic device, there aren't really multiple HW queues, but configuring for multiple queues helps drive the higher end IOPS numbers in VMs with sizable vCPU counts. So setting "host_tagset" while allowing multiple HW queues for higher parallelism might be a good approach. We will experiment. Michael
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c index 6bc5453cea8a..d7953a6e00e6 100644 --- a/drivers/scsi/storvsc_drv.c +++ b/drivers/scsi/storvsc_drv.c @@ -1946,6 +1946,8 @@ static int storvsc_probe(struct hv_device *device, (max_sub_channels + 1) * (100 - ring_avail_percent_lowater) / 100; + scsi_driver.cmd_per_lun = min_t(u32, scsi_driver.cmd_per_lun, scsi_driver.can_queue); + host = scsi_host_alloc(&scsi_driver, sizeof(struct hv_host_device)); if (!host)
The scsi_device->queue_depth is set to Scsi_Host->cmd_per_lun during allocation. Cap cmd_per_lun at can_queue to avoid dispatch errors. Signed-off-by: Melanie Plageman (Microsoft) <melanieplageman@gmail.com> --- drivers/scsi/storvsc_drv.c | 2 ++ 1 file changed, 2 insertions(+)