Message ID | 20230124190308.127318-2-niklas.cassel@wdc.com |
---|---|
State | New |
Headers | show |
Series | Add Command Duration Limits support | expand |
On 1/24/23 11:02, Niklas Cassel wrote: > Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should > be executed using duration-limits targets. The duration target to apply > to a command is indicated using the priority level. Up to 8 levels are > supported, with level 0 indiating "no limit". > > This priority class has effect only if the target device supports the > command duration limits feature and this feature is enabled by the user. > > While it is recommended to not use an ioscheduler when using the > IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler, > IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT. > > The reason for this is twofold: > 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a > duration limit descriptor (DLD) inside the device. Users can configure > these limits themselves using passthrough commands, so from a block layer > perspective, Linux has no idea of how each DLD is actually configured. > > By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds > its duration limit (because it was held too long in the scheduler) is > decreased. It is still possible to use the IOPRIO_CLASS_DL priority class > for "low priority" IOs by configuring a large limit in the respective DLD. > > 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands > (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority > commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to > IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these > two incompatible priority classes. Implementing duration limit support using the I/O priority mechanism makes it impossible to configure the I/O priority for commands that have a duration limit. Shouldn't the duration limit be independent of the I/O priority? Am I perhaps missing something? Thanks, Bart.
On 1/24/23 11:27, Bart Van Assche wrote: > Implementing duration limit support using the I/O priority mechanism > makes it impossible to configure the I/O priority for commands that have > a duration limit. Shouldn't the duration limit be independent of the I/O > priority? Am I perhaps missing something? (replying to my own e-mail) In SAM-6 I found the following: "The device server may use the duration expiration time to determine the order of processing commands with the SIMPLE task attribute within the task set. A difference in duration expiration time between commands may override other scheduling considerations (e.g., different times to access different logical block addresses or vendor specific scheduling considerations). Processing of a collection of commands with different command duration limit settings should cause a command with an earlier duration expiration time to complete with status sooner than a command with a later duration expiration time." Do I understand correctly that it is optional for a SCSI device to interpret the command duration as a priority and that this is not mandatory? Thanks, Bart.
On 1/25/23 04:27, Bart Van Assche wrote: > On 1/24/23 11:02, Niklas Cassel wrote: >> Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should >> be executed using duration-limits targets. The duration target to apply >> to a command is indicated using the priority level. Up to 8 levels are >> supported, with level 0 indiating "no limit". >> >> This priority class has effect only if the target device supports the >> command duration limits feature and this feature is enabled by the user. >> >> While it is recommended to not use an ioscheduler when using the >> IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler, >> IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT. >> >> The reason for this is twofold: >> 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a >> duration limit descriptor (DLD) inside the device. Users can configure >> these limits themselves using passthrough commands, so from a block layer >> perspective, Linux has no idea of how each DLD is actually configured. >> >> By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds >> its duration limit (because it was held too long in the scheduler) is >> decreased. It is still possible to use the IOPRIO_CLASS_DL priority class >> for "low priority" IOs by configuring a large limit in the respective DLD. >> >> 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands >> (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority >> commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to >> IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these >> two incompatible priority classes. > > Implementing duration limit support using the I/O priority mechanism > makes it impossible to configure the I/O priority for commands that have > a duration limit. Shouldn't the duration limit be independent of the I/O > priority? Am I perhaps missing something? I/O priority at the device level does not exist with SAS and with SATA, the ACS specifications mandates that NCQ I/O priority and CDL cannot be used mixed together. So from the device point of view, I/O priority and CDL are mutually exclusive. No issues. Now, if you are talking about the host level I/O priority scheduling done by mq-deadline and bfq, the CDL priority class maps to the RT class. They are the same, as they should. There is nothing more real-time than CDL in my opinion :) Furthermore, if we do not reuse the I/O priority interface, we will have to add another field to BIOs & requests to propagate the cdl index from user space down to the device LLD, almost exactly in the manner of I/O priorities, including all the controls with merging etc. That would be a lot of overhead to achieve the possibility of prioritized CDL commands. CDL in of itself allows the user to define "prioritized" commands by defining CDLs on the drive that are sorted in increasing time limit order, i.e. with low CDL index numbers having low limits, and higher priority within the class (as CDL index == prio level). With that, schedulers can still do the right thing as they do now, with the additional benefit that they can even be improved to base their scheduling decisions on a known time limit for the command execution. But such optimization is not implemented by this series. > > Thanks, > > Bart. >
On 1/25/23 05:36, Bart Van Assche wrote: > On 1/24/23 11:27, Bart Van Assche wrote: >> Implementing duration limit support using the I/O priority mechanism >> makes it impossible to configure the I/O priority for commands that have >> a duration limit. Shouldn't the duration limit be independent of the I/O >> priority? Am I perhaps missing something? > > (replying to my own e-mail) > > In SAM-6 I found the following: "The device server may use the duration > expiration time to determine the order of processing commands with > the SIMPLE task attribute within the task set. A difference in duration > expiration time between commands may override other scheduling > considerations (e.g., different times to access different logical block > addresses or vendor specific scheduling considerations). Processing of a > collection of commands with different command duration limit settings > should cause a command with an earlier duration expiration time to > complete with status sooner than a command with a later duration > expiration time." > > Do I understand correctly that it is optional for a SCSI device to > interpret the command duration as a priority and that this is not mandatory? This describes the expected behavior from the drive in terms of command execution ordering when CDL is used. The text is a little "soft" and sound as if this behavior is optional because CDL is a combination of time limits AND a policy for each time limit. The policy of a CDL indicates what the drive behavior should be if a command misses its time limit. In short, there are 2 policies: 1) best-effort: the time limit is a hint of sorts. If the drive fails to execute the command before the limit expires, the command is not aborted and execution continues. 2) fast-fail: If the drive fails to execute the command before the time limit expires, the command must be completed with an error immediately. And CDL also has a field, settable by the user, that describes an allowed performance degradation to achieve CDL scheduling in time. That is, most important for the best-effort case to indicate how "serious" the user is about the CDL limit "hint". So as you can see I think, the SAM-6 text is vague because of the many possible variations in scheduling policies that need to be implemented by a drive.
On 1/24/23 13:29, Damien Le Moal wrote: > I/O priority at the device level does not exist with SAS and with SATA, > the ACS specifications mandates that NCQ I/O priority and CDL cannot be > used mixed together. So from the device point of view, I/O priority and > CDL are mutually exclusive. No issues. > > Now, if you are talking about the host level I/O priority scheduling done > by mq-deadline and bfq, the CDL priority class maps to the RT class. They > are the same, as they should. There is nothing more real-time than CDL in > my opinion :) > > Furthermore, if we do not reuse the I/O priority interface, we will have > to add another field to BIOs & requests to propagate the cdl index from > user space down to the device LLD, almost exactly in the manner of I/O > priorities, including all the controls with merging etc. That would be a > lot of overhead to achieve the possibility of prioritized CDL commands. > > CDL in of itself allows the user to define "prioritized" commands by > defining CDLs on the drive that are sorted in increasing time limit order, > i.e. with low CDL index numbers having low limits, and higher priority > within the class (as CDL index == prio level). With that, schedulers can > still do the right thing as they do now, with the additional benefit that > they can even be improved to base their scheduling decisions on a known > time limit for the command execution. But such optimization is not > implemented by this series. Hi Damien, What if a device that supports I/O priorities (e.g. an NVMe device that supports configuring the SQ priority) and a device that supports command duration limits (e.g. a SATA hard disk) are combined via the device mapper into a single block device? Should I/O be submitted to the dm device with one of the existing I/O priority classes (not supported by SATA hard disks) or with I/O priority class IOPRIO_CLASS_DL (not supported by NVMe devices)? Shouldn't the ATA core translate the existing I/O priority levels into a command duration limit instead of introducing a new I/O priority class that is only supported by ATA devices? Thanks, Bart.
On 1/25/23 07:43, Bart Van Assche wrote: > On 1/24/23 13:29, Damien Le Moal wrote: >> I/O priority at the device level does not exist with SAS and with SATA, >> the ACS specifications mandates that NCQ I/O priority and CDL cannot be >> used mixed together. So from the device point of view, I/O priority and >> CDL are mutually exclusive. No issues. >> >> Now, if you are talking about the host level I/O priority scheduling done >> by mq-deadline and bfq, the CDL priority class maps to the RT class. They >> are the same, as they should. There is nothing more real-time than CDL in >> my opinion :) >> >> Furthermore, if we do not reuse the I/O priority interface, we will have >> to add another field to BIOs & requests to propagate the cdl index from >> user space down to the device LLD, almost exactly in the manner of I/O >> priorities, including all the controls with merging etc. That would be a >> lot of overhead to achieve the possibility of prioritized CDL commands. >> >> CDL in of itself allows the user to define "prioritized" commands by >> defining CDLs on the drive that are sorted in increasing time limit order, >> i.e. with low CDL index numbers having low limits, and higher priority >> within the class (as CDL index == prio level). With that, schedulers can >> still do the right thing as they do now, with the additional benefit that >> they can even be improved to base their scheduling decisions on a known >> time limit for the command execution. But such optimization is not >> implemented by this series. > > Hi Damien, > > What if a device that supports I/O priorities (e.g. an NVMe device that > supports configuring the SQ priority) and a device that supports command > duration limits (e.g. a SATA hard disk) are combined via the device > mapper into a single block device? Should I/O be submitted to the dm > device with one of the existing I/O priority classes (not supported by > SATA hard disks) or with I/O priority class IOPRIO_CLASS_DL (not > supported by NVMe devices)? That is not a use case we considered. My gut feeling is that this is something the target driver should handle when processing a user IO. Note that I was not aware that Linux NVMe driver supported queue priorities... > Shouldn't the ATA core translate the existing I/O priority levels into a > command duration limit instead of introducing a new I/O priority class > that is only supported by ATA devices? There is only one priority class that ATA understands: RT (the level is irrelevant and ignored). All RT class IOs are mapped to high priority NCQ commands. All other classes map to normal priority (no priority bit set) commands. And sure, we could map the level of RT class IOs to a CDL index, as we do for the CDL class, but what would be the point ? The user should use the CDL class in that case. Furthermore, there is one additional thing that we do not yet support but will later: CDL descriptor 0 can be used to set a target time limit for high priority NCQ commands. Without this new feature introduced with CDL, the drive is free to schedule high priority NCQ commands as it wants, and that is hard coded in FW. So you can endup with very aggressive scheduling leading to significant overall IOPS drop and long tail latency for low priority commands. See page 11 and 20 of this presentation for an example: https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf For a drive that supports both CDL and NCQ priority, with CDL feature turned off, CDL descriptor 0 defines the time limit hint for high priority NCQ commands. Again, CDL and NCQ high priority are mutually exclusive. So for clarity, I really would prefer separating CDL and RT classes as we did. We could integrate CDL support reusing the RT class + level for CDL index, but I think this may be very confusing for users, especially considering that the CLDs on a drive can be defined in any order the user wants, resulting in indexes/levels that does do not have any particular order, making it impossible for the host to correctly schedule commands.
On 1/24/23 14:59, Damien Le Moal wrote: > There is only one priority class that ATA understands: RT (the level is > irrelevant and ignored). All RT class IOs are mapped to high priority NCQ > commands. All other classes map to normal priority (no priority bit set) > commands. > > And sure, we could map the level of RT class IOs to a CDL index, as we do > for the CDL class, but what would be the point ? The user should use the > CDL class in that case. > > Furthermore, there is one additional thing that we do not yet support but > will later: CDL descriptor 0 can be used to set a target time limit for > high priority NCQ commands. Without this new feature introduced with CDL, > the drive is free to schedule high priority NCQ commands as it wants, and > that is hard coded in FW. So you can endup with very aggressive scheduling > leading to significant overall IOPS drop and long tail latency for low > priority commands. See page 11 and 20 of this presentation for an example: > > https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf > > For a drive that supports both CDL and NCQ priority, with CDL feature > turned off, CDL descriptor 0 defines the time limit hint for high priority > NCQ commands. Again, CDL and NCQ high priority are mutually exclusive. > > So for clarity, I really would prefer separating CDL and RT classes as we > did. We could integrate CDL support reusing the RT class + level for CDL > index, but I think this may be very confusing for users, especially > considering that the CLDs on a drive can be defined in any order the user > wants, resulting in indexes/levels that does do not have any particular > order, making it impossible for the host to correctly schedule commands. Hi Damien, Thanks again for the detailed reply. Your replies are very informative and help me understand the context better. However, I'm still less than enthusiast about the introduction of the I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) is a mechanism that is supported by one storage standard (SCSI) and of which it is not sure that it will be integrated in other storage standards (NVMe, ...). Isn't the purpose of the block layer to provide an interface that is independent of the specifics of a single storage standard? This is why I'm in favor of letting the ATA core translate one of the existing I/O priority classes into a CDL instead of introducing a new I/O priority class (IOPRIO_CLASS_DL) in the block layer. Others may have a different opinion. Thanks, Bart.
On 1/25/23 09:05, Bart Van Assche wrote: > On 1/24/23 14:59, Damien Le Moal wrote: >> There is only one priority class that ATA understands: RT (the level is >> irrelevant and ignored). All RT class IOs are mapped to high priority NCQ >> commands. All other classes map to normal priority (no priority bit set) >> commands. >> >> And sure, we could map the level of RT class IOs to a CDL index, as we do >> for the CDL class, but what would be the point ? The user should use the >> CDL class in that case. >> >> Furthermore, there is one additional thing that we do not yet support but >> will later: CDL descriptor 0 can be used to set a target time limit for >> high priority NCQ commands. Without this new feature introduced with CDL, >> the drive is free to schedule high priority NCQ commands as it wants, and >> that is hard coded in FW. So you can endup with very aggressive scheduling >> leading to significant overall IOPS drop and long tail latency for low >> priority commands. See page 11 and 20 of this presentation for an example: >> >> https://www.snia.org/sites/default/files/SDC/2021/pdfs/SNIA-SDC21-LeMoal-Be-On-Time-command-duration-limits-Feature-Support-in%20Linux.pdf >> >> For a drive that supports both CDL and NCQ priority, with CDL feature >> turned off, CDL descriptor 0 defines the time limit hint for high priority >> NCQ commands. Again, CDL and NCQ high priority are mutually exclusive. >> >> So for clarity, I really would prefer separating CDL and RT classes as we >> did. We could integrate CDL support reusing the RT class + level for CDL >> index, but I think this may be very confusing for users, especially >> considering that the CLDs on a drive can be defined in any order the user >> wants, resulting in indexes/levels that does do not have any particular >> order, making it impossible for the host to correctly schedule commands. > > Hi Damien, > > Thanks again for the detailed reply. Your replies are very informative > and help me understand the context better. > > However, I'm still less than enthusiast about the introduction of the > I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) > is a mechanism that is supported by one storage standard (SCSI) and of And ATA (ACS) too. Not just SCSI. This is actually an improvement over IO priority (command priority) that is supported only by ATA NCQ and does not exist with SCSI/SBC. > which it is not sure that it will be integrated in other storage > standards (NVMe, ...). Isn't the purpose of the block layer to provide > an interface that is independent of the specifics of a single storage > standard? This is why I'm in favor of letting the ATA core translate one > of the existing I/O priority classes into a CDL instead of introducing a > new I/O priority class (IOPRIO_CLASS_DL) in the block layer. We discussed CDL with Hannes in the context of NVMe over fabrics. Their may be interesting extensions to consider for NVMe in that context (the value for local PCI attached NVMe drive is more limited at best). I would argue that IO priority is the same: that is not supported by all device classes either, and for those that support it, the semantic is not identical (ATA vs NVMe). Yet, we have the RT class that maps to high priority for ATA, and nothing else as far as I know. CDL at least covers SCSI *and* ATA, and as mentioned above, could be used by NVMe-of host drivers to do fancy link selection for a multipath setup based on the link speed for instance. We could overload the RT class with a mapping to CDL feature on scsi and ata, but I think this is more confusing/messy than a separate class as we implemented. > > Others may have a different opinion. > > Thanks, > > Bart.
On Tue, Jan 24, 2023 at 02:43:24PM -0800, Bart Van Assche wrote: > What if a device that supports I/O priorities (e.g. an NVMe device that > supports configuring the SQ priority) NVMe does not have any I/O priorities, it only has a very limited priority scheme for queue arbitration.
On 1/24/23 17:19, Damien Le Moal wrote: > On 1/25/23 09:05, Bart Van Assche wrote: >> Thanks again for the detailed reply. Your replies are very informative >> and help me understand the context better. >> >> However, I'm still less than enthusiast about the introduction of the >> I/O priority class IOPRIO_CLASS_DL. To me command duration limits (CDL) >> is a mechanism that is supported by one storage standard (SCSI) and of > > And ATA (ACS) too. Not just SCSI. This is actually an improvement over IO > priority (command priority) that is supported only by ATA NCQ and does not > exist with SCSI/SBC. > >> which it is not sure that it will be integrated in other storage >> standards (NVMe, ...). Isn't the purpose of the block layer to provide >> an interface that is independent of the specifics of a single storage >> standard? This is why I'm in favor of letting the ATA core translate one >> of the existing I/O priority classes into a CDL instead of introducing a >> new I/O priority class (IOPRIO_CLASS_DL) in the block layer. > > We discussed CDL with Hannes in the context of NVMe over fabrics. Their > may be interesting extensions to consider for NVMe in that context (the > value for local PCI attached NVMe drive is more limited at best). > > I would argue that IO priority is the same: that is not supported by all > device classes either, and for those that support it, the semantic is not > identical (ATA vs NVMe). Yet, we have the RT class that maps to high > priority for ATA, and nothing else as far as I know. > > CDL at least covers SCSI *and* ATA, and as mentioned above, could be used > by NVMe-of host drivers to do fancy link selection for a multipath setup > based on the link speed for instance. > > We could overload the RT class with a mapping to CDL feature on scsi and > ata, but I think this is more confusing/messy than a separate class as we > implemented. Hi Damien, The more I think about this, the more I'm convinced that it would be wrong to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that support CDL and drives that do not support CDL. It seems wrong to me to make user space software responsible for figuring out whether or not the drive supports CDL before it can be decided which I/O priority class should be used. This is something the kernel should do instead of user space software. If we would ask Android storage vendors to implement CDL then IOPRIO_CLASS_DL would cause trouble too. Android has support since considerable time to give the foreground application a higher I/O priority than background applications. The cgroup settings for foreground and background applications come from the task_profiles.json file (see also https://android.googlesource.com/platform/system/core/+/master/libprocessgroup/profiles/task_profiles.json). As one can see all the settings in that file are independent of the features of the storage device. Introducing IOPRIO_CLASS_DL in the kernel and using it in task_profiles.json would imply that the storage device type has to be determined before it can be decided whether or not IOPRIO_CLASS_DL can be used. This seems wrong to me. I downloaded the patch series in its entirety and applied it on a local kernel branch. I verified which changes would be needed to replace IOPRIO_CLASS_DL with IOPRIO_CLASS_RT. Can you help me with verifying the patch below? Regarding the BFQ changes in the patch below, is an I/O scheduler useful at all if CDL is used since a comment in the BFQ driver says that the disk should do the scheduling instead of BFQ if CDL is used? Thanks, Bart. diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 7add9346c585..815b884d6c5a 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5545,14 +5545,6 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE; bfqq->new_ioprio = 7; break; - case IOPRIO_CLASS_DL: - /* - * For the duration-limits class, we want the disk to do the - * scheduling. So map all levels to the highest RT level. - */ - bfqq->new_ioprio = 0; - bfqq->new_ioprio_class = IOPRIO_CLASS_RT; - break; } if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) { @@ -5681,8 +5673,6 @@ static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd, return &bfqg->async_bfqq[1][ioprio][act_idx]; case IOPRIO_CLASS_IDLE: return &bfqg->async_idle_bfqq[act_idx]; - case IOPRIO_CLASS_DL: - return &bfqg->async_bfqq[0][0][act_idx]; default: return NULL; } diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c index dfb5c3f447f4..8bb6b8eba4ce 100644 --- a/block/blk-ioprio.c +++ b/block/blk-ioprio.c @@ -27,7 +27,6 @@ * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into * IOPRIO_CLASS_BE. * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE. - * @POLICY_ALL_TO_DL: change the I/O priority class into IOPRIO_CLASS_DL. * * See also <linux/ioprio.h>. */ @@ -36,7 +35,6 @@ enum prio_policy { POLICY_NONE_TO_RT = 1, POLICY_RESTRICT_TO_BE = 2, POLICY_ALL_TO_IDLE = 3, - POLICY_ALL_TO_DL = 4, }; static const char *policy_name[] = { @@ -44,7 +42,6 @@ static const char *policy_name[] = { [POLICY_NONE_TO_RT] = "none-to-rt", [POLICY_RESTRICT_TO_BE] = "restrict-to-be", [POLICY_ALL_TO_IDLE] = "idle", - [POLICY_ALL_TO_DL] = "duration-limits", }; static struct blkcg_policy ioprio_policy; diff --git a/block/ioprio.c b/block/ioprio.c index 1b3a9da82597..32a456b45804 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -37,7 +37,6 @@ int ioprio_check_cap(int ioprio) switch (class) { case IOPRIO_CLASS_RT: - case IOPRIO_CLASS_DL: /* * Originally this only checked for CAP_SYS_ADMIN, * which was implicitly allowed for pid 0 by security @@ -48,7 +47,7 @@ int ioprio_check_cap(int ioprio) if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE)) return -EPERM; fallthrough; - /* RT and DL have prio field too */ + /* rt has prio field too */ case IOPRIO_CLASS_BE: if (data >= IOPRIO_NR_LEVELS || data < 0) return -EINVAL; diff --git a/block/mq-deadline.c b/block/mq-deadline.c index 526d0ea4dbf9..f10c2a0d18d4 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -113,7 +113,6 @@ static const enum dd_prio ioprio_class_to_prio[] = { [IOPRIO_CLASS_RT] = DD_RT_PRIO, [IOPRIO_CLASS_BE] = DD_BE_PRIO, [IOPRIO_CLASS_IDLE] = DD_IDLE_PRIO, - [IOPRIO_CLASS_DL] = DD_RT_PRIO, }; static inline struct rb_root * diff --git a/drivers/ata/libata-core.c b/drivers/ata/libata-core.c index b4761c3c4b91..3065b632e6ae 100644 --- a/drivers/ata/libata-core.c +++ b/drivers/ata/libata-core.c @@ -673,7 +673,7 @@ static inline void ata_set_tf_cdl(struct ata_queued_cmd *qc, int ioprio) struct ata_taskfile *tf = &qc->tf; int cdl; - if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL) + if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_RT) return; cdl = IOPRIO_PRIO_DATA(ioprio) & 0x07; diff --git a/drivers/scsi/sd_cdl.c b/drivers/scsi/sd_cdl.c index 59d02dbb5ea1..c5286f5ddae4 100644 --- a/drivers/scsi/sd_cdl.c +++ b/drivers/scsi/sd_cdl.c @@ -880,10 +880,10 @@ int sd_cdl_dld(struct scsi_disk *sdkp, struct scsi_cmnd *scmd) unsigned int dld; /* - * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_DL + * Use "no limit" if the request ioprio class is not IOPRIO_CLASS_RT * or if the user specified an invalid CDL descriptor index. */ - if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_DL) + if (IOPRIO_PRIO_CLASS(ioprio) != IOPRIO_CLASS_RT) return 0; dld = IOPRIO_PRIO_DATA(ioprio); diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 2f3fc2fbd668..7578d4f6a969 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -20,7 +20,7 @@ static inline bool ioprio_valid(unsigned short ioprio) { unsigned short class = IOPRIO_PRIO_CLASS(ioprio); - return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_DL; + return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_IDLE; } /* diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h index 15908b9e9d8c..f70f2596a6bf 100644 --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -29,7 +29,6 @@ enum { IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE, - IOPRIO_CLASS_DL, }; /* @@ -38,12 +37,6 @@ enum { #define IOPRIO_NR_LEVELS 8 #define IOPRIO_BE_NR IOPRIO_NR_LEVELS -/* - * The Duration limits class allows 8 levels: level 0 for "no limit" and levels - * 1 to 7, each corresponding to a read or write limit descriptor. - */ -#define IOPRIO_DL_NR_LEVELS 8 - enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP,
On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote: (snip) > Hi Damien, > > The more I think about this, the more I'm convinced that it would be wrong > to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that > support CDL and drives that do not support CDL. It seems wrong to me to > make user space software responsible for figuring out whether or not the > drive supports CDL before it can be decided which I/O priority class should > be used. This is something the kernel should do instead of user space > software. Well, if we take e.g. NCQ priority as an example, as that is probably the only device side I/O priority feature currently supported by the kernel. If you want to use of NCQ priority, you need to first enable /sys/block/sdX/device/ncq_prio_enable and then submit I/O using IOPRIO_CLASS_RT, so I would argue the user already needs to know that a device supports device side I/O priority, if he wants to make use of it. For CDL there are 7 different limits for reads and 7 different limits for writes, these limits can be configured by the user. So the users that want to get most performance out of their drive will most likely analyze their workloads, and set the limits depending on how their workload actually looks like. Bottom line is that heavy users of CDL will absolutely know how the CDL limits are configured in user space, as they will pick the correct CDL index (prio level) for the descriptor that they want to use for the specific I/O that they are doing. An ioscheduler will most likely be disabled. (For CDL, the limit is from the time the command is submitted to the device, so from the device's PoV, it does not really matter if a command is queued for a long time in a scheduler or not, but from an application PoV, it does not make sense to hold back a command for long if it e.g. has a short limit.) If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be to have something like: $ cat /sys/block/sdX/device/rt_prio_backend [none] ncq-prio cdl Devices that does not support ncq-prio or cdl, e.g. currently NVMe, would just have none (i.e. RT simply means higher host side priority (if a scheduler is used)). SCSI would then have none and cdl (for SCSI devices supporting CDL.) ATA would have none, ncq-prio and cdl. (for ATA devices supporting CDL.) That would theoretically avoid another ioprio class, but like I've just explained, a user space application making use of CDL would for sure know how the descriptors look like anyway, so I'm not sure if there is an actual benefit of doing it this way over simply having a IOPRIO_CLASS_DL. I guess the only benefit would be that we would avoid introducing another I/O priority class (at the expense of additional complexity elsewhere). Kind regards, Niklas
On Wed, Jan 25, 2023 at 10:19:45AM +0900, Damien Le Moal wrote: > On 1/25/23 09:05, Bart Van Assche wrote: > > > which it is not sure that it will be integrated in other storage > > standards (NVMe, ...). Isn't the purpose of the block layer to provide > > an interface that is independent of the specifics of a single storage > > standard? This is why I'm in favor of letting the ATA core translate one > > of the existing I/O priority classes into a CDL instead of introducing a > > new I/O priority class (IOPRIO_CLASS_DL) in the block layer. > > We discussed CDL with Hannes in the context of NVMe over fabrics. Their > may be interesting extensions to consider for NVMe in that context (the > value for local PCI attached NVMe drive is more limited at best). I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe does allow rotating media, and they'll want feature parity if CDL is considered useful in other protocols.
On 2023/01/26 8:11, Keith Busch wrote: > On Wed, Jan 25, 2023 at 10:19:45AM +0900, Damien Le Moal wrote: >> On 1/25/23 09:05, Bart Van Assche wrote: >> >>> which it is not sure that it will be integrated in other storage >>> standards (NVMe, ...). Isn't the purpose of the block layer to provide >>> an interface that is independent of the specifics of a single storage >>> standard? This is why I'm in favor of letting the ATA core translate one >>> of the existing I/O priority classes into a CDL instead of introducing a >>> new I/O priority class (IOPRIO_CLASS_DL) in the block layer. >> >> We discussed CDL with Hannes in the context of NVMe over fabrics. Their >> may be interesting extensions to consider for NVMe in that context (the >> value for local PCI attached NVMe drive is more limited at best). > > I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe > does allow rotating media, and they'll want feature parity if CDL is considered > useful in other protocols. True. If NVMe HDDs come to market, we'll definitely want a CDL feature too.
On 2023/01/26 6:23, Niklas Cassel wrote: > On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote: > > (snip) > >> Hi Damien, >> >> The more I think about this, the more I'm convinced that it would be wrong >> to introduce IOPRIO_CLASS_DL. Datacenters will have a mix of drives that >> support CDL and drives that do not support CDL. It seems wrong to me to >> make user space software responsible for figuring out whether or not the >> drive supports CDL before it can be decided which I/O priority class should >> be used. This is something the kernel should do instead of user space >> software. > > Well, if we take e.g. NCQ priority as an example, as that is probably > the only device side I/O priority feature currently supported by the > kernel. > > If you want to use of NCQ priority, you need to first enable > /sys/block/sdX/device/ncq_prio_enable > and then submit I/O using IOPRIO_CLASS_RT, so I would argue the user > already needs to know that a device supports device side I/O priority, > if he wants to make use of it. Yes, absolutely. In addition to this, NCQ high priority feature is optional. The host-level RT class scheduling works the same way regardless of a SATA drive supporting NCQ high priority or not. If ncq_prio_enable is not enabled (or not supported), the scheduler still works as before. If ncq_prio_enable is set for a drive that supports NCQ high prio, then the user gets the additional benefit of *also* having the drive prioritize the commands from high-priority user IOs. > For CDL there are 7 different limits for reads and 7 different > limits for writes, these limits can be configured by the user. > So the users that want to get most performance out of their drive > will most likely analyze their workloads, and set the limits depending > on how their workload actually looks like. > > Bottom line is that heavy users of CDL will absolutely know how the CDL > limits are configured in user space, as they will pick the correct CDL > index (prio level) for the descriptor that they want to use for the > specific I/O that they are doing. An ioscheduler will most likely be > disabled. Yes. And for cases where we still need an IO scheduler (e.g. SMR with mq-deadline), we really cannot use the priority level (CDL index) as a meaningful information to make request scheduling decisions because I think it is simply impossible to reliably define a "priority" order for the 7 read and write descriptors. We cannot map a set of 14 descriptors with a very large possible number of variations to sorted array of priority-like levels. > (For CDL, the limit is from the time the command is submitted to the device, > so from the device's PoV, it does not really matter if a command is queued > for a long time in a scheduler or not, but from an application PoV, it does > not make sense to hold back a command for long if it e.g. has a short limit.) > > > If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be > to have something like: > > $ cat /sys/block/sdX/device/rt_prio_backend > [none] ncq-prio cdl No need for this. We can keep the existing ncq_prio_enable and the proposed duration_limits/enable sysfs attributes. The user cannot enable both at the same time with our patches. So if the user enables ncq_prio_enable, then it will get high priority NCQ commands mapping for any level of the RT class. If duration_limits/enable is set, then the user will get CDL scheduling of commands on the drive. But again, the difficulty with this overloading is that we *cannot* implement a solid level-based scheduling in IO schedulers because ordering the CDLs in a meaningful way is impossible. So BFQ handling of the RT class would likely not result in the most ideal scheduling (that would depend heavily on how the CDL descriptors are defined on the drive). Hence my reluctance to overload the RT class for CDL. > Devices that does not support ncq-prio or cdl, > e.g. currently NVMe, would just have none > (i.e. RT simply means higher host side priority (if a scheduler is used)). Yes. Exactly. > SCSI would then have none and cdl > (for SCSI devices supporting CDL.) > > ATA would have none, ncq-prio and cdl. > (for ATA devices supporting CDL.) > > That would theoretically avoid another ioprio class, but like I've just > explained, a user space application making use of CDL would for sure know > how the descriptors look like anyway, so I'm not sure if there is an actual > benefit of doing it this way over simply having a IOPRIO_CLASS_DL. Agree. And as explained above, I think that reusing the RT class creates more problems than the only apparent simplification it is. > I guess the only benefit would be that we would avoid introducing another > I/O priority class (at the expense of additional complexity elsewhere). Yes. And I think that the added complexity to correctly handle the overloaded RT class is too much. RT class has been around for a long time for host-level IO priority scheduling. Let's not break it in weird ways. We certainly can work on improving handling of IOPRIO_CLASS_DL in IO schedulers. But in my opinion, that can be done later, after this initial series introducing CDL support is applied.
On Wed, Jan 25, 2023 at 04:11:41PM -0700, Keith Busch wrote: > I wouldn't necessarily rule out CDL for PCI attached in some future TP. NVMe > does allow rotating media, and they'll want feature parity if CDL is considered > useful in other protocols. NVMe has a TP for CDL that is technically active, although it doesn't seem to be actively worked on right now.
On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote: > On 2023/01/26 6:23, Niklas Cassel wrote: > > On Wed, Jan 25, 2023 at 10:37:52AM -0800, Bart Van Assche wrote: (snip) > > If we were to reuse IOPRIO_CLASS_RT, then I guess the best option would be > > to have something like: > > > > $ cat /sys/block/sdX/device/rt_prio_backend > > [none] ncq-prio cdl > > No need for this. We can keep the existing ncq_prio_enable and the proposed > duration_limits/enable sysfs attributes. The user cannot enable both at the same > time with our patches. So if the user enables ncq_prio_enable, then it will get > high priority NCQ commands mapping for any level of the RT class. If > duration_limits/enable is set, then the user will get CDL scheduling of commands > on the drive. > > But again, the difficulty with this overloading is that we *cannot* implement a > solid level-based scheduling in IO schedulers because ordering the CDLs in a > meaningful way is impossible. So BFQ handling of the RT class would likely not > result in the most ideal scheduling (that would depend heavily on how the CDL > descriptors are defined on the drive). Hence my reluctance to overload the RT > class for CDL. Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to disable the IO scheduler, so that lower classdata levels wouldn't be prioritized over higher classdata levels, or simply use an IO scheduler that does not care about the classdata level, e.g. mq-deadline.
On 1/26/23 05:53, Niklas Cassel wrote: > On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote: >> But again, the difficulty with this overloading is that we *cannot* implement a >> solid level-based scheduling in IO schedulers because ordering the CDLs in a >> meaningful way is impossible. So BFQ handling of the RT class would likely not >> result in the most ideal scheduling (that would depend heavily on how the CDL >> descriptors are defined on the drive). Hence my reluctance to overload the RT >> class for CDL. > > Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to > disable the IO scheduler, so that lower classdata levels wouldn't be prioritized > over higher classdata levels, or simply use an IO scheduler that does not care > about the classdata level, e.g. mq-deadline. How about making the information about whether or not CDL has been enabled available to the scheduler such that the scheduler can include that information in its decisions? > However, for CDL, things are not as simple as setting a single bit in the > command, because of all the different descriptors, so we must let the classdata > represent the device side priority level, and not the host side priority level > (as we cannot have both, and I agree with you, it is very hard define an order > between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked > higher or lower than a 20 ms policy 0xd descriptor?). How about only supporting a subset of the standard such that it becomes easy to map CDLs to host side priority levels? If users really need the ability to use all standardized CDL features and if there is no easy way to map CDL levels to an I/O priority, is the I/O priority mechanism really the best basis for a user space interface for CDLs? Thanks, Bart.
On 1/27/23 02:33, Bart Van Assche wrote: > On 1/26/23 05:53, Niklas Cassel wrote: >> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote: >>> But again, the difficulty with this overloading is that we *cannot* implement a >>> solid level-based scheduling in IO schedulers because ordering the CDLs in a >>> meaningful way is impossible. So BFQ handling of the RT class would likely not >>> result in the most ideal scheduling (that would depend heavily on how the CDL >>> descriptors are defined on the drive). Hence my reluctance to overload the RT >>> class for CDL. >> >> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to >> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized >> over higher classdata levels, or simply use an IO scheduler that does not care >> about the classdata level, e.g. mq-deadline. > > How about making the information about whether or not CDL has been > enabled available to the scheduler such that the scheduler can include > that information in its decisions? Sure, that is easy to do. But as I mentioned before, I think that is something we can do after this initial support series. >> However, for CDL, things are not as simple as setting a single bit in the >> command, because of all the different descriptors, so we must let the classdata >> represent the device side priority level, and not the host side priority level >> (as we cannot have both, and I agree with you, it is very hard define an order >> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked >> higher or lower than a 20 ms policy 0xd descriptor?). > > How about only supporting a subset of the standard such that it becomes > easy to map CDLs to host side priority levels? I am opposed to this, for several reasons: 1) We are seeing different use cases from users that cover a wide range of use of CDL descriptors with various definitions. 2) Passthrough commands can be used by a user to change a drive CDL descriptors without the kernel knowing about it, unless we spend our time revalidating the CDL descriptor log page(s)... 3) CDL standard as is is actually very sensible and not overloaded with stuff that is only useful in niche use cases. For each CDL descriptor, you have: * The active time limit, which is a clean way to specify how much time you allow a drive to deal with bad sectors (mostly read case). A typical HDD will try very hard to recover data from a sector, always. As a result, the HDD may spend up to several seconds reading a sector again and again applying different signal processing techniques until it gets the sector ECC checked to return valid data. That of course can hugely increase an IO latency seen by the host. In applications such as erasure coded distributed object stores, maximum latency for an object access can thus be kept low using this limit without compromising the data since the object can always be rebuilt from the erasure codes if one HDD is slow to respond. This limit is also interesting for video streaming/playback to avoid video buffer underflow (at the expense of may be some block noise depending on the codec). * The inactive time limit can be used to tell the drive how long it is allowed to let a command stand in the drive internal queue before processing. This is thus a parameter that allows a host to tune the drive RPO optimization (rotational positioning optimization, e.g. HDD internal command scheduling based on angular sector position on tracks withe the head current position). This is a neat way to control max IOPS vs tail latency since drives tend to privilege maximizing IOPS over lowering max tail latency. * The duration guideline limit defines an overall time limit for a command without distinguishing between active and inactive time. It is the easiest to use (the easiest one to understand from a beginner user point of view). This is a neat way to define an intelligent IO prioritization in fact, way better than RT class scheduling on the host or the use of ATA NCQ high priority, as it provides more information to the drive about the urgency of a particular command. That allows the drive to still perform RPO to maximize IOPS without long tail latencies. Chaining such limit with an active+inactive time limit descriptor using the "next limit" policy (0x1 policy) can also finely define what the drive should if the guideline limit is exceeded (as the next descriptor can define what to do based on the reason for the limit being exceeded: long internal queueing vs bad sector long access time). > If users really need the ability to use all standardized CDL features > and if there is no easy way to map CDL levels to an I/O priority, is the > I/O priority mechanism really the best basis for a user space interface > for CDLs? As you can see above, yes, we need everything and should not attempt restricting CDL use. The IO priority interface is a perfect fit for CDL in the sense that all we need to pass along from user to device is one number: the CDL index to use for a command. So creating a different interface for this while the IO priority interface exactly does that sounds silly to me. One compromise we could do is: have the IO schedulers completely ignore CDL prio class for now, that is, have them assume that no IO prio class/level was specified. Given that they are not tuned to handle CDL well anyway, this is probably the best thing to do for now. We still need to have the block layer prevent merging of requests with different CDL descriptors though, which is another reason to reuse the IO prio interface as the block layer already does this. Less code, which is always a good thing. > > Thanks, > > Bart.
On 1/27/23 09:18, Damien Le Moal wrote: > On 1/27/23 02:33, Bart Van Assche wrote: >> On 1/26/23 05:53, Niklas Cassel wrote: >>> On Thu, Jan 26, 2023 at 09:24:12AM +0900, Damien Le Moal wrote: >>>> But again, the difficulty with this overloading is that we *cannot* implement a >>>> solid level-based scheduling in IO schedulers because ordering the CDLs in a >>>> meaningful way is impossible. So BFQ handling of the RT class would likely not >>>> result in the most ideal scheduling (that would depend heavily on how the CDL >>>> descriptors are defined on the drive). Hence my reluctance to overload the RT >>>> class for CDL. >>> >>> Well, if CDL were to reuse IOPRIO_CLASS_RT, then the user would either have to >>> disable the IO scheduler, so that lower classdata levels wouldn't be prioritized >>> over higher classdata levels, or simply use an IO scheduler that does not care >>> about the classdata level, e.g. mq-deadline. >> >> How about making the information about whether or not CDL has been >> enabled available to the scheduler such that the scheduler can include >> that information in its decisions? > > Sure, that is easy to do. But as I mentioned before, I think that is > something we can do after this initial support series. > >>> However, for CDL, things are not as simple as setting a single bit in the >>> command, because of all the different descriptors, so we must let the classdata >>> represent the device side priority level, and not the host side priority level >>> (as we cannot have both, and I agree with you, it is very hard define an order >>> between the descriptors.. e.g. should a 20 ms policy 0xf descriptor be ranked >>> higher or lower than a 20 ms policy 0xd descriptor?). >> >> How about only supporting a subset of the standard such that it becomes >> easy to map CDLs to host side priority levels? > > I am opposed to this, for several reasons: > > 1) We are seeing different use cases from users that cover a wide range of > use of CDL descriptors with various definitions. > > 2) Passthrough commands can be used by a user to change a drive CDL > descriptors without the kernel knowing about it, unless we spend our time > revalidating the CDL descriptor log page(s)... > > 3) CDL standard as is is actually very sensible and not overloaded with > stuff that is only useful in niche use cases. For each CDL descriptor, you > have: > * The active time limit, which is a clean way to specify how much time > you allow a drive to deal with bad sectors (mostly read case). A typical > HDD will try very hard to recover data from a sector, always. As a result, > the HDD may spend up to several seconds reading a sector again and again > applying different signal processing techniques until it gets the sector > ECC checked to return valid data. That of course can hugely increase an IO > latency seen by the host. In applications such as erasure coded > distributed object stores, maximum latency for an object access can thus > be kept low using this limit without compromising the data since the > object can always be rebuilt from the erasure codes if one HDD is slow to > respond. This limit is also interesting for video streaming/playback to > avoid video buffer underflow (at the expense of may be some block noise > depending on the codec). > * The inactive time limit can be used to tell the drive how long it is > allowed to let a command stand in the drive internal queue before > processing. This is thus a parameter that allows a host to tune the drive > RPO optimization (rotational positioning optimization, e.g. HDD internal > command scheduling based on angular sector position on tracks withe the > head current position). This is a neat way to control max IOPS vs tail > latency since drives tend to privilege maximizing IOPS over lowering max > tail latency. > * The duration guideline limit defines an overall time limit for a > command without distinguishing between active and inactive time. It is the > easiest to use (the easiest one to understand from a beginner user point > of view). This is a neat way to define an intelligent IO prioritization in > fact, way better than RT class scheduling on the host or the use of ATA > NCQ high priority, as it provides more information to the drive about the > urgency of a particular command. That allows the drive to still perform > RPO to maximize IOPS without long tail latencies. Chaining such limit with > an active+inactive time limit descriptor using the "next limit" policy > (0x1 policy) can also finely define what the drive should if the guideline > limit is exceeded (as the next descriptor can define what to do based on > the reason for the limit being exceeded: long internal queueing vs bad > sector long access time). Note that all 3 limits can be used in a single CDL descriptor to precisely define how a command should be processed by the device. That is why it is nearly impossible to come up with a meaningful ordering of CDL descriptors as an increasing set of priority levels. > >> If users really need the ability to use all standardized CDL features >> and if there is no easy way to map CDL levels to an I/O priority, is the >> I/O priority mechanism really the best basis for a user space interface >> for CDLs? > > As you can see above, yes, we need everything and should not attempt > restricting CDL use. The IO priority interface is a perfect fit for CDL in > the sense that all we need to pass along from user to device is one > number: the CDL index to use for a command. So creating a different > interface for this while the IO priority interface exactly does that > sounds silly to me. > > One compromise we could do is: have the IO schedulers completely ignore > CDL prio class for now, that is, have them assume that no IO prio > class/level was specified. Given that they are not tuned to handle CDL > well anyway, this is probably the best thing to do for now. > > We still need to have the block layer prevent merging of requests with > different CDL descriptors though, which is another reason to reuse the IO > prio interface as the block layer already does this. Less code, which is > always a good thing. > >> >> Thanks, >> >> Bart. >
On 1/24/23 20:02, Niklas Cassel wrote: > From: Damien Le Moal <damien.lemoal@opensource.wdc.com> > > Introduce the IOPRIO_CLASS_DL priority class to indicate that IOs should > be executed using duration-limits targets. The duration target to apply > to a command is indicated using the priority level. Up to 8 levels are > supported, with level 0 indiating "no limit". > > This priority class has effect only if the target device supports the > command duration limits feature and this feature is enabled by the user. > > While it is recommended to not use an ioscheduler when using the > IOPRIO_CLASS_DL priority class, if using the BFQ or mq-deadline scheduler, > IOPRIO_CLASS_DL is mapped to IOPRIO_CLASS_RT. > > The reason for this is twofold: > 1) Each priority level for the IOPRIO_CLASS_DL priority class represents a > duration limit descriptor (DLD) inside the device. Users can configure > these limits themselves using passthrough commands, so from a block layer > perspective, Linux has no idea of how each DLD is actually configured. > > By mapping a command to IOPRIO_CLASS_RT, the chance that a command exceeds > its duration limit (because it was held too long in the scheduler) is > decreased. It is still possible to use the IOPRIO_CLASS_DL priority class > for "low priority" IOs by configuring a large limit in the respective DLD. > > 2) On ATA drives, IOPRIO_CLASS_DL commands and NCQ priority commands > (IOPRIO_CLASS_RT) cannot be used together. A mix of CDL and high priority > commands cannot be sent to a device. By mapping IOPRIO_CLASS_DL to > IOPRIO_CLASS_RT, we ensure that a device will never receive a mix of these > two incompatible priority classes. > > Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> > Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> > --- > block/bfq-iosched.c | 10 ++++++++++ > block/blk-ioprio.c | 3 +++ > block/ioprio.c | 3 ++- > block/mq-deadline.c | 1 + > include/linux/ioprio.h | 2 +- > include/uapi/linux/ioprio.h | 7 +++++++ > 6 files changed, 24 insertions(+), 2 deletions(-) > Reviewed-by: Hannes Reinecke <hare@suse.de> Cheers, Hannes
On 1/26/23 17:40, Damien Le Moal wrote: > On 1/27/23 09:18, Damien Le Moal wrote: >> On 1/27/23 02:33, Bart Van Assche wrote: >>> How about only supporting a subset of the standard such that it becomes >>> easy to map CDLs to host side priority levels? >> >> I am opposed to this, for several reasons: >> >> 1) We are seeing different use cases from users that cover a wide range of >> use of CDL descriptors with various definitions. >> >> 2) Passthrough commands can be used by a user to change a drive CDL >> descriptors without the kernel knowing about it, unless we spend our time >> revalidating the CDL descriptor log page(s)... >> 3) CDL standard as is is actually very sensible and not overloaded with >> stuff that is only useful in niche use cases. For each CDL descriptor, you >> have: >> * The active time limit, which is a clean way to specify how much time >> you allow a drive to deal with bad sectors (mostly read case). A typical >> HDD will try very hard to recover data from a sector, always. As a result, >> the HDD may spend up to several seconds reading a sector again and again >> applying different signal processing techniques until it gets the sector >> ECC checked to return valid data. That of course can hugely increase an IO >> latency seen by the host. In applications such as erasure coded >> distributed object stores, maximum latency for an object access can thus >> be kept low using this limit without compromising the data since the >> object can always be rebuilt from the erasure codes if one HDD is slow to >> respond. This limit is also interesting for video streaming/playback to >> avoid video buffer underflow (at the expense of may be some block noise >> depending on the codec). >> * The inactive time limit can be used to tell the drive how long it is >> allowed to let a command stand in the drive internal queue before >> processing. This is thus a parameter that allows a host to tune the drive >> RPO optimization (rotational positioning optimization, e.g. HDD internal >> command scheduling based on angular sector position on tracks withe the >> head current position). This is a neat way to control max IOPS vs tail >> latency since drives tend to privilege maximizing IOPS over lowering max >> tail latency. >> * The duration guideline limit defines an overall time limit for a >> command without distinguishing between active and inactive time. It is the >> easiest to use (the easiest one to understand from a beginner user point >> of view). This is a neat way to define an intelligent IO prioritization in >> fact, way better than RT class scheduling on the host or the use of ATA >> NCQ high priority, as it provides more information to the drive about the >> urgency of a particular command. That allows the drive to still perform >> RPO to maximize IOPS without long tail latencies. Chaining such limit with >> an active+inactive time limit descriptor using the "next limit" policy >> (0x1 policy) can also finely define what the drive should if the guideline >> limit is exceeded (as the next descriptor can define what to do based on >> the reason for the limit being exceeded: long internal queueing vs bad >> sector long access time). > > Note that all 3 limits can be used in a single CDL descriptor to precisely > define how a command should be processed by the device. That is why it is > nearly impossible to come up with a meaningful ordering of CDL descriptors > as an increasing set of priority levels. A summary of my concerns is as follows: * The current I/O priority levels (RT, BE, IDLE) apply to all block devices. IOPRIO_CLASS_DL is only supported by certain block devices (some but not all SCSI harddisks). This forces applications to check the capabilities of the storage device before it can be decided whether or not IOPRIO_CLASS_DL can be used. This is not something applications should do but something the kernel should do. Additionally, if multiple dm devices are stacked on top of the block device driver, like in Android, it becomes even more cumbersome to check whether or not the block device supports CDL. * For the RT, BE and IDLE classes, it is well defined which priority number represents a high priority and which priority number represents a low priority. For CDL, only the drive knows the priority details. I think that application software should be able to select a DL priority without having to read the CDL configuration first. I hope that I have it made it clear that I think that the proposed user space API will be very painful to use for application developers. Bart.
On 1/28/23 02:23, Bart Van Assche wrote: > A summary of my concerns is as follows: > * The current I/O priority levels (RT, BE, IDLE) apply to all block > devices. IOPRIO_CLASS_DL is only supported by certain block devices > (some but not all SCSI harddisks). This forces applications to check the > capabilities of the storage device before it can be decided whether or > not IOPRIO_CLASS_DL can be used. This is not something applications > should do but something the kernel should do. Additionally, if multiple > dm devices are stacked on top of the block device driver, like in > Android, it becomes even more cumbersome to check whether or not the > block device supports CDL. Yes, RT, BE and IDLE apply to all block devices. And so does CDL in the sense that if a user specifies the CDL class for IOs to a device that does not support CDL, then nothing special will happen. There will be no differentiation of the IOs. That *exactly* what happens when using RT, BE or IDLE with the none scheduler (e.g. default nvme setup). And the same remark applies to RT class mapping to ATA NCQ priority feature: the user needs to check the device to know if that will happen, *and* also needs to turn on that feature for that mapping to be effective. The levels of the CDL priority class are also very well defined: they map to the CDL descriptors defined on the drive, which are consultable by the user through sysfs (no special tools needed), so easily discoverable. As for DM devices, these have no scheduler. So any processing of a priority class by a DM target driver is that driver responsibility. Initially, all that happens is the block layer passing on that information through the stack with the BIOs. That's it. Real action may happen once the physical block device is reached with the IO scheduler for that device, if one is set. At that level, none scheduler is of no concern, nothing will happen. Kyber also ignores priorities. We are left with only bfq and mq-deadline. The latter only cares about the priority class, ignoring levels. bfq does act on both class and level. IOPRIO_CLASS_DL is equal to 4, so strictly speaking, is of lower priority than the IDLE class if you want to consider it as part of that ordering. But we defined it as a different class to allow *not* having to do that. IO schedulers can be modified to ignore that priority class for now, mapping it to say the default BE class for instance. Our current patch set maps the CDL class to the RT class for the schedulers, as that made most sense given the time-sensitive nature of CDL workloads. But we can change that to actually let the scheduler decide if you want. There are no other changes in the block layer that have or need special handling of the CDL class. All very clean in my opinion, no special conditions for that feature. No additional "if" in the hot path, no overhead added. > * For the RT, BE and IDLE classes, it is well defined which priority > number represents a high priority and which priority number represents a > low priority. For CDL, only the drive knows the priority details. I > think that application software should be able to select a DL priority > without having to read the CDL configuration first. The levels of the CDL priority class are also very well defined: they map to the CDL descriptors defined on the drive, which are consultable by the user through sysfs (no special tools needed), so easily discoverable. And unless we restrict how CDL descriptors can be defined, which I explained in my previous email is not desirable at all, we cannot and should not try to order levels in some sort of priority semantic. CDL semantic does not define directly a priority level, only time limits, which may or may not be ordered, depending on the limits definitions. As Niklas pointed out, this is not a "generic" feature that any random application can magically use without modifications. The application must be aware of what CDL is and if how the descriptors are. And for 99.99% of the use cases, the CDL descriptors will be defined in a way usefull for that application. There is no magic generic set of descriptors defined by default. Though a simple set of increasing time limits that can be cleanly mapped to priority levels. A system administrator is free to do that for the system drives if that is what the running applications expect. CDL is a very flexible feature that can cover a lot of use cases. Trying to shoehorn in into the legacy/classic priority semantic framework would only restrict its usefulness. > I hope that I have it made it clear that I think that the proposed user > space API will be very painful to use for application developers. I completely disagree. Reusing the prio class/level API made it easy to allow applications to use the feature. fio support for CDL requires exactly *one line* change, to allow for the CDL class number 4. That's it. From there, one can use the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of "idx" here of course depends on how the descriptors are set on the drive. But back to the point above. This depends on the application goals and the descriptors are set accordingly for that goal. There is no real discovery needed by the application. The application expect a certain set of CDL limits for its use case, and checking that this set is the one currently defined on the drive is easy to do from an application with the sysfs interface we added. Many users out there have deployed and using applications taking advantage of ATA NCQ priority feature, using class RT for high priority IOs. The new CDL class does not require many application changes to be enabled for next gen drives that will have CDL. > > Bart. >
On 1/27/23 16:40, Damien Le Moal wrote: > On 1/28/23 02:23, Bart Van Assche wrote: >> I hope that I have it made it clear that I think that the proposed user >> space API will be very painful to use for application developers. > > I completely disagree. Reusing the prio class/level API made it easy to allow > applications to use the feature. fio support for CDL requires exactly *one line* > change, to allow for the CDL class number 4. That's it. From there, one can use > the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of > "idx" here of course depends on how the descriptors are set on the drive. But > back to the point above. This depends on the application goals and the > descriptors are set accordingly for that goal. There is no real discovery needed > by the application. The application expect a certain set of CDL limits for its > use case, and checking that this set is the one currently defined on the drive > is easy to do from an application with the sysfs interface we added. > > Many users out there have deployed and using applications taking advantage of > ATA NCQ priority feature, using class RT for high priority IOs. The new CDL > class does not require many application changes to be enabled for next gen > drives that will have CDL. As I mentioned before, the new I/O priority class IOPRIO_CLASS_DL makes it impossible to use a single I/O priority class across devices that support CDL and devices that do not support CDL. I'm surprised that you keep denying that IOPRIO_CLASS_DL is a royal pain for users who have to support devices that support CDL and devices that do not support CDL. Bart.
On 1/28/23 09:47, Bart Van Assche wrote: > On 1/27/23 16:40, Damien Le Moal wrote: >> On 1/28/23 02:23, Bart Van Assche wrote: >>> I hope that I have it made it clear that I think that the proposed user >>> space API will be very painful to use for application developers. >> >> I completely disagree. Reusing the prio class/level API made it easy to allow >> applications to use the feature. fio support for CDL requires exactly *one line* >> change, to allow for the CDL class number 4. That's it. From there, one can use >> the --cmdprio_class=4 nd --cmdprio=idx options to exercise a drive. The value of >> "idx" here of course depends on how the descriptors are set on the drive. But >> back to the point above. This depends on the application goals and the >> descriptors are set accordingly for that goal. There is no real discovery needed >> by the application. The application expect a certain set of CDL limits for its >> use case, and checking that this set is the one currently defined on the drive >> is easy to do from an application with the sysfs interface we added. >> >> Many users out there have deployed and using applications taking advantage of >> ATA NCQ priority feature, using class RT for high priority IOs. The new CDL >> class does not require many application changes to be enabled for next gen >> drives that will have CDL. > As I mentioned before, the new I/O priority class IOPRIO_CLASS_DL > makes it impossible to use a single I/O priority class across devices > that support CDL and devices that do not support CDL. I'm surprised that > you keep denying that IOPRIO_CLASS_DL is a royal pain for users who have > to support devices that support CDL and devices that do not support CDL. I am not denying anything. I simply keep telling you that CDL is not a generic feature for random applications to use, including those that already use RT/BE/IDLE. It is for applications that know and expect it, and so have a setup suited for CDL use down to the drive CDL descriptors. That includes DM setups. Thinking about CDL in a generic setup for any random application to use is nonsense. And even if that happens and a user not knowing about it still tries it, than as mentioned, nothing bad will happen. Using CDL in a setup that does not support it is a NOP. That would be the same as not using it. > > Bart.
Damien, Finally had a window where I could sit down a read this extremely long thread. > I am not denying anything. I simply keep telling you that CDL is not a > generic feature for random applications to use, including those that > already use RT/BE/IDLE. It is for applications that know and expect > it, and so have a setup suited for CDL use down to the drive CDL > descriptors. That includes DM setups. > > Thinking about CDL in a generic setup for any random application to > use is nonsense. And even if that happens and a user not knowing about > it still tries it, than as mentioned, nothing bad will happen. Using > CDL in a setup that does not support it is a NOP. That would be the > same as not using it. My observations: - Wrt. ioprio as conduit, I personally really dislike the idea of conflating priority (relative performance wrt. other I/O) with CDL (which is a QoS concept). I would really prefer those things to be separate. However, I do think that the ioprio *interface* is a good fit. A tool like ionice seems like a reasonable approach to letting generic applications set their CDL. If bio space wasn't a premium, I'd say just keep things separate. But given the inherent incompatibility between kernel I/O scheduling and CDL, it's probably not worth the hassle to separate them. As much as it pains me to mix two concepts which should be completely orthogonal. I wish we could let applications specify both a priority and a CDL at the same time, though. Even if the kernel plumbing in the short term ends up using bi_ioprio as conduit. It's much harder to make changes in the application interface at a later date. - Wrt. "CDL is not a generic feature", I think you are underestimating how many applications actually want something like this. We have many. I don't think we should design for "special interest only, needs custom device tweaking to be usable". We have been down that path before (streams, etc.). And with poor results. I/O hints also tanked but at least we tried to pre-define performance classes that made sense in an abstract fashion. And programmed the mode page on discovered devices so that the classes were identical across all disks, regardless of whether they were SSDs or million dollar arrays. This allowed filesystems to communicate "this is metadata" regardless of the device the I/O was going to. Instead of "index 5 on this device" but "index 42 on the mirror". As such, I don't like the "just customize your settings with cdltools" approach. I'd much rather see us try to define a few QoS classes that make sense that would apply to every app and use those to define the application interface. And then have the kernel program those CDL classes into SCSI/ATA devices by default. Having the kernel provide an abstract interface for bio QoS and configuring a new disk with a sane handful of classes does not prevent $CLOUD_VENDOR from overriding what Linux configured. But at least we'd have a generic approach to block QoS in Linux. Similar to the existing I/O priority infrastructure which is also not tied to any particular hardware feature. A generic implementation also allows us to do fancy things in the hypervisor where we would like to be able to do QoS across multiple devices as well. Without having ATA or SCSI with CDL involved. Or whatever things might look like in NVMe.
On 1/29/23 05:25, Martin K. Petersen wrote: > > Damien, > > Finally had a window where I could sit down a read this extremely long > thread. > >> I am not denying anything. I simply keep telling you that CDL is not a >> generic feature for random applications to use, including those that >> already use RT/BE/IDLE. It is for applications that know and expect >> it, and so have a setup suited for CDL use down to the drive CDL >> descriptors. That includes DM setups. >> >> Thinking about CDL in a generic setup for any random application to >> use is nonsense. And even if that happens and a user not knowing about >> it still tries it, than as mentioned, nothing bad will happen. Using >> CDL in a setup that does not support it is a NOP. That would be the >> same as not using it. > > My observations: > > - Wrt. ioprio as conduit, I personally really dislike the idea of > conflating priority (relative performance wrt. other I/O) with CDL > (which is a QoS concept). I would really prefer those things to be > separate. However, I do think that the ioprio *interface* is a good > fit. A tool like ionice seems like a reasonable approach to letting > generic applications set their CDL. The definition of IOPRIO_CLASS_CDL was more about reusing the ioprio *interface* rather than having CDL support defined as a fully functional IO priority class. As I argued in this thread, and I think you agreee, CDL semantic is more than the simple priority class/level ordering. > If bio space wasn't a premium, I'd say just keep things separate. > But given the inherent incompatibility between kernel I/O scheduling > and CDL, it's probably not worth the hassle to separate them. As much > as it pains me to mix two concepts which should be completely > orthogonal. > > I wish we could let applications specify both a priority and a CDL at > the same time, though. Even if the kernel plumbing in the short term > ends up using bi_ioprio as conduit. It's much harder to make changes > in the application interface at a later date. See below. There may be a solution about that. > - Wrt. "CDL is not a generic feature", I think you are underestimating > how many applications actually want something like this. We have > many. > > I don't think we should design for "special interest only, needs > custom device tweaking to be usable". We have been down that path > before (streams, etc.). And with poor results. OK. > I/O hints also tanked but at least we tried to pre-define performance > classes that made sense in an abstract fashion. And programmed the > mode page on discovered devices so that the classes were identical > across all disks, regardless of whether they were SSDs or million > dollar arrays. This allowed filesystems to communicate "this is > metadata" regardless of the device the I/O was going to. Instead of > "index 5 on this device" but "index 42 on the mirror". > > As such, I don't like the "just customize your settings with > cdltools" approach. I'd much rather see us try to define a few QoS > classes that make sense that would apply to every app and use those > to define the application interface. And then have the kernel program > those CDL classes into SCSI/ATA devices by default. Makes sense. Though I think it will be hard to define a set of QoS hints that are useful for a wide range of applications, and even harder to convert the defined hint classes to CDL descriptors. I fear that we may end up with the same issues as IO hints/streams. > Having the kernel provide an abstract interface for bio QoS and > configuring a new disk with a sane handful of classes does not > prevent $CLOUD_VENDOR from overriding what Linux configured. But at > least we'd have a generic approach to block QoS in Linux. Similar to > the existing I/O priority infrastructure which is also not tied to > any particular hardware feature. OK. See below about this. > A generic implementation also allows us to do fancy things in the > hypervisor where we would like to be able to do QoS across multiple > devices as well. Without having ATA or SCSI with CDL involved. Or > whatever things might look like in NVMe. Fair point, especially given that virtio actually already forwards a guest ioprio to the host through the virtio block command. Thinking of that particular point together with what you said, I came up with the change show below as a replacement for this patch 1/18. This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio level. This is consistent with the IO prio interface since IO priority levels have to be between 0 and 7 (otherwise, errors are returned). So in fact, the upper 10-bits of the ioprio data are ignored and we can safely use 3 of these bits for an IO hint. This hint applies to all priority classes and levels, that is, for the CDL case, we can enrich any priority with a hint that specifies the CDL index to use for an IO. This falls short of actually defining generic IO hints, but this has the advantage to not break anything for current applications using IO priorities, not require any change to existing IO schedulers, while still allowing to pass CDL indexes for IOs down to the scsi & ATA layers (which for now would be the only layers in the kernel acting on the ioprio qos hints). I think that this approach still allows us to enable CDL support, and on top of it, go further and define generic QOS hints that IO scheduler can use and that also potentially map to CDL for scsi & ata (similarly to the RT class IOs mapping to the NCQ priority feature if the user enabled that feature). As mentioned above, I think that defining generic IO hint classes will be difficult. But the change below is I think a good a starting point that should not prevent working on that. Thoughts ? Bart, Given that you did not like the IOPRIO_CLASS_CDL, what do you think of this approach ? diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index ccf2204477a5..9b3c8fb806f1 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5378,11 +5378,11 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) bfqq->new_ioprio_class = task_nice_ioclass(tsk); break; case IOPRIO_CLASS_RT: - bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio); bfqq->new_ioprio_class = IOPRIO_CLASS_RT; break; case IOPRIO_CLASS_BE: - bfqq->new_ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + bfqq->new_ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio); bfqq->new_ioprio_class = IOPRIO_CLASS_BE; break; case IOPRIO_CLASS_IDLE: @@ -5671,7 +5671,7 @@ static struct bfq_queue *bfq_get_queue(struct bfq_data *bfqd, struct bfq_io_cq *bic, bool respawn) { - const int ioprio = IOPRIO_PRIO_DATA(bic->ioprio); + const int ioprio = IOPRIO_PRIO_LEVEL(bic->ioprio); const int ioprio_class = IOPRIO_PRIO_CLASS(bic->ioprio); struct bfq_queue **async_bfqq = NULL; struct bfq_queue *bfqq; diff --git a/block/ioprio.c b/block/ioprio.c index 32a456b45804..33f327a10811 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -33,7 +33,7 @@ int ioprio_check_cap(int ioprio) { int class = IOPRIO_PRIO_CLASS(ioprio); - int data = IOPRIO_PRIO_DATA(ioprio); + int level = IOPRIO_PRIO_LEVEL(ioprio); switch (class) { case IOPRIO_CLASS_RT: @@ -49,13 +49,13 @@ int ioprio_check_cap(int ioprio) fallthrough; /* rt has prio field too */ case IOPRIO_CLASS_BE: - if (data >= IOPRIO_NR_LEVELS || data < 0) + if (level >= IOPRIO_NR_LEVELS || level < 0) return -EINVAL; break; case IOPRIO_CLASS_IDLE: break; case IOPRIO_CLASS_NONE: - if (data) + if (level) return -EINVAL; break; default: diff --git a/fs/f2fs/sysfs.c b/fs/f2fs/sysfs.c index 83a366f3ee80..eba23e6a7bf6 100644 --- a/fs/f2fs/sysfs.c +++ b/fs/f2fs/sysfs.c @@ -314,7 +314,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a, struct ckpt_req_control *cprc = &sbi->cprc_info; int len = 0; int class = IOPRIO_PRIO_CLASS(cprc->ckpt_thread_ioprio); - int data = IOPRIO_PRIO_DATA(cprc->ckpt_thread_ioprio); + int level = IOPRIO_PRIO_LEVEL(cprc->ckpt_thread_ioprio); if (class == IOPRIO_CLASS_RT) len += scnprintf(buf + len, PAGE_SIZE - len, "rt,"); @@ -323,7 +323,7 @@ static ssize_t f2fs_sbi_show(struct f2fs_attr *a, else return -EINVAL; - len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", data); + len += scnprintf(buf + len, PAGE_SIZE - len, "%d\n", level); return len; } diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h index f70f2596a6bf..1d90349a19c9 100644 --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -37,6 +37,18 @@ enum { #define IOPRIO_NR_LEVELS 8 #define IOPRIO_BE_NR IOPRIO_NR_LEVELS +/* + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and + * up to 8 priority levels. + */ +#define IOPRIO_PRIO_LEVEL_MASK (IOPRIO_NR_LEVELS - 1) +#define IOPRIO_QOS_HINT_SHIFT 10 +#define IOPRIO_NR_QOS_HINTS 8 +#define IOPRIO_QOS_HINT_MASK (IOPRIO_NR_QOS_HINTS - 1) +#define IOPRIO_PRIO_LEVEL(ioprio) ((ioprio) & IOPRIO_PRIO_LEVEL_MASK) +#define IOPRIO_QOS_HINT(ioprio) \ + (((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK) + enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP,
On 1/30/23 22:44, Hannes Reinecke wrote: > On 1/29/23 04:52, Damien Le Moal wrote: >> On 1/29/23 05:25, Martin K. Petersen wrote: > [ .. ] >>> >>> As such, I don't like the "just customize your settings with >>> cdltools" approach. I'd much rather see us try to define a few QoS >>> classes that make sense that would apply to every app and use those >>> to define the application interface. And then have the kernel program >>> those CDL classes into SCSI/ATA devices by default. >> >> Makes sense. Though I think it will be hard to define a set of QoS hints that >> are useful for a wide range of applications, and even harder to convert the >> defined hint classes to CDL descriptors. I fear that we may end up with the same >> issues as IO hints/streams. >> >>> Having the kernel provide an abstract interface for bio QoS and >>> configuring a new disk with a sane handful of classes does not >>> prevent $CLOUD_VENDOR from overriding what Linux configured. But at >>> least we'd have a generic approach to block QoS in Linux. Similar to >>> the existing I/O priority infrastructure which is also not tied to >>> any particular hardware feature. >> >> OK. See below about this. >> >>> A generic implementation also allows us to do fancy things in the >>> hypervisor where we would like to be able to do QoS across multiple >>> devices as well. Without having ATA or SCSI with CDL involved. Or >>> whatever things might look like in NVMe. >> >> Fair point, especially given that virtio actually already forwards a guest >> ioprio to the host through the virtio block command. Thinking of that particular >> point together with what you said, I came up with the change show below as a >> replacement for this patch 1/18. >> >> This changes the 13-bits ioprio data into a 3-bits QOS hint + 3-bits of IO prio >> level. This is consistent with the IO prio interface since IO priority levels >> have to be between 0 and 7 (otherwise, errors are returned). So in fact, the >> upper 10-bits of the ioprio data are ignored and we can safely use 3 of these >> bits for an IO hint. >> >> This hint applies to all priority classes and levels, that is, for the CDL case, >> we can enrich any priority with a hint that specifies the CDL index to use for >> an IO. >> >> This falls short of actually defining generic IO hints, but this has the >> advantage to not break anything for current applications using IO priorities, >> not require any change to existing IO schedulers, while still allowing to pass >> CDL indexes for IOs down to the scsi & ATA layers (which for now would be the >> only layers in the kernel acting on the ioprio qos hints). >> >> I think that this approach still allows us to enable CDL support, and on top of >> it, go further and define generic QOS hints that IO scheduler can use and that >> also potentially map to CDL for scsi & ata (similarly to the RT class IOs >> mapping to the NCQ priority feature if the user enabled that feature). >> >> As mentioned above, I think that defining generic IO hint classes will be >> difficult. But the change below is I think a good a starting point that should >> not prevent working on that. >> >> Thoughts ? >> > I like the idea. > QoS is one of the recurring topic always coming up sooner or later when > talking of storage networks, so having _some_ concept of QoS in the > linux kernel (for storage) would be beneficial. > > Maybe time for a topic at LSF? Yes. I was hoping for a quicker resolution so that we can get the CDL "mechanical" bits in, but without a nice API for it, we cannot :) Trying to compile something with Niklas. So far, we are thinking of having QOS flags + QOS data, the flags determining how (and if) the QOS data is used and what it means. Ex of things We could have: * IOPRIO_QOS_FAILFAST: do not retry the IO if it fails the first time * IOPRIO_QOS_DURATION_LIMIT: then the QOS data indicates the limit to use (number). That can be implemented in schedulers and also map to CDL on drives that support that feature. That is the difficult part: what else ? For now, considering only our target of adding scsi & ata CDL support, the above is enough. But is that enough in general for most users/apps ? > > Cheers, > > Hannes >
On 1/28/23 12:25, Martin K. Petersen wrote: > - Wrt. ioprio as conduit, I personally really dislike the idea of > conflating priority (relative performance wrt. other I/O) with CDL > (which is a QoS concept). I would really prefer those things to be > separate. However, I do think that the ioprio *interface* is a good > fit. A tool like ionice seems like a reasonable approach to letting > generic applications set their CDL. Hi Martin, My understanding is that ionice uses the ioprio_set() system call and hence only affects foreground I/O but not page cache writeback. This is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How about not adding CDL support in ioprio_set() and only supporting configuration of CDL via the v2 cgroup mechanism? Thanks, Bart.
On 1/28/23 19:52, Damien Le Moal wrote: > diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h > index f70f2596a6bf..1d90349a19c9 100644 > +/* > + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and > + * up to 8 priority levels. > + */ > +#define IOPRIO_PRIO_LEVEL_MASK (IOPRIO_NR_LEVELS - 1) > +#define IOPRIO_QOS_HINT_SHIFT 10 > +#define IOPRIO_NR_QOS_HINTS 8 > +#define IOPRIO_QOS_HINT_MASK (IOPRIO_NR_QOS_HINTS - 1) > +#define IOPRIO_PRIO_LEVEL(ioprio) ((ioprio) & IOPRIO_PRIO_LEVEL_MASK) > +#define IOPRIO_QOS_HINT(ioprio) \ > + (((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK) > + Hi Damien, How about the following approach? * Do not add QoS support to the ioprio_set() system cal since that system call only affects foreground I/O. * Configure QoS via the v2 cgroup mechanism such that QoS policies are applied to both foreground and background I/O. This approach allows to use another binary representation for I/O priorities and QoS in the bio.bi_ioprio field than what is supported by the ioprio_set() system call. This approach also allows to use names (strings) for QoS settings instead of numbers in the interface between user space and the kernel if that would be considered desirable. Thanks, Bart.
On 1/28/23 19:52, Damien Le Moal wrote: > +/* > + * The 13-bits of ioprio data for each class provide up to 8 QOS hints and > + * up to 8 priority levels. > + */ > +#define IOPRIO_PRIO_LEVEL_MASK (IOPRIO_NR_LEVELS - 1) > +#define IOPRIO_QOS_HINT_SHIFT 10 > +#define IOPRIO_NR_QOS_HINTS 8 > +#define IOPRIO_QOS_HINT_MASK (IOPRIO_NR_QOS_HINTS - 1) > +#define IOPRIO_PRIO_LEVEL(ioprio) ((ioprio) & IOPRIO_PRIO_LEVEL_MASK) > +#define IOPRIO_QOS_HINT(ioprio) \ > + (((ioprio) >> IOPRIO_QOS_HINT_SHIFT) & IOPRIO_QOS_HINT_MASK) Does the QoS level really have to be encoded in bio.bi_ioprio? How about introducing a new field in the existing hole in struct bio? From the pahole output: struct bio { struct bio * bi_next; /* 0 4 */ struct block_device * bi_bdev; /* 4 4 */ blk_opf_t bi_opf; /* 8 4 */ short unsigned int bi_flags; /* 12 2 */ short unsigned int bi_ioprio; /* 14 2 */ blk_status_t bi_status; /* 16 1 */ /* XXX 3 bytes hole, try to pack */ atomic_t __bi_remaining; /* 20 4 */ struct bvec_iter bi_iter; /* 24 20 */ blk_qc_t bi_cookie; /* 44 4 */ bio_end_io_t * bi_end_io; /* 48 4 */ void * bi_private; /* 52 4 */ struct bio_crypt_ctx * bi_crypt_context; /* 56 4 */ [ ... ] Thanks, Bart.
Damien, > Makes sense. Though I think it will be hard to define a set of QoS > hints that are useful for a wide range of applications, and even > harder to convert the defined hint classes to CDL descriptors. I fear > that we may end up with the same issues as IO hints/streams. Hints mainly failed because non-Linux OSes had very different expectations about how this was going to work. So that left device vendors in a situation where they had to essentially support 3 different approaches all implemented using the same protocol. The challenge of being a general purpose OS is to come up with concepts that are applicable in a variety of situations. Twiddling protocol fields is the easy part. I have a couple of experienced CDL users that I'd like to talk to and try to get a better idea of what a suitable set of defaults might look like. > This hint applies to all priority classes and levels, that is, for the > CDL case, we can enrich any priority with a hint that specifies the > CDL index to use for an IO. Yeah, I like that approach better.
Hi Bart! > My understanding is that ionice uses the ioprio_set() system call and > hence only affects foreground I/O but not page cache writeback. This > is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How > about not adding CDL support in ioprio_set() and only supporting > configuration of CDL via the v2 cgroup mechanism? I suspect applications that care about CDL would probably not go through the page cache. But I don't have a problem supporting cgroups at all. Longer term, for the applications that I'm aware of that care about this, we'd probably want to be able to specify things on a per-I/O basis via io_uring, though.
On 1/31/23 11:58, Martin K. Petersen wrote: > > Hi Bart! > >> My understanding is that ionice uses the ioprio_set() system call and >> hence only affects foreground I/O but not page cache writeback. This >> is why I introduced the ioprio rq-qos policy (block/blk-ioprio.c). How >> about not adding CDL support in ioprio_set() and only supporting >> configuration of CDL via the v2 cgroup mechanism? > > I suspect applications that care about CDL would probably not go through > the page cache. But I don't have a problem supporting cgroups at all. > > Longer term, for the applications that I'm aware of that care about > this, we'd probably want to be able to specify things on a per-I/O basis > via io_uring, though. Absolutely agree here. Similarly to the legacy ioprio, we need to be able to specify per-io (iouring or libaio) and per context (ioprio_set() or cgroups). I see the per-io and cgroups APIs as complementary. Many of the use cases we are seeing for CDL are transitions from ATA NCQ prio, which relies on the RT class for per IO aio_iorpio field with libaio/iouring. I would like to have continuity with this to facilitate application development. Having only cgroups as the API would disallow per-io async io application engines.
On 1/31/23 11:49, Martin K. Petersen wrote: > > Damien, > >> Makes sense. Though I think it will be hard to define a set of QoS >> hints that are useful for a wide range of applications, and even >> harder to convert the defined hint classes to CDL descriptors. I fear >> that we may end up with the same issues as IO hints/streams. > > Hints mainly failed because non-Linux OSes had very different > expectations about how this was going to work. So that left device > vendors in a situation where they had to essentially support 3 different > approaches all implemented using the same protocol. > > The challenge of being a general purpose OS is to come up with concepts > that are applicable in a variety of situations. Twiddling protocol > fields is the easy part. > > I have a couple of experienced CDL users that I'd like to talk to and > try to get a better idea of what a suitable set of defaults might look > like. > >> This hint applies to all priority classes and levels, that is, for the >> CDL case, we can enrich any priority with a hint that specifies the >> CDL index to use for an IO. > > Yeah, I like that approach better. Of note is that even though the IOPRIO_XXX macros in include/uapi/linux/ioprio.h assume a 16bits value for the priority class + data, of which only 6 bits are usable (3 for the class, 3 for the level), all syscall and kernel internal interface has ioprio defined as an int. So we have in fact 32 bits to play with. We could keep the lower 16 bits for ioprio as it was, and have the upper 16bits used for QOS hints. More room that the 10 bits between the prio class and level. The only place that will need changing is struct bio since bi_ioprio is defined as an unsigned short. To solve this, as Bart suggested, we could add another unsigned short in the bio struct hole for the qos hints (bi_iohint or bi_ioqoshint). But if we can define a sensible set of hints that covers at least CDL with the 10 free bits we have in the current ioprio, that would be even better I think (less changes needed in the block layer).
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 815b884d6c5a..7add9346c585 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -5545,6 +5545,14 @@ bfq_set_next_ioprio_data(struct bfq_queue *bfqq, struct bfq_io_cq *bic) bfqq->new_ioprio_class = IOPRIO_CLASS_IDLE; bfqq->new_ioprio = 7; break; + case IOPRIO_CLASS_DL: + /* + * For the duration-limits class, we want the disk to do the + * scheduling. So map all levels to the highest RT level. + */ + bfqq->new_ioprio = 0; + bfqq->new_ioprio_class = IOPRIO_CLASS_RT; + break; } if (bfqq->new_ioprio >= IOPRIO_NR_LEVELS) { @@ -5673,6 +5681,8 @@ static struct bfq_queue **bfq_async_queue_prio(struct bfq_data *bfqd, return &bfqg->async_bfqq[1][ioprio][act_idx]; case IOPRIO_CLASS_IDLE: return &bfqg->async_idle_bfqq[act_idx]; + case IOPRIO_CLASS_DL: + return &bfqg->async_bfqq[0][0][act_idx]; default: return NULL; } diff --git a/block/blk-ioprio.c b/block/blk-ioprio.c index 8bb6b8eba4ce..dfb5c3f447f4 100644 --- a/block/blk-ioprio.c +++ b/block/blk-ioprio.c @@ -27,6 +27,7 @@ * @POLICY_RESTRICT_TO_BE: modify IOPRIO_CLASS_NONE and IOPRIO_CLASS_RT into * IOPRIO_CLASS_BE. * @POLICY_ALL_TO_IDLE: change the I/O priority class into IOPRIO_CLASS_IDLE. + * @POLICY_ALL_TO_DL: change the I/O priority class into IOPRIO_CLASS_DL. * * See also <linux/ioprio.h>. */ @@ -35,6 +36,7 @@ enum prio_policy { POLICY_NONE_TO_RT = 1, POLICY_RESTRICT_TO_BE = 2, POLICY_ALL_TO_IDLE = 3, + POLICY_ALL_TO_DL = 4, }; static const char *policy_name[] = { @@ -42,6 +44,7 @@ static const char *policy_name[] = { [POLICY_NONE_TO_RT] = "none-to-rt", [POLICY_RESTRICT_TO_BE] = "restrict-to-be", [POLICY_ALL_TO_IDLE] = "idle", + [POLICY_ALL_TO_DL] = "duration-limits", }; static struct blkcg_policy ioprio_policy; diff --git a/block/ioprio.c b/block/ioprio.c index 32a456b45804..1b3a9da82597 100644 --- a/block/ioprio.c +++ b/block/ioprio.c @@ -37,6 +37,7 @@ int ioprio_check_cap(int ioprio) switch (class) { case IOPRIO_CLASS_RT: + case IOPRIO_CLASS_DL: /* * Originally this only checked for CAP_SYS_ADMIN, * which was implicitly allowed for pid 0 by security @@ -47,7 +48,7 @@ int ioprio_check_cap(int ioprio) if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_NICE)) return -EPERM; fallthrough; - /* rt has prio field too */ + /* RT and DL have prio field too */ case IOPRIO_CLASS_BE: if (data >= IOPRIO_NR_LEVELS || data < 0) return -EINVAL; diff --git a/block/mq-deadline.c b/block/mq-deadline.c index f10c2a0d18d4..526d0ea4dbf9 100644 --- a/block/mq-deadline.c +++ b/block/mq-deadline.c @@ -113,6 +113,7 @@ static const enum dd_prio ioprio_class_to_prio[] = { [IOPRIO_CLASS_RT] = DD_RT_PRIO, [IOPRIO_CLASS_BE] = DD_BE_PRIO, [IOPRIO_CLASS_IDLE] = DD_IDLE_PRIO, + [IOPRIO_CLASS_DL] = DD_RT_PRIO, }; static inline struct rb_root * diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h index 7578d4f6a969..2f3fc2fbd668 100644 --- a/include/linux/ioprio.h +++ b/include/linux/ioprio.h @@ -20,7 +20,7 @@ static inline bool ioprio_valid(unsigned short ioprio) { unsigned short class = IOPRIO_PRIO_CLASS(ioprio); - return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_IDLE; + return class > IOPRIO_CLASS_NONE && class <= IOPRIO_CLASS_DL; } /* diff --git a/include/uapi/linux/ioprio.h b/include/uapi/linux/ioprio.h index f70f2596a6bf..15908b9e9d8c 100644 --- a/include/uapi/linux/ioprio.h +++ b/include/uapi/linux/ioprio.h @@ -29,6 +29,7 @@ enum { IOPRIO_CLASS_RT, IOPRIO_CLASS_BE, IOPRIO_CLASS_IDLE, + IOPRIO_CLASS_DL, }; /* @@ -37,6 +38,12 @@ enum { #define IOPRIO_NR_LEVELS 8 #define IOPRIO_BE_NR IOPRIO_NR_LEVELS +/* + * The Duration limits class allows 8 levels: level 0 for "no limit" and levels + * 1 to 7, each corresponding to a read or write limit descriptor. + */ +#define IOPRIO_DL_NR_LEVELS 8 + enum { IOPRIO_WHO_PROCESS = 1, IOPRIO_WHO_PGRP,