Message ID | 20180207211920.6343-1-paolo.valente@linaro.org |
---|---|
State | Accepted |
Commit | a7877390614770965a6925dfed79cbd3eeeb61e0 |
Headers | show |
Series | [BUGFIX,V3] block, bfq: add requeue-request hook | expand |
On 2/7/18 2:19 PM, Paolo Valente wrote: > Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via > RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device > be re-inserted into the active I/O scheduler for that device. As a > consequence, I/O schedulers may get the same request inserted again, > even several times, without a finish_request invoked on that request > before each re-insertion. > > This fact is the cause of the failure reported in [1]. For an I/O > scheduler, every re-insertion of the same re-prepared request is > equivalent to the insertion of a new request. For schedulers like > mq-deadline or kyber, this fact causes no harm. In contrast, it > confuses a stateful scheduler like BFQ, which keeps state for an I/O > request, until the finish_request hook is invoked on the request. In > particular, BFQ may get stuck, waiting forever for the number of > request dispatches, of the same request, to be balanced by an equal > number of request completions (while there will be one completion for > that request). In this state, BFQ may refuse to serve I/O requests > from other bfq_queues. The hang reported in [1] then follows. > > However, the above re-prepared requests undergo a requeue, thus the > requeue_request hook of the active elevator is invoked for these > requests, if set. This commit then addresses the above issue by > properly implementing the hook requeue_request in BFQ. Thanks, applied. -- Jens Axboe
> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <axboe@kernel.dk> ha scritto: > > On 2/7/18 2:19 PM, Paolo Valente wrote: >> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via >> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device >> be re-inserted into the active I/O scheduler for that device. As a >> consequence, I/O schedulers may get the same request inserted again, >> even several times, without a finish_request invoked on that request >> before each re-insertion. >> >> This fact is the cause of the failure reported in [1]. For an I/O >> scheduler, every re-insertion of the same re-prepared request is >> equivalent to the insertion of a new request. For schedulers like >> mq-deadline or kyber, this fact causes no harm. In contrast, it >> confuses a stateful scheduler like BFQ, which keeps state for an I/O >> request, until the finish_request hook is invoked on the request. In >> particular, BFQ may get stuck, waiting forever for the number of >> request dispatches, of the same request, to be balanced by an equal >> number of request completions (while there will be one completion for >> that request). In this state, BFQ may refuse to serve I/O requests >> from other bfq_queues. The hang reported in [1] then follows. >> >> However, the above re-prepared requests undergo a requeue, thus the >> requeue_request hook of the active elevator is invoked for these >> requests, if set. This commit then addresses the above issue by >> properly implementing the hook requeue_request in BFQ. > > Thanks, applied. > I Jens, I forgot to add Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> in the patch. Is it still possible to add it? Thanks, Paolo > -- > Jens Axboe
Hi. 08.02.2018 08:16, Paolo Valente wrote: >> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <axboe@kernel.dk> ha >> scritto: >> >> On 2/7/18 2:19 PM, Paolo Valente wrote: >>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq >>> via >>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a >>> device >>> be re-inserted into the active I/O scheduler for that device. As a >>> consequence, I/O schedulers may get the same request inserted again, >>> even several times, without a finish_request invoked on that request >>> before each re-insertion. >>> >>> This fact is the cause of the failure reported in [1]. For an I/O >>> scheduler, every re-insertion of the same re-prepared request is >>> equivalent to the insertion of a new request. For schedulers like >>> mq-deadline or kyber, this fact causes no harm. In contrast, it >>> confuses a stateful scheduler like BFQ, which keeps state for an I/O >>> request, until the finish_request hook is invoked on the request. In >>> particular, BFQ may get stuck, waiting forever for the number of >>> request dispatches, of the same request, to be balanced by an equal >>> number of request completions (while there will be one completion for >>> that request). In this state, BFQ may refuse to serve I/O requests >>> from other bfq_queues. The hang reported in [1] then follows. >>> >>> However, the above re-prepared requests undergo a requeue, thus the >>> requeue_request hook of the active elevator is invoked for these >>> requests, if set. This commit then addresses the above issue by >>> properly implementing the hook requeue_request in BFQ. >> >> Thanks, applied. >> > > I Jens, > I forgot to add > Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> > in the patch. > > Is it still possible to add it? > In addition to this I think it should be worth considering CC'ing Greg to pull this fix into 4.15 stable tree. Oleksandr
On 2/9/18 6:21 AM, Oleksandr Natalenko wrote: > Hi. > > 08.02.2018 08:16, Paolo Valente wrote: >>> Il giorno 07 feb 2018, alle ore 23:18, Jens Axboe <axboe@kernel.dk> ha >>> scritto: >>> >>> On 2/7/18 2:19 PM, Paolo Valente wrote: >>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq >>>> via >>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a >>>> device >>>> be re-inserted into the active I/O scheduler for that device. As a >>>> consequence, I/O schedulers may get the same request inserted again, >>>> even several times, without a finish_request invoked on that request >>>> before each re-insertion. >>>> >>>> This fact is the cause of the failure reported in [1]. For an I/O >>>> scheduler, every re-insertion of the same re-prepared request is >>>> equivalent to the insertion of a new request. For schedulers like >>>> mq-deadline or kyber, this fact causes no harm. In contrast, it >>>> confuses a stateful scheduler like BFQ, which keeps state for an I/O >>>> request, until the finish_request hook is invoked on the request. In >>>> particular, BFQ may get stuck, waiting forever for the number of >>>> request dispatches, of the same request, to be balanced by an equal >>>> number of request completions (while there will be one completion for >>>> that request). In this state, BFQ may refuse to serve I/O requests >>>> from other bfq_queues. The hang reported in [1] then follows. >>>> >>>> However, the above re-prepared requests undergo a requeue, thus the >>>> requeue_request hook of the active elevator is invoked for these >>>> requests, if set. This commit then addresses the above issue by >>>> properly implementing the hook requeue_request in BFQ. >>> >>> Thanks, applied. >>> >> >> I Jens, >> I forgot to add >> Tested-by: Oleksandr Natalenko <oleksandr@natalenko.name> >> in the patch. >> >> Is it still possible to add it? >> > > In addition to this I think it should be worth considering CC'ing Greg > to pull this fix into 4.15 stable tree. I can't add the tested-by anymore, but it's easy enough to target for stable after-the-fact. -- Jens Axboe
On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote: > > In addition to this I think it should be worth considering CC'ing Greg > to pull this fix into 4.15 stable tree. This isn't one he can cherry-pick, some munging required, in which case he usually wants a properly tested backport. -Mike
Hi. On pátek 9. února 2018 18:29:39 CET Mike Galbraith wrote: > On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote: > > In addition to this I think it should be worth considering CC'ing Greg > > to pull this fix into 4.15 stable tree. > > This isn't one he can cherry-pick, some munging required, in which case > he usually wants a properly tested backport. > > -Mike Maybe, this could be a good opportunity to push all the pending BFQ patches into the stable 4.15 branch? Because IIUC currently BFQ in 4.15 is just unusable. Paolo? --- block, bfq: add requeue-request hook bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED block, bfq: release oom-queue ref to root group on exit block, bfq: put async queues for root bfq groups too block, bfq: limit sectors served with interactive weight raising block, bfq: limit tags for writes and async I/O block, bfq: increase threshold to deem I/O as random block, bfq: remove superfluous check in queue-merging setup block, bfq: let a queue be merged only shortly after starting I/O block, bfq: check low_latency flag in bfq_bfqq_save_state() block, bfq: add missing rq_pos_tree update on rq removal block, bfq: fix occurrences of request finish method's old name block, bfq: consider also past I/O in soft real-time detection block, bfq: remove batches of confusing ifdefs
> Il giorno 10 feb 2018, alle ore 09:29, Oleksandr Natalenko <oleksandr@natalenko.name> ha scritto: > > Hi. > > On pátek 9. února 2018 18:29:39 CET Mike Galbraith wrote: >> On Fri, 2018-02-09 at 14:21 +0100, Oleksandr Natalenko wrote: >>> In addition to this I think it should be worth considering CC'ing Greg >>> to pull this fix into 4.15 stable tree. >> >> This isn't one he can cherry-pick, some munging required, in which case >> he usually wants a properly tested backport. >> >> -Mike > > Maybe, this could be a good opportunity to push all the pending BFQ patches > into the stable 4.15 branch? Because IIUC currently BFQ in 4.15 is just > unusable. > > Paolo? > Of course ok for me, and thanks Oleksandr for proposing this. These commits should apply cleanly on 4.15, and FWIW have been tested, by me and BFQ users, on 4.15 too in these months. Thanks, Paolo > --- > > block, bfq: add requeue-request hook > bfq-iosched: don't call bfqg_and_blkg_put for !CONFIG_BFQ_GROUP_IOSCHED > block, bfq: release oom-queue ref to root group on exit > block, bfq: put async queues for root bfq groups too > block, bfq: limit sectors served with interactive weight raising > block, bfq: limit tags for writes and async I/O > block, bfq: increase threshold to deem I/O as random > block, bfq: remove superfluous check in queue-merging setup > block, bfq: let a queue be merged only shortly after starting I/O > block, bfq: check low_latency flag in bfq_bfqq_save_state() > block, bfq: add missing rq_pos_tree update on rq removal > block, bfq: fix occurrences of request finish method's old name > block, bfq: consider also past I/O in soft real-time detection > block, bfq: remove batches of confusing ifdefs > >
Hi Paolo, On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: > Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via > RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device > be re-inserted into the active I/O scheduler for that device. As a No, this behaviour isn't related with commit a6a252e64914, and it has been there since blk_mq_requeue_request() is introduced. And you can see blk_mq_requeue_request() is called by lots of drivers, especially it is often used in error handler, see SCSI's example. > consequence, I/O schedulers may get the same request inserted again, > even several times, without a finish_request invoked on that request > before each re-insertion. > > This fact is the cause of the failure reported in [1]. For an I/O > scheduler, every re-insertion of the same re-prepared request is > equivalent to the insertion of a new request. For schedulers like > mq-deadline or kyber, this fact causes no harm. In contrast, it > confuses a stateful scheduler like BFQ, which keeps state for an I/O > request, until the finish_request hook is invoked on the request. In > particular, BFQ may get stuck, waiting forever for the number of > request dispatches, of the same request, to be balanced by an equal > number of request completions (while there will be one completion for > that request). In this state, BFQ may refuse to serve I/O requests > from other bfq_queues. The hang reported in [1] then follows. > > However, the above re-prepared requests undergo a requeue, thus the > requeue_request hook of the active elevator is invoked for these > requests, if set. This commit then addresses the above issue by > properly implementing the hook requeue_request in BFQ. > > [1] https://marc.info/?l=linux-block&m=151211117608676 > > Reported-by: Ivan Kozik <ivan@ludios.org> > Reported-by: Alban Browaeys <alban.browaeys@gmail.com> > Tested-by: Mike Galbraith <efault@gmx.de> > Signed-off-by: Paolo Valente <paolo.valente@linaro.org> > Signed-off-by: Serena Ziviani <ziviani.serena@gmail.com> > --- > V2: contains fix to bug reported in [2] > V3: implements the improvement suggested in [3] > > [2] https://lkml.org/lkml/2018/2/5/599 > [3] https://lkml.org/lkml/2018/2/7/532 > > block/bfq-iosched.c | 107 ++++++++++++++++++++++++++++++++++++++++------------ > 1 file changed, 82 insertions(+), 25 deletions(-) > > diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c > index 47e6ec7427c4..aeca22d91101 100644 > --- a/block/bfq-iosched.c > +++ b/block/bfq-iosched.c > @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) > } > > /* > - * We exploit the bfq_finish_request hook to decrement > - * rq_in_driver, but bfq_finish_request will not be > - * invoked on this request. So, to avoid unbalance, > - * just start this request, without incrementing > - * rq_in_driver. As a negative consequence, > - * rq_in_driver is deceptively lower than it should be > - * while this request is in service. This may cause > - * bfq_schedule_dispatch to be invoked uselessly. > + * We exploit the bfq_finish_requeue_request hook to > + * decrement rq_in_driver, but > + * bfq_finish_requeue_request will not be invoked on > + * this request. So, to avoid unbalance, just start > + * this request, without incrementing rq_in_driver. As > + * a negative consequence, rq_in_driver is deceptively > + * lower than it should be while this request is in > + * service. This may cause bfq_schedule_dispatch to be > + * invoked uselessly. > * > * As for implementing an exact solution, the > - * bfq_finish_request hook, if defined, is probably > - * invoked also on this request. So, by exploiting > - * this hook, we could 1) increment rq_in_driver here, > - * and 2) decrement it in bfq_finish_request. Such a > - * solution would let the value of the counter be > - * always accurate, but it would entail using an extra > - * interface function. This cost seems higher than the > - * benefit, being the frequency of non-elevator-private > + * bfq_finish_requeue_request hook, if defined, is > + * probably invoked also on this request. So, by > + * exploiting this hook, we could 1) increment > + * rq_in_driver here, and 2) decrement it in > + * bfq_finish_requeue_request. Such a solution would > + * let the value of the counter be always accurate, > + * but it would entail using an extra interface > + * function. This cost seems higher than the benefit, > + * being the frequency of non-elevator-private > * requests very low. > */ > goto start_rq; > @@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct request_queue *q, > unsigned int cmd_flags) {} > #endif > > +static void bfq_prepare_request(struct request *rq, struct bio *bio); > + > static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, > bool at_head) > { > @@ -4541,6 +4545,18 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, > else > list_add_tail(&rq->queuelist, &bfqd->dispatch); > } else { > + if (WARN_ON_ONCE(!bfqq)) { > + /* > + * This should never happen. Most likely rq is > + * a requeued regular request, being > + * re-inserted without being first > + * re-prepared. Do a prepare, to avoid > + * failure. > + */ > + bfq_prepare_request(rq, rq->bio); > + bfqq = RQ_BFQQ(rq); > + } > + > idle_timer_disabled = __bfq_insert_request(bfqd, rq); > /* > * Update bfqq, because, if a queue merge has occurred > @@ -4697,22 +4713,44 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd) > bfq_schedule_dispatch(bfqd); > } > > -static void bfq_finish_request_body(struct bfq_queue *bfqq) > +static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq) > { > bfqq->allocated--; > > bfq_put_queue(bfqq); > } > > -static void bfq_finish_request(struct request *rq) > +/* > + * Handle either a requeue or a finish for rq. The things to do are > + * the same in both cases: all references to rq are to be dropped. In > + * particular, rq is considered completed from the point of view of > + * the scheduler. > + */ > +static void bfq_finish_requeue_request(struct request *rq) > { > - struct bfq_queue *bfqq; > + struct bfq_queue *bfqq = RQ_BFQQ(rq); > struct bfq_data *bfqd; > > - if (!rq->elv.icq) > + /* > + * Requeue and finish hooks are invoked in blk-mq without > + * checking whether the involved request is actually still > + * referenced in the scheduler. To handle this fact, the > + * following two checks make this function exit in case of > + * spurious invocations, for which there is nothing to do. > + * > + * First, check whether rq has nothing to do with an elevator. > + */ > + if (unlikely(!(rq->rq_flags & RQF_ELVPRIV))) > + return; > + > + /* > + * rq either is not associated with any icq, or is an already > + * requeued request that has not (yet) been re-inserted into > + * a bfq_queue. > + */ > + if (!rq->elv.icq || !bfqq) > return; > > - bfqq = RQ_BFQQ(rq); > bfqd = bfqq->bfqd; > > if (rq->rq_flags & RQF_STARTED) > @@ -4727,13 +4765,14 @@ static void bfq_finish_request(struct request *rq) > spin_lock_irqsave(&bfqd->lock, flags); > > bfq_completed_request(bfqq, bfqd); > - bfq_finish_request_body(bfqq); > + bfq_finish_requeue_request_body(bfqq); > > spin_unlock_irqrestore(&bfqd->lock, flags); > } else { > /* > * Request rq may be still/already in the scheduler, > - * in which case we need to remove it. And we cannot > + * in which case we need to remove it (this should > + * never happen in case of requeue). And we cannot > * defer such a check and removal, to avoid > * inconsistencies in the time interval from the end > * of this function to the start of the deferred work. > @@ -4748,9 +4787,26 @@ static void bfq_finish_request(struct request *rq) > bfqg_stats_update_io_remove(bfqq_group(bfqq), > rq->cmd_flags); > } > - bfq_finish_request_body(bfqq); > + bfq_finish_requeue_request_body(bfqq); > } > > + /* > + * Reset private fields. In case of a requeue, this allows > + * this function to correctly do nothing if it is spuriously > + * invoked again on this same request (see the check at the > + * beginning of the function). Probably, a better general > + * design would be to prevent blk-mq from invoking the requeue > + * or finish hooks of an elevator, for a request that is not > + * referred by that elevator. > + * > + * Resetting the following fields would break the > + * request-insertion logic if rq is re-inserted into a bfq > + * internal queue, without a re-preparation. Here we assume > + * that re-insertions of requeued requests, without > + * re-preparation, can happen only for pass_through or at_head > + * requests (which are not re-inserted into bfq internal > + * queues). > + */ > rq->elv.priv[0] = NULL; > rq->elv.priv[1] = NULL; > } > @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { > .ops.mq = { > .limit_depth = bfq_limit_depth, > .prepare_request = bfq_prepare_request, > - .finish_request = bfq_finish_request, > + .requeue_request = bfq_finish_requeue_request, > + .finish_request = bfq_finish_requeue_request, > .exit_icq = bfq_exit_icq, > .insert_requests = bfq_insert_requests, > .dispatch_request = bfq_dispatch_request, This way may not be correct since blk_mq_sched_requeue_request() can be called for one request which won't enter io scheduler. __blk_mq_requeue_request() is called for two cases: - one is that the requeued request is added to hctx->dispatch, such as blk_mq_dispatch_rq_list() - another case is that the request is requeued to io scheduler, such as blk_mq_requeue_request(). For the 1st case, blk_mq_sched_requeue_request() shouldn't be called since it is nothing to do with scheduler, seems we only need to do that for 2nd case. So looks we need the following patch: diff --git a/block/blk-mq.c b/block/blk-mq.c index 23de7fd8099a..a216f3c3c3ce 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq) trace_block_rq_requeue(q, rq); wbt_requeue(q->rq_wb, &rq->issue_stat); - blk_mq_sched_requeue_request(rq); if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { blk_mq_rq_update_state(rq, MQ_RQ_IDLE); @@ -725,6 +726,9 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) { __blk_mq_requeue_request(rq); + /* this request will be re-inserted to io scheduler queue */ + blk_mq_sched_requeue_request(rq); + BUG_ON(blk_queued_rq(rq)); blk_mq_add_to_requeue_list(rq, true, kick_requeue_list); } Thanks, Ming
> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming.lei@redhat.com> ha scritto: > > Hi Paolo, > > On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: >> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via >> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device >> be re-inserted into the active I/O scheduler for that device. As a > > No, this behaviour isn't related with commit a6a252e64914, and > it has been there since blk_mq_requeue_request() is introduced. > Hi Ming, actually, we wrote the above statement after simply following the call chain that led to the failure. In this respect, the change in commit a6a252e64914: static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, + bool has_sched, struct request *rq) { - if (rq->tag == -1) { + /* dispatch flush rq directly */ + if (rq->rq_flags & RQF_FLUSH_SEQ) { + spin_lock(&hctx->lock); + list_add(&rq->queuelist, &hctx->dispatch); + spin_unlock(&hctx->lock); + return true; + } + + if (has_sched) { rq->rq_flags |= RQF_SORTED; - return false; + WARN_ON(rq->tag != -1); } - /* - * If we already have a real request tag, send directly to - * the dispatch list. - */ - spin_lock(&hctx->lock); - list_add(&rq->queuelist, &hctx->dispatch); - spin_unlock(&hctx->lock); - return true; + return false; } makes blk_mq_sched_bypass_insert return false for all non-flush requests. From that, the anomaly described in our commit follows, for bfq any stateful scheduler that waits for the completion of requests that passed through it. I'm elaborating again a little bit on this in my replies to your next points below. I don't mean that this change is an error, it simply sends a stateful scheduler in an inconsistent state, unless the scheduler properly handles the requeue that precedes the re-insertion into the scheduler. If this clarifies the situation, but there is still some misleading statement in the commit, just let me better understand, and I'll be glad to rectify it, if possible somehow. > And you can see blk_mq_requeue_request() is called by lots of drivers, > especially it is often used in error handler, see SCSI's example. > >> consequence, I/O schedulers may get the same request inserted again, >> even several times, without a finish_request invoked on that request >> before each re-insertion. >> ... >> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { >> .ops.mq = { >> .limit_depth = bfq_limit_depth, >> .prepare_request = bfq_prepare_request, >> - .finish_request = bfq_finish_request, >> + .requeue_request = bfq_finish_requeue_request, >> + .finish_request = bfq_finish_requeue_request, >> .exit_icq = bfq_exit_icq, >> .insert_requests = bfq_insert_requests, >> .dispatch_request = bfq_dispatch_request, > > This way may not be correct since blk_mq_sched_requeue_request() can be > called for one request which won't enter io scheduler. > Exactly, there are two cases: requeues that lead to subsequent re-insertions, and requeues that don't. The function bfq_finish_requeue_request handles both, and both need to be handled, to inform bfq that it has not to wait for the completion of rq any longer. One special case is when bfq_finish_requeue_request gets invoked even if rq has nothing to do with any scheduler. In that case, bfq_finish_requeue_request exists immediately. > __blk_mq_requeue_request() is called for two cases: > > - one is that the requeued request is added to hctx->dispatch, such > as blk_mq_dispatch_rq_list() yes > - another case is that the request is requeued to io scheduler, such as > blk_mq_requeue_request(). > yes > For the 1st case, blk_mq_sched_requeue_request() shouldn't be called > since it is nothing to do with scheduler, No, if that rq has been inserted and then extracted from the scheduler through a dispatch_request, then it has. The scheduler is stateful, and keeps state for rq, because it must do so, until a completion or a requeue arrive. In particular, bfq may decide that no other bfq_queues must be served before rq has been completed, because this is necessary to preserve its target service guarantees. If bfq is not informed, either through its completion or its requeue hook, then it will wait forever. > seems we only need to do that > for 2nd case. > > So looks we need the following patch: > > diff --git a/block/blk-mq.c b/block/blk-mq.c > index 23de7fd8099a..a216f3c3c3ce 100644 > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq) > > trace_block_rq_requeue(q, rq); > wbt_requeue(q->rq_wb, &rq->issue_stat); > - blk_mq_sched_requeue_request(rq); > By doing so, if rq has 'passed' through bfq, then you block again bfq forever. Of course, I may be wrong, as I don't have a complete mental model of all what blk-mq does around bfq. But I thin that you can quickly check whether a hang occurs again if you add this change. In particular, it should happen in the USB former failure case. Thanks, Paolo > if (blk_mq_rq_state(rq) != MQ_RQ_IDLE) { > blk_mq_rq_update_state(rq, MQ_RQ_IDLE); > @@ -725,6 +726,9 @@ void blk_mq_requeue_request(struct request *rq, bool kick_requeue_list) > { > __blk_mq_requeue_request(rq); > > + /* this request will be re-inserted to io scheduler queue */ > + blk_mq_sched_requeue_request(rq); > + > BUG_ON(blk_queued_rq(rq)); > blk_mq_add_to_requeue_list(rq, true, kick_requeue_list); > } > > > Thanks, > Ming
Hi Paolo, On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote: > > > > Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming.lei@redhat.com> ha scritto: > > > > Hi Paolo, > > > > On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: > >> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via > >> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device > >> be re-inserted into the active I/O scheduler for that device. As a > > > > No, this behaviour isn't related with commit a6a252e64914, and > > it has been there since blk_mq_requeue_request() is introduced. > > > > Hi Ming, > actually, we wrote the above statement after simply following the call > chain that led to the failure. In this respect, the change in commit > a6a252e64914: > > static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, > + bool has_sched, > struct request *rq) > { > - if (rq->tag == -1) { > + /* dispatch flush rq directly */ > + if (rq->rq_flags & RQF_FLUSH_SEQ) { > + spin_lock(&hctx->lock); > + list_add(&rq->queuelist, &hctx->dispatch); > + spin_unlock(&hctx->lock); > + return true; > + } > + > + if (has_sched) { > rq->rq_flags |= RQF_SORTED; > - return false; > + WARN_ON(rq->tag != -1); > } > > - /* > - * If we already have a real request tag, send directly to > - * the dispatch list. > - */ > - spin_lock(&hctx->lock); > - list_add(&rq->queuelist, &hctx->dispatch); > - spin_unlock(&hctx->lock); > - return true; > + return false; > } > > makes blk_mq_sched_bypass_insert return false for all non-flush > requests. From that, the anomaly described in our commit follows, for > bfq any stateful scheduler that waits for the completion of requests > that passed through it. I'm elaborating again a little bit on this in > my replies to your next points below. Before a6a252e64914, follows blk_mq_sched_bypass_insert() if (rq->tag == -1) { rq->rq_flags |= RQF_SORTED; return false; } /* * If we already have a real request tag, send directly to * the dispatch list. */ spin_lock(&hctx->lock); list_add(&rq->queuelist, &hctx->dispatch); spin_unlock(&hctx->lock); return true; This function still returns false for all non-flush request, so nothing changes wrt. this kind of handling. > > I don't mean that this change is an error, it simply sends a stateful > scheduler in an inconsistent state, unless the scheduler properly > handles the requeue that precedes the re-insertion into the > scheduler. > > If this clarifies the situation, but there is still some misleading > statement in the commit, just let me better understand, and I'll be > glad to rectify it, if possible somehow. > > > > And you can see blk_mq_requeue_request() is called by lots of drivers, > > especially it is often used in error handler, see SCSI's example. > > > >> consequence, I/O schedulers may get the same request inserted again, > >> even several times, without a finish_request invoked on that request > >> before each re-insertion. > >> > > ... > > >> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { > >> .ops.mq = { > >> .limit_depth = bfq_limit_depth, > >> .prepare_request = bfq_prepare_request, > >> - .finish_request = bfq_finish_request, > >> + .requeue_request = bfq_finish_requeue_request, > >> + .finish_request = bfq_finish_requeue_request, > >> .exit_icq = bfq_exit_icq, > >> .insert_requests = bfq_insert_requests, > >> .dispatch_request = bfq_dispatch_request, > > > > This way may not be correct since blk_mq_sched_requeue_request() can be > > called for one request which won't enter io scheduler. > > > > Exactly, there are two cases: requeues that lead to subsequent > re-insertions, and requeues that don't. The function > bfq_finish_requeue_request handles both, and both need to be handled, > to inform bfq that it has not to wait for the completion of rq any > longer. > > One special case is when bfq_finish_requeue_request gets invoked even > if rq has nothing to do with any scheduler. In that case, > bfq_finish_requeue_request exists immediately. > > > > __blk_mq_requeue_request() is called for two cases: > > > > - one is that the requeued request is added to hctx->dispatch, such > > as blk_mq_dispatch_rq_list() > > yes > > > - another case is that the request is requeued to io scheduler, such as > > blk_mq_requeue_request(). > > > > yes > > > For the 1st case, blk_mq_sched_requeue_request() shouldn't be called > > since it is nothing to do with scheduler, > > No, if that rq has been inserted and then extracted from the scheduler > through a dispatch_request, then it has. The scheduler is stateful, > and keeps state for rq, because it must do so, until a completion or a > requeue arrive. In particular, bfq may decide that no other This 'requeue' to hctx->dispatch is invisible to io scheduler, and from io scheduler view, seems no difference between STS_OK and STS_RESOURCE, since this request will be submitted to lld automatically by blk-mq core. Also in legacy path, no such notification to io scheduler too. > bfq_queues must be served before rq has been completed, because this > is necessary to preserve its target service guarantees. If bfq is not > informed, either through its completion or its requeue hook, then it > will wait forever. The .finish_request will be called after this request is completed for this case, either after this io is completed by device, or timeout. Or .requeue_request will be called if the driver need to requeue it to io scheduler via blk_mq_requeue_request() explicitly. So io scheduler will be made sure to get notified. > > > seems we only need to do that > > for 2nd case. > > > > > So looks we need the following patch: > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c > > index 23de7fd8099a..a216f3c3c3ce 100644 > > --- a/block/blk-mq.c > > +++ b/block/blk-mq.c > > @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq) > > > > trace_block_rq_requeue(q, rq); > > wbt_requeue(q->rq_wb, &rq->issue_stat); > > - blk_mq_sched_requeue_request(rq); > > > > By doing so, if rq has 'passed' through bfq, then you block again bfq > forever. Could you explain it a bit why bfq is blocked by this way? > > Of course, I may be wrong, as I don't have a complete mental model of > all what blk-mq does around bfq. But I thin that you can quickly > check whether a hang occurs again if you add this change. In > particular, it should happen in the USB former failure case. USB/BFQ runs well on my parted test with this change, and this test can trigger the IO hang issue easily on kyber or without your patch of 'block, bfq: add requeue-request hook'. Thanks, Ming
> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei <ming.lei@redhat.com> ha scritto: > > Hi Paolo, > > On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote: >> >> >>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming.lei@redhat.com> ha scritto: >>> >>> Hi Paolo, >>> >>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: >>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via >>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device >>>> be re-inserted into the active I/O scheduler for that device. As a >>> >>> No, this behaviour isn't related with commit a6a252e64914, and >>> it has been there since blk_mq_requeue_request() is introduced. >>> >> >> Hi Ming, >> actually, we wrote the above statement after simply following the call >> chain that led to the failure. In this respect, the change in commit >> a6a252e64914: >> >> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, >> + bool has_sched, >> struct request *rq) >> { >> - if (rq->tag == -1) { >> + /* dispatch flush rq directly */ >> + if (rq->rq_flags & RQF_FLUSH_SEQ) { >> + spin_lock(&hctx->lock); >> + list_add(&rq->queuelist, &hctx->dispatch); >> + spin_unlock(&hctx->lock); >> + return true; >> + } >> + >> + if (has_sched) { >> rq->rq_flags |= RQF_SORTED; >> - return false; >> + WARN_ON(rq->tag != -1); >> } >> >> - /* >> - * If we already have a real request tag, send directly to >> - * the dispatch list. >> - */ >> - spin_lock(&hctx->lock); >> - list_add(&rq->queuelist, &hctx->dispatch); >> - spin_unlock(&hctx->lock); >> - return true; >> + return false; >> } >> >> makes blk_mq_sched_bypass_insert return false for all non-flush >> requests. From that, the anomaly described in our commit follows, for >> bfq any stateful scheduler that waits for the completion of requests >> that passed through it. I'm elaborating again a little bit on this in >> my replies to your next points below. > > Before a6a252e64914, follows blk_mq_sched_bypass_insert() > > if (rq->tag == -1) { > rq->rq_flags |= RQF_SORTED; > return false; > } > > /* > * If we already have a real request tag, send directly to > * the dispatch list. > */ > spin_lock(&hctx->lock); > list_add(&rq->queuelist, &hctx->dispatch); > spin_unlock(&hctx->lock); > return true; > > This function still returns false for all non-flush request, so nothing > changes wrt. this kind of handling. > Yep Ming. I don't have the expertise to tell you why, but the failure in the USB case was caused by an rq that is not flush, but for which rq->tag != -1. So, the previous version of blk_mq_sched_bypass_insert returned true, and there was not failure, while after commit a6a252e64914 the function returns true and the failure occurs if bfq does not exploit the requeue hook. You have actually shown it yourself, several months ago, through the simple code instrumentation you made and used to show that bfq was stuck. And I guess it can still be reproduced very easily, unless something else has changed in blk-mq. BTW, if you can shed a light on this fact, that would be a great occasion to learn for me. >> >> I don't mean that this change is an error, it simply sends a stateful >> scheduler in an inconsistent state, unless the scheduler properly >> handles the requeue that precedes the re-insertion into the >> scheduler. >> >> If this clarifies the situation, but there is still some misleading >> statement in the commit, just let me better understand, and I'll be >> glad to rectify it, if possible somehow. >> >> >>> And you can see blk_mq_requeue_request() is called by lots of drivers, >>> especially it is often used in error handler, see SCSI's example. >>> >>>> consequence, I/O schedulers may get the same request inserted again, >>>> even several times, without a finish_request invoked on that request >>>> before each re-insertion. >>>> >> >> ... >> >>>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { >>>> .ops.mq = { >>>> .limit_depth = bfq_limit_depth, >>>> .prepare_request = bfq_prepare_request, >>>> - .finish_request = bfq_finish_request, >>>> + .requeue_request = bfq_finish_requeue_request, >>>> + .finish_request = bfq_finish_requeue_request, >>>> .exit_icq = bfq_exit_icq, >>>> .insert_requests = bfq_insert_requests, >>>> .dispatch_request = bfq_dispatch_request, >>> >>> This way may not be correct since blk_mq_sched_requeue_request() can be >>> called for one request which won't enter io scheduler. >>> >> >> Exactly, there are two cases: requeues that lead to subsequent >> re-insertions, and requeues that don't. The function >> bfq_finish_requeue_request handles both, and both need to be handled, >> to inform bfq that it has not to wait for the completion of rq any >> longer. >> >> One special case is when bfq_finish_requeue_request gets invoked even >> if rq has nothing to do with any scheduler. In that case, >> bfq_finish_requeue_request exists immediately. >> >> >>> __blk_mq_requeue_request() is called for two cases: >>> >>> - one is that the requeued request is added to hctx->dispatch, such >>> as blk_mq_dispatch_rq_list() >> >> yes >> >>> - another case is that the request is requeued to io scheduler, such as >>> blk_mq_requeue_request(). >>> >> >> yes >> >>> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called >>> since it is nothing to do with scheduler, >> >> No, if that rq has been inserted and then extracted from the scheduler >> through a dispatch_request, then it has. The scheduler is stateful, >> and keeps state for rq, because it must do so, until a completion or a >> requeue arrive. In particular, bfq may decide that no other > > This 'requeue' to hctx->dispatch is invisible to io scheduler, and from > io scheduler view, seems no difference between STS_OK and STS_RESOURCE, > since this request will be submitted to lld automatically by blk-mq > core. > > Also in legacy path, no such notification to io scheduler too. ok, still, for reasons that at this point I don't know, in the last 9-10 years it has never been reported a failure caused by a request that has been first dispatched by bfq and then re-inserted into bfq without being completed. This new event is the problem. >> bfq_queues must be served before rq has been completed, because this >> is necessary to preserve its target service guarantees. If bfq is not >> informed, either through its completion or its requeue hook, then it >> will wait forever. > > The .finish_request will be called after this request is completed for this > case, either after this io is completed by device, or timeout. > > Or .requeue_request will be called if the driver need to requeue it to > io scheduler via blk_mq_requeue_request() explicitly. > > So io scheduler will be made sure to get notified. > So, you mean that the scheduler will get notified in a way or the other, if it has a requeue hook defined, right? Or, simply, the request will be eventually completed, thus unblocking the scheduler? >> >>> seems we only need to do that >>> for 2nd case. >>> >> >>> So looks we need the following patch: >>> >>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>> index 23de7fd8099a..a216f3c3c3ce 100644 >>> --- a/block/blk-mq.c >>> +++ b/block/blk-mq.c >>> @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq) >>> >>> trace_block_rq_requeue(q, rq); >>> wbt_requeue(q->rq_wb, &rq->issue_stat); >>> - blk_mq_sched_requeue_request(rq); >>> >> >> By doing so, if rq has 'passed' through bfq, then you block again bfq >> forever. > > Could you explain it a bit why bfq is blocked by this way? It's a theoretical problem (unfortunately not just a contingency you can work around). If you want to guarantee that a process doing sync I/O, and with a higher weight than other processes, gets a share of the total throughput proportional to its weight, then you have to idle the device during the service slot of that process. Otherwise, the pending requests of the other processes will simply slip through *every time* the high-weight process has no pending request because it is blocked by one of its sync request. The final result will be loss of control on the percentage of requests of the high-weight process that get dispatched w.r.t. to the total number of requests dispatched. So, failure to provide the process with its due share of the throughput. I'm working on solutions to compensate for this nasty service constraint, and its effects on throughput, but this is another story. > >> >> Of course, I may be wrong, as I don't have a complete mental model of >> all what blk-mq does around bfq. But I thin that you can quickly >> check whether a hang occurs again if you add this change. In >> particular, it should happen in the USB former failure case. > > USB/BFQ runs well on my parted test with this change, and this test can > trigger the IO hang issue easily on kyber or without your patch of 'block, > bfq: add requeue-request hook'. > Great! According to the facts I reported at the beginning of this email, I guess your patch works because it avoids the nasty case for bfq, i.e., a request re-inserted into bfq without being completed first, right? And then, about the requeues without re-insertions, everything works because the latter type of requests eventually get a completion. Is it correct? Sorry for being pedantic, but because of our lack of expertise on these requeueing mechanism, we worked really a lot on this commit, and it would be frustrating if we bumped again into troubles, after removing it. Apart from that, if this commit is useless, it is perfectly fine for me that it is reverted and replaced by a better solution. In the end, less code in bfq means less room for bugs ;) Thanks, Paolo > Thanks, > Ming
On Sat, Feb 24, 2018 at 08:54:31AM +0100, Paolo Valente wrote: > > > > Il giorno 23 feb 2018, alle ore 17:17, Ming Lei <ming.lei@redhat.com> ha scritto: > > > > Hi Paolo, > > > > On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote: > >> > >> > >>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming.lei@redhat.com> ha scritto: > >>> > >>> Hi Paolo, > >>> > >>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: > >>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via > >>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device > >>>> be re-inserted into the active I/O scheduler for that device. As a > >>> > >>> No, this behaviour isn't related with commit a6a252e64914, and > >>> it has been there since blk_mq_requeue_request() is introduced. > >>> > >> > >> Hi Ming, > >> actually, we wrote the above statement after simply following the call > >> chain that led to the failure. In this respect, the change in commit > >> a6a252e64914: > >> > >> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, > >> + bool has_sched, > >> struct request *rq) > >> { > >> - if (rq->tag == -1) { > >> + /* dispatch flush rq directly */ > >> + if (rq->rq_flags & RQF_FLUSH_SEQ) { > >> + spin_lock(&hctx->lock); > >> + list_add(&rq->queuelist, &hctx->dispatch); > >> + spin_unlock(&hctx->lock); > >> + return true; > >> + } > >> + > >> + if (has_sched) { > >> rq->rq_flags |= RQF_SORTED; > >> - return false; > >> + WARN_ON(rq->tag != -1); > >> } > >> > >> - /* > >> - * If we already have a real request tag, send directly to > >> - * the dispatch list. > >> - */ > >> - spin_lock(&hctx->lock); > >> - list_add(&rq->queuelist, &hctx->dispatch); > >> - spin_unlock(&hctx->lock); > >> - return true; > >> + return false; > >> } > >> > >> makes blk_mq_sched_bypass_insert return false for all non-flush > >> requests. From that, the anomaly described in our commit follows, for > >> bfq any stateful scheduler that waits for the completion of requests > >> that passed through it. I'm elaborating again a little bit on this in > >> my replies to your next points below. > > > > Before a6a252e64914, follows blk_mq_sched_bypass_insert() > > > > if (rq->tag == -1) { > > rq->rq_flags |= RQF_SORTED; > > return false; > > } > > > > /* > > * If we already have a real request tag, send directly to > > * the dispatch list. > > */ > > spin_lock(&hctx->lock); > > list_add(&rq->queuelist, &hctx->dispatch); > > spin_unlock(&hctx->lock); > > return true; > > > > This function still returns false for all non-flush request, so nothing > > changes wrt. this kind of handling. > > > > Yep Ming. I don't have the expertise to tell you why, but the failure > in the USB case was caused by an rq that is not flush, but for which > rq->tag != -1. So, the previous version of blk_mq_sched_bypass_insert > returned true, and there was not failure, while after commit > a6a252e64914 the function returns true and the failure occurs if bfq > does not exploit the requeue hook. > > You have actually shown it yourself, several months ago, through the > simple code instrumentation you made and used to show that bfq was > stuck. And I guess it can still be reproduced very easily, unless > something else has changed in blk-mq. > > BTW, if you can shed a light on this fact, that would be a great > occasion to learn for me. The difference should be made by commit 923218f6166a84 (blk-mq: don't allocate driver tag upfront for flush rq), which releases driver tag before requeuing the request, but that is definitely the correct thing to do. > > >> > >> I don't mean that this change is an error, it simply sends a stateful > >> scheduler in an inconsistent state, unless the scheduler properly > >> handles the requeue that precedes the re-insertion into the > >> scheduler. > >> > >> If this clarifies the situation, but there is still some misleading > >> statement in the commit, just let me better understand, and I'll be > >> glad to rectify it, if possible somehow. > >> > >> > >>> And you can see blk_mq_requeue_request() is called by lots of drivers, > >>> especially it is often used in error handler, see SCSI's example. > >>> > >>>> consequence, I/O schedulers may get the same request inserted again, > >>>> even several times, without a finish_request invoked on that request > >>>> before each re-insertion. > >>>> > >> > >> ... > >> > >>>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { > >>>> .ops.mq = { > >>>> .limit_depth = bfq_limit_depth, > >>>> .prepare_request = bfq_prepare_request, > >>>> - .finish_request = bfq_finish_request, > >>>> + .requeue_request = bfq_finish_requeue_request, > >>>> + .finish_request = bfq_finish_requeue_request, > >>>> .exit_icq = bfq_exit_icq, > >>>> .insert_requests = bfq_insert_requests, > >>>> .dispatch_request = bfq_dispatch_request, > >>> > >>> This way may not be correct since blk_mq_sched_requeue_request() can be > >>> called for one request which won't enter io scheduler. > >>> > >> > >> Exactly, there are two cases: requeues that lead to subsequent > >> re-insertions, and requeues that don't. The function > >> bfq_finish_requeue_request handles both, and both need to be handled, > >> to inform bfq that it has not to wait for the completion of rq any > >> longer. > >> > >> One special case is when bfq_finish_requeue_request gets invoked even > >> if rq has nothing to do with any scheduler. In that case, > >> bfq_finish_requeue_request exists immediately. > >> > >> > >>> __blk_mq_requeue_request() is called for two cases: > >>> > >>> - one is that the requeued request is added to hctx->dispatch, such > >>> as blk_mq_dispatch_rq_list() > >> > >> yes > >> > >>> - another case is that the request is requeued to io scheduler, such as > >>> blk_mq_requeue_request(). > >>> > >> > >> yes > >> > >>> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called > >>> since it is nothing to do with scheduler, > >> > >> No, if that rq has been inserted and then extracted from the scheduler > >> through a dispatch_request, then it has. The scheduler is stateful, > >> and keeps state for rq, because it must do so, until a completion or a > >> requeue arrive. In particular, bfq may decide that no other > > > > This 'requeue' to hctx->dispatch is invisible to io scheduler, and from > > io scheduler view, seems no difference between STS_OK and STS_RESOURCE, > > since this request will be submitted to lld automatically by blk-mq > > core. > > > > Also in legacy path, no such notification to io scheduler too. > > ok, still, for reasons that at this point I don't know, in the last > 9-10 years it has never been reported a failure caused by a request > that has been first dispatched by bfq and then re-inserted into bfq > without being completed. This new event is the problem. This mechanism is invented by blk-mq, and I guess it is for avoiding to disable irq when acquiring hctx->lock since blk-mq's completion handler is done in irq context. Legacy's requeue won't re-insert one request to scheduler queue. > > > >> bfq_queues must be served before rq has been completed, because this > >> is necessary to preserve its target service guarantees. If bfq is not > >> informed, either through its completion or its requeue hook, then it > >> will wait forever. > > > > The .finish_request will be called after this request is completed for this > > case, either after this io is completed by device, or timeout. > > > > Or .requeue_request will be called if the driver need to requeue it to > > io scheduler via blk_mq_requeue_request() explicitly. > > > > So io scheduler will be made sure to get notified. > > > > So, you mean that the scheduler will get notified in a way or the > other, if it has a requeue hook defined, right? Or, simply, the > request will be eventually completed, thus unblocking the scheduler? Right, once .queue_rq() returns BLK_STS_RESOURCE, this rq is added to hctx->dispatch list, and finally it will be completed via driver or timeout handler, or it may be re-inserted to io scheduler queue via blk_mq_requeue_request() by driver in driver's completion handler. So io scheduler will get the notification finally. > > > >> > >>> seems we only need to do that > >>> for 2nd case. > >>> > >> > >>> So looks we need the following patch: > >>> > >>> diff --git a/block/blk-mq.c b/block/blk-mq.c > >>> index 23de7fd8099a..a216f3c3c3ce 100644 > >>> --- a/block/blk-mq.c > >>> +++ b/block/blk-mq.c > >>> @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq) > >>> > >>> trace_block_rq_requeue(q, rq); > >>> wbt_requeue(q->rq_wb, &rq->issue_stat); > >>> - blk_mq_sched_requeue_request(rq); > >>> > >> > >> By doing so, if rq has 'passed' through bfq, then you block again bfq > >> forever. > > > > Could you explain it a bit why bfq is blocked by this way? > > It's a theoretical problem (unfortunately not just a contingency you > can work around). If you want to guarantee that a process doing sync > I/O, and with a higher weight than other processes, gets a share of > the total throughput proportional to its weight, then you have to idle > the device during the service slot of that process. Otherwise, the > pending requests of the other processes will simply slip through > *every time* the high-weight process has no pending request because it > is blocked by one of its sync request. The final result will be loss > of control on the percentage of requests of the high-weight process > that get dispatched w.r.t. to the total number of requests > dispatched. So, failure to provide the process with its due share of > the throughput. > > I'm working on solutions to compensate for this nasty service > constraint, and its effects on throughput, but this is another story. As I explained, io scheduler will get notified via either .finish_request or .requeue_request finally, so there shouldn't be such block if this patch is in. > > > > >> > >> Of course, I may be wrong, as I don't have a complete mental model of > >> all what blk-mq does around bfq. But I thin that you can quickly > >> check whether a hang occurs again if you add this change. In > >> particular, it should happen in the USB former failure case. > > > > USB/BFQ runs well on my parted test with this change, and this test can > > trigger the IO hang issue easily on kyber or without your patch of 'block, > > bfq: add requeue-request hook'. > > > > Great! According to the facts I reported at the beginning of > this email, I guess your patch works because it avoids the > nasty case for bfq, i.e., a request re-inserted into bfq without being > completed first, right? BFQ still can get notified via .requeue_request when one request is re-inserted to BFQ with this patch. > And then, about the requeues without > re-insertions, everything works because the latter type of requests > eventually get a completion. Is it correct? Right. > > Sorry for being pedantic, but because of our lack of expertise on > these requeueing mechanism, we worked really a lot on this commit, and > it would be frustrating if we bumped again into troubles, after > removing it. Apart from that, if this commit is useless, it is > perfectly fine for me that it is reverted and replaced by a better > solution. In the end, less code in bfq means less room for bugs ;) I agree. We should deal with this issue in a simple/clean way, just like the patch of 'block: kyber: fix domain token leak during requeue'. Now could you please let us know if you are fine with the patch of 'blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch'? Thanks, Ming
> Il giorno 24 feb 2018, alle ore 13:05, Ming Lei <ming.lei@redhat.com> ha scritto: > > On Sat, Feb 24, 2018 at 08:54:31AM +0100, Paolo Valente wrote: >> >> >>> Il giorno 23 feb 2018, alle ore 17:17, Ming Lei <ming.lei@redhat.com> ha scritto: >>> >>> Hi Paolo, >>> >>> On Fri, Feb 23, 2018 at 04:41:36PM +0100, Paolo Valente wrote: >>>> >>>> >>>>> Il giorno 23 feb 2018, alle ore 16:07, Ming Lei <ming.lei@redhat.com> ha scritto: >>>>> >>>>> Hi Paolo, >>>>> >>>>> On Wed, Feb 07, 2018 at 10:19:20PM +0100, Paolo Valente wrote: >>>>>> Commit 'a6a252e64914 ("blk-mq-sched: decide how to handle flush rq via >>>>>> RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device >>>>>> be re-inserted into the active I/O scheduler for that device. As a >>>>> >>>>> No, this behaviour isn't related with commit a6a252e64914, and >>>>> it has been there since blk_mq_requeue_request() is introduced. >>>>> >>>> >>>> Hi Ming, >>>> actually, we wrote the above statement after simply following the call >>>> chain that led to the failure. In this respect, the change in commit >>>> a6a252e64914: >>>> >>>> static bool blk_mq_sched_bypass_insert(struct blk_mq_hw_ctx *hctx, >>>> + bool has_sched, >>>> struct request *rq) >>>> { >>>> - if (rq->tag == -1) { >>>> + /* dispatch flush rq directly */ >>>> + if (rq->rq_flags & RQF_FLUSH_SEQ) { >>>> + spin_lock(&hctx->lock); >>>> + list_add(&rq->queuelist, &hctx->dispatch); >>>> + spin_unlock(&hctx->lock); >>>> + return true; >>>> + } >>>> + >>>> + if (has_sched) { >>>> rq->rq_flags |= RQF_SORTED; >>>> - return false; >>>> + WARN_ON(rq->tag != -1); >>>> } >>>> >>>> - /* >>>> - * If we already have a real request tag, send directly to >>>> - * the dispatch list. >>>> - */ >>>> - spin_lock(&hctx->lock); >>>> - list_add(&rq->queuelist, &hctx->dispatch); >>>> - spin_unlock(&hctx->lock); >>>> - return true; >>>> + return false; >>>> } >>>> >>>> makes blk_mq_sched_bypass_insert return false for all non-flush >>>> requests. From that, the anomaly described in our commit follows, for >>>> bfq any stateful scheduler that waits for the completion of requests >>>> that passed through it. I'm elaborating again a little bit on this in >>>> my replies to your next points below. >>> >>> Before a6a252e64914, follows blk_mq_sched_bypass_insert() >>> >>> if (rq->tag == -1) { >>> rq->rq_flags |= RQF_SORTED; >>> return false; >>> } >>> >>> /* >>> * If we already have a real request tag, send directly to >>> * the dispatch list. >>> */ >>> spin_lock(&hctx->lock); >>> list_add(&rq->queuelist, &hctx->dispatch); >>> spin_unlock(&hctx->lock); >>> return true; >>> >>> This function still returns false for all non-flush request, so nothing >>> changes wrt. this kind of handling. >>> >> >> Yep Ming. I don't have the expertise to tell you why, but the failure >> in the USB case was caused by an rq that is not flush, but for which >> rq->tag != -1. So, the previous version of blk_mq_sched_bypass_insert >> returned true, and there was not failure, while after commit >> a6a252e64914 the function returns true and the failure occurs if bfq >> does not exploit the requeue hook. >> >> You have actually shown it yourself, several months ago, through the >> simple code instrumentation you made and used to show that bfq was >> stuck. And I guess it can still be reproduced very easily, unless >> something else has changed in blk-mq. >> >> BTW, if you can shed a light on this fact, that would be a great >> occasion to learn for me. > > The difference should be made by commit 923218f6166a84 (blk-mq: don't > allocate driver tag upfront for flush rq), which releases driver tag > before requeuing the request, but that is definitely the correct thing > to do. > Ok. We just did a bisection, guided by changes in the function blk_mq_sched_bypass_insert, as that is the function whose different return value led to the hang. And commit a6a252e64914 apparently was the latest commit after which the hang occurs. Anyway, what matters is just that, from some commit on, a requeued request that before that commit was not re-inserted into the scheduler started to be re-inserted. And this was the trigger of the failure. >> >>>> >>>> I don't mean that this change is an error, it simply sends a stateful >>>> scheduler in an inconsistent state, unless the scheduler properly >>>> handles the requeue that precedes the re-insertion into the >>>> scheduler. >>>> >>>> If this clarifies the situation, but there is still some misleading >>>> statement in the commit, just let me better understand, and I'll be >>>> glad to rectify it, if possible somehow. >>>> >>>> >>>>> And you can see blk_mq_requeue_request() is called by lots of drivers, >>>>> especially it is often used in error handler, see SCSI's example. >>>>> >>>>>> consequence, I/O schedulers may get the same request inserted again, >>>>>> even several times, without a finish_request invoked on that request >>>>>> before each re-insertion. >>>>>> >>>> >>>> ... >>>> >>>>>> @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { >>>>>> .ops.mq = { >>>>>> .limit_depth = bfq_limit_depth, >>>>>> .prepare_request = bfq_prepare_request, >>>>>> - .finish_request = bfq_finish_request, >>>>>> + .requeue_request = bfq_finish_requeue_request, >>>>>> + .finish_request = bfq_finish_requeue_request, >>>>>> .exit_icq = bfq_exit_icq, >>>>>> .insert_requests = bfq_insert_requests, >>>>>> .dispatch_request = bfq_dispatch_request, >>>>> >>>>> This way may not be correct since blk_mq_sched_requeue_request() can be >>>>> called for one request which won't enter io scheduler. >>>>> >>>> >>>> Exactly, there are two cases: requeues that lead to subsequent >>>> re-insertions, and requeues that don't. The function >>>> bfq_finish_requeue_request handles both, and both need to be handled, >>>> to inform bfq that it has not to wait for the completion of rq any >>>> longer. >>>> >>>> One special case is when bfq_finish_requeue_request gets invoked even >>>> if rq has nothing to do with any scheduler. In that case, >>>> bfq_finish_requeue_request exists immediately. >>>> >>>> >>>>> __blk_mq_requeue_request() is called for two cases: >>>>> >>>>> - one is that the requeued request is added to hctx->dispatch, such >>>>> as blk_mq_dispatch_rq_list() >>>> >>>> yes >>>> >>>>> - another case is that the request is requeued to io scheduler, such as >>>>> blk_mq_requeue_request(). >>>>> >>>> >>>> yes >>>> >>>>> For the 1st case, blk_mq_sched_requeue_request() shouldn't be called >>>>> since it is nothing to do with scheduler, >>>> >>>> No, if that rq has been inserted and then extracted from the scheduler >>>> through a dispatch_request, then it has. The scheduler is stateful, >>>> and keeps state for rq, because it must do so, until a completion or a >>>> requeue arrive. In particular, bfq may decide that no other >>> >>> This 'requeue' to hctx->dispatch is invisible to io scheduler, and from >>> io scheduler view, seems no difference between STS_OK and STS_RESOURCE, >>> since this request will be submitted to lld automatically by blk-mq >>> core. >>> >>> Also in legacy path, no such notification to io scheduler too. >> >> ok, still, for reasons that at this point I don't know, in the last >> 9-10 years it has never been reported a failure caused by a request >> that has been first dispatched by bfq and then re-inserted into bfq >> without being completed. This new event is the problem. > > This mechanism is invented by blk-mq, and I guess it is for avoiding to > disable irq when acquiring hctx->lock since blk-mq's completion handler > is done in irq context. > > Legacy's requeue won't re-insert one request to scheduler queue. ok, that's the root cause of this issue then. > >> >> >>>> bfq_queues must be served before rq has been completed, because this >>>> is necessary to preserve its target service guarantees. If bfq is not >>>> informed, either through its completion or its requeue hook, then it >>>> will wait forever. >>> >>> The .finish_request will be called after this request is completed for this >>> case, either after this io is completed by device, or timeout. >>> >>> Or .requeue_request will be called if the driver need to requeue it to >>> io scheduler via blk_mq_requeue_request() explicitly. >>> >>> So io scheduler will be made sure to get notified. >>> >> >> So, you mean that the scheduler will get notified in a way or the >> other, if it has a requeue hook defined, right? Or, simply, the >> request will be eventually completed, thus unblocking the scheduler? > > Right, once .queue_rq() returns BLK_STS_RESOURCE, this rq is added to > hctx->dispatch list, and finally it will be completed via driver or > timeout handler, or it may be re-inserted to io scheduler queue via > blk_mq_requeue_request() by driver in driver's completion handler. > > So io scheduler will get the notification finally. Yes, provided that the scheduler has a requeue hook, otherwise is one case it is not notified before the re-isnertion. And from this troubles follow with a stateful scheduler like bfq. > >> >> >>>> >>>>> seems we only need to do that >>>>> for 2nd case. >>>>> >>>> >>>>> So looks we need the following patch: >>>>> >>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c >>>>> index 23de7fd8099a..a216f3c3c3ce 100644 >>>>> --- a/block/blk-mq.c >>>>> +++ b/block/blk-mq.c >>>>> @@ -712,7 +714,6 @@ static void __blk_mq_requeue_request(struct request *rq) >>>>> >>>>> trace_block_rq_requeue(q, rq); >>>>> wbt_requeue(q->rq_wb, &rq->issue_stat); >>>>> - blk_mq_sched_requeue_request(rq); >>>>> >>>> >>>> By doing so, if rq has 'passed' through bfq, then you block again bfq >>>> forever. >>> >>> Could you explain it a bit why bfq is blocked by this way? >> >> It's a theoretical problem (unfortunately not just a contingency you >> can work around). If you want to guarantee that a process doing sync >> I/O, and with a higher weight than other processes, gets a share of >> the total throughput proportional to its weight, then you have to idle >> the device during the service slot of that process. Otherwise, the >> pending requests of the other processes will simply slip through >> *every time* the high-weight process has no pending request because it >> is blocked by one of its sync request. The final result will be loss >> of control on the percentage of requests of the high-weight process >> that get dispatched w.r.t. to the total number of requests >> dispatched. So, failure to provide the process with its due share of >> the throughput. >> >> I'm working on solutions to compensate for this nasty service >> constraint, and its effects on throughput, but this is another story. > > As I explained, io scheduler will get notified via either > .finish_request or .requeue_request finally, so there shouldn't be > such block if this patch is in. > Well, it depends on which patch is your "this patch" :) I mean, if you refer to your new patch, then I'm afraid the answer is "no, exactly because of what you explained". In fact, in case of a re-insertion, the scheduler gets notified through its requeue hook. But for this mechanism to work, the scheduler must have a its requeue hook defined. bfq had no such a hook defined, and our patch defines exactly it. So, I don'n see how the re-insertion case might be handled correctly without our patch. Maybe I'm just missing something here. >> >>> >>>> >>>> Of course, I may be wrong, as I don't have a complete mental model of >>>> all what blk-mq does around bfq. But I thin that you can quickly >>>> check whether a hang occurs again if you add this change. In >>>> particular, it should happen in the USB former failure case. >>> >>> USB/BFQ runs well on my parted test with this change, and this test can >>> trigger the IO hang issue easily on kyber or without your patch of 'block, >>> bfq: add requeue-request hook'. >>> >> >> Great! According to the facts I reported at the beginning of >> this email, I guess your patch works because it avoids the >> nasty case for bfq, i.e., a request re-inserted into bfq without being >> completed first, right? > > BFQ still can get notified via .requeue_request when one request is re-inserted > to BFQ with this patch. > >> And then, about the requeues without >> re-insertions, everything works because the latter type of requests >> eventually get a completion. Is it correct? > > Right. > >> >> Sorry for being pedantic, but because of our lack of expertise on >> these requeueing mechanism, we worked really a lot on this commit, and >> it would be frustrating if we bumped again into troubles, after >> removing it. Apart from that, if this commit is useless, it is >> perfectly fine for me that it is reverted and replaced by a better >> solution. In the end, less code in bfq means less room for bugs ;) > > I agree. We should deal with this issue in a simple/clean way, just like > the patch of 'block: kyber: fix domain token leak during requeue'. > Our patch "block, bfq: add requeue-request hook" is exactly the bfq counterpart of the patch 'block: kyber: fix domain token leak during requeue' > Now could you please let us know if you are fine with the patch of 'blk-mq: > don't call io sched's .requeue_request when requeueing rq to ->dispatch'? > Absolutely. Just, in view of what you explained, and of what I wrote in this reply, such a patch does not seem to make our patch useless. Thanks, Paolo > > Thanks, > Ming
diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c index 47e6ec7427c4..aeca22d91101 100644 --- a/block/bfq-iosched.c +++ b/block/bfq-iosched.c @@ -3823,24 +3823,26 @@ static struct request *__bfq_dispatch_request(struct blk_mq_hw_ctx *hctx) } /* - * We exploit the bfq_finish_request hook to decrement - * rq_in_driver, but bfq_finish_request will not be - * invoked on this request. So, to avoid unbalance, - * just start this request, without incrementing - * rq_in_driver. As a negative consequence, - * rq_in_driver is deceptively lower than it should be - * while this request is in service. This may cause - * bfq_schedule_dispatch to be invoked uselessly. + * We exploit the bfq_finish_requeue_request hook to + * decrement rq_in_driver, but + * bfq_finish_requeue_request will not be invoked on + * this request. So, to avoid unbalance, just start + * this request, without incrementing rq_in_driver. As + * a negative consequence, rq_in_driver is deceptively + * lower than it should be while this request is in + * service. This may cause bfq_schedule_dispatch to be + * invoked uselessly. * * As for implementing an exact solution, the - * bfq_finish_request hook, if defined, is probably - * invoked also on this request. So, by exploiting - * this hook, we could 1) increment rq_in_driver here, - * and 2) decrement it in bfq_finish_request. Such a - * solution would let the value of the counter be - * always accurate, but it would entail using an extra - * interface function. This cost seems higher than the - * benefit, being the frequency of non-elevator-private + * bfq_finish_requeue_request hook, if defined, is + * probably invoked also on this request. So, by + * exploiting this hook, we could 1) increment + * rq_in_driver here, and 2) decrement it in + * bfq_finish_requeue_request. Such a solution would + * let the value of the counter be always accurate, + * but it would entail using an extra interface + * function. This cost seems higher than the benefit, + * being the frequency of non-elevator-private * requests very low. */ goto start_rq; @@ -4515,6 +4517,8 @@ static inline void bfq_update_insert_stats(struct request_queue *q, unsigned int cmd_flags) {} #endif +static void bfq_prepare_request(struct request *rq, struct bio *bio); + static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, bool at_head) { @@ -4541,6 +4545,18 @@ static void bfq_insert_request(struct blk_mq_hw_ctx *hctx, struct request *rq, else list_add_tail(&rq->queuelist, &bfqd->dispatch); } else { + if (WARN_ON_ONCE(!bfqq)) { + /* + * This should never happen. Most likely rq is + * a requeued regular request, being + * re-inserted without being first + * re-prepared. Do a prepare, to avoid + * failure. + */ + bfq_prepare_request(rq, rq->bio); + bfqq = RQ_BFQQ(rq); + } + idle_timer_disabled = __bfq_insert_request(bfqd, rq); /* * Update bfqq, because, if a queue merge has occurred @@ -4697,22 +4713,44 @@ static void bfq_completed_request(struct bfq_queue *bfqq, struct bfq_data *bfqd) bfq_schedule_dispatch(bfqd); } -static void bfq_finish_request_body(struct bfq_queue *bfqq) +static void bfq_finish_requeue_request_body(struct bfq_queue *bfqq) { bfqq->allocated--; bfq_put_queue(bfqq); } -static void bfq_finish_request(struct request *rq) +/* + * Handle either a requeue or a finish for rq. The things to do are + * the same in both cases: all references to rq are to be dropped. In + * particular, rq is considered completed from the point of view of + * the scheduler. + */ +static void bfq_finish_requeue_request(struct request *rq) { - struct bfq_queue *bfqq; + struct bfq_queue *bfqq = RQ_BFQQ(rq); struct bfq_data *bfqd; - if (!rq->elv.icq) + /* + * Requeue and finish hooks are invoked in blk-mq without + * checking whether the involved request is actually still + * referenced in the scheduler. To handle this fact, the + * following two checks make this function exit in case of + * spurious invocations, for which there is nothing to do. + * + * First, check whether rq has nothing to do with an elevator. + */ + if (unlikely(!(rq->rq_flags & RQF_ELVPRIV))) + return; + + /* + * rq either is not associated with any icq, or is an already + * requeued request that has not (yet) been re-inserted into + * a bfq_queue. + */ + if (!rq->elv.icq || !bfqq) return; - bfqq = RQ_BFQQ(rq); bfqd = bfqq->bfqd; if (rq->rq_flags & RQF_STARTED) @@ -4727,13 +4765,14 @@ static void bfq_finish_request(struct request *rq) spin_lock_irqsave(&bfqd->lock, flags); bfq_completed_request(bfqq, bfqd); - bfq_finish_request_body(bfqq); + bfq_finish_requeue_request_body(bfqq); spin_unlock_irqrestore(&bfqd->lock, flags); } else { /* * Request rq may be still/already in the scheduler, - * in which case we need to remove it. And we cannot + * in which case we need to remove it (this should + * never happen in case of requeue). And we cannot * defer such a check and removal, to avoid * inconsistencies in the time interval from the end * of this function to the start of the deferred work. @@ -4748,9 +4787,26 @@ static void bfq_finish_request(struct request *rq) bfqg_stats_update_io_remove(bfqq_group(bfqq), rq->cmd_flags); } - bfq_finish_request_body(bfqq); + bfq_finish_requeue_request_body(bfqq); } + /* + * Reset private fields. In case of a requeue, this allows + * this function to correctly do nothing if it is spuriously + * invoked again on this same request (see the check at the + * beginning of the function). Probably, a better general + * design would be to prevent blk-mq from invoking the requeue + * or finish hooks of an elevator, for a request that is not + * referred by that elevator. + * + * Resetting the following fields would break the + * request-insertion logic if rq is re-inserted into a bfq + * internal queue, without a re-preparation. Here we assume + * that re-insertions of requeued requests, without + * re-preparation, can happen only for pass_through or at_head + * requests (which are not re-inserted into bfq internal + * queues). + */ rq->elv.priv[0] = NULL; rq->elv.priv[1] = NULL; } @@ -5426,7 +5482,8 @@ static struct elevator_type iosched_bfq_mq = { .ops.mq = { .limit_depth = bfq_limit_depth, .prepare_request = bfq_prepare_request, - .finish_request = bfq_finish_request, + .requeue_request = bfq_finish_requeue_request, + .finish_request = bfq_finish_requeue_request, .exit_icq = bfq_exit_icq, .insert_requests = bfq_insert_requests, .dispatch_request = bfq_dispatch_request,