Message ID | 20201028141251.3608598-1-bigeasy@linutronix.de |
---|---|
State | New |
Headers | show |
Series | [1/3] blk-mq: Don't complete on a remote CPU in force threaded mode | expand |
On Wed, Oct 28, 2020 at 03:12:51PM +0100, Sebastian Andrzej Siewior wrote: > static int blk_softirq_cpu_dead(unsigned int cpu) > { > - /* > - * If a CPU goes away, splice its entries to the current CPU > - * and trigger a run of the softirq > - */ > - local_irq_disable(); > - list_splice_init(&per_cpu(blk_cpu_done, cpu), > - this_cpu_ptr(&blk_cpu_done)); > - raise_softirq_irqoff(BLOCK_SOFTIRQ); > - local_irq_enable(); > - > + blk_complete_reqs(&per_cpu(blk_cpu_done, cpu)); > return 0; How can this be preempted? Can't we keep using this_cpu_ptr here?
On 2020-10-28 14:44:53 [+0000], Christoph Hellwig wrote: > On Wed, Oct 28, 2020 at 03:12:51PM +0100, Sebastian Andrzej Siewior wrote: > > static int blk_softirq_cpu_dead(unsigned int cpu) > > { > > - /* > > - * If a CPU goes away, splice its entries to the current CPU > > - * and trigger a run of the softirq > > - */ > > - local_irq_disable(); > > - list_splice_init(&per_cpu(blk_cpu_done, cpu), > > - this_cpu_ptr(&blk_cpu_done)); > > - raise_softirq_irqoff(BLOCK_SOFTIRQ); > > - local_irq_enable(); > > - > > + blk_complete_reqs(&per_cpu(blk_cpu_done, cpu)); > > return 0; > > How can this be preempted? Can't we keep using this_cpu_ptr here? cpu of the dead CPU != this CPU. Sebastian
On 2020-10-28 15:12:51 [+0100], To linux-block@vger.kernel.org wrote: > --- a/block/blk-mq.c > +++ b/block/blk-mq.c > @@ -667,14 +632,21 @@ bool blk_mq_complete_request_remote(struct request *rq) > return false; > > if (blk_mq_complete_need_ipi(rq)) { … > } else { > if (rq->q->nr_hw_queues > 1) > return false; > - blk_mq_trigger_softirq(rq); > + cpu_list = this_cpu_ptr(&blk_cpu_done); > + if (llist_add(&rq->ipi_list, cpu_list)) > + raise_softirq(BLOCK_SOFTIRQ); > } > > return true; So Mike posted this: | BUG: using smp_processor_id() in preemptible [00000000] code: usb-storage/841 | caller is blk_mq_complete_request_remote.part.0+0xa2/0x120 | CPU: 0 PID: 841 Comm: usb-storage Not tainted 5.10.0-rc1+ #61 | Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-1 04/01/2014 | Call Trace: | dump_stack+0x77/0x97 | check_preemption_disabled+0xbe/0xc0 | blk_mq_complete_request_remote.part.0+0xa2/0x120 | blk_mq_complete_request+0x2e/0x40 | usb_stor_control_thread+0x29a/0x300 | kthread+0x14b/0x170 | ret_from_fork+0x22/0x30 This comes from this_cpu_ptr() because usb_stor_control_thread() runs with enabled preemption. Adding preempt_disable() around it will make the warning go away but will wake the ksoftirqd (this happens now, too). Adding local_bh_disable() around it would perform the completion immediately (instead of waking kssoftirqd) but local_bh_enable() feels slightly more expensive. Are there many drivers completing the SCSI requests in preemtible context? In this case it would be more efficient to complete the request directly (usb_stor_control_thread() goes to sleep after that anyway and there is only one request at a time). Sebastian
On Thu, Oct 29, 2020 at 02:12:12PM +0100, Sebastian Andrzej Siewior wrote: > Are there many drivers completing the SCSI requests in preemtible > context? In this case it would be more efficient to complete the request > directly (usb_stor_control_thread() goes to sleep after that anyway and > there is only one request at a time). Well, usb-storage obviously seems to do it, and the block layer does not prohibit it.
On 2020-10-29 14:05:36 [+0000], Christoph Hellwig wrote: > Well, usb-storage obviously seems to do it, and the block layer > does not prohibit it. Also loop, nvme-tcp and then I stopped looking. Any objections about adding local_bh_disable() around it? Sebastian
On Thu, Oct 29, 2020 at 03:56:23PM +0100, Sebastian Andrzej Siewior wrote: > On 2020-10-29 14:05:36 [+0000], Christoph Hellwig wrote: > > Well, usb-storage obviously seems to do it, and the block layer > > does not prohibit it. > > Also loop, nvme-tcp and then I stopped looking. > Any objections about adding local_bh_disable() around it? To me it seems like the whole IPI plus potentially softirq dance is a little pointless when completing from process context. Sagi, any opinion on that from the nvme-tcp POV?
>>> Well, usb-storage obviously seems to do it, and the block layer >>> does not prohibit it. >> >> Also loop, nvme-tcp and then I stopped looking. >> Any objections about adding local_bh_disable() around it? > > To me it seems like the whole IPI plus potentially softirq dance is > a little pointless when completing from process context. I agree. > Sagi, any opinion on that from the nvme-tcp POV? nvme-tcp should (almost) always complete from the context that matches the rq->mq_ctx->cpu as the thread that processes incoming completions (per hctx) should be affinitized to match it (unless cpus come and go). So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true in normal operation. That leaves the teardowns+aborts, which aren't very interesting here. I would note that nvme-tcp does not go to sleep after completing every I/O like how sebastian indicated usb does. Having said that, today the network stack is calling nvme_tcp_data_ready in napi context (softirq) which in turn triggers the queue thread to handle network rx (and complete the I/O). It's been measured recently that running the rx context directly in softirq will save some latency (possible because nvme-tcp rx context is non-blocking). So I'd think that patch #2 is unnecessary and just add overhead for nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed from napi context, nvme-tcp will probably always go to the IPI path.
On 2020-10-29 13:03:26 [-0700], Sagi Grimberg wrote: > > > > > Well, usb-storage obviously seems to do it, and the block layer > > > > does not prohibit it. > > > > > > Also loop, nvme-tcp and then I stopped looking. > > > Any objections about adding local_bh_disable() around it? > > > > To me it seems like the whole IPI plus potentially softirq dance is > > a little pointless when completing from process context. > > I agree. > > > Sagi, any opinion on that from the nvme-tcp POV? > > nvme-tcp should (almost) always complete from the context that matches > the rq->mq_ctx->cpu as the thread that processes incoming > completions (per hctx) should be affinitized to match it (unless cpus > come and go). in which context? But this is probably nr_hw_queues > 1? > So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true > in normal operation. That leaves the teardowns+aborts, which aren't very > interesting here. The process context invocation is nvme_tcp_complete_timed_out(). > I would note that nvme-tcp does not go to sleep after completing every > I/O like how sebastian indicated usb does. > > Having said that, today the network stack is calling nvme_tcp_data_ready > in napi context (softirq) which in turn triggers the queue thread to > handle network rx (and complete the I/O). It's been measured recently > that running the rx context directly in softirq will save some > latency (possible because nvme-tcp rx context is non-blocking). > > So I'd think that patch #2 is unnecessary and just add overhead for > nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS > steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed > from napi context, nvme-tcp will probably always go to the IPI path. but running it in softirq on the remote CPU would still allow of other packets to come on the remote CPU (which would block BLOCK sofirq if NET_RX is already running). Sebastian
>>>>> Well, usb-storage obviously seems to do it, and the block layer >>>>> does not prohibit it. >>>> >>>> Also loop, nvme-tcp and then I stopped looking. >>>> Any objections about adding local_bh_disable() around it? >>> >>> To me it seems like the whole IPI plus potentially softirq dance is >>> a little pointless when completing from process context. >> >> I agree. >> >>> Sagi, any opinion on that from the nvme-tcp POV? >> >> nvme-tcp should (almost) always complete from the context that matches >> the rq->mq_ctx->cpu as the thread that processes incoming >> completions (per hctx) should be affinitized to match it (unless cpus >> come and go). > > in which context? Not sure what is the question. > But this is probably nr_hw_queues > 1? Yes. >> So for nvme-tcp I don't expect blk_mq_complete_need_ipi to return true >> in normal operation. That leaves the teardowns+aborts, which aren't very >> interesting here. > > The process context invocation is nvme_tcp_complete_timed_out(). Yes. >> I would note that nvme-tcp does not go to sleep after completing every >> I/O like how sebastian indicated usb does. >> >> Having said that, today the network stack is calling nvme_tcp_data_ready >> in napi context (softirq) which in turn triggers the queue thread to >> handle network rx (and complete the I/O). It's been measured recently >> that running the rx context directly in softirq will save some >> latency (possible because nvme-tcp rx context is non-blocking). >> >> So I'd think that patch #2 is unnecessary and just add overhead for >> nvme-tcp.. do note that the napi softirq cpu mapping depends on the RSS >> steering, which is unlikely to match rq->mq_ctx->cpu, hence if completed >> from napi context, nvme-tcp will probably always go to the IPI path. > > but running it in softirq on the remote CPU would still allow of other > packets to come on the remote CPU (which would block BLOCK sofirq if > NET_RX is already running). Not sure I understand your comment, if napi triggers on core X and we complete from that, it will trigger IPI to core Y, and there with patch #2 is will trigger softirq instead of calling ->complete directly no?
On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote: > > in which context? > > Not sure what is the question. The question is in which context do you complete your requests. My guess by now is "usually softirq/NAPI and context in rare error case". > > But this is probably nr_hw_queues > 1? > > Yes. So this means it will either complete directly or issue an IPI. > > but running it in softirq on the remote CPU would still allow of other > > packets to come on the remote CPU (which would block BLOCK sofirq if > > NET_RX is already running). > > Not sure I understand your comment, if napi triggers on core X and we > complete from that, it will trigger IPI to core Y, and there with patch #2 > is will trigger softirq instead of calling ->complete directly no? This is correct. But trigger softirq does not mean that it will wake `ksoftirqd' as it is the case for the usb-storage right now. In your case (completing from NAPI/sofitrq (or for most other driver which complete in their IRQ handler)) it means: - trigger IPI - IPI will OR the BLOCK-softirq bit. - on exit from IPI it will invoke do_softirq() (unless softirq is already pending and got interrupted by the IPI) and complete the Block request. Sebastian
On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote: > On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote: >>> in which context? >> >> Not sure what is the question. > > The question is in which context do you complete your requests. My guess > by now is "usually softirq/NAPI and context in rare error case". There really aren't any rules for this, and it's perfectly legit to complete from process context. Maybe you're a kthread driven driver and that's how you handle completions. The block completion path has always been hard IRQ safe, but possible to call from anywhere.
On 10/31/20 9:00 AM, Jens Axboe wrote: > On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote: >> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote: >>>> in which context? >>> >>> Not sure what is the question. >> >> The question is in which context do you complete your requests. My guess >> by now is "usually softirq/NAPI and context in rare error case". > > There really aren't any rules for this, and it's perfectly legit to > complete from process context. Maybe you're a kthread driven driver and > that's how you handle completions. The block completion path has always > been hard IRQ safe, but possible to call from anywhere. A more recent example is polled IO, which will always complete from process/task context and very much is fast path. -- Jens Axboe
On Sat, Oct 31, 2020 at 09:01:45AM -0600, Jens Axboe wrote: > On 10/31/20 9:00 AM, Jens Axboe wrote: > > On 10/31/20 4:41 AM, Sebastian Andrzej Siewior wrote: > >> On 2020-10-29 14:07:59 [-0700], Sagi Grimberg wrote: > >>>> in which context? > >>> > >>> Not sure what is the question. > >> > >> The question is in which context do you complete your requests. My guess > >> by now is "usually softirq/NAPI and context in rare error case". > > > > There really aren't any rules for this, and it's perfectly legit to > > complete from process context. Maybe you're a kthread driven driver and > > that's how you handle completions. The block completion path has always > > been hard IRQ safe, but possible to call from anywhere. > > A more recent example is polled IO, which will always complete from > process/task context and very much is fast path. But we never IPI for that anyway, so it is the easy case.
On 2020-10-31 09:00:49 [-0600], Jens Axboe wrote: > There really aren't any rules for this, and it's perfectly legit to > complete from process context. Maybe you're a kthread driven driver and > that's how you handle completions. The block completion path has always > been hard IRQ safe, but possible to call from anywhere. I'm not trying to put restrictions and forbidding completions from a kthread. I'm trying to avoid the pointless softirq dance for no added value. We could: diff --git a/block/blk-mq.c b/block/blk-mq.c index 4f53de48e5038..c4693b3750878 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -644,9 +644,11 @@ bool blk_mq_complete_request_remote(struct request *rq) } else { if (rq->q->nr_hw_queues > 1) return false; + preempt_disable(); cpu_list = this_cpu_ptr(&blk_cpu_done); if (llist_add(&rq->ipi_list, cpu_list)) raise_softirq(BLOCK_SOFTIRQ); + preempt_enable(); } return true; to not break that assumption you just mentioned and provide |static inline void blk_mq_complete_request_local(struct request *rq) |{ | rq->q->mq_ops->complete(rq); |} so that completion issued from from process context (like those from usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER) completing the requests but rather performing it right away. The softirq dance makes no sense here. As mentioned earlier, the alternative _could_ be to s/preempt_/local_bh_/ in the above patch. This would ensure that any invocation outside of IRQ/Softirq context would invoke the softirq _directly_ at local_bh_enable() time rather than waking the daemon for that purpose. It would also avoid another completion function for the direct case which could be abused if used from outside the thread context. The last one is currently my favorite. Sebastian
On Mon, Nov 02, 2020 at 10:55:33AM +0100, Sebastian Andrzej Siewior wrote: > On 2020-10-31 09:00:49 [-0600], Jens Axboe wrote: > > There really aren't any rules for this, and it's perfectly legit to > > complete from process context. Maybe you're a kthread driven driver and > > that's how you handle completions. The block completion path has always > > been hard IRQ safe, but possible to call from anywhere. > > I'm not trying to put restrictions and forbidding completions from a > kthread. I'm trying to avoid the pointless softirq dance for no added > value. We could: > to not break that assumption you just mentioned and provide > |static inline void blk_mq_complete_request_local(struct request *rq) > |{ > | rq->q->mq_ops->complete(rq); > |} > > so that completion issued from from process context (like those from > usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER) > completing the requests but rather performing it right away. The softirq > dance makes no sense here. Agreed. But I don't think your above blk_mq_complete_request_local is all that useful either as ->complete is defined by the caller, so we could just do a direct call. Basically we should just return false from blk_mq_complete_request_remote after updating the state when called from process context. But given that IIRC we are not supposed to check what state we are called from we'll need a helper just for updating the state instead and ensure the driver uses the right helper. Now of course we might have process context callers that still want to bounce to the submitting CPU, but in that case we should go directly to a workqueue or similar. Either way doing this properly will probabl involve an audit of all drivers, but I think that is worth it.
>>> There really aren't any rules for this, and it's perfectly legit to >>> complete from process context. Maybe you're a kthread driven driver and >>> that's how you handle completions. The block completion path has always >>> been hard IRQ safe, but possible to call from anywhere. >> >> I'm not trying to put restrictions and forbidding completions from a >> kthread. I'm trying to avoid the pointless softirq dance for no added >> value. We could: > >> to not break that assumption you just mentioned and provide >> |static inline void blk_mq_complete_request_local(struct request *rq) >> |{ >> | rq->q->mq_ops->complete(rq); >> |} >> >> so that completion issued from from process context (like those from >> usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER) >> completing the requests but rather performing it right away. The softirq >> dance makes no sense here. > > Agreed. But I don't think your above blk_mq_complete_request_local > is all that useful either as ->complete is defined by the caller, > so we could just do a direct call. Basically we should just > return false from blk_mq_complete_request_remote after updating > the state when called from process context. Agreed. > But given that IIRC > we are not supposed to check what state we are called from > we'll need a helper just for updating the state instead and > ensure the driver uses the right helper. Now of course we might > have process context callers that still want to bounce to the > submitting CPU, but in that case we should go directly to a > workqueue or similar. This would mean that it may be suboptimal for nvme-tcp to complete requests in softirq context from the network context (determined by NIC steering). Because in this case, this would trigger workqueue schedule on a per-request basis rather than once per .data_ready call like we do today. Is that correct? It has been observed that completing commands in softirq context (network determined cpu) because basically the completion does IPI + local complete, not IPI + softirq or IPI + workqueue. > Either way doing this properly will probabl involve an audit of all > drivers, but I think that is worth it. Agree.
On 2020-11-02 18:12:38 [+0000], Christoph Hellwig wrote: > > to not break that assumption you just mentioned and provide > > |static inline void blk_mq_complete_request_local(struct request *rq) > > |{ > > | rq->q->mq_ops->complete(rq); > > |} > > > > so that completion issued from from process context (like those from > > usb-storage) don't end up waking `ksoftird' (running at SCHED_OTHER) > > completing the requests but rather performing it right away. The softirq > > dance makes no sense here. > > Agreed. But I don't think your above blk_mq_complete_request_local > is all that useful either as ->complete is defined by the caller, > so we could just do a direct call. In usb-storage case it is hidden somewhere in the SCSI stack but this can probably be changed later on. > Basically we should just > return false from blk_mq_complete_request_remote after updating > the state when called from process context. But given that IIRC > we are not supposed to check what state we are called from > we'll need a helper just for updating the state instead and > ensure the driver uses the right helper. Now of course we might > have process context callers that still want to bounce to the > submitting CPU, but in that case we should go directly to a > workqueue or similar. So instead blk_mq_complete_request_local() you want a helper to set the state in which the completion function is invoked. Sounds more like an argument :) > Either way doing this properly will probabl involve an audit of all > drivers, but I think that is worth it. I'm lost. Should I repost the three patches with a preempt_disable() section (as suggested) to not break preemptible callers? And then move from there to provide callers from preemtible context an alternative? Sebastian
diff --git a/block/blk-mq.c b/block/blk-mq.c index 55bcee5dc0320..421a40968c9ff 100644 --- a/block/blk-mq.c +++ b/block/blk-mq.c @@ -648,6 +648,14 @@ static inline bool blk_mq_complete_need_ipi(struct request *rq) if (!IS_ENABLED(CONFIG_SMP) || !test_bit(QUEUE_FLAG_SAME_COMP, &rq->q->queue_flags)) return false; + /* + * With force threaded interrupts enabled, raising softirq from an SMP + * function call will always result in waking the ksoftirqd thread. + * This is probably worse than completing the request on a different + * cache domain. + */ + if (force_irqthreads) + return false; /* same CPU or cache domain? Complete locally */ if (cpu == rq->mq_ctx->cpu ||
With force threaded interrupts enabled, raising softirq from an SMP function call will always result in waking the ksoftirqd thread. This is not optimal given that the thread runs at SCHED_OTHER priority. Completing the request in hard IRQ-context on PREEMPT_RT (which enforces the force threaded mode) is bad because the completion handler may acquire sleeping locks which violate the locking context. Disable request completing on a remote CPU in force threaded mode. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> --- block/blk-mq.c | 8 ++++++++ 1 file changed, 8 insertions(+)