Message ID | 877dk162mo.ffs@nanos.tec.linutronix.de |
---|---|
State | New |
Headers | show |
Series | [RFC] r8152: Ensure that napi_schedule() is handled | expand |
On Fri, 14 May 2021 22:25:50 +0200 Thomas Gleixner wrote: > On Fri, May 14 2021 at 12:38, Jakub Kicinski wrote: > > > On Fri, 14 May 2021 12:17:19 +0200 Thomas Gleixner wrote: > >> The driver invokes napi_schedule() in several places from task > >> context. napi_schedule() raises the NET_RX softirq bit and relies on the > >> calling context to ensure that the softirq is handled. That's usually on > >> return from interrupt or on the outermost local_bh_enable(). > >> > >> But that's not the case here which causes the soft interrupt handling to be > >> delayed to the next interrupt or local_bh_enable(). If the task in which > >> context this is invoked is the last runnable task on a CPU and the CPU goes > >> idle before an interrupt arrives or a local_bh_disable/enable() pair > >> handles the pending soft interrupt then the NOHZ idle code emits the > >> following warning. > >> > >> NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!! > >> > >> Prevent this by wrapping the napi_schedule() invocation from task context > >> into a local_bh_disable/enable() pair. > > > > I should have read through my inbox before replying :) > > > > I'd go for switching to raise_softirq_irqoff() in ____napi_schedule()... > > why not? > > Except that some instruction cycle beancounters might complain about > the extra conditional for the sane cases. > > But yes, I'm fine with that as well. That's why this patch is marked RFC :) When we're in the right context (irq/bh disabled etc.) the cost is just read of preempt_count() and jump, right? And presumably preempt_count() is in the cache already, because those sections aren't very long. Let me make this change locally and see if it is in any way perceivable. Obviously if anyone sees a way to solve the problem without much ifdefinery and force_irqthreads checks that'd be great - I don't. I'd rather avoid pushing this kind of stuff out to the drivers.
On Fri, 14 May 2021 23:10:43 +0200 Thomas Gleixner wrote: > On Fri, May 14 2021 at 13:46, Jakub Kicinski wrote: > > On Fri, 14 May 2021 22:25:50 +0200 Thomas Gleixner wrote: > >> Except that some instruction cycle beancounters might complain about > >> the extra conditional for the sane cases. > >> > >> But yes, I'm fine with that as well. That's why this patch is marked RFC :) > > > > When we're in the right context (irq/bh disabled etc.) the cost is just > > read of preempt_count() and jump, right? And presumably preempt_count() > > is in the cache already, because those sections aren't very long. Let me > > make this change locally and see if it is in any way perceivable. > > Right. Just wanted to mention it :) > > > Obviously if anyone sees a way to solve the problem without much > > ifdefinery and force_irqthreads checks that'd be great - I don't. > > This is not related to force_irqthreads at all. This very driver invokes > it from plain thread context. I see, but a driver calling __napi_schedule_irqoff() from its IRQ handler _would_ be an issue, right? Or do irq threads trigger softirq processing on exit? > > I'd rather avoid pushing this kind of stuff out to the drivers. > > You could have napi_schedule_intask() or something like that which would > do the local_bh_disable()/enable() dance around the invocation of > napi_schedule(). That would also document it clearly in the drivers. A > quick grep shows a bunch of instances which could be replaced: > > drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c-5704- local_bh_disable(); > drivers/net/ethernet/mellanox/mlx4/en_netdev.c-1830- local_bh_disable(); > drivers/net/usb/r8152.c-1552- local_bh_disable(); > drivers/net/virtio_net.c-1355- local_bh_disable(); > drivers/net/wireless/intel/iwlwifi/pcie/rx.c-1650- local_bh_disable(); > drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2015- local_bh_disable(); > drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2225- local_bh_disable(); > drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2235- local_bh_disable(); > drivers/s390/net/qeth_core_main.c-3515- local_bh_disable(); Very well aware, I've just sent a patch for mlx5 last week :) My initial reaction was the same as yours - we should add lockdep check, and napi_schedule_intask(). But then I started wondering if it's all for nothing on rt or with force_irqthreads, and therefore we should just eat the extra check.
On Fri, May 14 2021 at 14:41, Jakub Kicinski wrote: > On Fri, 14 May 2021 23:10:43 +0200 Thomas Gleixner wrote: >> On Fri, May 14 2021 at 13:46, Jakub Kicinski wrote: >> > On Fri, 14 May 2021 22:25:50 +0200 Thomas Gleixner wrote: >> >> Except that some instruction cycle beancounters might complain about >> >> the extra conditional for the sane cases. >> >> >> >> But yes, I'm fine with that as well. That's why this patch is marked RFC :) >> > >> > When we're in the right context (irq/bh disabled etc.) the cost is just >> > read of preempt_count() and jump, right? And presumably preempt_count() >> > is in the cache already, because those sections aren't very long. Let me >> > make this change locally and see if it is in any way perceivable. >> >> Right. Just wanted to mention it :) >> >> > Obviously if anyone sees a way to solve the problem without much >> > ifdefinery and force_irqthreads checks that'd be great - I don't. >> >> This is not related to force_irqthreads at all. This very driver invokes >> it from plain thread context. > > I see, but a driver calling __napi_schedule_irqoff() from its IRQ > handler _would_ be an issue, right? Or do irq threads trigger softirq > processing on exit? Yes, they do. See irq_forced_thread_fn(). It has a local_bh_disable() / local_bh_ enable() pair around the invocation to ensure that. >> > I'd rather avoid pushing this kind of stuff out to the drivers. >> >> You could have napi_schedule_intask() or something like that which would >> do the local_bh_disable()/enable() dance around the invocation of >> napi_schedule(). That would also document it clearly in the drivers. A >> quick grep shows a bunch of instances which could be replaced: >> >> drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c-5704- local_bh_disable(); >> drivers/net/ethernet/mellanox/mlx4/en_netdev.c-1830- local_bh_disable(); >> drivers/net/usb/r8152.c-1552- local_bh_disable(); >> drivers/net/virtio_net.c-1355- local_bh_disable(); >> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-1650- local_bh_disable(); >> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2015- local_bh_disable(); >> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2225- local_bh_disable(); >> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2235- local_bh_disable(); >> drivers/s390/net/qeth_core_main.c-3515- local_bh_disable(); > > Very well aware, I've just sent a patch for mlx5 last week :) > > My initial reaction was the same as yours - we should add lockdep > check, and napi_schedule_intask(). But then I started wondering > if it's all for nothing on rt or with force_irqthreads, and therefore > we should just eat the extra check. We can make that work but sure I'm not going to argue when you decide to just go for raise_softirq_irqsoff(). I just hacked that check up which is actually useful beyond NAPI. It's straight forward except for that flush_smp_call_function_from_idle() oddball, which immeditately triggered that assert because block mq uses __raise_softirq_irqsoff() in a smp function call... See below. Peter might have opinions though :) Thanks, tglx --- include/linux/lockdep.h | 21 +++++++++++++++++++++ include/linux/sched.h | 1 + kernel/smp.c | 2 ++ kernel/softirq.c | 18 ++++++++++++++---- 4 files changed, 38 insertions(+), 4 deletions(-) --- a/include/linux/lockdep.h +++ b/include/linux/lockdep.h @@ -636,6 +636,23 @@ do { \ (!in_softirq() || in_irq() || in_nmi())); \ } while (0) +#define lockdep_set_softirq_raise_safe() \ +do { \ + current->softirq_raise_safe = 1; \ +} while (0) + +#define lockdep_clear_softirq_raise_safe() \ +do { \ + current->softirq_raise_safe = 0; \ +} while (0) + +#define lockdep_assert_softirq_raise_ok() \ +do { \ + WARN_ON_ONCE(__lockdep_enabled && \ + !current->softirq_raise_safe && \ + !(softirq_count() | hardirq_count())); \ +} while (0) + #else # define might_lock(lock) do { } while (0) # define might_lock_read(lock) do { } while (0) @@ -648,6 +665,10 @@ do { \ # define lockdep_assert_preemption_enabled() do { } while (0) # define lockdep_assert_preemption_disabled() do { } while (0) # define lockdep_assert_in_softirq() do { } while (0) + +# define lockdep_set_softirq_raise_safe() do { } while (0) +# define lockdep_clear_softirq_raise_safe() do { } while (0) +# define lockdep_assert_softirq_raise_ok() do { } while (0) #endif #ifdef CONFIG_PROVE_RAW_LOCK_NESTING --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1058,6 +1058,7 @@ struct task_struct { u64 curr_chain_key; int lockdep_depth; unsigned int lockdep_recursion; + unsigned int softirq_raise_safe; struct held_lock held_locks[MAX_LOCK_DEPTH]; #endif --- a/kernel/smp.c +++ b/kernel/smp.c @@ -691,7 +691,9 @@ void flush_smp_call_function_from_idle(v cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU, smp_processor_id(), CFD_SEQ_IDLE); local_irq_save(flags); + lockdep_set_softirq_raise_safe(); flush_smp_call_function_queue(true); + lockdep_clear_softirq_raise_safe(); if (local_softirq_pending()) do_softirq(); --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -664,12 +664,19 @@ void irq_exit(void) lockdep_hardirq_exit(); } +static inline void ____raise_softirq_irqoff(unsigned int nr) +{ + lockdep_assert_irqs_disabled(); + trace_softirq_raise(nr); + or_softirq_pending(1UL << nr); +} + /* * This function must run with irqs disabled! */ inline void raise_softirq_irqoff(unsigned int nr) { - __raise_softirq_irqoff(nr); + ____raise_softirq_irqoff(nr); /* * If we're in an interrupt or softirq, we're done @@ -693,11 +700,14 @@ void raise_softirq(unsigned int nr) local_irq_restore(flags); } +/* + * Must be invoked with interrupts disabled and either from softirq serving + * context or with local bottom halfs disabled. + */ void __raise_softirq_irqoff(unsigned int nr) { - lockdep_assert_irqs_disabled(); - trace_softirq_raise(nr); - or_softirq_pending(1UL << nr); + lockdep_assert_softirq_raise_ok(); + ____raise_softirq_irqoff(nr); } void open_softirq(int nr, void (*action)(struct softirq_action *))
On Sat, May 15, 2021 at 01:23:02AM +0200, Thomas Gleixner wrote: > We can make that work but sure I'm not going to argue when you decide to > just go for raise_softirq_irqsoff(). > > I just hacked that check up which is actually useful beyond NAPI. It's > straight forward except for that flush_smp_call_function_from_idle() > oddball, which immeditately triggered that assert because block mq uses > __raise_softirq_irqsoff() in a smp function call... > > See below. Peter might have opinions though :) Yeah, lovely stuff :-) > +#define lockdep_assert_softirq_raise_ok() \ > +do { \ > + WARN_ON_ONCE(__lockdep_enabled && \ > + !current->softirq_raise_safe && \ > + !(softirq_count() | hardirq_count())); \ > +} while (0) > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -691,7 +691,9 @@ void flush_smp_call_function_from_idle(v > cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU, > smp_processor_id(), CFD_SEQ_IDLE); > local_irq_save(flags); > + lockdep_set_softirq_raise_safe(); > flush_smp_call_function_queue(true); > + lockdep_clear_softirq_raise_safe(); > if (local_softirq_pending()) > do_softirq(); I think it might make more sense to raise hardirq_count() in/for flush_smp_call_function_queue() callers that aren't already from hardirq context. That's this site and smpcfd_dying_cpu(). Then we can do away with this new special case.
On Sat, May 15 2021 at 15:09, Peter Zijlstra wrote: > On Sat, May 15, 2021 at 01:23:02AM +0200, Thomas Gleixner wrote: >> --- a/kernel/smp.c >> +++ b/kernel/smp.c >> @@ -691,7 +691,9 @@ void flush_smp_call_function_from_idle(v >> cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU, >> smp_processor_id(), CFD_SEQ_IDLE); >> local_irq_save(flags); >> + lockdep_set_softirq_raise_safe(); >> flush_smp_call_function_queue(true); >> + lockdep_clear_softirq_raise_safe(); >> if (local_softirq_pending()) >> do_softirq(); > > I think it might make more sense to raise hardirq_count() in/for > flush_smp_call_function_queue() callers that aren't already from hardirq > context. That's this site and smpcfd_dying_cpu(). > > Then we can do away with this new special case. Right. Though I just checked smpcfd_dying_cpu(). That ones does not run softirqs after flushing the function queue and it can't do that because that's in the CPU dying phase with interrupts disabled where the CPU is already half torn down. Especially as softirq processing enables interrupts, which might cause even more havoc. Anyway how is it safe to run arbitrary functions there after the CPU removed itself from the online mask? That's daft to put it mildly. Thanks, tglx
--- a/drivers/net/usb/r8152.c +++ b/drivers/net/usb/r8152.c @@ -1543,6 +1543,17 @@ void write_mii_word(struct net_device *n r8152_mdio_write(tp, reg, val); } +/* + * Wrapper around napi_schedule() to ensure that the raised network softirq + * is actually handled. + */ +static void r8152_napi_schedule(struct napi_struct *napi) +{ + local_bh_disable(); + napi_schedule(napi); + local_bh_enable(); +} + static int r8152_submit_rx(struct r8152 *tp, struct rx_agg *agg, gfp_t mem_flags); @@ -1753,7 +1764,7 @@ static void read_bulk_callback(struct ur spin_lock_irqsave(&tp->rx_lock, flags); list_add_tail(&agg->list, &tp->rx_done); spin_unlock_irqrestore(&tp->rx_lock, flags); - napi_schedule(&tp->napi); + r8152_napi_schedule(&tp->napi); return; case -ESHUTDOWN: rtl_set_unplug(tp); @@ -2640,7 +2651,7 @@ int r8152_submit_rx(struct r8152 *tp, st netif_err(tp, rx_err, tp->netdev, "Couldn't submit rx[%p], ret = %d\n", agg, ret); - napi_schedule(&tp->napi); + r8152_napi_schedule(&tp->napi); } return ret; @@ -8202,7 +8213,7 @@ static int rtl8152_post_reset(struct usb usb_submit_urb(tp->intr_urb, GFP_KERNEL); if (!list_empty(&tp->rx_done)) - napi_schedule(&tp->napi); + r8152_napi_schedule(&tp->napi); return 0; } @@ -8256,7 +8267,7 @@ static int rtl8152_runtime_resume(struct smp_mb__after_atomic(); if (!list_empty(&tp->rx_done)) - napi_schedule(&tp->napi); + r8152_napi_schedule(&tp->napi); usb_submit_urb(tp->intr_urb, GFP_NOIO); } else {