[RFC] r8152: Ensure that napi_schedule() is handled

Message ID	877dk162mo.ffs@nanos.tec.linutronix.de
State	New
Headers	show Return-Path: <linux-usb-owner@kernel.org> From: Thomas Gleixner <tglx@linutronix.de> To: linux-usb@vger.kernel.org, netdev@vger.kernel.org Cc: Michal Svec <msvec@suse.com>, "David S. Miller" <davem@davemloft.net>, Jakub Kicinski <kuba@kernel.org>, Hayes Wang <hayeswang@realtek.com>, Thierry Reding <treding@nvidia.com>, Lee Jones <lee.jones@linaro.org>, Borislav Petkov <bp@alien8.de> Subject: [PATCH RFC] r8152: Ensure that napi_schedule() is handled Date: Fri, 14 May 2021 12:17:19 +0200 Message-ID: <877dk162mo.ffs@nanos.tec.linutronix.de> MIME-Version: 1.0 Content-Type: text/plain Precedence: bulk
Series	[RFC] r8152: Ensure that napi_schedule() is handled \| expand [RFC] r8152: Ensure that napi_schedule() is handled

Thomas Gleixner May 14, 2021, 10:17 a.m. UTC

From: Thomas Gleixner <tglx@linutronix.de>
Date: Fri, 14 May 2021 11:46:08 +0200

The driver invokes napi_schedule() in several places from task
context. napi_schedule() raises the NET_RX softirq bit and relies on the
calling context to ensure that the softirq is handled. That's usually on
return from interrupt or on the outermost local_bh_enable().

But that's not the case here which causes the soft interrupt handling to be
delayed to the next interrupt or local_bh_enable(). If the task in which
context this is invoked is the last runnable task on a CPU and the CPU goes
idle before an interrupt arrives or a local_bh_disable/enable() pair
handles the pending soft interrupt then the NOHZ idle code emits the
following warning.

  NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!

Prevent this by wrapping the napi_schedule() invocation from task context
into a local_bh_disable/enable() pair.

Reported-by: Michal Svec <msvec@suse.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---

Note: That's not the first incident of this. Shouldn't napi_schedule()
      have a debug check (under lockdep) to catch this?

---
 drivers/net/usb/r8152.c |   19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

Jakub Kicinski May 14, 2021, 8:46 p.m. UTC | #1

On Fri, 14 May 2021 22:25:50 +0200 Thomas Gleixner wrote:
> On Fri, May 14 2021 at 12:38, Jakub Kicinski wrote:
> 
> > On Fri, 14 May 2021 12:17:19 +0200 Thomas Gleixner wrote:  
> >> The driver invokes napi_schedule() in several places from task
> >> context. napi_schedule() raises the NET_RX softirq bit and relies on the
> >> calling context to ensure that the softirq is handled. That's usually on
> >> return from interrupt or on the outermost local_bh_enable().
> >> 
> >> But that's not the case here which causes the soft interrupt handling to be
> >> delayed to the next interrupt or local_bh_enable(). If the task in which
> >> context this is invoked is the last runnable task on a CPU and the CPU goes
> >> idle before an interrupt arrives or a local_bh_disable/enable() pair
> >> handles the pending soft interrupt then the NOHZ idle code emits the
> >> following warning.
> >> 
> >>   NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #08!!!
> >> 
> >> Prevent this by wrapping the napi_schedule() invocation from task context
> >> into a local_bh_disable/enable() pair.  
> >
> > I should have read through my inbox before replying :)
> >
> > I'd go for switching to raise_softirq_irqoff() in ____napi_schedule()...
> > why not?  
> 
> Except that some instruction cycle beancounters might complain about
> the extra conditional for the sane cases.
> 
> But yes, I'm fine with that as well. That's why this patch is marked RFC :)

When we're in the right context (irq/bh disabled etc.) the cost is just
read of preempt_count() and jump, right? And presumably preempt_count()
is in the cache already, because those sections aren't very long. Let me
make this change locally and see if it is in any way perceivable.

Obviously if anyone sees a way to solve the problem without much
ifdefinery and force_irqthreads checks that'd be great - I don't.
I'd rather avoid pushing this kind of stuff out to the drivers.

Jakub Kicinski May 14, 2021, 9:41 p.m. UTC | #2

On Fri, 14 May 2021 23:10:43 +0200 Thomas Gleixner wrote:
> On Fri, May 14 2021 at 13:46, Jakub Kicinski wrote:
> > On Fri, 14 May 2021 22:25:50 +0200 Thomas Gleixner wrote:  
> >> Except that some instruction cycle beancounters might complain about
> >> the extra conditional for the sane cases.
> >> 
> >> But yes, I'm fine with that as well. That's why this patch is marked RFC :)  
> >
> > When we're in the right context (irq/bh disabled etc.) the cost is just
> > read of preempt_count() and jump, right? And presumably preempt_count()
> > is in the cache already, because those sections aren't very long. Let me
> > make this change locally and see if it is in any way perceivable.  
> 
> Right. Just wanted to mention it :)
> 
> > Obviously if anyone sees a way to solve the problem without much
> > ifdefinery and force_irqthreads checks that'd be great - I don't.  
> 
> This is not related to force_irqthreads at all. This very driver invokes
> it from plain thread context.

I see, but a driver calling __napi_schedule_irqoff() from its IRQ
handler _would_ be an issue, right? Or do irq threads trigger softirq
processing on exit?

> > I'd rather avoid pushing this kind of stuff out to the drivers.  
> 
> You could have napi_schedule_intask() or something like that which would
> do the local_bh_disable()/enable() dance around the invocation of
> napi_schedule(). That would also document it clearly in the drivers. A
> quick grep shows a bunch of instances which could be replaced:
> 
> drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c-5704-		local_bh_disable();
> drivers/net/ethernet/mellanox/mlx4/en_netdev.c-1830-		local_bh_disable();
> drivers/net/usb/r8152.c-1552-	local_bh_disable();
> drivers/net/virtio_net.c-1355-	local_bh_disable();
> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-1650-	local_bh_disable();
> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2015-		local_bh_disable();
> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2225-		local_bh_disable();
> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2235-		local_bh_disable();
> drivers/s390/net/qeth_core_main.c-3515-	local_bh_disable();

Very well aware, I've just sent a patch for mlx5 last week :)

My initial reaction was the same as yours - we should add lockdep
check, and napi_schedule_intask(). But then I started wondering
if it's all for nothing on rt or with force_irqthreads, and therefore
we should just eat the extra check.

Thomas Gleixner May 14, 2021, 11:23 p.m. UTC | #3

On Fri, May 14 2021 at 14:41, Jakub Kicinski wrote:
> On Fri, 14 May 2021 23:10:43 +0200 Thomas Gleixner wrote:
>> On Fri, May 14 2021 at 13:46, Jakub Kicinski wrote:
>> > On Fri, 14 May 2021 22:25:50 +0200 Thomas Gleixner wrote:  
>> >> Except that some instruction cycle beancounters might complain about
>> >> the extra conditional for the sane cases.
>> >> 
>> >> But yes, I'm fine with that as well. That's why this patch is marked RFC :)  
>> >
>> > When we're in the right context (irq/bh disabled etc.) the cost is just
>> > read of preempt_count() and jump, right? And presumably preempt_count()
>> > is in the cache already, because those sections aren't very long. Let me
>> > make this change locally and see if it is in any way perceivable.  
>> 
>> Right. Just wanted to mention it :)
>> 
>> > Obviously if anyone sees a way to solve the problem without much
>> > ifdefinery and force_irqthreads checks that'd be great - I don't.  
>> 
>> This is not related to force_irqthreads at all. This very driver invokes
>> it from plain thread context.
>
> I see, but a driver calling __napi_schedule_irqoff() from its IRQ
> handler _would_ be an issue, right? Or do irq threads trigger softirq
> processing on exit?

Yes, they do. See irq_forced_thread_fn(). It has a local_bh_disable() /
local_bh_ enable() pair around the invocation to ensure that.

>> > I'd rather avoid pushing this kind of stuff out to the drivers.  
>> 
>> You could have napi_schedule_intask() or something like that which would
>> do the local_bh_disable()/enable() dance around the invocation of
>> napi_schedule(). That would also document it clearly in the drivers. A
>> quick grep shows a bunch of instances which could be replaced:
>> 
>> drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c-5704-		local_bh_disable();
>> drivers/net/ethernet/mellanox/mlx4/en_netdev.c-1830-		local_bh_disable();
>> drivers/net/usb/r8152.c-1552-	local_bh_disable();
>> drivers/net/virtio_net.c-1355-	local_bh_disable();
>> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-1650-	local_bh_disable();
>> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2015-		local_bh_disable();
>> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2225-		local_bh_disable();
>> drivers/net/wireless/intel/iwlwifi/pcie/rx.c-2235-		local_bh_disable();
>> drivers/s390/net/qeth_core_main.c-3515-	local_bh_disable();
>
> Very well aware, I've just sent a patch for mlx5 last week :)
>
> My initial reaction was the same as yours - we should add lockdep
> check, and napi_schedule_intask(). But then I started wondering
> if it's all for nothing on rt or with force_irqthreads, and therefore
> we should just eat the extra check.

We can make that work but sure I'm not going to argue when you decide to
just go for raise_softirq_irqsoff().

I just hacked that check up which is actually useful beyond NAPI. It's
straight forward except for that flush_smp_call_function_from_idle()
oddball, which immeditately triggered that assert because block mq uses
__raise_softirq_irqsoff() in a smp function call...

See below. Peter might have opinions though :)

Thanks,

        tglx
---
 include/linux/lockdep.h |   21 +++++++++++++++++++++
 include/linux/sched.h   |    1 +
 kernel/smp.c            |    2 ++
 kernel/softirq.c        |   18 ++++++++++++++----
 4 files changed, 38 insertions(+), 4 deletions(-)

--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -636,6 +636,23 @@ do {									\
 		     (!in_softirq() || in_irq() || in_nmi()));		\
 } while (0)
 
+#define lockdep_set_softirq_raise_safe()				\
+do {									\
+	current->softirq_raise_safe = 1;				\
+} while (0)
+
+#define lockdep_clear_softirq_raise_safe()				\
+do {									\
+	current->softirq_raise_safe = 0;				\
+} while (0)
+
+#define lockdep_assert_softirq_raise_ok()				\
+do {									\
+	WARN_ON_ONCE(__lockdep_enabled &&				\
+		     !current->softirq_raise_safe &&			\
+		     !(softirq_count() | hardirq_count()));		\
+} while (0)
+
 #else
 # define might_lock(lock) do { } while (0)
 # define might_lock_read(lock) do { } while (0)
@@ -648,6 +665,10 @@ do {									\
 # define lockdep_assert_preemption_enabled() do { } while (0)
 # define lockdep_assert_preemption_disabled() do { } while (0)
 # define lockdep_assert_in_softirq() do { } while (0)
+
+# define lockdep_set_softirq_raise_safe()	do { } while (0)
+# define lockdep_clear_softirq_raise_safe()	do { } while (0)
+# define lockdep_assert_softirq_raise_ok()	do { } while (0)
 #endif
 
 #ifdef CONFIG_PROVE_RAW_LOCK_NESTING
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1058,6 +1058,7 @@ struct task_struct {
 	u64				curr_chain_key;
 	int				lockdep_depth;
 	unsigned int			lockdep_recursion;
+	unsigned int			softirq_raise_safe;
 	struct held_lock		held_locks[MAX_LOCK_DEPTH];
 #endif
 
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -691,7 +691,9 @@ void flush_smp_call_function_from_idle(v
 	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU,
 		      smp_processor_id(), CFD_SEQ_IDLE);
 	local_irq_save(flags);
+	lockdep_set_softirq_raise_safe();
 	flush_smp_call_function_queue(true);
+	lockdep_clear_softirq_raise_safe();
 	if (local_softirq_pending())
 		do_softirq();
 
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -664,12 +664,19 @@ void irq_exit(void)
 	lockdep_hardirq_exit();
 }
 
+static inline void ____raise_softirq_irqoff(unsigned int nr)
+{
+	lockdep_assert_irqs_disabled();
+	trace_softirq_raise(nr);
+	or_softirq_pending(1UL << nr);
+}
+
 /*
  * This function must run with irqs disabled!
  */
 inline void raise_softirq_irqoff(unsigned int nr)
 {
-	__raise_softirq_irqoff(nr);
+	____raise_softirq_irqoff(nr);
 
 	/*
 	 * If we're in an interrupt or softirq, we're done
@@ -693,11 +700,14 @@ void raise_softirq(unsigned int nr)
 	local_irq_restore(flags);
 }
 
+/*
+ * Must be invoked with interrupts disabled and either from softirq serving
+ * context or with local bottom halfs disabled.
+ */
 void __raise_softirq_irqoff(unsigned int nr)
 {
-	lockdep_assert_irqs_disabled();
-	trace_softirq_raise(nr);
-	or_softirq_pending(1UL << nr);
+	lockdep_assert_softirq_raise_ok();
+	____raise_softirq_irqoff(nr);
 }
 
 void open_softirq(int nr, void (*action)(struct softirq_action *))

Peter Zijlstra May 15, 2021, 1:09 p.m. UTC | #4

On Sat, May 15, 2021 at 01:23:02AM +0200, Thomas Gleixner wrote:
> We can make that work but sure I'm not going to argue when you decide to

> just go for raise_softirq_irqsoff().

> 

> I just hacked that check up which is actually useful beyond NAPI. It's

> straight forward except for that flush_smp_call_function_from_idle()

> oddball, which immeditately triggered that assert because block mq uses

> __raise_softirq_irqsoff() in a smp function call...

> 

> See below. Peter might have opinions though :)


Yeah, lovely stuff :-)


> +#define lockdep_assert_softirq_raise_ok()				\

> +do {									\

> +	WARN_ON_ONCE(__lockdep_enabled &&				\

> +		     !current->softirq_raise_safe &&			\

> +		     !(softirq_count() | hardirq_count()));		\

> +} while (0)


> --- a/kernel/smp.c

> +++ b/kernel/smp.c

> @@ -691,7 +691,9 @@ void flush_smp_call_function_from_idle(v

>  	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU,

>  		      smp_processor_id(), CFD_SEQ_IDLE);

>  	local_irq_save(flags);

> +	lockdep_set_softirq_raise_safe();

>  	flush_smp_call_function_queue(true);

> +	lockdep_clear_softirq_raise_safe();

>  	if (local_softirq_pending())

>  		do_softirq();


I think it might make more sense to raise hardirq_count() in/for
flush_smp_call_function_queue() callers that aren't already from hardirq
context. That's this site and smpcfd_dying_cpu().

Then we can do away with this new special case.

Thomas Gleixner May 15, 2021, 7:06 p.m. UTC | #5

On Sat, May 15 2021 at 15:09, Peter Zijlstra wrote:
> On Sat, May 15, 2021 at 01:23:02AM +0200, Thomas Gleixner wrote:

>> --- a/kernel/smp.c

>> +++ b/kernel/smp.c

>> @@ -691,7 +691,9 @@ void flush_smp_call_function_from_idle(v

>>  	cfd_seq_store(this_cpu_ptr(&cfd_seq_local)->idle, CFD_SEQ_NOCPU,

>>  		      smp_processor_id(), CFD_SEQ_IDLE);

>>  	local_irq_save(flags);

>> +	lockdep_set_softirq_raise_safe();

>>  	flush_smp_call_function_queue(true);

>> +	lockdep_clear_softirq_raise_safe();

>>  	if (local_softirq_pending())

>>  		do_softirq();

>

> I think it might make more sense to raise hardirq_count() in/for

> flush_smp_call_function_queue() callers that aren't already from hardirq

> context. That's this site and smpcfd_dying_cpu().

>

> Then we can do away with this new special case.

Right.

Though I just checked smpcfd_dying_cpu(). That ones does not run
softirqs after flushing the function queue and it can't do that because
that's in the CPU dying phase with interrupts disabled where the CPU is
already half torn down.

Especially as softirq processing enables interrupts, which might cause
even more havoc.

Anyway how is it safe to run arbitrary functions there after the CPU
removed itself from the online mask? That's daft to put it mildly.

Thanks,

        tglx

[RFC] r8152: Ensure that napi_schedule() is handled

Commit Message

Comments

Patch