diff mbox series

[RFC] rt: kernel/sched/core: fix kthread_park() pending too long when CPU un-plugged

Message ID 20210107091841.19798-1-ran.wang_1@nxp.com
State New
Headers show
Series [RFC] rt: kernel/sched/core: fix kthread_park() pending too long when CPU un-plugged | expand

Commit Message

Ran Wang Jan. 7, 2021, 9:18 a.m. UTC
When doing CPU un-plug stress test, function smpboot_park_threads() would
get call to park kernel threads (which including ksoftirqd) on that
CPU core, and function wait_task_inactive() would yield for those queued
task(s) by calling schedule_hrtimerout() with mode of HRTIMER_MODE_REL.

stack trace:
...
smpboot_thread_fn
    cpuhp_thread_fun
        cpuhp_invoke_callback
            smpboot_park_threads
              smpboot_park_thread: ksoftirqd/1
                kthread_park
                  wait_task_inactive
                     schedule_hrtimerout

However, when PREEMPT_RT is set, this would cause a pending issue since
schedule_hrtimerout() depend on thread ksoftirqd to complete related
work if it using HRTIMER_MODE_SOFT. So force using HRTIMER_MODE_HARD
in such case.

Suggested-by: Jiafei Pan <jiafei.pan@nxp.com>
Signed-off-by: Ran Wang <ran.wang_1@nxp.com>
---
 kernel/sched/core.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

Comments

Peter Zijlstra Jan. 7, 2021, 10:45 a.m. UTC | #1
On Thu, Jan 07, 2021 at 05:18:41PM +0800, Ran Wang wrote:
> +
> +			if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
> +			    !strncmp(p->comm, "ksoftirqd/", 10))
> +				schedule_hrtimeout(&to,
> +					HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
> +			else
> +				schedule_hrtimeout(&to, HRTIMER_MODE_REL);

This is horrific, why did you not self-censor and spare me the mental
anguish of having to formulate a CoC compliant response?

It also violates coding style, but given the total lack of any sense,
that seems like a minor detail.

Why can't we use HRTIMER_MODE_HARD unconditionally?
Ran Wang Jan. 8, 2021, 8:45 a.m. UTC | #2
Hi Sebastian, Peter

Thursday, January 7, 2021 11:29 PM, Sebastian Siewior wrote:
> 

> On 2021-01-07 11:45:39 [+0100], Peter Zijlstra wrote:

> > On Thu, Jan 07, 2021 at 05:18:41PM +0800, Ran Wang wrote:

> > > +

> > > +			if (IS_ENABLED(CONFIG_PREEMPT_RT) &&

> > > +			    !strncmp(p->comm, "ksoftirqd/", 10))

> > > +				schedule_hrtimeout(&to,

> > > +					HRTIMER_MODE_REL | HRTIMER_MODE_HARD);

> > > +			else

> > > +				schedule_hrtimeout(&to, HRTIMER_MODE_REL);

> >

> > This is horrific, why did you not self-censor and spare me the mental

> > anguish of having to formulate a CoC compliant response?

> >

> > It also violates coding style, but given the total lack of any sense,

> > that seems like a minor detail.

> >

> > Why can't we use HRTIMER_MODE_HARD unconditionally?

> 

> I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.

> It was added because RT couldn't boot since creating the boot-threads didn't work before the ksoftirqd was up. This was fixed by commit

>    26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")


I tried applying above commit to linux-5.6.y-rt, it could resolve my problem on LX2160ARDB, THANKS!

> and live was good again.

> tglx (also) suggested to add HRTIMER_MODE_HARD unconditionally (it looked at SYSTEM_STATE back then) and I was only worried some

> abuse via userland.


Got it.

Regards,
Ran

> This sleep can be triggered by ptrace/strace() and with brief testing I can trigger the sleep there but I don't get it anywhere near where I

> would notice it with cyclictest.

> 

> Sebastian
Sebastian Andrzej Siewior Jan. 8, 2021, 9:05 a.m. UTC | #3
On 2021-01-08 08:45:14 [+0000], Ran Wang wrote:
> Hi Sebastian, Peter

Hi,

> > I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.

> > It was added because RT couldn't boot since creating the boot-threads didn't work before the ksoftirqd was up. This was fixed by commit

> >    26c7295be0c5e ("kthread: Do not preempt current task if it is going to call schedule()")

> 

> I tried applying above commit to linux-5.6.y-rt, it could resolve my problem on LX2160ARDB, THANKS!


so in other words all this could have been avoided by using a supported
or maintained RT series. The v5.4 series has this patch, v5.6 isn't
maintained anymore so it is likely that there is more missing.

Sebastian
Ran Wang Jan. 8, 2021, 9:27 a.m. UTC | #4
Hi Sebastian,

On Friday, January 8, 2021 5:05 PM, Sebastian Siewior wrote:
> 

> On 2021-01-08 08:45:14 [+0000], Ran Wang wrote:

> > Hi Sebastian, Peter

> Hi,

> 

> > > I had a similar patch in -RT and dropped it in v5.10-rc7-rt16.

> > > It was added because RT couldn't boot since creating the boot-threads didn't work before the ksoftirqd was up. This was fixed by commit

> > >    26c7295be0c5e ("kthread: Do not preempt current task if it is

> > > going to call schedule()")

> >

> > I tried applying above commit to linux-5.6.y-rt, it could resolve my problem on LX2160ARDB, THANKS!

> 

> so in other words all this could have been avoided by using a supported or maintained RT series. The v5.4 series has this patch, v5.6 isn't

> maintained anymore so it is likely that there is more missing.


Thanks for let me know this.

The reason I trying linux-5.6-rt is that I have encountered other more serious issues on later RT version (even with v5.10-rc7-rt16), one of them
is CPU hot plug got stuck in irq_work_sync() which called by sugov_stop(), failure happen at 1st loop stress test every time.
I will try to collect more information and create another mail thread later.

Thanks & Regards,
Ran

> Sebastian
Peter Zijlstra Jan. 8, 2021, 9:32 a.m. UTC | #5
On Thu, Jan 07, 2021 at 04:28:43PM +0100, Sebastian Siewior wrote:
> This sleep can be triggered by ptrace/strace() and with brief testing I

> can trigger the sleep there but I don't get it anywhere near where I

> would notice it with cyclictest.


It's a single task wakeup (the caller), doing that from hardirq context
really should not be a problem, we do lots of that in RT already.
Sebastian Andrzej Siewior Jan. 8, 2021, 11:19 a.m. UTC | #6
On 2021-01-08 10:32:36 [+0100], Peter Zijlstra wrote:
> It's a single task wakeup (the caller), doing that from hardirq context

> really should not be a problem, we do lots of that in RT already.


I'm not worry about that single wakeup but about an artificial case
where you manage to accumulate multiple single wake ups in a short
time frame.

Sebastian
diff mbox series

Patch

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 792da55..4cc742a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2054,10 +2054,15 @@  unsigned long wait_task_inactive(struct task_struct *p, long match_state)
 			ktime_t to = NSEC_PER_SEC / HZ;
 
 			set_current_state(TASK_UNINTERRUPTIBLE);
-			schedule_hrtimeout(&to, HRTIMER_MODE_REL);
+
+			if (IS_ENABLED(CONFIG_PREEMPT_RT) &&
+			    !strncmp(p->comm, "ksoftirqd/", 10))
+				schedule_hrtimeout(&to,
+					HRTIMER_MODE_REL | HRTIMER_MODE_HARD);
+			else
+				schedule_hrtimeout(&to, HRTIMER_MODE_REL);
 			continue;
 		}
-
 		/*
 		 * Ahh, all good. It wasn't running, and it wasn't
 		 * runnable, which means that it will never become