diff mbox series

[v2,2/7] serial: qcom-geni: fix shutdown race

Message ID 20241001125033.10625-3-johan+linaro@kernel.org
State Superseded
Headers show
Series serial: qcom-geni: fix receiver enable | expand

Commit Message

Johan Hovold Oct. 1, 2024, 12:50 p.m. UTC
A commit adding back the stopping of tx on port shutdown failed to add
back the locking which had also been removed by commit e83766334f96
("tty: serial: qcom_geni_serial: No need to stop tx/rx on UART
shutdown").

Holding the port lock is needed to serialise against the console code,
which may update the interrupt enable register and access the port
state.

Fixes: d8aca2f96813 ("tty: serial: qcom-geni-serial: stop operations in progress at shutdown")
Fixes: 947cc4ecc06c ("serial: qcom-geni: fix soft lockup on sw flow control and suspend")
Cc: stable@vger.kernel.org	# 6.3
Cc: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
Signed-off-by: Johan Hovold <johan+linaro@kernel.org>
---
 drivers/tty/serial/qcom_geni_serial.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Doug Anderson Oct. 3, 2024, 6:30 p.m. UTC | #1
Hi,

On Tue, Oct 1, 2024 at 5:51 AM Johan Hovold <johan+linaro@kernel.org> wrote:
>
> A commit adding back the stopping of tx on port shutdown failed to add
> back the locking which had also been removed by commit e83766334f96
> ("tty: serial: qcom_geni_serial: No need to stop tx/rx on UART
> shutdown").

Hmmm, when I look at that commit it makes me think that the problem
that commit e83766334f96 ("tty: serial: qcom_geni_serial: No need to
stop tx/rx on UART shutdown") was fixing was re-introduced by commit
d8aca2f96813 ("tty: serial: qcom-geni-serial: stop operations in
progress at shutdown"). ...and indeed, it was. :(

I can't interact with kgdb if I do this:

1. ssh over to DUT
2. Kill the console process (on ChromeOS stop console-ttyMSM0)
3. Drop in the debugger (echo g > /proc/sysrq-trigger)

This bug predates your series, but since it touches the same code
maybe you could fix it at the same time? I will note that commit
e83766334f96 ("tty: serial: qcom_geni_serial: No need to stop tx/rx on
UART shutdown") mentions that it wasn't required for FIFO mode--only
DMA...

Aside from the pre-existing bug, I agree that the locking should be there.


-Doug
Johan Hovold Oct. 9, 2024, 2:10 p.m. UTC | #2
On Thu, Oct 03, 2024 at 11:30:08AM -0700, Doug Anderson wrote:
> On Tue, Oct 1, 2024 at 5:51 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> >
> > A commit adding back the stopping of tx on port shutdown failed to add
> > back the locking which had also been removed by commit e83766334f96
> > ("tty: serial: qcom_geni_serial: No need to stop tx/rx on UART
> > shutdown").
> 
> Hmmm, when I look at that commit it makes me think that the problem
> that commit e83766334f96 ("tty: serial: qcom_geni_serial: No need to
> stop tx/rx on UART shutdown") was fixing was re-introduced by commit
> d8aca2f96813 ("tty: serial: qcom-geni-serial: stop operations in
> progress at shutdown"). ...and indeed, it was. :(
> 
> I can't interact with kgdb if I do this:
> 
> 1. ssh over to DUT
> 2. Kill the console process (on ChromeOS stop console-ttyMSM0)
> 3. Drop in the debugger (echo g > /proc/sysrq-trigger)

Yeah, don't do that then. ;)

Not sure how your "console process" works, but this should only happen
if you do not enable the serial console (console=ttyMSM0) and then try
to use a polled console (as enabling the console will prevent port
shutdown from being called). That should probably just be disallowed.

The console code, and the polled console code bolted on top, is a bit of
a hack so corner cases like this are to be expected.

When the polled console code was introduced it was claimed that it would
have "absolutely zero impact as long as CONFIG_CONSOLE_POLL is
disabled". Perhaps I'm reading too much into it, but that statement is
clearly ignoring the maintenance cost...

Johan
Doug Anderson Oct. 10, 2024, 10:30 p.m. UTC | #3
Hi,

On Wed, Oct 9, 2024 at 7:10 AM Johan Hovold <johan@kernel.org> wrote:
>
> On Thu, Oct 03, 2024 at 11:30:08AM -0700, Doug Anderson wrote:
> > On Tue, Oct 1, 2024 at 5:51 AM Johan Hovold <johan+linaro@kernel.org> wrote:
> > >
> > > A commit adding back the stopping of tx on port shutdown failed to add
> > > back the locking which had also been removed by commit e83766334f96
> > > ("tty: serial: qcom_geni_serial: No need to stop tx/rx on UART
> > > shutdown").
> >
> > Hmmm, when I look at that commit it makes me think that the problem
> > that commit e83766334f96 ("tty: serial: qcom_geni_serial: No need to
> > stop tx/rx on UART shutdown") was fixing was re-introduced by commit
> > d8aca2f96813 ("tty: serial: qcom-geni-serial: stop operations in
> > progress at shutdown"). ...and indeed, it was. :(
> >
> > I can't interact with kgdb if I do this:
> >
> > 1. ssh over to DUT
> > 2. Kill the console process (on ChromeOS stop console-ttyMSM0)
> > 3. Drop in the debugger (echo g > /proc/sysrq-trigger)
>
> Yeah, don't do that then. ;)

The problem is, I don't always have a choice. As talked about in the
message of commit e83766334f96 ("tty: serial: qcom_geni_serial: No
need to stop tx/rx on UART shutdown"), the above steps attempt to
simulate what happened organically: a crash in late shutdown. During
shutdown the agetty has been killed by the init system and I don't
have a choice about it. If I get a kernel crash then (which isn't
uncommon since shutdown code tends to trigger seldom-used code paths)
then I can't debug it. :(

We need to fix this.


> Not sure how your "console process" works, but this should only happen
> if you do not enable the serial console (console=ttyMSM0) and then try
> to use a polled console (as enabling the console will prevent port
> shutdown from being called).

That simply doesn't seem to be the case for me. The port shutdown
seems to be called. To confirm, I put a printout at the start of
qcom_geni_serial_shutdown(). I see in my /proc/cmdline:

console=ttyMSM0,115200n8

...and I indeed verify that I see console messages on my UART. I then run:

stop console-ttyMSM0

...and I see on the UART:

[   92.916964] DOUG: qcom_geni_serial_shutdown
[   92.922703] init: console-ttyMSM0 main process (611) killed by TERM signal

Console messages keep coming out the UART even though the agetty isn't
there. Now I (via ssh) drop into the debugger:

echo g > /proc/sysrq-trigger

I see the "kgdb" prompt but I can't interact with it because
qcom_geni_serial_shutdown() stopped RX.


-Doug
Johan Hovold Oct. 11, 2024, 6:51 a.m. UTC | #4
On Thu, Oct 10, 2024 at 03:30:05PM -0700, Doug Anderson wrote:
> On Wed, Oct 9, 2024 at 7:10 AM Johan Hovold <johan@kernel.org> wrote:
> > On Thu, Oct 03, 2024 at 11:30:08AM -0700, Doug Anderson wrote:

> > > Hmmm, when I look at that commit it makes me think that the problem
> > > that commit e83766334f96 ("tty: serial: qcom_geni_serial: No need to
> > > stop tx/rx on UART shutdown") was fixing was re-introduced by commit
> > > d8aca2f96813 ("tty: serial: qcom-geni-serial: stop operations in
> > > progress at shutdown"). ...and indeed, it was. :(
> > >
> > > I can't interact with kgdb if I do this:
> > >
> > > 1. ssh over to DUT
> > > 2. Kill the console process (on ChromeOS stop console-ttyMSM0)
> > > 3. Drop in the debugger (echo g > /proc/sysrq-trigger)
> >
> > Yeah, don't do that then. ;)
> 
> The problem is, I don't always have a choice. As talked about in the
> message of commit e83766334f96 ("tty: serial: qcom_geni_serial: No
> need to stop tx/rx on UART shutdown"), the above steps attempt to
> simulate what happened organically: a crash in late shutdown. During
> shutdown the agetty has been killed by the init system and I don't
> have a choice about it. If I get a kernel crash then (which isn't
> uncommon since shutdown code tends to trigger seldom-used code paths)
> then I can't debug it. :(

Ok, thanks for clarifying.

> > Not sure how your "console process" works, but this should only happen
> > if you do not enable the serial console (console=ttyMSM0) and then try
> > to use a polled console (as enabling the console will prevent port
> > shutdown from being called).
> 
> That simply doesn't seem to be the case for me. The port shutdown
> seems to be called. To confirm, I put a printout at the start of
> qcom_geni_serial_shutdown(). I see in my /proc/cmdline:
> 
> console=ttyMSM0,115200n8
> 
> ...and I indeed verify that I see console messages on my UART. I then run:
> 
> stop console-ttyMSM0
> 
> ...and I see on the UART:
> 
> [   92.916964] DOUG: qcom_geni_serial_shutdown
> [   92.922703] init: console-ttyMSM0 main process (611) killed by TERM signal
> 
> Console messages keep coming out the UART even though the agetty isn't
> there.

And this is with a Chromium kernel, not mainline? 

If you take a look at tty_port_shutdown() there's a hack in there for
consoles that was added back in 2010 and that prevents shutdown() from
called for console ports.

Put perhaps you manage to hit shutdown() via some other path. Serial
core is not yet using tty_port_hangup() so a hangup might trigger
that...

Could you check that with a dump_stack()?

> Now I (via ssh) drop into the debugger:
> 
> echo g > /proc/sysrq-trigger
> 
> I see the "kgdb" prompt but I can't interact with it because
> qcom_geni_serial_shutdown() stopped RX.

How about simply amending poll_get_char() so that it enables the
receiver if it's not already enabled?

Johan
Doug Anderson Oct. 11, 2024, 2:30 p.m. UTC | #5
Hi,

On Thu, Oct 10, 2024 at 11:51 PM Johan Hovold <johan@kernel.org> wrote:
>
> > > Not sure how your "console process" works, but this should only happen
> > > if you do not enable the serial console (console=ttyMSM0) and then try
> > > to use a polled console (as enabling the console will prevent port
> > > shutdown from being called).
> >
> > That simply doesn't seem to be the case for me. The port shutdown
> > seems to be called. To confirm, I put a printout at the start of
> > qcom_geni_serial_shutdown(). I see in my /proc/cmdline:
> >
> > console=ttyMSM0,115200n8
> >
> > ...and I indeed verify that I see console messages on my UART. I then run:
> >
> > stop console-ttyMSM0
> >
> > ...and I see on the UART:
> >
> > [   92.916964] DOUG: qcom_geni_serial_shutdown
> > [   92.922703] init: console-ttyMSM0 main process (611) killed by TERM signal
> >
> > Console messages keep coming out the UART even though the agetty isn't
> > there.
>
> And this is with a Chromium kernel, not mainline?

Who do you take me for?!?!  :-P :-P :-P Of course it's with mainline.


> If you take a look at tty_port_shutdown() there's a hack in there for
> consoles that was added back in 2010 and that prevents shutdown() from
> called for console ports.
>
> Put perhaps you manage to hit shutdown() via some other path. Serial
> core is not yet using tty_port_hangup() so a hangup might trigger
> that...
>
> Could you check that with a dump_stack()?

Sure. Typed from the agetty itself, here ya go. Git hash is not a
mainline git hash because I have your patches applied. "dirty" is
because of the printout / dump_stack().

lazor-rev9 ~ # stop console-ttyMSM0
[   68.772828] DOUG: qcom_geni_serial_shutdown
[   68.777365] CPU: 2 UID: 0 PID: 589 Comm: login Not tainted
6.12.0-rc1-g0bde0d120d58-dirty #1
ac8ed1a05abcc73f4fafe0543cbc76768ea594e1
[   68.789737] Hardware name: Google Lazor (rev9) with LTE (DT)
[   68.795581] Call trace:
[   68.798124]  dump_backtrace+0xf8/0x120
[   68.802025]  show_stack+0x24/0x38
[   68.805463]  dump_stack_lvl+0x40/0xc8
[   68.809265]  dump_stack+0x18/0x38
[   68.812702]  qcom_geni_serial_shutdown+0x38/0x110
[   68.817578]  uart_port_shutdown+0x48/0x68
[   68.821736]  uart_shutdown+0xcc/0x170
[   68.825530]  uart_hangup+0x54/0x158
[   68.829154]  __tty_hangup+0x20c/0x318
[   68.832954]  tty_vhangup_session+0x20/0x38
[   68.837195]  disassociate_ctty+0xe8/0x1a8
[   68.841355]  do_exit+0x10c/0x358
[   68.844716]  do_group_exit+0x9c/0xa8
[   68.848441]  get_signal+0x408/0x4d8
[   68.852071]  do_signal+0xa8/0x770
[   68.855526]  do_notify_resume+0x78/0x118
[   68.859605]  el0_svc+0x64/0x68
[   68.862790]  el0t_64_sync_handler+0x20/0x128
[   68.867218]  el0t_64_sync+0x1a8/0x1b0
[   68.872933] init: console-ttyMSM0 main process (589) killed by TERM signal


> > Now I (via ssh) drop into the debugger:
> >
> > echo g > /proc/sysrq-trigger
> >
> > I see the "kgdb" prompt but I can't interact with it because
> > qcom_geni_serial_shutdown() stopped RX.
>
> How about simply amending poll_get_char() so that it enables the
> receiver if it's not already enabled?

Yeah, this would probably work.

-Doug
Johan Hovold Oct. 18, 2024, 9:21 a.m. UTC | #6
On Fri, Oct 11, 2024 at 07:30:30AM -0700, Doug Anderson wrote:
> On Thu, Oct 10, 2024 at 11:51 PM Johan Hovold <johan@kernel.org> wrote:
> >
> > > > Not sure how your "console process" works, but this should only happen
> > > > if you do not enable the serial console (console=ttyMSM0) and then try
> > > > to use a polled console (as enabling the console will prevent port
> > > > shutdown from being called).

> > And this is with a Chromium kernel, not mainline?
> 
> Who do you take me for?!?!  :-P :-P :-P Of course it's with mainline.

Heh. Just checking. I was sure about shutdown() not being called when
closing ports, but yeah, you can indeed hit this via hangup() as serial
core was only half-converted over to use the tty port implementation in
2016.

> > If you take a look at tty_port_shutdown() there's a hack in there for
> > consoles that was added back in 2010 and that prevents shutdown() from
> > called for console ports.
> >
> > Put perhaps you manage to hit shutdown() via some other path. Serial
> > core is not yet using tty_port_hangup() so a hangup might trigger
> > that...
> >
> > Could you check that with a dump_stack()?

> lazor-rev9 ~ # stop console-ttyMSM0

> [   68.812702]  qcom_geni_serial_shutdown+0x38/0x110
> [   68.817578]  uart_port_shutdown+0x48/0x68
> [   68.821736]  uart_shutdown+0xcc/0x170
> [   68.825530]  uart_hangup+0x54/0x158
> [   68.829154]  __tty_hangup+0x20c/0x318
> [   68.832954]  tty_vhangup_session+0x20/0x38
> [   68.837195]  disassociate_ctty+0xe8/0x1a8
> [   68.841355]  do_exit+0x10c/0x358
> [   68.844716]  do_group_exit+0x9c/0xa8
> [   68.848441]  get_signal+0x408/0x4d8
> [   68.852071]  do_signal+0xa8/0x770

Thanks for confirming. I see this too when stopping a getty.

> > > Now I (via ssh) drop into the debugger:
> > >
> > > echo g > /proc/sysrq-trigger
> > >
> > > I see the "kgdb" prompt but I can't interact with it because
> > > qcom_geni_serial_shutdown() stopped RX.
> >
> > How about simply amending poll_get_char() so that it enables the
> > receiver if it's not already enabled?
> 
> Yeah, this would probably work.

Seems we should clean up serial core so that it at least behaves
consistently on hangup and close.

Having someone think trough and document how these polled consoles are
supposed to work would also be good and save people modifying these
drivers a lot of work.

If they are restricted to when the console is active, there would be no
need for most of poll_init(), and we already prevent the console from
being shut down on hangup() and close().

And then we now also have the detachable console mess to consider...

Johan
diff mbox series

Patch

diff --git a/drivers/tty/serial/qcom_geni_serial.c b/drivers/tty/serial/qcom_geni_serial.c
index 9ea6bd09e665..b6a8729cee6d 100644
--- a/drivers/tty/serial/qcom_geni_serial.c
+++ b/drivers/tty/serial/qcom_geni_serial.c
@@ -1096,10 +1096,12 @@  static void qcom_geni_serial_shutdown(struct uart_port *uport)
 {
 	disable_irq(uport->irq);
 
+	uart_port_lock_irq(uport);
 	qcom_geni_serial_stop_tx(uport);
 	qcom_geni_serial_stop_rx(uport);
 
 	qcom_geni_serial_cancel_tx_cmd(uport);
+	uart_port_unlock_irq(uport);
 }
 
 static void qcom_geni_serial_flush_buffer(struct uart_port *uport)