diff mbox series

Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1)

Message ID 20221010141630.zfzi7mk7zvnmclzy@techsingularity.net
State New
Headers show
Series Intermittent boot failure after 6492fed7d8c9 (v6.0-rc1) | expand

Commit Message

Mel Gorman Oct. 10, 2022, 2:16 p.m. UTC
Hi Rafael,

I'm seeing intermittent boot failures after 6492fed7d8c9 ("rtc: rtc-cmos:
Do not check ACPI_FADT_LOW_POWER_S0") due to a NULL pointer exception
early in boot. It fails to boot 5 times after 10 boot attempts and I've
only observed it on one machine so far. Either a revert or the patch below
fixes it but it's unlikely it is the correct fix.


Boot failure looks like the below, it's not a vanilla kernel but the
applied patch is not relevant and it's known to fail on a vanilla kernel.
The machine has a E5-2698 v4 CPU plugged into a SGI C2112-4GP3 platform
with a X10DRT-P-Series motherboard.

[   10.924167][    C1] BUG: kernel NULL pointer dereference, address: 0000000000000000
[   10.928016][    C1] #PF: supervisor read access in kernel mode
[   10.928016][    C1] #PF: error_code(0x0000) - not-present page
[   10.928016][    C1] PGD 0 P4D 0 
[   10.928016][    C1] Oops: 0000 [#1] PREEMPT SMP PTI
[   10.928016][    C1] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 6.0.0-mm-pcpnoirq-v1r2 #1 6debc4647ebcbe3e91270f1109aebc1e85510e3e
[   10.928016][    C1] Hardware name: SGI.COM C2112-4GP3/X10DRT-P-Series, BIOS 2.0a 05/09/2016
[   10.928016][    C1] RIP: 0010:rtc_handler+0x73/0xd0
[   10.928016][    C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
[   10.928016][    C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
[   10.928016][    C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
[   10.928016][    C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
[   10.928016][    C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
[   10.928016][    C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
[   10.928016][    C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
[   10.928016][    C1] FS:  0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
[   10.928016][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.928016][    C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
[   10.928016][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.928016][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.928016][    C1] Call Trace:
[   10.928016][    C1]  <IRQ>
[   10.928016][    C1]  acpi_ev_fixed_event_detect+0x14a/0x18c
[   10.928016][    C1]  acpi_ev_sci_xrupt_handler+0x2c/0x6e
[   10.928016][    C1]  acpi_irq+0x18/0x40
[   10.928016][    C1]  __handle_irq_event_percpu+0x3e/0x2d0
[   10.928016][    C1]  handle_irq_event_percpu+0xf/0x40
[   10.928016][    C1]  handle_irq_event+0x34/0x60
[   10.928016][    C1]  handle_fasteoi_irq+0x7b/0x140
[   10.928016][    C1]  __common_interrupt+0x4b/0x100
[   10.928016][    C1]  common_interrupt+0x58/0xa0
[   10.928016][    C1]  </IRQ>
[   10.928016][    C1]  <TASK>
[   10.928016][    C1]  asm_common_interrupt+0x22/0x40
[   10.928016][    C1] RIP: 0010:cmos_wake_setup.part.9+0x2f/0x120
[   10.928016][    C1] Code: 80 3d 65 16 4a 01 00 53 48 89 fb 0f 84 a5 00 00 00 48 89 da 48 c7 c6 00 0c 6d ad bf 04 00 00 00 e8 53 b8 e7 ff bf 04 00 00 00 <e8> 98 c6 e7 ff 31 f6 bf 04 00 00 00 e8 fd c8 e7 ff 0f b6 0d 34 ce
[   10.928016][    C1] RSP: 0000:ffffaf7f800d7ca8 EFLAGS: 00000246
[   10.928016][    C1] RAX: 0000000000000000 RBX: ffff94049801a000 RCX: 0000000000000004
[   10.928016][    C1] RDX: ffffffffadefef10 RSI: ffffffffadefee20 RDI: 0000000000000004
[   10.928016][    C1] RBP: ffffffffaeaf98a0 R08: 0000000000000000 R09: 0000000000000000
[   10.928016][    C1] R10: 0000000000000000 R11: 000000000000000a R12: ffffffffad6d1750
[   10.928016][    C1] R13: 0000000000000000 R14: ffff93c5111191a0 R15: ffffffffaefe47f8
[   10.928016][    C1]  ? rdinit_setup+0x2f/0x2f
[   10.928016][    C1]  ? cmos_do_probe+0x570/0x570
[   10.928016][    C1]  ? cmos_wake_setup.part.9+0x2a/0x120
[   10.928016][    C1]  cmos_pnp_probe+0x6c/0xa0
[   10.928016][    C1]  pnp_device_probe+0x5b/0xb0
[   10.928016][    C1]  ? driver_sysfs_add+0x75/0xe0
[   10.928016][    C1]  really_probe+0x109/0x3e0
[   10.928016][    C1]  ? pm_runtime_barrier+0x4f/0xa0
[   10.928016][    C1]  __driver_probe_device+0x79/0x170
[   10.928016][    C1]  driver_probe_device+0x1f/0xa0
[   10.928016][    C1]  __driver_attach+0x11e/0x180
[   10.928016][    C1]  ? __device_attach_driver+0x110/0x110
[   10.928016][    C1]  bus_for_each_dev+0x79/0xc0
[   10.928016][    C1]  bus_add_driver+0x1ba/0x250
[   10.928016][    C1]  ? rtc_dev_init+0x34/0x34
[   10.928016][    C1]  driver_register+0x5f/0x100
[   10.928016][    C1]  ? rtc_dev_init+0x34/0x34
[   10.928016][    C1]  cmos_init+0x12/0x70
[   10.928016][    C1]  do_one_initcall+0x5b/0x310
[   10.928016][    C1]  ? rcu_read_lock_held_common+0xe/0x50
[   10.928016][    C1]  ? rcu_read_lock_sched_held+0x23/0x80
[   10.928016][    C1]  kernel_init_freeable+0x2b7/0x319
[   10.928016][    C1]  ? rest_init+0x1b0/0x1b0
[   10.928016][    C1]  kernel_init+0x16/0x140
[   10.928016][    C1]  ret_from_fork+0x22/0x30
[   10.928016][    C1]  </TASK>
[   10.928016][    C1] Modules linked in:
[   10.928016][    C1] CR2: 0000000000000000
[   10.928016][    C1] ---[ end trace 0000000000000000 ]---
[   10.928016][    C1] RIP: 0010:rtc_handler+0x73/0xd0
[   10.928016][    C1] Code: df e8 41 62 f9 ff bf 04 00 00 00 e8 a3 bf e7 ff 31 f6 bf 04 00 00 00 e8 08 c2 e7 ff b8 01 00 00 00 5b 5d 41 5c c3 cc cc cc cc <48> 8b 75 00 31 ff e8 72 fe ff ff eb c0 bf 0b 00 00 00 e8 56 81 77
[   10.928016][    C1] RSP: 0000:ffffaf7f8003eec0 EFLAGS: 00010002
[   10.928016][    C1] RAX: ffffffffad6d0c00 RBX: ffff94049801a000 RCX: 0000000000000000
[   10.928016][    C1] RDX: 0000000000000040 RSI: ffffffffadf00460 RDI: ffff94049801a000
[   10.928016][    C1] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000000004d0
[   10.928016][    C1] R10: 0000000000000000 R11: ffffaf7f8003eff8 R12: 0000000000000000
[   10.928016][    C1] R13: ffffffffae228d82 R14: 0000000000000004 R15: 0000000000000000
[   10.928016][    C1] FS:  0000000000000000(0000) GS:ffff94037ea80000(0000) knlGS:0000000000000000
[   10.928016][    C1] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   10.928016][    C1] CR2: 0000000000000000 CR3: 00000002c7e26001 CR4: 00000000003706e0
[   10.928016][    C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   10.928016][    C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   10.928016][    C1] Kernel panic - not syncing: Fatal exception in interrupt
[   10.928016][    C1] Kernel Offset: 0x2be00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[   10.928016][    C1] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

Comments

Mathieu Chouquet-Stringer Dec. 12, 2022, 6:24 p.m. UTC | #1
Hello Rafael,

On Tue, Oct 11, 2022 at 10:20:50AM +0100, Mel Gorman wrote:
> On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > > That's less than the previous 5/10 failures but I
> > > cannot be certain it helped without running a lot more boot tests. The
> > > failure happens in the same function as before.
> > 
> > I've overlooked the fact that acpi_install_fixed_event_handler()
> > enables the event on success, so it is a bug to call it when the
> > handler is not ready.
> > 
> > It should help to only enable the event after running cmos_do_probe()
> > where the driver data pointer is set, so please try the attached
> > patch.

I'm hitting this issue on the 6.0 stable releases (aka 6.0.y) and
looking at the stable tree I see this hasn't been merged... I just got
bitten by this on 6.0.12.

Greg, if Rafael agrees, I think you should apply 4919d3eb2ec0 and
0782b66ed2fb to the 6.0.y tree.

Thank you in advance.

Cheers,
Rafael J. Wysocki Dec. 12, 2022, 6:34 p.m. UTC | #2
On Mon, Dec 12, 2022 at 7:25 PM Mathieu Chouquet-Stringer
<me@mathieu.digital> wrote:
>
>         Hello Rafael,
>
> On Tue, Oct 11, 2022 at 10:20:50AM +0100, Mel Gorman wrote:
> > On Mon, Oct 10, 2022 at 08:29:05PM +0200, Rafael J. Wysocki wrote:
> > > > That's less than the previous 5/10 failures but I
> > > > cannot be certain it helped without running a lot more boot tests. The
> > > > failure happens in the same function as before.
> > >
> > > I've overlooked the fact that acpi_install_fixed_event_handler()
> > > enables the event on success, so it is a bug to call it when the
> > > handler is not ready.
> > >
> > > It should help to only enable the event after running cmos_do_probe()
> > > where the driver data pointer is set, so please try the attached
> > > patch.
>
> I'm hitting this issue on the 6.0 stable releases (aka 6.0.y) and
> looking at the stable tree I see this hasn't been merged... I just got
> bitten by this on 6.0.12.
>
> Greg, if Rafael agrees, I think you should apply 4919d3eb2ec0 and
> 0782b66ed2fb to the 6.0.y tree.

This is fine with me, please send an inclusion request to Greg and the
"stable" list.
diff mbox series

Patch

--- drivers/rtc/rtc-cmos.c.orig	2022-10-10 15:11:50.335756567 +0200
+++ drivers/rtc/rtc-cmos.c	2022-10-10 15:11:53.211756691 +0200
@@ -1209,7 +1209,7 @@ 
 	 * Or else, ACPI SCI is enabled during suspend/resume only,
 	 * update rtc irq in that case.
 	 */
-	if (cmos_use_acpi_alarm())
+	if (cmos_use_acpi_alarm() && cmos)
 		cmos_interrupt(0, (void *)cmos->rtc);
 	else {
 		/* Fix me: can we use cmos_interrupt() here as well? */