diff mbox series

RISC-V reserved memory problems

Message ID f8e67f82-103d-156c-deb0-d6d6e2756f5e@microchip.com
State New
Headers show
Series RISC-V reserved memory problems | expand

Commit Message

Conor Dooley Aug. 16, 2022, 8:41 p.m. UTC
Hey all,
We've run into a bit of a problem with reserved memory on PolarFire, or
more accurately a pair of problems that seem to have opposite fixes.

The first of these problems is triggered when trying to implement a
remoteproc driver. To get the reserved memory buffer, remoteproc
does an of_reserved_mem_lookup(), something like:

	np = of_parse_phandle(pdev->of_node, "memory-region", 0);
	if (!np)
		return -EINVAL;

	rmem = of_reserved_mem_lookup(np);
	if (!rmem)
		return -EINVAL;

of_reserved_mem_lookup() then uses reserved_mem[i].name to try and find
a match - but this was triggering kernel panics for us. We did some
debugging and found that the name string's pointer was pointing to an
address in the 0x4000_0000 range. The minimum reproduction for this
crash is attached - it hacks in some print_reserved_mem()s into
setup_vm_final() around a tlb flush so you can see the before/after.
(You'll need a reserved memory node in your dts to replicate)

The output is like so, with the same crash as in the remoteproc driver:

[    0.000000] Linux version 6.0.0-rc1-00001-g0d9d6953d834 (conor@wendy) (riscv64-unknown-linux-gnu-gcc (g5964b5cd727) 11.1.0, GNU ld (GNU Binutils) 2.37) #1 SMP Tue Aug 16 13:42:09 IST 2022
[    0.000000] OF: fdt: Ignoring memory range 0x80000000 - 0x80200000
[    0.000000] Machine model: Microchip PolarFire-SoC Icicle Kit
[    0.000000] earlycon: ns16550a0 at MMIO32 0x0000000020100000 (options '115200n8')
[    0.000000] printk: bootconsole [ns16550a0] enabled
[    0.000000] printk: debug: skip boot console de-registration.
[    0.000000] efi: UEFI not found.
[    0.000000] before flush
[    0.000000] OF: reserved mem: debug name is fabricbuf@ae000000
[    0.000000] after flush
[    0.000000] Unable to handle kernel paging request at virtual address 00000000401c31ac
[    0.000000] Oops [#1]
[    0.000000] Modules linked in:
[    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 6.0.0-rc1-00001-g0d9d6953d834 #1
[    0.000000] Hardware name: Microchip PolarFire-SoC Icicle Kit (DT)
[    0.000000] epc : string+0x4a/0xea
[    0.000000]  ra : vsnprintf+0x1e4/0x336
[    0.000000] epc : ffffffff80335ea0 ra : ffffffff80338936 sp : ffffffff81203be0
[    0.000000]  gp : ffffffff812e0a98 tp : ffffffff8120de40 t0 : 0000000000000000
[    0.000000]  t1 : ffffffff81203e28 t2 : 7265736572203a46 s0 : ffffffff81203c20
[    0.000000]  s1 : ffffffff81203e28 a0 : ffffffff81203d22 a1 : 0000000000000000
[    0.000000]  a2 : ffffffff81203d08 a3 : 0000000081203d21 a4 : ffffffffffffffff
[    0.000000]  a5 : 00000000401c31ac a6 : ffff0a00ffffff04 a7 : ffffffffffffffff
[    0.000000]  s2 : ffffffff81203d08 s3 : ffffffff81203d00 s4 : 0000000000000008
[    0.000000]  s5 : ffffffff000000ff s6 : 0000000000ffffff s7 : 00000000ffffff00
[    0.000000]  s8 : ffffffff80d9821a s9 : ffffffff81203d22 s10: 0000000000000002
[    0.000000]  s11: ffffffff80d9821c t3 : ffffffff812f3617 t4 : ffffffff812f3617
[    0.000000]  t5 : ffffffff812f3618 t6 : ffffffff81203d08
[    0.000000] status: 0000000200000100 badaddr: 00000000401c31ac cause: 000000000000000d
[    0.000000] [<ffffffff80338936>] vsnprintf+0x1e4/0x336
[    0.000000] [<ffffffff80055ae2>] vprintk_store+0xf6/0x344
[    0.000000] [<ffffffff80055d86>] vprintk_emit+0x56/0x192
[    0.000000] [<ffffffff80055ed8>] vprintk_default+0x16/0x1e
[    0.000000] [<ffffffff800563d2>] vprintk+0x72/0x80
[    0.000000] [<ffffffff806813b2>] _printk+0x36/0x50
[    0.000000] [<ffffffff8068af48>] print_reserved_mem+0x1c/0x24
[    0.000000] [<ffffffff808057ec>] paging_init+0x528/0x5bc
[    0.000000] [<ffffffff808031ae>] setup_arch+0xd0/0x592
[    0.000000] [<ffffffff8080070e>] start_kernel+0x82/0x73c
[    0.000000] ---[ end trace 0000000000000000 ]---
[    0.000000] Kernel panic - not syncing: Attempted to kill the idle task!
[    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---

We traced this back to early_init_fdt_scan_reserved_mem() in
setup_bootmem() - moving it later back up the boot sequence to
after the dt has been remapped etc has fixed the problem for us.

The least movement to get it working is attached, and also pushed
here: git.kernel.org/conor/c/1735589baefc

The second problem is a bit more complicated to explain - but we
found the solution conflicted with the remoteproc fix as we had
to move early_init_fdt_scan_reserved_mem() _earlier_ in the boot
process to solve this one.

We want to have a node in our devicetree that contains some memory
that is non-cached & marked as reserved-memory. Maybe we have just
missed something, but from what we've seen:
- the really early setup looks at the dtb, picks the highest bit
   of memory and puts the dtb etc there so it can start using it
- early_init_fdt_scan_reserved_mem() is then called, which figures
   out if memory is reserved or not.

Unfortunately, the highest bit of memory is the non-cached bit so
everything falls over, but we can avoid this by moving the call to
early_init_fdt_scan_reserved_mem() above the dtb memblock alloc that
takes place right before it in setup_bootmem().

Obviously, both of these changes are moving the function call in
opposite directions and we can only really do one of them. We are not
sure if what we are doing with the non-cached reserved-memory section
is just not permitted & cannot work - or if this is something that
was overlooked for RISC-V specifically and works for other archs.

It does seem like the first issue is a real bug, and I am happy to
submit the patch for that whenever - but having two problems with
opposite fixes seemed as if there was something else lurking that we
just don't have enough understanding to detect.

Any help would be great!

Thanks,
Conor.

Comments

Mike Rapoport March 7, 2023, 11:35 a.m. UTC | #1
Hi Conor,

Sorry for the delay, somehow this slipped between the cracks.

On Thu, Feb 02, 2023 at 10:01:26PM +0000, Conor Dooley wrote:
> Hullo Palmer, Mike & whoever else may read this,
> 
> Just reviving this thread from a little while ago as I have been in the
> area again recently...

TBH, I didn't really dig deep into the issues, but the thought I had was
what if DT was mapped via fixmap until the setup_vm_final() and then it
would be possible to call DT methods early.

Could be I'm shooting in the dark :)
 
> On Tue, Aug 16, 2022 at 08:41:05PM +0000, Conor.Dooley@microchip.com wrote:
> > Hey all,
> > We've run into a bit of a problem with reserved memory on PolarFire, or
> > more accurately a pair of problems that seem to have opposite fixes.
> > 
> > The first of these problems is triggered when trying to implement a
> > remoteproc driver. To get the reserved memory buffer, remoteproc
> > does an of_reserved_mem_lookup(), something like:
> > 
> > 	np = of_parse_phandle(pdev->of_node, "memory-region", 0);
> > 	if (!np)
> > 		return -EINVAL;
> > 
> > 	rmem = of_reserved_mem_lookup(np);
> > 	if (!rmem)
> > 		return -EINVAL;
> > 
> > of_reserved_mem_lookup() then uses reserved_mem[i].name to try and find
> > a match - but this was triggering kernel panics for us. We did some
> > debugging and found that the name string's pointer was pointing to an
> > address in the 0x4000_0000 range. The minimum reproduction for this
> > crash is attached - it hacks in some print_reserved_mem()s into
> > setup_vm_final() around a tlb flush so you can see the before/after.
> > (You'll need a reserved memory node in your dts to replicate)
> > 
> > The output is like so, with the same crash as in the remoteproc driver:
> > 
> > [    0.000000] Linux version 6.0.0-rc1-00001-g0d9d6953d834 (conor@wendy) (riscv64-unknown-linux-gnu-gcc (g5964b5cd727) 11.1.0, GNU ld (GNU Binutils) 2.37) #1 SMP Tue Aug 16 13:42:09 IST 2022
> 
> [...]
> 
> > [    0.000000] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
> > 
> > We traced this back to early_init_fdt_scan_reserved_mem() in
> > setup_bootmem() - moving it later back up the boot sequence to
> > after the dt has been remapped etc has fixed the problem for us.
> > 
> > The least movement to get it working is attached, and also pushed
> > here: git.kernel.org/conor/c/1735589baefc
> 
> This one is fixed now, as of commit 50e63dd8ed92 ("riscv: fix reserved
> memory setup").
> 
> > The second problem is a bit more complicated to explain - but we
> > found the solution conflicted with the remoteproc fix as we had
> > to move early_init_fdt_scan_reserved_mem() _earlier_ in the boot
> > process to solve this one.
> > 
> > We want to have a node in our devicetree that contains some memory
> > that is non-cached & marked as reserved-memory. Maybe we have just
> > missed something, but from what we've seen:
> > - the really early setup looks at the dtb, picks the highest bit
> >    of memory and puts the dtb etc there so it can start using it
> > - early_init_fdt_scan_reserved_mem() is then called, which figures
> >    out if memory is reserved or not.
> > 
> > Unfortunately, the highest bit of memory is the non-cached bit so
> > everything falls over, but we can avoid this by moving the call to
> > early_init_fdt_scan_reserved_mem() above the dtb memblock alloc that
> > takes place right before it in setup_bootmem().
> > 
> > Obviously, both of these changes are moving the function call in
> > opposite directions and we can only really do one of them. We are not
> > sure if what we are doing with the non-cached reserved-memory section
> > is just not permitted & cannot work - or if this is something that
> > was overlooked for RISC-V specifically and works for other archs.
> 
> We ended up working around this one by making sure that U-Boot loaded
> the dtb to somewhere that would be inside the kernel's memory map, thus
> avoiding the remapping in the first place.
> 
> We did run into another problem recently though, and 50e63dd8ed92 is
> kinda at fault for it.
> This particular issue was encountered with a devicetree where the
> top-most memory region was entirely reserved & was not observed prior
> to my fix for the first issue.
> 
> On RISC-V, the boot sequence is something like:
> 	setup_bootmem();
> 	setup_vm_final();
> 	unflatten_device_tree();
> 	early_init_fdt_scan_reserved_mem();
> 
> Whereas, before my patch it used to be (give-or-take):
> 	setup_bootmem();
> 	early_init_fdt_scan_reserved_mem();
> 	setup_vm_final();
> 	unflatten_device_tree();
> 
> The difference being that we used to have scanned the reserved memory
> regions before calling setup_vm_final() & therefore know which regions
> we cannot use. As a reminder, calling early_init_fdt_scan_reserved_mem()
> before we've got the dt in a proper virtual memory address will cause
> the kernel to panic if it tries to read a reserved memory node's label.
> 
> As we are now calling setup_vm_final() *before* we know what the
> reserved memory regions are & as RISC-V allocates memblocks from the top
> down, the allocations in setup_vm_final() will be done in the highest
> memory region.
> When early_init_fdt_scan_reserved_mem() then tries to reserve the
> entirety of that top-most memory region, the reservation fails as part
> of this region has already been allocated.
> In the scenario where I found this bug, that top-most region is non-
> cached memory & the kernel ends up panicking.
> The memblock debug code made this pretty easy to spot, otherwise I'd
> probably have spent more than just a few hours trying to figure out why
> it was panicking!
> 
> My "this needs to be fixed today" solution for this problem was calling
> memblock_set_bottom_up(true) in setup_bootmem() & that's what we are
> going to carry downstream for now.
> 
> I haven't tested it (yet) but I suspect that it would also fix our
> problem of the dtb being remapped into a non-cached region of memory
> that we would later go on to reserve too. Non-cached being an issue
> mainly due to the panicking, but failing to reserve (and using!) memory
> regions that are meant to be reserved is very far from ideal even when
> they are memory that the kernel can actually use.
> 
> I have no idea if that is an acceptable solution for upstream though, so
> I guess this is me putting out feelers as to whether this is something I
> should send a patch to do *OR* if this is another sign of the issues
> that you (Mike, Palmer) mentioned in the past.
> If it isn't an acceptable solution, I'm not really too sure how to
> proceed!
> 
> Cheers,
> Conor.
>
Conor Dooley March 20, 2023, 12:11 p.m. UTC | #2
On Thu, Mar 09, 2023 at 04:12:57PM +0100, Alexandre Ghiti wrote:
> On 3/9/23 13:51, Conor Dooley wrote:
> > On Thu, Mar 09, 2023 at 01:45:05PM +0100, Alexandre Ghiti wrote:
> > > On 3/7/23 12:35, Mike Rapoport wrote:
> > > > Hi Conor,
> > > > 
> > > > Sorry for the delay, somehow this slipped between the cracks.
> > > > 
> > > > On Thu, Feb 02, 2023 at 10:01:26PM +0000, Conor Dooley wrote:
> > > > > Hullo Palmer, Mike & whoever else may read this,
> > > > > 
> > > > > Just reviving this thread from a little while ago as I have been in the
> > > > > area again recently...
> > > > TBH, I didn't really dig deep into the issues, but the thought I had was
> > > > what if DT was mapped via fixmap until the setup_vm_final() and then it
> > > > would be possible to call DT methods early.
> > > > 
> > > > Could be I'm shooting in the dark :)
> > > 
> > > I think I understand the issue now, it's because In riscv, we establish 2
> > > different virtual mappings and we map the device tree at 2 different virtual
> > > addresses, which is the problem.
> > > 
> > > So to me, the solution is:
> > > 
> > > - to revert your previous fix, that is calling
> > > early_init_fdt_scan_reserved_mem() before any call to memblock_alloc()
> > > (which could result in an allocation in the area you want to reserve)
> > > 
> > > - to map the device tree at the same virtual address, because
> > > early_init_fdt_scan_reserved_mem() initializes reserved_mem with the dtb
> > > mapping established in setup_vm() and uses reserved_mem with the new mapping
> > > from setup_vm_final (which is what Mike proposes, we should use the fixmap
> > > region to have the same virtual addresses)
> > > 
> > > Hope that makes sense: I'll come up with something this afternoon for you to
> > > test!
> > Sounds good. Please give me some ELI5 commit messages if you can,
> > explanations for this stuff (which I found took a lot of archaeology to
> > understand) would be very welcome next time we need to go back looking
> > at this stuff.
> 
> 
> Can you give it a try here:
> https://github.com/AlexGhiti/riscv-linux/commits/dev/alex/conor_dtb_fixmap_v1
> ?
> 
> That works for me but I need to carefully explain and check that's correct
> though, not upstreamable as is.

Hey Alex,

So I ended up being pretty sick & had to take a week off. I gave this an
initial spin today & it appears to work.
I'll take it for a longer test-drive when you send a "real" patch for
it, but I tested both the lookup by name & the situation that was
allocating in reserved memory and both were not an issue.

Thanks for working on this,
Conor.
diff mbox series

Patch

From 14447dc618a3007bf17dd27c03f7fec095efbc6f Mon Sep 17 00:00:00 2001
From: Valentina Fernandez <valentina.fernandezalanis@microchip.com>
Date: Wed, 13 Jul 2022 10:56:47 +0100
Subject: [PATCH] Debug tlb_flush with reserved memory

---
 arch/riscv/boot/dts/microchip/Makefile        |   1 +
 .../microchip/mpfs-icicle-kit-context-a.dts   | 203 ++++++++++++++++++
 arch/riscv/mm/init.c                          |   9 +-
 drivers/of/of_reserved_mem.c                  |   6 +
 include/linux/of_reserved_mem.h               |   2 +
 5 files changed, 220 insertions(+), 1 deletion(-)
 create mode 100644 arch/riscv/boot/dts/microchip/mpfs-icicle-kit-context-a.dts

diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index d466ec670e1f..5458519c3eb4 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -20,7 +20,7 @@ 
 #include <linux/dma-map-ops.h>
 #include <linux/crash_dump.h>
 #include <linux/hugetlb.h>
-
+#include <linux/of_reserved_mem.h>
 #include <asm/fixmap.h>
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
@@ -1112,8 +1112,15 @@  static void __init setup_vm_final(void)
 
 	/* Move to swapper page table */
 	csr_write(CSR_SATP, PFN_DOWN(__pa_symbol(swapper_pg_dir)) | satp_mode);
+
+	pr_err("before flush\n");
+	print_reserved_mem();
+
 	local_flush_tlb_all();
 
+	pr_err("after flush\n");
+	print_reserved_mem();
+
 	pt_ops_set_late();
 }
 #else
diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 75caa6f5d36f..6aaebb05730d 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -445,3 +445,9 @@  struct reserved_mem *of_reserved_mem_lookup(struct device_node *np)
 	return NULL;
 }
 EXPORT_SYMBOL_GPL(of_reserved_mem_lookup);
+
+void print_reserved_mem(void)
+{
+	pr_err("debug name is %s\n", reserved_mem[0].name);
+}
+EXPORT_SYMBOL_GPL(print_reserved_mem);
\ No newline at end of file
diff --git a/include/linux/of_reserved_mem.h b/include/linux/of_reserved_mem.h
index 4de2a24cadc9..1fe504af3944 100644
--- a/include/linux/of_reserved_mem.h
+++ b/include/linux/of_reserved_mem.h
@@ -40,6 +40,8 @@  int of_reserved_mem_device_init_by_name(struct device *dev,
 void of_reserved_mem_device_release(struct device *dev);
 
 struct reserved_mem *of_reserved_mem_lookup(struct device_node *np);
+
+void print_reserved_mem(void);
 #else
 
 #define RESERVEDMEM_OF_DECLARE(name, compat, init)			\
-- 
2.25.1