diff mbox

[2/2] arm64: mm: enable CONFIG_HOLES_IN_ZONE for NUMA

Message ID 1481706707-6211-3-git-send-email-ard.biesheuvel@linaro.org
State New
Headers show

Commit Message

Ard Biesheuvel Dec. 14, 2016, 9:11 a.m. UTC
The NUMA code may get confused by the presence of NOMAP regions within
zones, resulting in spurious BUG() checks where the node id deviates
from the containing zone's node id.

Since the kernel has no business reasoning about node ids of pages it
does not own in the first place, enable CONFIG_HOLES_IN_ZONE to ensure
that such pages are disregarded.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

---
 arch/arm64/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

-- 
2.7.4

Comments

Robert Richter Dec. 15, 2016, 3:39 p.m. UTC | #1
I was going to do some measurements but my kernel crashes now with a
page fault in efi_rtc_probe():

[   21.663393] Unable to handle kernel paging request at virtual address 20251000
[   21.663396] pgd = ffff000009090000
[   21.663401] [20251000] *pgd=0000010ffff90003
[   21.663402] , *pud=0000010ffff90003
[   21.663404] , *pmd=0000000fdc030003
[   21.663405] , *pte=00e8832000250707

The sparsemem config requires the whole section to be initialized.
Your patches do not address this.

On 14.12.16 09:11:47, Ard Biesheuvel wrote:
> +config HOLES_IN_ZONE

> +	def_bool y

> +	depends on NUMA


This enables pfn_valid_within() for arm64 and causes the check for
each page of a section. The arm64 implementation of pfn_valid() is
already expensive (traversing memblock areas). Now, this is increased
by a factor of 2^18 for 4k page size (16384 for 64k). We need to
initialize the whole section to avoid that.

-Robert






[   21.663393] Unable to handle kernel paging request at virtual address 20251000
[   21.663396] pgd = ffff000009090000
[   21.663401] [20251000] *pgd=0000010ffff90003
[   21.663402] , *pud=0000010ffff90003
[   21.663404] , *pmd=0000000fdc030003
[   21.663405] , *pte=00e8832000250707
[   21.663405] 
[   21.663411] Internal error: Oops: 96000047 [#1] SMP
[   21.663416] Modules linked in:
[   21.663425] CPU: 49 PID: 1 Comm: swapper/0 Tainted: G        W       4.9.0.0.vanilla10-00002-g429605e9ab0a #1
[   21.663426] Hardware name: www.cavium.com ThunderX CRB-2S/ThunderX CRB-2S, BIOS 0.3 Sep 13 2016
[   21.663429] task: ffff800feee6bc00 task.stack: ffff800fec050000
[   21.663433] PC is at 0x201ff820
[   21.663434] LR is at 0x201fdfc0
[   21.663435] pc : [<00000000201ff820>] lr : [<00000000201fdfc0>] pstate: 20000045
[   21.663437] sp : ffff800fec053b70
[   21.663440] x29: ffff800fec053bc0 x28: 0000000000000000 
[   21.663443] x27: ffff000008ce3e08 x26: ffff000008c52568 
[   21.663445] x25: ffff000008bf045c x24: ffff000008bdb828 
[   21.663448] x23: 0000000000000000 x22: 0000000000000040 
[   21.663451] x21: ffff800fec053bb8 x20: 0000000020251000 
[   21.663453] x19: ffff800fec053c20 x18: 0000000000000000 
[   21.663456] x17: 0000000000000000 x16: 00000000bbb67a65 
[   21.663459] x15: ffffffffffffffff x14: ffff810016ea291c 
[   21.663461] x13: ffff810016ea2181 x12: 0000000000000030 
[   21.663464] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f 
[   21.663467] x9 : feff716475687163 x8 : ffffffffffffffff 
[   21.663469] x7 : 83f0680000000000 x6 : 0000000000000000 
[   21.663472] x5 : ffff800fc187aab9 x4 : 0002000000000000 
[   21.663474] x3 : ffff800fec053bb8 x2 : 0000000000000000 
[   21.663477] x1 : 83f0680000000000 x0 : 0000000020251000 
[   21.663478] 
[   21.663479] Process swapper/0 (pid: 1, stack limit = 0xffff800fec050020)
...
[   21.663605] [<00000000201ff820>] 0x201ff820
[   21.663617] [<ffff000008c3eef4>] efi_rtc_probe+0x24/0x78
[   21.663625] [<ffff000008586c88>] platform_drv_probe+0x60/0xc8
[   21.663636] [<ffff0000085845d4>] driver_probe_device+0x26c/0x420
[   21.663639] [<ffff0000085848ac>] __driver_attach+0x124/0x128
[   21.663642] [<ffff000008581e08>] bus_for_each_dev+0x70/0xb0
[   21.663644] [<ffff000008583c30>] driver_attach+0x30/0x40
[   21.663647] [<ffff000008583668>] bus_add_driver+0x200/0x2b8
[   21.663650] [<ffff000008585430>] driver_register+0x68/0x100
[   21.663652] [<ffff000008586e3c>] __platform_driver_probe+0x84/0x128
[   21.663654] [<ffff000008c3eec8>] efi_rtc_driver_init+0x20/0x28
[   21.663658] [<ffff000008082d94>] do_one_initcall+0x44/0x138
[   21.663665] [<ffff000008bf0d0c>] kernel_init_freeable+0x1ac/0x24c
[   21.663673] [<ffff00000885e7a0>] kernel_init+0x18/0x110
[   21.663675] [<ffff000008082b30>] ret_from_fork+0x10/0x20
[   21.663679] Code: f9400000 d5033d9f d65f03c0 d5033e9f (f9000001) 
[   21.663688] ---[ end trace e420ef9636e3c9b2 ]---
[   21.663711] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[   21.663711] 
[   21.663713] SMP: stopping secondary CPUs
[   21.670234] Kernel Offset: disabled
[   21.670235] Memory Limit: none
[   22.681333] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
Ard Biesheuvel Dec. 15, 2016, 4:07 p.m. UTC | #2
On 15 December 2016 at 15:39, Robert Richter <robert.richter@cavium.com> wrote:
> I was going to do some measurements but my kernel crashes now with a

> page fault in efi_rtc_probe():

>

> [   21.663393] Unable to handle kernel paging request at virtual address 20251000

> [   21.663396] pgd = ffff000009090000

> [   21.663401] [20251000] *pgd=0000010ffff90003

> [   21.663402] , *pud=0000010ffff90003

> [   21.663404] , *pmd=0000000fdc030003

> [   21.663405] , *pte=00e8832000250707

>

> The sparsemem config requires the whole section to be initialized.

> Your patches do not address this.

>


96000047 is a third level translation fault, and the PTE address has
RES0 bits set. I don't see how this is related to sparsemem, could you
explain?

> On 14.12.16 09:11:47, Ard Biesheuvel wrote:

>> +config HOLES_IN_ZONE

>> +     def_bool y

>> +     depends on NUMA

>

> This enables pfn_valid_within() for arm64 and causes the check for

> each page of a section. The arm64 implementation of pfn_valid() is

> already expensive (traversing memblock areas). Now, this is increased

> by a factor of 2^18 for 4k page size (16384 for 64k). We need to

> initialize the whole section to avoid that.

>


I know that. But if you want something for -stable, we should have
something that is correct first, and only then care about the
performance hit (if there is one)
Hanjun Guo Dec. 16, 2016, 1:57 a.m. UTC | #3
Hi Robert,

On 2016/12/15 23:39, Robert Richter wrote:
> I was going to do some measurements but my kernel crashes now with a

> page fault in efi_rtc_probe():

>

> [   21.663393] Unable to handle kernel paging request at virtual address 20251000

> [   21.663396] pgd = ffff000009090000

> [   21.663401] [20251000] *pgd=0000010ffff90003

> [   21.663402] , *pud=0000010ffff90003

> [   21.663404] , *pmd=0000000fdc030003

> [   21.663405] , *pte=00e8832000250707

>

> The sparsemem config requires the whole section to be initialized.

> Your patches do not address this.


This patch set is running properly on D05, both the boot and
LTP MM stress test are ok, seems it's a different configuration
of memory mappings in firmware, just a stupid question, which
part is related to this problem, is it only the Reserved memory?

Thanks
Hanjun
Robert Richter Dec. 16, 2016, 5:10 p.m. UTC | #4
On 15.12.16 16:07:26, Ard Biesheuvel wrote:
> On 15 December 2016 at 15:39, Robert Richter <robert.richter@cavium.com> wrote:

> > I was going to do some measurements but my kernel crashes now with a

> > page fault in efi_rtc_probe():

> >

> > [   21.663393] Unable to handle kernel paging request at virtual address 20251000

> > [   21.663396] pgd = ffff000009090000

> > [   21.663401] [20251000] *pgd=0000010ffff90003

> > [   21.663402] , *pud=0000010ffff90003

> > [   21.663404] , *pmd=0000000fdc030003

> > [   21.663405] , *pte=00e8832000250707

> >

> > The sparsemem config requires the whole section to be initialized.

> > Your patches do not address this.

> >

> 

> 96000047 is a third level translation fault, and the PTE address has

> RES0 bits set. I don't see how this is related to sparsemem, could you

> explain?


When initializing the whole section it works. Maybe it uncovers
another bug. Did not yet start debugging this.

> 

> > On 14.12.16 09:11:47, Ard Biesheuvel wrote:

> >> +config HOLES_IN_ZONE

> >> +     def_bool y

> >> +     depends on NUMA

> >

> > This enables pfn_valid_within() for arm64 and causes the check for

> > each page of a section. The arm64 implementation of pfn_valid() is

> > already expensive (traversing memblock areas). Now, this is increased

> > by a factor of 2^18 for 4k page size (16384 for 64k). We need to

> > initialize the whole section to avoid that.

> >

> 

> I know that. But if you want something for -stable, we should have

> something that is correct first, and only then care about the

> performance hit (if there is one)


I would prefer to check for a performance penalty *before* we put it
into stable. There is nor risk at all with the patch I am proposing.
See: https://lkml.org/lkml/2016/12/16/412

-Robert
Will Deacon Jan. 4, 2017, 1:28 p.m. UTC | #5
On Wed, Dec 14, 2016 at 09:11:47AM +0000, Ard Biesheuvel wrote:
> The NUMA code may get confused by the presence of NOMAP regions within

> zones, resulting in spurious BUG() checks where the node id deviates

> from the containing zone's node id.

> 

> Since the kernel has no business reasoning about node ids of pages it

> does not own in the first place, enable CONFIG_HOLES_IN_ZONE to ensure

> that such pages are disregarded.

> 

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> ---

>  arch/arm64/Kconfig | 4 ++++

>  1 file changed, 4 insertions(+)

> 

> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig

> index 111742126897..0472afe64d55 100644

> --- a/arch/arm64/Kconfig

> +++ b/arch/arm64/Kconfig

> @@ -614,6 +614,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK

>  	def_bool y

>  	depends on NUMA

>  

> +config HOLES_IN_ZONE

> +	def_bool y

> +	depends on NUMA

> +

>  source kernel/Kconfig.preempt

>  source kernel/Kconfig.hz


I'm happy to apply this, but I'll hold off until the first patch is queued
somewhere, since this doesn't help without the VM_BUG_ON being moved.

Alternatively, I can queue both if somebody from the mm camp acks the
first patch.

Will
Ard Biesheuvel Jan. 4, 2017, 1:50 p.m. UTC | #6
On 4 January 2017 at 13:28, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Dec 14, 2016 at 09:11:47AM +0000, Ard Biesheuvel wrote:

>> The NUMA code may get confused by the presence of NOMAP regions within

>> zones, resulting in spurious BUG() checks where the node id deviates

>> from the containing zone's node id.

>>

>> Since the kernel has no business reasoning about node ids of pages it

>> does not own in the first place, enable CONFIG_HOLES_IN_ZONE to ensure

>> that such pages are disregarded.

>>

>> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

>> ---

>>  arch/arm64/Kconfig | 4 ++++

>>  1 file changed, 4 insertions(+)

>>

>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig

>> index 111742126897..0472afe64d55 100644

>> --- a/arch/arm64/Kconfig

>> +++ b/arch/arm64/Kconfig

>> @@ -614,6 +614,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK

>>       def_bool y

>>       depends on NUMA

>>

>> +config HOLES_IN_ZONE

>> +     def_bool y

>> +     depends on NUMA

>> +

>>  source kernel/Kconfig.preempt

>>  source kernel/Kconfig.hz

>

> I'm happy to apply this, but I'll hold off until the first patch is queued

> somewhere, since this doesn't help without the VM_BUG_ON being moved.

>

> Alternatively, I can queue both if somebody from the mm camp acks the

> first patch.

>


Actually, I am not convinced the discussion is finalized. These
patches do fix the issue, but Robert also suggested an alternative fix
which may be preferable.

http://marc.info/?l=linux-arm-kernel&m=148190753510107&w=2

I haven't responded to it yet, due to the holidays, but I'd like to
explore that solution a bit further before applying anything, if you
don't mind.

Thanks,
Ard.
Will Deacon Jan. 4, 2017, 2:02 p.m. UTC | #7
On Wed, Jan 04, 2017 at 01:50:20PM +0000, Ard Biesheuvel wrote:
> On 4 January 2017 at 13:28, Will Deacon <will.deacon@arm.com> wrote:

> > On Wed, Dec 14, 2016 at 09:11:47AM +0000, Ard Biesheuvel wrote:

> >> The NUMA code may get confused by the presence of NOMAP regions within

> >> zones, resulting in spurious BUG() checks where the node id deviates

> >> from the containing zone's node id.

> >>

> >> Since the kernel has no business reasoning about node ids of pages it

> >> does not own in the first place, enable CONFIG_HOLES_IN_ZONE to ensure

> >> that such pages are disregarded.

> >>

> >> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

> >> ---

> >>  arch/arm64/Kconfig | 4 ++++

> >>  1 file changed, 4 insertions(+)

> >>

> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig

> >> index 111742126897..0472afe64d55 100644

> >> --- a/arch/arm64/Kconfig

> >> +++ b/arch/arm64/Kconfig

> >> @@ -614,6 +614,10 @@ config NEED_PER_CPU_EMBED_FIRST_CHUNK

> >>       def_bool y

> >>       depends on NUMA

> >>

> >> +config HOLES_IN_ZONE

> >> +     def_bool y

> >> +     depends on NUMA

> >> +

> >>  source kernel/Kconfig.preempt

> >>  source kernel/Kconfig.hz

> >

> > I'm happy to apply this, but I'll hold off until the first patch is queued

> > somewhere, since this doesn't help without the VM_BUG_ON being moved.

> >

> > Alternatively, I can queue both if somebody from the mm camp acks the

> > first patch.

> >

> 

> Actually, I am not convinced the discussion is finalized. These

> patches do fix the issue, but Robert also suggested an alternative fix

> which may be preferable.

> 

> http://marc.info/?l=linux-arm-kernel&m=148190753510107&w=2

> 

> I haven't responded to it yet, due to the holidays, but I'd like to

> explore that solution a bit further before applying anything, if you

> don't mind.


Using early_pfn_valid feels like a bodge to me, since having pfn_valid
return false for something that early_pfn_valid says is valid (and is
therefore initialised in the memmap) makes the NOMAP semantics even more
confusing.

But there's no rush, so I'll hold off for the moment. I was under the
impression that things had stalled.

Will
Robert Richter Jan. 5, 2017, 11:24 a.m. UTC | #8
On 04.01.17 14:02:23, Will Deacon wrote:
> Using early_pfn_valid feels like a bodge to me, since having pfn_valid

> return false for something that early_pfn_valid says is valid (and is

> therefore initialised in the memmap) makes the NOMAP semantics even more

> confusing.


The concern I have had with HOLES_IN_ZONE is that it enables
pfn_valid_within() for arm64. This means that each pfn of a section is
checked which is done only once for the section otherwise. With up to
2^18 pages per section we traverse the memblock list by that factor
more often. There could be a performance regression. I haven't numbers
yet, since the fix causes another kernel crash. And, this is the next
problem I have. The crash doesn't happen otherwise. So, either it
uncovers another bug or the fix is incomplete. Though the changes look
like it should work. This needs more investigation.

-Robert
Ard Biesheuvel Jan. 6, 2017, 12:22 p.m. UTC | #9
On 6 January 2017 at 12:03, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Jan 05, 2017 at 08:49:44PM +0100, Robert Richter wrote:

>> On 05.01.17 13:22:00, Robert Richter wrote:

>> > On 05.01.17 12:08:20, Will Deacon wrote:

>> > > I really can't see how the fix causes a crash, and I couldn't reproduce

>> > > it on any of my boards, nor could any of the Linaro folk afaik. Are you

>> > > definitely running mainline with just these two patches from Ard?

>> >

>> > Yes, just both patches applied. Various other solutions were working.

>>

>> I have retested the same kernel (v4.9 based) as before and now it

>> boots fine including rtc-efi device registration (it was crashing

>> there):

>>

>>  rtc-efi rtc-efi: rtc core: registered rtc-efi as rtc0

>>

>> There could be a difference in firmware and mem setup, though I also

>> downgraded the firmware to test it, but can't reproduce it anymore. I

>> could reliable trigger the crash the first time.

>>

>> FTR the oops.

>

> Hmm, I just can't help but think you were accidentally running with

> additional patches when you saw this oops previously. For example,

> your log looks very similar to this one:

>

>   http://lists.infradead.org/pipermail/linux-arm-kernel/2016-December/473666.html

>

> but then again, these crashes probably often look alike.

>


These are quite different, in fact. In James's case, the UEFI memory
map was missing some entries, so not all memory regions that the
firmware expected to be there were actually mapped, hence the all-zero
*pte. In Robert's case, it looks like the UEFI runtime services page
tables are corrupted, i.e., *pte has RES0 bits set.
Robert Richter Feb. 6, 2017, 1:36 p.m. UTC | #10
On 14.12.16 09:11:47, Ard Biesheuvel wrote:
> The NUMA code may get confused by the presence of NOMAP regions within

> zones, resulting in spurious BUG() checks where the node id deviates

> from the containing zone's node id.

> 

> Since the kernel has no business reasoning about node ids of pages it

> does not own in the first place, enable CONFIG_HOLES_IN_ZONE to ensure

> that such pages are disregarded.

> 

> Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>


I would rather see a solution other than making pfn_valid checks more
fine grained, but this patch also fixes the issue. So:

Acked-by: Robert Richter <rrichter@cavium.com>
diff mbox

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 111742126897..0472afe64d55 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -614,6 +614,10 @@  config NEED_PER_CPU_EMBED_FIRST_CHUNK
 	def_bool y
 	depends on NUMA
 
+config HOLES_IN_ZONE
+	def_bool y
+	depends on NUMA
+
 source kernel/Kconfig.preempt
 source kernel/Kconfig.hz