Message ID | nycvar.YSQ.7.76.1710031638450.5407@knanqh.ubzr |
---|---|
State | New |
Headers | show |
Series | mm/percpu.c: use smarter memory allocation for struct pcpu_alloc_info | expand |
On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote: > This can be much smaller than a page on very small memory systems. > Always rounding up the size to a page is wasteful in that case, and > required alignment is smaller than the memblock default. Let's round > things up to a page size only when the actual size is >= page size, and > then it makes sense to page-align for a nicer allocation pattern. Isn't that a temporary area which gets freed later during boot? Thanks. -- tejun
On Tue, 3 Oct 2017, Tejun Heo wrote: > On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote: > > This can be much smaller than a page on very small memory systems. > > Always rounding up the size to a page is wasteful in that case, and > > required alignment is smaller than the memblock default. Let's round > > things up to a page size only when the actual size is >= page size, and > > then it makes sense to page-align for a nicer allocation pattern. > > Isn't that a temporary area which gets freed later during boot? Hmmm... It may get freed through 3 different paths where 2 of them are error paths. What looks like a non-error path is in pcpu_embed_first_chunk() called from setup_per_cpu_areas(). But there are two versions of setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case never calls pcpu_free_alloc_info() currently. I'm not sure i understand that code fully, but maybe the following patch could be a better fit: ----- >8 Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info Unlike the SMP case, the !SMP case does not free the memory for struct pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a chance of being reused by the page allocator later, align it to a page boundary just like its size. Signed-off-by: Nicolas Pitre <nico@linaro.org> diff --git a/mm/percpu.c b/mm/percpu.c index 434844415d..caab63375b 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, __alignof__(ai->groups[0].cpu_map[0])); ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]); - ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0); + ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE); if (!ptr) return NULL; ai = ptr; @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) if (pcpu_setup_first_chunk(ai, fc) < 0) panic("Failed to initialize percpu areas."); + pcpu_free_alloc_info(ai); } #endif /* CONFIG_SMP */
Hello, On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > I'm not sure i understand that code fully, but maybe the following patch > could be a better fit: > > ----- >8 > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info So, IIRC, the error path is either boot fail or some serious bug in arch code. It really doesn't matter whether we free a page or not. Thanks. -- tejun
Hi Tejun, On Tue, Oct 03, 2017 at 03:36:42PM -0700, Tejun Heo wrote: > > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info > > So, IIRC, the error path is either boot fail or some serious bug in > arch code. It really doesn't matter whether we free a page or not. > In setup_per_cpu_area, a call to either pcpu_embed_first_chunk, pcpu_page_first_chunk, or pcpu_setup_first_chunk is made. The first two eventually call pcpu_setup_first_chunk with a pairing call to pcpu_free_alloc_info right after. This occurs in all implementations. It happens we don't have a pairing call to pcpu_free_alloc_info in the UP setup_per_cpu_area. Thanks, Dennis
On Tue, 3 Oct 2017, Dennis Zhou wrote: > Hi Tejun, > > On Tue, Oct 03, 2017 at 03:36:42PM -0700, Tejun Heo wrote: > > > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info > > > > So, IIRC, the error path is either boot fail or some serious bug in > > arch code. It really doesn't matter whether we free a page or not. > > > > In setup_per_cpu_area, a call to either pcpu_embed_first_chunk, > pcpu_page_first_chunk, or pcpu_setup_first_chunk is made. The first two > eventually call pcpu_setup_first_chunk with a pairing call to > pcpu_free_alloc_info right after. This occurs in all implementations. It > happens we don't have a pairing call to pcpu_free_alloc_info in the UP > setup_per_cpu_area. That was my conclusion too (albeit not stated as clearly) and what my second patch fixed. Nicolas
Hello, On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info > > Unlike the SMP case, the !SMP case does not free the memory for struct > pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a > chance of being reused by the page allocator later, align it to a page > boundary just like its size. > > Signed-off-by: Nicolas Pitre <nico@linaro.org> Applied to percpu/for-4.15 w/ Dennis's ack. Thanks. -- tejun
Hi, On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > On Tue, 3 Oct 2017, Tejun Heo wrote: > > > On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote: > > > This can be much smaller than a page on very small memory systems. > > > Always rounding up the size to a page is wasteful in that case, and > > > required alignment is smaller than the memblock default. Let's round > > > things up to a page size only when the actual size is >= page size, and > > > then it makes sense to page-align for a nicer allocation pattern. > > > > Isn't that a temporary area which gets freed later during boot? > > Hmmm... > > It may get freed through 3 different paths where 2 of them are error > paths. What looks like a non-error path is in pcpu_embed_first_chunk() > called from setup_per_cpu_areas(). But there are two versions of > setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case > never calls pcpu_free_alloc_info() currently. > > I'm not sure i understand that code fully, but maybe the following patch > could be a better fit: > > ----- >8 > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info > > Unlike the SMP case, the !SMP case does not free the memory for struct > pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a > chance of being reused by the page allocator later, align it to a page > boundary just like its size. > > Signed-off-by: Nicolas Pitre <nico@linaro.org> This patch causes my crisv32 qemu emulation to hang with no console output. > > diff --git a/mm/percpu.c b/mm/percpu.c > index 434844415d..caab63375b 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, > __alignof__(ai->groups[0].cpu_map[0])); > ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]); > > - ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0); > + ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE); > if (!ptr) > return NULL; > ai = ptr; > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > if (pcpu_setup_first_chunk(ai, fc) < 0) > panic("Failed to initialize percpu areas."); > + pcpu_free_alloc_info(ai); This is the culprit. Everything works fine if I remove this line. No idea if the problem is here or in the cris core. Copying cris maintainers for input. Guenter
On Sat, 18 Nov 2017, Guenter Roeck wrote: > Hi, > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > On Tue, 3 Oct 2017, Tejun Heo wrote: > > > > > On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote: > > > > This can be much smaller than a page on very small memory systems. > > > > Always rounding up the size to a page is wasteful in that case, and > > > > required alignment is smaller than the memblock default. Let's round > > > > things up to a page size only when the actual size is >= page size, and > > > > then it makes sense to page-align for a nicer allocation pattern. > > > > > > Isn't that a temporary area which gets freed later during boot? > > > > Hmmm... > > > > It may get freed through 3 different paths where 2 of them are error > > paths. What looks like a non-error path is in pcpu_embed_first_chunk() > > called from setup_per_cpu_areas(). But there are two versions of > > setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case > > never calls pcpu_free_alloc_info() currently. > > > > I'm not sure i understand that code fully, but maybe the following patch > > could be a better fit: > > > > ----- >8 > > Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info > > > > Unlike the SMP case, the !SMP case does not free the memory for struct > > pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a > > chance of being reused by the page allocator later, align it to a page > > boundary just like its size. > > > > Signed-off-by: Nicolas Pitre <nico@linaro.org> > > This patch causes my crisv32 qemu emulation to hang with no console output. > > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > index 434844415d..caab63375b 100644 > > --- a/mm/percpu.c > > +++ b/mm/percpu.c > > @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, > > __alignof__(ai->groups[0].cpu_map[0])); > > ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]); > > > > - ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0); > > + ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE); > > if (!ptr) > > return NULL; > > ai = ptr; > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > panic("Failed to initialize percpu areas."); > > + pcpu_free_alloc_info(ai); > > This is the culprit. Everything works fine if I remove this line. Without this line, the memory at the ai pointer is leaked. Maybe this is modifying the memory allocation pattern and that triggers a bug later on in your case. At that point the console driver is not yet initialized and any error message won't be printed. You should enable the early console mechanism in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what that might tell you. Nicolas
On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > On Sat, 18 Nov 2017, Guenter Roeck wrote: > >> Hi, >> >> On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: >>> On Tue, 3 Oct 2017, Tejun Heo wrote: >>> >>>> On Tue, Oct 03, 2017 at 04:57:44PM -0400, Nicolas Pitre wrote: >>>>> This can be much smaller than a page on very small memory systems. >>>>> Always rounding up the size to a page is wasteful in that case, and >>>>> required alignment is smaller than the memblock default. Let's round >>>>> things up to a page size only when the actual size is >= page size, and >>>>> then it makes sense to page-align for a nicer allocation pattern. >>>> >>>> Isn't that a temporary area which gets freed later during boot? >>> >>> Hmmm... >>> >>> It may get freed through 3 different paths where 2 of them are error >>> paths. What looks like a non-error path is in pcpu_embed_first_chunk() >>> called from setup_per_cpu_areas(). But there are two versions of >>> setup_per_cpu_areas(): one for SMP and one for !SMP. And the !SMP case >>> never calls pcpu_free_alloc_info() currently. >>> >>> I'm not sure i understand that code fully, but maybe the following patch >>> could be a better fit: >>> >>> ----- >8 >>> Subject: [PATCH] percpu: don't forget to free the temporary struct pcpu_alloc_info >>> >>> Unlike the SMP case, the !SMP case does not free the memory for struct >>> pcpu_alloc_info allocated in setup_per_cpu_areas(). And to give it a >>> chance of being reused by the page allocator later, align it to a page >>> boundary just like its size. >>> >>> Signed-off-by: Nicolas Pitre <nico@linaro.org> >> >> This patch causes my crisv32 qemu emulation to hang with no console output. >> >>> >>> diff --git a/mm/percpu.c b/mm/percpu.c >>> index 434844415d..caab63375b 100644 >>> --- a/mm/percpu.c >>> +++ b/mm/percpu.c >>> @@ -1416,7 +1416,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, >>> __alignof__(ai->groups[0].cpu_map[0])); >>> ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]); >>> >>> - ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0); >>> + ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), PAGE_SIZE); >>> if (!ptr) >>> return NULL; >>> ai = ptr; >>> @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) >>> >>> if (pcpu_setup_first_chunk(ai, fc) < 0) >>> panic("Failed to initialize percpu areas."); >>> + pcpu_free_alloc_info(ai); >> >> This is the culprit. Everything works fine if I remove this line. > > Without this line, the memory at the ai pointer is leaked. Maybe this is > modifying the memory allocation pattern and that triggers a bug later on > in your case. > > At that point the console driver is not yet initialized and any error > message won't be printed. You should enable the early console mechanism > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > that might tell you. > The problem is that BUG() on crisv32 does not yield useful output. Anyway, here is the culprit. diff --git a/mm/bootmem.c b/mm/bootmem.c index 6aef64254203..2bcc8901450c 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, unsigned long end, return 0; pos = bdata->node_low_pfn; } - BUG(); + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", start, end); + return -ENOMEM; } /** diff --git a/mm/percpu.c b/mm/percpu.c index 79e3549cab0f..c75622d844f1 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, */ void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) { + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); memblock_free_early(__pa(ai), ai->__ai_size); } results in: pcpu_free_alloc_info(c0534000 (0x40534000)) ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa mark_bootmem(): memory range 0x2029a-0x2029b not found and the system keeps booting. If I drop the __pa() from the memblock_free_early() parameter, everything works as expected. Off to the cris maintainers ... I have no idea how to fix the problem. Guenter
On Sun, 19 Nov 2017, Guenter Roeck wrote: > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > panic("Failed to initialize percpu areas."); > > > > + pcpu_free_alloc_info(ai); > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > modifying the memory allocation pattern and that triggers a bug later on > > in your case. > > > > At that point the console driver is not yet initialized and any error > > message won't be printed. You should enable the early console mechanism > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > that might tell you. > > > > The problem is that BUG() on crisv32 does not yield useful output. > Anyway, here is the culprit. > > diff --git a/mm/bootmem.c b/mm/bootmem.c > index 6aef64254203..2bcc8901450c 100644 > --- a/mm/bootmem.c > +++ b/mm/bootmem.c > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > unsigned long end, > return 0; > pos = bdata->node_low_pfn; > } > - BUG(); > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", start, > end); > + return -ENOMEM; > } > > /** > diff --git a/mm/percpu.c b/mm/percpu.c > index 79e3549cab0f..c75622d844f1 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > pcpu_alloc_alloc_info(int nr_groups, > */ > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > { > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > memblock_free_early(__pa(ai), ai->__ai_size); The problem here is that there is two possibilities for memblock_free_early(). From include/linux/bootmem.h: #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM) static inline void __init memblock_free_early( phys_addr_t base, phys_addr_t size) { __memblock_free_early(base, size); } #else static inline void __init memblock_free_early( phys_addr_t base, phys_addr_t size) { free_bootmem(base, size); } #endif It looks like most architectures use the memblock variant, including all the ones I have access to. > results in: > > pcpu_free_alloc_info(c0534000 (0x40534000)) > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > mark_bootmem(): memory range 0x2029a-0x2029b not found Well... PFN_UP(0x40534000) should give 0x40534. How you might end up with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max == end) return 0;" within the loop is rather weird. Nicolas
On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > On Sun, 19 Nov 2017, Guenter Roeck wrote: >> On 11/19/2017 12:36 PM, Nicolas Pitre wrote: >>> On Sat, 18 Nov 2017, Guenter Roeck wrote: >>>> On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: >>>>> @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) >>>>> if (pcpu_setup_first_chunk(ai, fc) < 0) >>>>> panic("Failed to initialize percpu areas."); >>>>> + pcpu_free_alloc_info(ai); >>>> >>>> This is the culprit. Everything works fine if I remove this line. >>> >>> Without this line, the memory at the ai pointer is leaked. Maybe this is >>> modifying the memory allocation pattern and that triggers a bug later on >>> in your case. >>> >>> At that point the console driver is not yet initialized and any error >>> message won't be printed. You should enable the early console mechanism >>> in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what >>> that might tell you. >>> >> >> The problem is that BUG() on crisv32 does not yield useful output. >> Anyway, here is the culprit. >> >> diff --git a/mm/bootmem.c b/mm/bootmem.c >> index 6aef64254203..2bcc8901450c 100644 >> --- a/mm/bootmem.c >> +++ b/mm/bootmem.c >> @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, >> unsigned long end, >> return 0; >> pos = bdata->node_low_pfn; >> } >> - BUG(); >> + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", start, >> end); >> + return -ENOMEM; >> } >> >> /** >> diff --git a/mm/percpu.c b/mm/percpu.c >> index 79e3549cab0f..c75622d844f1 100644 >> --- a/mm/percpu.c >> +++ b/mm/percpu.c >> @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init >> pcpu_alloc_alloc_info(int nr_groups, >> */ >> void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) >> { >> + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); >> memblock_free_early(__pa(ai), ai->__ai_size); > > The problem here is that there is two possibilities for > memblock_free_early(). From include/linux/bootmem.h: > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM) > > static inline void __init memblock_free_early( > phys_addr_t base, phys_addr_t size) > { > __memblock_free_early(base, size); > } > > #else > > static inline void __init memblock_free_early( > phys_addr_t base, phys_addr_t size) > { > free_bootmem(base, size); > } > > #endif > > It looks like most architectures use the memblock variant, including all > the ones I have access to. > >> results in: >> >> pcpu_free_alloc_info(c0534000 (0x40534000)) >> ------------[ cut here ]------------ >> WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa >> mark_bootmem(): memory range 0x2029a-0x2029b not found > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > == end) return 0;" within the loop is rather weird. > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". Guenter
On Sun, 19 Nov 2017, Guenter Roeck wrote: > On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > > > panic("Failed to initialize percpu areas."); > > > > > > + pcpu_free_alloc_info(ai); > > > > > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > > > modifying the memory allocation pattern and that triggers a bug later on > > > > in your case. > > > > > > > > At that point the console driver is not yet initialized and any error > > > > message won't be printed. You should enable the early console mechanism > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > > > that might tell you. > > > > > > > > > > The problem is that BUG() on crisv32 does not yield useful output. > > > Anyway, here is the culprit. > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c > > > index 6aef64254203..2bcc8901450c 100644 > > > --- a/mm/bootmem.c > > > +++ b/mm/bootmem.c > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > > > unsigned long end, > > > return 0; > > > pos = bdata->node_low_pfn; > > > } > > > - BUG(); > > > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", > > > start, > > > end); > > > + return -ENOMEM; > > > } > > > > > > /** > > > diff --git a/mm/percpu.c b/mm/percpu.c > > > index 79e3549cab0f..c75622d844f1 100644 > > > --- a/mm/percpu.c > > > +++ b/mm/percpu.c > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > > > pcpu_alloc_alloc_info(int nr_groups, > > > */ > > > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > > > { > > > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > > > memblock_free_early(__pa(ai), ai->__ai_size); > > > > The problem here is that there is two possibilities for > > memblock_free_early(). From include/linux/bootmem.h: > > > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM) > > > > static inline void __init memblock_free_early( > > phys_addr_t base, phys_addr_t size) > > { > > __memblock_free_early(base, size); > > } > > > > #else > > > > static inline void __init memblock_free_early( > > phys_addr_t base, phys_addr_t size) > > { > > free_bootmem(base, size); > > } > > > > #endif > > > > It looks like most architectures use the memblock variant, including all > > the ones I have access to. > > > > > results in: > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000)) > > > ------------[ cut here ]------------ > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > > > mark_bootmem(): memory range 0x2029a-0x2029b not found > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > > == end) return 0;" within the loop is rather weird. > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case. However the bootmem allocator deals with physical addresses not virtual ones. So it shouldn't give you a 0x60000..0x61000 range. Would be interesting to see what result you get on line 860 of mm/bootmem.c. Nicolas
On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote: > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > > > > panic("Failed to initialize percpu areas."); > > > > > > > + pcpu_free_alloc_info(ai); > > > > > > > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > > > > modifying the memory allocation pattern and that triggers a bug later on > > > > > in your case. > > > > > > > > > > At that point the console driver is not yet initialized and any error > > > > > message won't be printed. You should enable the early console mechanism > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > > > > that might tell you. > > > > > > > > > > > > > The problem is that BUG() on crisv32 does not yield useful output. > > > > Anyway, here is the culprit. > > > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c > > > > index 6aef64254203..2bcc8901450c 100644 > > > > --- a/mm/bootmem.c > > > > +++ b/mm/bootmem.c > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > > > > unsigned long end, > > > > return 0; > > > > pos = bdata->node_low_pfn; > > > > } > > > > - BUG(); > > > > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", > > > > start, > > > > end); > > > > + return -ENOMEM; > > > > } > > > > > > > > /** > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > > > index 79e3549cab0f..c75622d844f1 100644 > > > > --- a/mm/percpu.c > > > > +++ b/mm/percpu.c > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > > > > pcpu_alloc_alloc_info(int nr_groups, > > > > */ > > > > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > > > > { > > > > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > > > > memblock_free_early(__pa(ai), ai->__ai_size); > > > > > > The problem here is that there is two possibilities for > > > memblock_free_early(). From include/linux/bootmem.h: > > > > > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM) > > > > > > static inline void __init memblock_free_early( > > > phys_addr_t base, phys_addr_t size) > > > { > > > __memblock_free_early(base, size); > > > } > > > > > > #else > > > > > > static inline void __init memblock_free_early( > > > phys_addr_t base, phys_addr_t size) > > > { > > > free_bootmem(base, size); > > > } > > > > > > #endif > > > > > > It looks like most architectures use the memblock variant, including all > > > the ones I have access to. > > > > > > > results in: > > > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000)) > > > > ------------[ cut here ]------------ > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found > > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > > > == end) return 0;" within the loop is rather weird. > > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case. > However the bootmem allocator deals with physical addresses not virtual > ones. So it shouldn't give you a 0x60000..0x61000 range. > > Would be interesting to see what result you get on line 860 of > mm/bootmem.c. > Nothing; __alloc_bootmem_low_node() is not called. Call chain is: pcpu_alloc_alloc_info memblock_virt_alloc_nopanic __alloc_bootmem_nopanic ___alloc_bootmem_nopanic and returns 0xc0536000. Guenter
On Mon, 20 Nov 2017, Guenter Roeck wrote: > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote: > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > > > > > panic("Failed to initialize percpu areas."); > > > > > > > > + pcpu_free_alloc_info(ai); > > > > > > > > > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > > > > > modifying the memory allocation pattern and that triggers a bug later on > > > > > > in your case. > > > > > > > > > > > > At that point the console driver is not yet initialized and any error > > > > > > message won't be printed. You should enable the early console mechanism > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > > > > > that might tell you. > > > > > > > > > > > > > > > > The problem is that BUG() on crisv32 does not yield useful output. > > > > > Anyway, here is the culprit. > > > > > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c > > > > > index 6aef64254203..2bcc8901450c 100644 > > > > > --- a/mm/bootmem.c > > > > > +++ b/mm/bootmem.c > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > > > > > unsigned long end, > > > > > return 0; > > > > > pos = bdata->node_low_pfn; > > > > > } > > > > > - BUG(); > > > > > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", > > > > > start, > > > > > end); > > > > > + return -ENOMEM; > > > > > } > > > > > > > > > > /** > > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > > > > index 79e3549cab0f..c75622d844f1 100644 > > > > > --- a/mm/percpu.c > > > > > +++ b/mm/percpu.c > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > > > > > pcpu_alloc_alloc_info(int nr_groups, > > > > > */ > > > > > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > > > > > { > > > > > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > > > > > memblock_free_early(__pa(ai), ai->__ai_size); > > > > > > > > The problem here is that there is two possibilities for > > > > memblock_free_early(). From include/linux/bootmem.h: > > > > > > > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM) > > > > > > > > static inline void __init memblock_free_early( > > > > phys_addr_t base, phys_addr_t size) > > > > { > > > > __memblock_free_early(base, size); > > > > } > > > > > > > > #else > > > > > > > > static inline void __init memblock_free_early( > > > > phys_addr_t base, phys_addr_t size) > > > > { > > > > free_bootmem(base, size); > > > > } > > > > > > > > #endif > > > > > > > > It looks like most architectures use the memblock variant, including all > > > > the ones I have access to. > > > > > > > > > results in: > > > > > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000)) > > > > > ------------[ cut here ]------------ > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found > > > > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > > > > == end) return 0;" within the loop is rather weird. > > > > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b > > > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". > > > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case. > > However the bootmem allocator deals with physical addresses not virtual > > ones. So it shouldn't give you a 0x60000..0x61000 range. > > > > Would be interesting to see what result you get on line 860 of > > mm/bootmem.c. > > > Nothing; __alloc_bootmem_low_node() is not called. > > Call chain is: > pcpu_alloc_alloc_info > memblock_virt_alloc_nopanic > __alloc_bootmem_nopanic > ___alloc_bootmem_nopanic But from there it should continue with: alloc_bootmem_core() --> alloc_bootmem_bdata() --> [...] region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off); That's line 585, not 860 as I mentioned. Sorry for the confusion. Nicolas
On Mon, Nov 20, 2017 at 03:21:32PM -0500, Nicolas Pitre wrote: > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote: > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > > > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > > > > > > panic("Failed to initialize percpu areas."); > > > > > > > > > + pcpu_free_alloc_info(ai); > > > > > > > > > > > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > > > > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > > > > > > modifying the memory allocation pattern and that triggers a bug later on > > > > > > > in your case. > > > > > > > > > > > > > > At that point the console driver is not yet initialized and any error > > > > > > > message won't be printed. You should enable the early console mechanism > > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > > > > > > that might tell you. > > > > > > > > > > > > > > > > > > > The problem is that BUG() on crisv32 does not yield useful output. > > > > > > Anyway, here is the culprit. > > > > > > > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c > > > > > > index 6aef64254203..2bcc8901450c 100644 > > > > > > --- a/mm/bootmem.c > > > > > > +++ b/mm/bootmem.c > > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > > > > > > unsigned long end, > > > > > > return 0; > > > > > > pos = bdata->node_low_pfn; > > > > > > } > > > > > > - BUG(); > > > > > > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", > > > > > > start, > > > > > > end); > > > > > > + return -ENOMEM; > > > > > > } > > > > > > > > > > > > /** > > > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > > > > > index 79e3549cab0f..c75622d844f1 100644 > > > > > > --- a/mm/percpu.c > > > > > > +++ b/mm/percpu.c > > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > > > > > > pcpu_alloc_alloc_info(int nr_groups, > > > > > > */ > > > > > > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > > > > > > { > > > > > > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > > > > > > memblock_free_early(__pa(ai), ai->__ai_size); > > > > > > > > > > The problem here is that there is two possibilities for > > > > > memblock_free_early(). From include/linux/bootmem.h: > > > > > > > > > > #if defined(CONFIG_HAVE_MEMBLOCK) && defined(CONFIG_NO_BOOTMEM) > > > > > > > > > > static inline void __init memblock_free_early( > > > > > phys_addr_t base, phys_addr_t size) > > > > > { > > > > > __memblock_free_early(base, size); > > > > > } > > > > > > > > > > #else > > > > > > > > > > static inline void __init memblock_free_early( > > > > > phys_addr_t base, phys_addr_t size) > > > > > { > > > > > free_bootmem(base, size); > > > > > } > > > > > > > > > > #endif > > > > > > > > > > It looks like most architectures use the memblock variant, including all > > > > > the ones I have access to. > > > > > > > > > > > results in: > > > > > > > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000)) > > > > > > ------------[ cut here ]------------ > > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found > > > > > > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > > > > > == end) return 0;" within the loop is rather weird. > > > > > > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, > > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b > > > > > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" > > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". > > > > > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case. > > > However the bootmem allocator deals with physical addresses not virtual > > > ones. So it shouldn't give you a 0x60000..0x61000 range. > > > > > > Would be interesting to see what result you get on line 860 of > > > mm/bootmem.c. > > > > > Nothing; __alloc_bootmem_low_node() is not called. > > > > Call chain is: > > pcpu_alloc_alloc_info > > memblock_virt_alloc_nopanic > > __alloc_bootmem_nopanic > > ___alloc_bootmem_nopanic > > But from there it should continue with: > > alloc_bootmem_core() --> > alloc_bootmem_bdata() --> > [...] > region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off); > > That's line 585, not 860 as I mentioned. Sorry for the confusion. > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 Guenter
On Mon, 20 Nov 2017, Guenter Roeck wrote: > On Mon, Nov 20, 2017 at 03:21:32PM -0500, Nicolas Pitre wrote: > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote: > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > > > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > > > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > > > > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > > > > > > > panic("Failed to initialize percpu areas."); > > > > > > > > > > + pcpu_free_alloc_info(ai); > > > > > > > > > > > > > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > > > > > > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > > > > > > > modifying the memory allocation pattern and that triggers a bug later on > > > > > > > > in your case. > > > > > > > > > > > > > > > > At that point the console driver is not yet initialized and any error > > > > > > > > message won't be printed. You should enable the early console mechanism > > > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > > > > > > > that might tell you. > > > > > > > > > > > > > > > > > > > > > > The problem is that BUG() on crisv32 does not yield useful output. > > > > > > > Anyway, here is the culprit. > > > > > > > > > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c > > > > > > > index 6aef64254203..2bcc8901450c 100644 > > > > > > > --- a/mm/bootmem.c > > > > > > > +++ b/mm/bootmem.c > > > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > > > > > > > unsigned long end, > > > > > > > return 0; > > > > > > > pos = bdata->node_low_pfn; > > > > > > > } > > > > > > > - BUG(); > > > > > > > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", > > > > > > > start, > > > > > > > end); > > > > > > > + return -ENOMEM; > > > > > > > } > > > > > > > > > > > > > > /** > > > > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > > > > > > index 79e3549cab0f..c75622d844f1 100644 > > > > > > > --- a/mm/percpu.c > > > > > > > +++ b/mm/percpu.c > > > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > > > > > > > pcpu_alloc_alloc_info(int nr_groups, > > > > > > > */ > > > > > > > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > > > > > > > { > > > > > > > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > > > > > > > memblock_free_early(__pa(ai), ai->__ai_size); > > > > > > > > > > > > > > results in: > > > > > > > > > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000)) > > > > > > > ------------[ cut here ]------------ > > > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > > > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found > > > > > > > > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > > > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > > > > > > == end) return 0;" within the loop is rather weird. > > > > > > > > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, > > > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b > > > > > > > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" > > > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". > > > > > > > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case. > > > > However the bootmem allocator deals with physical addresses not virtual > > > > ones. So it shouldn't give you a 0x60000..0x61000 range. > > > > > > > > Would be interesting to see what result you get on line 860 of > > > > mm/bootmem.c. > > > > > > > Nothing; __alloc_bootmem_low_node() is not called. > > > > > > Call chain is: > > > pcpu_alloc_alloc_info > > > memblock_virt_alloc_nopanic > > > __alloc_bootmem_nopanic > > > ___alloc_bootmem_nopanic > > > > But from there it should continue with: > > > > alloc_bootmem_core() --> > > alloc_bootmem_bdata() --> > > [...] > > region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off); > > > > That's line 585, not 860 as I mentioned. Sorry for the confusion. > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 If PFN_PHYS(bdata->node_min_pfn)=c0000000 and region=c0536000 that means phys_to_virt() is a no-op. However, from your result above, __pa(0xc0534000) = 0x40534000. So, why is it that phys_to_virt() is a no-op and __pa() is not? virt_to_phys() and __pa() are meant to be the reverse of phys_to_virt() and __va(). Nicolas
On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote: > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > On Mon, Nov 20, 2017 at 03:21:32PM -0500, Nicolas Pitre wrote: > > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > > > > On Mon, Nov 20, 2017 at 01:18:38PM -0500, Nicolas Pitre wrote: > > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > > > > > > > On 11/19/2017 08:08 PM, Nicolas Pitre wrote: > > > > > > > On Sun, 19 Nov 2017, Guenter Roeck wrote: > > > > > > > > On 11/19/2017 12:36 PM, Nicolas Pitre wrote: > > > > > > > > > On Sat, 18 Nov 2017, Guenter Roeck wrote: > > > > > > > > > > On Tue, Oct 03, 2017 at 06:29:49PM -0400, Nicolas Pitre wrote: > > > > > > > > > > > @@ -2295,6 +2295,7 @@ void __init setup_per_cpu_areas(void) > > > > > > > > > > > if (pcpu_setup_first_chunk(ai, fc) < 0) > > > > > > > > > > > panic("Failed to initialize percpu areas."); > > > > > > > > > > > + pcpu_free_alloc_info(ai); > > > > > > > > > > > > > > > > > > > > This is the culprit. Everything works fine if I remove this line. > > > > > > > > > > > > > > > > > > Without this line, the memory at the ai pointer is leaked. Maybe this is > > > > > > > > > modifying the memory allocation pattern and that triggers a bug later on > > > > > > > > > in your case. > > > > > > > > > > > > > > > > > > At that point the console driver is not yet initialized and any error > > > > > > > > > message won't be printed. You should enable the early console mechanism > > > > > > > > > in your kernel (see arch/cris/arch-v32/kernel/debugport.c) and see what > > > > > > > > > that might tell you. > > > > > > > > > > > > > > > > > > > > > > > > > The problem is that BUG() on crisv32 does not yield useful output. > > > > > > > > Anyway, here is the culprit. > > > > > > > > > > > > > > > > diff --git a/mm/bootmem.c b/mm/bootmem.c > > > > > > > > index 6aef64254203..2bcc8901450c 100644 > > > > > > > > --- a/mm/bootmem.c > > > > > > > > +++ b/mm/bootmem.c > > > > > > > > @@ -382,7 +382,8 @@ static int __init mark_bootmem(unsigned long start, > > > > > > > > unsigned long end, > > > > > > > > return 0; > > > > > > > > pos = bdata->node_low_pfn; > > > > > > > > } > > > > > > > > - BUG(); > > > > > > > > + WARN(1, "mark_bootmem(): memory range 0x%lx-0x%lx not found\n", > > > > > > > > start, > > > > > > > > end); > > > > > > > > + return -ENOMEM; > > > > > > > > } > > > > > > > > > > > > > > > > /** > > > > > > > > diff --git a/mm/percpu.c b/mm/percpu.c > > > > > > > > index 79e3549cab0f..c75622d844f1 100644 > > > > > > > > --- a/mm/percpu.c > > > > > > > > +++ b/mm/percpu.c > > > > > > > > @@ -1881,6 +1881,7 @@ struct pcpu_alloc_info * __init > > > > > > > > pcpu_alloc_alloc_info(int nr_groups, > > > > > > > > */ > > > > > > > > void __init pcpu_free_alloc_info(struct pcpu_alloc_info *ai) > > > > > > > > { > > > > > > > > + printk("pcpu_free_alloc_info(%p (0x%lx))\n", ai, __pa(ai)); > > > > > > > > memblock_free_early(__pa(ai), ai->__ai_size); > > > > > > > > > > > > > > > > results in: > > > > > > > > > > > > > > > > pcpu_free_alloc_info(c0534000 (0x40534000)) > > > > > > > > ------------[ cut here ]------------ > > > > > > > > WARNING: CPU: 0 PID: 0 at mm/bootmem.c:385 mark_bootmem+0x9a/0xaa > > > > > > > > mark_bootmem(): memory range 0x2029a-0x2029b not found > > > > > > > > > > > > > > Well... PFN_UP(0x40534000) should give 0x40534. How you might end up > > > > > > > with 0x2029a in mark_bootmem(), let alone not exit on the first "if (max > > > > > > > == end) return 0;" within the loop is rather weird. > > > > > > > > > > > > > pcpu_free_alloc_info: ai=c0536000, __pa(ai)=0x40536000, > > > > > > PFN_UP(__pa(ai))=0x2029b, PFN_UP(ai)=0x6029b > > > > > > > > > > > > bootmem range is 0x60000..0x61000. It doesn't get to "if (max == end)" > > > > > > because "pos (=0x2029b) < bdata->node_min_pfn (=0x60000)". > > > > > > > > > > OK. the 0x2029b is the result of PAGE_SIZE being 8192 in your case. > > > > > However the bootmem allocator deals with physical addresses not virtual > > > > > ones. So it shouldn't give you a 0x60000..0x61000 range. > > > > > > > > > > Would be interesting to see what result you get on line 860 of > > > > > mm/bootmem.c. > > > > > > > > > Nothing; __alloc_bootmem_low_node() is not called. > > > > > > > > Call chain is: > > > > pcpu_alloc_alloc_info > > > > memblock_virt_alloc_nopanic > > > > __alloc_bootmem_nopanic > > > > ___alloc_bootmem_nopanic > > > > > > But from there it should continue with: > > > > > > alloc_bootmem_core() --> > > > alloc_bootmem_bdata() --> > > > [...] > > > region = phys_to_virt(PFN_PHYS(bdata->node_min_pfn) + start_off); > > > > > > That's line 585, not 860 as I mentioned. Sorry for the confusion. > > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and > region=c0536000 that means phys_to_virt() is a no-op. > No, it is |= 0x80000000 > However, from your result above, __pa(0xc0534000) = 0x40534000. > > So, why is it that phys_to_virt() is a no-op and __pa() is not? > > virt_to_phys() and __pa() are meant to be the reverse of phys_to_virt() > and __va(). > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted left by PFN_PHYS, making it 0xc0000000, which in my understanding is a virtual address. So something is wrong ... presumably node_min_pfn should be 0x20000, not 0x60000. init_bootmem_node() definitely passes virtual pfns as parameters. That doesn't seem to be easy to fix. It seems there is a mixup of physical and virtual addresses in the architecture. Guenter
On Mon, 20 Nov 2017, Guenter Roeck wrote: > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote: > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 > > > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and > > region=c0536000 that means phys_to_virt() is a no-op. > > > No, it is |= 0x80000000 Then the bootmem registration looks very fishy. If you have: > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted > left by PFN_PHYS, making it 0xc0000000, which in my understanding is > a virtual address. Exact. #define __pa(x) ((unsigned long)(x) & 0x7fffffff) #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) With that, the only possible physical address range you may have is 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's not where your RAM is then something is wrong. This is in fact a very bad idea to define __va() and __pa() using bitwise operations as this hides mistakes like defining physical RAM address at 0xc0000000. Instead, it should look like: #define __pa(x) ((unsigned long)(x) - 0x80000000) #define __va(x) ((void *)((unsigned long)(x) + 0x80000000)) This way, bad physical RAM address definitions will be caught immediately. > That doesn't seem to be easy to fix. It seems there is a mixup of physical > and virtual addresses in the architecture. Well... I don't think there is much else to say other than this needs fixing. Nicolas
On Mon, Nov 20, 2017 at 10:50:46PM -0500, Nicolas Pitre wrote: > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote: > > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 > > > > > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and > > > region=c0536000 that means phys_to_virt() is a no-op. > > > > > No, it is |= 0x80000000 > > Then the bootmem registration looks very fishy. If you have: > > > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted > > left by PFN_PHYS, making it 0xc0000000, which in my understanding is > > a virtual address. > > Exact. > > #define __pa(x) ((unsigned long)(x) & 0x7fffffff) > #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) > > With that, the only possible physical address range you may have is > 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's > not where your RAM is then something is wrong. > > This is in fact a very bad idea to define __va() and __pa() using > bitwise operations as this hides mistakes like defining physical RAM > address at 0xc0000000. Instead, it should look like: > > #define __pa(x) ((unsigned long)(x) - 0x80000000) > #define __va(x) ((void *)((unsigned long)(x) + 0x80000000)) > > This way, bad physical RAM address definitions will be caught > immediately. > > > That doesn't seem to be easy to fix. It seems there is a mixup of physical > > and virtual addresses in the architecture. > > Well... I don't think there is much else to say other than this needs > fixing. The memory map for the ETRAX FS has the SDRAM mapped at both 0x40000000-0x7fffffff and 0xc0000000-0xffffffff, and the difference is cached and non-cached. That is actively (ab)used in the port, unfortunately, allthough I'm uncertain if this is the problem in this case. I get the same behaviour in my QEMU, but I've not been able to make sense of anything yet... > Nicolas /^JN - Jesper Nilsson -- Jesper Nilsson -- jesper.nilsson@axis.com
On Wed, 22 Nov 2017, Jesper Nilsson wrote: > On Mon, Nov 20, 2017 at 10:50:46PM -0500, Nicolas Pitre wrote: > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote: > > > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > > > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 > > > > > > > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and > > > > region=c0536000 that means phys_to_virt() is a no-op. > > > > > > > No, it is |= 0x80000000 > > > > Then the bootmem registration looks very fishy. If you have: > > > > > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted > > > left by PFN_PHYS, making it 0xc0000000, which in my understanding is > > > a virtual address. > > > > Exact. > > > > #define __pa(x) ((unsigned long)(x) & 0x7fffffff) > > #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) > > > > With that, the only possible physical address range you may have is > > 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's > > not where your RAM is then something is wrong. > > > > This is in fact a very bad idea to define __va() and __pa() using > > bitwise operations as this hides mistakes like defining physical RAM > > address at 0xc0000000. Instead, it should look like: > > > > #define __pa(x) ((unsigned long)(x) - 0x80000000) > > #define __va(x) ((void *)((unsigned long)(x) + 0x80000000)) > > > > This way, bad physical RAM address definitions will be caught > > immediately. > > > > > That doesn't seem to be easy to fix. It seems there is a mixup of physical > > > and virtual addresses in the architecture. > > > > Well... I don't think there is much else to say other than this needs > > fixing. > > The memory map for the ETRAX FS has the SDRAM mapped at both 0x40000000-0x7fffffff > and 0xc0000000-0xffffffff, and the difference is cached and non-cached. > That is actively (ab)used in the port, unfortunately, allthough I'm > uncertain if this is the problem in this case. It certainly is a problem. If your cached RAM is physically mapped at 0xc0000000 and you want it to be virtually mapped at 0xc0000000 then you should have: #define __pa(x) ((unsigned long)(x)) #define __va(x) ((void *)(x)) i.e. no translation. For non-cached RAM access, there are specific interfaces for that. For example, you could have dma_alloc_coherent() take advantage of the fact that memory with the top bit cleared becomes uncached. But __pa() is the wrong interface for obtaining uncached memory. Nicolas
On Wed, Nov 22, 2017 at 03:17:00PM -0500, Nicolas Pitre wrote: > On Wed, 22 Nov 2017, Jesper Nilsson wrote: > > > On Mon, Nov 20, 2017 at 10:50:46PM -0500, Nicolas Pitre wrote: > > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > On Mon, Nov 20, 2017 at 07:28:21PM -0500, Nicolas Pitre wrote: > > > > > On Mon, 20 Nov 2017, Guenter Roeck wrote: > > > > > > > > > > > bdata->node_min_pfn=60000 PFN_PHYS(bdata->node_min_pfn)=c0000000 start_off=536000 region=c0536000 > > > > > > > > > > If PFN_PHYS(bdata->node_min_pfn)=c0000000 and > > > > > region=c0536000 that means phys_to_virt() is a no-op. > > > > > > > > > No, it is |= 0x80000000 > > > > > > Then the bootmem registration looks very fishy. If you have: > > > > > > > I think the problem is the 0x60000 in bdata->node_min_pfn. It is shifted > > > > left by PFN_PHYS, making it 0xc0000000, which in my understanding is > > > > a virtual address. > > > > > > Exact. > > > > > > #define __pa(x) ((unsigned long)(x) & 0x7fffffff) > > > #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) > > > > > > With that, the only possible physical address range you may have is > > > 0x40000000 - 0x7fffffff, and it better start at 0x40000000. If that's > > > not where your RAM is then something is wrong. > > > > > > This is in fact a very bad idea to define __va() and __pa() using > > > bitwise operations as this hides mistakes like defining physical RAM > > > address at 0xc0000000. Instead, it should look like: > > > > > > #define __pa(x) ((unsigned long)(x) - 0x80000000) > > > #define __va(x) ((void *)((unsigned long)(x) + 0x80000000)) > > > > > > This way, bad physical RAM address definitions will be caught > > > immediately. > > > > > > > That doesn't seem to be easy to fix. It seems there is a mixup of physical > > > > and virtual addresses in the architecture. > > > > > > Well... I don't think there is much else to say other than this needs > > > fixing. > > > > The memory map for the ETRAX FS has the SDRAM mapped at both 0x40000000-0x7fffffff > > and 0xc0000000-0xffffffff, and the difference is cached and non-cached. > > That is actively (ab)used in the port, unfortunately, allthough I'm > > uncertain if this is the problem in this case. > > It certainly is a problem. If your cached RAM is physically mapped at > 0xc0000000 and you want it to be virtually mapped at 0xc0000000 then you > should have: > > #define __pa(x) ((unsigned long)(x)) > #define __va(x) ((void *)(x)) > > i.e. no translation. Sorry, it's the other way around, cached memory is at 0x40000000 and non-cached is at 0xc0000000, so the translation is right, even if as you pointed out earlier, it should be performed differently. > For non-cached RAM access, there are specific > interfaces for that. For example, you could have dma_alloc_coherent() > take advantage of the fact that memory with the top bit cleared becomes > uncached. But __pa() is the wrong interface for obtaining uncached > memory. > > Nicolas /^JN - Jesper Nilsson -- Jesper Nilsson -- jesper.nilsson@axis.com
Hello, I'm reverting the offending commit till we figure out what's going on. Thanks. -- tejun
On Mon, 27 Nov 2017, Tejun Heo wrote: > Hello, > > I'm reverting the offending commit till we figure out what's going on. It is figured out. The cris port is wrongly initializing the bootmem allocator with virtual memory addresses rather than physical addresses. And because its __va() definition reads like this: #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) then things just work out because the end result is the same whether you give this a physical or a virtual address. Untill you call memblock_free_early(__pa(address)) that is, because values from __pa() don't match with the virtual addresses stuffed in the bootmem allocator anymore. So IMHO I don't think reverting the commit is the right thing to do. That commit is clearly not at fault here. Nicolas
Hello, On Mon, Nov 27, 2017 at 03:31:52PM -0500, Nicolas Pitre wrote: > So IMHO I don't think reverting the commit is the right thing to do. > That commit is clearly not at fault here. It's not about the blame. We just want to avoid breaking boot in a way which is difficult to debug. Once cris is fixed, we can re-apply the patch. Thanks. -- tejun
On Mon, 27 Nov 2017, Tejun Heo wrote: > Hello, > > On Mon, Nov 27, 2017 at 03:31:52PM -0500, Nicolas Pitre wrote: > > So IMHO I don't think reverting the commit is the right thing to do. > > That commit is clearly not at fault here. > > It's not about the blame. We just want to avoid breaking boot in a > way which is difficult to debug. Once cris is fixed, we can re-apply > the patch. In that case I suggest the following instead. No point penalizing everyone for a single architecture's fault. And this will serve as a visible reminder to the cris people that they need to clean up. ----- >8 Subject: percpu: hack to let the CRIS architecture to boot until they clean up Commit 438a506180 ("percpu: don't forget to free the temporary struct pcpu_alloc_info") uncovered a problem on the CRIS architecture where the bootmem allocator is initialized with virtual addresses. Given it has: #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) then things just work out because the end result is the same whether you give this a physical or a virtual address. Untill you call memblock_free_early(__pa(address)) that is, because values from __pa() don't match with the virtual addresses stuffed in the bootmem allocator anymore. Avoid freeing the temporary pcpu_alloc_info memory on that architecture until they fix things up to let the kernel boot like it did before. Signed-off-by: Nicolas Pitre <nico@linaro.org> diff --git a/mm/percpu.c b/mm/percpu.c index 79e3549cab..50e7fdf840 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -2719,7 +2719,11 @@ void __init setup_per_cpu_areas(void) if (pcpu_setup_first_chunk(ai, fc) < 0) panic("Failed to initialize percpu areas."); +#ifdef CONFIG_CRIS +#warning "the CRIS architecture has physical and virtual addresses confused" +#else pcpu_free_alloc_info(ai); +#endif } #endif /* CONFIG_SMP */
Hello, Nicolas. On Mon, Nov 27, 2017 at 03:51:04PM -0500, Nicolas Pitre wrote: > Subject: percpu: hack to let the CRIS architecture to boot until they clean up > > Commit 438a506180 ("percpu: don't forget to free the temporary struct > pcpu_alloc_info") uncovered a problem on the CRIS architecture where > the bootmem allocator is initialized with virtual addresses. Given it > has: > > #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) > > then things just work out because the end result is the same whether you > give this a physical or a virtual address. > > Untill you call memblock_free_early(__pa(address)) that is, because > values from __pa() don't match with the virtual addresses stuffed in the > bootmem allocator anymore. > > Avoid freeing the temporary pcpu_alloc_info memory on that architecture > until they fix things up to let the kernel boot like it did before. > > Signed-off-by: Nicolas Pitre <nico@linaro.org> This totally works for me. Replaced the revert with this one. Thanks! -- tejun
On Mon, Nov 27, 2017 at 12:54:21PM -0800, Tejun Heo wrote: > Hello, Nicolas. > > On Mon, Nov 27, 2017 at 03:51:04PM -0500, Nicolas Pitre wrote: > > Subject: percpu: hack to let the CRIS architecture to boot until they clean up > > > > Commit 438a506180 ("percpu: don't forget to free the temporary struct > > pcpu_alloc_info") uncovered a problem on the CRIS architecture where > > the bootmem allocator is initialized with virtual addresses. Given it > > has: > > > > #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) > > > > then things just work out because the end result is the same whether you > > give this a physical or a virtual address. > > > > Untill you call memblock_free_early(__pa(address)) that is, because > > values from __pa() don't match with the virtual addresses stuffed in the > > bootmem allocator anymore. > > > > Avoid freeing the temporary pcpu_alloc_info memory on that architecture > > until they fix things up to let the kernel boot like it did before. > > > > Signed-off-by: Nicolas Pitre <nico@linaro.org> > > This totally works for me. Replaced the revert with this one. > Same here. Thanks, Guenter > Thanks! > > -- > tejun
On Mon, Nov 27, 2017 at 03:51:04PM -0500, Nicolas Pitre wrote: > On Mon, 27 Nov 2017, Tejun Heo wrote: > > > Hello, > > > > On Mon, Nov 27, 2017 at 03:31:52PM -0500, Nicolas Pitre wrote: > > > So IMHO I don't think reverting the commit is the right thing to do. > > > That commit is clearly not at fault here. > > > > It's not about the blame. We just want to avoid breaking boot in a > > way which is difficult to debug. Once cris is fixed, we can re-apply > > the patch. > > In that case I suggest the following instead. No point penalizing > everyone for a single architecture's fault. And this will serve as a > visible reminder to the cris people that they need to clean up. > > ----- >8 > Subject: percpu: hack to let the CRIS architecture to boot until they clean up > > Commit 438a506180 ("percpu: don't forget to free the temporary struct > pcpu_alloc_info") uncovered a problem on the CRIS architecture where > the bootmem allocator is initialized with virtual addresses. Given it > has: > > #define __va(x) ((void *)((unsigned long)(x) | 0x80000000)) > > then things just work out because the end result is the same whether you > give this a physical or a virtual address. > > Untill you call memblock_free_early(__pa(address)) that is, because > values from __pa() don't match with the virtual addresses stuffed in the > bootmem allocator anymore. > > Avoid freeing the temporary pcpu_alloc_info memory on that architecture > until they fix things up to let the kernel boot like it did before. > > Signed-off-by: Nicolas Pitre <nico@linaro.org> > > diff --git a/mm/percpu.c b/mm/percpu.c > index 79e3549cab..50e7fdf840 100644 > --- a/mm/percpu.c > +++ b/mm/percpu.c > @@ -2719,7 +2719,11 @@ void __init setup_per_cpu_areas(void) > > if (pcpu_setup_first_chunk(ai, fc) < 0) > panic("Failed to initialize percpu areas."); > +#ifdef CONFIG_CRIS > +#warning "the CRIS architecture has physical and virtual addresses confused" > +#else > pcpu_free_alloc_info(ai); > +#endif > } > > #endif /* CONFIG_SMP */ Works for me, and thanks. /^JN - Jesper Nilsson -- Jesper Nilsson -- jesper.nilsson@axis.com
diff --git a/mm/percpu.c b/mm/percpu.c index 434844415d..fe37f85cc2 100644 --- a/mm/percpu.c +++ b/mm/percpu.c @@ -1410,13 +1410,17 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, struct pcpu_alloc_info *ai; size_t base_size, ai_size; void *ptr; - int unit; + int unit, align; - base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]), - __alignof__(ai->groups[0].cpu_map[0])); + align = __alignof__(ai->groups[0].cpu_map[0]); + base_size = ALIGN(sizeof(*ai) + nr_groups * sizeof(ai->groups[0]), align); ai_size = base_size + nr_units * sizeof(ai->groups[0].cpu_map[0]); + if (ai_size >= PAGE_SIZE) { + ai_size = PFN_ALIGN(ai_size); + align = PAGE_SIZE; + } - ptr = memblock_virt_alloc_nopanic(PFN_ALIGN(ai_size), 0); + ptr = memblock_virt_alloc_nopanic(ai_size, align); if (!ptr) return NULL; ai = ptr; @@ -1428,7 +1432,7 @@ struct pcpu_alloc_info * __init pcpu_alloc_alloc_info(int nr_groups, ai->groups[0].cpu_map[unit] = NR_CPUS; ai->nr_groups = nr_groups; - ai->__ai_size = PFN_ALIGN(ai_size); + ai->__ai_size = ai_size; return ai; }
This can be much smaller than a page on very small memory systems. Always rounding up the size to a page is wasteful in that case, and required alignment is smaller than the memblock default. Let's round things up to a page size only when the actual size is >= page size, and then it makes sense to page-align for a nicer allocation pattern. Signed-off-by: Nicolas Pitre <nico@linaro.org>