mbox series

[v4,0/2] mm: fix initialization of struct page for holes in memory layout

Message ID 20210130221035.4169-1-rppt@kernel.org
Headers show
Series mm: fix initialization of struct page for holes in memory layout | expand

Message

Mike Rapoport Jan. 30, 2021, 10:10 p.m. UTC
From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

Commit 73a6e474cb37 ("mm: memmap_init: iterate over
memblock regions rather that check each PFN") exposed several issues with
the memory map initialization and these patches fix those issues.

Initially there were crashes during compaction that Qian Cai reported back
in April [1]. It seemed back then that the problem was fixed, but a few
weeks ago Andrea Arcangeli hit the same bug [2] and there was an additional
discussion at [3].

I didn't appreciate variety of ways BIOSes can report memory in the first
megabyte, so v3 of this set caused boot failures on several x86 systems. 
Hopefully this time I covered all the bases.

The first patch here complements commit bde9cfa3afe4 ("x86/setup: don't
remove E820_TYPE_RAM for pfn 0") for the cases when BIOS reports the first
page as absent or reserved.

The second patch is a more robust version of d3921cb8be29 ("mm: fix
initialization of struct page for holes in memory layout") that can now
handle the above cases as well.

v4:
* make sure pages in the range 0 - start_pfn_of_lowest_zone are initialized
  even if an architecture hides them from the generic mm
* finally make pfn 0 on x86 to be a part of memory visible to the generic
  mm as reserved memory.

v3: https://lore.kernel.org/lkml/20210111194017.22696-1-rppt@kernel.org
* use architectural zone constraints to set zone links for struct pages
  corresponding to the holes
* drop implicit update of memblock.memory
* add a patch that sets pfn 0 to E820_TYPE_RAM on x86

v2: https://lore.kernel.org/lkml/20201209214304.6812-1-rppt@kernel.org/):
* added patch that adds all regions in memblock.reserved that do not
overlap with memblock.memory to memblock.memory in the beginning of
free_area_init()

[1] https://lore.kernel.org/lkml/8C537EB7-85EE-4DCF-943E-3CC0ED0DF56D@lca.pw
[2] https://lore.kernel.org/lkml/20201121194506.13464-1-aarcange@redhat.com
[3] https://lore.kernel.org/mm-commits/20201206005401.qKuAVgOXr%akpm@linux-foundation.org

Mike Rapoport (2):
  x86/setup: always add the beginning of RAM as memblock.memory
  mm: fix initialization of struct page for holes in memory layout

 arch/x86/kernel/setup.c |  8 ++++
 mm/page_alloc.c         | 85 ++++++++++++++++++++++++-----------------
 2 files changed, 59 insertions(+), 34 deletions(-)

Comments

Linus Torvalds Jan. 31, 2021, 12:37 a.m. UTC | #1
On Sat, Jan 30, 2021 at 2:10 PM Mike Rapoport <rppt@kernel.org> wrote:
>

> In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000

> to memblock.memory and later during memory map initialization this range is

> left outside any zone.


Honestly, this just sounds like memblock being stupid in the first place.

Why aren't these zones padded to sane alignments?

This patch smells like working around the memblock code being fragile
rather than a real fix.

That's *particularly* true when the very line above it did a
"memblock_reserve()" of the exact same range that the memblock_add()
"adds".

              Linus
Mike Rapoport Jan. 31, 2021, 8:03 a.m. UTC | #2
On Sat, Jan 30, 2021 at 04:37:54PM -0800, Linus Torvalds wrote:
> On Sat, Jan 30, 2021 at 2:10 PM Mike Rapoport <rppt@kernel.org> wrote:

> >

> > In either case, e820__memblock_setup() won't add the range 0x0000 - 0x1000

> > to memblock.memory and later during memory map initialization this range is

> > left outside any zone.

> 

> Honestly, this just sounds like memblock being stupid in the first place.

> 

> Why aren't these zones padded to sane alignments?

 
The implicit alignment of zones would be a guess. What alignment would be
sane here? 1M? MAX_ORDER? pageblock_order?

I'm not sure that if an architecture reports its memory at X and we use,
say, round_down(X, 1M) for node[0]->node_start_pfn and
zone[0]->zone_start_pfn it wouldn't cause boot failure on some system out
there in the wild.

> This patch smells like working around the memblock code being fragile

> rather than a real fix.

>

> That's *particularly* true when the very line above it did a

> "memblock_reserve()" of the exact same range that the memblock_add()

> "adds".


The most correct thing to do would have been to 

	memblock_add(0, end_of_first_memory_bank);

Somewhere at e820__memblock_setup().

But that would mean we also must change the way e820__memblock_setup()
reserves memory and that seemed to me like really asking for troubles so
I've limited the registration of memory to the range that's for sure
reserved.

A part of the problem is that x86 adds only usable memory to
memblock.memory omitting holes and reserved areas, while free_area_init()
presumes that memblock.memory covers populated physical memory.

I've tried implicitly adding ranges from memblock.reserved to
memblock.memory if they were not there and it had broken some arm machines:

https://lore.kernel.org/lkml/127999c4-7d56-0c36-7f88-8e1a5c934cae@collabora.com

I do feel that free_area_init() is fragile and no doubt there is a room for
improvement there. But I think the safer way forward is to reduce
inconsistencies between arch and generic code, so that we won't need to
guess what is the memory layout at free_area_init() time.
 
>               Linus


-- 
Sincerely yours,
Mike.
Linus Torvalds Jan. 31, 2021, 9:49 p.m. UTC | #3
On Sun, Jan 31, 2021 at 12:04 AM Mike Rapoport <rppt@kernel.org> wrote:
>

> >

> > That's *particularly* true when the very line above it did a

> > "memblock_reserve()" of the exact same range that the memblock_add()

> > "adds".

>

> The most correct thing to do would have been to

>

>         memblock_add(0, end_of_first_memory_bank);

>

> Somewhere at e820__memblock_setup().


You miss my complaint.

Why does the memblock code care about this magical "memblock_add()",
when we just told it that the SAME REGION is reserved by doing a
"memblock_reserve()"?

IOW, I'm not interested in "the correct thing to do would have been
[another memblock_add()]". I'm saying that the memblock code itself is
being confused, and no additional thing should have been required at
all, because we already *did* that memblock_reserve().

See?

Honestly, I'm not seeing it being a good thing to move further towards
memblock code as the primary model for memory initialization, when the
memblock code is so confused.

              Linus
Mike Rapoport Feb. 1, 2021, 2:06 p.m. UTC | #4
On Sun, Jan 31, 2021 at 01:49:27PM -0800, Linus Torvalds wrote:
> On Sun, Jan 31, 2021 at 12:04 AM Mike Rapoport <rppt@kernel.org> wrote:

> >

> > >

> > > That's *particularly* true when the very line above it did a

> > > "memblock_reserve()" of the exact same range that the memblock_add()

> > > "adds".

> >

> > The most correct thing to do would have been to

> >

> >         memblock_add(0, end_of_first_memory_bank);

> >

> > Somewhere at e820__memblock_setup().

> 

> You miss my complaint.

> 

> Why does the memblock code care about this magical "memblock_add()",

> when we just told it that the SAME REGION is reserved by doing a

> "memblock_reserve()"?

> 

> IOW, I'm not interested in "the correct thing to do would have been

> [another memblock_add()]". I'm saying that the memblock code itself is

> being confused, and no additional thing should have been required at

> all, because we already *did* that memblock_reserve().

> 

> See?


There is nothing magical about memblock_add().

Memblock presumes that arch code uses memblock_add() to register populated
physical memory ranges and memblock_reserve() to protect memory ranges that
should not be touched. These ranges do not necessarily overlap, so there
maybe reserved ranges that do not have the corresponding registered memory.

This lets architectures to say "here are the memory banks I have" and "this
memory is in use" (or even "this memory _might_ be in use" ) independently
of each other.

The downside is that if there is a reserved range there is no way to tell
whether it is backed by populated memory.

We could change this semantics and enforce the overlap, e.g. by
implicitly adding all the reserved ranges to the registered memory.
I've already tried that and I've found out that there are systems that rely
on memblock's ability to track reserved and available ranges independently.
For example, arm systems I've mentioned in the previous mail always have a
reserved chunk at 0xfe000000 in their DTS, but they may have only 2G of
memory actually populated. 

Now, on x86 there is a gap between e820 and memblock since 2.6 times. As of
now, only E820_TYPE_RAM is added to memblock as memory, some of the
E820_*_RESERVED are reserved and on top there are reservations of the
memory that's known to be used by BIOS or kernel.

I'm trying to close this gap with small steps and with changes that I
believe will not break too many things at once so it'll become
unmanageable.

> Honestly, I'm not seeing it being a good thing to move further towards

> memblock code as the primary model for memory initialization, when the

> memblock code is so confused.


I'm not sure I follow you here.
If I'm not mistaken, memblock is used as the primary model for memmap and
page allocator initialization for almost a decade now...
 
>               Linus


-- 
Sincerely yours,
Mike.