mbox series

[v2,0/5] mm: place pages to the freelist tail when onlining and undoing isolation

Message ID 20201005121534.15649-1-david@redhat.com
Headers show
Series mm: place pages to the freelist tail when onlining and undoing isolation | expand

Message

David Hildenbrand Oct. 5, 2020, 12:15 p.m. UTC
When adding separate memory blocks via add_memory*() and onlining them
immediately, the metadata (especially the memmap) of the next block will be
placed onto one of the just added+onlined block. This creates a chain
of unmovable allocations: If the last memory block cannot get
offlined+removed() so will all dependent ones. We directly have unmovable
allocations all over the place.

This can be observed quite easily using virtio-mem, however, it can also
be observed when using DIMMs. The freshly onlined pages will usually be
placed to the head of the freelists, meaning they will be allocated next,
turning the just-added memory usually immediately un-removable. The
fresh pages are cold, prefering to allocate others (that might be hot)
also feels to be the natural thing to do.

It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
adding separate, successive memory blocks, each memory block will have
unmovable allocations on them - for example gigantic pages will fail to
allocate.

While the ZONE_NORMAL doesn't provide any guarantees that memory can get
offlined+removed again (any kind of fragmentation with unmovable
allocations is possible), there are many scenarios (hotplugging a lot of
memory, running workload, hotunplug some memory/as much as possible) where
we can offline+remove quite a lot with this patchset.

a) To visualize the problem, a very simple example:

Start a VM with 4GB and 8GB of virtio-mem memory:

 [root@localhost ~]# lsmem
 RANGE                                 SIZE  STATE REMOVABLE  BLOCK
 0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
 0x0000000100000000-0x000000033fffffff   9G online       yes 32-103

 Memory block size:       128M
 Total online memory:      12G
 Total offline memory:      0B

Then try to unplug as much as possible using virtio-mem. Observe which
memory blocks are still around. Without this patch set:

 [root@localhost ~]# lsmem
 RANGE                                  SIZE  STATE REMOVABLE   BLOCK
 0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
 0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
 0x0000000148000000-0x000000014fffffff  128M online       yes      41
 0x0000000158000000-0x000000015fffffff  128M online       yes      43
 0x0000000168000000-0x000000016fffffff  128M online       yes      45
 0x0000000178000000-0x000000017fffffff  128M online       yes      47
 0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
 0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
 0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
 0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
 0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
 0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
 0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
 0x0000000200000000-0x0000000207ffffff  128M online       yes      64
 0x0000000210000000-0x0000000217ffffff  128M online       yes      66
 0x0000000220000000-0x0000000227ffffff  128M online       yes      68
 0x0000000230000000-0x0000000237ffffff  128M online       yes      70
 0x0000000240000000-0x0000000247ffffff  128M online       yes      72
 0x0000000250000000-0x0000000257ffffff  128M online       yes      74
 0x0000000260000000-0x0000000267ffffff  128M online       yes      76
 0x0000000270000000-0x0000000277ffffff  128M online       yes      78
 0x0000000280000000-0x0000000287ffffff  128M online       yes      80
 0x0000000290000000-0x0000000297ffffff  128M online       yes      82
 0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
 0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
 0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
 0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
 0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
 0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
 0x0000000300000000-0x0000000307ffffff  128M online       yes      96
 0x0000000310000000-0x0000000317ffffff  128M online       yes      98
 0x0000000320000000-0x0000000327ffffff  128M online       yes     100
 0x0000000330000000-0x000000033fffffff  256M online       yes 102-103

 Memory block size:       128M
 Total online memory:     8.1G
 Total offline memory:      0B

With this patch set:

 [root@localhost ~]# lsmem
 RANGE                                 SIZE  STATE REMOVABLE BLOCK
 0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
 0x0000000100000000-0x000000013fffffff   1G online       yes 32-39

 Memory block size:       128M
 Total online memory:       4G
 Total offline memory:      0B

All memory can get unplugged, all memory block can get removed. Of course,
no workload ran and the system was basically idle, but it highlights the
issue - the fairly deterministic chain of unmovable allocations. When a
huge page for the 2MB memmap is needed, a just-onlined 4MB page will
be split. The remaining 2MB page will be used for the memmap of the next
memory block. So one memory block will hold the memmap of the two following
memory blocks. Finally the pages of the last-onlined memory block will get
used for the next bigger allocations - if any allocation is unmovable,
all dependent memory blocks cannot get unplugged and removed until that
allocation is gone.

Note that with bigger memory blocks (e.g., 256MB), *all* memory
blocks are dependent and none can get unplugged again!

b) Experiment with memory intensive workload

I performed an experiment with an older version of this patch set
(before we used undo_isolate_page_range() in online_pages():
Hotplug 56GB to a VM with an initial 4GB, onlining all memory to
ZONE_NORMAL right from the kernel when adding it. I then run various
memory intensive workloads that consume most system memory for a total of
45 minutes. Once finished, I try to unplug as much memory as possible.

With this change, I am able to remove via virtio-mem (adding individual
128MB memory blocks) 413 out of 448 added memory blocks. Via individual
(256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any numbers
without this patchset, but looking at the above example, it's at most half
of the 448 memory blocks for virtio-mem, and most probably none for DIMMs).

Again, there are workloads that might behave very differently due to the
nature of ZONE_NORMAL.

This change also affects (besodes memory onlining):
- Other users of undo_isolate_page_range(): Pages are always placed to the
  tail.
-- When memory offlining fails
-- When memory isolation fails after having isolated some pageblocks
-- When alloc_contig_range() either succeeds or fails
- Other users of __putback_isolated_page(): Pages are always placed to the
  tail.
-- Free page reporting
- Other users of __free_pages_core()
-- AFAIKs, any memory that is getting exposed to the buddy during boot.
   IIUC we will now usually allocate memory from lower addresses within
   a zone first (especially during boot).
- Other users of generic_online_page()
-- Hyper-V balloon

v1 -> v2:
- Avoid changing indentation/alignment of function parameters
- Minor spelling fixes
- "mm/page_alloc: convert "report" flag of __free_one_page() to a proper
   flag"
-- fop_t -> fpi_t
-- Clarify/extend documentation of FPI_SKIP_REPORT_NOTIFY
- "mm/page_alloc: move pages to tail in move_to_free_list()"
-- Perform change for all move_to_free_list()/move_freepages_block() users
   to simplify.
-- Adjust subject/description accordingly.
- "mm/page_alloc: place pages to tail in __free_pages_core()"
-- s/init_single_page/__init_single_page/

RFC -> v1:
- Tweak some patch descriptions
- "mm/page_alloc: place pages to tail in __putback_isolated_page()"
-- FOP_TO_TAIL now has higher precedence than page shuffling
-- Add a note that nothing should rely on FOP_TO_TAIL for correctness
- "mm/page_alloc: always move pages to the tail of the freelist in
   unset_migratetype_isolate()"
-- Use "bool" parameter for move_freepages_block() as requested
- "mm/page_alloc: place pages to tail in __free_pages_core()"
-- Eliminate set_page_refcounted() + page_ref_dec() and add a comment
- "mm/memory_hotplug: update comment regarding zone shuffling"
-- Added

David Hildenbrand (5):
  mm/page_alloc: convert "report" flag of __free_one_page() to a proper
    flag
  mm/page_alloc: place pages to tail in __putback_isolated_page()
  mm/page_alloc: move pages to tail in move_to_free_list()
  mm/page_alloc: place pages to tail in __free_pages_core()
  mm/memory_hotplug: update comment regarding zone shuffling

 mm/memory_hotplug.c | 11 +++---
 mm/page_alloc.c     | 84 +++++++++++++++++++++++++++++++++++----------
 mm/page_isolation.c |  5 +++
 3 files changed, 75 insertions(+), 25 deletions(-)

Comments

Michal Hocko Oct. 6, 2020, 12:12 p.m. UTC | #1
On Mon 05-10-20 14:15:32, David Hildenbrand wrote:
> Whenever we move pages between freelists via move_to_free_list()/

> move_freepages_block(), we don't actually touch the pages:

> 1. Page isolation doesn't actually touch the pages, it simply isolates

>    pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.

>    When undoing isolation, we move the pages back to the target list.

> 2. Page stealing (steal_suitable_fallback()) moves free pages directly

>    between lists without touching them.

> 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves

>    free pages directly between freelists without touching them.

> 

> We already place pages to the tail of the freelists when undoing isolation

> via __putback_isolated_page(), let's do it in any case (e.g., if order <=

> pageblock_order) and document the behavior. To simplify, let's move the

> pages to the tail for all move_to_free_list()/move_freepages_block() users.

> 

> In 2., the target list is empty, so there should be no change. In 3.,

> we might observe a change, however, highatomic is more concerned about

> allocations succeeding than cache hotness - if we ever realize this

> change degrades a workload, we can special-case this instance and add a

> proper comment.

> 

> This change results in all pages getting onlined via online_pages() to

> be placed to the tail of the freelist.

> 

> Reviewed-by: Oscar Salvador <osalvador@suse.de>

> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>

> Cc: Andrew Morton <akpm@linux-foundation.org>

> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>

> Cc: Mel Gorman <mgorman@techsingularity.net>

> Cc: Michal Hocko <mhocko@kernel.org>

> Cc: Dave Hansen <dave.hansen@intel.com>

> Cc: Vlastimil Babka <vbabka@suse.cz>

> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>

> Cc: Oscar Salvador <osalvador@suse.de>

> Cc: Mike Rapoport <rppt@kernel.org>

> Cc: Scott Cheloha <cheloha@linux.ibm.com>

> Cc: Michael Ellerman <mpe@ellerman.id.au>

> Signed-off-by: David Hildenbrand <david@redhat.com>


Much simpler!
Acked-by: Michal Hocko <mhocko@suse.com>


Thanks!

> ---

>  mm/page_alloc.c     | 10 +++++++---

>  mm/page_isolation.c |  5 +++++

>  2 files changed, 12 insertions(+), 3 deletions(-)

> 

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c

> index df5ff0cd6df1..b187e46cf640 100644

> --- a/mm/page_alloc.c

> +++ b/mm/page_alloc.c

> @@ -901,13 +901,17 @@ static inline void add_to_free_list_tail(struct page *page, struct zone *zone,

>  	area->nr_free++;

>  }

>  

> -/* Used for pages which are on another list */

> +/*

> + * Used for pages which are on another list. Move the pages to the tail

> + * of the list - so the moved pages won't immediately be considered for

> + * allocation again (e.g., optimization for memory onlining).

> + */

>  static inline void move_to_free_list(struct page *page, struct zone *zone,

>  				     unsigned int order, int migratetype)

>  {

>  	struct free_area *area = &zone->free_area[order];

>  

> -	list_move(&page->lru, &area->free_list[migratetype]);

> +	list_move_tail(&page->lru, &area->free_list[migratetype]);

>  }

>  

>  static inline void del_page_from_free_list(struct page *page, struct zone *zone,

> @@ -2340,7 +2344,7 @@ static inline struct page *__rmqueue_cma_fallback(struct zone *zone,

>  #endif

>  

>  /*

> - * Move the free pages in a range to the free lists of the requested type.

> + * Move the free pages in a range to the freelist tail of the requested type.

>   * Note that start_page and end_pages are not aligned on a pageblock

>   * boundary. If alignment is required, use move_freepages_block()

>   */

> diff --git a/mm/page_isolation.c b/mm/page_isolation.c

> index abfe26ad59fd..83692b937784 100644

> --- a/mm/page_isolation.c

> +++ b/mm/page_isolation.c

> @@ -106,6 +106,11 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)

>  	 * If we isolate freepage with more than pageblock_order, there

>  	 * should be no freepage in the range, so we could avoid costly

>  	 * pageblock scanning for freepage moving.

> +	 *

> +	 * We didn't actually touch any of the isolated pages, so place them

> +	 * to the tail of the freelist. This is an optimization for memory

> +	 * onlining - just onlined memory won't immediately be considered for

> +	 * allocation.

>  	 */

>  	if (!isolated_page) {

>  		nr_pages = move_freepages_block(zone, page, migratetype, NULL);

> -- 

> 2.26.2


-- 
Michal Hocko
SUSE Labs
Vlastimil Babka Oct. 20, 2020, 5:20 p.m. UTC | #2
On 10/5/20 2:15 PM, David Hildenbrand wrote:
> Whenever we move pages between freelists via move_to_free_list()/
> move_freepages_block(), we don't actually touch the pages:
> 1. Page isolation doesn't actually touch the pages, it simply isolates
>     pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
>     When undoing isolation, we move the pages back to the target list.
> 2. Page stealing (steal_suitable_fallback()) moves free pages directly
>     between lists without touching them.
> 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
>     free pages directly between freelists without touching them.
> 
> We already place pages to the tail of the freelists when undoing isolation
> via __putback_isolated_page(), let's do it in any case (e.g., if order <=
> pageblock_order) and document the behavior. To simplify, let's move the
> pages to the tail for all move_to_free_list()/move_freepages_block() users.
> 
> In 2., the target list is empty, so there should be no change. In 3.,
> we might observe a change, however, highatomic is more concerned about
> allocations succeeding than cache hotness - if we ever realize this
> change degrades a workload, we can special-case this instance and add a
> proper comment.
> 
> This change results in all pages getting onlined via online_pages() to
> be placed to the tail of the freelist.
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>
> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Vlastimil Babka Oct. 20, 2020, 5:21 p.m. UTC | #3
On 10/5/20 2:15 PM, David Hildenbrand wrote:
> As we no longer shuffle via generic_online_page() and when undoing

> isolation, we can simplify the comment.

> 

> We now effectively shuffle only once (properly) when onlining new

> memory.

> 

> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>

> Acked-by: Michal Hocko <mhocko@suse.com>


Acked-by: Vlastimil Babka <vbabka@suse.cz>


> Cc: Andrew Morton <akpm@linux-foundation.org>

> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>

> Cc: Mel Gorman <mgorman@techsingularity.net>

> Cc: Michal Hocko <mhocko@kernel.org>

> Cc: Dave Hansen <dave.hansen@intel.com>

> Cc: Vlastimil Babka <vbabka@suse.cz>

> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>

> Cc: Oscar Salvador <osalvador@suse.de>

> Cc: Mike Rapoport <rppt@kernel.org>

> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>

> Signed-off-by: David Hildenbrand <david@redhat.com>

> ---

>   mm/memory_hotplug.c | 11 ++++-------

>   1 file changed, 4 insertions(+), 7 deletions(-)

> 

> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c

> index 03a00cb68bf7..b44d4c7ba73b 100644

> --- a/mm/memory_hotplug.c

> +++ b/mm/memory_hotplug.c

> @@ -858,13 +858,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,

>   	undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);

>   

>   	/*

> -	 * When exposing larger, physically contiguous memory areas to the

> -	 * buddy, shuffling in the buddy (when freeing onlined pages, putting

> -	 * them either to the head or the tail of the freelist) is only helpful

> -	 * for maintaining the shuffle, but not for creating the initial

> -	 * shuffle. Shuffle the whole zone to make sure the just onlined pages

> -	 * are properly distributed across the whole freelist. Make sure to

> -	 * shuffle once pageblocks are no longer isolated.

> +	 * Freshly onlined pages aren't shuffled (e.g., all pages are placed to

> +	 * the tail of the freelist when undoing isolation). Shuffle the whole

> +	 * zone to make sure the just onlined pages are properly distributed

> +	 * across the whole freelist - to create an initial shuffle.

>   	 */

>   	shuffle_zone(zone);

>   

>
Pankaj Gupta Oct. 21, 2020, 10:58 a.m. UTC | #4
> As we no longer shuffle via generic_online_page() and when undoing
> isolation, we can simplify the comment.
>
> We now effectively shuffle only once (properly) when onlining new
> memory.
>
> Reviewed-by: Wei Yang <richard.weiyang@linux.alibaba.com>
> Acked-by: Michal Hocko <mhocko@suse.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> Cc: Mel Gorman <mgorman@techsingularity.net>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Mike Rapoport <rppt@kernel.org>
> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory_hotplug.c | 11 ++++-------
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 03a00cb68bf7..b44d4c7ba73b 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -858,13 +858,10 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages,
>         undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
>
>         /*
> -        * When exposing larger, physically contiguous memory areas to the
> -        * buddy, shuffling in the buddy (when freeing onlined pages, putting
> -        * them either to the head or the tail of the freelist) is only helpful
> -        * for maintaining the shuffle, but not for creating the initial
> -        * shuffle. Shuffle the whole zone to make sure the just onlined pages
> -        * are properly distributed across the whole freelist. Make sure to
> -        * shuffle once pageblocks are no longer isolated.
> +        * Freshly onlined pages aren't shuffled (e.g., all pages are placed to
> +        * the tail of the freelist when undoing isolation). Shuffle the whole
> +        * zone to make sure the just onlined pages are properly distributed
> +        * across the whole freelist - to create an initial shuffle.
>          */
>         shuffle_zone(zone);
>

Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com>