Message ID | 20220111113314.27173-1-kirill.shutemov@linux.intel.com |
---|---|
Headers | show |
Series | Implement support for unaccepted memory | expand |
> diff --git a/mm/memblock.c b/mm/memblock.c > index 1018e50566f3..6dfa594192de 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, > */ > kmemleak_alloc_phys(found, size, 0, 0); > > + accept_memory(found, found + size); > return found; > } This could use a comment. Looking at this, I also have to wonder if accept_memory() is a bit too generic. Should it perhaps be: cc_accept_memory() or cc_guest_accept_memory()? > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index c5952749ad40..5707b4b5f774 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page, > unsigned int max_order; > struct page *buddy; > bool to_tail; > + bool offline = PageOffline(page); > > max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order); > > @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page, > clear_page_guard(zone, buddy, order, migratetype); > else > del_page_from_free_list(buddy, zone, order); > + > + if (PageOffline(buddy)) > + offline = true; > + > combined_pfn = buddy_pfn & pfn; > page = page + (combined_pfn - pfn); > pfn = combined_pfn; > @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page, > done_merging: > set_buddy_order(page, order); > > + if (offline) > + __SetPageOffline(page); > + > if (fpi_flags & FPI_TO_TAIL) > to_tail = true; > else if (is_shuffle_order(order)) This is touching some pretty hot code paths. You mention both that accepting memory is slow and expensive, yet you're doing it in the core allocator. That needs at least some discussion in the changelog. > @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page, > static inline bool page_expected_state(struct page *page, > unsigned long check_flags) > { > - if (unlikely(atomic_read(&page->_mapcount) != -1)) > + if (unlikely(atomic_read(&page->_mapcount) != -1) && > + !PageOffline(page)) > return false; Looking at stuff like this, I can't help but think that a: #define PageOffline PageUnaccepted and some other renaming would be a fine idea. I get that the Offline bit can be reused, but I'm not sure that the "Offline" *naming* should be reused. What you're doing here is logically distinct from existing offlining. > if (unlikely((unsigned long)page->mapping | > @@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn, > { > if (early_page_uninitialised(pfn)) > return; > + > + maybe_set_page_offline(page, order); > __free_pages_core(page, order); > } > > @@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn, > if (nr_pages == pageblock_nr_pages && > (pfn & (pageblock_nr_pages - 1)) == 0) { > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > + maybe_set_page_offline(page, pageblock_order); > __free_pages_core(page, pageblock_order); > return; > } > > + accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT); > for (i = 0; i < nr_pages; i++, page++, pfn++) { > if ((pfn & (pageblock_nr_pages - 1)) == 0) > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > @@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page, > if (set_page_guard(zone, &page[size], high, migratetype)) > continue; > > + if (PageOffline(page)) > + __SetPageOffline(&page[size]); Yeah, this is really begging for comments. Please add some. > add_to_free_list(&page[size], zone, high, migratetype); > set_buddy_order(&page[size], high); > } > @@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > */ > kernel_unpoison_pages(page, 1 << order); > > + if (PageOffline(page)) > + accept_and_clear_page_offline(page, order); > + > /* > * As memory initialization might be integrated into KASAN, > * kasan_alloc_pages and kernel_init_free_pages must be I guess once there are no more PageOffline() pages in the allocator, the only impact from these patches will be a bunch of conditional branches from the "if (PageOffline(page))" that always have the same result. The branch predictors should do a good job with that. *BUT*, that overhead is going to be universally inflicted on all users on x86, even those without TDX. I guess the compiler will save non-x86 users because they'll have an empty stub for accept_and_clear_page_offline() which the compiler will optimize away. It sure would be nice to have some changelog material about why this is OK, though. This is especially true since there's a global spinlock hidden in accept_and_clear_page_offline() wrapping a slow and "costly" operation.
> > Looking at stuff like this, I can't help but think that a: > > #define PageOffline PageUnaccepted > > and some other renaming would be a fine idea. I get that the Offline > bit can be reused, but I'm not sure that the "Offline" *naming* should > be reused. What you're doing here is logically distinct from existing > offlining. Yes, or using a new pagetype bit to make the distinction clearer. Especially the function names like maybe_set_page_offline() et. Al are confusing IMHO. They are all about accepting unaccepted memory ... and should express that. I assume PageOffline() will be set only on the first sub-page of a high-order PageBuddy() page, correct? Then we'll have to monitor all PageOffline() users such that they can actually deal with PageBuddy() pages spanning *multiple* base pages for a PageBuddy() page. For now it's clear that if a page is PageOffline(), it cannot be PageBuddy() and cannot span more than one base page. E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on individual base pages.
On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote: > > diff --git a/mm/memblock.c b/mm/memblock.c > > index 1018e50566f3..6dfa594192de 100644 > > --- a/mm/memblock.c > > +++ b/mm/memblock.c > > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, > > */ > > kmemleak_alloc_phys(found, size, 0, 0); > > + accept_memory(found, found + size); > > return found; > > } > > This could use a comment. How about this: /* * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, * requiring memory to be accepted before it can be used by the * guest. * * Accept the memory of the allocated buffer. */ > > Looking at this, I also have to wonder if accept_memory() is a bit too > generic. Should it perhaps be: cc_accept_memory() or > cc_guest_accept_memory()? I'll rename accept_memory() to cc_accept_memory() and accept_and_clear_page_offline() to cc_accept_and_clear_page_offline(). > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index c5952749ad40..5707b4b5f774 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page, > > unsigned int max_order; > > struct page *buddy; > > bool to_tail; > > + bool offline = PageOffline(page); > > max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order); > > @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page, > > clear_page_guard(zone, buddy, order, migratetype); > > else > > del_page_from_free_list(buddy, zone, order); > > + > > + if (PageOffline(buddy)) > > + offline = true; > > + > > combined_pfn = buddy_pfn & pfn; > > page = page + (combined_pfn - pfn); > > pfn = combined_pfn; > > @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page, > > done_merging: > > set_buddy_order(page, order); > > + if (offline) > > + __SetPageOffline(page); > > + I'll add /* Mark page PageOffline() if any merged page was PageOffline() */ above the 'if'. > > if (fpi_flags & FPI_TO_TAIL) > > to_tail = true; > > else if (is_shuffle_order(order)) > > This is touching some pretty hot code paths. You mention both that > accepting memory is slow and expensive, yet you're doing it in the core > allocator. > > That needs at least some discussion in the changelog. That is page type transfer on page merging. What expensive do you see here? The cachelines with both struct pages are hot already. > > @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page, > > static inline bool page_expected_state(struct page *page, > > unsigned long check_flags) > > { > > - if (unlikely(atomic_read(&page->_mapcount) != -1)) > > + if (unlikely(atomic_read(&page->_mapcount) != -1) && > > + !PageOffline(page)) > > return false; > > Looking at stuff like this, I can't help but think that a: > > #define PageOffline PageUnaccepted > > and some other renaming would be a fine idea. I get that the Offline bit > can be reused, but I'm not sure that the "Offline" *naming* should be > reused. What you're doing here is logically distinct from existing > offlining. I find the Offline name fitting. In both cases page is not accessible without additional preparation. Why do you want to multiply entities? > > if (unlikely((unsigned long)page->mapping | > > @@ -1734,6 +1743,8 @@ void __init memblock_free_pages(struct page *page, unsigned long pfn, > > { > > if (early_page_uninitialised(pfn)) > > return; > > + > > + maybe_set_page_offline(page, order); > > __free_pages_core(page, order); > > } > > @@ -1823,10 +1834,12 @@ static void __init deferred_free_range(unsigned long pfn, > > if (nr_pages == pageblock_nr_pages && > > (pfn & (pageblock_nr_pages - 1)) == 0) { > > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > + maybe_set_page_offline(page, pageblock_order); > > __free_pages_core(page, pageblock_order); > > return; > > } > > + accept_memory(pfn << PAGE_SHIFT, (pfn + nr_pages) << PAGE_SHIFT); > > for (i = 0; i < nr_pages; i++, page++, pfn++) { > > if ((pfn & (pageblock_nr_pages - 1)) == 0) > > set_pageblock_migratetype(page, MIGRATE_MOVABLE); > > @@ -2297,6 +2310,9 @@ static inline void expand(struct zone *zone, struct page *page, > > if (set_page_guard(zone, &page[size], high, migratetype)) > > continue; > > + if (PageOffline(page)) > > + __SetPageOffline(&page[size]); > > Yeah, this is really begging for comments. Please add some. I'll add /* Transfer PageOffline() to newly split pages */ > > > add_to_free_list(&page[size], zone, high, migratetype); > > set_buddy_order(&page[size], high); > > } > > @@ -2393,6 +2409,9 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > */ > > kernel_unpoison_pages(page, 1 << order); > > + if (PageOffline(page)) > > + accept_and_clear_page_offline(page, order); > > + > > /* > > * As memory initialization might be integrated into KASAN, > > * kasan_alloc_pages and kernel_init_free_pages must be > > I guess once there are no more PageOffline() pages in the allocator, the > only impact from these patches will be a bunch of conditional branches from > the "if (PageOffline(page))" that always have the same result. The branch > predictors should do a good job with that. > > *BUT*, that overhead is going to be universally inflicted on all users on > x86, even those without TDX. I guess the compiler will save non-x86 users > because they'll have an empty stub for accept_and_clear_page_offline() which > the compiler will optimize away. > > It sure would be nice to have some changelog material about why this is OK, > though. This is especially true since there's a global spinlock hidden in > accept_and_clear_page_offline() wrapping a slow and "costly" operation. Okay, I will come up with an explanation in commit message.
On 1/12/22 10:30, Kirill A. Shutemov wrote: > On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote: >>> diff --git a/mm/memblock.c b/mm/memblock.c >>> index 1018e50566f3..6dfa594192de 100644 >>> --- a/mm/memblock.c >>> +++ b/mm/memblock.c >>> @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, >>> */ >>> kmemleak_alloc_phys(found, size, 0, 0); >>> + accept_memory(found, found + size); >>> return found; >>> } >> >> This could use a comment. > > How about this: > > /* > * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > * requiring memory to be accepted before it can be used by the > * guest. > * > * Accept the memory of the allocated buffer. > */ I think a one-liner that might cue the reader to go look at accept_memory() itself would be fine. Maybe: /* Make the memblock usable when running in picky VM guests: */ That implies that the memory isn't usable without doing this and also points out that it's related to running in a guest. >> Looking at this, I also have to wonder if accept_memory() is a bit too >> generic. Should it perhaps be: cc_accept_memory() or >> cc_guest_accept_memory()? > > I'll rename accept_memory() to cc_accept_memory() and > accept_and_clear_page_offline() to cc_accept_and_clear_page_offline(). > >> >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index c5952749ad40..5707b4b5f774 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -1064,6 +1064,7 @@ static inline void __free_one_page(struct page *page, >>> unsigned int max_order; >>> struct page *buddy; >>> bool to_tail; >>> + bool offline = PageOffline(page); >>> max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order); >>> @@ -1097,6 +1098,10 @@ static inline void __free_one_page(struct page *page, >>> clear_page_guard(zone, buddy, order, migratetype); >>> else >>> del_page_from_free_list(buddy, zone, order); >>> + >>> + if (PageOffline(buddy)) >>> + offline = true; >>> + >>> combined_pfn = buddy_pfn & pfn; >>> page = page + (combined_pfn - pfn); >>> pfn = combined_pfn; >>> @@ -1130,6 +1135,9 @@ static inline void __free_one_page(struct page *page, >>> done_merging: >>> set_buddy_order(page, order); >>> + if (offline) >>> + __SetPageOffline(page); >>> + > > I'll add > > /* Mark page PageOffline() if any merged page was PageOffline() */ > > above the 'if'. > >>> if (fpi_flags & FPI_TO_TAIL) >>> to_tail = true; >>> else if (is_shuffle_order(order)) >> >> This is touching some pretty hot code paths. You mention both that >> accepting memory is slow and expensive, yet you're doing it in the core >> allocator. >> >> That needs at least some discussion in the changelog. > > That is page type transfer on page merging. What expensive do you see here? > The cachelines with both struct pages are hot already. I meant that comment generically rather than at this specific hunk. Just in general, I think this series needs to acknowledge that it is touching very core parts of the allocator and might make page allocation *MASSIVELY* slower, albeit temporarily. >>> @@ -1155,7 +1163,8 @@ static inline void __free_one_page(struct page *page, >>> static inline bool page_expected_state(struct page *page, >>> unsigned long check_flags) >>> { >>> - if (unlikely(atomic_read(&page->_mapcount) != -1)) >>> + if (unlikely(atomic_read(&page->_mapcount) != -1) && >>> + !PageOffline(page)) >>> return false; >> >> Looking at stuff like this, I can't help but think that a: >> >> #define PageOffline PageUnaccepted >> >> and some other renaming would be a fine idea. I get that the Offline bit >> can be reused, but I'm not sure that the "Offline" *naming* should be >> reused. What you're doing here is logically distinct from existing >> offlining. > > I find the Offline name fitting. In both cases page is not accessible > without additional preparation. > > Why do you want to multiply entities? The name wouldn't be bad *if* there was no other use of "Offline". But, logically, your use of "Offline" and the existing use of "Offline" are different things. They are totally orthogonal areas of the code. They should have different names. Again, I'm fine with using the same _bit_ in page->flags. But, the two logical uses need two different names.
On Wed, Jan 12, 2022 at 12:31:10PM +0100, David Hildenbrand wrote: > > > > > Looking at stuff like this, I can't help but think that a: > > > > #define PageOffline PageUnaccepted > > > > and some other renaming would be a fine idea. I get that the Offline > > bit can be reused, but I'm not sure that the "Offline" *naming* should > > be reused. What you're doing here is logically distinct from existing > > offlining. > > Yes, or using a new pagetype bit to make the distinction clearer. > Especially the function names like maybe_set_page_offline() et. Al are > confusing IMHO. They are all about accepting unaccepted memory ... and > should express that. "Unaccepted" is UEFI treminology and I'm not sure we want to expose core-mm to it. Power/S390/ARM may have a different name for the same concept. Offline/online is neutral terminology, familiar to MM developers. What if I change accept->online in function names and document the meaning properly? > I assume PageOffline() will be set only on the first sub-page of a > high-order PageBuddy() page, correct? > > Then we'll have to monitor all PageOffline() users such that they can > actually deal with PageBuddy() pages spanning *multiple* base pages for > a PageBuddy() page. For now it's clear that if a page is PageOffline(), > it cannot be PageBuddy() and cannot span more than one base page. > E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on > individual base pages. Right, pages that offline from hotplug POV are never on page allocator's free lists, so it cannot ever step on them.
On Tue, Jan 11, 2022 at 09:17:19AM -0800, Dave Hansen wrote: > On 1/11/22 03:33, Kirill A. Shutemov wrote: > ... > > +void mark_unaccepted(struct boot_params *params, u64 start, u64 end) > > +{ > > + /* > > + * The accepted memory bitmap only works at PMD_SIZE granularity. > > + * If a request comes in to mark memory as unaccepted which is not > > + * PMD_SIZE-aligned, simply accept the memory now since it can not be > > + * *marked* as unaccepted. > > + */ > > + > > + /* Immediately accept whole range if it is within a PMD_SIZE block: */ > > + if ((start & PMD_MASK) == (end & PMD_MASK)) { > > + npages = (end - start) / PAGE_SIZE; > > + __accept_memory(start, start + npages * PAGE_SIZE); > > + return; > > + } > > I still don't quite like how this turned out. It's still a bit unclear to > the reader that this has covered all the corner cases. I think this needs a > better comment: > > /* > * Handle <PMD_SIZE blocks that do not end at a PMD boundary. > * > * Immediately accept the whole block. This handles the case > * where the below round_{up,down}() would "lose" a small, > * <PMD_SIZE block. > */ > if ((start & PMD_MASK) == (end & PMD_MASK)) { > ... > return; > } > > /* > * There is at least one more block to accept. Both 'start' > * and 'end' may not be PMD-aligned. > */ Okay, looks better. Thanks. > > + /* Immediately accept a <PMD_SIZE piece at the start: */ > > + if (start & ~PMD_MASK) { > > + __accept_memory(start, round_up(start, PMD_SIZE)); > > + start = round_up(start, PMD_SIZE); > > + } > > + > > + /* Immediately accept a <PMD_SIZE piece at the end: */ > > + if (end & ~PMD_MASK) { > > + __accept_memory(round_down(end, PMD_SIZE), end); > > + end = round_down(end, PMD_SIZE); > > + } > > /* > * 'start' and 'end' are now both PMD-aligned. > * Record the range as being unaccepted: > */ Okay. > > + if (start == end) > > + return; > > Does bitmap_set()not accept zero-sized 'len' arguments? Looks like it does. Will drop this. > > + bitmap_set((unsigned long *)params->unaccepted_memory, > > + start / PMD_SIZE, (end - start) / PMD_SIZE); > > +} > > The code you have there is _precise_. It will never eagerly accept any area > that _can_ be represented in the bitmap. But, that's kinda hard to > describe. Maybe we should be a bit more sloppy about accepting things up > front to make it easier to describe: > > /* > * Accept small regions that might not be > * able to be represented in the bitmap: > */ > if (end - start < PMD_SIZE*2) { > npages = (end - start) / PAGE_SIZE; > __accept_memory(start, start + npages * PAGE_SIZE); > return; > } > > /* > * No matter how the start and end are aligned, at > * least one unaccepted PMD_SIZE area will remain. > */ > > ... now do the start/end rounding > > That has the downside of accepting a few things that it doesn't *HAVE* to > accept. But, its behavior is very easy to describe. Hm. Okay. I will give it a try. I like how it is now, but maybe it will be better. > > > diff --git a/arch/x86/include/asm/unaccepted_memory.h b/arch/x86/include/asm/unaccepted_memory.h > > new file mode 100644 > > index 000000000000..cbc24040b853 > > --- /dev/null > > +++ b/arch/x86/include/asm/unaccepted_memory.h > > @@ -0,0 +1,12 @@ > > +/* SPDX-License-Identifier: GPL-2.0 */ > > +/* Copyright (C) 2020 Intel Corporation */ > > +#ifndef _ASM_X86_UNACCEPTED_MEMORY_H > > +#define _ASM_X86_UNACCEPTED_MEMORY_H > > + > > +#include <linux/types.h> > > + > > +struct boot_params; > > + > > +void mark_unaccepted(struct boot_params *params, u64 start, u64 num); > > + > > +#endif > > diff --git a/arch/x86/include/uapi/asm/bootparam.h b/arch/x86/include/uapi/asm/bootparam.h > > index b25d3f82c2f3..16bc686a198d 100644 > > --- a/arch/x86/include/uapi/asm/bootparam.h > > +++ b/arch/x86/include/uapi/asm/bootparam.h > > @@ -217,7 +217,8 @@ struct boot_params { > > struct boot_e820_entry e820_table[E820_MAX_ENTRIES_ZEROPAGE]; /* 0x2d0 */ > > __u8 _pad8[48]; /* 0xcd0 */ > > struct edd_info eddbuf[EDDMAXNR]; /* 0xd00 */ > > - __u8 _pad9[276]; /* 0xeec */ > > + __u64 unaccepted_memory; /* 0xeec */ > > + __u8 _pad9[268]; /* 0xef4 */ > > } __attribute__((packed)); > > /** > > diff --git a/drivers/firmware/efi/Kconfig b/drivers/firmware/efi/Kconfig > > index 2c3dac5ecb36..36c1bf33f112 100644 > > --- a/drivers/firmware/efi/Kconfig > > +++ b/drivers/firmware/efi/Kconfig > > @@ -243,6 +243,20 @@ config EFI_DISABLE_PCI_DMA > > options "efi=disable_early_pci_dma" or "efi=no_disable_early_pci_dma" > > may be used to override this option. > > +config UNACCEPTED_MEMORY > > + bool > > + depends on EFI_STUB > > + help > > + Some Virtual Machine platforms, such as Intel TDX, introduce > > + the concept of memory acceptance, requiring memory to be accepted > > + before it can be used by the guest. This protects against a class of > > + attacks by the virtual machine platform. > > Some Virtual Machine platforms, such as Intel TDX, require > some memory to be "accepted" by the guest before it can be used. > This requirement protects against a class of attacks by the > virtual machine platform. > > Can we make this "class of attacks" a bit more concrete? Maybe: > > This mechanism helps prevent malicious hosts from making changes > to guest memory. > > ?? Okay. > > + UEFI specification v2.9 introduced EFI_UNACCEPTED_MEMORY memory type. > > + > > + This option adds support for unaccepted memory and makes such memory > > + usable by kernel. > > + > > endmenu > > config EFI_EMBEDDED_FIRMWARE > > diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c > > index ae79c3300129..abe862c381b6 100644 > > --- a/drivers/firmware/efi/efi.c > > +++ b/drivers/firmware/efi/efi.c > > @@ -740,6 +740,7 @@ static __initdata char memory_type_name[][13] = { > > "MMIO Port", > > "PAL Code", > > "Persistent", > > + "Unaccepted", > > }; > > char * __init efi_md_typeattr_format(char *buf, size_t size, > > diff --git a/drivers/firmware/efi/libstub/x86-stub.c b/drivers/firmware/efi/libstub/x86-stub.c > > index a0b946182b5e..346b12d6f1b2 100644 > > --- a/drivers/firmware/efi/libstub/x86-stub.c > > +++ b/drivers/firmware/efi/libstub/x86-stub.c > > @@ -9,12 +9,14 @@ > > #include <linux/efi.h> > > #include <linux/pci.h> > > #include <linux/stddef.h> > > +#include <linux/bitmap.h> > > #include <asm/efi.h> > > #include <asm/e820/types.h> > > #include <asm/setup.h> > > #include <asm/desc.h> > > #include <asm/boot.h> > > +#include <asm/unaccepted_memory.h> > > #include "efistub.h" > > @@ -504,6 +506,13 @@ setup_e820(struct boot_params *params, struct setup_data *e820ext, u32 e820ext_s > > e820_type = E820_TYPE_PMEM; > > break; > > + case EFI_UNACCEPTED_MEMORY: > > + if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) > > + continue; > > + e820_type = E820_TYPE_RAM; > > + mark_unaccepted(params, d->phys_addr, > > + d->phys_addr + PAGE_SIZE * d->num_pages); > > + break; > > default: > > continue; > > } > > @@ -575,6 +584,9 @@ static efi_status_t allocate_e820(struct boot_params *params, > > { > > efi_status_t status; > > __u32 nr_desc; > > + bool unaccepted_memory_present = false; > > + u64 max_addr = 0; > > + int i; > > status = efi_get_memory_map(map); > > if (status != EFI_SUCCESS) > > @@ -589,9 +601,55 @@ static efi_status_t allocate_e820(struct boot_params *params, > > if (status != EFI_SUCCESS) > > goto out; > > } > > + > > + if (!IS_ENABLED(CONFIG_UNACCEPTED_MEMORY)) > > + goto out; > > + > > + /* Check if there's any unaccepted memory and find the max address */ > > + for (i = 0; i < nr_desc; i++) { > > + efi_memory_desc_t *d; > > + > > + d = efi_early_memdesc_ptr(*map->map, *map->desc_size, i); > > + if (d->type == EFI_UNACCEPTED_MEMORY) > > + unaccepted_memory_present = true; > > + if (d->phys_addr + d->num_pages * PAGE_SIZE > max_addr) > > + max_addr = d->phys_addr + d->num_pages * PAGE_SIZE; > > + } > > + > > + /* > > + * If unaccepted memory present allocate a bitmap to track what memory > > ^ is > > > + * has to be accepted before access. > > + * > > + * One bit in the bitmap represents 2MiB in the address space: one 4k > > + * page is enough to track 64GiB or physical address space. > > That's a bit awkward and needs a "or->of". Perhaps: > > * One bit in the bitmap represents 2MiB in the address space: > * A 4k bitmap can track 64GiB of physical address space. Okay. > > > + * In the worst case scenario -- a huge hole in the middle of the > > + * address space -- It needs 256MiB to handle 4PiB of the address > > + * space. > > + * > > + * TODO: handle situation if params->unaccepted_memory has already set. > > + * It's required to deal with kexec. > > What happens today with kexec() since its not dealt with? I didn't give it a try, but I assume it will hang. There are more things to do to make kexec working and safe. We will get there, but it is not top priority.
On 1/12/22 11:29 AM, Kirill A. Shutemov wrote: >>> + * In the worst case scenario -- a huge hole in the middle of the >>> + * address space -- It needs 256MiB to handle 4PiB of the address >>> + * space. >>> + * >>> + * TODO: handle situation if params->unaccepted_memory has already set. >>> + * It's required to deal with kexec. >> What happens today with kexec() since its not dealt with? > I didn't give it a try, but I assume it will hang. > > There are more things to do to make kexec working and safe. We will get > there, but it is not top priority. Well, if we know it's broken, shouldn't we at least turn kexec off? It would be dirt simple to do in Kconfig. As would setting: kexec_load_disabled = true; which would probably also do the trick. That's from three seconds of looking. I'm sure you can come up with something better.
On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote: > On 1/11/22 03:33, Kirill A. Shutemov wrote: > > Unaccepted memory bitmap is allocated during decompression stage and > > handed over to main kernel image via boot_params. The bitmap is used to > > track if memory has been accepted. > > > > Reserve unaccepted memory bitmap has to prevent reallocating memory for > > other means. > > I'm having a hard time parsing that changelog, especially the second > paragraph. Could you give it another shot? What about this: Unaccepted memory bitmap is allocated during decompression stage and handed over to main kernel image via boot_params. Kernel tracks what memory has been accepted in the bitmap. Reserve memory where the bitmap is placed to prevent memblock from re-allocating the memory for other needs. ? > > + /* Mark unaccepted memory bitmap reserved */ > > + if (boot_params.unaccepted_memory) { > > + unsigned long size; > > + > > + /* One bit per 2MB */ > > + size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE, > > + PMD_SIZE * BITS_PER_BYTE); > > + memblock_reserve(boot_params.unaccepted_memory, size); > > + } > > Is it OK that the size of the bitmap is inferred from > e820__end_of_ram_pfn()? Is this OK in the presence of mem= and other things > that muck with the e820? Good question. I think we are fine. If kernel is not able to allocate memory from a part of physical address space we don't need the bitmap for it either.
On Tue, Jan 11, 2022 at 12:01:56PM -0800, Dave Hansen wrote: > On 1/11/22 03:33, Kirill A. Shutemov wrote: > > Core-mm requires few helpers to support unaccepted memory: > > > > - accept_memory() checks the range of addresses against the bitmap and > > accept memory if needed; > > > > - maybe_set_page_offline() checks the bitmap and marks a page with > > PageOffline() if memory acceptance required on the first > > allocation of the page. > > > > - accept_and_clear_page_offline() accepts memory for the page and clears > > PageOffline(). > > > ... > > +void accept_memory(phys_addr_t start, phys_addr_t end) > > +{ > > + unsigned long flags; > > + if (!boot_params.unaccepted_memory) > > + return; > > + > > + spin_lock_irqsave(&unaccepted_memory_lock, flags); > > + __accept_memory(start, end); > > + spin_unlock_irqrestore(&unaccepted_memory_lock, flags); > > +} > > Not a big deal, but please cc me on all the patches in the series. This is > called from the core mm patches which I wasn't cc'd on. > > This also isn't obvious, but this introduces a new, global lock into the > fast path of the page allocator and holds it for extended periods of time. > It won't be taken any more once all memory is accepted, but you can sure bet > that it will be noticeable until that happens. > > *PLEASE* document this. It needs changelog and probably code comments. Okay, will do.
On 1/12/22 11:43 AM, Kirill A. Shutemov wrote: > On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote: >> On 1/11/22 03:33, Kirill A. Shutemov wrote: >>> Unaccepted memory bitmap is allocated during decompression stage and >>> handed over to main kernel image via boot_params. The bitmap is used to >>> track if memory has been accepted. >>> >>> Reserve unaccepted memory bitmap has to prevent reallocating memory for >>> other means. >> >> I'm having a hard time parsing that changelog, especially the second >> paragraph. Could you give it another shot? > > What about this: > > Unaccepted memory bitmap is allocated during decompression stage and > handed over to main kernel image via boot_params. > > Kernel tracks what memory has been accepted in the bitmap. > > Reserve memory where the bitmap is placed to prevent memblock from > re-allocating the memory for other needs. > > ? Ahh, I get what you're trying to say now. But, it still really lacks a coherent problem statement. How about this? == Problem == A given page of memory can only be accepted once. The kernel has a need to accept memory both in the early decompression stage and during normal runtime. == Solution == Use a bitmap to communicate the acceptance state of each page between the decompression stage and normal runtime. This eliminates the possibility of attempting to double-accept a page. == Details == Allocate the bitmap during decompression stage and hand it over to the main kernel image via boot_params. In the runtime kernel, reserve the bitmap's memory to ensure nothing overwrites it. >>> + /* Mark unaccepted memory bitmap reserved */ >>> + if (boot_params.unaccepted_memory) { >>> + unsigned long size; >>> + >>> + /* One bit per 2MB */ >>> + size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE, >>> + PMD_SIZE * BITS_PER_BYTE); >>> + memblock_reserve(boot_params.unaccepted_memory, size); >>> + } >> >> Is it OK that the size of the bitmap is inferred from >> e820__end_of_ram_pfn()? Is this OK in the presence of mem= and other things >> that muck with the e820? > > Good question. I think we are fine. If kernel is not able to allocate > memory from a part of physical address space we don't need the bitmap for > it either. That's a good point. If the e820 range does a one-way shrink it's probably fine. The only problem would be if the bitmap had space for for stuff past e820__end_of_ram_pfn() *and* it later needed to be accepted. Would it be worth recording the size of the reservation and then double-checking against it in the bitmap operations?
On Wed, Jan 12, 2022 at 10:40:53AM -0800, Dave Hansen wrote: > On 1/12/22 10:30, Kirill A. Shutemov wrote: > > On Tue, Jan 11, 2022 at 11:46:37AM -0800, Dave Hansen wrote: > > > > diff --git a/mm/memblock.c b/mm/memblock.c > > > > index 1018e50566f3..6dfa594192de 100644 > > > > --- a/mm/memblock.c > > > > +++ b/mm/memblock.c > > > > @@ -1400,6 +1400,7 @@ phys_addr_t __init memblock_alloc_range_nid(phys_addr_t size, > > > > */ > > > > kmemleak_alloc_phys(found, size, 0, 0); > > > > + accept_memory(found, found + size); > > > > return found; > > > > } > > > > > > This could use a comment. > > > > How about this: > > > > /* > > * Some Virtual Machine platforms, such as Intel TDX or AMD SEV-SNP, > > * requiring memory to be accepted before it can be used by the > > * guest. > > * > > * Accept the memory of the allocated buffer. > > */ > > I think a one-liner that might cue the reader to go look at accept_memory() > itself would be fine. Maybe: > > /* Make the memblock usable when running in picky VM guests: */ I'd s/memblock/found range/ or something like that, memblock is too vague IMO > That implies that the memory isn't usable without doing this and also points > out that it's related to running in a guest.
On 12.01.22 20:15, Kirill A. Shutemov wrote: > On Wed, Jan 12, 2022 at 12:31:10PM +0100, David Hildenbrand wrote: >> >>> >>> Looking at stuff like this, I can't help but think that a: >>> >>> #define PageOffline PageUnaccepted >>> >>> and some other renaming would be a fine idea. I get that the Offline >>> bit can be reused, but I'm not sure that the "Offline" *naming* should >>> be reused. What you're doing here is logically distinct from existing >>> offlining. >> >> Yes, or using a new pagetype bit to make the distinction clearer. >> Especially the function names like maybe_set_page_offline() et. Al are >> confusing IMHO. They are all about accepting unaccepted memory ... and >> should express that. > > "Unaccepted" is UEFI treminology and I'm not sure we want to expose > core-mm to it. Power/S390/ARM may have a different name for the same > concept. Offline/online is neutral terminology, familiar to MM developers. Personally, I'd much rather prefer clear UEFI terminology for now than making the code more confusing to get. We can always generalize later iff there are similar needs by other archs (and if they are able to come up witha better name). But maybe we can find a different name immediately. The issue with online vs. offline I have is that we already have enough confusion: offline page: memory section is offline. These pages are not managed by the buddy. The memmap is stale unless we're dealing with special ZONE_DEVICE memory. logically offline pages: memory section is online and pages are PageOffline(). These pages were removed from the buddy e.g., to free them up in the hypervisor. soft offline pages: memory section is online and pages are PageHWPoison(). These pages are removed from the buddy such that we cannot allocate them to not trigger MCEs. offline pages are exposed to the buddy by onlining them (generic_online_page()), which is init+freeing. PageOffline() and PageHWPoison() are onlined by removing the flag and freeing them to the buddy. Your case is different such that the pages are managed by the buddy and they don't really have online/offline semantics compared to what we already have. All the buddy has to do is prepare them for initial use. I'm fine with reusing PageOffline(), but for the purpose of reading the code, I think we really want some different terminology in page_alloc.c So using any such terminology would make it clearer to me: * PageBuddyUnprepared() * PageBuddyUninitialized() * PageBuddyUnprocessed() * PageBuddyUnready() > > What if I change accept->online in function names and document the meaning > properly? > >> I assume PageOffline() will be set only on the first sub-page of a >> high-order PageBuddy() page, correct? >> >> Then we'll have to monitor all PageOffline() users such that they can >> actually deal with PageBuddy() pages spanning *multiple* base pages for >> a PageBuddy() page. For now it's clear that if a page is PageOffline(), >> it cannot be PageBuddy() and cannot span more than one base page. > >> E.g., fs/proc/kcore.c:read_kcore() assumes that PageOffline() is set on >> individual base pages. > > Right, pages that offline from hotplug POV are never on page allocator's > free lists, so it cannot ever step on them. >
On Wed, Jan 12, 2022 at 11:53:42AM -0800, Dave Hansen wrote: > On 1/12/22 11:43 AM, Kirill A. Shutemov wrote: > > On Tue, Jan 11, 2022 at 11:10:40AM -0800, Dave Hansen wrote: > >> On 1/11/22 03:33, Kirill A. Shutemov wrote: > >> > >>> + /* Mark unaccepted memory bitmap reserved */ > >>> + if (boot_params.unaccepted_memory) { > >>> + unsigned long size; > >>> + > >>> + /* One bit per 2MB */ > >>> + size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE, > >>> + PMD_SIZE * BITS_PER_BYTE); > >>> + memblock_reserve(boot_params.unaccepted_memory, size); > >>> + } > >> > >> Is it OK that the size of the bitmap is inferred from > >> e820__end_of_ram_pfn()? Is this OK in the presence of mem= and other things > >> that muck with the e820? > > > > Good question. I think we are fine. If kernel is not able to allocate > > memory from a part of physical address space we don't need the bitmap for > > it either. > > That's a good point. If the e820 range does a one-way shrink it's > probably fine. The only problem would be if the bitmap had space for > for stuff past e820__end_of_ram_pfn() *and* it later needed to be accepted. It's unlikely, but e820 can grow because of EFI and because of memmap=. To be completely on the safe side, the unaccepted bitmap should be reserved after parse_early_param() and efi_memblock_x86_reserve_range(). Since we anyway do not have memblock allocations before e820__memblock_setup(), the simplest thing would be to put the reservation first thing in e820__memblock_setup().
Hi Kirill, ... > > The approach lowers boot time substantially. Boot to shell is ~2.5x > faster for 4G TDX VM and ~4x faster for 64G. > > Patches 1-6/7 are generic and don't have any dependencies on TDX. They > should serve AMD SEV needs as well. TDX-specific code isolated in the > last patch. This patch requires the core TDX patchset which is currently > under review. > I can confirm that this series works for the SEV-SNP guest. I was able to hook the SEV-SNP page validation vmgexit (similar to the TDX patch#7) and have verified that the guest kernel successfully accepted all the memory regions marked unaccepted by the EFI boot loader. Not a big deal, but can I ask you to include me in Cc on the future series; I should be able to do more testing on SNP hardware and provide my Test-by tag. ~ Brijesh