diff mbox series

[PATCHv4,5/8] x86/mm: Reserve unaccepted memory bitmap

Message ID 20220405234343.74045-6-kirill.shutemov@linux.intel.com
State Superseded
Headers show
Series mm, x86/cc: Implement support for unaccepted memory | expand

Commit Message

Kirill A. Shutemov April 5, 2022, 11:43 p.m. UTC
A given page of memory can only be accepted once.  The kernel has a need
to accept memory both in the early decompression stage and during normal
runtime.

A bitmap used to communicate the acceptance state of each page between the
decompression stage and normal runtime.  This eliminates the possibility of
attempting to double-accept a page.

The bitmap is allocated in EFI stub, decompression stage updates the state
of pages used for the kernel and initrd and hands the bitmap over to the
main kernel image via boot_params.

In the runtime kernel, reserve the bitmap's memory to ensure nothing
overwrites it.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Acked-by: Mike Rapoport <rppt@linux.ibm.com>
---
 arch/x86/kernel/e820.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

Comments

Dave Hansen April 8, 2022, 6:08 p.m. UTC | #1
On 4/5/22 16:43, Kirill A. Shutemov wrote:
> A given page of memory can only be accepted once.  The kernel has a need
> to accept memory both in the early decompression stage and during normal
> runtime.
> 
> A bitmap used to communicate the acceptance state of each page between the
> decompression stage and normal runtime.  This eliminates the possibility of
> attempting to double-accept a page.

... which is fatal, right?  Could you include that an also the rationale
for why it is fatal?

> The bitmap is allocated in EFI stub, decompression stage updates the state
> of pages used for the kernel and initrd and hands the bitmap over to the
> main kernel image via boot_params.

This is really good info.  Could we maybe expand it a bit?

There are several steps in the bitmap's lifecycle:
1. Bitmap is allocated in the EFI stub
2. The kernel decompression code locates it, accepts some pages before
   decompression and marks them in the bitmap.
3. Decompression code points to the bitmap via a boot_param
4. Main kernel locates bitmap via the boot_param and uses it to guide
   runtime acceptance decisions.

> In the runtime kernel, reserve the bitmap's memory to ensure nothing
> overwrites it.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Acked-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>  arch/x86/kernel/e820.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> index f267205f2d5a..22d1fe48dcba 100644
> --- a/arch/x86/kernel/e820.c
> +++ b/arch/x86/kernel/e820.c
> @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
>  	int i;
>  	u64 end;
>  
> +	/* Mark unaccepted memory bitmap reserved */
> +	if (boot_params.unaccepted_memory) {
> +		unsigned long size;
> +
> +		/* One bit per 2MB */
> +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> +				    PMD_SIZE * BITS_PER_BYTE);
> +		memblock_reserve(boot_params.unaccepted_memory, size);
> +	}

One oddity here: The size is implied by the e820's contents.  Did you
mention somewhere that unaccepted memory is considered E820_TYPE_RAM?
It *has* to be in order for e820__end_of_ram_pfn() to work, right?
Kirill A. Shutemov April 9, 2022, 8:43 p.m. UTC | #2
On Fri, Apr 08, 2022 at 11:08:48AM -0700, Dave Hansen wrote:
> On 4/5/22 16:43, Kirill A. Shutemov wrote:
> > A given page of memory can only be accepted once.  The kernel has a need
> > to accept memory both in the early decompression stage and during normal
> > runtime.
> > 
> > A bitmap used to communicate the acceptance state of each page between the
> > decompression stage and normal runtime.  This eliminates the possibility of
> > attempting to double-accept a page.
> 
> ... which is fatal, right?  Could you include that an also the rationale
> for why it is fatal?

Actually, no. For TDX, TDX_ACCEPT_PAGE would just return an error. It is
not fatal, but it indicates that there is security hole. VMM can clear the
range of the memory if it can trick the guest to re-accept the memory
blindly:

1. VMM removes the memory range from the guest. VMM can do it at any
   point.
2. VMM re-adds memory for the same range of the guest physical addresses.
3. VMM tricks guest into re-accepting this memory blindly. It makes the
   range of guest physical memory filled with 0.
4. ???
5. PROFIT!

TDX relies on guest to be careful with accepting memory and only do this
for memory that is not in use. 

This patchet is designed to keep unaccepted
memory accounted and prevent such double accpept.


> > The bitmap is allocated in EFI stub, decompression stage updates the state
> > of pages used for the kernel and initrd and hands the bitmap over to the
> > main kernel image via boot_params.
> 
> This is really good info.  Could we maybe expand it a bit?
> 
> There are several steps in the bitmap's lifecycle:
> 1. Bitmap is allocated in the EFI stub

Allocated and populated.

> 2. The kernel decompression code locates it, accepts some pages before
>    decompression and marks them in the bitmap.
> 3. Decompression code points to the bitmap via a boot_param

Actually EFI stub makes boot_param point to the bitmap. Decompression code
and main kernel consume it from there.

> 4. Main kernel locates bitmap via the boot_param and uses it to guide
>    runtime acceptance decisions.
> 
> > In the runtime kernel, reserve the bitmap's memory to ensure nothing
> > overwrites it.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Acked-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/x86/kernel/e820.c | 10 ++++++++++
> >  1 file changed, 10 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
> > index f267205f2d5a..22d1fe48dcba 100644
> > --- a/arch/x86/kernel/e820.c
> > +++ b/arch/x86/kernel/e820.c
> > @@ -1316,6 +1316,16 @@ void __init e820__memblock_setup(void)
> >  	int i;
> >  	u64 end;
> >  
> > +	/* Mark unaccepted memory bitmap reserved */
> > +	if (boot_params.unaccepted_memory) {
> > +		unsigned long size;
> > +
> > +		/* One bit per 2MB */
> > +		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
> > +				    PMD_SIZE * BITS_PER_BYTE);
> > +		memblock_reserve(boot_params.unaccepted_memory, size);
> > +	}
> 
> One oddity here: The size is implied by the e820's contents.  Did you
> mention somewhere that unaccepted memory is considered E820_TYPE_RAM?
> It *has* to be in order for e820__end_of_ram_pfn() to work, right?

"e820_type = E820_TYPE_RAM;" for "case EFI_UNACCEPTED_MEMORY:" in patch
3/8 does this.

I didn't wrote down it explitictly. I will.
diff mbox series

Patch

diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
index f267205f2d5a..22d1fe48dcba 100644
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -1316,6 +1316,16 @@  void __init e820__memblock_setup(void)
 	int i;
 	u64 end;
 
+	/* Mark unaccepted memory bitmap reserved */
+	if (boot_params.unaccepted_memory) {
+		unsigned long size;
+
+		/* One bit per 2MB */
+		size = DIV_ROUND_UP(e820__end_of_ram_pfn() * PAGE_SIZE,
+				    PMD_SIZE * BITS_PER_BYTE);
+		memblock_reserve(boot_params.unaccepted_memory, size);
+	}
+
 	/*
 	 * The bootstrap memblock region count maximum is 128 entries
 	 * (INIT_MEMBLOCK_REGIONS), but EFI might pass us more E820 entries