diff mbox series

[PATCHv11,11/19] x86/tdx: Convert shared memory back to private on kexec

Message ID 20240528095522.509667-12-kirill.shutemov@linux.intel.com
State Superseded
Headers show
Series x86/tdx: Add kexec support | expand

Commit Message

Kirill A. Shutemov May 28, 2024, 9:55 a.m. UTC
TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second, kexec-ed kernel has no idea what memory is converted this
way. It only sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
---
 arch/x86/coco/tdx/tdx.c           | 69 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable.h    |  5 +++
 arch/x86/include/asm/set_memory.h |  3 ++
 arch/x86/mm/pat/set_memory.c      | 41 ++++++++++++++++--
 4 files changed, 115 insertions(+), 3 deletions(-)

Comments

Borislav Petkov May 31, 2024, 3:14 p.m. UTC | #1
On Tue, May 28, 2024 at 12:55:14PM +0300, Kirill A. Shutemov wrote:
> +static void tdx_kexec_finish(void)
> +{
> +	unsigned long addr, end;
> +	long found = 0, shared;
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	addr = PAGE_OFFSET;
> +	end  = PAGE_OFFSET + get_max_mapped();
> +
> +	while (addr < end) {
> +		unsigned long size;
> +		unsigned int level;
> +		pte_t *pte;
> +
> +		pte = lookup_address(addr, &level);
> +		size = page_level_size(level);
> +
> +		if (pte && pte_decrypted(*pte)) {
> +			int pages = size / PAGE_SIZE;
> +
> +			/*
> +			 * Touching memory with shared bit set triggers implicit
> +			 * conversion to shared.
> +			 *
> +			 * Make sure nobody touches the shared range from
> +			 * now on.
> +			 */
> +			set_pte(pte, __pte(0));
> +

Format the below into a comment here:

/* 

The only thing one can do at this point on failure is panic. It is
reasonable to proceed, especially for the crash case because the
kexec-ed kernel is using a different page table so there won't be
a mismatch between shared/private marking of the page so it doesn't
matter.

Also, even if the failure is real and the page cannot be touched as
private, the kdump kernel will boot fine as it uses pre-reserved memory.
What happens next depends on what the dumping process does and there's
a reasonable chance to produce useful dump on crash.

Regardless, the print leaves a trace in the log to give a clue for
debug.

One possible reason for the failure is if kdump raced with memory
conversion. In this case shared bit in page table got set (or not
cleared form shared->private conversion), but the page is actually
private. So this failure is not going to affect the kexec'ed kernel.

*/

<---

> +			if (!tdx_enc_status_changed(addr, pages, true)) {
> +				pr_err("Failed to unshare range %#lx-%#lx\n",
> +				       addr, addr + size);
> +			}
> +
> +			found += pages;
> +		}
> +
> +		addr += size;
> +	}
> +
> +	__flush_tlb_all();
> +
> +	shared = atomic_long_read(&nr_shared);
> +	if (shared != found) {
> +		pr_err("shared page accounting is off\n");
> +		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
> +	}
> +}

...

>  static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>  {
> -	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
> -		return __set_memory_enc_pgtable(addr, numpages, enc);
> +	int ret = 0;
>  
> -	return 0;
> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
> +		if (!down_read_trylock(&mem_enc_lock))
> +			return -EBUSY;
> +
> +		ret = __set_memory_enc_pgtable(addr, numpages, enc);
> +
> +		up_read(&mem_enc_lock);
> +	}

So CC_ATTR_MEM_ENCRYPT is set for SEV* guests too. You need to change
that code here to take the lock only on TDX, where you want it, not on
the others.

Thx.
Kalra, Ashish May 31, 2024, 5:34 p.m. UTC | #2
Hello Boris,

On 5/31/2024 10:14 AM, Borislav Petkov wrote:
>>   static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
>>   {
>> -	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
>> -		return __set_memory_enc_pgtable(addr, numpages, enc);
>> +	int ret = 0;
>>   
>> -	return 0;
>> +	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
>> +		if (!down_read_trylock(&mem_enc_lock))
>> +			return -EBUSY;
>> +
>> +		ret = __set_memory_enc_pgtable(addr, numpages, enc);
>> +
>> +		up_read(&mem_enc_lock);
>> +	}
> So CC_ATTR_MEM_ENCRYPT is set for SEV* guests too. You need to change
> that code here to take the lock only on TDX, where you want it, not on
> the others.

SNP guest kexec patches are based on top of this patch-series and SNP 
guests also need this exclusive mem_enc_lock protection, so 
CC_ATTR_MEM_ENCRYPT makes sense to be used here.

Thanks, Ashish
Borislav Petkov May 31, 2024, 6:06 p.m. UTC | #3
On Fri, May 31, 2024 at 12:34:49PM -0500, Kalra, Ashish wrote:
> SNP guest kexec patches are based on top of this patch-series and SNP guests
> also need this exclusive mem_enc_lock protection, so CC_ATTR_MEM_ENCRYPT
> makes sense to be used here.

Well, for the future, I'd encourage you to always send an Acked-by: you
or Reviewed-by: you as a reply to such patches so that it is clear that
such a change is desired.

Thx.
Dave Hansen June 4, 2024, 4:27 p.m. UTC | #4
On 5/28/24 02:55, Kirill A. Shutemov wrote:
> +/* Stop new private<->shared conversions */
> +static void tdx_kexec_begin(bool crash)
> +{
> +	/*
> +	 * Crash kernel reaches here with interrupts disabled: can't wait for
> +	 * conversions to finish.
> +	 *
> +	 * If race happened, just report and proceed.
> +	 */
> +	if (!set_memory_enc_stop_conversion(!crash))
> +		pr_warn("Failed to stop shared<->private conversions\n");
> +}

I don't like having to pass 'crash' in here.

If interrupts are the problem we have ways of testing for those directly.

If it's being in an oops that's a problem, we have 'oops_in_progress'
for that.

In other words, I'd much rather this function (or better yet
set_memory_enc_stop_conversion() itself) use some existing API to change
its behavior in a crash rather than have the context be passed down and
twiddled through several levels of function calls.

There are a ton of these in the console code:

	if (oops_in_progress)
		foo_trylock();
	else
		foo_lock();

To me, that's a billion times more clear than a 'wait' argument that
gets derives from who-knows-what that I have to trace through ten levels
of function calls.
Kirill A. Shutemov June 5, 2024, 12:43 p.m. UTC | #5
On Tue, Jun 04, 2024 at 09:27:59AM -0700, Dave Hansen wrote:
> On 5/28/24 02:55, Kirill A. Shutemov wrote:
> > +/* Stop new private<->shared conversions */
> > +static void tdx_kexec_begin(bool crash)
> > +{
> > +	/*
> > +	 * Crash kernel reaches here with interrupts disabled: can't wait for
> > +	 * conversions to finish.
> > +	 *
> > +	 * If race happened, just report and proceed.
> > +	 */
> > +	if (!set_memory_enc_stop_conversion(!crash))
> > +		pr_warn("Failed to stop shared<->private conversions\n");
> > +}
> 
> I don't like having to pass 'crash' in here.
> 
> If interrupts are the problem we have ways of testing for those directly.
> 
> If it's being in an oops that's a problem, we have 'oops_in_progress'
> for that.
> 
> In other words, I'd much rather this function (or better yet
> set_memory_enc_stop_conversion() itself) use some existing API to change
> its behavior in a crash rather than have the context be passed down and
> twiddled through several levels of function calls.
> 
> There are a ton of these in the console code:
> 
> 	if (oops_in_progress)
> 		foo_trylock();
> 	else
> 		foo_lock();
> 
> To me, that's a billion times more clear than a 'wait' argument that
> gets derives from who-knows-what that I have to trace through ten levels
> of function calls.

Okay fair enough. Check out the fixup below. Is it what you mean?

One other thing I realized is that these callback are dead code if kernel
compiled without kexec support. Do we want them to be wrapped with
#ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly.

Any better ideas?

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 3d23ea0f5d45..1c5aa036b76b 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -834,7 +834,7 @@ static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 }
 
 /* Stop new private<->shared conversions */
-static void tdx_kexec_begin(bool crash)
+static void tdx_kexec_begin(void)
 {
 	/*
 	 * Crash kernel reaches here with interrupts disabled: can't wait for
@@ -842,7 +842,7 @@ static void tdx_kexec_begin(bool crash)
 	 *
 	 * If race happened, just report and proceed.
 	 */
-	if (!set_memory_enc_stop_conversion(!crash))
+	if (!set_memory_enc_stop_conversion())
 		pr_warn("Failed to stop shared<->private conversions\n");
 }
 
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index d490db38db9e..4b2abce2e3e7 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -50,7 +50,7 @@ int set_memory_np(unsigned long addr, int numpages);
 int set_memory_p(unsigned long addr, int numpages);
 int set_memory_4k(unsigned long addr, int numpages);
 
-bool set_memory_enc_stop_conversion(bool wait);
+bool set_memory_enc_stop_conversion(void);
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
 
diff --git a/arch/x86/include/asm/x86_init.h b/arch/x86/include/asm/x86_init.h
index b0f313278967..213cf5379a5a 100644
--- a/arch/x86/include/asm/x86_init.h
+++ b/arch/x86/include/asm/x86_init.h
@@ -152,8 +152,6 @@ struct x86_init_acpi {
  * @enc_kexec_begin		Begin the two-step process of converting shared memory back
  *				to private. It stops the new conversions from being started
  *				and waits in-flight conversions to finish, if possible.
- *				The @crash parameter denotes whether the function is being
- *				called in the crash shutdown path.
  * @enc_kexec_finish		Finish the two-step process of converting shared memory to
  *				private. All memory is private after the call when
  *				the function returns.
@@ -165,7 +163,7 @@ struct x86_guest {
 	int (*enc_status_change_finish)(unsigned long vaddr, int npages, bool enc);
 	bool (*enc_tlb_flush_required)(bool enc);
 	bool (*enc_cache_flush_required)(void);
-	void (*enc_kexec_begin)(bool crash);
+	void (*enc_kexec_begin)(void);
 	void (*enc_kexec_finish)(void);
 };
 
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index fc52ea80cdc8..340af8155658 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -137,7 +137,7 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
 	 * down and interrupts have been disabled. This allows the callback to
 	 * detect a race with the conversion and report it.
 	 */
-	x86_platform.guest.enc_kexec_begin(true);
+	x86_platform.guest.enc_kexec_begin();
 	x86_platform.guest.enc_kexec_finish();
 
 	crash_save_cpu(regs, safe_smp_processor_id());
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 513809b5b27c..0e0a4cf6b5eb 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -723,7 +723,7 @@ void native_machine_shutdown(void)
 	 * conversions to finish cleanly.
 	 */
 	if (kexec_in_progress)
-		x86_platform.guest.enc_kexec_begin(false);
+		x86_platform.guest.enc_kexec_begin();
 
 	/* Stop the cpus and apics */
 #ifdef CONFIG_X86_IO_APIC
diff --git a/arch/x86/kernel/x86_init.c b/arch/x86/kernel/x86_init.c
index 8a79fb505303..82b128d3f309 100644
--- a/arch/x86/kernel/x86_init.c
+++ b/arch/x86/kernel/x86_init.c
@@ -138,7 +138,7 @@ static int enc_status_change_prepare_noop(unsigned long vaddr, int npages, bool
 static int enc_status_change_finish_noop(unsigned long vaddr, int npages, bool enc) { return 0; }
 static bool enc_tlb_flush_required_noop(bool enc) { return false; }
 static bool enc_cache_flush_required_noop(void) { return false; }
-static void enc_kexec_begin_noop(bool crash) {}
+static void enc_kexec_begin_noop(void) {}
 static void enc_kexec_finish_noop(void) {}
 static bool is_private_mmio_noop(u64 addr) {return false; }
 
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 2a548b65ef5f..443a97e515c0 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2240,13 +2240,14 @@ static DECLARE_RWSEM(mem_enc_lock);
  *
  * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
  * The lock is not released to prevent new conversions from being started.
- *
- * If sleep is not allowed, as in a crash scenario, try to take the lock.
- * Failure indicates that there is a race with the conversion.
  */
-bool set_memory_enc_stop_conversion(bool wait)
+bool set_memory_enc_stop_conversion(void)
 {
-	if (!wait)
+	/*
+	 * In a crash scenario, sleep is not allowed. Try to take the lock.
+	 * Failure indicates that there is a race with the conversion.
+	 */
+	if (oops_in_progress)
 		return down_write_trylock(&mem_enc_lock);
 
 	down_write(&mem_enc_lock);
Dave Hansen June 5, 2024, 4:05 p.m. UTC | #6
On 6/5/24 05:43, Kirill A. Shutemov wrote:
> Okay fair enough. Check out the fixup below. Is it what you mean?

Yes.  Much better.

> One other thing I realized is that these callback are dead code if kernel
> compiled without kexec support. Do we want them to be wrapped with
> #ifdef COFNIG_KEXEC_CORE everywhere? It is going to be ugly.
> 
> Any better ideas?

The other callbacks don't have #ifdefs either and they're dependent on
memory encryption as far as I can tell.

I think a simple:

	if (IS_ENABLED(COFNIG_KEXEC_CORE))
		return;

in the top of the callbacks will result in a tiny little stub function
when kexec is disabled.  So the bloat will be limited to kernels that
have TDX compiled in but kexec compiled out (probably never).  The bloat
will be two callback pointer, one tiny stub function, and a quick
call/return in a slow path.

I think that probably ends up being a few dozen bytes of bloat in kernel
text for a "probably never" config.
diff mbox series

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 979891e97d83..c0a651fa8963 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@ 
 #include <linux/cpufeature.h>
 #include <linux/export.h>
 #include <linux/io.h>
+#include <linux/kexec.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -14,6 +15,7 @@ 
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/set_memory.h>
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -831,6 +833,70 @@  static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	return 0;
 }
 
+/* Stop new private<->shared conversions */
+static void tdx_kexec_begin(bool crash)
+{
+	/*
+	 * Crash kernel reaches here with interrupts disabled: can't wait for
+	 * conversions to finish.
+	 *
+	 * If race happened, just report and proceed.
+	 */
+	if (!set_memory_enc_stop_conversion(!crash))
+		pr_warn("Failed to stop shared<->private conversions\n");
+}
+
+/* Walk direct mapping and convert all shared memory back to private */
+static void tdx_kexec_finish(void)
+{
+	unsigned long addr, end;
+	long found = 0, shared;
+
+	lockdep_assert_irqs_disabled();
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte)) {
+			int pages = size / PAGE_SIZE;
+
+			/*
+			 * Touching memory with shared bit set triggers implicit
+			 * conversion to shared.
+			 *
+			 * Make sure nobody touches the shared range from
+			 * now on.
+			 */
+			set_pte(pte, __pte(0));
+
+			if (!tdx_enc_status_changed(addr, pages, true)) {
+				pr_err("Failed to unshare range %#lx-%#lx\n",
+				       addr, addr + size);
+			}
+
+			found += pages;
+		}
+
+		addr += size;
+	}
+
+	__flush_tlb_all();
+
+	shared = atomic_long_read(&nr_shared);
+	if (shared != found) {
+		pr_err("shared page accounting is off\n");
+		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+	}
+}
+
 void __init tdx_early_init(void)
 {
 	struct tdx_module_args args = {
@@ -890,6 +956,9 @@  void __init tdx_early_init(void)
 	x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
 	x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
 
+	x86_platform.guest.enc_kexec_begin	     = tdx_kexec_begin;
+	x86_platform.guest.enc_kexec_finish	     = tdx_kexec_finish;
+
 	/*
 	 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 	 * bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..e39311a89bf4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -140,6 +140,11 @@  static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline bool pte_decrypted(pte_t pte)
+{
+	return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
 #define pmd_dirty pmd_dirty
 static inline bool pmd_dirty(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 9aee31862b4a..d490db38db9e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,8 +49,11 @@  int set_memory_wb(unsigned long addr, int numpages);
 int set_memory_np(unsigned long addr, int numpages);
 int set_memory_p(unsigned long addr, int numpages);
 int set_memory_4k(unsigned long addr, int numpages);
+
+bool set_memory_enc_stop_conversion(bool wait);
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
+
 int set_memory_np_noalias(unsigned long addr, int numpages);
 int set_memory_nonglobal(unsigned long addr, int numpages);
 int set_memory_global(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index a7a7a6c6a3fb..2a548b65ef5f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2227,12 +2227,47 @@  static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	return ret;
 }
 
+/*
+ * The lock serializes conversions between private and shared memory.
+ *
+ * It is taken for read on conversion. A write lock guarantees that no
+ * concurrent conversions are in progress.
+ */
+static DECLARE_RWSEM(mem_enc_lock);
+
+/*
+ * Stop new private<->shared conversions.
+ *
+ * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
+ * The lock is not released to prevent new conversions from being started.
+ *
+ * If sleep is not allowed, as in a crash scenario, try to take the lock.
+ * Failure indicates that there is a race with the conversion.
+ */
+bool set_memory_enc_stop_conversion(bool wait)
+{
+	if (!wait)
+		return down_write_trylock(&mem_enc_lock);
+
+	down_write(&mem_enc_lock);
+
+	return true;
+}
+
 static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 {
-	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
-		return __set_memory_enc_pgtable(addr, numpages, enc);
+	int ret = 0;
 
-	return 0;
+	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		if (!down_read_trylock(&mem_enc_lock))
+			return -EBUSY;
+
+		ret = __set_memory_enc_pgtable(addr, numpages, enc);
+
+		up_read(&mem_enc_lock);
+	}
+
+	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)