diff mbox series

[PATCHv11.1,11/19] x86/tdx: Convert shared memory back to private on kexec

Message ID 20240602142303.3263551-1-kirill.shutemov@linux.intel.com
State New
Headers show
Series None | expand

Commit Message

Kirill A. Shutemov June 2, 2024, 2:23 p.m. UTC
TDX guests allocate shared buffers to perform I/O. It is done by
allocating pages normally from the buddy allocator and converting them
to shared with set_memory_decrypted().

The second, kexec-ed kernel has no idea what memory is converted this
way. It only sees E820_TYPE_RAM.

Accessing shared memory via private mapping is fatal. It leads to
unrecoverable TD exit.

On kexec walk direct mapping and convert all shared memory back to
private. It makes all RAM private again and second kernel may use it
normally.

The conversion occurs in two steps: stopping new conversions and
unsharing all memory. In the case of normal kexec, the stopping of
conversions takes place while scheduling is still functioning. This
allows for waiting until any ongoing conversions are finished. The
second step is carried out when all CPUs except one are inactive and
interrupts are disabled. This prevents any conflicts with code that may
access shared memory.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
Tested-by: Tao Liu <ltao@redhat.com>
---
 arch/x86/coco/tdx/tdx.c           | 90 +++++++++++++++++++++++++++++++
 arch/x86/include/asm/pgtable.h    |  5 ++
 arch/x86/include/asm/set_memory.h |  3 ++
 arch/x86/mm/pat/set_memory.c      | 41 ++++++++++++--
 4 files changed, 136 insertions(+), 3 deletions(-)

Comments

Borislav Petkov June 3, 2024, 8:37 a.m. UTC | #1
On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote:
> +			/*
> +			 * The only thing one can do at this point on failure
> +			 * is panic. It is reasonable to proceed.

It makes even less sense now: panic() means "all stops and we die" and
you say it is reasonable to proceed.

I'm confused.
Kirill A. Shutemov June 4, 2024, 3:32 p.m. UTC | #2
On Mon, Jun 03, 2024 at 10:37:54AM +0200, Borislav Petkov wrote:
> On Sun, Jun 02, 2024 at 05:23:03PM +0300, Kirill A. Shutemov wrote:
> > +			/*
> > +			 * The only thing one can do at this point on failure
> > +			 * is panic. It is reasonable to proceed.
> 
> It makes even less sense now: panic() means "all stops and we die" and
> you say it is reasonable to proceed.
> 
> I'm confused.

Right.

What about the comment below?

			/*
			 * One possible reason for the failure is if kexec raced
			 * with memory conversion. In this case shared bit in
			 * page table got set (or not cleared) during
			 * shared<->private conversion, but the page is actually
			 * private. So this failure is not going to affect the
			 * kexec'ed kernel.
			 *
			 * The only thing one can do at this point on failure
			 * at this point is panic. In absence of better options,
			 * it is reasonable to proceed, hoping the failure is a
			 * benign shared bit mismatch due to the race.
			 *
			 * Also, even if the failure is real and the page cannot
			 * be touched as private, the kdump kernel will boot
			 * fine as it uses pre-reserved memory. What happens
			 * next depends on what the dumping process does and
			 * there's a reasonable chance to produce useful dump
			 * on crash.
			 *
			 * Regardless, the print leaves a trace in the log to
			 * give a clue for debug.
			 */
Dave Hansen June 4, 2024, 3:47 p.m. UTC | #3
On 6/4/24 08:32, Kirill A. Shutemov wrote:
> What about the comment below?
> 
> 			/*
> 			 * One possible reason for the failure is if kexec raced
> 			 * with memory conversion. In this case shared bit in
> 			 * page table got set (or not cleared) during
> 			 * shared<->private conversion, but the page is actually
> 			 * private. So this failure is not going to affect the
> 			 * kexec'ed kernel.
> 			 *
> 			 * The only thing one can do at this point on failure
> 			 * at this point is panic. In absence of better options,
> 			 * it is reasonable to proceed, hoping the failure is a
> 			 * benign shared bit mismatch due to the race.
> 			 *
> 			 * Also, even if the failure is real and the page cannot
> 			 * be touched as private, the kdump kernel will boot
> 			 * fine as it uses pre-reserved memory. What happens
> 			 * next depends on what the dumping process does and
> 			 * there's a reasonable chance to produce useful dump
> 			 * on crash.
> 			 *
> 			 * Regardless, the print leaves a trace in the log to
> 			 * give a clue for debug.
> 			 */

It's rambling too much for my taste.

Let's boil this down to what matters:

 1. Failures to change encryption status here can lead a future kernel
    to touch shared memory with a private mapping
 2. That causes an immediate unrecoverable guest shutdown (right?)
 3. kdump kernels should not be affected since they have their own
    memory ranges and its encryption status is not being tweawked here
 4. The pr_err() may help make some sense out of #2 when it happens

I'm not sure the reason behind the failed conversion is important here.

I wouldn't mention panic().

We don't need to opine about what the next kernel might or might not do.
Kirill A. Shutemov June 4, 2024, 4:14 p.m. UTC | #4
On Tue, Jun 04, 2024 at 08:47:22AM -0700, Dave Hansen wrote:
> On 6/4/24 08:32, Kirill A. Shutemov wrote:
> > What about the comment below?
> > 
> > 			/*
> > 			 * One possible reason for the failure is if kexec raced
> > 			 * with memory conversion. In this case shared bit in
> > 			 * page table got set (or not cleared) during
> > 			 * shared<->private conversion, but the page is actually
> > 			 * private. So this failure is not going to affect the
> > 			 * kexec'ed kernel.
> > 			 *
> > 			 * The only thing one can do at this point on failure
> > 			 * at this point is panic. In absence of better options,
> > 			 * it is reasonable to proceed, hoping the failure is a
> > 			 * benign shared bit mismatch due to the race.
> > 			 *
> > 			 * Also, even if the failure is real and the page cannot
> > 			 * be touched as private, the kdump kernel will boot
> > 			 * fine as it uses pre-reserved memory. What happens
> > 			 * next depends on what the dumping process does and
> > 			 * there's a reasonable chance to produce useful dump
> > 			 * on crash.
> > 			 *
> > 			 * Regardless, the print leaves a trace in the log to
> > 			 * give a clue for debug.
> > 			 */
> 
> It's rambling too much for my taste.
> 
> Let's boil this down to what matters:
> 
>  1. Failures to change encryption status here can lead a future kernel
>     to touch shared memory with a private mapping
>  2. That causes an immediate unrecoverable guest shutdown (right?)

Right.

>  3. kdump kernels should not be affected since they have their own
>     memory ranges and its encryption status is not being tweawked here
>  4. The pr_err() may help make some sense out of #2 when it happens
> 
> I'm not sure the reason behind the failed conversion is important here.

The important part is that failure can be benign. It explains "can" in #1.
But okay.

> I wouldn't mention panic().
> 
> We don't need to opine about what the next kernel might or might not do.

Is this any better?

			/*
			 * If tdx_enc_status_changed() fails, it leaves memory
			 * in an unknown state. If the memory remains shared,
			 * it can result in an unrecoverable guest shutdown on
			 * the first accessed through a private mapping.
			 *
			 * The kdump kernel boot is not impacted as it uses
			 * a pre-reserved memory range that is always private.
			 * However, gathering crash information could lead to
			 * a crash if it accesses unconverted memory through
			 * a private mapping.
			 *
			 * pr_err() may assist in understanding such crashes.
			 */
Borislav Petkov June 4, 2024, 6:05 p.m. UTC | #5
On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote:
> 			/*
> 			 * If tdx_enc_status_changed() fails, it leaves memory
> 			 * in an unknown state. If the memory remains shared,
> 			 * it can result in an unrecoverable guest shutdown on
> 			 * the first accessed through a private mapping.

"access"

So this sentence above can go too, right?

Because that comment is in tdx_kexec_finish() and we're basically going
off to kexec. So can a guest even access it through a private mapping?
We're shutting down so nothing is running anymore...

> 			 * The kdump kernel boot is not impacted as it uses
> 			 * a pre-reserved memory range that is always private.
> 			 * However, gathering crash information could lead to
> 			 * a crash if it accesses unconverted memory through
> 			 * a private mapping.

When does the kexec kernel even get such a private mapping? It is not
even up yet...

> 			 * pr_err() may assist in understanding such crashes.

"Print error info in order to leave bread crumbs for debugging." is what
I'd say.

Thx.
Kirill A. Shutemov June 5, 2024, 12:21 p.m. UTC | #6
On Tue, Jun 04, 2024 at 08:05:54PM +0200, Borislav Petkov wrote:
> On Tue, Jun 04, 2024 at 07:14:00PM +0300, Kirill A. Shutemov wrote:
> > 			/*
> > 			 * If tdx_enc_status_changed() fails, it leaves memory
> > 			 * in an unknown state. If the memory remains shared,
> > 			 * it can result in an unrecoverable guest shutdown on
> > 			 * the first accessed through a private mapping.
> 
> "access"

Okay.

> So this sentence above can go too, right?

I don't think so.

> Because that comment is in tdx_kexec_finish() and we're basically going
> off to kexec. So can a guest even access it through a private mapping?
> We're shutting down so nothing is running anymore...

This kernel can't. But the next kernel can.

If a page can be accessed via private mapping is determined by the
presence in Secure EPT. This state persist across kexec.

> > 			 * The kdump kernel boot is not impacted as it uses
> > 			 * a pre-reserved memory range that is always private.
> > 			 * However, gathering crash information could lead to
> > 			 * a crash if it accesses unconverted memory through
> > 			 * a private mapping.
> 
> When does the kexec kernel even get such a private mapping? It is not
> even up yet...

Crash kernel provides access to this memory via /proc/vmcore. Crash kernel
will assume all memory there is private.

> > 			 * pr_err() may assist in understanding such crashes.
> 
> "Print error info in order to leave bread crumbs for debugging." is what
> I'd say.

Okay.
Borislav Petkov June 5, 2024, 4:24 p.m. UTC | #7
On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote:
> If a page can be accessed via private mapping is determined by the
> presence in Secure EPT. This state persist across kexec.

I just love it how I tickle out details each time I touch this comment
because we three can't write a single concise and self-contained
explanation. :-(

Ok, next version:

"Private mappings persist across kexec. If tdx_enc_status_changed() fails
in the first kernel, it leaves memory in an unknown state.

If that memory remains shared, accessing it in the *next* kernel through
a private mapping will result in an unrecoverable guest shutdown.

The kdump kernel boot is not impacted as it uses a pre-reserved memory
range that is always private.  However, gathering crash information
could lead to a crash if it accesses unconverted memory through
a private mapping which is possible when accessing that memory through
/proc/vmcore, for example.

In all cases, print error info in order to leave enough bread crumbs for
debugging."

I think this is getting in the right direction as it actually makes
sense now.
Kirill A. Shutemov June 6, 2024, 12:39 p.m. UTC | #8
On Wed, Jun 05, 2024 at 06:24:19PM +0200, Borislav Petkov wrote:
> On Wed, Jun 05, 2024 at 03:21:42PM +0300, Kirill A. Shutemov wrote:
> > If a page can be accessed via private mapping is determined by the
> > presence in Secure EPT. This state persist across kexec.
> 
> I just love it how I tickle out details each time I touch this comment
> because we three can't write a single concise and self-contained
> explanation. :-(
> 
> Ok, next version:
> 
> "Private mappings persist across kexec. If tdx_enc_status_changed() fails

s/Private mappings persist /Memory encryption state persists /

> in the first kernel, it leaves memory in an unknown state.
> 
> If that memory remains shared, accessing it in the *next* kernel through
> a private mapping will result in an unrecoverable guest shutdown.
> 
> The kdump kernel boot is not impacted as it uses a pre-reserved memory
> range that is always private.  However, gathering crash information
> could lead to a crash if it accesses unconverted memory through
> a private mapping which is possible when accessing that memory through
> /proc/vmcore, for example.
> 
> In all cases, print error info in order to leave enough bread crumbs for
> debugging."
> 
> I think this is getting in the right direction as it actually makes
> sense now.

Otherwise looks good to me.
diff mbox series

Patch

diff --git a/arch/x86/coco/tdx/tdx.c b/arch/x86/coco/tdx/tdx.c
index 979891e97d83..afd71bc6eb02 100644
--- a/arch/x86/coco/tdx/tdx.c
+++ b/arch/x86/coco/tdx/tdx.c
@@ -7,6 +7,7 @@ 
 #include <linux/cpufeature.h>
 #include <linux/export.h>
 #include <linux/io.h>
+#include <linux/kexec.h>
 #include <asm/coco.h>
 #include <asm/tdx.h>
 #include <asm/vmx.h>
@@ -14,6 +15,7 @@ 
 #include <asm/insn.h>
 #include <asm/insn-eval.h>
 #include <asm/pgtable.h>
+#include <asm/set_memory.h>
 
 /* MMIO direction */
 #define EPT_READ	0
@@ -831,6 +833,91 @@  static int tdx_enc_status_change_finish(unsigned long vaddr, int numpages,
 	return 0;
 }
 
+/* Stop new private<->shared conversions */
+static void tdx_kexec_begin(bool crash)
+{
+	/*
+	 * Crash kernel reaches here with interrupts disabled: can't wait for
+	 * conversions to finish.
+	 *
+	 * If race happened, just report and proceed.
+	 */
+	if (!set_memory_enc_stop_conversion(!crash))
+		pr_warn("Failed to stop shared<->private conversions\n");
+}
+
+/* Walk direct mapping and convert all shared memory back to private */
+static void tdx_kexec_finish(void)
+{
+	unsigned long addr, end;
+	long found = 0, shared;
+
+	lockdep_assert_irqs_disabled();
+
+	addr = PAGE_OFFSET;
+	end  = PAGE_OFFSET + get_max_mapped();
+
+	while (addr < end) {
+		unsigned long size;
+		unsigned int level;
+		pte_t *pte;
+
+		pte = lookup_address(addr, &level);
+		size = page_level_size(level);
+
+		if (pte && pte_decrypted(*pte)) {
+			int pages = size / PAGE_SIZE;
+
+			/*
+			 * Touching memory with shared bit set triggers implicit
+			 * conversion to shared.
+			 *
+			 * Make sure nobody touches the shared range from
+			 * now on.
+			 */
+			set_pte(pte, __pte(0));
+
+			/*
+			 * The only thing one can do at this point on failure
+			 * is panic. It is reasonable to proceed.
+			 *
+			 * Also, even if the failure is real and the page cannot
+			 * be touched as private, the kdump kernel will boot
+			 * fine as it uses pre-reserved memory. What happens
+			 * next depends on what the dumping process does and
+			 * there's a reasonable chance to produce useful dump
+			 * on crash.
+			 *
+			 * Regardless, the print leaves a trace in the log to
+			 * give a clue for debug.
+			 *
+			 * One possible reason for the failure is if kdump raced
+			 * with memory conversion. In this case shared bit in
+			 * page table got set (or not cleared) during
+			 * shared<->private conversion, but the page is actually
+			 * private. So this failure is not going to affect the
+			 * kexec'ed kernel.
+			 */
+			if (!tdx_enc_status_changed(addr, pages, true)) {
+				pr_err("Failed to unshare range %#lx-%#lx\n",
+				       addr, addr + size);
+			}
+
+			found += pages;
+		}
+
+		addr += size;
+	}
+
+	__flush_tlb_all();
+
+	shared = atomic_long_read(&nr_shared);
+	if (shared != found) {
+		pr_err("shared page accounting is off\n");
+		pr_err("nr_shared = %ld, nr_found = %ld\n", shared, found);
+	}
+}
+
 void __init tdx_early_init(void)
 {
 	struct tdx_module_args args = {
@@ -890,6 +977,9 @@  void __init tdx_early_init(void)
 	x86_platform.guest.enc_cache_flush_required  = tdx_cache_flush_required;
 	x86_platform.guest.enc_tlb_flush_required    = tdx_tlb_flush_required;
 
+	x86_platform.guest.enc_kexec_begin	     = tdx_kexec_begin;
+	x86_platform.guest.enc_kexec_finish	     = tdx_kexec_finish;
+
 	/*
 	 * TDX intercepts the RDMSR to read the X2APIC ID in the parallel
 	 * bringup low level code. That raises #VE which cannot be handled
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 65b8e5bb902c..e39311a89bf4 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -140,6 +140,11 @@  static inline int pte_young(pte_t pte)
 	return pte_flags(pte) & _PAGE_ACCESSED;
 }
 
+static inline bool pte_decrypted(pte_t pte)
+{
+	return cc_mkdec(pte_val(pte)) == pte_val(pte);
+}
+
 #define pmd_dirty pmd_dirty
 static inline bool pmd_dirty(pmd_t pmd)
 {
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 9aee31862b4a..d490db38db9e 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -49,8 +49,11 @@  int set_memory_wb(unsigned long addr, int numpages);
 int set_memory_np(unsigned long addr, int numpages);
 int set_memory_p(unsigned long addr, int numpages);
 int set_memory_4k(unsigned long addr, int numpages);
+
+bool set_memory_enc_stop_conversion(bool wait);
 int set_memory_encrypted(unsigned long addr, int numpages);
 int set_memory_decrypted(unsigned long addr, int numpages);
+
 int set_memory_np_noalias(unsigned long addr, int numpages);
 int set_memory_nonglobal(unsigned long addr, int numpages);
 int set_memory_global(unsigned long addr, int numpages);
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index a7a7a6c6a3fb..2a548b65ef5f 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2227,12 +2227,47 @@  static int __set_memory_enc_pgtable(unsigned long addr, int numpages, bool enc)
 	return ret;
 }
 
+/*
+ * The lock serializes conversions between private and shared memory.
+ *
+ * It is taken for read on conversion. A write lock guarantees that no
+ * concurrent conversions are in progress.
+ */
+static DECLARE_RWSEM(mem_enc_lock);
+
+/*
+ * Stop new private<->shared conversions.
+ *
+ * Taking the exclusive mem_enc_lock waits for in-flight conversions to complete.
+ * The lock is not released to prevent new conversions from being started.
+ *
+ * If sleep is not allowed, as in a crash scenario, try to take the lock.
+ * Failure indicates that there is a race with the conversion.
+ */
+bool set_memory_enc_stop_conversion(bool wait)
+{
+	if (!wait)
+		return down_write_trylock(&mem_enc_lock);
+
+	down_write(&mem_enc_lock);
+
+	return true;
+}
+
 static int __set_memory_enc_dec(unsigned long addr, int numpages, bool enc)
 {
-	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT))
-		return __set_memory_enc_pgtable(addr, numpages, enc);
+	int ret = 0;
 
-	return 0;
+	if (cc_platform_has(CC_ATTR_MEM_ENCRYPT)) {
+		if (!down_read_trylock(&mem_enc_lock))
+			return -EBUSY;
+
+		ret = __set_memory_enc_pgtable(addr, numpages, enc);
+
+		up_read(&mem_enc_lock);
+	}
+
+	return ret;
 }
 
 int set_memory_encrypted(unsigned long addr, int numpages)