mbox series

[v3,|,v5.10,0/3] x86/kvm: fixes for hibernation

Message ID 20210531140526.42932-1-krzysztof.kozlowski@canonical.com
Headers show
Series x86/kvm: fixes for hibernation | expand

Message

Krzysztof Kozlowski May 31, 2021, 2:05 p.m. UTC
Hi,

This is version 2 of a backport for v5.10.

Changes since v2:
1. Rebased v3 of v5.4 version, I kept numbering to be consistent.
2. The context in patch 1/3 had to be adjusted.


The first patch specifically fixes bug aftert 2nd resume:
  BUG: Bad page state in process dbus-daemon  pfn:18b01
  page:ffffea000062c040 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 compound_mapcount: -30591
  flags: 0xfffffc0078141(locked|error|workingset|writeback|head|mappedtodisk|reclaim)
  raw: 000fffffc0078141 dead0000000002d0 dead000000000100 0000000000000000
  raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
  page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
  bad because of flags: 0x78141(locked|error|workingset|writeback|head|mappedtodisk|reclaim)

Best regards,
Krzysztof

Vitaly Kuznetsov (3):
  x86/kvm: Teardown PV features on boot CPU as well
  x86/kvm: Disable kvmclock on all CPUs on shutdown
  x86/kvm: Disable all PV features on crash

 arch/x86/include/asm/kvm_para.h | 10 +---
 arch/x86/kernel/kvm.c           | 92 ++++++++++++++++++++++++---------
 arch/x86/kernel/kvmclock.c      | 26 +---------
 3 files changed, 72 insertions(+), 56 deletions(-)

Comments

Paolo Bonzini May 31, 2021, 3:42 p.m. UTC | #1
On 31/05/21 16:05, Krzysztof Kozlowski wrote:
> From: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> commit 8b79feffeca28c5459458fe78676b081e87c93a4 upstream.
> 
> Various PV features (Async PF, PV EOI, steal time) work through memory
> shared with hypervisor and when we restore from hibernation we must
> properly teardown all these features to make sure hypervisor doesn't
> write to stale locations after we jump to the previously hibernated kernel
> (which can try to place anything there). For secondary CPUs the job is
> already done by kvm_cpu_down_prepare(), register syscore ops to do
> the same for boot CPU.
> 
> Krzysztof:
> This fixes memory corruption visible after second resume from
> hibernation:
> 
>    BUG: Bad page state in process dbus-daemon  pfn:18b01
>    page:ffffea000062c040 refcount:0 mapcount:0 mapping:0000000000000000 index:0x1 compound_mapcount: -30591
>    flags: 0xfffffc0078141(locked|error|workingset|writeback|head|mappedtodisk|reclaim)
>    raw: 000fffffc0078141 dead0000000002d0 dead000000000100 0000000000000000
>    raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
>    page dumped because: PAGE_FLAGS_CHECK_AT_PREP flag set
>    bad because of flags: 0x78141(locked|error|workingset|writeback|head|mappedtodisk|reclaim)
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> Message-Id: <20210414123544.1060604-3-vkuznets@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Andrea Righi <andrea.righi@canonical.com>
> [krzysztof: Extend the commit message, adjust for v5.10 context]
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
> ---
>   arch/x86/kernel/kvm.c | 57 +++++++++++++++++++++++++++++++------------
>   1 file changed, 41 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index 7f57ede3cb8e..6af3f9c3956c 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -26,6 +26,7 @@
>   #include <linux/kprobes.h>
>   #include <linux/nmi.h>
>   #include <linux/swait.h>
> +#include <linux/syscore_ops.h>
>   #include <asm/timer.h>
>   #include <asm/cpu.h>
>   #include <asm/traps.h>
> @@ -460,6 +461,25 @@ static bool pv_tlb_flush_supported(void)
>   
>   static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
>   
> +static void kvm_guest_cpu_offline(void)
> +{
> +	kvm_disable_steal_time();
> +	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> +		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
> +	kvm_pv_disable_apf();
> +	apf_task_wake_all();
> +}
> +
> +static int kvm_cpu_online(unsigned int cpu)
> +{
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	kvm_guest_cpu_init();
> +	local_irq_restore(flags);
> +	return 0;
> +}
> +
>   #ifdef CONFIG_SMP
>   
>   static bool pv_ipi_supported(void)
> @@ -587,31 +607,34 @@ static void __init kvm_smp_prepare_boot_cpu(void)
>   	kvm_spinlock_init();
>   }
>   
> -static void kvm_guest_cpu_offline(void)
> +static int kvm_cpu_down_prepare(unsigned int cpu)
>   {
> -	kvm_disable_steal_time();
> -	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> -		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
> -	kvm_pv_disable_apf();
> -	apf_task_wake_all();
> -}
> +	unsigned long flags;
>   
> -static int kvm_cpu_online(unsigned int cpu)
> -{
> -	local_irq_disable();
> -	kvm_guest_cpu_init();
> -	local_irq_enable();
> +	local_irq_save(flags);
> +	kvm_guest_cpu_offline();
> +	local_irq_restore(flags);
>   	return 0;
>   }
>   
> -static int kvm_cpu_down_prepare(unsigned int cpu)
> +#endif
> +
> +static int kvm_suspend(void)
>   {
> -	local_irq_disable();
>   	kvm_guest_cpu_offline();
> -	local_irq_enable();
> +
>   	return 0;
>   }
> -#endif
> +
> +static void kvm_resume(void)
> +{
> +	kvm_cpu_online(raw_smp_processor_id());
> +}
> +
> +static struct syscore_ops kvm_syscore_ops = {
> +	.suspend	= kvm_suspend,
> +	.resume		= kvm_resume,
> +};
>   
>   static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>   			const struct flush_tlb_info *info)
> @@ -681,6 +704,8 @@ static void __init kvm_guest_init(void)
>   	kvm_guest_cpu_init();
>   #endif
>   
> +	register_syscore_ops(&kvm_syscore_ops);
> +
>   	/*
>   	 * Hard lockup detection is enabled by default. Disable it, as guests
>   	 * can get false positives too easily, for example if the host is
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini May 31, 2021, 3:42 p.m. UTC | #2
On 31/05/21 16:05, Krzysztof Kozlowski wrote:
> From: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> commit 3d6b84132d2a57b5a74100f6923a8feb679ac2ce upstream.
> 
> Crash shutdown handler only disables kvmclock and steal time, other PV
> features remain active so we risk corrupting memory or getting some
> side-effects in kdump kernel. Move crash handler to kvm.c and unify
> with CPU offline.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> Message-Id: <20210414123544.1060604-5-vkuznets@redhat.com>
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com>
> ---
>   arch/x86/include/asm/kvm_para.h |  6 -----
>   arch/x86/kernel/kvm.c           | 44 ++++++++++++++++++++++++---------
>   arch/x86/kernel/kvmclock.c      | 21 ----------------
>   3 files changed, 32 insertions(+), 39 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_para.h b/arch/x86/include/asm/kvm_para.h
> index 9c56e0defd45..69299878b200 100644
> --- a/arch/x86/include/asm/kvm_para.h
> +++ b/arch/x86/include/asm/kvm_para.h
> @@ -92,7 +92,6 @@ unsigned int kvm_arch_para_hints(void);
>   void kvm_async_pf_task_wait_schedule(u32 token);
>   void kvm_async_pf_task_wake(u32 token);
>   u32 kvm_read_and_reset_apf_flags(void);
> -void kvm_disable_steal_time(void);
>   bool __kvm_handle_async_pf(struct pt_regs *regs, u32 token);
>   
>   DECLARE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> @@ -137,11 +136,6 @@ static inline u32 kvm_read_and_reset_apf_flags(void)
>   	return 0;
>   }
>   
> -static inline void kvm_disable_steal_time(void)
> -{
> -	return;
> -}
> -
>   static __always_inline bool kvm_handle_async_pf(struct pt_regs *regs, u32 token)
>   {
>   	return false;
> diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> index be1c42e663c6..7462b79c39de 100644
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -38,6 +38,7 @@
>   #include <asm/tlb.h>
>   #include <asm/cpuidle_haltpoll.h>
>   #include <asm/ptrace.h>
> +#include <asm/reboot.h>
>   #include <asm/svm.h>
>   
>   DEFINE_STATIC_KEY_FALSE(kvm_async_pf_enabled);
> @@ -375,6 +376,14 @@ static void kvm_pv_disable_apf(void)
>   	pr_info("Unregister pv shared memory for cpu %d\n", smp_processor_id());
>   }
>   
> +static void kvm_disable_steal_time(void)
> +{
> +	if (!has_steal_clock)
> +		return;
> +
> +	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
> +}
> +
>   static void kvm_pv_guest_cpu_reboot(void *unused)
>   {
>   	/*
> @@ -417,14 +426,6 @@ static u64 kvm_steal_clock(int cpu)
>   	return steal;
>   }
>   
> -void kvm_disable_steal_time(void)
> -{
> -	if (!has_steal_clock)
> -		return;
> -
> -	wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
> -}
> -
>   static inline void __set_percpu_decrypted(void *ptr, unsigned long size)
>   {
>   	early_set_memory_decrypted((unsigned long) ptr, size);
> @@ -461,13 +462,14 @@ static bool pv_tlb_flush_supported(void)
>   
>   static DEFINE_PER_CPU(cpumask_var_t, __pv_cpu_mask);
>   
> -static void kvm_guest_cpu_offline(void)
> +static void kvm_guest_cpu_offline(bool shutdown)
>   {
>   	kvm_disable_steal_time();
>   	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
>   		wrmsrl(MSR_KVM_PV_EOI_EN, 0);
>   	kvm_pv_disable_apf();
> -	apf_task_wake_all();
> +	if (!shutdown)
> +		apf_task_wake_all();
>   	kvmclock_disable();
>   }
>   
> @@ -613,7 +615,7 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
>   	unsigned long flags;
>   
>   	local_irq_save(flags);
> -	kvm_guest_cpu_offline();
> +	kvm_guest_cpu_offline(false);
>   	local_irq_restore(flags);
>   	return 0;
>   }
> @@ -622,7 +624,7 @@ static int kvm_cpu_down_prepare(unsigned int cpu)
>   
>   static int kvm_suspend(void)
>   {
> -	kvm_guest_cpu_offline();
> +	kvm_guest_cpu_offline(false);
>   
>   	return 0;
>   }
> @@ -637,6 +639,20 @@ static struct syscore_ops kvm_syscore_ops = {
>   	.resume		= kvm_resume,
>   };
>   
> +/*
> + * After a PV feature is registered, the host will keep writing to the
> + * registered memory location. If the guest happens to shutdown, this memory
> + * won't be valid. In cases like kexec, in which you install a new kernel, this
> + * means a random memory location will be kept being written.
> + */
> +#ifdef CONFIG_KEXEC_CORE
> +static void kvm_crash_shutdown(struct pt_regs *regs)
> +{
> +	kvm_guest_cpu_offline(true);
> +	native_machine_crash_shutdown(regs);
> +}
> +#endif
> +
>   static void kvm_flush_tlb_others(const struct cpumask *cpumask,
>   			const struct flush_tlb_info *info)
>   {
> @@ -705,6 +721,10 @@ static void __init kvm_guest_init(void)
>   	kvm_guest_cpu_init();
>   #endif
>   
> +#ifdef CONFIG_KEXEC_CORE
> +	machine_ops.crash_shutdown = kvm_crash_shutdown;
> +#endif
> +
>   	register_syscore_ops(&kvm_syscore_ops);
>   
>   	/*
> diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> index 327a0de01066..c4ac26333bc4 100644
> --- a/arch/x86/kernel/kvmclock.c
> +++ b/arch/x86/kernel/kvmclock.c
> @@ -20,7 +20,6 @@
>   #include <asm/hypervisor.h>
>   #include <asm/mem_encrypt.h>
>   #include <asm/x86_init.h>
> -#include <asm/reboot.h>
>   #include <asm/kvmclock.h>
>   
>   static int kvmclock __initdata = 1;
> @@ -204,23 +203,6 @@ static void kvm_setup_secondary_clock(void)
>   }
>   #endif
>   
> -/*
> - * After the clock is registered, the host will keep writing to the
> - * registered memory location. If the guest happens to shutdown, this memory
> - * won't be valid. In cases like kexec, in which you install a new kernel, this
> - * means a random memory location will be kept being written. So before any
> - * kind of shutdown from our side, we unregister the clock by writing anything
> - * that does not have the 'enable' bit set in the msr
> - */
> -#ifdef CONFIG_KEXEC_CORE
> -static void kvm_crash_shutdown(struct pt_regs *regs)
> -{
> -	native_write_msr(msr_kvm_system_time, 0, 0);
> -	kvm_disable_steal_time();
> -	native_machine_crash_shutdown(regs);
> -}
> -#endif
> -
>   void kvmclock_disable(void)
>   {
>   	native_write_msr(msr_kvm_system_time, 0, 0);
> @@ -350,9 +332,6 @@ void __init kvmclock_init(void)
>   #endif
>   	x86_platform.save_sched_clock_state = kvm_save_sched_clock_state;
>   	x86_platform.restore_sched_clock_state = kvm_restore_sched_clock_state;
> -#ifdef CONFIG_KEXEC_CORE
> -	machine_ops.crash_shutdown  = kvm_crash_shutdown;
> -#endif
>   	kvm_get_preset_lpj();
>   
>   	/*
> 

Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Greg KH June 1, 2021, 6:40 a.m. UTC | #3
On Mon, May 31, 2021 at 04:05:23PM +0200, Krzysztof Kozlowski wrote:
> Hi,

> 

> This is version 2 of a backport for v5.10.

> 

> Changes since v2:

> 1. Rebased v3 of v5.4 version, I kept numbering to be consistent.

> 2. The context in patch 1/3 had to be adjusted.


I also need a 5.12.y version of this series before I can apply these to
older kernel releases.

thanks,

greg k-h
Krzysztof Kozlowski June 1, 2021, 7:02 a.m. UTC | #4
On 01/06/2021 08:40, Greg KH wrote:
> On Mon, May 31, 2021 at 04:05:23PM +0200, Krzysztof Kozlowski wrote:

>> Hi,

>>

>> This is version 2 of a backport for v5.10.

>>

>> Changes since v2:

>> 1. Rebased v3 of v5.4 version, I kept numbering to be consistent.

>> 2. The context in patch 1/3 had to be adjusted.

> 

> I also need a 5.12.y version of this series before I can apply these to

> older kernel releases.


The cherry-pick from v5.10 would work although apply might fail due to
context difference. I'll sent shortly a v5.12 port.


Best regards,
Krzysztof