diff mbox series

[v2] x86/mce: Fix endless loop when run task works after #MC

Message ID 20210706121606.15864-1-dinghui@sangfor.com.cn
State New
Headers show
Series [v2] x86/mce: Fix endless loop when run task works after #MC | expand

Commit Message

Ding Hui July 6, 2021, 12:16 p.m. UTC
Recently we encounter multi #MC on the same task when it's
task_work_run() has not been called, current->mce_kill_me was
added to task_works list more than once, that make a circular
linked task_works, so task_work_run() will do a endless loop.

More seriously, the SIGBUS signal can not be delivered to the
userspace task which tigger the #MC and I met #MC flood.

I borrowed mce_kill_me.func to check whether current->mce_kill_me
has been added to task_works, prevent duplicate addition and override
current->mce_xxx. When work function be called, the task_works
must has been taken, so it is safe to be cleared in callback.

Fixes: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work")
Cc: <stable@vger.kernel.org> #v5.8+
Signed-off-by: Ding Hui <dinghui@sangfor.com.cn>

---
v1->v2:
do not call kill_me_now in kill_me_maybe, to avoid mce_kill_me.func
race condition
do not override current->mce_xxx before callback

 arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 13 deletions(-)

Comments

Luck, Tony July 6, 2021, 4:44 p.m. UTC | #1
On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote:
> Recently we encounter multi #MC on the same task when it's
> task_work_run() has not been called, current->mce_kill_me was
> added to task_works list more than once, that make a circular
> linked task_works, so task_work_run() will do a endless loop.

I saw the same and posted a similar fix a while back:

https://www.spinics.net/lists/linux-mm/msg251006.html

It didn't get merged because some validation tests began failing
around the same time.  I'm now pretty sure I understand what happened
with those other tests.

I'll post my updated version (second patch in a three part series)
later today.

> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c

> +	if (!cmpxchg(&current->mce_kill_me.func, NULL, ch.func)) {
> +		current->mce_addr = m->addr;
> +		current->mce_kflags = m->kflags;
> +		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
> +		current->mce_whole_page = whole_page(m);

You don't need an atomic cmpxchg here (nor the WRITE_ONCE() to clear it).
The task is operating on its own task_struct. Nobody else should touch
the mce_kill_me field.

-Tony
Ding Hui July 7, 2021, 3:39 a.m. UTC | #2
On 2021/7/7 0:44, Luck, Tony wrote:
> On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote:

>> Recently we encounter multi #MC on the same task when it's

>> task_work_run() has not been called, current->mce_kill_me was

>> added to task_works list more than once, that make a circular

>> linked task_works, so task_work_run() will do a endless loop.

> 

> I saw the same and posted a similar fix a while back:

> 

> https://www.spinics.net/lists/linux-mm/msg251006.html

> 

> It didn't get merged because some validation tests began failing

> around the same time.  I'm now pretty sure I understand what happened

> with those other tests.

> 

> I'll post my updated version (second patch in a three part series)

> later today.

> 


Thanks for your fixes.

After digging my original problem, maybe I find out why I met #MC flood.

My test case:
1. run qemu-kvm guest VM, OS is memtest86+.iso
2. inject SRAR UE to VM memory and wait #MC
When VM trigger #MC, I expect that qemu will receive SIGBUS signal ASAP, 
and with the modifed qemu, I will kill VM.

In this case, do_machine_check() maybe called by kvm_machine_check() in 
vmx.c.

Before [1], memory_failure() is called in do_machine_check(), so 
TIF_SIGPENDING is set on due to SIGBUS signal, vcpu_run() checked the 
pending singal, so return to qemu to handle SIGBUS.

After [1], do_machine_check() only add task work but not send SIGBUS 
directly, vcpu_run() will not break the for-loop because 
vcpu_enter_guest() return 1 and not set TIF_SIGPENDING on, task works 
never executed until sth else happen. So the kvm enter guest repeatedly 
and the #MC is triggered repeatedly.

Can you consider to fix cases like this?

And do you mind to give me some advice for my temporary workaround about 
this #MC flood:
I want to check the context of do_machine_check() is exception or kvm, 
and fallback to call kill_me_xxx directly when in kvm context. (I 
already tested simply and met my expection)

[1]: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work")
-- 
Thanks,
- Ding Hui
Ding Hui July 7, 2021, 9:51 a.m. UTC | #3
On 2021/7/7 11:39, Ding Hui wrote:
> On 2021/7/7 0:44, Luck, Tony wrote:

>> On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote:

>>> Recently we encounter multi #MC on the same task when it's

>>> task_work_run() has not been called, current->mce_kill_me was

>>> added to task_works list more than once, that make a circular

>>> linked task_works, so task_work_run() will do a endless loop.

>>

>> I saw the same and posted a similar fix a while back:

>>

>> https://www.spinics.net/lists/linux-mm/msg251006.html

>>

>> It didn't get merged because some validation tests began failing

>> around the same time.  I'm now pretty sure I understand what happened

>> with those other tests.

>>

>> I'll post my updated version (second patch in a three part series)

>> later today.

>>

> 

> Thanks for your fixes.

> 

> After digging my original problem, maybe I find out why I met #MC flood.

> 

> My test case:

> 1. run qemu-kvm guest VM, OS is memtest86+.iso

> 2. inject SRAR UE to VM memory and wait #MC

> When VM trigger #MC, I expect that qemu will receive SIGBUS signal ASAP, 

> and with the modifed qemu, I will kill VM.

> 

> In this case, do_machine_check() maybe called by kvm_machine_check() in 

> vmx.c.

> 

> Before [1], memory_failure() is called in do_machine_check(), so 

> TIF_SIGPENDING is set on due to SIGBUS signal, vcpu_run() checked the 

> pending singal, so return to qemu to handle SIGBUS.

> 

> After [1], do_machine_check() only add task work but not send SIGBUS 

> directly, vcpu_run() will not break the for-loop because 

> vcpu_enter_guest() return 1 and not set TIF_SIGPENDING on, task works 

> never executed until sth else happen. So the kvm enter guest repeatedly 

> and the #MC is triggered repeatedly.

> 


Sorry for my incorrect description.

I figure out that my test kernel is not the lastest, it's without [2] 
commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work 
function"), so vcpu_run() only care about signal_pending but not 
TIF_NOTIFY_RESUME which set on in task_work_add().

After [2], #MC flood should not exist.

Also thank Thomas Gleixner.

> Can you consider to fix cases like this?

> 

> And do you mind to give me some advice for my temporary workaround about 

> this #MC flood:

> I want to check the context of do_machine_check() is exception or kvm, 

> and fallback to call kill_me_xxx directly when in kvm context. (I 

> already tested simply and met my expection)

> 


So ignore my ask, please.

> [1]: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work")



-- 
Thanks,
- Ding Hui
diff mbox series

Patch

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 22791aadc085..95d244a95486 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1250,6 +1250,7 @@  static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin
 
 static void kill_me_now(struct callback_head *ch)
 {
+	WRITE_ONCE(ch->func, NULL);
 	force_sig(SIGBUS);
 }
 
@@ -1258,15 +1259,22 @@  static void kill_me_maybe(struct callback_head *cb)
 	struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me);
 	int flags = MF_ACTION_REQUIRED;
 	int ret;
+	u64 mce_addr = p->mce_addr;
+	__u64 mce_kflags = p->mce_kflags;
+	bool mce_ripv = p->mce_ripv;
+	bool mce_whole_page = p->mce_whole_page;
 
-	pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr);
+	/* reset func after save p->mce_xxx */
+	WRITE_ONCE(cb->func, NULL);
 
-	if (!p->mce_ripv)
+	pr_err("Uncorrected hardware memory error in user-access at %llx", mce_addr);
+
+	if (!mce_ripv)
 		flags |= MF_MUST_KILL;
 
-	ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags);
-	if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) {
-		set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page);
+	ret = memory_failure(mce_addr >> PAGE_SHIFT, flags);
+	if (!ret && !(mce_kflags & MCE_IN_KERNEL_COPYIN)) {
+		set_mce_nospec(mce_addr >> PAGE_SHIFT, mce_whole_page);
 		sync_core();
 		return;
 	}
@@ -1283,23 +1291,27 @@  static void kill_me_maybe(struct callback_head *cb)
 		force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT);
 	} else {
 		pr_err("Memory error not recovered");
-		kill_me_now(cb);
+		force_sig(SIGBUS);
 	}
 }
 
 static void queue_task_work(struct mce *m, int kill_current_task)
 {
-	current->mce_addr = m->addr;
-	current->mce_kflags = m->kflags;
-	current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
-	current->mce_whole_page = whole_page(m);
+	struct callback_head ch;
 
 	if (kill_current_task)
-		current->mce_kill_me.func = kill_me_now;
+		ch.func = kill_me_now;
 	else
-		current->mce_kill_me.func = kill_me_maybe;
+		ch.func = kill_me_maybe;
+
+	if (!cmpxchg(&current->mce_kill_me.func, NULL, ch.func)) {
+		current->mce_addr = m->addr;
+		current->mce_kflags = m->kflags;
+		current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV);
+		current->mce_whole_page = whole_page(m);
 
-	task_work_add(current, &current->mce_kill_me, TWA_RESUME);
+		task_work_add(current, &current->mce_kill_me, TWA_RESUME);
+	}
 }
 
 /*