Message ID | 20210706121606.15864-1-dinghui@sangfor.com.cn |
---|---|
State | New |
Headers | show |
Series | [v2] x86/mce: Fix endless loop when run task works after #MC | expand |
On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote: > Recently we encounter multi #MC on the same task when it's > task_work_run() has not been called, current->mce_kill_me was > added to task_works list more than once, that make a circular > linked task_works, so task_work_run() will do a endless loop. I saw the same and posted a similar fix a while back: https://www.spinics.net/lists/linux-mm/msg251006.html It didn't get merged because some validation tests began failing around the same time. I'm now pretty sure I understand what happened with those other tests. I'll post my updated version (second patch in a three part series) later today. > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > + if (!cmpxchg(¤t->mce_kill_me.func, NULL, ch.func)) { > + current->mce_addr = m->addr; > + current->mce_kflags = m->kflags; > + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); > + current->mce_whole_page = whole_page(m); You don't need an atomic cmpxchg here (nor the WRITE_ONCE() to clear it). The task is operating on its own task_struct. Nobody else should touch the mce_kill_me field. -Tony
On 2021/7/7 0:44, Luck, Tony wrote: > On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote: >> Recently we encounter multi #MC on the same task when it's >> task_work_run() has not been called, current->mce_kill_me was >> added to task_works list more than once, that make a circular >> linked task_works, so task_work_run() will do a endless loop. > > I saw the same and posted a similar fix a while back: > > https://www.spinics.net/lists/linux-mm/msg251006.html > > It didn't get merged because some validation tests began failing > around the same time. I'm now pretty sure I understand what happened > with those other tests. > > I'll post my updated version (second patch in a three part series) > later today. > Thanks for your fixes. After digging my original problem, maybe I find out why I met #MC flood. My test case: 1. run qemu-kvm guest VM, OS is memtest86+.iso 2. inject SRAR UE to VM memory and wait #MC When VM trigger #MC, I expect that qemu will receive SIGBUS signal ASAP, and with the modifed qemu, I will kill VM. In this case, do_machine_check() maybe called by kvm_machine_check() in vmx.c. Before [1], memory_failure() is called in do_machine_check(), so TIF_SIGPENDING is set on due to SIGBUS signal, vcpu_run() checked the pending singal, so return to qemu to handle SIGBUS. After [1], do_machine_check() only add task work but not send SIGBUS directly, vcpu_run() will not break the for-loop because vcpu_enter_guest() return 1 and not set TIF_SIGPENDING on, task works never executed until sth else happen. So the kvm enter guest repeatedly and the #MC is triggered repeatedly. Can you consider to fix cases like this? And do you mind to give me some advice for my temporary workaround about this #MC flood: I want to check the context of do_machine_check() is exception or kvm, and fallback to call kill_me_xxx directly when in kvm context. (I already tested simply and met my expection) [1]: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work") -- Thanks, - Ding Hui
On 2021/7/7 11:39, Ding Hui wrote: > On 2021/7/7 0:44, Luck, Tony wrote: >> On Tue, Jul 06, 2021 at 08:16:06PM +0800, Ding Hui wrote: >>> Recently we encounter multi #MC on the same task when it's >>> task_work_run() has not been called, current->mce_kill_me was >>> added to task_works list more than once, that make a circular >>> linked task_works, so task_work_run() will do a endless loop. >> >> I saw the same and posted a similar fix a while back: >> >> https://www.spinics.net/lists/linux-mm/msg251006.html >> >> It didn't get merged because some validation tests began failing >> around the same time. I'm now pretty sure I understand what happened >> with those other tests. >> >> I'll post my updated version (second patch in a three part series) >> later today. >> > > Thanks for your fixes. > > After digging my original problem, maybe I find out why I met #MC flood. > > My test case: > 1. run qemu-kvm guest VM, OS is memtest86+.iso > 2. inject SRAR UE to VM memory and wait #MC > When VM trigger #MC, I expect that qemu will receive SIGBUS signal ASAP, > and with the modifed qemu, I will kill VM. > > In this case, do_machine_check() maybe called by kvm_machine_check() in > vmx.c. > > Before [1], memory_failure() is called in do_machine_check(), so > TIF_SIGPENDING is set on due to SIGBUS signal, vcpu_run() checked the > pending singal, so return to qemu to handle SIGBUS. > > After [1], do_machine_check() only add task work but not send SIGBUS > directly, vcpu_run() will not break the for-loop because > vcpu_enter_guest() return 1 and not set TIF_SIGPENDING on, task works > never executed until sth else happen. So the kvm enter guest repeatedly > and the #MC is triggered repeatedly. > Sorry for my incorrect description. I figure out that my test kernel is not the lastest, it's without [2] commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer to guest work function"), so vcpu_run() only care about signal_pending but not TIF_NOTIFY_RESUME which set on in task_work_add(). After [2], #MC flood should not exist. Also thank Thomas Gleixner. > Can you consider to fix cases like this? > > And do you mind to give me some advice for my temporary workaround about > this #MC flood: > I want to check the context of do_machine_check() is exception or kvm, > and fallback to call kill_me_xxx directly when in kvm context. (I > already tested simply and met my expection) > So ignore my ask, please. > [1]: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work") -- Thanks, - Ding Hui
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 22791aadc085..95d244a95486 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1250,6 +1250,7 @@ static void __mc_scan_banks(struct mce *m, struct pt_regs *regs, struct mce *fin static void kill_me_now(struct callback_head *ch) { + WRITE_ONCE(ch->func, NULL); force_sig(SIGBUS); } @@ -1258,15 +1259,22 @@ static void kill_me_maybe(struct callback_head *cb) struct task_struct *p = container_of(cb, struct task_struct, mce_kill_me); int flags = MF_ACTION_REQUIRED; int ret; + u64 mce_addr = p->mce_addr; + __u64 mce_kflags = p->mce_kflags; + bool mce_ripv = p->mce_ripv; + bool mce_whole_page = p->mce_whole_page; - pr_err("Uncorrected hardware memory error in user-access at %llx", p->mce_addr); + /* reset func after save p->mce_xxx */ + WRITE_ONCE(cb->func, NULL); - if (!p->mce_ripv) + pr_err("Uncorrected hardware memory error in user-access at %llx", mce_addr); + + if (!mce_ripv) flags |= MF_MUST_KILL; - ret = memory_failure(p->mce_addr >> PAGE_SHIFT, flags); - if (!ret && !(p->mce_kflags & MCE_IN_KERNEL_COPYIN)) { - set_mce_nospec(p->mce_addr >> PAGE_SHIFT, p->mce_whole_page); + ret = memory_failure(mce_addr >> PAGE_SHIFT, flags); + if (!ret && !(mce_kflags & MCE_IN_KERNEL_COPYIN)) { + set_mce_nospec(mce_addr >> PAGE_SHIFT, mce_whole_page); sync_core(); return; } @@ -1283,23 +1291,27 @@ static void kill_me_maybe(struct callback_head *cb) force_sig_mceerr(BUS_MCEERR_AR, p->mce_vaddr, PAGE_SHIFT); } else { pr_err("Memory error not recovered"); - kill_me_now(cb); + force_sig(SIGBUS); } } static void queue_task_work(struct mce *m, int kill_current_task) { - current->mce_addr = m->addr; - current->mce_kflags = m->kflags; - current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); - current->mce_whole_page = whole_page(m); + struct callback_head ch; if (kill_current_task) - current->mce_kill_me.func = kill_me_now; + ch.func = kill_me_now; else - current->mce_kill_me.func = kill_me_maybe; + ch.func = kill_me_maybe; + + if (!cmpxchg(¤t->mce_kill_me.func, NULL, ch.func)) { + current->mce_addr = m->addr; + current->mce_kflags = m->kflags; + current->mce_ripv = !!(m->mcgstatus & MCG_STATUS_RIPV); + current->mce_whole_page = whole_page(m); - task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); + task_work_add(current, ¤t->mce_kill_me, TWA_RESUME); + } } /*
Recently we encounter multi #MC on the same task when it's task_work_run() has not been called, current->mce_kill_me was added to task_works list more than once, that make a circular linked task_works, so task_work_run() will do a endless loop. More seriously, the SIGBUS signal can not be delivered to the userspace task which tigger the #MC and I met #MC flood. I borrowed mce_kill_me.func to check whether current->mce_kill_me has been added to task_works, prevent duplicate addition and override current->mce_xxx. When work function be called, the task_works must has been taken, so it is safe to be cleared in callback. Fixes: commit 5567d11c21a1 ("x86/mce: Send #MC singal from task work") Cc: <stable@vger.kernel.org> #v5.8+ Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> --- v1->v2: do not call kill_me_now in kill_me_maybe, to avoid mce_kill_me.func race condition do not override current->mce_xxx before callback arch/x86/kernel/cpu/mce/core.c | 38 ++++++++++++++++++++++------------ 1 file changed, 25 insertions(+), 13 deletions(-)