Message ID | 20241202030527.20586-2-xueshuai@linux.alibaba.com |
---|---|
State | New |
Headers | show |
Series | ACPI: APEI: handle synchronous errors in task work | expand |
On 12/1/2024 7:05 PM, Shuai Xue wrote: > Synchronous error was detected as a result of user-space process accessing > a 2-bit uncorrected error. The CPU will take a synchronous error exception > such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a > memory_failure() work which poisons the related page, unmaps the page, and > then sends a SIGBUS to the process, so that a system wide panic can be > avoided. > > However, no memory_failure() work will be queued when abnormal synchronous > errors occur. These errors can include situations such as invalid PA, > unexpected severity, no memory failure config support, invalid GUID > section, etc. In such case, the user-space process will trigger SEA again. > This loop can potentially exceed the platform firmware threshold or even > trigger a kernel hard lockup, leading to a system reboot. > > Fix it by performing a force kill if no memory_failure() work is queued > for synchronous errors. > > Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> > Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> > --- > drivers/acpi/apei/ghes.c | 11 +++++++++++ > 1 file changed, 11 insertions(+) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index a2491905f165..106486bdfefc 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -801,6 +801,17 @@ static bool ghes_do_proc(struct ghes *ghes, > } > } > > + /* > + * If no memory failure work is queued for abnormal synchronous > + * errors, do a force kill. > + */ > + if (sync && !queued) { > + dev_err(ghes->dev, > + HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", > + current->comm, task_pid_nr(current)); > + force_sig(SIGBUS); > + } > + > return queued; > } > Looks good. Reviewed-by: Jane Chu <jane.chu@oracle.com> -jane
在 2024/12/17 07:37, jane.chu@oracle.com 写道: > > On 12/1/2024 7:05 PM, Shuai Xue wrote: >> Synchronous error was detected as a result of user-space process accessing >> a 2-bit uncorrected error. The CPU will take a synchronous error exception >> such as Synchronous External Abort (SEA) on Arm64. The kernel will queue a >> memory_failure() work which poisons the related page, unmaps the page, and >> then sends a SIGBUS to the process, so that a system wide panic can be >> avoided. >> >> However, no memory_failure() work will be queued when abnormal synchronous >> errors occur. These errors can include situations such as invalid PA, >> unexpected severity, no memory failure config support, invalid GUID >> section, etc. In such case, the user-space process will trigger SEA again. >> This loop can potentially exceed the platform firmware threshold or even >> trigger a kernel hard lockup, leading to a system reboot. >> >> Fix it by performing a force kill if no memory_failure() work is queued >> for synchronous errors. >> >> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com> >> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> >> Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com> >> --- >> drivers/acpi/apei/ghes.c | 11 +++++++++++ >> 1 file changed, 11 insertions(+) >> >> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c >> index a2491905f165..106486bdfefc 100644 >> --- a/drivers/acpi/apei/ghes.c >> +++ b/drivers/acpi/apei/ghes.c >> @@ -801,6 +801,17 @@ static bool ghes_do_proc(struct ghes *ghes, >> } >> } >> + /* >> + * If no memory failure work is queued for abnormal synchronous >> + * errors, do a force kill. >> + */ >> + if (sync && !queued) { >> + dev_err(ghes->dev, >> + HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", >> + current->comm, task_pid_nr(current)); >> + force_sig(SIGBUS); >> + } >> + >> return queued; >> } > > Looks good. > > Reviewed-by: Jane Chu <jane.chu@oracle.com> > > -jane Thanks. Best Regards, Shuai
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index a2491905f165..106486bdfefc 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -801,6 +801,17 @@ static bool ghes_do_proc(struct ghes *ghes, } } + /* + * If no memory failure work is queued for abnormal synchronous + * errors, do a force kill. + */ + if (sync && !queued) { + dev_err(ghes->dev, + HW_ERR GHES_PFX "%s:%d: synchronous unrecoverable error (SIGBUS)\n", + current->comm, task_pid_nr(current)); + force_sig(SIGBUS); + } + return queued; }