Message ID | cover.1720436039.git.mchehab+huawei@kernel.org |
---|---|
Headers | show |
Series | Fix issues with ARM Processor CPER records | expand |
On Mon, 8 Jul 2024 13:18:11 +0200 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > From: Shengwei Luo <luoshengwei@huawei.com> > > The ARM processor CPER record was added at UEFI 2.6, and hasn't > any changes up to UEFI 2.10 on its struct. > > Yet, the original arm_event trace code added on changeset > e9279e83ad1f ("trace, ras: add ARM processor error trace event") is > incomplete, as it only traces some fields of UAPI 2.6 table N.16, > not exporting at all any information from tables N.17 to N.29 of > the record. > > This is not enough for user to take appropriate action or to log > what exactly happened. > > According to UEFI_2_9 specification chapter N2.4.4, the ARM processor > error section includes: > > - several (ERR_INFO_NUM) ARM processor error information structures > (Tables N.17 to N.20); > - several (CONTEXT_INFO_NUM) ARM processor context information > structures (Tables N.21 to N.29); > - several vendor specific error information structures. The > size is given by Section Length minus the size of the other > fields. > > In addition to those data, it also exports two fields that are > parsed by the GHES driver when firmware reports it, e. g.: > > - error severity > - cpu logical index > > Report all of these information to userspace via trace uAPI, So that > userspace can properly record the error and take decisions related > to cpu core isolation according to error severity and other info. > > After this patch, all the data from ARM Processor record from table > N.16 are directly or indirectly visible on userspace: > > ====================================== ============================= > UEFI field on table N.16 ARM Processor trace fields > ====================================== ============================= > Validation handled when filling data for > affinity MPIDR and running > state. > ERR_INFO_NUM pei_len > CONTEXT_INFO_NUM ctx_len > Section Length indirectly reported by > pei_len, ctx_len and oem_len > Error affinity level affinity > MPIDR_EL1 mpidr > MIDR_EL1 midr > Running State running_state > PSCI State psci_state > Processor Error Information Structure pei_err - count at pei_len > Processor Context ctx_err- count at ctx_len > Vendor Specific Error Info oem - count at oem_len > ====================================== ============================= > > It should be noticed that decoding of tables N.17 to N.29, if needed, > will be handled on userspace. That gives more flexibility, as there > won't be any need to flood the Kernel with micro-architecture specific > error decoding). > Also, decoding the other fields require a complex logic, and should > be done for each of the several values inside the record field. > So, let userspace daemons like rasdaemon decode them, parsing such > tables and having vendor-specific micro-architecture-specific decoders. > > [mchehab: modified patch description and fix coding style] > Fixes: e9279e83ad1f ("trace, ras: add ARM processor error trace event") > Signed-off-by: Shengwei Luo <luoshengwei@huawei.com> > Signed-off-by: Jason Tian <jason@os.amperecomputing.com> > Signed-off-by: Daniel Ferguson <danielf@os.amperecomputing.com> > Tested-by: Shiju Jose <shiju.jose@huawei.com> > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > Cc: "Rafael J. Wysocki" <rafael@kernel.org> > Link: https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-section A few comments inline but all of the 'I'd have done this slightly differently' variety. This is fine as it stands though. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > drivers/acpi/apei/ghes.c | 3 +-- > drivers/ras/ras.c | 45 +++++++++++++++++++++++++++++++++++-- > include/linux/ras.h | 16 ++++++++++---- > include/ras/ras_event.h | 48 +++++++++++++++++++++++++++++++++++----- > 4 files changed, 99 insertions(+), 13 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 2589a3536d91..90efca025d27 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -538,9 +538,8 @@ static bool ghes_handle_arm_hw_error(struct acpi_hest_generic_data *gdata, > int sec_sev, i; > char *p; > > - log_arm_hw_error(err); > - > sec_sev = ghes_severity(gdata->error_severity); > + log_arm_hw_error(err, sec_sev); > if (sev != GHES_SEV_RECOVERABLE || sec_sev != GHES_SEV_RECOVERABLE) > return false; > > diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c > index 5d94ab79c8c3..75acc09bc96a 100644 > --- a/drivers/ras/ras.c > +++ b/drivers/ras/ras.c > @@ -52,10 +52,51 @@ void log_non_standard_event(const guid_t *sec_type, const guid_t *fru_id, > trace_non_standard_event(sec_type, fru_id, fru_text, sev, err, len); > } > > -void log_arm_hw_error(struct cper_sec_proc_arm *err) > +void log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev) > { > #if defined(CONFIG_ARM) || defined(CONFIG_ARM64) > - trace_arm_event(err); > + struct cper_arm_err_info *err_info; > + struct cper_arm_ctx_info *ctx_info; > + u8 *ven_err_data; > + u32 ctx_len = 0; > + int n, sz, cpu; > + s32 vsei_len; > + u32 pei_len; > + u8 *pei_err; > + u8 *ctx_err; > + > + pei_len = sizeof(struct cper_arm_err_info) * err->err_info_num; > + pei_err = (u8 *)err + sizeof(struct cper_sec_proc_arm); > + > + err_info = (struct cper_arm_err_info *)(err + 1); > + ctx_info = (struct cper_arm_ctx_info *)(err_info + err->err_info_num); > + ctx_err = (u8 *)ctx_info; > + for (n = 0; n < err->context_info_num; n++) { > + sz = sizeof(struct cper_arm_ctx_info) + ctx_info->size; > + ctx_info = (struct cper_arm_ctx_info *)((long)ctx_info + sz); > + ctx_len += sz; > + } > + > + vsei_len = err->section_length - (sizeof(struct cper_sec_proc_arm) + > + pei_len + ctx_len); > + if (vsei_len < 0) { > + pr_warn(FW_BUG > + "section length: %d\n", err->section_length); > + pr_warn(FW_BUG > + "section length is too small\n"); > + pr_warn(FW_BUG > + "firmware-generated error record is incorrect\n"); > + vsei_len = 0; > + } > + ven_err_data = (u8 *)ctx_info; > + > + cpu = GET_LOGICAL_INDEX(err->mpidr); > + /* when return value is invalid, set cpu index to -1 */ > + if (cpu < 0) > + cpu = -1; > + > + trace_arm_event(err, pei_err, pei_len, ctx_err, ctx_len, > + ven_err_data, (u32)vsei_len, sev, cpu); > #endif > } > > diff --git a/include/linux/ras.h b/include/linux/ras.h > index a64182bc72ad..6025afe5736a 100644 > --- a/include/linux/ras.h > +++ b/include/linux/ras.h > @@ -24,8 +24,7 @@ int __init parse_cec_param(char *str); > void log_non_standard_event(const guid_t *sec_type, > const guid_t *fru_id, const char *fru_text, > const u8 sev, const u8 *err, const u32 len); > -void log_arm_hw_error(struct cper_sec_proc_arm *err); > - > +void log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev); > #else > static inline void > log_non_standard_event(const guid_t *sec_type, > @@ -33,7 +32,7 @@ log_non_standard_event(const guid_t *sec_type, > const u8 sev, const u8 *err, const u32 len) > { return; } > static inline void > -log_arm_hw_error(struct cper_sec_proc_arm *err) { return; } > +log_arm_hw_error(struct cper_sec_proc_arm *err, const u8 sev) { return; } > #endif > > struct atl_err { > @@ -52,5 +51,14 @@ static inline void amd_retire_dram_row(struct atl_err *err) { } > static inline unsigned long > amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err) { return -EINVAL; } > #endif /* CONFIG_AMD_ATL */ > - I'd keep a blank line here for readability. > +#if defined(CONFIG_ARM) || defined(CONFIG_ARM64) > +#include <asm/smp_plat.h> > +/* > + * Include ARM specific SMP header which provides a function mapping mpidr to > + * cpu logical index. > + */ > +#define GET_LOGICAL_INDEX(mpidr) get_logical_index(mpidr & MPIDR_HWID_BITMASK) > +#else > +#define GET_LOGICAL_INDEX(mpidr) -EINVAL > +#endif /* CONFIG_ARM || CONFIG_ARM64 */ > #endif /* __RAS_H__ */ > diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h > index 7c47151d5c72..ce5214f008eb 100644 > --- a/include/ras/ras_event.h > +++ b/include/ras/ras_event.h > @@ -168,11 +168,24 @@ TRACE_EVENT(mc_event, > * This event is generated when hardware detects an ARM processor error > * has occurred. UEFI 2.6 spec section N.2.4.4. > */ > +#define APEIL "ARM Processor Err Info data len" > +#define APEID "ARM Processor Err Info raw data" > +#define APECIL "ARM Processor Err Context Info data len" > +#define APECID "ARM Processor Err Context Info raw data" > +#define VSEIL "Vendor Specific Err Info data len" > +#define VSEID "Vendor Specific Err Info raw data" I don't think I'd have bothered with these defines, but it doesn't really matter. traceprintk is strictly for debug convenience etc so not vital how it is formatted however, maybe could have used a shorter description as "Vendor Specific Err Info (Length %d): %s" However it would be inconsistent with existing entries. > TRACE_EVENT(arm_event, > > - TP_PROTO(const struct cper_sec_proc_arm *proc), > + TP_PROTO(const struct cper_sec_proc_arm *proc, const u8 *pei_err, > + const u32 pei_len, > + const u8 *ctx_err, > + const u32 ctx_len, > + const u8 *oem, > + const u32 oem_len, > + u8 sev, > + int cpu), > > - TP_ARGS(proc), > + TP_ARGS(proc, pei_err, pei_len, ctx_err, ctx_len, oem, oem_len, sev, cpu), > > TP_STRUCT__entry( > __field(u64, mpidr) > @@ -180,6 +193,14 @@ TRACE_EVENT(arm_event, > __field(u32, running_state) > __field(u32, psci_state) > __field(u8, affinity) > + __field(u32, pei_len) > + __dynamic_array(u8, buf, pei_len) Can we do better than naming buf, buf1, buf2? Will make the code below easier to read if they are pei_buf, ctx_buf, oem_buf > + __field(u32, ctx_len) > + __dynamic_array(u8, buf1, ctx_len) > + __field(u32, oem_len) > + __dynamic_array(u8, buf2, oem_len) > + __field(u8, sev) > + __field(int, cpu) > ), > > TP_fast_assign( > @@ -199,12 +220,29 @@ TRACE_EVENT(arm_event, > __entry->running_state = ~0; > __entry->psci_state = ~0; > } > + __entry->pei_len = pei_len; > + memcpy(__get_dynamic_array(buf), pei_err, pei_len); > + __entry->ctx_len = ctx_len; > + memcpy(__get_dynamic_array(buf1), ctx_err, ctx_len); > + __entry->oem_len = oem_len; > + memcpy(__get_dynamic_array(buf2), oem, oem_len); > + __entry->sev = sev; > + __entry->cpu = cpu; > ), > > - TP_printk("affinity level: %d; MPIDR: %016llx; MIDR: %016llx; " > - "running state: %d; PSCI state: %d", > + TP_printk("cpu: %d; error: %d; affinity level: %d; MPIDR: %016llx; MIDR: %016llx; " > + "running state: %d; PSCI state: %d; " > + "%s: %d; %s: %s; %s: %d; %s: %s; %s: %d; %s: %s", > + __entry->cpu, > + __entry->sev, > __entry->affinity, __entry->mpidr, __entry->midr, > - __entry->running_state, __entry->psci_state) > + __entry->running_state, __entry->psci_state, > + APEIL, __entry->pei_len, APEID, > + __print_hex(__get_dynamic_array(buf), __entry->pei_len), > + APECIL, __entry->ctx_len, APECID, > + __print_hex(__get_dynamic_array(buf1), __entry->ctx_len), > + VSEIL, __entry->oem_len, VSEID, > + __print_hex(__get_dynamic_array(buf2), __entry->oem_len)) > ); > > /*
On Mon, 8 Jul 2024 13:18:13 +0200 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > Sometimes it is desired to produce a single log line for errors. > Add a new helper function for such purpose. > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> > --- > drivers/firmware/efi/cper.c | 43 +++++++++++++++++++++++++++++++++++++ > include/linux/cper.h | 2 ++ > 2 files changed, 45 insertions(+) > > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index 7d2cdd9e2227..f8c8a15cd527 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -106,6 +106,49 @@ void cper_print_bits(const char *pfx, unsigned int bits, > printk("%s\n", buf); > } > > +/* > + * cper_bits_to_str - return a string for set bits > + * @buf: buffer to store the output string > + * @buf_size: size of the output string buffer > + * @bits: bit mask > + * @strs: string array, indexed by bit position > + * @strs_size: size of the string array: @strs > + * @mask: a continuous bitmask used to detect the first valid bit of the > + * bitmap. > + * > + * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits > + * mask, add the corresponding string describing the bit in @strs to @buf. Good to document what the return value is. Also, I note some fixes for this doc are in patch 6 that should be here. I wonder if better to return number of bytes filled? Currently the return value isn't used, but that feels potentially more useful than returning the buffer and someone having to run strlen() on it if they want to append something afterwards. Also allows detection of out of space condition. > + */ > +char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits, > + const char * const strs[], unsigned int strs_size) > +{ > + int len = buf_size; > + char *str = buf; > + int i, size; > + > + *buf = '\0'; > + > + for_each_set_bit(i, &bits, strs_size) { > + if (!(bits & (1U << (i)))) > + continue; How would that happen? We are only entering the loop if that condition is true. > + > + if (*buf && len > 0) { > + *str = '|'; > + len--; > + str++; > + } > + > + size = strscpy(str, strs[i], len); > + if (size < 0) > + break; > + > + len -= size; > + str += size; > + } > + return buf; > +} > +EXPORT_SYMBOL_GPL(cper_bits_to_str); > + > static const char * const proc_type_strs[] = { > "IA32/X64", > "IA64", > diff --git a/include/linux/cper.h b/include/linux/cper.h > index 265b0f8fc0b3..c2f14b916bfb 100644 > --- a/include/linux/cper.h > +++ b/include/linux/cper.h > @@ -584,6 +584,8 @@ const char *cper_mem_err_type_str(unsigned int); > const char *cper_mem_err_status_str(u64 status); > void cper_print_bits(const char *prefix, unsigned int bits, > const char * const strs[], unsigned int strs_size); > +char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits, > + const char * const strs[], unsigned int strs_size); > void cper_mem_err_pack(const struct cper_sec_mem_err *, > struct cper_mem_err_compact *); > const char *cper_mem_err_unpack(struct trace_seq *,
On Mon, 8 Jul 2024 13:18:15 +0200 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > There are two kernel-doc like descriptions at cper, which is used > by other parts of cper and on ghes driver. They both have kernel-doc > like descriptions. > > Change the tags for them to be actual kernel-doc tags and add them > to the driver-api documentaion at the UEFI section. > > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> Other than the blob at the end that belongs in earlier patch LGTM. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > --- > Documentation/driver-api/firmware/efi/index.rst | 11 ++++++++--- > drivers/firmware/efi/cper.c | 10 ++++------ > 2 files changed, 12 insertions(+), 9 deletions(-) > > diff --git a/Documentation/driver-api/firmware/efi/index.rst b/Documentation/driver-api/firmware/efi/index.rst > index 4fe8abba9fc6..5a6b6229592c 100644 > --- a/Documentation/driver-api/firmware/efi/index.rst > +++ b/Documentation/driver-api/firmware/efi/index.rst > @@ -1,11 +1,16 @@ > .. SPDX-License-Identifier: GPL-2.0 > > -============ > -UEFI Support > -============ > +==================================================== > +Unified Extensible Firmware Interface (UEFI) Support > +==================================================== > > UEFI stub library functions > =========================== > > .. kernel-doc:: drivers/firmware/efi/libstub/mem.c > :internal: > + > +UEFI Common Platform Error Record (CPER) functions > +================================================== > + > +.. kernel-doc:: drivers/firmware/efi/cper.c > diff --git a/drivers/firmware/efi/cper.c b/drivers/firmware/efi/cper.c > index f8c8a15cd527..2785c8ea8ad8 100644 > --- a/drivers/firmware/efi/cper.c > +++ b/drivers/firmware/efi/cper.c > @@ -69,7 +69,7 @@ const char *cper_severity_str(unsigned int severity) > } > EXPORT_SYMBOL_GPL(cper_severity_str); > > -/* > +/** > * cper_print_bits - print strings for set bits > * @pfx: prefix for each line, including log level and prefix string > * @bits: bit mask > @@ -106,18 +106,16 @@ void cper_print_bits(const char *pfx, unsigned int bits, > printk("%s\n", buf); > } > > -/* > +/** > * cper_bits_to_str - return a string for set bits > * @buf: buffer to store the output string > * @buf_size: size of the output string buffer > * @bits: bit mask > * @strs: string array, indexed by bit position > * @strs_size: size of the string array: @strs > - * @mask: a continuous bitmask used to detect the first valid bit of the > - * bitmap. > * > - * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits > - * mask, add the corresponding string describing the bit in @strs to @buf. > + * Add to @buf the bitmask in hexadecimal. Then, for each set bit in @bits, > + * add the corresponding string describing the bit in @strs to @buf. This is in wrong patch. No point in introducing wrong docs to fix later. > */ > char *cper_bits_to_str(char *buf, int buf_size, unsigned long bits, > const char * const strs[], unsigned int strs_size)
On Mon, 8 Jul 2024 13:18:14 +0200 Mauro Carvalho Chehab <mchehab+huawei@kernel.org> wrote: > Up to UEFI spec, the type byte of CPER struct for ARM processor was Up to UEFI spec XXX? > defined simply as: > > Type at byte offset 4: > > - Cache error > - TLB Error > - Bus Error > - Micro-architectural Error > All other values are reserved > > Yet, there was no information about how this would be encoded. > > Spec 2.9A errata corrected it by defining: > > - Bit 1 - Cache Error > - Bit 2 - TLB Error > - Bit 3 - Bus Error > - Bit 4 - Micro-architectural Error > All other values are reserved > > That actually aligns with the values already defined on older > versions at N.2.4.1. Generic Processor Error Section. > > Spec 2.10 also preserve the same encoding as 2.9A > > Adjust CPER and GHES handling code for both generic and ARM > processors to properly handle UEFI 2.9A and 2.10 encoding. > > Link: https://uefi.org/specs/UEFI/2.10/Apx_N_Common_Platform_Error_Record.html#arm-processor-error-information > Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org> With above tidied up. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>