Message ID | 20230601-cxl-cper-v3-0-0189d61f7956@intel.com |
---|---|
Headers | show |
Series | efi/cxl-cper: Report CPER CXL component events through trace events | expand |
On 11/1/2023 2:11 PM, Ira Weiny wrote: [snip] > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c > index 44a21ab7add5..37add91068c0 100644 > --- a/drivers/cxl/pci.c > +++ b/drivers/cxl/pci.c > @@ -1,5 +1,6 @@ > // SPDX-License-Identifier: GPL-2.0-only > /* Copyright(c) 2020 Intel Corporation. All rights reserved. */ > +#include <asm-generic/unaligned.h> > #include <linux/io-64-nonatomic-lo-hi.h> > #include <linux/moduleparam.h> > #include <linux/module.h> > @@ -748,6 +749,60 @@ static bool cxl_event_int_is_fw(u8 setting) > return mode == CXL_INT_FW; > } > > +#define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0) > +static int cxl_cper_event_call(struct notifier_block *nb, unsigned long action, > + void *data) > +{ > + struct cxl_cper_notifier_data *nd = data; > + struct cper_cxl_event_devid *device_id = &nd->rec->hdr.device_id; > + enum cxl_event_log_type log_type; > + struct cxl_memdev_state *mds; > + struct cxl_dev_state *cxlds; > + struct pci_dev *pdev; > + unsigned int devfn; > + u32 hdr_flags; > + > + mds = container_of(nb, struct cxl_memdev_state, cxl_cper_nb); > + > + /* PCI_DEVFN() would require 2 extra bit shifts; skip those */ > + devfn = (device_id->slot_num & 0xfff8) | (device_id->func_num & 0x07); devfn = PCI_DEVFN(device_id->device_num, device_id->func_num) should also work correct? > + pdev = pci_get_domain_bus_and_slot(device_id->segment_num, > + device_id->bus_num, devfn); > + cxlds = pci_get_drvdata(pdev); > + if (cxlds != &mds->cxlds) { Do we need a error message here? Thanks, Smita > + pci_dev_put(pdev); > + return NOTIFY_DONE; > + } > + > + /* Fabricate a log type */ > + hdr_flags = get_unaligned_le24(nd->rec->event.generic.hdr.flags); > + log_type = FIELD_GET(CXL_EVENT_HDR_FLAGS_REC_SEVERITY, hdr_flags); > + > + cxl_event_trace_record(mds->cxlds.cxlmd, log_type, nd->event_type, > + &nd->rec->event); > + pci_dev_put(pdev); > + return NOTIFY_OK; > +} > + > +static void cxl_unregister_cper_events(void *_mds) > +{ > + struct cxl_memdev_state *mds = _mds; > + > + unregister_cxl_cper_notifier(&mds->cxl_cper_nb); > +} > + > +static void register_cper_events(struct cxl_memdev_state *mds) > +{ > + mds->cxl_cper_nb.notifier_call = cxl_cper_event_call; > + > + if (register_cxl_cper_notifier(&mds->cxl_cper_nb)) { > + dev_err(mds->cxlds.dev, "CPER registration failed\n"); > + return; > + } > + > + devm_add_action_or_reset(mds->cxlds.dev, cxl_unregister_cper_events, mds); > +} > + > static int cxl_event_config(struct pci_host_bridge *host_bridge, > struct cxl_memdev_state *mds) > { > @@ -758,8 +813,10 @@ static int cxl_event_config(struct pci_host_bridge *host_bridge, > * When BIOS maintains CXL error reporting control, it will process > * event records. Only one agent can do so. > */ > - if (!host_bridge->native_cxl_error) > + if (!host_bridge->native_cxl_error) { > + register_cper_events(mds); > return 0; > + } > > rc = cxl_mem_alloc_event_buf(mds); > if (rc) >
On Wed, 01 Nov 2023 14:11:18 -0700 Ira Weiny <ira.weiny@intel.com> wrote: > The uuid printed in the well known events is redundant. The uuid > defines what the event was. > > Remove the uuid from the known events and only report it in the generic > event as it remains informative there. > > Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> > Reviewed-by: Dan Williams <dan.j.williams@intel.com> > Signed-off-by: Ira Weiny <ira.weiny@intel.com> Removing the print is fine, but look like this also removes the actual trace point field. That's userspace ABI. Expanding it is fine, but taking fields away is more problematic. Are we sure we don't break anyone? Shiju, will rasdaemon be fine with this change? Thanks, Jonathan > --- > drivers/cxl/core/trace.h | 10 +++++----- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h > index a0b5819bc70b..79ed03637604 100644 > --- a/drivers/cxl/core/trace.h > +++ b/drivers/cxl/core/trace.h > @@ -189,7 +189,6 @@ TRACE_EVENT(cxl_overflow, > __string(memdev, dev_name(&cxlmd->dev)) \ > __string(host, dev_name(cxlmd->dev.parent)) \ > __field(int, log) \ > - __field_struct(uuid_t, hdr_uuid) \ > __field(u64, serial) \ > __field(u32, hdr_flags) \ > __field(u16, hdr_handle) \ > @@ -203,7 +202,6 @@ TRACE_EVENT(cxl_overflow, > __assign_str(host, dev_name((cxlmd)->dev.parent)); \ > __entry->log = (l); \ > __entry->serial = (cxlmd)->cxlds->serial; \ > - memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t)); \ > __entry->hdr_length = (hdr).length; \ > __entry->hdr_flags = get_unaligned_le24((hdr).flags); \ > __entry->hdr_handle = le16_to_cpu((hdr).handle); \ > @@ -212,12 +210,12 @@ TRACE_EVENT(cxl_overflow, > __entry->hdr_maint_op_class = (hdr).maint_op_class > > #define CXL_EVT_TP_printk(fmt, ...) \ > - TP_printk("memdev=%s host=%s serial=%lld log=%s : time=%llu uuid=%pUb " \ > + TP_printk("memdev=%s host=%s serial=%lld log=%s : time=%llu " \ > "len=%d flags='%s' handle=%x related_handle=%x " \ > "maint_op_class=%u : " fmt, \ > __get_str(memdev), __get_str(host), __entry->serial, \ > cxl_event_log_type_str(__entry->log), \ > - __entry->hdr_timestamp, &__entry->hdr_uuid, __entry->hdr_length,\ > + __entry->hdr_timestamp, __entry->hdr_length, \ > show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle, \ > __entry->hdr_related_handle, __entry->hdr_maint_op_class, \ > ##__VA_ARGS__) > @@ -231,15 +229,17 @@ TRACE_EVENT(cxl_generic_event, > > TP_STRUCT__entry( > CXL_EVT_TP_entry > + __field_struct(uuid_t, hdr_uuid) > __array(u8, data, CXL_EVENT_RECORD_DATA_LENGTH) > ), > > TP_fast_assign( > CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr); > + memcpy(&__entry->hdr_uuid, &rec->hdr.id, sizeof(uuid_t)); > memcpy(__entry->data, &rec->data, CXL_EVENT_RECORD_DATA_LENGTH); > ), > > - CXL_EVT_TP_printk("%s", > + CXL_EVT_TP_printk("uuid=%pUb %s", &__entry->hdr_uuid, > __print_hex(__entry->data, CXL_EVENT_RECORD_DATA_LENGTH)) > ); > >
>-----Original Message----- >From: Jonathan Cameron <jonathan.cameron@huawei.com> >Sent: 03 November 2023 14:28 >To: Ira Weiny <ira.weiny@intel.com> >Cc: Dan Williams <dan.j.williams@intel.com>; Smita Koralahalli ><Smita.KoralahalliChannabasappa@amd.com>; Yazen Ghannam ><yazen.ghannam@amd.com>; Davidlohr Bueso <dave@stgolabs.net>; Dave >Jiang <dave.jiang@intel.com>; Alison Schofield <alison.schofield@intel.com>; >Vishal Verma <vishal.l.verma@intel.com>; Ard Biesheuvel <ardb@kernel.org>; >linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org; linux- >cxl@vger.kernel.org; Shiju Jose <shiju.jose@huawei.com> >Subject: Re: [PATCH RFC v3 1/6] cxl/trace: Remove uuid from event trace known >events > >On Wed, 01 Nov 2023 14:11:18 -0700 >Ira Weiny <ira.weiny@intel.com> wrote: > >> The uuid printed in the well known events is redundant. The uuid >> defines what the event was. >> >> Remove the uuid from the known events and only report it in the >> generic event as it remains informative there. >> >> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> >> Reviewed-by: Dan Williams <dan.j.williams@intel.com> >> Signed-off-by: Ira Weiny <ira.weiny@intel.com> > >Removing the print is fine, but look like this also removes the actual trace point >field. That's userspace ABI. Expanding it is fine, but taking fields away is more >problematic. > >Are we sure we don't break anyone? Shiju, will rasdaemon be fine with this >change? The field hdr_uuid is removed from the common CXL_EVT_TP_entry shared by the trace events cxl_generic_event, cxl_general_media, cxl_dram and cxl_memory_module . rasdaemon will break because of this while processing these trace events and also affects the corresponding error records in the SQLite data base. Rasdaemon needs update to avoid this. > >Thanks, > >Jonathan > Thanks, Shiju > > >> --- >> drivers/cxl/core/trace.h | 10 +++++----- >> 1 file changed, 5 insertions(+), 5 deletions(-) >> >> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index >> a0b5819bc70b..79ed03637604 100644 >> --- a/drivers/cxl/core/trace.h >> +++ b/drivers/cxl/core/trace.h >> @@ -189,7 +189,6 @@ TRACE_EVENT(cxl_overflow, >> __string(memdev, dev_name(&cxlmd->dev)) \ >> __string(host, dev_name(cxlmd->dev.parent)) \ >> __field(int, log) \ >> - __field_struct(uuid_t, hdr_uuid) \ >> __field(u64, serial) \ >> __field(u32, hdr_flags) \ >> __field(u16, hdr_handle) \ >> @@ -203,7 +202,6 @@ TRACE_EVENT(cxl_overflow, >> __assign_str(host, dev_name((cxlmd)->dev.parent)); > \ >> __entry->log = (l); > \ >> __entry->serial = (cxlmd)->cxlds->serial; \ >> - memcpy(&__entry->hdr_uuid, &(hdr).id, sizeof(uuid_t)); > \ >> __entry->hdr_length = (hdr).length; > \ >> __entry->hdr_flags = get_unaligned_le24((hdr).flags); > \ >> __entry->hdr_handle = le16_to_cpu((hdr).handle); > \ >> @@ -212,12 +210,12 @@ TRACE_EVENT(cxl_overflow, >> __entry->hdr_maint_op_class = (hdr).maint_op_class >> >> #define CXL_EVT_TP_printk(fmt, ...) \ >> - TP_printk("memdev=%s host=%s serial=%lld log=%s : time=%llu >uuid=%pUb " \ >> + TP_printk("memdev=%s host=%s serial=%lld log=%s : time=%llu " > \ >> "len=%d flags='%s' handle=%x related_handle=%x " > \ >> "maint_op_class=%u : " fmt, > \ >> __get_str(memdev), __get_str(host), __entry->serial, > \ >> cxl_event_log_type_str(__entry->log), > \ >> - __entry->hdr_timestamp, &__entry->hdr_uuid, __entry- >>hdr_length,\ >> + __entry->hdr_timestamp, __entry->hdr_length, > \ >> show_hdr_flags(__entry->hdr_flags), __entry->hdr_handle, > \ >> __entry->hdr_related_handle, __entry->hdr_maint_op_class, > \ >> ##__VA_ARGS__) >> @@ -231,15 +229,17 @@ TRACE_EVENT(cxl_generic_event, >> >> TP_STRUCT__entry( >> CXL_EVT_TP_entry >> + __field_struct(uuid_t, hdr_uuid) >> __array(u8, data, CXL_EVENT_RECORD_DATA_LENGTH) >> ), >> >> TP_fast_assign( >> CXL_EVT_TP_fast_assign(cxlmd, log, rec->hdr); >> + memcpy(&__entry->hdr_uuid, &rec->hdr.id, sizeof(uuid_t)); >> memcpy(__entry->data, &rec->data, >CXL_EVENT_RECORD_DATA_LENGTH); >> ), >> >> - CXL_EVT_TP_printk("%s", >> + CXL_EVT_TP_printk("uuid=%pUb %s", &__entry->hdr_uuid, >> __print_hex(__entry->data, >CXL_EVENT_RECORD_DATA_LENGTH)) ); >> >>
Shiju Jose wrote: > > > >-----Original Message----- > >From: Jonathan Cameron <jonathan.cameron@huawei.com> > >Sent: 03 November 2023 14:28 > >To: Ira Weiny <ira.weiny@intel.com> > >Cc: Dan Williams <dan.j.williams@intel.com>; Smita Koralahalli > ><Smita.KoralahalliChannabasappa@amd.com>; Yazen Ghannam > ><yazen.ghannam@amd.com>; Davidlohr Bueso <dave@stgolabs.net>; Dave > >Jiang <dave.jiang@intel.com>; Alison Schofield <alison.schofield@intel.com>; > >Vishal Verma <vishal.l.verma@intel.com>; Ard Biesheuvel <ardb@kernel.org>; > >linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org; linux- > >cxl@vger.kernel.org; Shiju Jose <shiju.jose@huawei.com> > >Subject: Re: [PATCH RFC v3 1/6] cxl/trace: Remove uuid from event trace known > >events > > > >On Wed, 01 Nov 2023 14:11:18 -0700 > >Ira Weiny <ira.weiny@intel.com> wrote: > > > >> The uuid printed in the well known events is redundant. The uuid > >> defines what the event was. > >> > >> Remove the uuid from the known events and only report it in the > >> generic event as it remains informative there. > >> > >> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> > >> Reviewed-by: Dan Williams <dan.j.williams@intel.com> > >> Signed-off-by: Ira Weiny <ira.weiny@intel.com> > > > >Removing the print is fine, but look like this also removes the actual trace point > >field. That's userspace ABI. Expanding it is fine, but taking fields away is more > >problematic. > > > >Are we sure we don't break anyone? Shiju, will rasdaemon be fine with this > >change? > > The field hdr_uuid is removed from the common CXL_EVT_TP_entry shared by the > trace events cxl_generic_event, cxl_general_media, cxl_dram and cxl_memory_module . > rasdaemon will break because of this while processing these trace events > and also affects the corresponding error records in the SQLite data base. > Rasdaemon needs update to avoid this. > Ok we can leave the uuid field in easy enough. But does rasdaemon use the value of the field for anything? In other words does CPER record processing need to generate a proper UUID value? Ira
Smita Koralahalli wrote: > On 11/1/2023 2:11 PM, Ira Weiny wrote: > [snip] > > +#define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0) > > +static int cxl_cper_event_call(struct notifier_block *nb, unsigned long action, > > + void *data) > > +{ > > + struct cxl_cper_notifier_data *nd = data; > > + struct cper_cxl_event_devid *device_id = &nd->rec->hdr.device_id; > > + enum cxl_event_log_type log_type; > > + struct cxl_memdev_state *mds; > > + struct cxl_dev_state *cxlds; > > + struct pci_dev *pdev; > > + unsigned int devfn; > > + u32 hdr_flags; > > + > > + mds = container_of(nb, struct cxl_memdev_state, cxl_cper_nb); > > + > > + /* PCI_DEVFN() would require 2 extra bit shifts; skip those */ > > + devfn = (device_id->slot_num & 0xfff8) | (device_id->func_num & 0x07); > > devfn = PCI_DEVFN(device_id->device_num, device_id->func_num) should > also work correct? Device num is the slot number right shifted? If so then yes. I'm not an expert on the PCIe nomenclature. > > > + pdev = pci_get_domain_bus_and_slot(device_id->segment_num, > > + device_id->bus_num, devfn); > > + cxlds = pci_get_drvdata(pdev); > > + if (cxlds != &mds->cxlds) { > > Do we need a error message here? No, it is just that this event is not for this device. Another device will process it. Or if there is no driver loaded for the device it will be ignored. (Same as would happen if the events were coming through the log because the driver is not monitoring the log.) Ira
Hi Ira, >-----Original Message----- >From: Ira Weiny <ira.weiny@intel.com> >Sent: 06 November 2023 22:06 >To: Shiju Jose <shiju.jose@huawei.com>; Jonathan Cameron ><jonathan.cameron@huawei.com>; Ira Weiny <ira.weiny@intel.com> >Cc: Dan Williams <dan.j.williams@intel.com>; Smita Koralahalli ><Smita.KoralahalliChannabasappa@amd.com>; Yazen Ghannam ><yazen.ghannam@amd.com>; Davidlohr Bueso <dave@stgolabs.net>; Dave >Jiang <dave.jiang@intel.com>; Alison Schofield <alison.schofield@intel.com>; >Vishal Verma <vishal.l.verma@intel.com>; Ard Biesheuvel <ardb@kernel.org>; >linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org; linux- >cxl@vger.kernel.org >Subject: RE: [PATCH RFC v3 1/6] cxl/trace: Remove uuid from event trace known >events > >Shiju Jose wrote: >> >> >> >-----Original Message----- >> >From: Jonathan Cameron <jonathan.cameron@huawei.com> >> >Sent: 03 November 2023 14:28 >> >To: Ira Weiny <ira.weiny@intel.com> >> >Cc: Dan Williams <dan.j.williams@intel.com>; Smita Koralahalli >> ><Smita.KoralahalliChannabasappa@amd.com>; Yazen Ghannam >> ><yazen.ghannam@amd.com>; Davidlohr Bueso <dave@stgolabs.net>; Dave >> >Jiang <dave.jiang@intel.com>; Alison Schofield >> ><alison.schofield@intel.com>; Vishal Verma >> ><vishal.l.verma@intel.com>; Ard Biesheuvel <ardb@kernel.org>; >> >linux-efi@vger.kernel.org; linux-kernel@vger.kernel.org; linux- >> >cxl@vger.kernel.org; Shiju Jose <shiju.jose@huawei.com> >> >Subject: Re: [PATCH RFC v3 1/6] cxl/trace: Remove uuid from event >> >trace known events >> > >> >On Wed, 01 Nov 2023 14:11:18 -0700 >> >Ira Weiny <ira.weiny@intel.com> wrote: >> > >> >> The uuid printed in the well known events is redundant. The uuid >> >> defines what the event was. >> >> >> >> Remove the uuid from the known events and only report it in the >> >> generic event as it remains informative there. >> >> >> >> Reviewed-by: Davidlohr Bueso <dave@stgolabs.net> >> >> Reviewed-by: Dan Williams <dan.j.williams@intel.com> >> >> Signed-off-by: Ira Weiny <ira.weiny@intel.com> >> > >> >Removing the print is fine, but look like this also removes the >> >actual trace point field. That's userspace ABI. Expanding it is >> >fine, but taking fields away is more problematic. >> > >> >Are we sure we don't break anyone? Shiju, will rasdaemon be fine >> >with this change? >> >> The field hdr_uuid is removed from the common CXL_EVT_TP_entry shared >> by the trace events cxl_generic_event, cxl_general_media, cxl_dram and >cxl_memory_module . >> rasdaemon will break because of this while processing these trace >> events and also affects the corresponding error records in the SQLite data >base. >> Rasdaemon needs update to avoid this. >> > >Ok we can leave the uuid field in easy enough. > >But does rasdaemon use the value of the field for anything? In other words does >CPER record processing need to generate a proper UUID value? No. Presently used for logging purpose only in the rasdaemon. > >Ira Thanks, Shiju
Series status/background ======================== This is another RFC version of processing the CXL CPER records through the CXL trace mechanisms as Dan mentioned in [1]. This raises the cxl event structures to a core header and rearranges them such that they can be shared most efficiently. Thus eliminating a memcpy Smita noticed. Also BDF is used instead of serial number. NOTE: I'm still fuzzy on which fields in the CPER record are correct to find the BDF in the Linux code. It would be nice to double check those for me. The CPER code remains compile tested only. The original event code continues to pass cxl-test. [1] https://lore.kernel.org/all/6528808cef2ba_780ef294c5@dwillia2-xfh.jf.intel.com.notmuch/ Cover letter ============ CXL Component Events, as defined by EFI 2.10 Section N.2.14, wrap a mostly CXL event payload in an EFI Common Platform Error Record (CPER) record. If a device is configured for firmware first CXL event records are not sent directly to the host. The CXL sub-system uniquely has DPA to HPA translation information. It also already properly decodes the event format. Send the CXL CPER records to the CXL sub-system for processing. With CXL event logs the device interrupts the host with events. In the EFI case events are wrapped with device information which needs to be matched with memdev devices the CXL driver is tracking. A number of alternatives were considered to match the memdev with the CPER record. The most robust was to find the PCI device via Bus, Device, Function and match it to the memdev driver data. CPER records are identified with GUID's while CXL event logs contain UUID's. The UUID was previously printed for all events. But the UUID is redundant information which presents unnecessary complexity when processing CPER data. Remove the UUIDs from known events. Restructure the code to make sharing the data between CPER/event logs most efficient. Signed-off-by: Ira Weiny <ira.weiny@intel.com> --- Changes in RFC v3: - djbw: Share structures between CPER/event logs - Smita: use BDF to resolve the memdev - djbw/Smita: various cleanups - Link to v2: https://lore.kernel.org/r/20230601-cxl-cper-v2-0-314d9c36ab02@intel.com --- Ira Weiny (6): cxl/trace: Remove uuid from event trace known events cxl/events: Promote CXL event structures to a core header cxl/events: Remove UUID from non-generic event structures cxl/events: Create a CXL event union firmware/efi: Process CXL Component Events cxl/memdev: Register for and process CPER events drivers/cxl/core/mbox.c | 57 +++++++++----- drivers/cxl/core/trace.h | 18 ++--- drivers/cxl/cxlmem.h | 96 ++--------------------- drivers/cxl/pci.c | 59 +++++++++++++- drivers/firmware/efi/cper.c | 15 ++++ drivers/firmware/efi/cper_cxl.c | 40 ++++++++++ drivers/firmware/efi/cper_cxl.h | 29 +++++++ include/linux/cxl-event.h | 160 ++++++++++++++++++++++++++++++++++++++ tools/testing/cxl/test/mem.c | 166 +++++++++++++++++++++++----------------- 9 files changed, 451 insertions(+), 189 deletions(-) --- base-commit: 1c8b86a3799f7e5be903c3f49fcdaee29fd385b5 change-id: 20230601-cxl-cper-26ffc839c6c6 Best regards,