Message ID | 20191025211226.2444-1-rrichter@marvell.com |
---|---|
State | Superseded |
Headers | show |
Series | EDAC, ghes: Fix locking and memory barrier issues | expand |
On Fri, Oct 25, 2019 at 09:13:14PM +0000, Robert Richter wrote: > The ghes registration and refcount is broken in several ways: ... > Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") > Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path") > Signed-off-by: Borislav Petkov <bp@suse.de> > Signed-off-by: James Morse <james.morse@arm.com> > Signed-off-by: Robert Richter <rrichter@marvell.com> More staring at the rest later, just to point out that the SOB chain is wrong: my and James' SOB must be Co-developed-by: or Originally-by: or so. Or even as freetext in the commit message: "Based on patches by ... " -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On Fri, Oct 25, 2019 at 09:13:14PM +0000, Robert Richter wrote: > The ghes registration and refcount is broken in several ways: > > * ghes_edac_register() returns with success for a 2nd instance even > if a first instance is still running. How? There's if (atomic_inc_return(&ghes_init) > 1) return 0; there. How would a second instance bypass this? > * The refcount was increased even if a registration failed. This > leads to stale counters preventing the device from being released. That I see - the return path should dec ghes_init. > * The ghes refcount may not be decremented properly on > unregistration. Always decrement the refcount once > ghes_edac_unregister() is called to keep the refcount sane. Right. > * The ghes_pvt pointer is handed to the irq handler before > registration finished. > > * The mci structure could be freed while the irq handler is running. > > Fix this by adding a mutex to ghes_edac_register(). This mutex > serializes instances to register and unregister. The refcount is only > increased if the registration succeeded. This makes sure the refcount > is in a consistent state after registering or unregistering a device. > Note: A spinlock cannot be used here as the code section may sleep. > > The ghes_pvt is protected by ghes_lock now. This better be documented in the driver with a comment above the ghes_pvt thing. I'm assuming the support for multiple instances is going ontop of this? If so, ghes_pvt needs to be an array or so. Also, if you do that, I think you should use mc_devices - see edac_mc_find() et al - instead of growing a special one just for this driver. > This ensures the pointer > is not updated before registration was finished or while the irq > handler is running. It is unset before unregistering the device > including necessary (implicit) memory barriers making the changes > visible to other cpus. Thus, the device can not be used anymore by an > interrupt. > > A refcount is needed. There can be multiple GHES structures being > defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error > Source, "Some platforms may describe multiple Generic Hardware Error > Source structures with different notification types, ..."). > > Another approach to use the mci's device refcount (get_device()) and > have a release function does not work here. A release function will be > called only for device_release() with the last put_device() call. The > device must be deleted *before* that with device_del(). This is only > possible by maintaining an own refcount. > > Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") > Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path") > Signed-off-by: Borislav Petkov <bp@suse.de> > Signed-off-by: James Morse <james.morse@arm.com> > Signed-off-by: Robert Richter <rrichter@marvell.com> > --- > drivers/edac/ghes_edac.c | 78 ++++++++++++++++++++++++++++------------ > 1 file changed, 56 insertions(+), 22 deletions(-) ... > @@ -457,10 +461,12 @@ static struct acpi_platform_list plat_list[] = { > int ghes_edac_register(struct ghes *ghes, struct device *dev) > { > bool fake = false; > - int rc, num_dimm = 0; > + int rc = 0, num_dimm = 0; > struct mem_ctl_info *mci; > + struct ghes_edac_pvt *pvt; > struct edac_mc_layer layers[1]; > struct ghes_edac_dimm_fill dimm_fill; > + unsigned long flags; > int idx = -1; > > if (IS_ENABLED(CONFIG_X86)) { > @@ -472,11 +478,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) > idx = 0; > } > > + /* finish another registration/unregistration instance first */ > + mutex_lock(&ghes_reg_mutex); > + > /* > * We have only one logical memory controller to which all DIMMs belong. > */ > - if (atomic_inc_return(&ghes_init) > 1) > - return 0; > + if (atomic_inc_not_zero(&ghes_init)) That should probably be called ghes_instances now to make it obvious what it is. Also, you can make it a normal variable now since it is being modified under the mutex only. > + goto unlock; > > /* Get the number of DIMMs */ > dmi_walk(ghes_edac_count_dimms, &num_dimm); > @@ -494,12 +503,13 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) > mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, sizeof(struct ghes_edac_pvt)); > if (!mci) { > pr_info("Can't allocate memory for EDAC data\n"); > - return -ENOMEM; > + rc = -ENOMEM; > + goto unlock; > } > > - ghes_pvt = mci->pvt_info; > - ghes_pvt->ghes = ghes; > - ghes_pvt->mci = mci; > + pvt = mci->pvt_info; > + pvt->ghes = ghes; > + pvt->mci = mci; > > mci->pdev = dev; > mci->mtype_cap = MEM_FLAG_EMPTY; > @@ -541,23 +551,47 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) > if (rc < 0) { > pr_info("Can't register at EDAC core\n"); > edac_mc_free(mci); > - return -ENODEV; > + rc = -ENODEV; This needs to "goto unlock". > } > - return 0; > + > + spin_lock_irqsave(&ghes_lock, flags); > + ghes_pvt = pvt; > + spin_unlock_irqrestore(&ghes_lock, flags); > + > + /* only increment on success */ > + atomic_inc(&ghes_init); > + > +unlock: > + mutex_unlock(&ghes_reg_mutex); > + > + return rc; > } -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
On 05.11.19 12:05:11, Borislav Petkov wrote: > On Fri, Oct 25, 2019 at 09:13:14PM +0000, Robert Richter wrote: > > The ghes registration and refcount is broken in several ways: > > > > * ghes_edac_register() returns with success for a 2nd instance even > > if a first instance is still running. > > How? > > There's > > if (atomic_inc_return(&ghes_init) > 1) > return 0; > > there. How would a second instance bypass this? > > > * The refcount was increased even if a registration failed. This > > leads to stale counters preventing the device from being released. > > That I see - the return path should dec ghes_init. Here the bypass: 1) Entering ghes_edac_register(), instance #1 2) ghes_init atomicly set to 1, instance #1 3) Entering ghes_edac_register(), instance #2 4) ghes_init atomicly set to 2, instance #2 5) Leaving ghes_edac_register() with success, instance #2, note: First instance not yet done. 6) Leaving ghes_edac_register() with error, instance #1. There are the following issues with this now: 1) Refcount is 2, though it should be zero as init failed with an error. ghes_edac_register(), instance #2 returned already with success even though: 2) ghes initialization was not yet done, and 3) ghes initialization is going to fail. > > > * The ghes refcount may not be decremented properly on > > unregistration. Always decrement the refcount once > > ghes_edac_unregister() is called to keep the refcount sane. > > Right. > > > * The ghes_pvt pointer is handed to the irq handler before > > registration finished. > > > > * The mci structure could be freed while the irq handler is running. > > > > Fix this by adding a mutex to ghes_edac_register(). This mutex > > serializes instances to register and unregister. The refcount is only > > increased if the registration succeeded. This makes sure the refcount > > is in a consistent state after registering or unregistering a device. > > Note: A spinlock cannot be used here as the code section may sleep. > > > > The ghes_pvt is protected by ghes_lock now. > > This better be documented in the driver with a comment above the > ghes_pvt thing. Ok, will add a comment there. > > I'm assuming the support for multiple instances is going ontop of this? > If so, ghes_pvt needs to be an array or so. Also, if you do that, I > think you should use mc_devices - see edac_mc_find() et al - instead of > growing a special one just for this driver. Right, there will be a parent device created for each instance. But an array is not required as this also can be part of the private data. Only some sort of list head is needed to collect them all. > > > This ensures the pointer > > is not updated before registration was finished or while the irq > > handler is running. It is unset before unregistering the device > > including necessary (implicit) memory barriers making the changes > > visible to other cpus. Thus, the device can not be used anymore by an > > interrupt. > > > > A refcount is needed. There can be multiple GHES structures being > > defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error > > Source, "Some platforms may describe multiple Generic Hardware Error > > Source structures with different notification types, ..."). > > > > Another approach to use the mci's device refcount (get_device()) and > > have a release function does not work here. A release function will be > > called only for device_release() with the last put_device() call. The > > device must be deleted *before* that with device_del(). This is only > > possible by maintaining an own refcount. > > > > Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller") > > Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path") > > Signed-off-by: Borislav Petkov <bp@suse.de> > > Signed-off-by: James Morse <james.morse@arm.com> > > Signed-off-by: Robert Richter <rrichter@marvell.com> > > --- > > drivers/edac/ghes_edac.c | 78 ++++++++++++++++++++++++++++------------ > > 1 file changed, 56 insertions(+), 22 deletions(-) > > ... > > > @@ -457,10 +461,12 @@ static struct acpi_platform_list plat_list[] = { > > int ghes_edac_register(struct ghes *ghes, struct device *dev) > > { > > bool fake = false; > > - int rc, num_dimm = 0; > > + int rc = 0, num_dimm = 0; > > struct mem_ctl_info *mci; > > + struct ghes_edac_pvt *pvt; > > struct edac_mc_layer layers[1]; > > struct ghes_edac_dimm_fill dimm_fill; > > + unsigned long flags; > > int idx = -1; > > > > if (IS_ENABLED(CONFIG_X86)) { > > @@ -472,11 +478,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) > > idx = 0; > > } > > > > + /* finish another registration/unregistration instance first */ > > + mutex_lock(&ghes_reg_mutex); > > + > > /* > > * We have only one logical memory controller to which all DIMMs belong. > > */ > > - if (atomic_inc_return(&ghes_init) > 1) > > - return 0; > > + if (atomic_inc_not_zero(&ghes_init)) > > That should probably be called ghes_instances now to make it obvious > what it is. I already thought of renaming this to ghes_refcount. > > Also, you can make it a normal variable now since it is being modified > under the mutex only. I would rather avoid an own implementation of reference counting and also better switch from atomic_* to refcount_* which also provides range checks. There is no benefit doing this our own. Any objections here for the renaming and using the refcount API? > > > + goto unlock; > > > > /* Get the number of DIMMs */ > > dmi_walk(ghes_edac_count_dimms, &num_dimm); > > @@ -494,12 +503,13 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) > > mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, sizeof(struct ghes_edac_pvt)); > > if (!mci) { > > pr_info("Can't allocate memory for EDAC data\n"); > > - return -ENOMEM; > > + rc = -ENOMEM; > > + goto unlock; > > } > > > > - ghes_pvt = mci->pvt_info; > > - ghes_pvt->ghes = ghes; > > - ghes_pvt->mci = mci; > > + pvt = mci->pvt_info; > > + pvt->ghes = ghes; > > + pvt->mci = mci; > > > > mci->pdev = dev; > > mci->mtype_cap = MEM_FLAG_EMPTY; > > @@ -541,23 +551,47 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) > > if (rc < 0) { > > pr_info("Can't register at EDAC core\n"); > > edac_mc_free(mci); > > - return -ENODEV; > > + rc = -ENODEV; > > This needs to "goto unlock". Umm, correct. > > > } > > - return 0; > > + > > + spin_lock_irqsave(&ghes_lock, flags); > > + ghes_pvt = pvt; > > + spin_unlock_irqrestore(&ghes_lock, flags); > > + > > + /* only increment on success */ > > + atomic_inc(&ghes_init); > > + > > +unlock: > > + mutex_unlock(&ghes_reg_mutex); > > + > > + return rc; > > } Will, repost. Thanks for review. -Robert
Just to hold down what we discussed on IRC. On Tue, Nov 05, 2019 at 02:17:42PM +0000, Robert Richter wrote: > Here the bypass: > > 1) Entering ghes_edac_register(), instance #1 That was a misunderstanding, I know what Robert means now. > Right, there will be a parent device created for each instance. But an > array is not required as this also can be part of the private data. > Only some sort of list head is needed to collect them all. Ok. > I would rather avoid an own implementation of reference counting and > also better switch from atomic_* to refcount_* which also provides > range checks. There is no benefit doing this our own. > > Any objections here for the renaming and using the refcount API? None. Thx. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c index 0bb62857ffb2..622e4dde51c3 100644 --- a/drivers/edac/ghes_edac.c +++ b/drivers/edac/ghes_edac.c @@ -29,6 +29,9 @@ struct ghes_edac_pvt { static atomic_t ghes_init = ATOMIC_INIT(0); static struct ghes_edac_pvt *ghes_pvt; +/* GHES registration mutex */ +static DEFINE_MUTEX(ghes_reg_mutex); + /* * Sync with other, potentially concurrent callers of * ghes_edac_report_mem_error(). We don't know what the @@ -79,9 +82,8 @@ static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg) (*num_dimm)++; } -static int get_dimm_smbios_index(u16 handle) +static int get_dimm_smbios_index(struct mem_ctl_info *mci, u16 handle) { - struct mem_ctl_info *mci = ghes_pvt->mci; int i; for (i = 0; i < mci->tot_dimms; i++) { @@ -198,14 +200,11 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) enum hw_event_mc_err_type type; struct edac_raw_error_desc *e; struct mem_ctl_info *mci; - struct ghes_edac_pvt *pvt = ghes_pvt; + struct ghes_edac_pvt *pvt; unsigned long flags; char *p; u8 grain_bits; - if (!pvt) - return; - /* * We can do the locking below because GHES defers error processing * from NMI to IRQ context. Whenever that changes, we'd at least @@ -216,6 +215,10 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) spin_lock_irqsave(&ghes_lock, flags); + pvt = ghes_pvt; + if (!pvt) + goto unlock; + mci = pvt->mci; e = &mci->error_desc; @@ -348,7 +351,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) p += sprintf(p, "DIMM DMI handle: 0x%.4x ", mem_err->mem_dev_handle); - index = get_dimm_smbios_index(mem_err->mem_dev_handle); + index = get_dimm_smbios_index(mci, mem_err->mem_dev_handle); if (index >= 0) { e->top_layer = index; e->enable_per_layer_report = true; @@ -443,6 +446,7 @@ void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err) grain_bits, e->syndrome, pvt->detail_location); edac_raw_mc_handle_error(type, mci, e); +unlock: spin_unlock_irqrestore(&ghes_lock, flags); } @@ -457,10 +461,12 @@ static struct acpi_platform_list plat_list[] = { int ghes_edac_register(struct ghes *ghes, struct device *dev) { bool fake = false; - int rc, num_dimm = 0; + int rc = 0, num_dimm = 0; struct mem_ctl_info *mci; + struct ghes_edac_pvt *pvt; struct edac_mc_layer layers[1]; struct ghes_edac_dimm_fill dimm_fill; + unsigned long flags; int idx = -1; if (IS_ENABLED(CONFIG_X86)) { @@ -472,11 +478,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) idx = 0; } + /* finish another registration/unregistration instance first */ + mutex_lock(&ghes_reg_mutex); + /* * We have only one logical memory controller to which all DIMMs belong. */ - if (atomic_inc_return(&ghes_init) > 1) - return 0; + if (atomic_inc_not_zero(&ghes_init)) + goto unlock; /* Get the number of DIMMs */ dmi_walk(ghes_edac_count_dimms, &num_dimm); @@ -494,12 +503,13 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, sizeof(struct ghes_edac_pvt)); if (!mci) { pr_info("Can't allocate memory for EDAC data\n"); - return -ENOMEM; + rc = -ENOMEM; + goto unlock; } - ghes_pvt = mci->pvt_info; - ghes_pvt->ghes = ghes; - ghes_pvt->mci = mci; + pvt = mci->pvt_info; + pvt->ghes = ghes; + pvt->mci = mci; mci->pdev = dev; mci->mtype_cap = MEM_FLAG_EMPTY; @@ -541,23 +551,47 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev) if (rc < 0) { pr_info("Can't register at EDAC core\n"); edac_mc_free(mci); - return -ENODEV; + rc = -ENODEV; } - return 0; + + spin_lock_irqsave(&ghes_lock, flags); + ghes_pvt = pvt; + spin_unlock_irqrestore(&ghes_lock, flags); + + /* only increment on success */ + atomic_inc(&ghes_init); + +unlock: + mutex_unlock(&ghes_reg_mutex); + + return rc; } void ghes_edac_unregister(struct ghes *ghes) { struct mem_ctl_info *mci; + unsigned long flags; - if (!ghes_pvt) - return; + mutex_lock(&ghes_reg_mutex); if (atomic_dec_return(&ghes_init)) - return; + goto unlock; - mci = ghes_pvt->mci; + /* + * Wait for the irq handler being finished. + */ + spin_lock_irqsave(&ghes_lock, flags); + mci = ghes_pvt ? ghes_pvt->mci : NULL; ghes_pvt = NULL; - edac_mc_del_mc(mci->pdev); - edac_mc_free(mci); + spin_unlock_irqrestore(&ghes_lock, flags); + + if (!mci) + goto unlock; + + mci = edac_mc_del_mc(mci->pdev); + if (mci) + edac_mc_free(mci); + +unlock: + mutex_unlock(&ghes_reg_mutex); }