diff mbox series

EDAC, ghes: Fix locking and memory barrier issues

Message ID 20191025211226.2444-1-rrichter@marvell.com
State Superseded
Headers show
Series EDAC, ghes: Fix locking and memory barrier issues | expand

Commit Message

Robert Richter Oct. 25, 2019, 9:13 p.m. UTC
The ghes registration and refcount is broken in several ways:

 * ghes_edac_register() returns with success for a 2nd instance even
   if a first instance is still running. This is not correct as the
   first instance may fail later. A subsequent registration may not
   finish before the first. Parallel registrations must be avoided.

 * The refcount was increased even if a registration failed. This
   leads to stale counters preventing the device from being released.

 * The ghes refcount may not be decremented properly on
   unregistration. Always decrement the refcount once
   ghes_edac_unregister() is called to keep the refcount sane.

 * The ghes_pvt pointer is handed to the irq handler before
   registration finished.

 * The mci structure could be freed while the irq handler is running.

Fix this by adding a mutex to ghes_edac_register(). This mutex
serializes instances to register and unregister. The refcount is only
increased if the registration succeeded. This makes sure the refcount
is in a consistent state after registering or unregistering a device.
Note: A spinlock cannot be used here as the code section may sleep.

The ghes_pvt is protected by ghes_lock now. This ensures the pointer
is not updated before registration was finished or while the irq
handler is running. It is unset before unregistering the device
including necessary (implicit) memory barriers making the changes
visible to other cpus. Thus, the device can not be used anymore by an
interrupt.

A refcount is needed. There can be multiple GHES structures being
defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error
Source, "Some platforms may describe multiple Generic Hardware Error
Source structures with different notification types, ...").

Another approach to use the mci's device refcount (get_device()) and
have a release function does not work here. A release function will be
called only for device_release() with the last put_device() call. The
device must be deleted *before* that with device_del(). This is only
possible by maintaining an own refcount.

Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller")
Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path")
Signed-off-by: Borislav Petkov <bp@suse.de>

Signed-off-by: James Morse <james.morse@arm.com>

Signed-off-by: Robert Richter <rrichter@marvell.com>

---
 drivers/edac/ghes_edac.c | 78 ++++++++++++++++++++++++++++------------
 1 file changed, 56 insertions(+), 22 deletions(-)

-- 
2.20.1

Comments

Borislav Petkov Oct. 25, 2019, 9:53 p.m. UTC | #1
On Fri, Oct 25, 2019 at 09:13:14PM +0000, Robert Richter wrote:
> The ghes registration and refcount is broken in several ways:


...

> Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller")

> Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path")

> Signed-off-by: Borislav Petkov <bp@suse.de>

> Signed-off-by: James Morse <james.morse@arm.com>

> Signed-off-by: Robert Richter <rrichter@marvell.com>


More staring at the rest later, just to point out that the SOB chain is
wrong: my and James' SOB must be Co-developed-by: or Originally-by: or
so. Or even as freetext in the commit message:

"Based on patches by ... "

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Borislav Petkov Nov. 5, 2019, 11:05 a.m. UTC | #2
On Fri, Oct 25, 2019 at 09:13:14PM +0000, Robert Richter wrote:
> The ghes registration and refcount is broken in several ways:

> 

>  * ghes_edac_register() returns with success for a 2nd instance even

>    if a first instance is still running.


How?

There's

        if (atomic_inc_return(&ghes_init) > 1)
                return 0;

there. How would a second instance bypass this?

>  * The refcount was increased even if a registration failed. This

>    leads to stale counters preventing the device from being released.


That I see - the return path should dec ghes_init.

>  * The ghes refcount may not be decremented properly on

>    unregistration. Always decrement the refcount once

>    ghes_edac_unregister() is called to keep the refcount sane.


Right.

>  * The ghes_pvt pointer is handed to the irq handler before

>    registration finished.

> 

>  * The mci structure could be freed while the irq handler is running.

> 

> Fix this by adding a mutex to ghes_edac_register(). This mutex

> serializes instances to register and unregister. The refcount is only

> increased if the registration succeeded. This makes sure the refcount

> is in a consistent state after registering or unregistering a device.

> Note: A spinlock cannot be used here as the code section may sleep.

> 

> The ghes_pvt is protected by ghes_lock now.


This better be documented in the driver with a comment above the
ghes_pvt thing.

I'm assuming the support for multiple instances is going ontop of this?
If so, ghes_pvt needs to be an array or so. Also, if you do that, I
think you should use mc_devices - see edac_mc_find() et al - instead of
growing a special one just for this driver.

> This ensures the pointer

> is not updated before registration was finished or while the irq

> handler is running. It is unset before unregistering the device

> including necessary (implicit) memory barriers making the changes

> visible to other cpus. Thus, the device can not be used anymore by an

> interrupt.

> 

> A refcount is needed. There can be multiple GHES structures being

> defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error

> Source, "Some platforms may describe multiple Generic Hardware Error

> Source structures with different notification types, ...").

> 

> Another approach to use the mci's device refcount (get_device()) and

> have a release function does not work here. A release function will be

> called only for device_release() with the last put_device() call. The

> device must be deleted *before* that with device_del(). This is only

> possible by maintaining an own refcount.

> 

> Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller")

> Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path")

> Signed-off-by: Borislav Petkov <bp@suse.de>

> Signed-off-by: James Morse <james.morse@arm.com>

> Signed-off-by: Robert Richter <rrichter@marvell.com>

> ---

>  drivers/edac/ghes_edac.c | 78 ++++++++++++++++++++++++++++------------

>  1 file changed, 56 insertions(+), 22 deletions(-)


...

> @@ -457,10 +461,12 @@ static struct acpi_platform_list plat_list[] = {

>  int ghes_edac_register(struct ghes *ghes, struct device *dev)

>  {

>  	bool fake = false;

> -	int rc, num_dimm = 0;

> +	int rc = 0, num_dimm = 0;

>  	struct mem_ctl_info *mci;

> +	struct ghes_edac_pvt *pvt;

>  	struct edac_mc_layer layers[1];

>  	struct ghes_edac_dimm_fill dimm_fill;

> +	unsigned long flags;

>  	int idx = -1;

>  

>  	if (IS_ENABLED(CONFIG_X86)) {

> @@ -472,11 +478,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)

>  		idx = 0;

>  	}

>  

> +	/* finish another registration/unregistration instance first */

> +	mutex_lock(&ghes_reg_mutex);

> +

>  	/*

>  	 * We have only one logical memory controller to which all DIMMs belong.

>  	 */

> -	if (atomic_inc_return(&ghes_init) > 1)

> -		return 0;

> +	if (atomic_inc_not_zero(&ghes_init))


That should probably be called ghes_instances now to make it obvious
what it is.

Also, you can make it a normal variable now since it is being modified
under the mutex only.

> +		goto unlock;

>  

>  	/* Get the number of DIMMs */

>  	dmi_walk(ghes_edac_count_dimms, &num_dimm);

> @@ -494,12 +503,13 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)

>  	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, sizeof(struct ghes_edac_pvt));

>  	if (!mci) {

>  		pr_info("Can't allocate memory for EDAC data\n");

> -		return -ENOMEM;

> +		rc = -ENOMEM;

> +		goto unlock;

>  	}

>  

> -	ghes_pvt	= mci->pvt_info;

> -	ghes_pvt->ghes	= ghes;

> -	ghes_pvt->mci	= mci;

> +	pvt		= mci->pvt_info;

> +	pvt->ghes	= ghes;

> +	pvt->mci	= mci;

>  

>  	mci->pdev = dev;

>  	mci->mtype_cap = MEM_FLAG_EMPTY;

> @@ -541,23 +551,47 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)

>  	if (rc < 0) {

>  		pr_info("Can't register at EDAC core\n");

>  		edac_mc_free(mci);

> -		return -ENODEV;

> +		rc = -ENODEV;


This needs to "goto unlock".

>  	}

> -	return 0;

> +

> +	spin_lock_irqsave(&ghes_lock, flags);

> +	ghes_pvt = pvt;

> +	spin_unlock_irqrestore(&ghes_lock, flags);

> +

> +	/* only increment on success */

> +	atomic_inc(&ghes_init);

> +

> +unlock:

> +	mutex_unlock(&ghes_reg_mutex);

> +

> +	return rc;

>  }


-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Robert Richter Nov. 5, 2019, 2:17 p.m. UTC | #3
On 05.11.19 12:05:11, Borislav Petkov wrote:
> On Fri, Oct 25, 2019 at 09:13:14PM +0000, Robert Richter wrote:

> > The ghes registration and refcount is broken in several ways:

> > 

> >  * ghes_edac_register() returns with success for a 2nd instance even

> >    if a first instance is still running.

> 

> How?

> 

> There's

> 

>         if (atomic_inc_return(&ghes_init) > 1)

>                 return 0;

> 

> there. How would a second instance bypass this?

> 

> >  * The refcount was increased even if a registration failed. This

> >    leads to stale counters preventing the device from being released.

> 

> That I see - the return path should dec ghes_init.


Here the bypass:

1) Entering ghes_edac_register(), instance #1

2) ghes_init atomicly set to 1, instance #1

3) Entering ghes_edac_register(), instance #2

4) ghes_init atomicly set to 2, instance #2

5) Leaving ghes_edac_register() with success, instance #2, note: First
   instance not yet done.

6) Leaving ghes_edac_register() with error, instance #1.

There are the following issues with this now:

1) Refcount is 2, though it should be zero as init failed with an
   error.

ghes_edac_register(), instance #2 returned already with success even
though:

2) ghes initialization was not yet done, and

3) ghes initialization is going to fail.

> 

> >  * The ghes refcount may not be decremented properly on

> >    unregistration. Always decrement the refcount once

> >    ghes_edac_unregister() is called to keep the refcount sane.

> 

> Right.

> 

> >  * The ghes_pvt pointer is handed to the irq handler before

> >    registration finished.

> > 

> >  * The mci structure could be freed while the irq handler is running.

> > 

> > Fix this by adding a mutex to ghes_edac_register(). This mutex

> > serializes instances to register and unregister. The refcount is only

> > increased if the registration succeeded. This makes sure the refcount

> > is in a consistent state after registering or unregistering a device.

> > Note: A spinlock cannot be used here as the code section may sleep.

> > 

> > The ghes_pvt is protected by ghes_lock now.

> 

> This better be documented in the driver with a comment above the

> ghes_pvt thing.


Ok, will add a comment there.

> 

> I'm assuming the support for multiple instances is going ontop of this?

> If so, ghes_pvt needs to be an array or so. Also, if you do that, I

> think you should use mc_devices - see edac_mc_find() et al - instead of

> growing a special one just for this driver.


Right, there will be a parent device created for each instance. But an
array is not required as this also can be part of the private data.
Only some sort of list head is needed to collect them all.

> 

> > This ensures the pointer

> > is not updated before registration was finished or while the irq

> > handler is running. It is unset before unregistering the device

> > including necessary (implicit) memory barriers making the changes

> > visible to other cpus. Thus, the device can not be used anymore by an

> > interrupt.

> > 

> > A refcount is needed. There can be multiple GHES structures being

> > defined (see ACPI 6.3 specification, 18.3.2.7 Generic Hardware Error

> > Source, "Some platforms may describe multiple Generic Hardware Error

> > Source structures with different notification types, ...").

> > 

> > Another approach to use the mci's device refcount (get_device()) and

> > have a release function does not work here. A release function will be

> > called only for device_release() with the last put_device() call. The

> > device must be deleted *before* that with device_del(). This is only

> > possible by maintaining an own refcount.

> > 

> > Fixes: 0fe5f281f749 ("EDAC, ghes: Model a single, logical memory controller")

> > Fixes: 1e72e673b9d1 ("EDAC/ghes: Fix Use after free in ghes_edac remove path")

> > Signed-off-by: Borislav Petkov <bp@suse.de>

> > Signed-off-by: James Morse <james.morse@arm.com>

> > Signed-off-by: Robert Richter <rrichter@marvell.com>

> > ---

> >  drivers/edac/ghes_edac.c | 78 ++++++++++++++++++++++++++++------------

> >  1 file changed, 56 insertions(+), 22 deletions(-)

> 

> ...

> 

> > @@ -457,10 +461,12 @@ static struct acpi_platform_list plat_list[] = {

> >  int ghes_edac_register(struct ghes *ghes, struct device *dev)

> >  {

> >  	bool fake = false;

> > -	int rc, num_dimm = 0;

> > +	int rc = 0, num_dimm = 0;

> >  	struct mem_ctl_info *mci;

> > +	struct ghes_edac_pvt *pvt;

> >  	struct edac_mc_layer layers[1];

> >  	struct ghes_edac_dimm_fill dimm_fill;

> > +	unsigned long flags;

> >  	int idx = -1;

> >  

> >  	if (IS_ENABLED(CONFIG_X86)) {

> > @@ -472,11 +478,14 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)

> >  		idx = 0;

> >  	}

> >  

> > +	/* finish another registration/unregistration instance first */

> > +	mutex_lock(&ghes_reg_mutex);

> > +

> >  	/*

> >  	 * We have only one logical memory controller to which all DIMMs belong.

> >  	 */

> > -	if (atomic_inc_return(&ghes_init) > 1)

> > -		return 0;

> > +	if (atomic_inc_not_zero(&ghes_init))

> 

> That should probably be called ghes_instances now to make it obvious

> what it is.


I already thought of renaming this to ghes_refcount.

> 

> Also, you can make it a normal variable now since it is being modified

> under the mutex only.


I would rather avoid an own implementation of reference counting and
also better switch from atomic_* to refcount_* which also provides
range checks. There is no benefit doing this our own.

Any objections here for the renaming and using the refcount API?

> 

> > +		goto unlock;

> >  

> >  	/* Get the number of DIMMs */

> >  	dmi_walk(ghes_edac_count_dimms, &num_dimm);

> > @@ -494,12 +503,13 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)

> >  	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, sizeof(struct ghes_edac_pvt));

> >  	if (!mci) {

> >  		pr_info("Can't allocate memory for EDAC data\n");

> > -		return -ENOMEM;

> > +		rc = -ENOMEM;

> > +		goto unlock;

> >  	}

> >  

> > -	ghes_pvt	= mci->pvt_info;

> > -	ghes_pvt->ghes	= ghes;

> > -	ghes_pvt->mci	= mci;

> > +	pvt		= mci->pvt_info;

> > +	pvt->ghes	= ghes;

> > +	pvt->mci	= mci;

> >  

> >  	mci->pdev = dev;

> >  	mci->mtype_cap = MEM_FLAG_EMPTY;

> > @@ -541,23 +551,47 @@ int ghes_edac_register(struct ghes *ghes, struct device *dev)

> >  	if (rc < 0) {

> >  		pr_info("Can't register at EDAC core\n");

> >  		edac_mc_free(mci);

> > -		return -ENODEV;

> > +		rc = -ENODEV;

> 

> This needs to "goto unlock".


Umm, correct.

> 

> >  	}

> > -	return 0;

> > +

> > +	spin_lock_irqsave(&ghes_lock, flags);

> > +	ghes_pvt = pvt;

> > +	spin_unlock_irqrestore(&ghes_lock, flags);

> > +

> > +	/* only increment on success */

> > +	atomic_inc(&ghes_init);

> > +

> > +unlock:

> > +	mutex_unlock(&ghes_reg_mutex);

> > +

> > +	return rc;

> >  }


Will, repost. Thanks for review.

-Robert
Borislav Petkov Nov. 5, 2019, 2:58 p.m. UTC | #4
Just to hold down what we discussed on IRC.

On Tue, Nov 05, 2019 at 02:17:42PM +0000, Robert Richter wrote:
> Here the bypass:

> 

> 1) Entering ghes_edac_register(), instance #1


That was a misunderstanding, I know what Robert means now.

> Right, there will be a parent device created for each instance. But an

> array is not required as this also can be part of the private data.

> Only some sort of list head is needed to collect them all.


Ok.

> I would rather avoid an own implementation of reference counting and

> also better switch from atomic_* to refcount_* which also provides

> range checks. There is no benefit doing this our own.

> 

> Any objections here for the renaming and using the refcount API?


None.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
diff mbox series

Patch

diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 0bb62857ffb2..622e4dde51c3 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -29,6 +29,9 @@  struct ghes_edac_pvt {
 static atomic_t ghes_init = ATOMIC_INIT(0);
 static struct ghes_edac_pvt *ghes_pvt;
 
+/* GHES registration mutex */
+static DEFINE_MUTEX(ghes_reg_mutex);
+
 /*
  * Sync with other, potentially concurrent callers of
  * ghes_edac_report_mem_error(). We don't know what the
@@ -79,9 +82,8 @@  static void ghes_edac_count_dimms(const struct dmi_header *dh, void *arg)
 		(*num_dimm)++;
 }
 
-static int get_dimm_smbios_index(u16 handle)
+static int get_dimm_smbios_index(struct mem_ctl_info *mci, u16 handle)
 {
-	struct mem_ctl_info *mci = ghes_pvt->mci;
 	int i;
 
 	for (i = 0; i < mci->tot_dimms; i++) {
@@ -198,14 +200,11 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 	enum hw_event_mc_err_type type;
 	struct edac_raw_error_desc *e;
 	struct mem_ctl_info *mci;
-	struct ghes_edac_pvt *pvt = ghes_pvt;
+	struct ghes_edac_pvt *pvt;
 	unsigned long flags;
 	char *p;
 	u8 grain_bits;
 
-	if (!pvt)
-		return;
-
 	/*
 	 * We can do the locking below because GHES defers error processing
 	 * from NMI to IRQ context. Whenever that changes, we'd at least
@@ -216,6 +215,10 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 
 	spin_lock_irqsave(&ghes_lock, flags);
 
+	pvt = ghes_pvt;
+	if (!pvt)
+		goto unlock;
+
 	mci = pvt->mci;
 	e = &mci->error_desc;
 
@@ -348,7 +351,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 			p += sprintf(p, "DIMM DMI handle: 0x%.4x ",
 				     mem_err->mem_dev_handle);
 
-		index = get_dimm_smbios_index(mem_err->mem_dev_handle);
+		index = get_dimm_smbios_index(mci, mem_err->mem_dev_handle);
 		if (index >= 0) {
 			e->top_layer = index;
 			e->enable_per_layer_report = true;
@@ -443,6 +446,7 @@  void ghes_edac_report_mem_error(int sev, struct cper_sec_mem_err *mem_err)
 		       grain_bits, e->syndrome, pvt->detail_location);
 
 	edac_raw_mc_handle_error(type, mci, e);
+unlock:
 	spin_unlock_irqrestore(&ghes_lock, flags);
 }
 
@@ -457,10 +461,12 @@  static struct acpi_platform_list plat_list[] = {
 int ghes_edac_register(struct ghes *ghes, struct device *dev)
 {
 	bool fake = false;
-	int rc, num_dimm = 0;
+	int rc = 0, num_dimm = 0;
 	struct mem_ctl_info *mci;
+	struct ghes_edac_pvt *pvt;
 	struct edac_mc_layer layers[1];
 	struct ghes_edac_dimm_fill dimm_fill;
+	unsigned long flags;
 	int idx = -1;
 
 	if (IS_ENABLED(CONFIG_X86)) {
@@ -472,11 +478,14 @@  int ghes_edac_register(struct ghes *ghes, struct device *dev)
 		idx = 0;
 	}
 
+	/* finish another registration/unregistration instance first */
+	mutex_lock(&ghes_reg_mutex);
+
 	/*
 	 * We have only one logical memory controller to which all DIMMs belong.
 	 */
-	if (atomic_inc_return(&ghes_init) > 1)
-		return 0;
+	if (atomic_inc_not_zero(&ghes_init))
+		goto unlock;
 
 	/* Get the number of DIMMs */
 	dmi_walk(ghes_edac_count_dimms, &num_dimm);
@@ -494,12 +503,13 @@  int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	mci = edac_mc_alloc(0, ARRAY_SIZE(layers), layers, sizeof(struct ghes_edac_pvt));
 	if (!mci) {
 		pr_info("Can't allocate memory for EDAC data\n");
-		return -ENOMEM;
+		rc = -ENOMEM;
+		goto unlock;
 	}
 
-	ghes_pvt	= mci->pvt_info;
-	ghes_pvt->ghes	= ghes;
-	ghes_pvt->mci	= mci;
+	pvt		= mci->pvt_info;
+	pvt->ghes	= ghes;
+	pvt->mci	= mci;
 
 	mci->pdev = dev;
 	mci->mtype_cap = MEM_FLAG_EMPTY;
@@ -541,23 +551,47 @@  int ghes_edac_register(struct ghes *ghes, struct device *dev)
 	if (rc < 0) {
 		pr_info("Can't register at EDAC core\n");
 		edac_mc_free(mci);
-		return -ENODEV;
+		rc = -ENODEV;
 	}
-	return 0;
+
+	spin_lock_irqsave(&ghes_lock, flags);
+	ghes_pvt = pvt;
+	spin_unlock_irqrestore(&ghes_lock, flags);
+
+	/* only increment on success */
+	atomic_inc(&ghes_init);
+
+unlock:
+	mutex_unlock(&ghes_reg_mutex);
+
+	return rc;
 }
 
 void ghes_edac_unregister(struct ghes *ghes)
 {
 	struct mem_ctl_info *mci;
+	unsigned long flags;
 
-	if (!ghes_pvt)
-		return;
+	mutex_lock(&ghes_reg_mutex);
 
 	if (atomic_dec_return(&ghes_init))
-		return;
+		goto unlock;
 
-	mci = ghes_pvt->mci;
+	/*
+	 * Wait for the irq handler being finished.
+	 */
+	spin_lock_irqsave(&ghes_lock, flags);
+	mci = ghes_pvt ? ghes_pvt->mci : NULL;
 	ghes_pvt = NULL;
-	edac_mc_del_mc(mci->pdev);
-	edac_mc_free(mci);
+	spin_unlock_irqrestore(&ghes_lock, flags);
+
+	if (!mci)
+		goto unlock;
+
+	mci = edac_mc_del_mc(mci->pdev);
+	if (mci)
+		edac_mc_free(mci);
+
+unlock:
+	mutex_unlock(&ghes_reg_mutex);
 }