diff mbox

sysfs: driver core: Fix glue dir race condition

Message ID 545C2408.60703@huawei.com
State New
Headers show

Commit Message

wangyijing Nov. 7, 2014, 1:44 a.m. UTC
On 2014/11/7 1:22, Greg KH wrote:
> On Thu, Nov 06, 2014 at 11:55:47AM -0500, Tejun Heo wrote:
>> Maybe "fix glue dir race condition by not removing them" is a better
>> title?
>>
>> On Thu, Nov 06, 2014 at 04:16:38PM +0800, Yijing Wang wrote:
>>> There is a race condition when removing glue directory.
>>> It can be reproduced in following test:
>>>
>>> path 1: Add first child device
>>> device_add()
>>> 	get_device_parent()
>>> 		/*find parent from glue_dirs.list*/
>>> 		list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry)
>>> 			if (k->parent == parent_kobj) {
>>> 				kobj = kobject_get(k);
>>> 				break;
>>> 			}
>>> 		....
>>> 		class_dir_create_and_add()
>>>
>>> path2: Remove last child device under glue dir
>>> device_del()
>>> 	cleanup_device_parent()
>>> 		cleanup_glue_dir()
>>> 			kobject_put(glue_dir);
>>>
>>> If path2 has been called cleanup_glue_dir(), but not
>>> call kobject_put(glue_dir), the glue dir is still
>>> in parent's kset list. Meanwhile, path1 find the glue
>>> dir from the glue_dirs.list. Path2 may release glue dir
>>> before path1 call kobject_get(). So kernel will report
>>> the warning and bug_on.
>>>
>>> This fix keep glue dir around once it created suggested
>>> by Tejun Heo.
>>
>> I think you prolly want to explain why this is okay / desired.
>> e.g. list how the glue dir is used and how many of them are there and
>> explain that there's no real benefit in removing them.
> 
> I'd really _like_ to remove them if at all possible, as if there isn't
> any "children" in the subdirectory, there shouldn't be a need for that
> directory to be there.
> 
> This seems to be the "classic" problem we have of a kref in a list that
> can be found while the last instance could be removed at the same time.
> I hate to just throw another lock at the problem, but wouldn't a lock to
> protect the list of glue_dirs be the answer here?

Hi Greg, in this case, we need to protect the race condition between traverse dev->class->p->glue_dirs.list
and kobject_put(glue_dir) in cleanup_glue_dir().

glue_dirs.list_lock only used to protect glue_dirs.list, but what we want to protect is
don't call kobject_put(glue_dir) to decrease glue_dir ref count during we traverse
dev->class->p->glue_dirs.list.


---------------------------------------------------------------------------
		/* find our class-directory at the parent and reference it */
		spin_lock(&dev->class->p->glue_dirs.list_lock);
		list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry)     ------>A
			if (k->parent == parent_kobj) {
				kobj = kobject_get(k);
				break;
			}
		spin_unlock(&dev->class->p->glue_dirs.list_lock);
------------------------------------------------------------------------------
static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
{
	/* see if we live in a "glue" directory */
	if (!glue_dir || !dev->class ||
	    glue_dir->kset != &dev->class->p->glue_dirs)
		return;

	kobject_put(glue_dir);   --------------->B
}
------------------------------------------------------------------------------


Tejun introduced a mutex gdp_mutex in commit 77d3d7c1d561f49 to fix the race condition in get_device_parent().
We could reuse the mutex to fix the race condition between glue_dirs.list traverse and kobject_put(glue_dir).

Greg, the two solutions (reuse the gdp_mutex and don't remove glue_dir), which one do you prefer ?











> 
> thanks,
> 
> greg k-h
> 
> .
>

Comments

Greg KH Nov. 7, 2014, 2:46 a.m. UTC | #1
On Fri, Nov 07, 2014 at 09:44:40AM +0800, Yijing Wang wrote:
> On 2014/11/7 1:22, Greg KH wrote:
> > On Thu, Nov 06, 2014 at 11:55:47AM -0500, Tejun Heo wrote:
> >> Maybe "fix glue dir race condition by not removing them" is a better
> >> title?
> >>
> >> On Thu, Nov 06, 2014 at 04:16:38PM +0800, Yijing Wang wrote:
> >>> There is a race condition when removing glue directory.
> >>> It can be reproduced in following test:
> >>>
> >>> path 1: Add first child device
> >>> device_add()
> >>> 	get_device_parent()
> >>> 		/*find parent from glue_dirs.list*/
> >>> 		list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry)
> >>> 			if (k->parent == parent_kobj) {
> >>> 				kobj = kobject_get(k);
> >>> 				break;
> >>> 			}
> >>> 		....
> >>> 		class_dir_create_and_add()
> >>>
> >>> path2: Remove last child device under glue dir
> >>> device_del()
> >>> 	cleanup_device_parent()
> >>> 		cleanup_glue_dir()
> >>> 			kobject_put(glue_dir);
> >>>
> >>> If path2 has been called cleanup_glue_dir(), but not
> >>> call kobject_put(glue_dir), the glue dir is still
> >>> in parent's kset list. Meanwhile, path1 find the glue
> >>> dir from the glue_dirs.list. Path2 may release glue dir
> >>> before path1 call kobject_get(). So kernel will report
> >>> the warning and bug_on.
> >>>
> >>> This fix keep glue dir around once it created suggested
> >>> by Tejun Heo.
> >>
> >> I think you prolly want to explain why this is okay / desired.
> >> e.g. list how the glue dir is used and how many of them are there and
> >> explain that there's no real benefit in removing them.
> > 
> > I'd really _like_ to remove them if at all possible, as if there isn't
> > any "children" in the subdirectory, there shouldn't be a need for that
> > directory to be there.
> > 
> > This seems to be the "classic" problem we have of a kref in a list that
> > can be found while the last instance could be removed at the same time.
> > I hate to just throw another lock at the problem, but wouldn't a lock to
> > protect the list of glue_dirs be the answer here?
> 
> Hi Greg, in this case, we need to protect the race condition between traverse dev->class->p->glue_dirs.list
> and kobject_put(glue_dir) in cleanup_glue_dir().
> 
> glue_dirs.list_lock only used to protect glue_dirs.list, but what we want to protect is
> don't call kobject_put(glue_dir) to decrease glue_dir ref count during we traverse
> dev->class->p->glue_dirs.list.
> 
> 
> ---------------------------------------------------------------------------
> 		/* find our class-directory at the parent and reference it */
> 		spin_lock(&dev->class->p->glue_dirs.list_lock);
> 		list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry)     ------>A
> 			if (k->parent == parent_kobj) {
> 				kobj = kobject_get(k);
> 				break;
> 			}
> 		spin_unlock(&dev->class->p->glue_dirs.list_lock);
> ------------------------------------------------------------------------------
> static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
> {
> 	/* see if we live in a "glue" directory */
> 	if (!glue_dir || !dev->class ||
> 	    glue_dir->kset != &dev->class->p->glue_dirs)
> 		return;
> 
> 	kobject_put(glue_dir);   --------------->B
> }
> ------------------------------------------------------------------------------
> 
> 
> Tejun introduced a mutex gdp_mutex in commit 77d3d7c1d561f49 to fix the race condition in get_device_parent().
> We could reuse the mutex to fix the race condition between glue_dirs.list traverse and kobject_put(glue_dir).
> 
> Greg, the two solutions (reuse the gdp_mutex and don't remove glue_dir), which one do you prefer ?
> 
> 
> diff --git a/drivers/base/core.c b/drivers/base/core.c
> index 28b808c..645eacf 100644
> --- a/drivers/base/core.c
> +++ b/drivers/base/core.c
> @@ -724,12 +724,12 @@ class_dir_create_and_add(struct class *class, struct kobject *parent_kobj)
>  	return &dir->kobj;
>  }
> 
> +static DEFINE_MUTEX(gdp_mutex);
> 
>  static struct kobject *get_device_parent(struct device *dev,
>  					 struct device *parent)
>  {
>  	if (dev->class) {
> -		static DEFINE_MUTEX(gdp_mutex);
>  		struct kobject *kobj = NULL;
>  		struct kobject *parent_kobj;
>  		struct kobject *k;
> @@ -793,7 +793,9 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
>  	    glue_dir->kset != &dev->class->p->glue_dirs)
>  		return;
> 
> +	mutex_lock(&gdp_mutex);
>  	kobject_put(glue_dir);
> +	mutex_unlock(&gdp_mutex);
>  }
> 
>  static void cleanup_device_parent(struct device *dev)
> 

I much prefer this patch over the other one, as it keeps the same
behavior as today, and fixes the existing bug.

Have you tested it out to see if it works properly?  If so, can you
resend it in a "proper" form so I can apply it?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
wangyijing Nov. 7, 2014, 3:12 a.m. UTC | #2
>> +static DEFINE_MUTEX(gdp_mutex);
>>
>>  static struct kobject *get_device_parent(struct device *dev,
>>  					 struct device *parent)
>>  {
>>  	if (dev->class) {
>> -		static DEFINE_MUTEX(gdp_mutex);
>>  		struct kobject *kobj = NULL;
>>  		struct kobject *parent_kobj;
>>  		struct kobject *k;
>> @@ -793,7 +793,9 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
>>  	    glue_dir->kset != &dev->class->p->glue_dirs)
>>  		return;
>>
>> +	mutex_lock(&gdp_mutex);
>>  	kobject_put(glue_dir);
>> +	mutex_unlock(&gdp_mutex);
>>  }
>>
>>  static void cleanup_device_parent(struct device *dev)
>>
> 
> I much prefer this patch over the other one, as it keeps the same
> behavior as today, and fixes the existing bug.
> 
> Have you tested it out to see if it works properly?  If so, can you
> resend it in a "proper" form so I can apply it?

Yes, we tested it in our system, I will resend it now, thanks!

> 
> thanks,
> 
> greg k-h
> 
> .
>
Greg KH Nov. 7, 2014, 5:51 a.m. UTC | #3
On Fri, Nov 07, 2014 at 11:12:19AM +0800, Yijing Wang wrote:
> >> +static DEFINE_MUTEX(gdp_mutex);
> >>
> >>  static struct kobject *get_device_parent(struct device *dev,
> >>  					 struct device *parent)
> >>  {
> >>  	if (dev->class) {
> >> -		static DEFINE_MUTEX(gdp_mutex);
> >>  		struct kobject *kobj = NULL;
> >>  		struct kobject *parent_kobj;
> >>  		struct kobject *k;
> >> @@ -793,7 +793,9 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
> >>  	    glue_dir->kset != &dev->class->p->glue_dirs)
> >>  		return;
> >>
> >> +	mutex_lock(&gdp_mutex);
> >>  	kobject_put(glue_dir);
> >> +	mutex_unlock(&gdp_mutex);
> >>  }
> >>
> >>  static void cleanup_device_parent(struct device *dev)
> >>
> > 
> > I much prefer this patch over the other one, as it keeps the same
> > behavior as today, and fixes the existing bug.
> > 
> > Have you tested it out to see if it works properly?  If so, can you
> > resend it in a "proper" form so I can apply it?
> 
> Yes, we tested it in our system, I will resend it now, thanks!

Wonderful, thanks for that, and persisting with this.  I'll queue up
that patch tomorrow morning.

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe stable" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/base/core.c b/drivers/base/core.c
index 28b808c..645eacf 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -724,12 +724,12 @@  class_dir_create_and_add(struct class *class, struct kobject *parent_kobj)
 	return &dir->kobj;
 }

+static DEFINE_MUTEX(gdp_mutex);

 static struct kobject *get_device_parent(struct device *dev,
 					 struct device *parent)
 {
 	if (dev->class) {
-		static DEFINE_MUTEX(gdp_mutex);
 		struct kobject *kobj = NULL;
 		struct kobject *parent_kobj;
 		struct kobject *k;
@@ -793,7 +793,9 @@  static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
 	    glue_dir->kset != &dev->class->p->glue_dirs)
 		return;

+	mutex_lock(&gdp_mutex);
 	kobject_put(glue_dir);
+	mutex_unlock(&gdp_mutex);
 }

 static void cleanup_device_parent(struct device *dev)