Message ID | 20230803112841.588822-1-huangcun@sangfor.com.cn |
---|---|
State | New |
Headers | show |
Series | scsi: scsi_dh_rdac: Avoid crash when a disk attach failed | expand |
Friendly ping. On 2023/8/3 19:28, Huang Cun wrote: > When a disk fails to attach, the struct rdac_dh_data is released, > but it is not removed from the ctlr->dh_list. When attaching another > disk, the released rdac_dh_data will be accessed and the following > BUG_ON() may be observed: > > [ 414.696167] scsi 5:0:0:7: rdac: Attach failed (8) > ... > [ 423.615364] kernel BUG at drivers/scsi/device_handler/scsi_dh_rdac.c:427! > [ 423.615731] invalid opcode: 0000 [#1] SMP NOPTI > ... > [ 423.623247] Call Trace: > [ 423.623598] rdac_bus_attach+0x203/0x4c0 > [ 423.623949] ? scsi_dh_handler_attach+0x2d/0x90 > [ 423.624300] scsi_dh_handler_attach+0x2d/0x90 > [ 423.624652] scsi_sysfs_add_sdev+0x88/0x270 > [ 423.625004] scsi_probe_and_add_lun+0xc47/0xd50 > [ 423.625354] scsi_report_lun_scan+0x339/0x3b0 > [ 423.625705] __scsi_scan_target+0xe9/0x220 > [ 423.626056] scsi_scan_target+0xf6/0x100 > [ 423.626404] fc_scsi_scan_rport+0xa5/0xb0 > [ 423.626757] process_one_work+0x15e/0x3f0 > [ 423.627106] worker_thread+0x4c/0x440 > [ 423.627453] ? rescuer_thread+0x350/0x350 > [ 423.627804] kthread+0xf8/0x130 > [ 423.628153] ? kthread_destroy_worker+0x40/0x40 > [ 423.628509] ret_from_fork+0x1f/0x40 > > Fixes: 1a5dc166cd88 ("scsi_dh_rdac: update 'access_state' field") > Signed-off-by: Huang Cun <huangcun@sangfor.com.cn> > Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> > Cc: Donglin Peng <pengdonglin@sangfor.com.cn> > --- > drivers/scsi/device_handler/scsi_dh_rdac.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c > index c5538645057a..9d487c2b7708 100644 > --- a/drivers/scsi/device_handler/scsi_dh_rdac.c > +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c > @@ -762,8 +762,10 @@ static int rdac_bus_attach(struct scsi_device *sdev) > > clean_ctlr: > spin_lock(&list_lock); > + list_del_rcu(&h->node); > kref_put(&h->ctlr->kref, release_controller); > spin_unlock(&list_lock); > + synchronize_rcu(); > > failed: > kfree(h);
On 8/3/23 6:28 AM, Huang Cun wrote: > When a disk fails to attach, the struct rdac_dh_data is released, > but it is not removed from the ctlr->dh_list. When attaching another > disk, the released rdac_dh_data will be accessed and the following > BUG_ON() may be observed: > > [ 414.696167] scsi 5:0:0:7: rdac: Attach failed (8) > ... > [ 423.615364] kernel BUG at drivers/scsi/device_handler/scsi_dh_rdac.c:427! > [ 423.615731] invalid opcode: 0000 [#1] SMP NOPTI > ... > [ 423.623247] Call Trace: > [ 423.623598] rdac_bus_attach+0x203/0x4c0 > [ 423.623949] ? scsi_dh_handler_attach+0x2d/0x90 > [ 423.624300] scsi_dh_handler_attach+0x2d/0x90 > [ 423.624652] scsi_sysfs_add_sdev+0x88/0x270 > [ 423.625004] scsi_probe_and_add_lun+0xc47/0xd50 > [ 423.625354] scsi_report_lun_scan+0x339/0x3b0 > [ 423.625705] __scsi_scan_target+0xe9/0x220 > [ 423.626056] scsi_scan_target+0xf6/0x100 > [ 423.626404] fc_scsi_scan_rport+0xa5/0xb0 > [ 423.626757] process_one_work+0x15e/0x3f0 > [ 423.627106] worker_thread+0x4c/0x440 > [ 423.627453] ? rescuer_thread+0x350/0x350 > [ 423.627804] kthread+0xf8/0x130 > [ 423.628153] ? kthread_destroy_worker+0x40/0x40 > [ 423.628509] ret_from_fork+0x1f/0x40 > > Fixes: 1a5dc166cd88 ("scsi_dh_rdac: update 'access_state' field") > Signed-off-by: Huang Cun <huangcun@sangfor.com.cn> > Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> > Cc: Donglin Peng <pengdonglin@sangfor.com.cn> > --- > drivers/scsi/device_handler/scsi_dh_rdac.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c > index c5538645057a..9d487c2b7708 100644 > --- a/drivers/scsi/device_handler/scsi_dh_rdac.c > +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c > @@ -762,8 +762,10 @@ static int rdac_bus_attach(struct scsi_device *sdev) > > clean_ctlr: > spin_lock(&list_lock); > + list_del_rcu(&h->node); > kref_put(&h->ctlr->kref, release_controller); > spin_unlock(&list_lock); > + synchronize_rcu(); > Should this be: spin_lock(&list_lock); list_del_rcu(&h->node); spin_unlock(&list_lock); synchronize_rcu(); kref_put(&h->ctlr->kref, release_controller); ? If you do the synchronize_rcu after the kref_put, then the kref_put could free the rdac_dh_data, while check_ownership is still accessing the rdac_dh_data, right?
On 2023/9/6 23:51, Mike Christie wrote: > On 8/3/23 6:28 AM, Huang Cun wrote: >> When a disk fails to attach, the struct rdac_dh_data is released, >> but it is not removed from the ctlr->dh_list. When attaching another >> disk, the released rdac_dh_data will be accessed and the following >> BUG_ON() may be observed: >> >> [ 414.696167] scsi 5:0:0:7: rdac: Attach failed (8) >> ... >> [ 423.615364] kernel BUG at drivers/scsi/device_handler/scsi_dh_rdac.c:427! >> [ 423.615731] invalid opcode: 0000 [#1] SMP NOPTI >> ... >> [ 423.623247] Call Trace: >> [ 423.623598] rdac_bus_attach+0x203/0x4c0 >> [ 423.623949] ? scsi_dh_handler_attach+0x2d/0x90 >> [ 423.624300] scsi_dh_handler_attach+0x2d/0x90 >> [ 423.624652] scsi_sysfs_add_sdev+0x88/0x270 >> [ 423.625004] scsi_probe_and_add_lun+0xc47/0xd50 >> [ 423.625354] scsi_report_lun_scan+0x339/0x3b0 >> [ 423.625705] __scsi_scan_target+0xe9/0x220 >> [ 423.626056] scsi_scan_target+0xf6/0x100 >> [ 423.626404] fc_scsi_scan_rport+0xa5/0xb0 >> [ 423.626757] process_one_work+0x15e/0x3f0 >> [ 423.627106] worker_thread+0x4c/0x440 >> [ 423.627453] ? rescuer_thread+0x350/0x350 >> [ 423.627804] kthread+0xf8/0x130 >> [ 423.628153] ? kthread_destroy_worker+0x40/0x40 >> [ 423.628509] ret_from_fork+0x1f/0x40 >> >> Fixes: 1a5dc166cd88 ("scsi_dh_rdac: update 'access_state' field") >> Signed-off-by: Huang Cun <huangcun@sangfor.com.cn> >> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> >> Cc: Donglin Peng <pengdonglin@sangfor.com.cn> >> --- >> drivers/scsi/device_handler/scsi_dh_rdac.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c >> index c5538645057a..9d487c2b7708 100644 >> --- a/drivers/scsi/device_handler/scsi_dh_rdac.c >> +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c >> @@ -762,8 +762,10 @@ static int rdac_bus_attach(struct scsi_device *sdev) >> >> clean_ctlr: >> spin_lock(&list_lock); >> + list_del_rcu(&h->node); >> kref_put(&h->ctlr->kref, release_controller); >> spin_unlock(&list_lock); >> + synchronize_rcu(); >> > > Should this be: > > spin_lock(&list_lock); > list_del_rcu(&h->node); > spin_unlock(&list_lock); > > synchronize_rcu(); > > kref_put(&h->ctlr->kref, release_controller); > > > ? > > If you do the synchronize_rcu after the kref_put, then the kref_put > could free the rdac_dh_data, while check_ownership is still > accessing the rdac_dh_data, right? > You are right. But I think we should keep the kref_put() and release callback be protected by list_lock, and only free the ctlr after synchronize_rcu(). So how about the additional modify (not yet tested): --- a/drivers/scsi/device_handler/scsi_dh_rdac.c +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c @@ -166,6 +166,7 @@ struct rdac_controller { struct scsi_device *ms_sdev; struct list_head ms_head; struct list_head dh_list; + struct rcu_head rcu; }; struct c2_inquiry { @@ -320,7 +321,7 @@ static void release_controller(struct kref *kref) ctlr = container_of(kref, struct rdac_controller, kref); list_del(&ctlr->node); - kfree(ctlr); + kfree_rcu(ctlr, rcu); } static struct rdac_controller *get_controller(int index, char *array_name,
On 2023/9/7 9:45, Ding Hui wrote: > On 2023/9/6 23:51, Mike Christie wrote: >> On 8/3/23 6:28 AM, Huang Cun wrote: >>> When a disk fails to attach, the struct rdac_dh_data is released, >>> but it is not removed from the ctlr->dh_list. When attaching another >>> disk, the released rdac_dh_data will be accessed and the following >>> BUG_ON() may be observed: >>> >>> [ 414.696167] scsi 5:0:0:7: rdac: Attach failed (8) >>> ... >>> [ 423.615364] kernel BUG at drivers/scsi/device_handler/scsi_dh_rdac.c:427! >>> [ 423.615731] invalid opcode: 0000 [#1] SMP NOPTI >>> ... >>> [ 423.623247] Call Trace: >>> [ 423.623598] rdac_bus_attach+0x203/0x4c0 >>> [ 423.623949] ? scsi_dh_handler_attach+0x2d/0x90 >>> [ 423.624300] scsi_dh_handler_attach+0x2d/0x90 >>> [ 423.624652] scsi_sysfs_add_sdev+0x88/0x270 >>> [ 423.625004] scsi_probe_and_add_lun+0xc47/0xd50 >>> [ 423.625354] scsi_report_lun_scan+0x339/0x3b0 >>> [ 423.625705] __scsi_scan_target+0xe9/0x220 >>> [ 423.626056] scsi_scan_target+0xf6/0x100 >>> [ 423.626404] fc_scsi_scan_rport+0xa5/0xb0 >>> [ 423.626757] process_one_work+0x15e/0x3f0 >>> [ 423.627106] worker_thread+0x4c/0x440 >>> [ 423.627453] ? rescuer_thread+0x350/0x350 >>> [ 423.627804] kthread+0xf8/0x130 >>> [ 423.628153] ? kthread_destroy_worker+0x40/0x40 >>> [ 423.628509] ret_from_fork+0x1f/0x40 >>> >>> Fixes: 1a5dc166cd88 ("scsi_dh_rdac: update 'access_state' field") >>> Signed-off-by: Huang Cun <huangcun@sangfor.com.cn> >>> Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> >>> Cc: Donglin Peng <pengdonglin@sangfor.com.cn> >>> --- >>> drivers/scsi/device_handler/scsi_dh_rdac.c | 2 ++ >>> 1 file changed, 2 insertions(+) >>> >>> diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c >>> index c5538645057a..9d487c2b7708 100644 >>> --- a/drivers/scsi/device_handler/scsi_dh_rdac.c >>> +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c >>> @@ -762,8 +762,10 @@ static int rdac_bus_attach(struct scsi_device *sdev) >>> clean_ctlr: >>> spin_lock(&list_lock); >>> + list_del_rcu(&h->node); >>> kref_put(&h->ctlr->kref, release_controller); >>> spin_unlock(&list_lock); >>> + synchronize_rcu(); >> >> Should this be: >> >> spin_lock(&list_lock); >> list_del_rcu(&h->node); >> spin_unlock(&list_lock); >> >> synchronize_rcu(); >> >> kref_put(&h->ctlr->kref, release_controller); >> >> >> ? >> >> If you do the synchronize_rcu after the kref_put, then the kref_put >> could free the rdac_dh_data, while check_ownership is still >> accessing the rdac_dh_data, right? >> > > You are right. > > But I think we should keep the kref_put() and release callback be protected by list_lock, and only free > the ctlr after synchronize_rcu(). > Sorry, I thought again, maybe we don't need to worry about it. The ctlr->kref is protected by list_lock, and release_controller() DO free the ctlr, NOT rdac_dh_data, kfree(rdac_dh_data) is already after synchronize_rcu(). If check_ownership() shared the same ctlr, the kref must be greater than or equal to 2, so the ctlr will not be released by kref_put(). On the other hand, if the ctlr is different, releasing it will not affect others.
diff --git a/drivers/scsi/device_handler/scsi_dh_rdac.c b/drivers/scsi/device_handler/scsi_dh_rdac.c index c5538645057a..9d487c2b7708 100644 --- a/drivers/scsi/device_handler/scsi_dh_rdac.c +++ b/drivers/scsi/device_handler/scsi_dh_rdac.c @@ -762,8 +762,10 @@ static int rdac_bus_attach(struct scsi_device *sdev) clean_ctlr: spin_lock(&list_lock); + list_del_rcu(&h->node); kref_put(&h->ctlr->kref, release_controller); spin_unlock(&list_lock); + synchronize_rcu(); failed: kfree(h);