Message ID | 20220720112852.11440-2-d.bogdanov@yadro.com |
---|---|
State | Superseded |
Headers | show |
Series | target: core: fix race during ACL removal | expand |
On 7/20/22 6:28 AM, Dmitry Bogdanov wrote: > Under huge load there is a possibility of race condition in updating > se_dev_entry object in ACL removal procedure. > > NIP [c0080000154093d0] transport_lookup_cmd_lun+0x1f8/0x3d0 [target_core_mod] > LR [c00800001542ab34] target_submit_cmd_map_sgls+0x11c/0x300 [target_core_mod] > Call Trace: > target_submit_cmd_map_sgls+0x11c/0x300 [target_core_mod] > target_submit_cmd+0x44/0x60 [target_core_mod] > tcm_qla2xxx_handle_cmd+0x88/0xe0 [tcm_qla2xxx] > qlt_do_work+0x2e4/0x3d0 [qla2xxx] > process_one_work+0x298/0x5c > > Despite usage of RCU primitives with deve->se_lun pointer, it has not > became dereference-safe. Because deve->se_lun is updated not > synchronized with a reader. That change might be in a release function > called by synchronize_rcu(). But, in fact, there is no sence to NULL that > pointer for deleting deve. So a better solution is to remove that NULLing. > Hey, For the line above about there being no reason to NULL the pointer are you saying: We should have had a NULL check like: transport_lookup_cmd_lun: ..... se_lun = rcu_dereference(deve->se_lun); if (!se_lun) { rcu_read_unlock(); return TCM_NON_EXISTENT_LUN; } and it would have prevented the crash. But we don't need that or the se_lun RCU use, because the hlist_del_rcu and kfree_rcu calls in core_disable_device_list_for_node will make sure that transport_lookup_cmd_lun either finds a se_dev_entry and while in use will never be freed from under us, or transport_lookup_cmd_lun will never see a se_dev_entry. If that's what you meant then I agree. > Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> > --- > drivers/target/target_core_device.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c > index 25f33eb25337..335f8cbfe633 100644 > --- a/drivers/target/target_core_device.c > +++ b/drivers/target/target_core_device.c > @@ -434,9 +434,6 @@ void core_disable_device_list_for_node( > kref_put(&orig->pr_kref, target_pr_kref_release); > wait_for_completion(&orig->pr_comp); > > - rcu_assign_pointer(orig->se_lun, NULL); > - rcu_assign_pointer(orig->se_lun_acl, NULL); > - > kfree_rcu(orig, rcu_head); > > core_scsi3_free_pr_reg_from_nacl(dev, nacl);
On Tue, Jul 26, 2022 at 04:40:16PM -0500, Mike Christie wrote: > > On 7/20/22 6:28 AM, Dmitry Bogdanov wrote: > > Under huge load there is a possibility of race condition in updating > > se_dev_entry object in ACL removal procedure. > > > > NIP [c0080000154093d0] transport_lookup_cmd_lun+0x1f8/0x3d0 [target_core_mod] > > LR [c00800001542ab34] target_submit_cmd_map_sgls+0x11c/0x300 [target_core_mod] > > Call Trace: > > target_submit_cmd_map_sgls+0x11c/0x300 [target_core_mod] > > target_submit_cmd+0x44/0x60 [target_core_mod] > > tcm_qla2xxx_handle_cmd+0x88/0xe0 [tcm_qla2xxx] > > qlt_do_work+0x2e4/0x3d0 [qla2xxx] > > process_one_work+0x298/0x5c > > > > Despite usage of RCU primitives with deve->se_lun pointer, it has not > > became dereference-safe. Because deve->se_lun is updated not > > synchronized with a reader. That change might be in a release function > > called by synchronize_rcu(). But, in fact, there is no sence to NULL that > > pointer for deleting deve. So a better solution is to remove that NULLing. > > > Hey, > > For the line above about there being no reason to NULL the pointer are you > saying: > > > We should have had a NULL check like: > > transport_lookup_cmd_lun: > > ..... > > se_lun = rcu_dereference(deve->se_lun); > if (!se_lun) { > rcu_read_unlock(); > return TCM_NON_EXISTENT_LUN; > } > > and it would have prevented the crash. > > But we don't need that or the se_lun RCU use, because the hlist_del_rcu and > kfree_rcu calls in core_disable_device_list_for_node will make sure that > transport_lookup_cmd_lun either finds a se_dev_entry and while in use will > never be freed from under us, or transport_lookup_cmd_lun will never see a > se_dev_entry. > > > If that's what you meant then I agree. Yes, that is exactly how I thought. All access to deve->se_lun is already under rcu_read_lock. And either deve->se_lun is always valid or deve is not valid itself and will not be found in the list_for_*. Will add more details in the commit message.
diff --git a/drivers/target/target_core_device.c b/drivers/target/target_core_device.c index 25f33eb25337..335f8cbfe633 100644 --- a/drivers/target/target_core_device.c +++ b/drivers/target/target_core_device.c @@ -434,9 +434,6 @@ void core_disable_device_list_for_node( kref_put(&orig->pr_kref, target_pr_kref_release); wait_for_completion(&orig->pr_comp); - rcu_assign_pointer(orig->se_lun, NULL); - rcu_assign_pointer(orig->se_lun_acl, NULL); - kfree_rcu(orig, rcu_head); core_scsi3_free_pr_reg_from_nacl(dev, nacl);
Under huge load there is a possibility of race condition in updating se_dev_entry object in ACL removal procedure. NIP [c0080000154093d0] transport_lookup_cmd_lun+0x1f8/0x3d0 [target_core_mod] LR [c00800001542ab34] target_submit_cmd_map_sgls+0x11c/0x300 [target_core_mod] Call Trace: target_submit_cmd_map_sgls+0x11c/0x300 [target_core_mod] target_submit_cmd+0x44/0x60 [target_core_mod] tcm_qla2xxx_handle_cmd+0x88/0xe0 [tcm_qla2xxx] qlt_do_work+0x2e4/0x3d0 [qla2xxx] process_one_work+0x298/0x5c Despite usage of RCU primitives with deve->se_lun pointer, it has not became dereference-safe. Because deve->se_lun is updated not synchronized with a reader. That change might be in a release function called by synchronize_rcu(). But, in fact, there is no sence to NULL that pointer for deleting deve. So a better solution is to remove that NULLing. Signed-off-by: Dmitry Bogdanov <d.bogdanov@yadro.com> --- drivers/target/target_core_device.c | 3 --- 1 file changed, 3 deletions(-)