Message ID | 20210111230943.3701806-1-olteanv@gmail.com |
---|---|
State | New |
Headers | show |
Series | [net] net: dsa: unbind all switches from tree when DSA master unbinds | expand |
On 1/11/21 3:09 PM, Vladimir Oltean wrote: > From: Vladimir Oltean <vladimir.oltean@nxp.com> > > Currently the following happens when a DSA master driver unbinds while > there are DSA switches attached to it: > > $ echo 0000:00:00.5 > /sys/bus/pci/drivers/mscc_felix/unbind > ------------[ cut here ]------------ > WARNING: CPU: 0 PID: 392 at net/core/dev.c:9507 > Call trace: > rollback_registered_many+0x5fc/0x688 > unregister_netdevice_queue+0x98/0x120 > dsa_slave_destroy+0x4c/0x88 > dsa_port_teardown.part.16+0x78/0xb0 > dsa_tree_teardown_switches+0x58/0xc0 > dsa_unregister_switch+0x104/0x1b8 > felix_pci_remove+0x24/0x48 > pci_device_remove+0x48/0xf0 > device_release_driver_internal+0x118/0x1e8 > device_driver_detach+0x28/0x38 > unbind_store+0xd0/0x100 > > Located at the above location is this WARN_ON: > > /* Notifier chain MUST detach us all upper devices. */ > WARN_ON(netdev_has_any_upper_dev(dev)); > > Other stacked interfaces, like VLAN, do indeed listen for > NETDEV_UNREGISTER on the real_dev and also unregister themselves at that > time, which is clearly the behavior that rollback_registered_many > expects. But DSA interfaces are not VLAN. They have backing hardware > (platform devices, PCI devices, MDIO, SPI etc) which have a life cycle > of their own and we can't just trigger an unregister from the DSA > framework when we receive a netdev notifier that the master unregisters. > > Luckily, there is something we can do, and that is to inform the driver > core that we have a runtime dependency to the DSA master interface's > device, and create a device link where that is the supplier and we are > the consumer. Having this device link will make the DSA switch unbind > before the DSA master unbinds, which is enough to avoid the WARN_ON from > rollback_registered_many. > > Note that even before the blamed commit, DSA did nothing intelligent > when the master interface got unregistered either. See the discussion > here: > https://lore.kernel.org/netdev/20200505210253.20311-1-f.fainelli@gmail.com/ > But this time, at least the WARN_ON is loud enough that the > upper_dev_link commit can be blamed. > > The advantage with this approach vs dev_hold(master) in the attached > link is that the latter is not meant for long term reference counting. > With dev_hold, the only thing that will happen is that when the user > attempts an unbind of the DSA master, netdev_wait_allrefs will keep > waiting and waiting, due to DSA keeping the refcount forever. DSA would > not access freed memory corresponding to the master interface, but the > unbind would still result in a freeze. Whereas with device links, > graceful teardown is ensured. It even works with cascaded DSA trees. > > $ echo 0000:00:00.2 > /sys/bus/pci/drivers/fsl_enetc/unbind > [ 1818.797546] device swp0 left promiscuous mode > [ 1819.301112] sja1105 spi2.0: Link is Down > [ 1819.307981] DSA: tree 1 torn down > [ 1819.312408] device eno2 left promiscuous mode > [ 1819.656803] mscc_felix 0000:00:00.5: Link is Down > [ 1819.667194] DSA: tree 0 torn down > [ 1819.711557] fsl_enetc 0000:00:00.2 eno2: Link is Down > > This approach allows us to keep the DSA framework absolutely unchanged, > and the driver core will just know to unbind us first when the master > goes away - as opposed to the large (and probably impossible) rework > required if attempting to listen for NETDEV_UNREGISTER. > > As per the documentation at Documentation/driver-api/device_link.rst, > specifying the DL_FLAG_AUTOREMOVE_CONSUMER flag causes the device link > to be automatically purged when the consumer fails to probe or later > unbinds. So we don't need to keep the consumer_link variable in struct > dsa_switch. > > Fixes: 2f1e8ea726e9 ("net: dsa: link interfaces with the DSA master to get rid of lockdep warnings") > Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Much nicer than dev_hold() indeed tested it and it works quite nicely in exposing use after free bugs, on net-next/master I get the below warning which I have not yet investigated whether it is legit or if this is ARM 32-bit's young KASAN days that are still make us find bugs that are not bugs, cross checking with an ARM64 build right now. # echo 9300000.ethernet > unbind [ 61.352514] ================================================================== [ 61.359856] BUG: KASAN: use-after-free in devlink_nl_port_fill+0x578/0x7a8 [ 61.366847] Read of size 4 at addr c7cf00a8 by task sh/249 [ 61.372413] [ 61.373945] CPU: 1 PID: 249 Comm: sh Not tainted 5.11.0-rc2-g4b0dc83b4219 #132 [ 61.381272] Hardware name: Broadcom STB (Flattened Device Tree) [ 61.387264] Backtrace: [ 61.389768] [<c10fd504>] (dump_backtrace) from [<c10fd7b4>] (show_stack+0x20/0x24) [ 61.397498] r7:c27d6f00 r6:40040093 r5:00000000 r4:c27d6f00 [ 61.403226] [<c10fd794>] (show_stack) from [<c11047b8>] (dump_stack+0xbc/0xe0) [ 61.410588] [<c11046fc>] (dump_stack) from [<c0462624>] (print_address_description.constprop.2+0x3c/0x2e4) [ 61.420440] r10:cbf6b500 r9:cb9a39e0 r8:cb9a3ac0 r7:00000000 r6:c0eb59d0 r5:ed6b8580 [ 61.428372] r4:c7cf00a8 r3:00000100 [ 61.432003] [<c04625e8>] (print_address_description.constprop.2) from [<c0462b28>] (kasan_report+0x1a4/0x1c0) [ 61.442106] r8:cb9a3ac0 r7:00000000 r6:c0eb59d0 r5:00000004 r4:c7cf00a8 [ 61.448883] [<c0462984>] (kasan_report) from [<c0463514>] (__asan_load4+0x6c/0xbc) [ 61.456615] r7:c7c1c860 r6:c7c1c864 r5:b8734730 r4:c7c1c83c [ 61.462345] [<c04634a8>] (__asan_load4) from [<c0eb59d0>] (devlink_nl_port_fill+0x578/0x7a8) [ 61.470930] [<c0eb5458>] (devlink_nl_port_fill) from [<c0eb5e20>] (devlink_port_notify+0x80/0x12c) [ 61.480057] r10:00000003 r9:c7c1c8f8 r8:00000000 r7:c7bc1200 r6:00000008 r5:cbf6b500 [ 61.487984] r4:c7c1c83c [ 61.490566] [<c0eb5da0>] (devlink_port_notify) from [<c0eb6558>] (devlink_port_unregister+0x5c/0x108) [ 61.499949] r7:c7bbe378 r6:c7bc12b4 r5:c7c1c8dc r4:c7c1c83c [ 61.505678] [<c0eb64fc>] (devlink_port_unregister) from [<c0feef60>] (dsa_switch_teardown.part.3+0x188/0x18c) [ 61.515785] r9:c7c1c8f8 r8:00000000 r7:c7bbe378 r6:c7bbe348 r5:c7bbe340 r4:c7c1c800 [ 61.523621] [<c0feedd8>] (dsa_switch_teardown.part.3) from [<c0fef00c>] (dsa_tree_teardown_switches+0xa8/0xc4) [ 61.533827] r9:c2824140 r8:c7bc381c r7:c7bc370c r6:c7bbe340 r5:c7bc381c r4:c7c1c800 [ 61.541663] [<c0feef64>] (dsa_tree_teardown_switches) from [<c0fef2cc>] (dsa_unregister_switch+0x124/0x250) [ 61.551597] r7:c7bc370c r6:c7bc3814 r5:c7bbe340 r4:c7bc3800 [ 61.557328] [<c0fef1a8>] (dsa_unregister_switch) from [<c0ba2f8c>] (bcm_sf2_sw_remove+0x98/0x13c) [ 61.566388] r10:00000003 r9:c2824140 r8:c387f844 r7:c387f848 r6:c7bb8e40 r5:c7bd3c88 [ 61.574315] r4:c7bd3c40 [ 61.576895] [<c0ba2ef4>] (bcm_sf2_sw_remove) from [<c0a57ae8>] (platform_remove+0x44/0x5c) [ 61.585333] r7:c387f848 r6:c287aaf4 r5:c387f810 r4:c287aaf4 [ 61.591063] [<c0a57aa4>] (platform_remove) from [<c0a54930>] (device_release_driver_internal+0x150/0x254) [ 61.600807] r5:c387e810 r4:c387f810 [ 61.604438] [<c0a547e0>] (device_release_driver_internal) from [<c0a4d040>] (device_links_unbind_consumers+0xf8/0x12c) [ 61.615327] r9:c83ec9e8 r8:c2820bc0 r7:c387c810 r6:00000004 r5:c387c870 r4:c387f810 [ 61.623162] [<c0a4cf48>] (device_links_unbind_consumers) from [<c0a54864>] (device_release_driver_internal+0x84/0x254) [ 61.634053] r10:c6de5c30 r9:cbc1d510 r8:c2824190 r7:c387c848 r6:c287db34 r5:c33dd010 [ 61.641984] r4:c387c810 r3:00000000 [ 61.645614] [<c0a547e0>] (device_release_driver_internal) from [<c0a54a88>] (device_driver_detach+0x30/0x34) [ 61.655631] r9:cbc1d510 r8:c2824190 r7:c387c810 r6:00000011 r5:c287db34 r4:c387c810 [ 61.663466] [<c0a54a58>] (device_driver_detach) from [<c0a514d8>] (unbind_store+0x90/0x134) [ 61.671974] r5:c287db34 r4:c2824140 [ 61.675605] [<c0a51448>] (unbind_store) from [<c0a50108>] (drv_attr_store+0x50/0x5c) [ 61.683510] r9:cbc1d510 r8:cbc17b80 r7:00000011 r6:cbc17b80 r5:c6e19400 r4:c0a51448 [ 61.691347] [<c0a500b8>] (drv_attr_store) from [<c055a0e4>] (sysfs_kf_write+0x90/0x9c) [ 61.699427] r7:c6e19400 r6:c6de5c30 r5:00000011 r4:c0a500b8 [ 61.705156] [<c055a054>] (sysfs_kf_write) from [<c055885c>] (kernfs_fop_write+0x178/0x2b4) [ 61.713594] r9:cbc1d510 r8:cb9a3f20 r7:00000000 r6:00000000 r5:cbc1d500 r4:00000011 [ 61.721430] [<c05586e4>] (kernfs_fop_write) from [<c0471aa0>] (vfs_write+0x19c/0x644) [ 61.729431] r10:00000000 r9:000e95b0 r8:cb9a3ea0 r7:cb9a3f20 r6:c05586e4 r5:cbf5a500 [ 61.737357] r4:b87347b4 [ 61.739938] [<c0471904>] (vfs_write) from [<c04721a8>] (ksys_write+0xd4/0x170) [ 61.747324] r10:000e95b0 r9:00000000 r8:00000000 r7:cbf5a500 r6:c2606d48 r5:cbf5a500 [ 61.755250] r4:b87347e0 [ 61.757832] [<c04720d4>] (ksys_write) from [<c047225c>] (sys_write+0x18/0x1c) [ 61.765121] r10:00000004 r9:cb9a0000 r8:c0200228 r7:00000004 r6:aec27d60 r5:000e95b0 [ 61.773048] r4:00000011 [ 61.775629] [<c0472244>] (sys_write) from [<c0200060>] (ret_fast_syscall+0x0/0x2c) [ 61.783341] Exception stack(0xcb9a3fa8 to 0xcb9a3ff0) [ 61.788480] 3fa0: 00000011 000e95b0 00000001 000e95b0 00000011 00000001 [ 61.796776] 3fc0: 00000011 000e95b0 aec27d60 00000004 00000011 000a7f7a 000c21b4 00000000 [ 61.805064] 3fe0: 00000000 b66eda2c aeb5744c aebaf810 [ 61.810185] [ 61.811715] Allocated by task 31: [ 61.815085] kasan_save_stack+0x24/0x48 [ 61.819001] ____kasan_kmalloc.constprop.0+0x98/0xac [ 61.824056] __kasan_kmalloc+0x10/0x14 [ 61.827882] __kmalloc+0x16c/0x2dc [ 61.831355] kvmalloc_node+0x40/0x8c [ 61.835006] alloc_netdev_mqs+0x5c/0x50c [ 61.839012] dsa_slave_create+0x110/0x9c8 [ 61.843102] dsa_register_switch+0xdb0/0x13a4 [ 61.847548] b53_switch_register+0x47c/0x6dc [ 61.851906] bcm_sf2_sw_probe+0xaa4/0xc98 [ 61.855996] platform_probe+0x90/0xf4 [ 61.859741] really_probe+0x184/0x728 [ 61.863481] driver_probe_device+0xa4/0x278 [ 61.867745] __device_attach_driver+0xe8/0x148 [ 61.872274] bus_for_each_drv+0x108/0x158 [ 61.876361] __device_attach+0x190/0x234 [ 61.880362] device_initial_probe+0x1c/0x20 [ 61.884628] bus_probe_device+0xdc/0xec [ 61.888540] deferred_probe_work_func+0xd4/0x11c [ 61.893242] process_one_work+0x420/0x8f8 [ 61.897341] worker_thread+0x4fc/0x91c [ 61.901174] kthread+0x21c/0x22c [ 61.904478] ret_from_fork+0x14/0x20 [ 61.908125] 0x0 [ 61.910014] [ 61.911543] Freed by task 249: [ 61.914652] kasan_save_stack+0x24/0x48 [ 61.918567] kasan_set_track+0x30/0x38 [ 61.922393] kasan_set_free_info+0x30/0x3c [ 61.926575] ____kasan_slab_free+0xec/0x110 [ 61.930840] __kasan_slab_free+0x14/0x18 [ 61.934842] kfree+0xbc/0x2a8 [ 61.937877] kvfree+0x34/0x38 [ 61.940913] netdev_freemem+0x30/0x34 [ 61.944661] netdev_release+0x48/0x50 [ 61.948400] device_release+0x5c/0x100 [ 61.952234] kobject_put+0x14c/0x2d8 [ 61.955888] put_device+0x20/0x24 [ 61.959284] free_netdev+0x170/0x194 [ 61.962935] dsa_slave_destroy+0xac/0xb0 [ 61.966936] dsa_port_teardown.part.2+0xa0/0xb4 [ 61.971558] dsa_tree_teardown_switches+0x50/0xc4 [ 61.976354] dsa_unregister_switch+0x124/0x250 [ 61.980887] bcm_sf2_sw_remove+0x98/0x13c [ 61.984977] platform_remove+0x44/0x5c [ 61.988810] device_release_driver_internal+0x150/0x254 [ 61.994129] device_links_unbind_consumers+0xf8/0x12c [ 61.999266] device_release_driver_internal+0x84/0x254 [ 62.004497] device_driver_detach+0x30/0x34 [ 62.008762] unbind_store+0x90/0x134 [ 62.012411] drv_attr_store+0x50/0x5c [ 62.016145] sysfs_kf_write+0x90/0x9c [ 62.019885] kernfs_fop_write+0x178/0x2b4 [ 62.023975] vfs_write+0x19c/0x644 [ 62.027452] ksys_write+0xd4/0x170 [ 62.030930] sys_write+0x18/0x1c [ 62.034232] ret_fast_syscall+0x0/0x2c [ 62.038056] 0xb66eda2c [ 62.040557] [ 62.042085] The buggy address belongs to the object at c7cf0000 [ 62.042085] which belongs to the cache kmalloc-2k of size 2048 [ 62.054030] The buggy address is located 168 bytes inside of [ 62.054030] 2048-byte region [c7cf0000, c7cf0800) [ 62.064583] The buggy address belongs to the page: [ 62.069435] page:17c208db refcount:1 mapcount:0 mapping:00000000 index:0x0 pfn:0x47cf0 [ 62.077462] head:17c208db order:3 compound_mapcount:0 compound_pincount:0 [ 62.084334] flags: 0x4010200(slab|head) [ 62.088271] raw: 04010200 00000000 00000100 00000122 c3001800 00000000 00080008 00000000 [ 62.096464] raw: ffffffff 00000001 [ 62.099922] page dumped because: kasan: bad access detected [ 62.105562] [ 62.107089] Memory state around the buggy address: [ 62.111947] c7ceff80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 62.118558] c7cf0000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 62.125168] >c7cf0080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 62.131770] ^ [ 62.135667] c7cf0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 62.142278] c7cf0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 62.148880] ================================================================== [ 62.156183] Disabling lock debugging due to kernel taint [ 62.162744] DSA: tree 0 torn down [ 62.171251] brcm-systemport 9300000.ethernet eth0: Link is Down PS: I realize that I never responded to my RFC but I believe that with a cascaded switch what you were seeing was consistent in that to fully unbind all the DSA peripherals you would have to unbind from the deepest level all the way down to the DSA master.
On Mon, Jan 11, 2021 at 03:40:46PM -0800, Florian Fainelli wrote: > ================================================================== > [ 61.359856] BUG: KASAN: use-after-free in devlink_nl_port_fill+0x578/0x7a8 > [ 61.366847] Read of size 4 at addr c7cf00a8 by task sh/249 > [ 61.372413] > [ 61.470930] [<c0eb5458>] (devlink_nl_port_fill) from [<c0eb5e20>] (devlink_port_notify+0x80/0x12c) > [ 61.490566] [<c0eb5da0>] (devlink_port_notify) from [<c0eb6558>] (devlink_port_unregister+0x5c/0x108) > [ 61.505678] [<c0eb64fc>] (devlink_port_unregister) from [<c0feef60>] (dsa_switch_teardown.part.3+0x188/0x18c) > [ 61.523621] [<c0feedd8>] (dsa_switch_teardown.part.3) from [<c0fef00c>] (dsa_tree_teardown_switches+0xa8/0xc4) > [ 61.541663] [<c0feef64>] (dsa_tree_teardown_switches) from [<c0fef2cc>] (dsa_unregister_switch+0x124/0x250) > [ 61.557328] [<c0fef1a8>] (dsa_unregister_switch) from [<c0ba2f8c>] (bcm_sf2_sw_remove+0x98/0x13c) > [ 61.576895] [<c0ba2ef4>] (bcm_sf2_sw_remove) from [<c0a57ae8>] (platform_remove+0x44/0x5c) > [ 61.591063] [<c0a57aa4>] (platform_remove) from [<c0a54930>] (device_release_driver_internal+0x150/0x254) > [ 61.604438] [<c0a547e0>] (device_release_driver_internal) from [<c0a4d040>] (device_links_unbind_consumers+0xf8/0x12c) > [ 61.623162] [<c0a4cf48>] (device_links_unbind_consumers) from [<c0a54864>] (device_release_driver_internal+0x84/0x254) > [ 61.645614] [<c0a547e0>] (device_release_driver_internal) from [<c0a54a88>] (device_driver_detach+0x30/0x34) > [ 61.663466] [<c0a54a58>] (device_driver_detach) from [<c0a514d8>] (unbind_store+0x90/0x134) > [ 61.675605] [<c0a51448>] (unbind_store) from [<c0a50108>] (drv_attr_store+0x50/0x5c) > [ 61.810185] > [ 61.811715] Allocated by task 31: > [ 61.835006] alloc_netdev_mqs+0x5c/0x50c > [ 61.839012] dsa_slave_create+0x110/0x9c8 > [ 61.843102] dsa_register_switch+0xdb0/0x13a4 > [ 61.847548] b53_switch_register+0x47c/0x6dc > [ 61.851906] bcm_sf2_sw_probe+0xaa4/0xc98 > [ 61.855996] platform_probe+0x90/0xf4 > [ 61.859741] really_probe+0x184/0x728 > [ 61.863481] driver_probe_device+0xa4/0x278 > [ 61.867745] __device_attach_driver+0xe8/0x148 > [ 61.872274] bus_for_each_drv+0x108/0x158 > [ 61.876361] __device_attach+0x190/0x234 > [ 61.880362] device_initial_probe+0x1c/0x20 > [ 61.884628] bus_probe_device+0xdc/0xec > [ 61.888540] deferred_probe_work_func+0xd4/0x11c > [ 61.893242] process_one_work+0x420/0x8f8 > [ 61.897341] worker_thread+0x4fc/0x91c > [ 61.901174] kthread+0x21c/0x22c > [ 61.904478] ret_from_fork+0x14/0x20 > [ 61.908125] 0x0 > [ 61.910014] > [ 61.911543] Freed by task 249: > [ 61.934842] kfree+0xbc/0x2a8 > [ 61.937877] kvfree+0x34/0x38 > [ 61.940913] netdev_freemem+0x30/0x34 > [ 61.944661] netdev_release+0x48/0x50 > [ 61.948400] device_release+0x5c/0x100 > [ 61.952234] kobject_put+0x14c/0x2d8 > [ 61.955888] put_device+0x20/0x24 > [ 61.959284] free_netdev+0x170/0x194 > [ 61.962935] dsa_slave_destroy+0xac/0xb0 > [ 61.966936] dsa_port_teardown.part.2+0xa0/0xb4 > [ 61.971558] dsa_tree_teardown_switches+0x50/0xc4 > [ 61.976354] dsa_unregister_switch+0x124/0x250 > [ 61.980887] bcm_sf2_sw_remove+0x98/0x13c > [ 61.984977] platform_remove+0x44/0x5c > [ 61.988810] device_release_driver_internal+0x150/0x254 > [ 61.994129] device_links_unbind_consumers+0xf8/0x12c > [ 61.999266] device_release_driver_internal+0x84/0x254 > [ 62.004497] device_driver_detach+0x30/0x34 > [ 62.008762] unbind_store+0x90/0x134 > [ 62.012411] drv_attr_store+0x50/0x5c > [ 62.016145] sysfs_kf_write+0x90/0x9c > [ 62.019885] kernfs_fop_write+0x178/0x2b4 > [ 62.023975] vfs_write+0x19c/0x644 > [ 62.027452] ksys_write+0xd4/0x170 > [ 62.030930] sys_write+0x18/0x1c > [ 62.034232] ret_fast_syscall+0x0/0x2c > [ 62.038056] 0xb66eda2c What is allocated in dsa_slave_create, freed in dsa_slave_destroy and used in devlink_nl_port_fill via devlink_port->type_dev? It's the good old net_device! Either this is a devlink design bug or we're all holding it wrong. In mlxsw, mlxsw_core_port_fini (which calls devlink_port_unregister) is called after unregister_netdev && free_netdev, just like in DSA's case. Apparently devlink_port_unregister notifies of port deletion over netlink, and accesses the freshly freed net_device in the meanwhile. Adding Jiri to figure out which one it is.
diff --git a/net/dsa/master.c b/net/dsa/master.c index 5a0f6fec4271..cb3a5cf99b25 100644 --- a/net/dsa/master.c +++ b/net/dsa/master.c @@ -309,8 +309,18 @@ static struct lock_class_key dsa_master_addr_list_lock_key; int dsa_master_setup(struct net_device *dev, struct dsa_port *cpu_dp) { int mtu = ETH_DATA_LEN + cpu_dp->tag_ops->overhead; + struct dsa_switch *ds = cpu_dp->ds; + struct device_link *consumer_link; int ret; + /* The DSA master must use SET_NETDEV_DEV for this to work. */ + consumer_link = device_link_add(ds->dev, dev->dev.parent, + DL_FLAG_AUTOREMOVE_CONSUMER); + if (!consumer_link) + netdev_err(dev, + "Failed to create a device link to DSA switch %s\n", + dev_name(ds->dev)); + rtnl_lock(); ret = dev_set_mtu(dev, mtu); rtnl_unlock();