[5.8,024/232] btrfs: open device without device_list_mutex

Message ID	20200820091613.922748607@linuxfoundation.org
State	Superseded
Headers	show Return-Path: <SRS0=nT5k=B6=vger.kernel.org=stable-owner@kernel.org> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>, David Sterba <dsterba@suse.com> Subject: [PATCH 5.8 024/232] btrfs: open device without device_list_mutex Date: Thu, 20 Aug 2020 11:17:55 +0200 Message-Id: <20200820091613.922748607@linuxfoundation.org> In-Reply-To: <20200820091612.692383444@linuxfoundation.org> References: <20200820091612.692383444@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sender: stable-owner@vger.kernel.org Precedence: bulk
Series	None \| expand [5.8,002/232] smb3: warn on confusing error scenario with sec=krb5 [5.8,004/232] genirq/affinity: Make affinity setting if activated opt-in [5.8,005/232] genirq: Unlock irq descriptor after errors [5.8,009/232] PCI: Mark AMD Navi10 GPU rev 0x00 ATS as broken [5.8,013/232] btrfs: allow use of global block reserve for balance item deletion [5.8,015/232] btrfs: dont allocate anonymous block device for user invisible roots [5.8,016/232] btrfs: preallocate anon block device at first phase of snapshot creation [5.8,019/232] btrfs: stop incremening log_batch for the log root tree when syncing log [5.8,021/232] btrfs: remove no longer needed use of log_writers for the log root tree [5.8,023/232] btrfs: pass checksum type via BTRFS_IOC_FS_INFO ioctl [5.8,024/232] btrfs: open device without device_list_mutex [5.8,026/232] btrfs: relocation: review the call sites which can be interrupted by signal [5.8,028/232] btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree [5.8,029/232] btrfs: return EROFS for BTRFS_FS_STATE_ERROR cases [5.8,030/232] btrfs: sysfs: use NOFS for device creation [5.8,032/232] btrfs: fix race between page release and a fast fsync [5.8,033/232] btrfs: dont show full path of bind mounts in subvol= [5.8,035/232] btrfs: only search for left_info if there is no right_info in try_merge_free_space [5.8,038/232] btrfs: trim: fix underflow in trim length to prevent access beyond device boundary [5.8,039/232] btrfs: make sure SB_I_VERSION doesnt get unset by remount [5.8,042/232] arm64: dts: qcom: sc7180: Drop the unused non-MSA SID [5.8,044/232] dt-bindings: iio: io-channel-mux: Fix compatible string in example code [5.8,045/232] iio: dac: ad5592r: fix unbalanced mutex unlocks in ad5592r_read_raw() [5.8,047/232] xtensa: add missing exclusive access state management [5.8,049/232] cifs: Fix leak when handling lease break for cached root fid [5.8,050/232] powerpc/ptdump: Fix build failure in hashpagetable.c [5.8,053/232] pinctrl: ingenic: Enhance support for IRQ_TYPE_EDGE_BOTH [5.8,056/232] media: vsp1: dl: Fix NULL pointer dereference on unbind [5.8,061/232] pidfd: Add missing sock updates for pidfd_getfd() [5.8,062/232] net/compat: Add missing sock updates for SCM_RIGHTS [5.8,064/232] md/raid5: Fix Force reconstruct-write io stuck in degraded raid5 [5.8,065/232] bcache: allocate meta data pages as compound pages [5.8,066/232] bcache: fix overflow in offset_to_stripe() [5.8,069/232] bcache: use disk_{start,end}_io_acct() to count I/O for bcache device [5.8,071/232] appletalk: Fix atalk_proc_init() return path [5.8,073/232] MIPS: CPU#0 is not hotpluggable [5.8,074/232] MIPS: qi_lb60: Fix routing to audio amplifier [5.8,077/232] khugepaged: collapse_pte_mapped_thp() flush the right range [5.8,079/232] khugepaged: collapse_pte_mapped_thp() protect the pmd lock [5.8,080/232] khugepaged: retract_page_tables() remember to test exit [5.8,084/232] ocfs2: change slot number type s16 to u16 [5.8,085/232] mm/page_counter.c: fix protection usage propagation [5.8,087/232] mm/memory_hotplug: fix unpaired mem_hotplug_begin/done [5.8,088/232] ftrace: Setup correct FTRACE_FL_REGS flags for module [5.8,091/232] tracing: Use trace_sched_process_free() instead of exit() for pid tracing [5.8,097/232] pseries: Fix 64 bit logical memory block panic [5.8,098/232] dm ebs: Fix incorrect checking for REQ_OP_FLUSH [5.8,101/232] bootconfig: Fix to find the initargs correctly [5.8,102/232] perf probe: Fix wrong variable warning when the probe point is not found [5.8,103/232] perf probe: Fix memory leakage when the probe point is not found [5.8,104/232] perf intel-pt: Fix FUP packet state [5.8,105/232] perf intel-pt: Fix duplicate branch after CBR [5.8,108/232] remoteproc: qcom: q6v5: Update running state before requesting stop [5.8,110/232] remoteproc: qcom_q6v5_mss: Validate modem blob firmware size before load [5.8,113/232] drm/ingenic: Fix incorrect assumption about plane->index [5.8,114/232] crypto: algif_aead - Only wake up when ctx->more is zero [5.8,116/232] octeontx2-af: change (struct qmem)->entry_sz from u8 to u16 [5.8,118/232] mtd: rawnand: brcmnand: ECC error handling on EDU transfers [5.8,120/232] drm/amdgpu/debugfs: fix memory leak when pm_runtime_get_sync failed [5.8,122/232] RDMA/ipoib: Fix ABBA deadlock with ipoib_reap_ah() [5.8,124/232] media: staging: rkisp1: remove macro RKISP1_DIR_SINK_SRC [5.8,125/232] media: staging: rkisp1: rename macros RKISP1_DIR_* to RKISP1_ISP_SD_* [5.8,126/232] media: staging: rkisp1: rsz: set default format if the given format is not RKISP1_I... [5.8,128/232] media: rockchip: rga: Only set output CSC mode for RGB input [5.8,129/232] IB/uverbs: Set IOVA on IB MR in uverbs layer [5.8,130/232] sched/uclamp: Protect uclamp fast path code with static key [5.8,132/232] bpf: selftests: Restore netns after each test [5.8,134/232] selftests/bpf: test_progs avoid minus shell exit codes [5.8,136/232] USB: serial: ftdi_sio: clean up receive processing [5.8,137/232] USB: serial: ftdi_sio: fix break and sysrq handling [5.8,139/232] devres: keep both device name and resource name in pretty name [5.8,140/232] RDMA/counter: Only bind user QPs in auto mode [5.8,141/232] RDMA/counter: Allow manually bind QPs with different pids to same counter [5.8,144/232] rtc: pl031: fix set_alarm by adding back call to alarm_irq_enable [5.8,145/232] crypto: caam - Remove broken arc4 support [5.8,148/232] dm rq: dont call blk_mq_queue_stopped() in dm_stop_queue() [5.8,152/232] selftests/powerpc: ptrace-pkey: Rename variables to make it easier to follow code [5.8,154/232] selftests/powerpc: ptrace-pkey: Dont update expected UAMOR value [5.8,155/232] iommu/omap: Check for failure of a call to omap_iommu_dump_ctx [5.8,156/232] clk: qcom: gcc: fix sm8150 GPU and NPU clocks [5.8,159/232] iommu/vt-d: Enforce PASID devTLB field mask [5.8,160/232] iommu/vt-d: Warn on out-of-range invalidation address [5.8,162/232] i2c: rcar: slave: only send STOP event when we have been addressed [5.8,164/232] PCI: hv: Fix a timing issue which causes kdump to fail occasionally [5.8,166/232] clk: clk-atlas6: fix return value check in atlas6_clk_init() [5.8,167/232] nvme: fix deadlock in disconnect during scan_work and/or ana_work [5.8,171/232] tools build feature: Use CC and CXX from parent [5.8,172/232] i2c: rcar: avoid race when unregistering slave [5.8,174/232] ubi: fastmap: Dont produce the initial next anchor PEB when fastmap is disabled [5.8,176/232] ubifs: Fix wrong orphan node deletion in ubifs_jnl_update\|rename [5.8,178/232] clk: bcm2835: Do not use prediv with bcm2711s PLLs [5.8,180/232] libnvdimm/security: ensure sysfs poll thread woke up and fetch updated attr [5.8,181/232] openrisc: Fix oops caused when dumping stack [5.8,183/232] scsi: lpfc: nvmet: Avoid hang / use-after-free again when destroying targetport [5.8,184/232] nfs: nfs_file_write() should check for writeback errors [5.8,186/232] watchdog: rti-wdt: balance pm runtime enable calls [5.8,187/232] md-cluster: Fix potential error pointer dereference in resize_bitmaps() [5.8,189/232] x86/tsr: Fix tsc frequency enumeration bug on Lightning Mountain SoC [5.8,192/232] Input: sentelic - fix error return when fsp_reg_write fails [5.8,194/232] selftests/bpf: Fix silent Makefile output [5.8,196/232] recordmcount: Fix build failure on non arm64 [5.8,198/232] drm/vmwgfx: Use correct vmw_legacy_display_unit pointer [5.8,200/232] s390/test_unwind: fix possible memleak in test_unwind() [5.8,201/232] s390/Kconfig: add missing ZCRYPT dependency to VFIO_AP [5.8,205/232] lib/test_lockup.c: fix return value of test_lockup_init() [5.8,211/232] i2c: iproc: fix race between client unreg and isr [5.8,212/232] mfd: dln2: Run event handler loop under spinlock [5.8,214/232] ALSA: echoaudio: Fix potential Oops in snd_echo_resume() [5.8,215/232] perf bench mem: Always memset source before memcpy [5.8,216/232] tools build feature: Quote CC and CXX for their arguments [5.8,217/232] perf/x86/rapl: Fix missing psys sysfs attributes [5.8,224/232] drm/dp_mst: Fix timeout handling of MST down messages [5.8,226/232] drm/omap: force runtime PM suspend on system suspend [5.8,227/232] drm/tidss: fix modeset init for DPI panels [5.8,228/232] drm: Added orientation quirk for ASUS tablet model T103HAF [5.8,231/232] drm/amd/display: Fix dmesg warning from setting abm level [5.8,232/232] drm/amd/display: dchubbub p-state warning during surface planes switch

Message ID

20200820091613.922748607@linuxfoundation.org

State

Superseded

Headers

From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Josef Bacik <josef@toxicpanda.com>,
	David Sterba <dsterba@suse.com>
Subject: [PATCH 5.8 024/232] btrfs: open device without device_list_mutex
Date: Thu, 20 Aug 2020 11:17:55 +0200
Message-Id: <20200820091613.922748607@linuxfoundation.org>
In-Reply-To: <20200820091612.692383444@linuxfoundation.org>
References: <20200820091612.692383444@linuxfoundation.org>
User-Agent: quilt/0.66
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Sender: stable-owner@vger.kernel.org
Precedence: bulk

Series

None | expand

Commit Message

Greg KH Aug. 20, 2020, 9:17 a.m. UTC

From: Josef Bacik <josef@toxicpanda.com>

commit 18c850fdc5a801bad4977b0f1723761d42267e45 upstream.

There's long existed a lockdep splat because we open our bdev's under
the ->device_list_mutex at mount time, which acquires the bd_mutex.
Usually this goes unnoticed, but if you do loopback devices at all
suddenly the bd_mutex comes with a whole host of other dependencies,
which results in the splat when you mount a btrfs file system.

======================================================
WARNING: possible circular locking dependency detected
5.8.0-0.rc3.1.fc33.x86_64+debug #1 Not tainted
------------------------------------------------------
systemd-journal/509 is trying to acquire lock:
ffff970831f84db0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_record_root_in_trans+0x44/0x70 [btrfs]

but task is already holding lock:
ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

 -> #6 (sb_pagefaults){.+.+}-{0:0}:
       __sb_start_write+0x13e/0x220
       btrfs_page_mkwrite+0x59/0x560 [btrfs]
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       asm_exc_page_fault+0x1e/0x30

 -> #5 (&mm->mmap_lock#2){++++}-{3:3}:
       __might_fault+0x60/0x80
       _copy_from_user+0x20/0xb0
       get_sg_io_hdr+0x9a/0xb0
       scsi_cmd_ioctl+0x1ea/0x2f0
       cdrom_ioctl+0x3c/0x12b4
       sr_block_ioctl+0xa4/0xd0
       block_ioctl+0x3f/0x50
       ksys_ioctl+0x82/0xc0
       __x64_sys_ioctl+0x16/0x20
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #4 (&cd->lock){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       sr_block_open+0xa2/0x180
       __blkdev_get+0xdd/0x550
       blkdev_get+0x38/0x150
       do_dentry_open+0x16b/0x3e0
       path_openat+0x3c9/0xa00
       do_filp_open+0x75/0x100
       do_sys_openat2+0x8a/0x140
       __x64_sys_openat+0x46/0x70
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #3 (&bdev->bd_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       __blkdev_get+0x6a/0x550
       blkdev_get+0x85/0x150
       blkdev_get_by_path+0x2c/0x70
       btrfs_get_bdev_and_sb+0x1b/0xb0 [btrfs]
       open_fs_devices+0x88/0x240 [btrfs]
       btrfs_open_devices+0x92/0xa0 [btrfs]
       btrfs_mount_root+0x250/0x490 [btrfs]
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       vfs_kern_mount.part.0+0x71/0xb0
       btrfs_mount+0x119/0x380 [btrfs]
       legacy_get_tree+0x30/0x50
       vfs_get_tree+0x28/0xc0
       do_mount+0x8c6/0xca0
       __x64_sys_mount+0x8e/0xd0
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #2 (&fs_devs->device_list_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       btrfs_run_dev_stats+0x36/0x420 [btrfs]
       commit_cowonly_roots+0x91/0x2d0 [btrfs]
       btrfs_commit_transaction+0x4e6/0x9f0 [btrfs]
       btrfs_sync_file+0x38a/0x480 [btrfs]
       __x64_sys_fdatasync+0x47/0x80
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #1 (&fs_info->tree_log_mutex){+.+.}-{3:3}:
       __mutex_lock+0x7b/0x820
       btrfs_commit_transaction+0x48e/0x9f0 [btrfs]
       btrfs_sync_file+0x38a/0x480 [btrfs]
       __x64_sys_fdatasync+0x47/0x80
       do_syscall_64+0x52/0xb0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9

 -> #0 (&fs_info->reloc_mutex){+.+.}-{3:3}:
       __lock_acquire+0x1241/0x20c0
       lock_acquire+0xb0/0x400
       __mutex_lock+0x7b/0x820
       btrfs_record_root_in_trans+0x44/0x70 [btrfs]
       start_transaction+0xd2/0x500 [btrfs]
       btrfs_dirty_inode+0x44/0xd0 [btrfs]
       file_update_time+0xc6/0x120
       btrfs_page_mkwrite+0xda/0x560 [btrfs]
       do_page_mkwrite+0x4f/0x130
       do_wp_page+0x3b0/0x4f0
       handle_mm_fault+0xf47/0x1850
       do_user_addr_fault+0x1fc/0x4b0
       exc_page_fault+0x88/0x300
       asm_exc_page_fault+0x1e/0x30

other info that might help us debug this:

Chain exists of:
  &fs_info->reloc_mutex --> &mm->mmap_lock#2 --> sb_pagefaults

Possible unsafe locking scenario:

     CPU0                    CPU1
     ----                    ----
 lock(sb_pagefaults);
                             lock(&mm->mmap_lock#2);
                             lock(sb_pagefaults);
 lock(&fs_info->reloc_mutex);

 *** DEADLOCK ***

3 locks held by systemd-journal/509:
 #0: ffff97083bdec8b8 (&mm->mmap_lock#2){++++}-{3:3}, at: do_user_addr_fault+0x12e/0x4b0
 #1: ffff97083144d598 (sb_pagefaults){.+.+}-{0:0}, at: btrfs_page_mkwrite+0x59/0x560 [btrfs]
 #2: ffff97083144d6a8 (sb_internal){.+.+}-{0:0}, at: start_transaction+0x3f8/0x500 [btrfs]

stack backtrace:
CPU: 0 PID: 509 Comm: systemd-journal Not tainted 5.8.0-0.rc3.1.fc33.x86_64+debug #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
 dump_stack+0x92/0xc8
 check_noncircular+0x134/0x150
 __lock_acquire+0x1241/0x20c0
 lock_acquire+0xb0/0x400
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 ? lock_acquire+0xb0/0x400
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 __mutex_lock+0x7b/0x820
 ? btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 ? kvm_sched_clock_read+0x14/0x30
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0xc/0xb0
 btrfs_record_root_in_trans+0x44/0x70 [btrfs]
 start_transaction+0xd2/0x500 [btrfs]
 btrfs_dirty_inode+0x44/0xd0 [btrfs]
 file_update_time+0xc6/0x120
 btrfs_page_mkwrite+0xda/0x560 [btrfs]
 ? sched_clock+0x5/0x10
 do_page_mkwrite+0x4f/0x130
 do_wp_page+0x3b0/0x4f0
 handle_mm_fault+0xf47/0x1850
 do_user_addr_fault+0x1fc/0x4b0
 exc_page_fault+0x88/0x300
 ? asm_exc_page_fault+0x8/0x30
 asm_exc_page_fault+0x1e/0x30
RIP: 0033:0x7fa3972fdbfe
Code: Bad RIP value.

Fix this by not holding the ->device_list_mutex at this point.  The
device_list_mutex exists to protect us from modifying the device list
while the file system is running.

However it can also be modified by doing a scan on a device.  But this
action is specifically protected by the uuid_mutex, which we are holding
here.  We cannot race with opening at this point because we have the
->s_mount lock held during the mount.  Not having the
->device_list_mutex here is perfectly safe as we're not going to change
the devices at this point.

CC: stable@vger.kernel.org # 4.19+
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add some comments ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 fs/btrfs/volumes.c |   21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -245,7 +245,9 @@  static int __btrfs_map_block(struct btrf
  *
  * global::fs_devs - add, remove, updates to the global list
  *
- * does not protect: manipulation of the fs_devices::devices list!
+ * does not protect: manipulation of the fs_devices::devices list in general
+ * but in mount context it could be used to exclude list modifications by eg.
+ * scan ioctl
  *
  * btrfs_device::name - renames (write side), read is RCU
  *
@@ -258,6 +260,9 @@  static int __btrfs_map_block(struct btrf
  * may be used to exclude some operations from running concurrently without any
  * modifications to the list (see write_all_supers)
  *
+ * Is not required at mount and close times, because our device list is
+ * protected by the uuid_mutex at that point.
+ *
  * balance_mutex
  * -------------
  * protects balance structures (status, state) and context accessed from
@@ -602,6 +607,11 @@  static int btrfs_free_stale_devices(cons
 	return ret;
 }
 
+/*
+ * This is only used on mount, and we are protected from competing things
+ * messing with our fs_devices by the uuid_mutex, thus we do not need the
+ * fs_devices->device_list_mutex here.
+ */
 static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices,
 			struct btrfs_device *device, fmode_t flags,
 			void *holder)
@@ -1229,8 +1239,14 @@  int btrfs_open_devices(struct btrfs_fs_d
 	int ret;
 
 	lockdep_assert_held(&uuid_mutex);
+	/*
+	 * The device_list_mutex cannot be taken here in case opening the
+	 * underlying device takes further locks like bd_mutex.
+	 *
+	 * We also don't need the lock here as this is called during mount and
+	 * exclusion is provided by uuid_mutex
+	 */
 
-	mutex_lock(&fs_devices->device_list_mutex);
 	if (fs_devices->opened) {
 		fs_devices->opened++;
 		ret = 0;
@@ -1238,7 +1254,6 @@  int btrfs_open_devices(struct btrfs_fs_d
 		list_sort(NULL, &fs_devices->devices, devid_cmp);
 		ret = open_fs_devices(fs_devices, flags, holder);
 	}
-	mutex_unlock(&fs_devices->device_list_mutex);
 
 	return ret;
 }

[5.8,024/232] btrfs: open device without device_list_mutex

Commit Message

Patch