[5.14,041/334] io-wq: remove GFP_ATOMIC allocation off schedule out path

From: Jens Axboe <axboe@kernel.dk>

From: Jens Axboe <axboe@kernel.dk>

[ Upstream commit d3e9f732c415cf22faa33d6f195e291ad82dc92e ]

Daniel reports that the v5.14-rc4-rt4 kernel throws a BUG when running
stress-ng:

| [   90.202543] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:35
| [   90.202549] in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 2047, name: iou-wrk-2041
| [   90.202555] CPU: 5 PID: 2047 Comm: iou-wrk-2041 Tainted: G        W         5.14.0-rc4-rt4+ #89
| [   90.202559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
| [   90.202561] Call Trace:
| [   90.202577]  dump_stack_lvl+0x34/0x44
| [   90.202584]  ___might_sleep.cold+0x87/0x94
| [   90.202588]  rt_spin_lock+0x19/0x70
| [   90.202593]  ___slab_alloc+0xcb/0x7d0
| [   90.202598]  ? newidle_balance.constprop.0+0xf5/0x3b0
| [   90.202603]  ? dequeue_entity+0xc3/0x290
| [   90.202605]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202610]  ? pick_next_task_fair+0xb9/0x330
| [   90.202612]  ? __schedule+0x670/0x1410
| [   90.202615]  ? io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202618]  kmem_cache_alloc_trace+0x79/0x1f0
| [   90.202621]  io_wqe_dec_running.isra.0+0x98/0xe0
| [   90.202625]  io_wq_worker_sleeping+0x37/0x50
| [   90.202628]  schedule+0x30/0xd0
| [   90.202630]  schedule_timeout+0x8f/0x1a0
| [   90.202634]  ? __bpf_trace_tick_stop+0x10/0x10
| [   90.202637]  io_wqe_worker+0xfd/0x320
| [   90.202641]  ? finish_task_switch.isra.0+0xd3/0x290
| [   90.202644]  ? io_worker_handle_work+0x670/0x670
| [   90.202646]  ? io_worker_handle_work+0x670/0x670
| [   90.202649]  ret_from_fork+0x22/0x30

which is due to the RT kernel not liking a GFP_ATOMIC allocation inside
a raw spinlock. Besides that not working on RT, doing any kind of
allocation from inside schedule() is kind of nasty and should be avoided
if at all possible.

This particular path happens when an io-wq worker goes to sleep, and we
need a new worker to handle pending work. We currently allocate a small
data item to hold the information we need to create a new worker, but we
can instead include this data in the io_worker struct itself and just
protect it with a single bit lock. We only really need one per worker
anyway, as we will have run pending work between to sleep cycles.

https://lore.kernel.org/lkml/20210804082418.fbibprcwtzyt5qax@beryllium.lan/
Reported-by: Daniel Wagner <dwagner@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/io-wq.c | 72 ++++++++++++++++++++++++++++++------------------------
 1 file changed, 40 insertions(+), 32 deletions(-)

Message ID	20210913131114.834260017@linuxfoundation.org
State	New
Headers	show Return-Path: <stable-owner@kernel.org> From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Daniel Wagner <dwagner@suse.de>, "Peter Zijlstra (Intel)" <peterz@infradead.org>, Jens Axboe <axboe@kernel.dk>, Sasha Levin <sashal@kernel.org> Subject: [PATCH 5.14 041/334] io-wq: remove GFP_ATOMIC allocation off schedule out path Date: Mon, 13 Sep 2021 15:11:35 +0200 Message-Id: <20210913131114.834260017@linuxfoundation.org> In-Reply-To: <20210913131113.390368911@linuxfoundation.org> References: <20210913131113.390368911@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	None \| expand [5.14,002/334] regmap: fix the offset of register error log [5.14,003/334] regulator: tps65910: Silence deferred probe error [5.14,006/334] power: supply: axp288_fuel_gauge: Report register-address on readb / writeb errors [5.14,009/334] rcu/tree: Handle VM stoppage in stall detection [5.14,010/334] EDAC/mce_amd: Do not load edac_mce_amd module on guests [5.14,012/334] hrtimer: Avoid double reprogramming in __hrtimer_start_range_ns() [5.14,014/334] udf: Check LVID earlier [5.14,015/334] udf: Fix iocharset=utf8 mount option [5.14,016/334] isofs: joliet: Fix iocharset=utf8 mount option [5.14,021/334] nvme-rdma: dont update queue count when failing to set io queues [5.14,022/334] nvmet: pass back cntlid on successful completion [5.14,023/334] power: supply: smb347-charger: Add missing pin control activation [5.14,025/334] s390/cio: add dev_busid sysfs entry for each subchannel [5.14,026/334] s390/zcrypt: fix wrong offset index for APKA master key valid state [5.14,029/334] crypto: omap - Fix inconsistent locking of device lists [5.14,031/334] crypto: qat - handle both source of interrupt in VF ISR [5.14,036/334] crypto: hisilicon/sec - modify the hardware endian configuration [5.14,037/334] crypto: tcrypt - Fix missing return value check [5.14,041/334] io-wq: remove GFP_ATOMIC allocation off schedule out path [5.14,043/334] s390/pci: fix misleading rc in clp_set_pci_fn() [5.14,044/334] s390/debug: keep debug data on resize [5.14,046/334] s390/ap: fix state machine hang after failure to enable irq [5.14,047/334] s390/smp: enable DAT before CPU restart callback is called [5.14,049/334] power: supply: cw2015: use dev_err_probe to allow deferred probe [5.14,051/334] crypto: x86/aes-ni - add missing error checks in XTS code [5.14,052/334] crypto: ecc - handle unaligned input buffer in ecc_swap_digits [5.14,054/334] sched: Fix UCLAMP_FLAG_IDLE setting [5.14,056/334] rcu: Fix stall-warning deadlock due to non-release of rcu_node ->lock [5.14,057/334] m68k: Fix invalid RMW_INSNS on CPUs that lack CAS [5.14,059/334] spi: spi-fsl-dspi: Fix issue with uninitialized dma_slave_config [5.14,061/334] genirq/timings: Fix error return code in irq_timings_test_irqs() [5.14,064/334] clocksource/drivers/sh_cmt: Fix wrong setting if dont request IRQ for clock source... [5.14,065/334] nbd: do del_gendisk() asynchronously for NBD_DESTROY_ON_DISCONNECT [5.14,066/334] block: nbd: add sanity check for first_minor [5.14,068/334] irqchip/apple-aic: Fix irq_disable from within irq handlers [5.14,071/334] m68k: Fix asm register constraints for atomic ops [5.14,074/334] EDAC/i10nm: Fix NVDIMM detection [5.14,076/334] spi: davinci: invoke chipselect callback [5.14,078/334] regulator: vctrl: Use locked regulator_get_voltage in probe path [5.14,079/334] regulator: vctrl: Avoid lockdep warning in enable/disable ops [5.14,080/334] spi: sprd: Fix the wrong WDG_LOAD_VAL [5.14,082/334] drm/panfrost: Fix missing clk_disable_unprepare() on error in panfrost_clk_init() [5.14,086/334] ASoC: tlv320aic32x4: Fix TAS2505/TAS2521 channel count [5.14,087/334] media: atmel: atmel-sama5d2-isc: fix YUYV format [5.14,088/334] media: TDA1997x: enable EDID support [5.14,090/334] soc: rockchip: ROCKCHIP_GRF should not default to y, unconditionally [5.14,092/334] drm/of: free the right object [5.14,094/334] bpf: Fix potential memleak and UAF in the verifier. [5.14,095/334] drm/of: free the iterator object on failure [5.14,098/334] ARM: dts: aspeed-g6: Fix HVI3C function-group in pinctrl dtsi [5.14,099/334] ARM: dts: everest: Add phase corrections for eMMC [5.14,100/334] arm64: dts: renesas: r8a77995: draak: Remove bogus adv7511w properties [5.14,102/334] arm64: dts: qcom: sc7180: Set adau wakeup delay to 80 ms [5.14,103/334] soc: qcom: rpmhpd: Use corner in power_off [5.14,104/334] libbpf: Fix removal of inner map in bpf_object__create_map [5.14,105/334] gfs2: Fix memory leak of object lsi on error return path [5.14,106/334] arm64: dts: qcom: sm8250: fix usb2 qmp phy node [5.14,108/334] firmware: fix theoretical UAF race with firmware cache and resume [5.14,109/334] driver core: Fix error return code in really_probe() [5.14,111/334] media: dvb-usb: fix uninit-value in dvb_usb_adapter_dvb_init [5.14,113/334] media: dvb-usb: Fix error handling in dvb_usb_i2c_init [5.14,115/334] media: go7007: fix memory leak in go7007_usb_probe [5.14,118/334] media: rockchip/rga: fix error handling in probe [5.14,119/334] media: coda: fix frame_mem_ctrl for YUV420 and YVU420 formats [5.14,122/334] Bluetooth: btusb: Fix a unspported condition to set available debug features [5.14,126/334] tcp: seq_file: Avoid skipping sk during tcp_seek_last_pos [5.14,129/334] ARM: dts: meson8b: mxq: Fix the pwm regulator supply properties [5.14,131/334] net/mlx5e: Prohibit inner indir TIRs in IPoIB [5.14,132/334] net/mlx5e: Block LRO if firmware asks for tunneled LRO [5.14,133/334] cgroup/cpuset: Fix a partition bug with hotplug [5.14,136/334] drm: mxsfb: Clear FIFO_CLEAR bit [5.14,138/334] net: ti: am65-cpsw-nuss: fix wrong devlink release order [5.14,142/334] lib/test_scanf: Handle n_bits == 0 in random tests [5.14,147/334] drm/bridge: ti-sn65dsi86: Dont read EDID blob over DDC [5.14,148/334] drm/bridge: ti-sn65dsi86: Improve probe errors with dev_err_probe() [5.14,149/334] drm/bridge: ti-sn65dsi86: Wrap panel with panel-bridge [5.14,153/334] i2c: highlander: add IRQ check [5.14,154/334] leds: lgm-sso: Put fwnode in any case during ->probe() [5.14,156/334] leds: lt3593: Put fwnode in any case during ->probe() [5.14,157/334] leds: rt8515: Put fwnode in any case during ->probe() [5.14,159/334] media: em28xx-input: fix refcount bug in em28xx_usb_disconnect [5.14,161/334] media: venus: hfi: fix return value check in sys_get_prop_image_version() [5.14,162/334] media: venus: venc: Fix potential null pointer dereference on pointer fmt [5.14,163/334] media: venus: helper: do not set constrained parameters for UBWC [5.14,167/334] bpf, samples: Add missing mprog-disable to xdp_redirect_cpus optstring [5.14,169/334] net: dsa: build tag_8021q.c as part of DSA core [5.14,170/334] net: dsa: tag_sja1105: optionally build as module when switch driver is module if ... [5.14,172/334] Bluetooth: increase BTNAMSIZ to 21 chars to fix potential buffer overflow [5.14,173/334] arm64: dts: qcom: sc7280: Fixup the cpufreq node [5.14,174/334] arm64: dts: qcom: sm8350: fix IPA interconnects [5.14,175/334] drm: bridge: it66121: Check drm_bridge_attach retval [5.14,177/334] net: dsa: stop syncing the bridge mcast_router attribute at join time [5.14,180/334] PM: EM: Increase energy calculation precision [5.14,182/334] leds: lgm-sso: Propagate error codes from callee to caller [5.14,183/334] drm/msm: Fix error return code in msm_drm_init() [5.14,185/334] drm/msm/mdp4: move HW revision detection to earlier phase [5.14,186/334] drm/msm/dp: update is_connected status base on sink count at dp_pm_resume() [5.14,187/334] drm/msm/dpu: make dpu_hw_ctl_clear_all_blendstages clear necessary LMs [5.14,191/334] cgroup/cpuset: Fix violation of cpuset locking rule [5.14,193/334] Bluetooth: fix repeated calls to sco_sock_kill [5.14,194/334] drm/msm/dsi: Fix some reference counted resource leaks [5.14,197/334] ASoC: rt5682: Properly turn off regulators if wrong device ID [5.14,199/334] usb: dwc3: meson-g12a: add IRQ check [5.14,202/334] usb: gadget: udc: s3c2410: add IRQ check [5.14,203/334] mac80211: remove unnecessary NULL check in ieee80211_register_hw() [5.14,204/334] usb: misc: brcmstb-usb-pinmap: add IRQ check [5.14,207/334] usb: gadget: udc: renesas_usb3: Fix soc_device_match() abuse [5.14,208/334] selftests/bpf: Fix test_core_autosize on big-endian machines [5.14,211/334] net: stmmac: fix INTR TBU status affecting irq count statistic [5.14,212/334] Bluetooth: Move shutdown callback before flushing tx and rx queue [5.14,215/334] usb: phy: tahvo: add IRQ check [5.14,219/334] lockd: Fix invalid lockowner cast after vfs_test_lock [5.14,220/334] SUNRPC: Fix a NULL pointer deref in trace_svc_stats_latency() [5.14,221/334] nfsd4: Fix forced-expiry locking [5.14,224/334] i2c: synquacer: fix deferred probing [5.14,226/334] hwmon: remove amd_energy driver in Makefile [5.14,227/334] ASoC: fsl_rpmsg: Check -EPROBE_DEFER for getting clocks [5.14,228/334] firmware: raspberrypi: Fix a leak in rpi_firmware_get() [5.14,230/334] mm/swap: consider max pages in iomap_swapfile_add_extent [5.14,232/334] Bluetooth: add timeout sanity check to hci_inquiry [5.14,234/334] i2c: s3c2410: fix IRQ check [5.14,235/334] i2c: hix5hd2: fix IRQ check [5.14,237/334] drm/exynos: g2d: fix missing unlock on error in g2d_runqueue_worker() [5.14,241/334] octeontx2-pf: send correct vlan priority mask to npc_install_flow_req [5.14,243/334] octeontx2-pf: Dont install VLAN offload rule if netdev is down [5.14,244/334] octeontx2-pf: Fix algorithm index in MCAM rules with RSS action [5.14,247/334] ASoC: Intel: kbl_da7219_max98927: Fix format selection for max98373 [5.14,248/334] ASoC: Intel: Skylake: Leave data as is when invoking TLV IPCs [5.14,251/334] mmc: dw_mmc: Fix issue with uninitialized dma_slave_config [5.14,252/334] mmc: moxart: Fix issue with uninitialized dma_slave_config [5.14,255/334] hv_utils: Set the maximum packet size for VSS driver to the length of the receive ... [5.14,256/334] CIFS: Fix a potencially linear read overflow [5.14,257/334] i2c: mt65xx: fix IRQ check [5.14,259/334] octeontx2-pf: cn10k: Fix error return code in otx2_set_flowkey_cfg() [5.14,262/334] usb: bdc: Fix a resource leak in the error handling path of bdc_probe() [5.14,263/334] tty: serial: fsl_lpuart: fix the wrong mapbase value [5.14,265/334] ASoC: wcd9335: Fix a memory leak in the error handling path of the probe function [5.14,267/334] iwlwifi: skip first element in the WTAS ACPI table [5.14,269/334] net/mlx5: Remove all auxiliary devices at the unregister event [5.14,270/334] net/mlx5e: Fix possible use-after-free deleting fdb rule [5.14,274/334] ice: fix Tx queue iteration for Tx timestamp enablement [5.14,275/334] ice: add lock around Tx timestamp tracker flush [5.14,278/334] net: phy: marvell10g: fix broken PHY interrupts for anyone after us in the driver ... [5.14,279/334] ath6kl: wmi: fix an error code in ath6kl_wmi_sync_point() [5.14,280/334] ALSA: usb-audio: Add lowlatency module option [5.14,282/334] bcma: Fix memory leak for internally-handled cores [5.14,283/334] brcmfmac: pcie: fix oops on failure to resume and reprobe [5.14,285/334] ipv4: make exception cache less predictible [5.14,286/334] net: qrtr: make checks in qrtr_endpoint_post() stricter [5.14,288/334] net: sched: Fix qdisc_rate_table refcount leak when get tcf_block failed [5.14,291/334] octeontx2-af: Fix mailbox errors in nix_rss_flowkey_cfg [5.14,294/334] ipv4: fix endianness issue in inet_rtm_getroute_build_skb() [5.14,296/334] iwlwifi Add support for ax201 in Samsung Galaxy Book Flex2 Alpha [5.14,298/334] time: Handle negative seconds correctly in timespec64_to_ns() [5.14,300/334] io_uring: limit fixed table size by RLIMIT_NOFILE [5.14,303/334] io_uring: fail links of cancelled timeouts [5.14,304/334] bio: fix page leak bio_add_hw_page failure [5.14,308/334] tty: Fix data race between tiocsti() and flush_to_ldisc() [5.14,309/334] perf/x86/intel/uncore: Fix IIO cleanup mapping procedure for SNR/ICX [5.14,310/334] Revert "KVM: x86: mmu: Add guest physical address check in translate_gpa()" [5.14,311/334] KVM: s390: index kvm->arch.idle_mask by vcpu_idx [5.14,312/334] KVM: x86: Update vCPUs hv_clock before back to guest when tsc_offset is adjusted [5.14,316/334] KVM: nVMX: Unconditionally clear nested.pi_pending on nested VM-Enter [5.14,317/334] KVM: arm64: Unregister HYP sections from kmemleak in protected mode [5.14,318/334] KVM: arm64: vgic: Resample HW pending state on deactivation [5.14,321/334] md/raid10: Remove unnecessary rcu_dereference in raid10_handle_discard [5.14,323/334] fuse: truncate pagecache on atomic_o_trunc [5.14,327/334] IMA: remove the dependency on CRYPTO_MD5 [5.14,329/334] ACPI: PRM: Find PRMT table before parsing it [5.14,332/334] backlight: pwm_bl: Improve bootloader/kernel device handover [5.14,333/334] parisc: Fix unaligned-access crash in bootloader [5.14,334/334] clk: kirkwood: Fix a clocking boot regression

[5.14,041/334] io-wq: remove GFP_ATOMIC allocation off schedule out path

Commit Message

Patch