[5.13,147/151] KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock

From: Sean Christopherson <seanjc@google.com>

From: Sean Christopherson <seanjc@google.com>

commit ce25681d59ffc4303321e555a2d71b1946af07da upstream.

Add yet another spinlock for the TDP MMU and take it when marking indirect
shadow pages unsync.  When using the TDP MMU and L1 is running L2(s) with
nested TDP, KVM may encounter shadow pages for the TDP entries managed by
L1 (controlling L2) when handling a TDP MMU page fault.  The unsync logic
is not thread safe, e.g. the kvm_mmu_page fields are not atomic, and
misbehaves when a shadow page is marked unsync via a TDP MMU page fault,
which runs with mmu_lock held for read, not write.

Lack of a critical section manifests most visibly as an underflow of
unsync_children in clear_unsync_child_bit() due to unsync_children being
corrupted when multiple CPUs write it without a critical section and
without atomic operations.  But underflow is the best case scenario.  The
worst case scenario is that unsync_children prematurely hits '0' and
leads to guest memory corruption due to KVM neglecting to properly sync
shadow pages.

Use an entirely new spinlock even though piggybacking tdp_mmu_pages_lock
would functionally be ok.  Usurping the lock could degrade performance when
building upper level page tables on different vCPUs, especially since the
unsync flow could hold the lock for a comparatively long time depending on
the number of indirect shadow pages and the depth of the paging tree.

For simplicity, take the lock for all MMUs, even though KVM could fairly
easily know that mmu_lock is held for write.  If mmu_lock is held for
write, there cannot be contention for the inner spinlock, and marking
shadow pages unsync across multiple vCPUs will be slow enough that
bouncing the kvm_arch cacheline should be in the noise.

Note, even though L2 could theoretically be given access to its own EPT
entries, a nested MMU must hold mmu_lock for write and thus cannot race
against a TDP MMU page fault.  I.e. the additional spinlock only _needs_ to
be taken by the TDP MMU, as opposed to being taken by any MMU for a VM
that is running with the TDP MMU enabled.  Holding mmu_lock for read also
prevents the indirect shadow page from being freed.  But as above, keep
it simple and always take the lock.

Alternative #1, the TDP MMU could simply pass "false" for can_unsync and
effectively disable unsync behavior for nested TDP.  Write protecting leaf
shadow pages is unlikely to noticeably impact traditional L1 VMMs, as such
VMMs typically don't modify TDP entries, but the same may not hold true for
non-standard use cases and/or VMMs that are migrating physical pages (from
L1's perspective).

Alternative #2, the unsync logic could be made thread safe.  In theory,
simply converting all relevant kvm_mmu_page fields to atomics and using
atomic bitops for the bitmap would suffice.  However, (a) an in-depth audit
would be required, (b) the code churn would be substantial, and (c) legacy
shadow paging would incur additional atomic operations in performance
sensitive paths for no benefit (to legacy shadow paging).

Fixes: a2855afc7ee8 ("KVM: x86/mmu: Allow parallel page faults for the TDP MMU")
Cc: stable@vger.kernel.org
Cc: Ben Gardon <bgardon@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20210812181815.3378104-1-seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
---
 Documentation/virt/kvm/locking.rst |    8 ++++----
 arch/x86/include/asm/kvm_host.h    |    7 +++++++
 arch/x86/kvm/mmu/mmu.c             |   28 ++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+), 4 deletions(-)

Message ID	20210816125448.920349481@linuxfoundation.org
State	New
Headers	show Return-Path: <stable-owner@kernel.org> X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCA53C432BE for <stable@archiver.kernel.org>; Mon, 16 Aug 2021 13:21:57 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id B967A611C4 for <stable@archiver.kernel.org>; Mon, 16 Aug 2021 13:21:57 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S240083AbhHPNW1 (ORCPT <rfc822;stable@archiver.kernel.org>); Mon, 16 Aug 2021 09:22:27 -0400 Received: from mail.kernel.org ([198.145.29.99]:43048 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S240954AbhHPNUZ (ORCPT <rfc822;stable@vger.kernel.org>); Mon, 16 Aug 2021 09:20:25 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 099CD6330A; Mon, 16 Aug 2021 13:15:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=linuxfoundation.org; s=korg; t=1629119747; bh=rC7r7jOUzgLlGrdiyVQYDGNnNrSd60k/QMwxZsfHFdI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=kMk7fVbAR2Aox846Vs0GKn1lEOJiImyf+k+Oy/jzfehS1wBXN1mMFQGyPfVKHKHRX KSIXxB8Qt9HRz0MhF1iKoJfscmV84IygQngbtPlFkOVnvXZA8/+V9Z44PlPGuDiY3L /xlHczTDAxfeR8vVPVSrP+Q3vPubqWf7cNmz5A80= From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> To: linux-kernel@vger.kernel.org Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>, stable@vger.kernel.org, Ben Gardon <bgardon@google.com>, Sean Christopherson <seanjc@google.com>, Paolo Bonzini <pbonzini@redhat.com> Subject: [PATCH 5.13 147/151] KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock Date: Mon, 16 Aug 2021 15:02:57 +0200 Message-Id: <20210816125448.920349481@linuxfoundation.org> X-Mailer: git-send-email 2.32.0 In-Reply-To: <20210816125444.082226187@linuxfoundation.org> References: <20210816125444.082226187@linuxfoundation.org> User-Agent: quilt/0.66 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Precedence: bulk List-ID: <stable.vger.kernel.org> X-Mailing-List: stable@vger.kernel.org
Series	None \| expand [5.13,002/151] Revert "usb: dwc3: gadget: Use list_replace_init() before traversing lists" [5.13,003/151] iio: adc: ti-ads7950: Ensure CS is deasserted after reading channels [5.13,004/151] iio: adis: set GPIO reset pin direction [5.13,005/151] iio: humidity: hdc100x: Add margin to the conversion time [5.13,006/151] iio: adc: Fix incorrect exit of for-loop [5.13,007/151] ASoC: amd: Fix reference to PCM buffer address [5.13,008/151] ASoC: xilinx: Fix reference to PCM buffer address [5.13,009/151] ASoC: uniphier: Fix reference to PCM buffer address [5.13,010/151] ASoC: tlv320aic31xx: Fix jack detection after suspend [5.13,011/151] ASoC: kirkwood: Fix reference to PCM buffer address [5.13,012/151] ASoC: intel: atom: Fix reference to PCM buffer address [5.13,013/151] i2c: dev: zero out array used for i2c reads from userspace [5.13,014/151] cifs: Handle race conditions during rename [5.13,015/151] cifs: create sd context must be a multiple of 8 [5.13,016/151] cifs: Call close synchronously during unlink/rename/lease break. [5.13,017/151] cifs: use the correct max-length for dentry_path_raw() [5.13,018/151] io_uring: drop ctx->uring_lock before flushing work item [5.13,019/151] io_uring: fix ctx-exit io_rsrc_put_work() deadlock [5.13,020/151] scsi: lpfc: Move initialization of phba->poll_list earlier to avoid crash [5.13,021/151] cgroup: rstat: fix A-A deadlock on 32bit around u64_stats_sync [5.13,022/151] seccomp: Fix setting loaded filter count during TSYNC [5.13,023/151] net: wwan: mhi_wwan_ctrl: Fix possible deadlock [5.13,024/151] net: ethernet: ti: cpsw: fix min eth packet size for non-switch use-cases [5.13,025/151] ARC: fp: set FPU_STATUS.FWE to enable FPU_STATUS update on context switch [5.13,026/151] ceph: reduce contention in ceph_check_delayed_caps() [5.13,027/151] pinctrl: k210: Fix k210_fpioa_probe() [5.13,028/151] ACPI: NFIT: Fix support for virtual SPA ranges [5.13,029/151] libnvdimm/region: Fix label activation vs errors [5.13,030/151] riscv: kexec: do not add -mno-relax flag if compiler doesnt support it [5.13,031/151] vmlinux.lds.h: Handle clangs module.{c,d}tor sections [5.13,032/151] drm/i915/gvt: Fix cached atomics setting for Windows VM [5.13,033/151] drm/i915/display: Fix the 12 BPC bits for PIPE_MISC reg [5.13,034/151] drm/amd/display: Remove invalid assert for ODM + MPC case [5.13,035/151] drm/amd/display: use GFP_ATOMIC in amdgpu_dm_irq_schedule_work [5.13,036/151] drm/amdgpu: Add preferred mode in modeset when freesync video modes enabled. [5.13,037/151] drm/amdgpu: dont enable baco on boco platforms in runpm [5.13,038/151] drm/amdgpu: handle VCN instances when harvesting (v2) [5.13,039/151] ieee802154: hwsim: fix GPF in hwsim_set_edge_lqi [5.13,040/151] ieee802154: hwsim: fix GPF in hwsim_new_edge_nl [5.13,041/151] drm/mediatek: Fix cursor plane no update [5.13,042/151] pinctrl: mediatek: Fix fallback behavior for bias_set_combo [5.13,043/151] ASoC: cs42l42: Correct definition of ADC Volume control [5.13,044/151] ASoC: cs42l42: Dont allow SND_SOC_DAIFMT_LEFT_J [5.13,045/151] ASoC: cs42l42: Fix bclk calculation for mono [5.13,046/151] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate [5.13,047/151] selftests/sgx: Fix Q1 and Q2 calculation in sigstruct.c [5.13,048/151] ASoC: SOF: Intel: Kconfig: fix SoundWire dependencies [5.13,049/151] ASoC: SOF: Intel: hda-ipc: fix reply size checking [5.13,050/151] ASoC: cs42l42: Fix inversion of ADC Notch Switch control [5.13,051/151] ASoC: cs42l42: Remove duplicate control for WNF filter frequency [5.13,052/151] netfilter: nf_conntrack_bridge: Fix memory leak when error [5.13,053/151] pinctrl: tigerlake: Fix GPIO mapping for newer version of software [5.13,054/151] ASoC: cs42l42: PLL must be running when changing MCLK_SRC_SEL [5.13,055/151] ASoC: cs42l42: Fix LRCLK frame start edge [5.13,056/151] ASoC: cs42l42: Fix mono playback [5.13,057/151] net: dsa: mt7530: add the missing RxUnicast MIB counter [5.13,058/151] net: mvvp2: fix short frame size on s390 [5.13,059/151] platform/x86: pcengines-apuv2: Add missing terminating entries to gpio-lookup tables [5.13,060/151] perf/x86/intel: Apply mid ACK for small core [5.13,061/151] drm/amd/pm: Fix a memory leak in an error handling path in vangogh_tables_init() [5.13,062/151] libbpf: Fix probe for BPF_PROG_TYPE_CGROUP_SOCKOPT [5.13,063/151] libbpf: Do not close un-owned FD 0 on errors [5.13,064/151] bpf: Fix integer overflow involving bucket_size [5.13,065/151] net: dsa: qca: ar9331: make proper initial port defaults [5.13,066/151] net: phy: micrel: Fix link detection on ksz87xx switch" [5.13,067/151] ppp: Fix generating ifname when empty IFLA_IFNAME is specified [5.13,068/151] io_uring: clear TIF_NOTIFY_SIGNAL when running task work [5.13,069/151] net/smc: fix wait on already cleared link [5.13,070/151] net/smc: Correct smc link connection counter in case of smc client [5.13,071/151] net: sched: act_mirred: Reset ct info when mirror/redirect skb [5.13,072/151] ice: Prevent probing virtual functions [5.13,073/151] ice: Stop processing VF messages during teardown [5.13,074/151] ice: dont remove netdev->dev_addr from uc sync list [5.13,075/151] iavf: Set RSS LUT and key in reset handle path [5.13,076/151] psample: Add a fwd declaration for skbuff [5.13,077/151] bareudp: Fix invalid read beyond skbs linear data [5.13,078/151] io-wq: fix bug of creating io-wokers unconditionally [5.13,079/151] io-wq: fix IO_WORKER_F_FIXED issue in create_io_worker() [5.13,080/151] net/mlx5: Dont skip subfunction cleanup in case of error in module init [5.13,081/151] net/mlx5: DR, Add fail on error check on decap [5.13,082/151] net/mlx5e: Avoid creating tunnel headers for local route [5.13,083/151] net/mlx5e: Destroy page pool after XDP SQ to fix use-after-free [5.13,084/151] net/mlx5: Block switchdev mode while devlink traps are active [5.13,085/151] net/mlx5e: TC, Fix error handling memory leak [5.13,086/151] net/mlx5: Synchronize correct IRQ when destroying CQ [5.13,087/151] net/mlx5: Fix return value from tracer initialization [5.13,088/151] drm/meson: fix colour distortion from HDR set during vendor u-boot [5.13,089/151] ovl: fix deadlock in splice write [5.13,090/151] bpf: Fix potentially incorrect results with bpf_get_local_storage() [5.13,091/151] net: dsa: microchip: Fix ksz_read64() [5.13,092/151] net: dsa: microchip: ksz8795: Fix PVID tag insertion [5.13,093/151] net: dsa: microchip: ksz8795: Reject unsupported VLAN configuration [5.13,094/151] net: dsa: microchip: ksz8795: Fix VLAN untagged flag change on deletion [5.13,095/151] net: dsa: microchip: ksz8795: Use software untagging on CPU port [5.13,096/151] net: dsa: microchip: ksz8795: Fix VLAN filtering [5.13,097/151] net: dsa: microchip: ksz8795: Dont use phy_port_cnt in VLAN table lookup [5.13,098/151] net: Fix memory leak in ieee802154_raw_deliver [5.13,099/151] net: igmp: fix data-race in igmp_ifc_timer_expire() [5.13,100/151] net: dsa: hellcreek: fix broken backpressure in .port_fdb_dump [5.13,101/151] net: dsa: lan9303: fix broken backpressure in .port_fdb_dump [5.13,102/151] net: dsa: lantiq: fix broken backpressure in .port_fdb_dump [5.13,103/151] net: dsa: sja1105: fix broken backpressure in .port_fdb_dump [5.13,104/151] pinctrl: sunxi: Dont underestimate number of functions [5.13,105/151] net: bridge: fix flags interpretation for extern learn fdb entries [5.13,106/151] net: bridge: fix memleak in br_add_if() [5.13,107/151] net: linkwatch: fix failure to restore device state across suspend/resume [5.13,108/151] tcp_bbr: fix u32 wrap bug in round logic if bbr_init() called after 2B packets [5.13,109/151] net: igmp: increase size of mr_ifc_count [5.13,110/151] drm/i915: Only access SFC_DONE when media domain is not fused off [5.13,111/151] xen/events: Fix race in set_evtchn_to_irq [5.13,112/151] vsock/virtio: avoid potential deadlock when vsock device remove [5.13,113/151] nbd: Aovid double completion of a request [5.13,114/151] arm64: efi: kaslr: Fix occasional random alloc (and boot) failure [5.13,115/151] KVM: arm64: Fix off-by-one in range_is_memory [5.13,116/151] efi/libstub: arm64: Force Image reallocation if BSS was not reserved [5.13,117/151] efi/libstub: arm64: Relax 2M alignment again for relocatable kernels [5.13,118/151] powerpc/kprobes: Fix kprobe Oops happens in booke [5.13,119/151] i2c: iproc: fix race between client unreg and tasklet [5.13,120/151] x86/tools: Fix objdump version check again [5.13,121/151] genirq: Provide IRQCHIP_AFFINITY_PRE_STARTUP [5.13,122/151] x86/msi: Force affinity setup before startup [5.13,123/151] x86/ioapic: Force affinity setup before startup [5.13,124/151] x86/resctrl: Fix default monitoring groups reporting [5.13,125/151] genirq/msi: Ensure deactivation on teardown [5.13,126/151] genirq/timings: Prevent potential array overflow in __irq_timings_store() [5.13,127/151] powerpc/interrupt: Fix OOPS by not calling do_IRQ() from timer_interrupt() [5.13,128/151] PCI/MSI: Enable and mask MSI-X early [5.13,129/151] PCI/MSI: Mask all unused MSI-X entries [5.13,130/151] PCI/MSI: Enforce that MSI-X table entry is masked for update [5.13,131/151] PCI/MSI: Enforce MSI[X] entry updates to be visible [5.13,132/151] PCI/MSI: Do not set invalid bits in MSI mask [5.13,133/151] PCI/MSI: Correct misleading comments [5.13,134/151] PCI/MSI: Use msi_mask_irq() in pci_msi_shutdown() [5.13,135/151] PCI/MSI: Protect msi_desc::masked for multi-MSI [5.13,136/151] powerpc/interrupt: Do not call single_step_exception() from other exceptions [5.13,137/151] powerpc/pseries: Fix update of LPAR security flavor after LPM [5.13,138/151] powerpc/32s: Fix napping restore in data storage interrupt (DSI) [5.13,139/151] powerpc/smp: Fix OOPS in topology_init() [5.13,140/151] powerpc/xive: Do not skip CPU-less nodes when creating the IPIs [5.13,141/151] powerpc/32: Fix critical and debug interrupts on BOOKE [5.13,142/151] efi/libstub: arm64: Double check image alignment at entry [5.13,143/151] locking/rtmutex: Use the correct rtmutex debugging config option [5.13,144/151] KVM: VMX: Use current VMCS to query WAITPKG support for MSR emulation [5.13,145/151] KVM: nVMX: Use vmx_need_pf_intercept() when deciding if L0 wants a #PF [5.13,146/151] KVM: x86/mmu: Dont leak non-leaf SPTEs when zapping all SPTEs [5.13,147/151] KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock [5.13,148/151] ceph: add some lockdep assertions around snaprealm handling [5.13,149/151] ceph: clean up locking annotation for ceph_get_snap_realm and __lookup_snap_realm [5.13,150/151] ceph: take snap_empty_lock atomically with snaprealm refcount change [5.13,151/151] kasan, slub: reset tag when printing address

[5.13,147/151] KVM: x86/mmu: Protect marking SPs unsync when using TDP MMU with spinlock

Commit Message

Patch