From patchwork Mon Aug 21 14:15:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Muhammad Usama Anjum X-Patchwork-Id: 715604 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 60BFAEE4993 for ; Mon, 21 Aug 2023 14:15:45 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232002AbjHUOPp (ORCPT ); Mon, 21 Aug 2023 10:15:45 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36668 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234057AbjHUOPo (ORCPT ); Mon, 21 Aug 2023 10:15:44 -0400 Received: from madras.collabora.co.uk (madras.collabora.co.uk [46.235.227.172]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 68E80DB; Mon, 21 Aug 2023 07:15:42 -0700 (PDT) Received: from localhost.localdomain (unknown [39.45.215.81]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) (Authenticated sender: usama.anjum) by madras.collabora.co.uk (Postfix) with ESMTPSA id 537EB6606F57; Mon, 21 Aug 2023 15:15:32 +0100 (BST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=collabora.com; s=mail; t=1692627341; bh=dH8EHmR8rNzU87wbXZPOdu/zobulaGefwDdUKRXRpWQ=; h=From:To:Cc:Subject:Date:From; b=GMRaYBR1+nl1UIUzr1BifzTGOOPgZ1gYK1H7VBZJd1JjBUKyHHHPL1UGlhhcS4+7o 1yORCi71ZjciJY8jaekm/0su3b4yYxLJF0yioU2VXBT3u+ZRYNT0I+bj31q0tZS3fa Y+CZ1pH7XWpdkuV4QoqTJlSMyebBu9EBJC8V4j5HOy0dxP7xNKdSwIvuHxWZL3eC3n 0CPWE+mYFb0KhUU7qfDTAo6ATGe8VGZyp5XMWNtNTLvaBrY1IViOS61Xxhw46yWT1h +nSGzQOTYsnC142YcWEruXoerny8AcCWrl9/TPw6EGjsKGuvbsKCVUQHyz13rFFBxS LaTJwzSA1z8RQ== From: Muhammad Usama Anjum To: Peter Xu , Andrew Morton , =?utf-8?b?TWljaGHFgiBNaXJvc8WCYXc=?= , Andrei Vagin , Danylo Mocherniuk , Paul Gofman Cc: Alexander Viro , Shuah Khan , Christian Brauner , Yang Shi , Vlastimil Babka , "Liam R . Howlett" , Yun Zhou , Suren Baghdasaryan , Alex Sierra , Muhammad Usama Anjum , Matthew Wilcox , Pasha Tatashin , Axel Rasmussen , "Gustavo A . R . Silva" , Dan Williams , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Greg KH , kernel@collabora.com, Cyrill Gorcunov , Mike Rapoport , Nadav Amit , David Hildenbrand Subject: [PATCH v33 0/6] Implement IOCTL to get and optionally clear info about PTEs Date: Mon, 21 Aug 2023 19:15:12 +0500 Message-Id: <20230821141518.870589-1-usama.anjum@collabora.com> X-Mailer: git-send-email 2.40.1 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org *Changes in v33*: - Add PAGE_IS_FILE support for THPs *Changes in v31 and v32*: - Minor updates *Changes in v30*: - Rebase on top of next-20230815 - Minor nitpicks *Changes in v29:* - Polish IOCTL and improve documentation *Changes in v28:* - Fix walk_end and add 17 test cases in selftests patch *Changes in v27:* - Handle review comments and minor improvements - Add performance improvement patch on top with test for easy review *Changes in v26:* - Code re-structurring and API changes in PAGEMAP_IOCTL *Changes in v25*: - Do proper filtering on hole as well (hole got missed earlier) *Changes in v24*: - Rebase on top of next-20230710 - Place WP markers in case of hole as well *Changes in v23*: - Set vec_buf_index in loop only when vec_buf_index is set - Return -EFAULT instead of -EINVAL if vec is NULL - Correctly return the walk ending address to the page granularity *Changes in v22*: - Interface change: - Replace [start start + len) with [start, end) - Return the ending address of the address walk in start *Changes in v21*: - Abort walk instead of returning error if WP is to be performed on partial hugetlb *Changes in v20* - Correct PAGE_IS_FILE and add PAGE_IS_PFNZERO *Changes in v19* - Minor changes and interface updates *Changes in v18* - Rebase on top of next-20230613 - Minor updates *Changes in v17* - Rebase on top of next-20230606 - Minor improvements in PAGEMAP_SCAN IOCTL patch *Changes in v16* - Fix a corner case - Add exclusive PM_SCAN_OP_WP back *Changes in v15* - Build fix (Add missed build fix in RESEND) *Changes in v14* - Fix build error caused by #ifdef added at last minute in some configs *Changes in v13* - Rebase on top of next-20230414 - Give-up on using uffd_wp_range() and write new helpers, flush tlb only once *Changes in v12* - Update and other memory types to UFFD_FEATURE_WP_ASYNC - Rebaase on top of next-20230406 - Review updates *Changes in v11* - Rebase on top of next-20230307 - Base patches on UFFD_FEATURE_WP_UNPOPULATED - Do a lot of cosmetic changes and review updates - Remove ENGAGE_WP + !GET operation as it can be performed with UFFDIO_WRITEPROTECT *Changes in v10* - Add specific condition to return error if hugetlb is used with wp async - Move changes in tools/include/uapi/linux/fs.h to separate patch - Add documentation *Changes in v9:* - Correct fault resolution for userfaultfd wp async - Fix build warnings and errors which were happening on some configs - Simplify pagemap ioctl's code *Changes in v8:* - Update uffd async wp implementation - Improve PAGEMAP_IOCTL implementation *Changes in v7:* - Add uffd wp async - Update the IOCTL to use uffd under the hood instead of soft-dirty flags *Motivation* The real motivation for adding PAGEMAP_SCAN IOCTL is to emulate Windows GetWriteWatch() and ResetWriteWatch() syscalls [1]. The GetWriteWatch() retrieves the addresses of the pages that are written to in a region of virtual memory. This syscall is used in Windows applications and games etc. This syscall is being emulated in pretty slow manner in userspace. Our purpose is to enhance the kernel such that we translate it efficiently in a better way. Currently some out of tree hack patches are being used to efficiently emulate it in some kernels. We intend to replace those with these patches. So the whole gaming on Linux can effectively get benefit from this. It means there would be tons of users of this code. CRIU use case [2] was mentioned by Andrei and Danylo: > Use cases for migrating sparse VMAs are binaries sanitized with ASAN, > MSAN or TSAN [3]. All of these sanitizers produce sparse mappings of > shadow memory [4]. Being able to migrate such binaries allows to highly > reduce the amount of work needed to identify and fix post-migration > crashes, which happen constantly. Andrei's defines the following uses of this code: * it is more granular and allows us to track changed pages more effectively. The current interface can clear dirty bits for the entire process only. In addition, reading info about pages is a separate operation. It means we must freeze the process to read information about all its pages, reset dirty bits, only then we can start dumping pages. The information about pages becomes more and more outdated, while we are processing pages. The new interface solves both these downsides. First, it allows us to read pte bits and clear the soft-dirty bit atomically. It means that CRIU will not need to freeze processes to pre-dump their memory. Second, it clears soft-dirty bits for a specified region of memory. It means CRIU will have actual info about pages to the moment of dumping them. * The new interface has to be much faster because basic page filtering is happening in the kernel. With the old interface, we have to read pagemap for each page. *Implementation Evolution (Short Summary)*