From patchwork Sat Sep 26 04:19:01 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 263433 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C88D9C4727E for ; Sat, 26 Sep 2020 04:19:03 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8D7CD20882 for ; Sat, 26 Sep 2020 04:19:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093943; bh=AJxwQo/L+ID93H6tPKPieFQbSPWKO3Vas2ik0yvPZ+s=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=nRPF4w8b+pwxJzeGh4ZYzKN0I8Rml4jDVQf91hmEFogWJeVubDYT1QqRKmg416rUi Vl1MSwk9/hIV8QiZ733GYwzATjVyoHFAMWkhFNeNG5WYYbFtyvBMiVZk9VPSAJxpXZ LVPIJvRTqca25seLmW96npycZcP715unn8fWaS3A= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730036AbgIZETD (ORCPT ); Sat, 26 Sep 2020 00:19:03 -0400 Received: from mail.kernel.org ([198.145.29.99]:51592 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730033AbgIZETD (ORCPT ); Sat, 26 Sep 2020 00:19:03 -0400 Received: from localhost.localdomain (c-71-198-47-131.hsd1.ca.comcast.net [71.198.47.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 136A02076D; Sat, 26 Sep 2020 04:19:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093942; bh=AJxwQo/L+ID93H6tPKPieFQbSPWKO3Vas2ik0yvPZ+s=; h=Date:From:To:Subject:In-Reply-To:From; b=Zed3OG/wK9qxn9wnbcZ9fmAs2DU7YwezSNB6Y9AsXjQzhtgMbre9rzAVd6MzH7LQd UFtvvFvKJKbQ44+3YsTtarIH+cCOiJ9Ak0aCVJDBRw5pg9eNFrFcJUK5XyPPJXWmos 6v5fAzaCzSye+xx718QJRjHbtkRgmoCJyqRZbB/g= Date: Fri, 25 Sep 2020 21:19:01 -0700 From: Andrew Morton To: akpm@linux-foundation.org, aquini@redhat.com, cmaiolino@redhat.com, david@fromorbit.com, esandeen@redhat.com, hsiangkao@redhat.com, linux-mm@kvack.org, mm-commits@vger.kernel.org, shy828301@gmail.com, stable@vger.kernel.org, torvalds@linux-foundation.org, willy@infradead.org, ying.huang@intel.com Subject: [patch 1/9] mm, THP, swap: fix allocating cluster for swapfile by mistake Message-ID: <20200926041901.P-A_BQZ2V%akpm@linux-foundation.org> In-Reply-To: <20200925211725.0fea54be9e9715486efea21f@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Gao Xiang Subject: mm, THP, swap: fix allocating cluster for swapfile by mistake SWP_FS is used to make swap_{read,write}page() go through the filesystem, and it's only used for swap files over NFS. So, !SWP_FS means non NFS for now, it could be either file backed or device backed. Something similar goes with legacy SWP_FILE. So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead. FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y. I reproduced the issue with the following details: Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB) Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y Some reproducable steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw mkswap /tmp/mnt/sw swapon /tmp/mnt/sw stress --vm 2 --vm-bytes 600M # doesn't matter too much as well Symptoms: - FS corruption (e.g. checksum failure) - memory corruption at: 0xd2808010 - segfault Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.com Fixes: f0eea189e8e9 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out") Signed-off-by: Gao Xiang Reviewed-by: "Huang, Ying" Reviewed-by: Yang Shi Acked-by: Rafael Aquini Cc: Matthew Wilcox Cc: Carlos Maiolino Cc: Eric Sandeen Cc: Dave Chinner Cc: Signed-off-by: Andrew Morton --- mm/swapfile.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/mm/swapfile.c~mm-thp-swap-fix-allocating-cluster-for-swapfile-by-mistake +++ a/mm/swapfile.c @@ -1078,7 +1078,7 @@ start_over: goto nextsi; } if (size == SWAPFILE_CLUSTER) { - if (!(si->flags & SWP_FS)) + if (si->flags & SWP_BLKDEV) n_ret = swap_alloc_cluster(si, swp_entries); } else n_ret = scan_swap_map_slots(si, SWAP_HAS_CACHE, From patchwork Sat Sep 26 04:19:10 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 309175 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D23EEC4727E for ; Sat, 26 Sep 2020 04:19:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 889E421527 for ; Sat, 26 Sep 2020 04:19:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093953; bh=ROsfeFYyajx6sHRc4vsd8UqPQB3IsomCwYlUooeeHLE=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=xNz7J4Cd5/q+GuTjcoqQwbFYpVF1dV5Lm9ZR3cRkgXOHS2ggTTcx0E89P9wXZE1f9 FSEewqEPVf61p2+ThY/7qsVyOsTErAOk2idh3JlHFzfRpMTLQ7lda1sqs1r4cddgia +rKk1j1QJDPTvSFCmuIj+gdUhjTGVvyk1rT90dag= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730013AbgIZETN (ORCPT ); Sat, 26 Sep 2020 00:19:13 -0400 Received: from mail.kernel.org ([198.145.29.99]:51734 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729035AbgIZETM (ORCPT ); Sat, 26 Sep 2020 00:19:12 -0400 Received: from localhost.localdomain (c-71-198-47-131.hsd1.ca.comcast.net [71.198.47.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7FFA8207C4; Sat, 26 Sep 2020 04:19:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093952; bh=ROsfeFYyajx6sHRc4vsd8UqPQB3IsomCwYlUooeeHLE=; h=Date:From:To:Subject:In-Reply-To:From; b=oMbZlQCFKOM+Cd33AndS+plrLvhXBMkEWLCYTd6ddcqRob/LUCS1/U4hQlGQvT5K4 OT+845XS8+fNv0+QqSI+3D8EeWfSethgnLkedb8OFILFbveC7ORU2vTjbD2NI8o+2X XBDEiJfDaWZj82YSILmHBvyRpC2FvOcxrtHUKXWM= Date: Fri, 25 Sep 2020 21:19:10 -0700 From: Andrew Morton To: agordeev@linux.ibm.com, akpm@linux-foundation.org, arnd@arndb.de, aryabinin@virtuozzo.com, benh@kernel.crashing.org, borntraeger@de.ibm.com, bp@alien8.de, catalin.marinas@arm.com, dave.hansen@intel.com, dave.hansen@linux.intel.com, gerald.schaefer@linux.ibm.com, gor@linux.ibm.com, hca@linux.ibm.com, imbrenda@linux.ibm.com, jdike@addtoit.com, jgg@nvidia.com, jhubbard@nvidia.com, linux-mm@kvack.org, linux@armlinux.org.uk, luto@kernel.org, mingo@redhat.com, mm-commits@vger.kernel.org, mpe@ellerman.id.au, paulus@samba.org, peterz@infradead.org, richard@nod.at, rppt@linux.ibm.com, stable@vger.kernel.org, tglx@linutronix.de, torvalds@linux-foundation.org, will@kernel.org Subject: [patch 3/9] mm/gup: fix gup_fast with dynamic page table folding Message-ID: <20200926041910.c0h74uLjj%akpm@linux-foundation.org> In-Reply-To: <20200925211725.0fea54be9e9715486efea21f@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Vasily Gorbik Subject: mm/gup: fix gup_fast with dynamic page table folding Currently to make sure that every page table entry is read just once gup_fast walks perform READ_ONCE and pass pXd value down to the next gup_pXd_range function by value e.g.: static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) ... pudp = pud_offset(&p4d, addr); This function passes a reference on that local value copy to pXd_offset, and might get the very same pointer in return. This happens when the level is folded (on most arches), and that pointer should not be iterated. On s390 due to the fact that each task might have different 5,4 or 3-level address translation and hence different levels folded the logic is more complex and non-iteratable pointer to a local copy leads to severe problems. Here is an example of what happens with gup_fast on s390, for a task with 3-levels paging, crossing a 2 GB pud boundary: // addr = 0x1007ffff000, end = 0x10080001000 static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; // pud_offset returns &p4d itself (a pointer to a value on stack) pudp = pud_offset(&p4d, addr); do { // on second iteratation reading "random" stack value pud_t pud = READ_ONCE(*pudp); // next = 0x10080000000, due to PUD_SIZE/MASK != PGDIR_SIZE/MASK on s390 next = pud_addr_end(addr, end); ... } while (pudp++, addr = next, addr != end); // pudp++ iterating over stack return 1; } This happens since s390 moved to common gup code with commit d1874a0c2805 ("s390/mm: make the pxd_offset functions more robust") and commit 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code"). s390 tried to mimic static level folding by changing pXd_offset primitives to always calculate top level page table offset in pgd_offset and just return the value passed when pXd_offset has to act as folded. What is crucial for gup_fast and what has been overlooked is that PxD_SIZE/MASK and thus pXd_addr_end should also change correspondingly. And the latter is not possible with dynamic folding. To fix the issue in addition to pXd values pass original pXdp pointers down to gup_pXd_range functions. And introduce pXd_offset_lockless helpers, which take an additional pXd entry value parameter. This has already been discussed in https://lkml.kernel.org/r/20190418100218.0a4afd51@mschwideX1 Link: https://lkml.kernel.org/r/patch.git-943f1e5dcff2.your-ad-here.call-01599856292-ext-8676@work.hours Fixes: 1a42010cdc26 ("s390/mm: convert to the generic get_user_pages_fast code") Signed-off-by: Vasily Gorbik Reviewed-by: Gerald Schaefer Reviewed-by: Alexander Gordeev Reviewed-by: Jason Gunthorpe Reviewed-by: Mike Rapoport Reviewed-by: John Hubbard Cc: Linus Torvalds Cc: Peter Zijlstra Cc: Dave Hansen Cc: Russell King Cc: Catalin Marinas Cc: Will Deacon Cc: Michael Ellerman Cc: Benjamin Herrenschmidt Cc: Paul Mackerras Cc: Jeff Dike Cc: Richard Weinberger Cc: Dave Hansen Cc: Andy Lutomirski Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Arnd Bergmann Cc: Andrey Ryabinin Cc: Heiko Carstens Cc: Christian Borntraeger Cc: Claudio Imbrenda Cc: [5.2+] Signed-off-by: Andrew Morton --- arch/s390/include/asm/pgtable.h | 42 +++++++++++++++++++++--------- include/linux/pgtable.h | 10 +++++++ mm/gup.c | 18 ++++++------ 3 files changed, 49 insertions(+), 21 deletions(-) --- a/arch/s390/include/asm/pgtable.h~mm-gup-fix-gup_fast-with-dynamic-page-table-folding +++ a/arch/s390/include/asm/pgtable.h @@ -1260,26 +1260,44 @@ static inline pgd_t *pgd_offset_raw(pgd_ #define pgd_offset(mm, address) pgd_offset_raw(READ_ONCE((mm)->pgd), address) -static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address) +static inline p4d_t *p4d_offset_lockless(pgd_t *pgdp, pgd_t pgd, unsigned long address) { - if ((pgd_val(*pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) - return (p4d_t *) pgd_deref(*pgd) + p4d_index(address); - return (p4d_t *) pgd; + if ((pgd_val(pgd) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R1) + return (p4d_t *) pgd_deref(pgd) + p4d_index(address); + return (p4d_t *) pgdp; } +#define p4d_offset_lockless p4d_offset_lockless -static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address) +static inline p4d_t *p4d_offset(pgd_t *pgdp, unsigned long address) { - if ((p4d_val(*p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) - return (pud_t *) p4d_deref(*p4d) + pud_index(address); - return (pud_t *) p4d; + return p4d_offset_lockless(pgdp, *pgdp, address); +} + +static inline pud_t *pud_offset_lockless(p4d_t *p4dp, p4d_t p4d, unsigned long address) +{ + if ((p4d_val(p4d) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R2) + return (pud_t *) p4d_deref(p4d) + pud_index(address); + return (pud_t *) p4dp; +} +#define pud_offset_lockless pud_offset_lockless + +static inline pud_t *pud_offset(p4d_t *p4dp, unsigned long address) +{ + return pud_offset_lockless(p4dp, *p4dp, address); } #define pud_offset pud_offset -static inline pmd_t *pmd_offset(pud_t *pud, unsigned long address) +static inline pmd_t *pmd_offset_lockless(pud_t *pudp, pud_t pud, unsigned long address) +{ + if ((pud_val(pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) + return (pmd_t *) pud_deref(pud) + pmd_index(address); + return (pmd_t *) pudp; +} +#define pmd_offset_lockless pmd_offset_lockless + +static inline pmd_t *pmd_offset(pud_t *pudp, unsigned long address) { - if ((pud_val(*pud) & _REGION_ENTRY_TYPE_MASK) >= _REGION_ENTRY_TYPE_R3) - return (pmd_t *) pud_deref(*pud) + pmd_index(address); - return (pmd_t *) pud; + return pmd_offset_lockless(pudp, *pudp, address); } #define pmd_offset pmd_offset --- a/include/linux/pgtable.h~mm-gup-fix-gup_fast-with-dynamic-page-table-folding +++ a/include/linux/pgtable.h @@ -1427,6 +1427,16 @@ typedef unsigned int pgtbl_mod_mask; #define mm_pmd_folded(mm) __is_defined(__PAGETABLE_PMD_FOLDED) #endif +#ifndef p4d_offset_lockless +#define p4d_offset_lockless(pgdp, pgd, address) p4d_offset(&(pgd), address) +#endif +#ifndef pud_offset_lockless +#define pud_offset_lockless(p4dp, p4d, address) pud_offset(&(p4d), address) +#endif +#ifndef pmd_offset_lockless +#define pmd_offset_lockless(pudp, pud, address) pmd_offset(&(pud), address) +#endif + /* * p?d_leaf() - true if this entry is a final mapping to a physical address. * This differs from p?d_huge() by the fact that they are always available (if --- a/mm/gup.c~mm-gup-fix-gup_fast-with-dynamic-page-table-folding +++ a/mm/gup.c @@ -2485,13 +2485,13 @@ static int gup_huge_pgd(pgd_t orig, pgd_ return 1; } -static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end, +static int gup_pmd_range(pud_t *pudp, pud_t pud, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pmd_t *pmdp; - pmdp = pmd_offset(&pud, addr); + pmdp = pmd_offset_lockless(pudp, pud, addr); do { pmd_t pmd = READ_ONCE(*pmdp); @@ -2528,13 +2528,13 @@ static int gup_pmd_range(pud_t pud, unsi return 1; } -static int gup_pud_range(p4d_t p4d, unsigned long addr, unsigned long end, +static int gup_pud_range(p4d_t *p4dp, p4d_t p4d, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; pud_t *pudp; - pudp = pud_offset(&p4d, addr); + pudp = pud_offset_lockless(p4dp, p4d, addr); do { pud_t pud = READ_ONCE(*pudp); @@ -2549,20 +2549,20 @@ static int gup_pud_range(p4d_t p4d, unsi if (!gup_huge_pd(__hugepd(pud_val(pud)), addr, PUD_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pmd_range(pud, addr, next, flags, pages, nr)) + } else if (!gup_pmd_range(pudp, pud, addr, next, flags, pages, nr)) return 0; } while (pudp++, addr = next, addr != end); return 1; } -static int gup_p4d_range(pgd_t pgd, unsigned long addr, unsigned long end, +static int gup_p4d_range(pgd_t *pgdp, pgd_t pgd, unsigned long addr, unsigned long end, unsigned int flags, struct page **pages, int *nr) { unsigned long next; p4d_t *p4dp; - p4dp = p4d_offset(&pgd, addr); + p4dp = p4d_offset_lockless(pgdp, pgd, addr); do { p4d_t p4d = READ_ONCE(*p4dp); @@ -2574,7 +2574,7 @@ static int gup_p4d_range(pgd_t pgd, unsi if (!gup_huge_pd(__hugepd(p4d_val(p4d)), addr, P4D_SHIFT, next, flags, pages, nr)) return 0; - } else if (!gup_pud_range(p4d, addr, next, flags, pages, nr)) + } else if (!gup_pud_range(p4dp, p4d, addr, next, flags, pages, nr)) return 0; } while (p4dp++, addr = next, addr != end); @@ -2602,7 +2602,7 @@ static void gup_pgd_range(unsigned long if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr, PGDIR_SHIFT, next, flags, pages, nr)) return; - } else if (!gup_p4d_range(pgd, addr, next, flags, pages, nr)) + } else if (!gup_p4d_range(pgdp, pgd, addr, next, flags, pages, nr)) return; } while (pgdp++, addr = next, addr != end); } From patchwork Sat Sep 26 04:19:18 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 263432 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING, SIGNED_OFF_BY, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0B55C4727D for ; Sat, 26 Sep 2020 04:19:20 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9465321527 for ; Sat, 26 Sep 2020 04:19:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093960; bh=gqjbGqJ1joxLFctcB/Msx4e1CUN0FUkvjKq/zP4B4zM=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=Q438sXHUX89bFMGJnS8pmERr1lnPd31tc4XhXtJSg6vPhnfxuDRbY5VLroHb50mXt XRhlupFrcYkB37rwiWOoGWo8S7ETAkdVr6wpiDr4KLUcxjqqvcwU4XVbcEailqTLWQ 6Tvszrcrn+clISvfgsGIXeLS+XwRTmuu9ejzZRc4= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730001AbgIZETU (ORCPT ); Sat, 26 Sep 2020 00:19:20 -0400 Received: from mail.kernel.org ([198.145.29.99]:51884 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729035AbgIZETU (ORCPT ); Sat, 26 Sep 2020 00:19:20 -0400 Received: from localhost.localdomain (c-71-198-47-131.hsd1.ca.comcast.net [71.198.47.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id BEF012078B; Sat, 26 Sep 2020 04:19:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093959; bh=gqjbGqJ1joxLFctcB/Msx4e1CUN0FUkvjKq/zP4B4zM=; h=Date:From:To:Subject:In-Reply-To:From; b=glp2xNOyJhKoRYpO+EPEp1Wz4yZampURWRfsCERI3F0fWG1GW292568BAd9RwJFcE qyvUK+UV60w93FDkCeLJe5SdrdEOh5VJBT+02DORAVykWjps9YDpBjt/+Y+dEIkEkc IZ3qxC8QmFZMNKiM1HY9QXoz6HMjpb2Sr4BTzzZ0= Date: Fri, 25 Sep 2020 21:19:18 -0700 From: Andrew Morton To: akpm@linux-foundation.org, andy.lavr@gmail.com, joe@perches.com, keescook@chromium.org, linux-mm@kvack.org, linux@rasmusvillemoes.dk, masahiroy@kernel.org, mm-commits@vger.kernel.org, natechancellor@gmail.com, ndesaulniers@google.com, nivedita@alum.mit.edu, samitolvanen@google.com, stable@vger.kernel.org, torvalds@linux-foundation.org Subject: [patch 5/9] lib/string.c: implement stpcpy Message-ID: <20200926041918.OfR3GXvLc%akpm@linux-foundation.org> In-Reply-To: <20200925211725.0fea54be9e9715486efea21f@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Nick Desaulniers Subject: lib/string.c: implement stpcpy LLVM implemented a recent "libcall optimization" that lowers calls to `sprintf(dest, "%s", str)` where the return value is used to `stpcpy(dest, str) - dest`. This generally avoids the machinery involved in parsing format strings. `stpcpy` is just like `strcpy` except it returns the pointer to the new tail of `dest`. This optimization was introduced into clang-12. Implement this so that we don't observe linkage failures due to missing symbol definitions for `stpcpy`. Similar to last year's fire drill with: commit 5f074f3e192f ("lib/string.c: implement a basic bcmp") The kernel is somewhere between a "freestanding" environment (no full libc) and "hosted" environment (many symbols from libc exist with the same type, function signature, and semantics). As H. Peter Anvin notes, there's not really a great way to inform the compiler that you're targeting a freestanding environment but would like to opt-in to some libcall optimizations (see pr/47280 below), rather than opt-out. Arvind notes, -fno-builtin-* behaves slightly differently between GCC and Clang, and Clang is missing many __builtin_* definitions, which I consider a bug in Clang and am working on fixing. Masahiro summarizes the subtle distinction between compilers justly: To prevent transformation from foo() into bar(), there are two ways in Clang to do that; -fno-builtin-foo, and -fno-builtin-bar. There is only one in GCC; -fno-buitin-foo. (Any difference in that behavior in Clang is likely a bug from a missing __builtin_* definition.) Masahiro also notes: We want to disable optimization from foo() to bar(), but we may still benefit from the optimization from foo() into something else. If GCC implements the same transform, we would run into a problem because it is not -fno-builtin-bar, but -fno-builtin-foo that disables that optimization. In this regard, -fno-builtin-foo would be more future-proof than -fno-built-bar, but -fno-builtin-foo is still potentially overkill. We may want to prevent calls from foo() being optimized into calls to bar(), but we still may want other optimization on calls to foo(). It seems that compilers today don't quite provide the fine grain control over which libcall optimizations pseudo-freestanding environments would prefer. Finally, Kees notes that this interface is unsafe, so we should not encourage its use. As such, I've removed the declaration from any header, but it still needs to be exported to avoid linkage errors in modules. Link: https://lkml.kernel.org/r/20200914161643.938408-1-ndesaulniers@google.com Link: https://bugs.llvm.org/show_bug.cgi?id=47162 Link: https://bugs.llvm.org/show_bug.cgi?id=47280 Link: https://github.com/ClangBuiltLinux/linux/issues/1126 Link: https://man7.org/linux/man-pages/man3/stpcpy.3.html Link: https://pubs.opengroup.org/onlinepubs/9699919799/functions/stpcpy.html Link: https://reviews.llvm.org/D85963 Reported-by: Sami Tolvanen Suggested-by: Andy Lavr Suggested-by: Arvind Sankar Suggested-by: Joe Perches Suggested-by: Kees Cook Suggested-by: Masahiro Yamada Suggested-by: Rasmus Villemoes Signed-off-by: Nick Desaulniers Tested-by: Nathan Chancellor Cc: Signed-off-by: Andrew Morton --- lib/string.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) --- a/lib/string.c~lib-stringc-implement-stpcpy +++ a/lib/string.c @@ -272,6 +272,30 @@ ssize_t strscpy_pad(char *dest, const ch } EXPORT_SYMBOL(strscpy_pad); +/** + * stpcpy - copy a string from src to dest returning a pointer to the new end + * of dest, including src's %NUL-terminator. May overrun dest. + * @dest: pointer to end of string being copied into. Must be large enough + * to receive copy. + * @src: pointer to the beginning of string being copied from. Must not overlap + * dest. + * + * stpcpy differs from strcpy in a key way: the return value is a pointer + * to the new %NUL-terminating character in @dest. (For strcpy, the return + * value is a pointer to the start of @dest). This interface is considered + * unsafe as it doesn't perform bounds checking of the inputs. As such it's + * not recommended for usage. Instead, its definition is provided in case + * the compiler lowers other libcalls to stpcpy. + */ +char *stpcpy(char *__restrict__ dest, const char *__restrict__ src); +char *stpcpy(char *__restrict__ dest, const char *__restrict__ src) +{ + while ((*dest++ = *src++) != '\0') + /* nothing */; + return --dest; +} +EXPORT_SYMBOL(stpcpy); + #ifndef __HAVE_ARCH_STRCAT /** * strcat - Append one %NUL-terminated string to another From patchwork Sat Sep 26 04:19:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 309174 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D6656C2D0A8 for ; Sat, 26 Sep 2020 04:19:26 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A4149215A4 for ; Sat, 26 Sep 2020 04:19:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093966; bh=4XYbOkZD/X+maFrZDRd5RAzqJJ9LoJUpJToY3T3h3R0=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=H32JsRlYrS27ktcDFcKwSOWGEG3y5Q9LXWkD3wvCN+siYoU65eY7ZiCmOxyt25cMh akg0Sef1VUCkW4Cq6mKmFjfrHBVqDHlxvIaq6vfh1gEg6M0I7gL19UXjPYrS9dAQP8 DxX/EZIClMM0xScn70jRBXFVIG2YFDnx1s0TW81E= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729172AbgIZET0 (ORCPT ); Sat, 26 Sep 2020 00:19:26 -0400 Received: from mail.kernel.org ([198.145.29.99]:52008 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729035AbgIZET0 (ORCPT ); Sat, 26 Sep 2020 00:19:26 -0400 Received: from localhost.localdomain (c-71-198-47-131.hsd1.ca.comcast.net [71.198.47.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id DF92D207C4; Sat, 26 Sep 2020 04:19:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093965; bh=4XYbOkZD/X+maFrZDRd5RAzqJJ9LoJUpJToY3T3h3R0=; h=Date:From:To:Subject:In-Reply-To:From; b=ARY3by6ITqp2ZnsjOsPvXl+rGGDcHv8XbLOpId2hoXAJYDuU33VhrEDACHeGCnZjR 9n91SUl1xCKp3tjXuT8Qf0r7v9lmsrcvsunSEVd4++RIROvmA21hg5jWaBnCAd9vFX cwn6ysHBq1VfNYgqtGTRudh+891wxKG0QhjTznM0= Date: Fri, 25 Sep 2020 21:19:24 -0700 From: Andrew Morton To: akpm@linux-foundation.org, dan.j.wiilliams@intel.com, hch@lst.de, hpa@zytor.com, jack@suse.cz, jmoyer@redhat.com, linux-mm@kvack.org, mawilcox@microsoft.com, mingo@elte.hu, mingo@redhat.com, mm-commits@vger.kernel.org, mpatocka@redhat.com, ross.zwisler@linux.intel.com, stable@vger.kernel.org, tglx@linutronix.de, torvalds@linux-foundation.org, toshi.kani@hpe.com, viro@zeniv.linux.org.uk Subject: [patch 7/9] arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback Message-ID: <20200926041924.JrUcL1D_9%akpm@linux-foundation.org> In-Reply-To: <20200925211725.0fea54be9e9715486efea21f@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Mikulas Patocka Subject: arch/x86/lib/usercopy_64.c: fix __copy_user_flushcache() cache writeback If we copy less than 8 bytes and if the destination crosses a cache line, __copy_user_flushcache would invalidate only the first cache line. This patch makes it invalidate the second cache line as well. Link: https://lkml.kernel.org/r/alpine.LRH.2.02.2009161451140.21915@file01.intranet.prod.int.rdu2.redhat.com Fixes: 0aed55af88345b ("x86, uaccess: introduce copy_from_iter_flushcache for pmem / cache-bypass operations") Signed-off-by: Mikulas Patocka Reviewed-by: Dan Williams Cc: Jan Kara Cc: Jeff Moyer Cc: Ingo Molnar Cc: Christoph Hellwig Cc: Toshi Kani Cc: "H. Peter Anvin" Cc: Al Viro Cc: Thomas Gleixner Cc: Matthew Wilcox Cc: Ross Zwisler Cc: Ingo Molnar Cc: "H. Peter Anvin" Cc: Signed-off-by: Andrew Morton --- arch/x86/lib/usercopy_64.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- a/arch/x86/lib/usercopy_64.c~arch-x86-lib-usercopy_64c-fix-__copy_user_flushcache-cache-writeback +++ a/arch/x86/lib/usercopy_64.c @@ -120,7 +120,7 @@ long __copy_user_flushcache(void *dst, c */ if (size < 8) { if (!IS_ALIGNED(dest, 4) || size != 4) - clean_cache_range(dst, 1); + clean_cache_range(dst, size); } else { if (!IS_ALIGNED(dest, 8)) { dest = ALIGN(dest, boot_cpu_data.x86_clflush_size); From patchwork Sat Sep 26 04:19:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 263431 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4BF1C47420 for ; Sat, 26 Sep 2020 04:19:31 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9A8DC20882 for ; Sat, 26 Sep 2020 04:19:31 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093971; bh=BLn0rc/DiGFvCooB+p8a0n/+Mt1r6e369KsEgAtvVKI=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=MNuRfacIKrz43XfunmD9vFAIZbFNcx7/5UZBJuSugiju+eCo10DOp10KdFxIVKZz4 cxJf9z46OUd7okZAxWoK05nKjIodCd7+kId1jzoJwJP48YI1vUnb1fieCpN7efzIhA MSbhjhnApRAWP2Y/Yqw/4M4wIZsqPMN0hcF4aX9o= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729426AbgIZETb (ORCPT ); Sat, 26 Sep 2020 00:19:31 -0400 Received: from mail.kernel.org ([198.145.29.99]:52088 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729035AbgIZETa (ORCPT ); Sat, 26 Sep 2020 00:19:30 -0400 Received: from localhost.localdomain (c-71-198-47-131.hsd1.ca.comcast.net [71.198.47.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id B8D8F2080A; Sat, 26 Sep 2020 04:19:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093969; bh=BLn0rc/DiGFvCooB+p8a0n/+Mt1r6e369KsEgAtvVKI=; h=Date:From:To:Subject:In-Reply-To:From; b=orlT/les5zNiJvpUPhDU0blR/827VuvNwzokG8zW7ACyysH233PhPc0yKm1oVMLTh pW80MB23KlhAztN/JfCm8xUol+GDHWMR7s/cDde316DJoR7uJNm3zMZiRNfZLP3Rcm MU8mXDXFyL2a6h51kVvxBm0UoD7Wo9gTlHpu39fs= Date: Fri, 25 Sep 2020 21:19:28 -0700 From: Andrew Morton To: akpm@linux-foundation.org, cheloha@linux.ibm.com, david@redhat.com, fenghua.yu@intel.com, gregkh@linuxfoundation.org, ldufour@linux.ibm.com, linux-mm@kvack.org, mhocko@suse.com, mm-commits@vger.kernel.org, nathanl@linux.ibm.com, osalvador@suse.de, rafael@kernel.org, stable@vger.kernel.org, tony.luck@intel.com, torvalds@linux-foundation.org Subject: [patch 8/9] mm: replace memmap_context by meminit_context Message-ID: <20200926041928.9xJHGgkah%akpm@linux-foundation.org> In-Reply-To: <20200925211725.0fea54be9e9715486efea21f@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Laurent Dufour Subject: mm: replace memmap_context by meminit_context Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: ------------[ cut here ]------------ kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 NIP: c000000000403f34 LR: c000000000403f2c CTR: 0000000000000000 REGS: c0000004876e3660 TRAP: 0700 Not tainted (5.9.0-rc1+) MSR: 800000000282b033 CR: 24000448 XER: 20040000 CFAR: c000000000846d20 IRQMASK: 0 GPR00: c000000000403f2c c0000004876e38f0 c0000000012f6f00 ffffffffffffffef GPR04: 0000000000000227 c0000004805ae680 0000000000000000 00000004886f0000 GPR08: 0000000000000226 0000000000000003 0000000000000002 fffffffffffffffd GPR12: 0000000088000484 c00000001ec96280 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000004 0000000000000003 GPR20: c00000047814ffe0 c0000007ffff7c08 0000000000000010 c0000000013332c8 GPR24: 0000000000000000 c0000000011f6cc0 0000000000000000 0000000000000000 GPR28: ffffffffffffffef 0000000000000001 0000000150000000 0000000010000000 NIP [c000000000403f34] add_memory_resource+0x244/0x340 LR [c000000000403f2c] add_memory_resource+0x23c/0x340 Call Trace: [c0000004876e38f0] [c000000000403f2c] add_memory_resource+0x23c/0x340 (unreliable) [c0000004876e39c0] [c00000000040408c] __add_memory+0x5c/0xf0 [c0000004876e39f0] [c0000000000e2b94] dlpar_add_lmb+0x1b4/0x500 [c0000004876e3ad0] [c0000000000e3888] dlpar_memory+0x1f8/0xb80 [c0000004876e3b60] [c0000000000dc0d0] handle_dlpar_errorlog+0xc0/0x190 [c0000004876e3bd0] [c0000000000dc398] dlpar_store+0x198/0x4a0 [c0000004876e3c90] [c00000000072e630] kobj_attr_store+0x30/0x50 [c0000004876e3cb0] [c00000000051f954] sysfs_kf_write+0x64/0x90 [c0000004876e3cd0] [c00000000051ee40] kernfs_fop_write+0x1b0/0x290 [c0000004876e3d20] [c000000000438dd8] vfs_write+0xe8/0x290 [c0000004876e3d70] [c0000000004391ac] ksys_write+0xdc/0x130 [c0000004876e3dc0] [c000000000034e40] system_call_exception+0x160/0x270 [c0000004876e3e20] [c00000000000d740] system_call_common+0xf0/0x27c Instruction dump: 48442e35 60000000 0b030000 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14 78a58402 48442db1 60000000 7c7c1b78 <0b030000> 7f23cb78 4bda371d 60000000 ---[ end trace 562fd6c109cd0fb2 ]--- This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are 2 issues here: a. The sysfs memory and node's layouts are broken due to these multiple links b. The link errors in link_mem_sections() should not lead to a system panic. To address a. register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. The patch 3 is addressing the point b. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com Signed-off-by: Laurent Dufour Suggested-by: David Hildenbrand Reviewed-by: David Hildenbrand Reviewed-by: Oscar Salvador Acked-by: Michal Hocko Cc: Greg Kroah-Hartman Cc: "Rafael J . Wysocki" Cc: Nathan Lynch Cc: Scott Cheloha Cc: Tony Luck Cc: Fenghua Yu Cc: Signed-off-by: Andrew Morton --- arch/ia64/mm/init.c | 6 +++--- include/linux/mm.h | 2 +- include/linux/mmzone.h | 11 ++++++++--- mm/memory_hotplug.c | 2 +- mm/page_alloc.c | 10 +++++----- 5 files changed, 18 insertions(+), 13 deletions(-) --- a/arch/ia64/mm/init.c~mm-replace-memmap_context-by-meminit_context +++ a/arch/ia64/mm/init.c @@ -538,7 +538,7 @@ virtual_memmap_init(u64 start, u64 end, if (map_start < map_end) memmap_init_zone((unsigned long)(map_end - map_start), args->nid, args->zone, page_to_pfn(map_start), - MEMMAP_EARLY, NULL); + MEMINIT_EARLY, NULL); return 0; } @@ -547,8 +547,8 @@ memmap_init (unsigned long size, int nid unsigned long start_pfn) { if (!vmem_map) { - memmap_init_zone(size, nid, zone, start_pfn, MEMMAP_EARLY, - NULL); + memmap_init_zone(size, nid, zone, start_pfn, + MEMINIT_EARLY, NULL); } else { struct page *start; struct memmap_init_callback_data args; --- a/include/linux/mm.h~mm-replace-memmap_context-by-meminit_context +++ a/include/linux/mm.h @@ -2416,7 +2416,7 @@ extern int __meminit __early_pfn_to_nid( extern void set_dma_reserve(unsigned long new_dma_reserve); extern void memmap_init_zone(unsigned long, int, unsigned long, unsigned long, - enum memmap_context, struct vmem_altmap *); + enum meminit_context, struct vmem_altmap *); extern void setup_per_zone_wmarks(void); extern int __meminit init_per_zone_wmark_min(void); extern void mem_init(void); --- a/include/linux/mmzone.h~mm-replace-memmap_context-by-meminit_context +++ a/include/linux/mmzone.h @@ -824,10 +824,15 @@ bool zone_watermark_ok(struct zone *z, u unsigned int alloc_flags); bool zone_watermark_ok_safe(struct zone *z, unsigned int order, unsigned long mark, int highest_zoneidx); -enum memmap_context { - MEMMAP_EARLY, - MEMMAP_HOTPLUG, +/* + * Memory initialization context, use to differentiate memory added by + * the platform statically or via memory hotplug interface. + */ +enum meminit_context { + MEMINIT_EARLY, + MEMINIT_HOTPLUG, }; + extern void init_currently_empty_zone(struct zone *zone, unsigned long start_pfn, unsigned long size); --- a/mm/memory_hotplug.c~mm-replace-memmap_context-by-meminit_context +++ a/mm/memory_hotplug.c @@ -729,7 +729,7 @@ void __ref move_pfn_range_to_zone(struct * are reserved so nobody should be touching them so we should be safe */ memmap_init_zone(nr_pages, nid, zone_idx(zone), start_pfn, - MEMMAP_HOTPLUG, altmap); + MEMINIT_HOTPLUG, altmap); set_zone_contiguous(zone); } --- a/mm/page_alloc.c~mm-replace-memmap_context-by-meminit_context +++ a/mm/page_alloc.c @@ -5975,7 +5975,7 @@ overlap_memmap_init(unsigned long zone, * done. Non-atomic initialization, single-pass. */ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, - unsigned long start_pfn, enum memmap_context context, + unsigned long start_pfn, enum meminit_context context, struct vmem_altmap *altmap) { unsigned long pfn, end_pfn = start_pfn + size; @@ -6007,7 +6007,7 @@ void __meminit memmap_init_zone(unsigned * There can be holes in boot-time mem_map[]s handed to this * function. They do not exist on hotplugged memory. */ - if (context == MEMMAP_EARLY) { + if (context == MEMINIT_EARLY) { if (overlap_memmap_init(zone, &pfn)) continue; if (defer_init(nid, pfn, end_pfn)) @@ -6016,7 +6016,7 @@ void __meminit memmap_init_zone(unsigned page = pfn_to_page(pfn); __init_single_page(page, pfn, zone, nid); - if (context == MEMMAP_HOTPLUG) + if (context == MEMINIT_HOTPLUG) __SetPageReserved(page); /* @@ -6099,7 +6099,7 @@ void __ref memmap_init_zone_device(struc * check here not to call set_pageblock_migratetype() against * pfn out of zone. * - * Please note that MEMMAP_HOTPLUG path doesn't clear memmap + * Please note that MEMINIT_HOTPLUG path doesn't clear memmap * because this is done early in section_activate() */ if (!(pfn & (pageblock_nr_pages - 1))) { @@ -6137,7 +6137,7 @@ void __meminit __weak memmap_init(unsign if (end_pfn > start_pfn) { size = end_pfn - start_pfn; memmap_init_zone(size, nid, zone, start_pfn, - MEMMAP_EARLY, NULL); + MEMINIT_EARLY, NULL); } } } From patchwork Sat Sep 26 04:19:31 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andrew Morton X-Patchwork-Id: 309173 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED, DKIM_VALID, HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI, SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCC58C47425 for ; Sat, 26 Sep 2020 04:19:34 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 998AB215A4 for ; Sat, 26 Sep 2020 04:19:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093974; bh=usSBJeQ8z9EC87empoXgaCPunCczq1s1yFugQTQwcUc=; h=Date:From:To:Subject:In-Reply-To:List-ID:From; b=MOZg9Fi0hJ1HoGQxQsW1RuyUUgObTFqtGB9u49zHWPNAwZpfqP69r0CAIcdQM0PH/ CcxGXhE0UvLVQw7w8fEBkl5FfQVjT+Pn66oz5JWhKwNn+tWaoZaTKaK4J2OsP9zeED p8AQCGbpaAxayZglaK2BvlhSATgDBQ2C0S/Bhb28= Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730034AbgIZETe (ORCPT ); Sat, 26 Sep 2020 00:19:34 -0400 Received: from mail.kernel.org ([198.145.29.99]:52162 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729035AbgIZETe (ORCPT ); Sat, 26 Sep 2020 00:19:34 -0400 Received: from localhost.localdomain (c-71-198-47-131.hsd1.ca.comcast.net [71.198.47.131]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4DFF320882; Sat, 26 Sep 2020 04:19:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1601093972; bh=usSBJeQ8z9EC87empoXgaCPunCczq1s1yFugQTQwcUc=; h=Date:From:To:Subject:In-Reply-To:From; b=EwEPqfL7SdVssscBl3RuNWldsF01lgY1/p/j98DVLSK7VVlcnVUnzp+kE+vrDpIqW 5TAEahOPor2r9JjyvRu+Xs944ZFmXJqpNdQBpr06+J077Lfc6Mqm40M2U5ELliRWgF DpLLNCfO7Jo4HNy+r3HbGvJ/ecjKoSatO2Uu0eAs= Date: Fri, 25 Sep 2020 21:19:31 -0700 From: Andrew Morton To: akpm@linux-foundation.org, cheloha@linux.ibm.com, david@redhat.com, fenghua.yu@intel.com, gregkh@linuxfoundation.org, ldufour@linux.ibm.com, linux-mm@kvack.org, mhocko@suse.com, mm-commits@vger.kernel.org, nathanl@linux.ibm.com, osalvador@suse.de, rafael@kernel.org, stable@vger.kernel.org, tony.luck@intel.com, torvalds@linux-foundation.org Subject: [patch 9/9] mm: don't rely on system state to detect hot-plug operations Message-ID: <20200926041931.4jyPIAVMX%akpm@linux-foundation.org> In-Reply-To: <20200925211725.0fea54be9e9715486efea21f@linux-foundation.org> User-Agent: s-nail v14.8.16 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org From: Laurent Dufour Subject: mm: don't rely on system state to detect hot-plug operations In register_mem_sect_under_node() the system_state's value is checked to detect whether the call is made during boot time or during an hot-plug operation. Unfortunately, that check against SYSTEM_BOOTING is wrong because regular memory is registered at SYSTEM_SCHEDULING state. In addition, memory hot-plug operation can be triggered at this system state by the ACPI [1]. So checking against the system state is not enough. The consequence is that on system with interleaved node's ranges like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] This can be seen on PowerPC LPAR after multiple memory hot-plug and hot-unplug operations are done. At the next reboot the node's memory ranges can be interleaved and since the call to link_mem_sections() is made in topology_init() while the system is in the SYSTEM_SCHEDULING state, the node's id is not checked, and the sections registered to multiple nodes: $ ls -l /sys/devices/system/memory/memory21/node* total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 In that case, the system is able to boot but if later one of theses memory blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is detected and this is triggering a BUG_ON(): ------------[ cut here ]------------ kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 NIP: c000000000403f34 LR: c000000000403f2c CTR: 0000000000000000 REGS: c0000004876e3660 TRAP: 0700 Not tainted (5.9.0-rc1+) MSR: 800000000282b033 CR: 24000448 XER: 20040000 CFAR: c000000000846d20 IRQMASK: 0 GPR00: c000000000403f2c c0000004876e38f0 c0000000012f6f00 ffffffffffffffef GPR04: 0000000000000227 c0000004805ae680 0000000000000000 00000004886f0000 GPR08: 0000000000000226 0000000000000003 0000000000000002 fffffffffffffffd GPR12: 0000000088000484 c00000001ec96280 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000004 0000000000000003 GPR20: c00000047814ffe0 c0000007ffff7c08 0000000000000010 c0000000013332c8 GPR24: 0000000000000000 c0000000011f6cc0 0000000000000000 0000000000000000 GPR28: ffffffffffffffef 0000000000000001 0000000150000000 0000000010000000 NIP [c000000000403f34] add_memory_resource+0x244/0x340 LR [c000000000403f2c] add_memory_resource+0x23c/0x340 Call Trace: [c0000004876e38f0] [c000000000403f2c] add_memory_resource+0x23c/0x340 (unreliable) [c0000004876e39c0] [c00000000040408c] __add_memory+0x5c/0xf0 [c0000004876e39f0] [c0000000000e2b94] dlpar_add_lmb+0x1b4/0x500 [c0000004876e3ad0] [c0000000000e3888] dlpar_memory+0x1f8/0xb80 [c0000004876e3b60] [c0000000000dc0d0] handle_dlpar_errorlog+0xc0/0x190 [c0000004876e3bd0] [c0000000000dc398] dlpar_store+0x198/0x4a0 [c0000004876e3c90] [c00000000072e630] kobj_attr_store+0x30/0x50 [c0000004876e3cb0] [c00000000051f954] sysfs_kf_write+0x64/0x90 [c0000004876e3cd0] [c00000000051ee40] kernfs_fop_write+0x1b0/0x290 [c0000004876e3d20] [c000000000438dd8] vfs_write+0xe8/0x290 [c0000004876e3d70] [c0000000004391ac] ksys_write+0xdc/0x130 [c0000004876e3dc0] [c000000000034e40] system_call_exception+0x160/0x270 [c0000004876e3e20] [c00000000000d740] system_call_common+0xf0/0x27c Instruction dump: 48442e35 60000000 0b030000 3cbe0001 7fa3eb78 7bc48402 38a5fffe 7ca5fa14 78a58402 48442db1 60000000 7c7c1b78 <0b030000> 7f23cb78 4bda371d 60000000 ---[ end trace 562fd6c109cd0fb2 ]--- This patch addresses the root cause by not relying on the system_state value to detect whether the call is due to a hot-plug operation. An extra parameter is added to link_mem_sections() detailing whether the operation is due to a hot-plug operation. [1] According to Oscar Salvador, using this qemu command line, ACPI memory hotplug operations are raised at SYSTEM_SCHEDULING state: $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \ -m size=$MEM,slots=255,maxmem=4294967296k \ -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \ -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \ -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \ -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \ -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \ -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \ -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \ -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \ Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.com Fixes: 4fbce633910e ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()") Signed-off-by: Laurent Dufour Reviewed-by: David Hildenbrand Reviewed-by: Oscar Salvador Acked-by: Michal Hocko Cc: Greg Kroah-Hartman Cc: "Rafael J. Wysocki" Cc: Fenghua Yu Cc: Nathan Lynch Cc: Scott Cheloha Cc: Tony Luck Cc: Signed-off-by: Andrew Morton --- drivers/base/node.c | 85 ++++++++++++++++++++++++++--------------- include/linux/node.h | 11 +++-- mm/memory_hotplug.c | 3 - 3 files changed, 64 insertions(+), 35 deletions(-) --- a/drivers/base/node.c~mm-dont-rely-on-system-state-to-detect-hot-plug-operations +++ a/drivers/base/node.c @@ -761,14 +761,36 @@ static int __ref get_nid_for_pfn(unsigne return pfn_to_nid(pfn); } +static int do_register_memory_block_under_node(int nid, + struct memory_block *mem_blk) +{ + int ret; + + /* + * If this memory block spans multiple nodes, we only indicate + * the last processed node. + */ + mem_blk->nid = nid; + + ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, + &mem_blk->dev.kobj, + kobject_name(&mem_blk->dev.kobj)); + if (ret) + return ret; + + return sysfs_create_link_nowarn(&mem_blk->dev.kobj, + &node_devices[nid]->dev.kobj, + kobject_name(&node_devices[nid]->dev.kobj)); +} + /* register memory section under specified node if it spans that node */ -static int register_mem_sect_under_node(struct memory_block *mem_blk, - void *arg) +static int register_mem_block_under_node_early(struct memory_block *mem_blk, + void *arg) { unsigned long memory_block_pfns = memory_block_size_bytes() / PAGE_SIZE; unsigned long start_pfn = section_nr_to_pfn(mem_blk->start_section_nr); unsigned long end_pfn = start_pfn + memory_block_pfns - 1; - int ret, nid = *(int *)arg; + int nid = *(int *)arg; unsigned long pfn; for (pfn = start_pfn; pfn <= end_pfn; pfn++) { @@ -785,39 +807,34 @@ static int register_mem_sect_under_node( } /* - * We need to check if page belongs to nid only for the boot - * case, during hotplug we know that all pages in the memory - * block belong to the same node. - */ - if (system_state == SYSTEM_BOOTING) { - page_nid = get_nid_for_pfn(pfn); - if (page_nid < 0) - continue; - if (page_nid != nid) - continue; - } - - /* - * If this memory block spans multiple nodes, we only indicate - * the last processed node. + * We need to check if page belongs to nid only at the boot + * case because node's ranges can be interleaved. */ - mem_blk->nid = nid; - - ret = sysfs_create_link_nowarn(&node_devices[nid]->dev.kobj, - &mem_blk->dev.kobj, - kobject_name(&mem_blk->dev.kobj)); - if (ret) - return ret; + page_nid = get_nid_for_pfn(pfn); + if (page_nid < 0) + continue; + if (page_nid != nid) + continue; - return sysfs_create_link_nowarn(&mem_blk->dev.kobj, - &node_devices[nid]->dev.kobj, - kobject_name(&node_devices[nid]->dev.kobj)); + return do_register_memory_block_under_node(nid, mem_blk); } /* mem section does not span the specified node */ return 0; } /* + * During hotplug we know that all pages in the memory block belong to the same + * node. + */ +static int register_mem_block_under_node_hotplug(struct memory_block *mem_blk, + void *arg) +{ + int nid = *(int *)arg; + + return do_register_memory_block_under_node(nid, mem_blk); +} + +/* * Unregister a memory block device under the node it spans. Memory blocks * with multiple nodes cannot be offlined and therefore also never be removed. */ @@ -832,11 +849,19 @@ void unregister_memory_block_under_nodes kobject_name(&node_devices[mem_blk->nid]->dev.kobj)); } -int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn) +int link_mem_sections(int nid, unsigned long start_pfn, unsigned long end_pfn, + enum meminit_context context) { + walk_memory_blocks_func_t func; + + if (context == MEMINIT_HOTPLUG) + func = register_mem_block_under_node_hotplug; + else + func = register_mem_block_under_node_early; + return walk_memory_blocks(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn), (void *)&nid, - register_mem_sect_under_node); + func); } #ifdef CONFIG_HUGETLBFS --- a/include/linux/node.h~mm-dont-rely-on-system-state-to-detect-hot-plug-operations +++ a/include/linux/node.h @@ -99,11 +99,13 @@ extern struct node *node_devices[]; typedef void (*node_registration_func_t)(struct node *); #if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_NUMA) -extern int link_mem_sections(int nid, unsigned long start_pfn, - unsigned long end_pfn); +int link_mem_sections(int nid, unsigned long start_pfn, + unsigned long end_pfn, + enum meminit_context context); #else static inline int link_mem_sections(int nid, unsigned long start_pfn, - unsigned long end_pfn) + unsigned long end_pfn, + enum meminit_context context) { return 0; } @@ -128,7 +130,8 @@ static inline int register_one_node(int if (error) return error; /* link memory sections under this node */ - error = link_mem_sections(nid, start_pfn, end_pfn); + error = link_mem_sections(nid, start_pfn, end_pfn, + MEMINIT_EARLY); } return error; --- a/mm/memory_hotplug.c~mm-dont-rely-on-system-state-to-detect-hot-plug-operations +++ a/mm/memory_hotplug.c @@ -1080,7 +1080,8 @@ int __ref add_memory_resource(int nid, s } /* link memory sections under this node.*/ - ret = link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1)); + ret = link_mem_sections(nid, PFN_DOWN(start), PFN_UP(start + size - 1), + MEMINIT_HOTPLUG); BUG_ON(ret); /* create new memmap entry */