From patchwork Fri Dec 4 05:07:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 338392 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BD355C433FE for ; Fri, 4 Dec 2020 05:08:06 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5BCEA207CF for ; Fri, 4 Dec 2020 05:08:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727417AbgLDFHu (ORCPT ); Fri, 4 Dec 2020 00:07:50 -0500 Received: from mail.kernel.org ([198.145.29.99]:43780 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725554AbgLDFHu (ORCPT ); Fri, 4 Dec 2020 00:07:50 -0500 From: Andy Lutomirski Authentication-Results: mail.kernel.org; dkim=permerror (bad message/signature format) To: x86@kernel.org, Mathieu Desnoyers Cc: LKML , Nicholas Piggin , Arnd Bergmann , Anton Blanchard , Andy Lutomirski , stable@vger.kernel.org Subject: [PATCH v3 1/4] x86/membarrier: Get rid of a dubious optimization Date: Thu, 3 Dec 2020 21:07:03 -0800 Message-Id: <5afc7632be1422f91eaf7611aaaa1b5b8580a086.1607058304.git.luto@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org sync_core_before_usermode() had an incorrect optimization. If we're in an IRQ, we can get to usermode without IRET -- we just have to schedule to a different task in the same mm and do SYSRET. Fortunately, there were no callers of sync_core_before_usermode() that could have had in_irq() or in_nmi() equal to true, because it's only ever called from the scheduler. While we're at it, clarify a related comment. Cc: stable@vger.kernel.org Reviewed-by: Mathieu Desnoyers Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/sync_core.h | 9 +++++---- arch/x86/mm/tlb.c | 10 ++++++++-- 2 files changed, 13 insertions(+), 6 deletions(-) diff --git a/arch/x86/include/asm/sync_core.h b/arch/x86/include/asm/sync_core.h index 0fd4a9dfb29c..ab7382f92aff 100644 --- a/arch/x86/include/asm/sync_core.h +++ b/arch/x86/include/asm/sync_core.h @@ -98,12 +98,13 @@ static inline void sync_core_before_usermode(void) /* With PTI, we unconditionally serialize before running user code. */ if (static_cpu_has(X86_FEATURE_PTI)) return; + /* - * Return from interrupt and NMI is done through iret, which is core - * serializing. + * Even if we're in an interrupt, we might reschedule before returning, + * in which case we could switch to a different thread in the same mm + * and return using SYSRET or SYSEXIT. Instead of trying to keep + * track of our need to sync the core, just sync right away. */ - if (in_irq() || in_nmi()) - return; sync_core(); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 11666ba19b62..569ac1d57f55 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -474,8 +474,14 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, /* * The membarrier system call requires a full memory barrier and * core serialization before returning to user-space, after - * storing to rq->curr. Writing to CR3 provides that full - * memory barrier and core serializing instruction. + * storing to rq->curr, when changing mm. This is because + * membarrier() sends IPIs to all CPUs that are in the target mm + * to make them issue memory barriers. However, if another CPU + * switches to/from the target mm concurrently with + * membarrier(), it can cause that CPU not to receive an IPI + * when it really should issue a memory barrier. Writing to CR3 + * provides that full memory barrier and core serializing + * instruction. */ if (real_prev == next) { VM_WARN_ON(this_cpu_read(cpu_tlbstate.ctxs[prev_asid].ctx_id) != From patchwork Fri Dec 4 05:07:04 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 338807 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8AD5C1B0D9 for ; Fri, 4 Dec 2020 05:08:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9329D22581 for ; Fri, 4 Dec 2020 05:08:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1725554AbgLDFHv (ORCPT ); Fri, 4 Dec 2020 00:07:51 -0500 Received: from mail.kernel.org ([198.145.29.99]:43820 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726485AbgLDFHv (ORCPT ); Fri, 4 Dec 2020 00:07:51 -0500 From: Andy Lutomirski Authentication-Results: mail.kernel.org; dkim=permerror (bad message/signature format) To: x86@kernel.org, Mathieu Desnoyers Cc: LKML , Nicholas Piggin , Arnd Bergmann , Anton Blanchard , Andy Lutomirski , stable@vger.kernel.org Subject: [PATCH v3 2/4] membarrier: Add an actual barrier before rseq_preempt() Date: Thu, 3 Dec 2020 21:07:04 -0800 Message-Id: X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org It seems to me that most RSEQ membarrier users will expect any stores done before the membarrier() syscall to be visible to the target task(s). While this is extremely likely to be true in practice, nothing actually guarantees it by a strict reading of the x86 manuals. Rather than providing this guarantee by accident and potentially causing a problem down the road, just add an explicit barrier. Cc: stable@vger.kernel.org Reviewed-by: Mathieu Desnoyers Signed-off-by: Andy Lutomirski --- kernel/sched/membarrier.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 5a40b3828ff2..6251d3d12abe 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -168,6 +168,14 @@ static void ipi_mb(void *info) static void ipi_rseq(void *info) { + /* + * Ensure that all stores done by the calling thread are visible + * to the current task before the current task resumes. We could + * probably optimize this away on most architectures, but by the + * time we've already sent an IPI, the cost of the extra smp_mb() + * is negligible. + */ + smp_mb(); rseq_preempt(current); } From patchwork Fri Dec 4 05:07:05 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 338808 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, URIBL_BLOCKED, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BCBA6C19437 for ; Fri, 4 Dec 2020 05:08:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6FDAF207CF for ; Fri, 4 Dec 2020 05:08:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727722AbgLDFHw (ORCPT ); Fri, 4 Dec 2020 00:07:52 -0500 Received: from mail.kernel.org ([198.145.29.99]:43872 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727692AbgLDFHw (ORCPT ); Fri, 4 Dec 2020 00:07:52 -0500 From: Andy Lutomirski Authentication-Results: mail.kernel.org; dkim=permerror (bad message/signature format) To: x86@kernel.org, Mathieu Desnoyers Cc: LKML , Nicholas Piggin , Arnd Bergmann , Anton Blanchard , Andy Lutomirski , stable@vger.kernel.org Subject: [PATCH v3 3/4] membarrier: Explicitly sync remote cores when SYNC_CORE is requested Date: Thu, 3 Dec 2020 21:07:05 -0800 Message-Id: <776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org membarrier() does not explicitly sync_core() remote CPUs; instead, it relies on the assumption that an IPI will result in a core sync. On x86, I think this may be true in practice, but it's not architecturally reliable. In particular, the SDM and APM do not appear to guarantee that interrupt delivery is serializing. While IRET does serialize, IPI return can schedule, thereby switching to another task in the same mm that was sleeping in a syscall. The new task could then SYSRET back to usermode without ever executing IRET. Make this more robust by explicitly calling sync_core_before_usermode() on remote cores. (This also helps people who search the kernel tree for instances of sync_core() and sync_core_before_usermode() -- one might be surprised that the core membarrier code doesn't currently show up in a such a search.) Cc: stable@vger.kernel.org Reviewed-by: Mathieu Desnoyers Signed-off-by: Andy Lutomirski --- kernel/sched/membarrier.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 6251d3d12abe..01538b31f27e 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -166,6 +166,23 @@ static void ipi_mb(void *info) smp_mb(); /* IPIs should be serializing but paranoid. */ } +static void ipi_sync_core(void *info) +{ + /* + * The smp_mb() in membarrier after all the IPIs is supposed to + * ensure that memory on remote CPUs that occur before the IPI + * become visible to membarrier()'s caller -- see scenario B in + * the big comment at the top of this file. + * + * A sync_core() would provide this guarantee, but + * sync_core_before_usermode() might end up being deferred until + * after membarrier()'s smp_mb(). + */ + smp_mb(); /* IPIs should be serializing but paranoid. */ + + sync_core_before_usermode(); +} + static void ipi_rseq(void *info) { /* @@ -301,6 +318,7 @@ static int membarrier_private_expedited(int flags, int cpu_id) if (!(atomic_read(&mm->membarrier_state) & MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE_READY)) return -EPERM; + ipi_func = ipi_sync_core; } else if (flags == MEMBARRIER_FLAG_RSEQ) { if (!IS_ENABLED(CONFIG_RSEQ)) return -EINVAL; From patchwork Fri Dec 4 05:07:06 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Andy Lutomirski X-Patchwork-Id: 338391 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00, INCLUDES_CR_TRAILER, INCLUDES_PATCH, MAILING_LIST_MULTI, SPF_HELO_NONE, SPF_PASS, USER_AGENT_GIT autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 99762C1B087 for ; Fri, 4 Dec 2020 05:08:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 47D3E22573 for ; Fri, 4 Dec 2020 05:08:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727805AbgLDFHx (ORCPT ); Fri, 4 Dec 2020 00:07:53 -0500 Received: from mail.kernel.org ([198.145.29.99]:43918 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726485AbgLDFHx (ORCPT ); Fri, 4 Dec 2020 00:07:53 -0500 From: Andy Lutomirski Authentication-Results: mail.kernel.org; dkim=permerror (bad message/signature format) To: x86@kernel.org, Mathieu Desnoyers Cc: LKML , Nicholas Piggin , Arnd Bergmann , Anton Blanchard , Andy Lutomirski , stable@vger.kernel.org Subject: [PATCH v3 4/4] membarrier: Execute SYNC_CORE on the calling thread Date: Thu, 3 Dec 2020 21:07:06 -0800 Message-Id: <250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: stable@vger.kernel.org membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as syncing the core on all sibling threads but not necessarily the calling thread. This behavior is fundamentally buggy and cannot be used safely. Suppose a user program has two threads. Thread A is on CPU 0 and thread B is on CPU 1. Thread A modifies some text and calls membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE). Then thread B executes the modified code. If, at any point after membarrier() decides which CPUs to target, thread A could be preempted and replaced by thread B on CPU 0. This could even happen on exit from the membarrier() syscall. If this happens, thread B will end up running on CPU 0 without having synced. In principle, this could be fixed by arranging for the scheduler to sync_core_before_usermode() whenever switching between two threads in the same mm if there is any possibility of a concurrent membarrier() call, but this would have considerable overhead. Instead, make membarrier() sync the calling CPU as well. As an optimization, this avoids an extra smp_mb() in the default barrier-only mode. Cc: stable@vger.kernel.org Signed-off-by: Andy Lutomirski Reviewed-by: Mathieu Desnoyers --- kernel/sched/membarrier.c | 51 +++++++++++++++++++++++++-------------- 1 file changed, 33 insertions(+), 18 deletions(-) diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 01538b31f27e..57266ab32ef9 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -333,7 +333,8 @@ static int membarrier_private_expedited(int flags, int cpu_id) return -EPERM; } - if (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1) + if (flags != MEMBARRIER_FLAG_SYNC_CORE && + (atomic_read(&mm->mm_users) == 1 || num_online_cpus() == 1)) return 0; /* @@ -352,8 +353,6 @@ static int membarrier_private_expedited(int flags, int cpu_id) if (cpu_id >= nr_cpu_ids || !cpu_online(cpu_id)) goto out; - if (cpu_id == raw_smp_processor_id()) - goto out; rcu_read_lock(); p = rcu_dereference(cpu_rq(cpu_id)->curr); if (!p || p->mm != mm) { @@ -368,16 +367,6 @@ static int membarrier_private_expedited(int flags, int cpu_id) for_each_online_cpu(cpu) { struct task_struct *p; - /* - * Skipping the current CPU is OK even through we can be - * migrated at any point. The current CPU, at the point - * where we read raw_smp_processor_id(), is ensured to - * be in program order with respect to the caller - * thread. Therefore, we can skip this CPU from the - * iteration. - */ - if (cpu == raw_smp_processor_id()) - continue; p = rcu_dereference(cpu_rq(cpu)->curr); if (p && p->mm == mm) __cpumask_set_cpu(cpu, tmpmask); @@ -385,12 +374,38 @@ static int membarrier_private_expedited(int flags, int cpu_id) rcu_read_unlock(); } - preempt_disable(); - if (cpu_id >= 0) + if (cpu_id >= 0) { + /* + * smp_call_function_single() will call ipi_func() if cpu_id + * is the calling CPU. + */ smp_call_function_single(cpu_id, ipi_func, NULL, 1); - else - smp_call_function_many(tmpmask, ipi_func, NULL, 1); - preempt_enable(); + } else { + /* + * For regular membarrier, we can save a few cycles by + * skipping the current cpu -- we're about to do smp_mb() + * below, and if we migrate to a different cpu, this cpu + * and the new cpu will execute a full barrier in the + * scheduler. + * + * For CORE_SYNC, we do need a barrier on the current cpu -- + * otherwise, if we are migrated and replaced by a different + * task in the same mm just before, during, or after + * membarrier, we will end up with some thread in the mm + * running without a core sync. + * + * For RSEQ, don't rseq_preempt() the caller. User code + * is not supposed to issue syscalls at all from inside an + * rseq critical section. + */ + if (flags != MEMBARRIER_FLAG_SYNC_CORE) { + preempt_disable(); + smp_call_function_many(tmpmask, ipi_func, NULL, true); + preempt_enable(); + } else { + on_each_cpu_mask(tmpmask, ipi_func, NULL, true); + } + } out: if (cpu_id < 0)