From patchwork Thu Sep 20 18:48:12 2012
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
X-Patchwork-Id: 11605
Return-Path: <patch+caf_=linaro-patchwork=canonical.com@linaro.org>
X-Original-To: patchwork@peony.canonical.com
Delivered-To: patchwork@peony.canonical.com
Received: from fiordland.canonical.com (fiordland.canonical.com
 [91.189.94.145])
 by peony.canonical.com (Postfix) with ESMTP id 9670223E54
 for <patchwork@peony.canonical.com>;
 Thu, 20 Sep 2012 18:49:22 +0000 (UTC)
Received: from mail-ie0-f180.google.com (mail-ie0-f180.google.com
 [209.85.223.180])
 by fiordland.canonical.com (Postfix) with ESMTP id 21FE0A1861E
 for <linaro-patchwork@canonical.com>;
 Thu, 20 Sep 2012 18:49:22 +0000 (UTC)
Received: by mail-ie0-f180.google.com with SMTP id e10so3420758iej.11
 for <linaro-patchwork@canonical.com>;
 Thu, 20 Sep 2012 11:49:21 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=google.com; s=20120113;
 h=x-forwarded-to:x-forwarded-for:delivered-to:received-spf:from:to:cc
 :subject:date:message-id:x-mailer:in-reply-to:references
 :x-content-scanned:x-cbid:x-gm-message-state;
 bh=1fhzH3dU9ASTO0sI9Y5bgnIPL890GCou6+/ixDo2Xe4=;
 b=j+XAqP8GSHJlBxCXXnkh3uvzlJJbF0ZbG2VGNcE66bXPzaBjCVIqhHb7BNpigb7l0U
 feyXB4fBhioaRgEhSSGosiZ2HccTv+6jbAf0TK9FjVjfa5QR5DiOBZTSapDOAhEMzy/4
 THfFZSYjGGMtYHScyAkM6NlRFu1rO5paMv4CGFLiEQji4iu0W3Y+jY9sO+TJwU9hBvXP
 bJaSHZWXPNriPz0jW54DEfNOuwKaxaq2JJCkBTcx1bQ31mIdyQ2YUEFG2XY58qkKpaWh
 tvvhpwvO5tL7GSGXgFsh6A3N285E9FIxaReX6UrfIcI+y9rgXk0+Yq2czWMot/Mob3/e
 18zw==
Received: by 10.50.242.3 with SMTP id wm3mr3407267igc.0.1348166961914;
 Thu, 20 Sep 2012 11:49:21 -0700 (PDT)
X-Forwarded-To: linaro-patchwork@canonical.com
X-Forwarded-For: patch@linaro.org linaro-patchwork@canonical.com
Delivered-To: patches@linaro.org
Received: by 10.50.184.232 with SMTP id ex8csp92250igc;
 Thu, 20 Sep 2012 11:49:21 -0700 (PDT)
Received: by 10.50.106.136 with SMTP id gu8mr2571149igb.41.1348166961112;
 Thu, 20 Sep 2012 11:49:21 -0700 (PDT)
Received: from e39.co.us.ibm.com (e39.co.us.ibm.com. [32.97.110.160])
 by mx.google.com with ESMTPS id
 eh1si9794106icb.30.2012.09.20.11.49.20
 (version=TLSv1/SSLv3 cipher=OTHER);
 Thu, 20 Sep 2012 11:49:21 -0700 (PDT)
Received-SPF: pass (google.com: domain of paulmck@linux.vnet.ibm.com
 designates 32.97.110.160 as permitted sender)
 client-ip=32.97.110.160; 
Authentication-Results: mx.google.com; spf=pass (google.com: domain of
 paulmck@linux.vnet.ibm.com designates 32.97.110.160 as
 permitted sender) smtp.mail=paulmck@linux.vnet.ibm.com
Received: from /spool/local
 by e39.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use
 Only! Violators will be prosecuted
 for <patches@linaro.org> from <paulmck@linux.vnet.ibm.com>;
 Thu, 20 Sep 2012 12:49:20 -0600
Received: from d03dlp03.boulder.ibm.com (9.17.202.179)
 by e39.co.us.ibm.com (192.168.1.139) with IBM ESMTP SMTP Gateway:
 Authorized Use Only! Violators will be prosecuted; 
 Thu, 20 Sep 2012 12:49:18 -0600
Received: from d03relay05.boulder.ibm.com (d03relay05.boulder.ibm.com
 [9.17.195.107])
 by d03dlp03.boulder.ibm.com (Postfix) with ESMTP id 281FF19D8072
 for <patches@linaro.org>; Thu, 20 Sep 2012 12:48:59 -0600 (MDT)
Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167])
 by d03relay05.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
 q8KImno5150954
 for <patches@linaro.org>; Thu, 20 Sep 2012 12:48:53 -0600
Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1])
 by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP
 id q8KImRQ4021345
 for <patches@linaro.org>; Thu, 20 Sep 2012 12:48:40 -0600
Received: from paulmck-ThinkPad-W500 ([9.47.24.72])
 by d03av01.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVin) with ESMTP
 id q8KImNuj020790; Thu, 20 Sep 2012 12:48:25 -0600
Received: by paulmck-ThinkPad-W500 (Postfix, from userid 1000)
 id 8650DEC532; Thu, 20 Sep 2012 11:48:22 -0700 (PDT)
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: linux-kernel@vger.kernel.org
Cc: mingo@elte.hu, laijs@cn.fujitsu.com, dipankar@in.ibm.com,
 akpm@linux-foundation.org, mathieu.desnoyers@polymtl.ca,
 josh@joshtriplett.org, niv@us.ibm.com, tglx@linutronix.de,
 peterz@infradead.org, rostedt@goodmis.org, Valdis.Kletnieks@vt.edu,
 dhowells@redhat.com, eric.dumazet@gmail.com, darren@dvhart.com,
 fweisbec@gmail.com, sbw@mit.edu, patches@linaro.org,
 "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: [PATCH tip/core/rcu 16/23] rcu: Fix day-zero grace-period
 initialization/cleanup race
Date: Thu, 20 Sep 2012 11:48:12 -0700
Message-Id: <1348166900-18716-16-git-send-email-paulmck@linux.vnet.ibm.com>
X-Mailer: git-send-email 1.7.8
In-Reply-To: <1348166900-18716-1-git-send-email-paulmck@linux.vnet.ibm.com>
References: <20120920184751.GA18657@linux.vnet.ibm.com>
 <1348166900-18716-1-git-send-email-paulmck@linux.vnet.ibm.com>
X-Content-Scanned: Fidelis XPS MAILER
x-cbid: 12092018-4242-0000-0000-000002F74F58
X-Gm-Message-State: ALoCoQnA27C635J+EhuQhfxOZm/3UqyrPw5ixQVEMkRL95T/X0cRNmUKuF+ln40C9TnjiS9QU1y5

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>

The current approach to grace-period initialization is vulnerable to
extremely low-probability races.  These races stem from the fact that
the old grace period is marked completed on the same traversal through
the rcu_node structure that is marking the start of the new grace period.
This means that some rcu_node structures will believe that the old grace
period is still in effect at the same time that other rcu_node structures
believe that the new grace period has already started.

These sorts of disagreements can result in too-short grace periods,
as shown in the following scenario:

1.	CPU 0 completes a grace period, but needs an additional
	grace period, so starts initializing one, initializing all
	the non-leaf rcu_node structures and the first leaf rcu_node
	structure.  Because CPU 0 is both completing the old grace
	period and starting a new one, it marks the completion of
	the old grace period and the start of the new grace period
	in a single traversal of the rcu_node structures.

	Therefore, CPUs corresponding to the first rcu_node structure
	can become aware that the prior grace period has completed, but
	CPUs corresponding to the other rcu_node structures will see
	this same prior grace period as still being in progress.

2.	CPU 1 passes through a quiescent state, and therefore informs
	the RCU core.  Because its leaf rcu_node structure has already
	been initialized, this CPU's quiescent state is applied to the
	new (and only partially initialized) grace period.

3.	CPU 1 enters an RCU read-side critical section and acquires
	a reference to data item A.  Note that this CPU believes that
	its critical section started after the beginning of the new
	grace period, and therefore will not block this new grace period.

4.	CPU 16 exits dyntick-idle mode.  Because it was in dyntick-idle
	mode, other CPUs informed the RCU core of its extended quiescent
	state for the past several grace periods.  This means that CPU 16
	is not yet aware that these past grace periods have ended.  Assume
	that CPU 16 corresponds to the second leaf rcu_node structure --
	which has not yet been made aware of the new grace period.

5.	CPU 16 removes data item A from its enclosing data structure
	and passes it to call_rcu(), which queues a callback in the
	RCU_NEXT_TAIL segment of the callback queue.

6.	CPU 16 enters the RCU core, possibly because it has taken a
	scheduling-clock interrupt, or alternatively because it has
	more than 10,000 callbacks queued.  It notes that the second
	most recent grace period has completed (recall that because it
	corresponds to the second as-yet-uninitialized rcu_node structure,
	it cannot yet become aware that the most recent grace period has
	completed), and therefore advances its callbacks.  The callback
	for data item A is therefore in the RCU_NEXT_READY_TAIL segment
	of the callback queue.

7.	CPU 0 completes initialization of the remaining leaf rcu_node
	structures for the new grace period, including the structure
	corresponding to CPU 16.

8.	CPU 16 again enters the RCU core, again, possibly because it has
	taken a scheduling-clock interrupt, or alternatively because
	it now has more than 10,000 callbacks queued.	It notes that
	the most recent grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.

9.	All CPUs other than CPU 1 pass through quiescent states.  Because
	CPU 1 already passed through its quiescent state, the new grace
	period completes.  Note that CPU 1 is still in its RCU read-side
	critical section, still referencing data item A.

10.	Suppose that CPU 2 wais the last CPU to pass through a quiescent
	state for the new grace period, and suppose further that CPU 2
	did not have any callbacks queued, therefore not needing an
	additional grace period.  CPU 2 therefore traverses all of the
	rcu_node structures, marking the new grace period as completed,
	but does not initialize a new grace period.

11.	CPU 16 yet again enters the RCU core, yet again possibly because
	it has taken a scheduling-clock interrupt, or alternatively
	because it now has more than 10,000 callbacks queued.	It notes
	that the new grace period has ended, and therefore advances
	its callbacks.	The callback for data item A is therefore in
	the RCU_DONE_TAIL segment of the callback queue.  This means
	that this callback is now considered ready to be invoked.

12.	CPU 16 invokes the callback, freeing data item A while CPU 1
	is still referencing it.

This scenario represents a day-zero bug for TREE_RCU.  This commit
therefore ensures that the old grace period is marked completed in
all leaf rcu_node structures before a new grace period is marked
started in any of them.

That said, it would have been insanely difficult to force this race to
happen before the grace-period initialization process was preemptible.
Therefore, this commit is not a candidate for -stable.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Reviewed-by: Josh Triplett <josh@joshtriplett.org>

Conflicts:

	kernel/rcutree.c
---
 kernel/rcutree.c |   40 +++++++++++++++++-----------------------
 1 files changed, 17 insertions(+), 23 deletions(-)

diff --git a/kernel/rcutree.c b/kernel/rcutree.c
index c900c3c..25a671c 100644
--- a/kernel/rcutree.c
+++ b/kernel/rcutree.c
@@ -1139,37 +1139,31 @@ static void rcu_gp_cleanup(struct rcu_state *rsp)
 	 * they can do to advance the grace period.  It is therefore
 	 * safe for us to drop the lock in order to mark the grace
 	 * period as completed in all of the rcu_node structures.
-	 *
-	 * But if this CPU needs another grace period, it will take
-	 * care of this while initializing the next grace period.
-	 * We use RCU_WAIT_TAIL instead of the usual RCU_DONE_TAIL
-	 * because the callbacks have not yet been advanced: Those
-	 * callbacks are waiting on the grace period that just now
-	 * completed.
 	 */
-	rdp = this_cpu_ptr(rsp->rda);
-	if (*rdp->nxttail[RCU_WAIT_TAIL] == NULL) {
-		raw_spin_unlock_irq(&rnp->lock);
+	raw_spin_unlock_irq(&rnp->lock);
 
-		/*
-		 * Propagate new ->completed value to rcu_node
-		 * structures so that other CPUs don't have to
-		 * wait until the start of the next grace period
-		 * to process their callbacks.
-		 */
-		rcu_for_each_node_breadth_first(rsp, rnp) {
-			raw_spin_lock_irq(&rnp->lock);
-			rnp->completed = rsp->gpnum;
-			raw_spin_unlock_irq(&rnp->lock);
-			cond_resched();
-		}
-		rnp = rcu_get_root(rsp);
+	/*
+	 * Propagate new ->completed value to rcu_node structures so
+	 * that other CPUs don't have to wait until the start of the next
+	 * grace period to process their callbacks.  This also avoids
+	 * some nasty RCU grace-period initialization races by forcing
+	 * the end of the current grace period to be completely recorded in
+	 * all of the rcu_node structures before the beginning of the next
+	 * grace period is recorded in any of the rcu_node structures.
+	 */
+	rcu_for_each_node_breadth_first(rsp, rnp) {
 		raw_spin_lock_irq(&rnp->lock);
+		rnp->completed = rsp->gpnum;
+		raw_spin_unlock_irq(&rnp->lock);
+		cond_resched();
 	}
+	rnp = rcu_get_root(rsp);
+	raw_spin_lock_irq(&rnp->lock);
 
 	rsp->completed = rsp->gpnum; /* Declare grace period done. */
 	trace_rcu_grace_period(rsp->name, rsp->completed, "end");
 	rsp->fqs_state = RCU_GP_IDLE;
+	rdp = this_cpu_ptr(rsp->rda);
 	if (cpu_needs_another_gp(rsp, rdp))
 		rsp->gp_flags = 1;
 	raw_spin_unlock_irq(&rnp->lock);