diff mbox series

[v3] nvme: rdma/tcp: fix list corruption with anatt timer

Message ID 20210427085246.13728-1-mwilck@suse.com
State New
Headers show
Series [v3] nvme: rdma/tcp: fix list corruption with anatt timer | expand

Commit Message

Martin Wilck April 27, 2021, 8:52 a.m. UTC
From: Martin Wilck <mwilck@suse.com>

We have observed a few crashes run_timer_softirq(), where a broken
timer_list struct belonging to an anatt_timer was encountered. The broken
structures look like this, and we see actually multiple ones attached to
the same timer base:

crash> struct timer_list 0xffff92471bcfdc90
struct timer_list {
  entry = {
    next = 0xdead000000000122,  // LIST_POISON2
    pprev = 0x0
  },
  expires = 4296022933,
  function = 0xffffffffc06de5e0 <nvme_anatt_timeout>,
  flags = 20
}

If such a timer is encountered in run_timer_softirq(), the kernel
crashes. The test scenario was an I/O load test with lots of NVMe
controllers, some of which were removed and re-added on the storage side.

I think this may happen if the rdma recovery_work starts, in this call
chain:

nvme_rdma_error_recovery_work()
  /* this stops all sorts of activity for the controller, but not
     the multipath-related work queue and timer */
  nvme_rdma_reconnect_or_remove(ctrl)
    => kicks reconnect_work

work queue: reconnect_work

nvme_rdma_reconnect_ctrl_work()
  nvme_rdma_setup_ctrl()
    nvme_rdma_configure_admin_queue()
       nvme_init_identify()
          nvme_mpath_init()
	     # this sets some fields of the timer_list without taking a lock
             timer_setup()
             nvme_read_ana_log()
	         mod_timer() or del_timer_sync()

Similar for TCP. The idea for the patch is based on the observation that
nvme_rdma_reset_ctrl_work() calls nvme_stop_ctrl()->nvme_mpath_stop(),
whereas nvme_rdma_error_recovery_work() stops only the keepalive timer, but
not the anatt timer.

I admit that the root cause analysis isn't rock solid yet. In particular, I
can't explain why we see LIST_POISON2 in the "next" pointer, which would
indicate that the timer has been detached before; yet we find it linked to
the timer base when the crash occurs.

OTOH, the anatt_timer is only touched in nvme_mpath_init() (see above) and
nvme_mpath_stop(), so the hypothesis that modifying active timers may cause
the issue isn't totally out of sight. I suspect that the LIST_POISON2 may
come to pass in multiple steps.

If anyone has better ideas, please advise. The issue occurs very
sporadically; verifying this by testing will be difficult.

Signed-off-by: Martin Wilck <mwilck@suse.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chao Leng <lengchao@huawei.com>
Cc: stable@vger.kernel.org

----
Changes in v3: Changed the subject line, as suggested by Sagi Grimberg

Changes in v2: Moved call to nvme_mpath_stop() further down, directly before
the call of nvme_rdma_reconnect_or_remove() (Chao Leng)
---
 drivers/nvme/host/multipath.c | 1 +
 drivers/nvme/host/rdma.c      | 1 +
 drivers/nvme/host/tcp.c       | 1 +
 3 files changed, 3 insertions(+)

Comments

Hannes Reinecke April 28, 2021, 6:35 a.m. UTC | #1
On 4/27/21 9:54 PM, Martin Wilck wrote:
> On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
>> On 4/27/21 6:25 PM, Christoph Hellwig wrote:
>>> On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke wrote:
>>>> As indicated in my previous mail, please change the description.
>>>> We have
>>>> since established a actual reason (duplicate calls to
>>>> add_timer()), so
>>>> please list it here.
>>>
>>> So what happens if the offending add_timer is changed to mod_timer?
>>>
>> I guess that should be fine, as the boilerplate said it can act
>> as a safe version of add_timer.
>>
>> But that would just solve the crash upon add_timer().
> 
> The code doesn't use add_timer(), only mod_timer() and
> del_timer_sync(). And we didn't observe a crash upon add_timer(). What
> we observed was that a timer had been enqueued multiple times, and the
> kernel crashes in expire_timers()->detach_timer(), when it encounters
> an already detached entry in the timer list.
> 
nvme_mpath_init() doesn't use add_timer, but it uses timer_setup(). And
calling that on an already pending timer is even worse :-)

And my point is that the anatt timer is not stopped at the end of
nvme_init_identify() if any of the calls to

nvme_configure_apst()
nvme_configure_timestamp()
nvme_configure_directives()
nvme_configure_acre()

returns with an error. If they do the controller is reset, causing
eg nvme_tcp_configure_admin_queue() to be called, which will be
calling timer_setup() with the original timer still running.
If the (original) timer triggers _after_ that time we have the crash.

Cheers,

Hannes
Martin Wilck April 28, 2021, 6:41 a.m. UTC | #2
On Wed, 2021-04-28 at 08:35 +0200, Hannes Reinecke wrote:
> On 4/27/21 9:54 PM, Martin Wilck wrote:
> > On Tue, 2021-04-27 at 20:05 +0200, Hannes Reinecke wrote:
> > > On 4/27/21 6:25 PM, Christoph Hellwig wrote:
> > > > On Tue, Apr 27, 2021 at 11:33:04AM +0200, Hannes Reinecke
> > > > wrote:
> > > > > As indicated in my previous mail, please change the
> > > > > description.
> > > > > We have
> > > > > since established a actual reason (duplicate calls to
> > > > > add_timer()), so
> > > > > please list it here.
> > > > 
> > > > So what happens if the offending add_timer is changed to
> > > > mod_timer?
> > > > 
> > > I guess that should be fine, as the boilerplate said it can act
> > > as a safe version of add_timer.
> > > 
> > > But that would just solve the crash upon add_timer().
> > 
> > The code doesn't use add_timer(), only mod_timer() and
> > del_timer_sync(). And we didn't observe a crash upon add_timer().
> > What
> > we observed was that a timer had been enqueued multiple times, and
> > the
> > kernel crashes in expire_timers()->detach_timer(), when it
> > encounters
> > an already detached entry in the timer list.
> > 
> nvme_mpath_init() doesn't use add_timer, but it uses timer_setup().

Yes. The notion that this is wrong was the idea behind my patch.

> And
> calling that on an already pending timer is even worse :-)
> 
> And my point is that the anatt timer is not stopped at the end of
> nvme_init_identify() if any of the calls to
> 
> nvme_configure_apst()
> nvme_configure_timestamp()
> nvme_configure_directives()
> nvme_configure_acre()
> 
> returns with an error. If they do the controller is reset, causing
> eg nvme_tcp_configure_admin_queue() to be called, which will be
> calling timer_setup() with the original timer still running.
> If the (original) timer triggers _after_ that time we have the crash.

You are right. But afaics the problem doesn't have to originate in
these 4 function calls. The same applies even after
nvme_init_identify() has returned. Any error that would trigger the
error recovery work after the anatt timer has started would have this
effect.

Regards,
Martin
Maurizio Lombardi April 28, 2021, 7:21 a.m. UTC | #3
st 28. 4. 2021 v 9:06 odesílatel Martin Wilck <mwilck@suse.com> napsal:
>
> Yes, that's what I think has happened. timer_setup() doesn't clear any
> pointers in the list of pending timers pointing to this entry. If the
> newly-initialized timer is then added with mod_timer(), it becomes
> linked in a second timer list. When the first one expires, the timer
> will be detached, but only from one of the lists it's pending in. In a
> scenario like the one we faced, this could actually happen multiple
> times. If the detached timer remains linked into a timer list, once
> that list is traversed, the kernel dereferences a pointer with value
> LIST_POISON2, and crashes.

Yes I think it makes sense.
timer_setup() modifies the timer's base in the "flags" field, then mod_timer()
could add the timer to the wrong base structure.

Maurizio
diff mbox series

Patch

diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index a1d476e1ac02..c63dd5dfa7ff 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -586,6 +586,7 @@  void nvme_mpath_stop(struct nvme_ctrl *ctrl)
 	del_timer_sync(&ctrl->anatt_timer);
 	cancel_work_sync(&ctrl->ana_work);
 }
+EXPORT_SYMBOL_GPL(nvme_mpath_stop);
 
 #define SUBSYS_ATTR_RW(_name, _mode, _show, _store)  \
 	struct device_attribute subsys_attr_##_name =	\
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index be905d4fdb47..fc07a7b0dc1d 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1202,6 +1202,7 @@  static void nvme_rdma_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	nvme_mpath_stop(&ctrl->ctrl);
 	nvme_rdma_reconnect_or_remove(ctrl);
 }
 
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index a0f00cb8f9f3..46287b4f4d10 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2068,6 +2068,7 @@  static void nvme_tcp_error_recovery_work(struct work_struct *work)
 		return;
 	}
 
+	nvme_mpath_stop(ctrl);
 	nvme_tcp_reconnect_or_remove(ctrl);
 }