fs/ceph/mds_client: ignore responses for waiting requests

Message ID	20230302130650.2209938-1-max.kellermann@ionos.com
State	New
Headers	show Return-Path: <ceph-devel-owner@vger.kernel.org> From: Max Kellermann <max.kellermann@ionos.com> To: xiubli@redhat.com, idryomov@gmail.com, jlayton@kernel.org, ceph-devel@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Max Kellermann <max.kellermann@ionos.com>, stable@vger.kernel.org Subject: [PATCH] fs/ceph/mds_client: ignore responses for waiting requests Date: Thu, 2 Mar 2023 14:06:50 +0100 Message-Id: <20230302130650.2209938-1-max.kellermann@ionos.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Precedence: bulk
Series	fs/ceph/mds_client: ignore responses for waiting requests \| expand fs/ceph/mds_client: ignore responses for waiting requests

Message ID

20230302130650.2209938-1-max.kellermann@ionos.com

State

New

Headers

From: Max Kellermann <max.kellermann@ionos.com>
To: xiubli@redhat.com, idryomov@gmail.com, jlayton@kernel.org,
        ceph-devel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
        Max Kellermann <max.kellermann@ionos.com>,
        stable@vger.kernel.org
Subject: [PATCH] fs/ceph/mds_client: ignore responses for waiting requests
Date: Thu,  2 Mar 2023 14:06:50 +0100
Message-Id: <20230302130650.2209938-1-max.kellermann@ionos.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Precedence: bulk

Series

fs/ceph/mds_client: ignore responses for waiting requests | expand

Commit Message

Max Kellermann March 2, 2023, 1:06 p.m. UTC

If a request is put on the waiting list, its submission is postponed
until the session becomes ready (e.g. via `mdsc->waiting_for_map` or
`session->s_waiting`).  If a `CEPH_MSG_CLIENT_REPLY` happens to be
received before `CEPH_MSG_MDS_MAP`, the request gets freed, and then
this assertion fails:

 WARN_ON_ONCE(!list_empty(&req->r_wait));

This occurred on a server after the Ceph MDS connection failed, and a
corrupt reply packet was received for a waiting request:

 libceph: mds1 (1)10.0.0.10:6801 socket error on write
 libceph: mds1 (1)10.0.0.10:6801 session reset
 ceph: mds1 closed our session
 ceph: mds1 reconnect start
 ceph: mds1 reconnect success
 ceph: problem parsing mds trace -5
 ceph: mds parse_reply err -5
 ceph: mdsc_handle_reply got corrupt reply mds1(tid:5530222)
 [...]
 ------------[ cut here ]------------
 WARNING: CPU: 9 PID: 229180 at fs/ceph/mds_client.c:966 ceph_mdsc_release_request+0x17a/0x180
 Modules linked in:
 CPU: 9 PID: 229180 Comm: kworker/9:3 Not tainted 6.1.8-cm4all1 #45
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
 Workqueue: ceph-msgr ceph_con_workfn
 RIP: 0010:ceph_mdsc_release_request+0x17a/0x180
 Code: 39 d8 75 26 5b 48 89 ee 48 8b 3d f9 2d 04 02 5d e9 fb 01 c9 ff e8 56 85 ab ff eb 9c 48 8b 7f 58 e8 db 4d ff ff e9 a4 fe ff ff <0f> 0b eb d6 66 90 0f 1f 44 00 00 41 54 48 8d 86 b8 03 00 00 55 4c
 RSP: 0018:ffffa6f2c0e2bd20 EFLAGS: 00010287
 RAX: ffff8f58b93687f8 RBX: ffff8f592f6374a8 RCX: 0000000000000aed
 RDX: 0000000000000ac0 RSI: 0000000000000000 RDI: 0000000000000000
 RBP: ffff8f592f637148 R08: 0000000000000001 R09: ffffffffa901de00
 R10: 0000000000000001 R11: ffffd630ad09dfc8 R12: ffff8f58b9368000
 R13: ffff8f5806b33800 R14: ffff8f58894f6780 R15: 000000000054626e
 FS:  0000000000000000(0000) GS:ffff8f630f040000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007ffc2926df68 CR3: 0000000218dce002 CR4: 00000000001706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  mds_dispatch+0xec5/0x1460
  ? inet_recvmsg+0x4d/0xf0
  ceph_con_process_message+0x6b/0x80
  ceph_con_v1_try_read+0xb92/0x1400
  ceph_con_workfn+0x383/0x4e0
  process_one_work+0x1da/0x370
  ? process_one_work+0x370/0x370
  worker_thread+0x4d/0x3c0
  ? process_one_work+0x370/0x370
  kthread+0xbb/0xe0
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x22/0x30
  </TASK>
 ---[ end trace 0000000000000000 ]---
 ceph: mds1 caps renewed

If we know that a request has not yet been submitted, we should ignore
all responses for it, just like we ignore responses for unknown TIDs.

To: ceph-devel@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
---
 fs/ceph/mds_client.c | 7 +++++++
 1 file changed, 7 insertions(+)

Comments

Xiubo Li March 8, 2023, 3:42 a.m. UTC | #1

Hi Max,

Sorry for late.

On 02/03/2023 21:06, Max Kellermann wrote:
> If a request is put on the waiting list, its submission is postponed
> until the session becomes ready (e.g. via `mdsc->waiting_for_map` or
> `session->s_waiting`).  If a `CEPH_MSG_CLIENT_REPLY` happens to be
> received before `CEPH_MSG_MDS_MAP`, the request gets freed, and then
> this assertion fails:

How could this happen ?

Since the req hasn't been submitted yet, how could it receive a reply 
normally ?

>   WARN_ON_ONCE(!list_empty(&req->r_wait));
>
> This occurred on a server after the Ceph MDS connection failed, and a
> corrupt reply packet was received for a waiting request:
>
>   libceph: mds1 (1)10.0.0.10:6801 socket error on write
>   libceph: mds1 (1)10.0.0.10:6801 session reset
>   ceph: mds1 closed our session
>   ceph: mds1 reconnect start
>   ceph: mds1 reconnect success
>   ceph: problem parsing mds trace -5
>   ceph: mds parse_reply err -5
>   ceph: mdsc_handle_reply got corrupt reply mds1(tid:5530222)

It should be a corrupted reply and it lead us to get a incorrect req, 
which hasn't been submitted yet.

BTW, do you have the dump of the corrupted msg by 'ceph_msg_dump(msg)' ?

We can check what have corrupted exactly.

Thanks

- Xiubo

>   [...]
>   ------------[ cut here ]------------
>   WARNING: CPU: 9 PID: 229180 at fs/ceph/mds_client.c:966 ceph_mdsc_release_request+0x17a/0x180
>   Modules linked in:
>   CPU: 9 PID: 229180 Comm: kworker/9:3 Not tainted 6.1.8-cm4all1 #45
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
>   Workqueue: ceph-msgr ceph_con_workfn
>   RIP: 0010:ceph_mdsc_release_request+0x17a/0x180
>   Code: 39 d8 75 26 5b 48 89 ee 48 8b 3d f9 2d 04 02 5d e9 fb 01 c9 ff e8 56 85 ab ff eb 9c 48 8b 7f 58 e8 db 4d ff ff e9 a4 fe ff ff <0f> 0b eb d6 66 90 0f 1f 44 00 00 41 54 48 8d 86 b8 03 00 00 55 4c
>   RSP: 0018:ffffa6f2c0e2bd20 EFLAGS: 00010287
>   RAX: ffff8f58b93687f8 RBX: ffff8f592f6374a8 RCX: 0000000000000aed
>   RDX: 0000000000000ac0 RSI: 0000000000000000 RDI: 0000000000000000
>   RBP: ffff8f592f637148 R08: 0000000000000001 R09: ffffffffa901de00
>   R10: 0000000000000001 R11: ffffd630ad09dfc8 R12: ffff8f58b9368000
>   R13: ffff8f5806b33800 R14: ffff8f58894f6780 R15: 000000000054626e
>   FS:  0000000000000000(0000) GS:ffff8f630f040000(0000) knlGS:0000000000000000
>   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   CR2: 00007ffc2926df68 CR3: 0000000218dce002 CR4: 00000000001706e0
>   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>   Call Trace:
>    <TASK>
>    mds_dispatch+0xec5/0x1460
>    ? inet_recvmsg+0x4d/0xf0
>    ceph_con_process_message+0x6b/0x80
>    ceph_con_v1_try_read+0xb92/0x1400
>    ceph_con_workfn+0x383/0x4e0
>    process_one_work+0x1da/0x370
>    ? process_one_work+0x370/0x370
>    worker_thread+0x4d/0x3c0
>    ? process_one_work+0x370/0x370
>    kthread+0xbb/0xe0
>    ? kthread_complete_and_exit+0x20/0x20
>    ret_from_fork+0x22/0x30
>    </TASK>
>   ---[ end trace 0000000000000000 ]---
>   ceph: mds1 caps renewed
>
> If we know that a request has not yet been submitted, we should ignore
> all responses for it, just like we ignore responses for unknown TIDs.
>
> To: ceph-devel@vger.kernel.org
> Cc: stable@vger.kernel.org
> Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
> ---
>   fs/ceph/mds_client.c | 7 +++++++
>   1 file changed, 7 insertions(+)
>
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 27a245d959c0..fa74fdb2cbfb 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -3275,6 +3275,13 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
>   	}
>   	dout("handle_reply %p\n", req);
>   
> +	/* waiting, not yet submitted? */
> +	if (!list_empty(&req->r_wait)) {
> +		pr_err("mdsc_handle_reply on waiting request tid %llu\n", tid);
> +		mutex_unlock(&mdsc->mutex);
> +		goto out;
> +	}
> +
>   	/* correct session? */
>   	if (req->r_session != session) {
>   		pr_err("mdsc_handle_reply got %llu on session mds%d"

Max Kellermann March 8, 2023, 3:17 p.m. UTC | #2

On Wed, Mar 8, 2023 at 4:42 AM Xiubo Li <xiubli@redhat.com> wrote:
> How could this happen ?
>
> Since the req hasn't been submitted yet, how could it receive a reply
> normally ?

I have no idea. We have frequent problems with MDS closing the
connection (once or twice a week), and sometimes, this leads to the
WARNING problem which leaves the server hanging. This seems to be some
timing problem, but that MDS connection problem is a different
problem.
My patch just attempts to address the WARNING; not knowing much about
Ceph internals, my idea was that even if the server sends bad reply
packets, the client shouldn't panic.

> It should be a corrupted reply and it lead us to get a incorrect req,
> which hasn't been submitted yet.
>
> BTW, do you have the dump of the corrupted msg by 'ceph_msg_dump(msg)' ?

Unfortunately not - we have already scrubbed the server that had this
problem and rebooted it with a fresh image including my patch. It
seems I don't have a full copy of the kernel log anymore.

Coincidentally, the patch has prevented another kernel hang just a few
minutes ago:

 Mar 08 15:48:53 sweb1 kernel: ceph: mds0 caps stale
 Mar 08 15:49:13 sweb1 kernel: ceph: mds0 caps stale
 Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps went stale, renewing
 Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps stale
 Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 socket
error on write
 Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 session reset
 Mar 08 15:49:35 sweb1 kernel: ceph: mds0 closed our session
 Mar 08 15:49:35 sweb1 kernel: ceph: mds0 reconnect start
 Mar 08 15:49:36 sweb1 kernel: ceph: mds0 reconnect success
 Mar 08 15:49:36 sweb1 kernel: ceph:  dropping dirty+flushing Fx state
for 0000000064778286 2199046848012
 Mar 08 15:49:40 sweb1 kernel: ceph: mdsc_handle_reply on waiting
request tid 1106187
 Mar 08 15:49:53 sweb1 kernel: ceph: mds0 caps renewed

Since my patch is already in place, the kernel hasn't checked the
unexpected packet and thus hasn't dumped it....

If you need more information and have a patch with more logging, I
could easily boot those servers with your patch and post that data
next time it happens.

Max

Xiubo Li March 9, 2023, 7:33 a.m. UTC | #3

On 09/03/2023 15:30, Max Kellermann wrote:
> On Thu, Mar 9, 2023 at 6:31 AM Xiubo Li <xiubli@redhat.com> wrote:
>
>> I attached one testing patch based yours, just added more debug logs,
>> which won't be introduce perf issue since all the logs should be printed
>> in corner cases.
>>
>> Could you help test it ?
> The patch now runs on one of our clusters, and I'll get back to you as
> soon as the problem occurs again. Thanks so far!

Cool, thanks very much.

- Xiubo

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 27a245d959c0..fa74fdb2cbfb 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3275,6 +3275,13 @@  static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg)
 	}
 	dout("handle_reply %p\n", req);
 
+	/* waiting, not yet submitted? */
+	if (!list_empty(&req->r_wait)) {
+		pr_err("mdsc_handle_reply on waiting request tid %llu\n", tid);
+		mutex_unlock(&mdsc->mutex);
+		goto out;
+	}
+
 	/* correct session? */
 	if (req->r_session != session) {
 		pr_err("mdsc_handle_reply got %llu on session mds%d"

fs/ceph/mds_client: ignore responses for waiting requests

Commit Message

Comments

Patch