Message ID | 20230302130650.2209938-1-max.kellermann@ionos.com |
---|---|
State | New |
Headers | show |
Series | fs/ceph/mds_client: ignore responses for waiting requests | expand |
Hi Max, Sorry for late. On 02/03/2023 21:06, Max Kellermann wrote: > If a request is put on the waiting list, its submission is postponed > until the session becomes ready (e.g. via `mdsc->waiting_for_map` or > `session->s_waiting`). If a `CEPH_MSG_CLIENT_REPLY` happens to be > received before `CEPH_MSG_MDS_MAP`, the request gets freed, and then > this assertion fails: How could this happen ? Since the req hasn't been submitted yet, how could it receive a reply normally ? > WARN_ON_ONCE(!list_empty(&req->r_wait)); > > This occurred on a server after the Ceph MDS connection failed, and a > corrupt reply packet was received for a waiting request: > > libceph: mds1 (1)10.0.0.10:6801 socket error on write > libceph: mds1 (1)10.0.0.10:6801 session reset > ceph: mds1 closed our session > ceph: mds1 reconnect start > ceph: mds1 reconnect success > ceph: problem parsing mds trace -5 > ceph: mds parse_reply err -5 > ceph: mdsc_handle_reply got corrupt reply mds1(tid:5530222) It should be a corrupted reply and it lead us to get a incorrect req, which hasn't been submitted yet. BTW, do you have the dump of the corrupted msg by 'ceph_msg_dump(msg)' ? We can check what have corrupted exactly. Thanks - Xiubo > [...] > ------------[ cut here ]------------ > WARNING: CPU: 9 PID: 229180 at fs/ceph/mds_client.c:966 ceph_mdsc_release_request+0x17a/0x180 > Modules linked in: > CPU: 9 PID: 229180 Comm: kworker/9:3 Not tainted 6.1.8-cm4all1 #45 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 > Workqueue: ceph-msgr ceph_con_workfn > RIP: 0010:ceph_mdsc_release_request+0x17a/0x180 > Code: 39 d8 75 26 5b 48 89 ee 48 8b 3d f9 2d 04 02 5d e9 fb 01 c9 ff e8 56 85 ab ff eb 9c 48 8b 7f 58 e8 db 4d ff ff e9 a4 fe ff ff <0f> 0b eb d6 66 90 0f 1f 44 00 00 41 54 48 8d 86 b8 03 00 00 55 4c > RSP: 0018:ffffa6f2c0e2bd20 EFLAGS: 00010287 > RAX: ffff8f58b93687f8 RBX: ffff8f592f6374a8 RCX: 0000000000000aed > RDX: 0000000000000ac0 RSI: 0000000000000000 RDI: 0000000000000000 > RBP: ffff8f592f637148 R08: 0000000000000001 R09: ffffffffa901de00 > R10: 0000000000000001 R11: ffffd630ad09dfc8 R12: ffff8f58b9368000 > R13: ffff8f5806b33800 R14: ffff8f58894f6780 R15: 000000000054626e > FS: 0000000000000000(0000) GS:ffff8f630f040000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007ffc2926df68 CR3: 0000000218dce002 CR4: 00000000001706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > <TASK> > mds_dispatch+0xec5/0x1460 > ? inet_recvmsg+0x4d/0xf0 > ceph_con_process_message+0x6b/0x80 > ceph_con_v1_try_read+0xb92/0x1400 > ceph_con_workfn+0x383/0x4e0 > process_one_work+0x1da/0x370 > ? process_one_work+0x370/0x370 > worker_thread+0x4d/0x3c0 > ? process_one_work+0x370/0x370 > kthread+0xbb/0xe0 > ? kthread_complete_and_exit+0x20/0x20 > ret_from_fork+0x22/0x30 > </TASK> > ---[ end trace 0000000000000000 ]--- > ceph: mds1 caps renewed > > If we know that a request has not yet been submitted, we should ignore > all responses for it, just like we ignore responses for unknown TIDs. > > To: ceph-devel@vger.kernel.org > Cc: stable@vger.kernel.org > Signed-off-by: Max Kellermann <max.kellermann@ionos.com> > --- > fs/ceph/mds_client.c | 7 +++++++ > 1 file changed, 7 insertions(+) > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 27a245d959c0..fa74fdb2cbfb 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -3275,6 +3275,13 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg) > } > dout("handle_reply %p\n", req); > > + /* waiting, not yet submitted? */ > + if (!list_empty(&req->r_wait)) { > + pr_err("mdsc_handle_reply on waiting request tid %llu\n", tid); > + mutex_unlock(&mdsc->mutex); > + goto out; > + } > + > /* correct session? */ > if (req->r_session != session) { > pr_err("mdsc_handle_reply got %llu on session mds%d"
On Wed, Mar 8, 2023 at 4:42 AM Xiubo Li <xiubli@redhat.com> wrote: > How could this happen ? > > Since the req hasn't been submitted yet, how could it receive a reply > normally ? I have no idea. We have frequent problems with MDS closing the connection (once or twice a week), and sometimes, this leads to the WARNING problem which leaves the server hanging. This seems to be some timing problem, but that MDS connection problem is a different problem. My patch just attempts to address the WARNING; not knowing much about Ceph internals, my idea was that even if the server sends bad reply packets, the client shouldn't panic. > It should be a corrupted reply and it lead us to get a incorrect req, > which hasn't been submitted yet. > > BTW, do you have the dump of the corrupted msg by 'ceph_msg_dump(msg)' ? Unfortunately not - we have already scrubbed the server that had this problem and rebooted it with a fresh image including my patch. It seems I don't have a full copy of the kernel log anymore. Coincidentally, the patch has prevented another kernel hang just a few minutes ago: Mar 08 15:48:53 sweb1 kernel: ceph: mds0 caps stale Mar 08 15:49:13 sweb1 kernel: ceph: mds0 caps stale Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps went stale, renewing Mar 08 15:49:35 sweb1 kernel: ceph: mds0 caps stale Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 socket error on write Mar 08 15:49:35 sweb1 kernel: libceph: mds0 (1)10.41.2.11:6801 session reset Mar 08 15:49:35 sweb1 kernel: ceph: mds0 closed our session Mar 08 15:49:35 sweb1 kernel: ceph: mds0 reconnect start Mar 08 15:49:36 sweb1 kernel: ceph: mds0 reconnect success Mar 08 15:49:36 sweb1 kernel: ceph: dropping dirty+flushing Fx state for 0000000064778286 2199046848012 Mar 08 15:49:40 sweb1 kernel: ceph: mdsc_handle_reply on waiting request tid 1106187 Mar 08 15:49:53 sweb1 kernel: ceph: mds0 caps renewed Since my patch is already in place, the kernel hasn't checked the unexpected packet and thus hasn't dumped it.... If you need more information and have a patch with more logging, I could easily boot those servers with your patch and post that data next time it happens. Max
On 09/03/2023 15:30, Max Kellermann wrote: > On Thu, Mar 9, 2023 at 6:31 AM Xiubo Li <xiubli@redhat.com> wrote: > >> I attached one testing patch based yours, just added more debug logs, >> which won't be introduce perf issue since all the logs should be printed >> in corner cases. >> >> Could you help test it ? > The patch now runs on one of our clusters, and I'll get back to you as > soon as the problem occurs again. Thanks so far! Cool, thanks very much. - Xiubo
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 27a245d959c0..fa74fdb2cbfb 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3275,6 +3275,13 @@ static void handle_reply(struct ceph_mds_session *session, struct ceph_msg *msg) } dout("handle_reply %p\n", req); + /* waiting, not yet submitted? */ + if (!list_empty(&req->r_wait)) { + pr_err("mdsc_handle_reply on waiting request tid %llu\n", tid); + mutex_unlock(&mdsc->mutex); + goto out; + } + /* correct session? */ if (req->r_session != session) { pr_err("mdsc_handle_reply got %llu on session mds%d"
If a request is put on the waiting list, its submission is postponed until the session becomes ready (e.g. via `mdsc->waiting_for_map` or `session->s_waiting`). If a `CEPH_MSG_CLIENT_REPLY` happens to be received before `CEPH_MSG_MDS_MAP`, the request gets freed, and then this assertion fails: WARN_ON_ONCE(!list_empty(&req->r_wait)); This occurred on a server after the Ceph MDS connection failed, and a corrupt reply packet was received for a waiting request: libceph: mds1 (1)10.0.0.10:6801 socket error on write libceph: mds1 (1)10.0.0.10:6801 session reset ceph: mds1 closed our session ceph: mds1 reconnect start ceph: mds1 reconnect success ceph: problem parsing mds trace -5 ceph: mds parse_reply err -5 ceph: mdsc_handle_reply got corrupt reply mds1(tid:5530222) [...] ------------[ cut here ]------------ WARNING: CPU: 9 PID: 229180 at fs/ceph/mds_client.c:966 ceph_mdsc_release_request+0x17a/0x180 Modules linked in: CPU: 9 PID: 229180 Comm: kworker/9:3 Not tainted 6.1.8-cm4all1 #45 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: ceph-msgr ceph_con_workfn RIP: 0010:ceph_mdsc_release_request+0x17a/0x180 Code: 39 d8 75 26 5b 48 89 ee 48 8b 3d f9 2d 04 02 5d e9 fb 01 c9 ff e8 56 85 ab ff eb 9c 48 8b 7f 58 e8 db 4d ff ff e9 a4 fe ff ff <0f> 0b eb d6 66 90 0f 1f 44 00 00 41 54 48 8d 86 b8 03 00 00 55 4c RSP: 0018:ffffa6f2c0e2bd20 EFLAGS: 00010287 RAX: ffff8f58b93687f8 RBX: ffff8f592f6374a8 RCX: 0000000000000aed RDX: 0000000000000ac0 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff8f592f637148 R08: 0000000000000001 R09: ffffffffa901de00 R10: 0000000000000001 R11: ffffd630ad09dfc8 R12: ffff8f58b9368000 R13: ffff8f5806b33800 R14: ffff8f58894f6780 R15: 000000000054626e FS: 0000000000000000(0000) GS:ffff8f630f040000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffc2926df68 CR3: 0000000218dce002 CR4: 00000000001706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> mds_dispatch+0xec5/0x1460 ? inet_recvmsg+0x4d/0xf0 ceph_con_process_message+0x6b/0x80 ceph_con_v1_try_read+0xb92/0x1400 ceph_con_workfn+0x383/0x4e0 process_one_work+0x1da/0x370 ? process_one_work+0x370/0x370 worker_thread+0x4d/0x3c0 ? process_one_work+0x370/0x370 kthread+0xbb/0xe0 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> ---[ end trace 0000000000000000 ]--- ceph: mds1 caps renewed If we know that a request has not yet been submitted, we should ignore all responses for it, just like we ignore responses for unknown TIDs. To: ceph-devel@vger.kernel.org Cc: stable@vger.kernel.org Signed-off-by: Max Kellermann <max.kellermann@ionos.com> --- fs/ceph/mds_client.c | 7 +++++++ 1 file changed, 7 insertions(+)