mbox series

[RFC,0/4] ceph: fix generic/421 test failure

Message ID 20250205000249.123054-1-slava@dubeyko.com
Headers show
Series ceph: fix generic/421 test failure | expand

Message

Viacheslav Dubeyko Feb. 5, 2025, 12:02 a.m. UTC
From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>

The generic/421 fails to finish because of the issue:

Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>

There are several issues here:
(1) ceph_kill_sb() doesn't wait ending of flushing
all dirty folios/pages because of racy nature of
mdsc->stopping_blockers. As a result, mdsc->stopping
becomes CEPH_MDSC_STOPPING_FLUSHED too early.
(2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
to increment mdsc->stopping_blockers. Finally,
already locked folios/pages are never been unlocked
and the logic tries to lock the same page second time.
(3) The folio_batch with found dirty pages by
filemap_get_folios_tag() is not processed properly.
And this is why some number of dirty pages simply never
processed and we have dirty folios/pages after unmount
anyway.

This patchset is refactoring the ceph_writepages_start()
method and it fixes the issues by means of:
(1) introducing dirty_folios counter and flush_end_wq
waiting queue in struct ceph_mds_client;
(2) ceph_dirty_folio() increments the dirty_folios
counter;
(3) writepages_finish() decrements the dirty_folios
counter and wake up all waiters on the queue
if dirty_folios counter is equal or lesser than zero;
(4) adding in ceph_kill_sb() method the logic of
checking the value of dirty_folios counter and
waiting if it is bigger than zero;
(5) adding ceph_inc_osd_stopping_blocker() call in the
beginning of the ceph_writepages_start() and
ceph_dec_osd_stopping_blocker() at the end of
the ceph_writepages_start() with the goal to resolve
the racy nature of mdsc->stopping_blockers.

sudo ./check generic/421
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch

generic/421 7s ...  4s
Ran: generic/421
Passed all 1 tests

Viacheslav Dubeyko (4):
  ceph: extend ceph_writeback_ctl for ceph_writepages_start()
    refactoring
  ceph: introduce ceph_process_folio_batch() method
  ceph: introduce ceph_submit_write() method
  ceph: fix generic/421 test failure

 fs/ceph/addr.c       | 1110 +++++++++++++++++++++++++++---------------
 fs/ceph/mds_client.c |    2 +
 fs/ceph/mds_client.h |    3 +
 fs/ceph/super.c      |   11 +
 4 files changed, 746 insertions(+), 380 deletions(-)

Comments

Viacheslav Dubeyko Feb. 12, 2025, 6:05 p.m. UTC | #1
Hi David,

Have you tried the fix? Does it fix the  issue on your side?

Thanks,
Slava.

On Tue, 2025-02-04 at 16:02 -0800, Viacheslav Dubeyko wrote:
> From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> The generic/421 fails to finish because of the issue:
> 
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>
> 
> There are several issues here:
> (1) ceph_kill_sb() doesn't wait ending of flushing
> all dirty folios/pages because of racy nature of
> mdsc->stopping_blockers. As a result, mdsc->stopping
> becomes CEPH_MDSC_STOPPING_FLUSHED too early.
> (2) The ceph_inc_osd_stopping_blocker(fsc->mdsc) fails
> to increment mdsc->stopping_blockers. Finally,
> already locked folios/pages are never been unlocked
> and the logic tries to lock the same page second time.
> (3) The folio_batch with found dirty pages by
> filemap_get_folios_tag() is not processed properly.
> And this is why some number of dirty pages simply never
> processed and we have dirty folios/pages after unmount
> anyway.
> 
> This patchset is refactoring the ceph_writepages_start()
> method and it fixes the issues by means of:
> (1) introducing dirty_folios counter and flush_end_wq
> waiting queue in struct ceph_mds_client;
> (2) ceph_dirty_folio() increments the dirty_folios
> counter;
> (3) writepages_finish() decrements the dirty_folios
> counter and wake up all waiters on the queue
> if dirty_folios counter is equal or lesser than zero;
> (4) adding in ceph_kill_sb() method the logic of
> checking the value of dirty_folios counter and
> waiting if it is bigger than zero;
> (5) adding ceph_inc_osd_stopping_blocker() call in the
> beginning of the ceph_writepages_start() and
> ceph_dec_osd_stopping_blocker() at the end of
> the ceph_writepages_start() with the goal to resolve
> the racy nature of mdsc->stopping_blockers.
> 
> sudo ./check generic/421
> FSTYP         -- ceph
> PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.13.0+ #137 SMP PREEMPT_DYNAMIC Mon Feb  3 20:30:08 UTC 2025
> MKFS_OPTIONS  -- 127.0.0.1:40551:/scratch
> MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom 127.0.0.1:40551:/scratch /mnt/scratch
> 
> generic/421 7s ...  4s
> Ran: generic/421
> Passed all 1 tests
> 
> Viacheslav Dubeyko (4):
>   ceph: extend ceph_writeback_ctl for ceph_writepages_start()
>     refactoring
>   ceph: introduce ceph_process_folio_batch() method
>   ceph: introduce ceph_submit_write() method
>   ceph: fix generic/421 test failure
> 
>  fs/ceph/addr.c       | 1110 +++++++++++++++++++++++++++---------------
>  fs/ceph/mds_client.c |    2 +
>  fs/ceph/mds_client.h |    3 +
>  fs/ceph/super.c      |   11 +
>  4 files changed, 746 insertions(+), 380 deletions(-)
>
David Howells Feb. 14, 2025, 5:19 p.m. UTC | #2
Okay...   I *think* that fixes the hang.  There was one case where I saw the
hang, but I'm not sure that I had your patches applied or whether I'd managed
to boot the previous kernel that didn't.

So, just with respect to fixing the hang:

	Tested-by: David Howells <dhowells@redhat.com>

There's still the issue of encrypted filenames occasionally showing through
which generic/397 is showing up - but I don't think your patches here fix
that, right?

David
Viacheslav Dubeyko Feb. 14, 2025, 5:36 p.m. UTC | #3
On Fri, 2025-02-14 at 17:19 +0000, David Howells wrote:
> Okay...   I *think* that fixes the hang.  There was one case where I saw the
> hang, but I'm not sure that I had your patches applied or whether I'd managed
> to boot the previous kernel that didn't.
> 
> So, just with respect to fixing the hang:
> 
> 	Tested-by: David Howells <dhowells@redhat.com>
> 
> There's still the issue of encrypted filenames occasionally showing through
> which generic/397 is showing up - but I don't think your patches here fix
> that, right?
> 

This patchset doesn't fix the generic/397 issue. I sent another patch ([PATCH
v2] ceph: Fix kernel crash in generic/397 test) [1] before this one with the
fix.

Thanks,
Slava.

[1]
https://lore.kernel.org/all/CAO8a2SjrDL5TqW70P3yyqv8X-B5jfQRg-eMTs9Nbntr8=Mwbog@mail.gmail.com/T/
David Howells Feb. 14, 2025, 8:35 p.m. UTC | #4
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> > There's still the issue of encrypted filenames occasionally showing through
> > which generic/397 is showing up - but I don't think your patches here fix
> > that, right?
> > 
> 
> This patchset doesn't fix the generic/397 issue. I sent another patch ([PATCH
> v2] ceph: Fix kernel crash in generic/397 test) [1] before this one with the
> fix.

That doesn't fix the problem either.  That seems to be fixing a crash, not:

generic/397       - output mismatch (see /root/xfstests-dev/results//generic/397.out.bad)
    --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
    +++ /root/xfstests-dev/results//generic/397.out.bad 2025-02-14 20:34:10.365900035 +0000
    @@ -1,13 +1,27 @@
     QA output created by 397
    +Only in /xfstest.scratch/ref_dir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
    +Only in /xfstest.scratch/edir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy����Sd�S�e��[�@���7,��
                                                                                            [�g��
    +Only in /xfstest.scratch/edir: 70h6RnwpEg1PMtJp9yQ,2g
    +Only in /xfstest.scratch/edir: HHBOImQ7cdmsZKNhc5yPCX+XKu0+dn4VViEQzd0q3Ig
    +Only in /xfstest.scratch/edir: HXYO3UK3FrxqwSZaNnQ5zQ
    +Only in /xfstest.scratch/edir: PecH6opy8KkkB8ir8Oz0pw
    ...
    (Run 'diff -u /root/xfstests-dev/tests/generic/397.out /root/xfstests-dev/results//generic/397.out.bad'  to see the entire diff)


David
Viacheslav Dubeyko Feb. 14, 2025, 8:45 p.m. UTC | #5
On Fri, 2025-02-14 at 20:35 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > > There's still the issue of encrypted filenames occasionally showing through
> > > which generic/397 is showing up - but I don't think your patches here fix
> > > that, right?
> > > 
> > 
> > This patchset doesn't fix the generic/397 issue. I sent another patch ([PATCH
> > v2] ceph: Fix kernel crash in generic/397 test) [1] before this one with the
> > fix.
> 
> That doesn't fix the problem either.  That seems to be fixing a crash, not:
> 
> generic/397       - output mismatch (see /root/xfstests-dev/results//generic/397.out.bad)
>     --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
>     +++ /root/xfstests-dev/results//generic/397.out.bad 2025-02-14 20:34:10.365900035 +0000
>     @@ -1,13 +1,27 @@
>      QA output created by 397
>     +Only in /xfstest.scratch/ref_dir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
>     +Only in /xfstest.scratch/edir: yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy����Sd�S�e��[�@���7,��
>                                                                                             [�g��
>     +Only in /xfstest.scratch/edir: 70h6RnwpEg1PMtJp9yQ,2g
>     +Only in /xfstest.scratch/edir: HHBOImQ7cdmsZKNhc5yPCX+XKu0+dn4VViEQzd0q3Ig
>     +Only in /xfstest.scratch/edir: HXYO3UK3FrxqwSZaNnQ5zQ
>     +Only in /xfstest.scratch/edir: PecH6opy8KkkB8ir8Oz0pw
>     ...
>     (Run 'diff -u /root/xfstests-dev/tests/generic/397.out /root/xfstests-dev/results//generic/397.out.bad'  to see the entire diff)
> 
> 
> 

Do you mean that you applied this modification?

---
 fs/ceph/addr.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 85936f6d2bf7..5e6ba92219f3 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -396,6 +396,15 @@ static void ceph_netfs_issue_read(struct
netfs_io_subrequest *subreq)
 		struct page **pages;
 		size_t page_off;
 
+		/*
+		 * The io_iter.count needs to be corrected to aligned length.
+		 * Otherwise, iov_iter_get_pages_alloc2() operates with
+		 * the initial unaligned length value. As a result,
+		 * ceph_msg_data_cursor_init() triggers BUG_ON() in the case
+		 * if msg->sparse_read_total > msg->data_length.
+		 */
+		subreq->io_iter.count = len;
+
 		err = iov_iter_get_pages_alloc2(&subreq->io_iter, &pages, len,
&page_off);
 		if (err < 0) {
 			doutc(cl, "%llx.%llx failed to allocate pages, %d\n",
@@ -405,6 +414,7 @@ static void ceph_netfs_issue_read(struct netfs_io_subrequest
*subreq)
 
 		/* should always give us a page-aligned read */
 		WARN_ON_ONCE(page_off);
+
 		len = err;
 		err = 0;
David Howells Feb. 14, 2025, 8:52 p.m. UTC | #6
Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:

> Do you mean that you applied this modification?

See:

	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes

for I have applied.

David
Viacheslav Dubeyko Feb. 14, 2025, 8:57 p.m. UTC | #7
On Fri, 2025-02-14 at 20:52 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > Do you mean that you applied this modification?
> 
> See:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes  
> 
> for I have applied.
> 

OK. I didn't see such output during the testing:

generic/397       - output mismatch (see /root/xfstests-
dev/results//generic/397.out.bad)
    --- tests/generic/397.out   2024-09-12 12:36:14.167441927 +0100
    +++ /root/xfstests-dev/results//generic/397.out.bad 2025-02-14
20:34:10.365900035 +0000
    @@ -1,13 +1,27 @@
     QA output created by 397
    +Only in /xfstest.scratch/ref_dir:
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyy
    +Only in /xfstest.scratch/edir:
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy����Sd�S�e��[�@�
��7,��
                                                                               
[�g��
    +Only in /xfstest.scratch/edir: 70h6RnwpEg1PMtJp9yQ,2g
    +Only in /xfstest.scratch/edir: HHBOImQ7cdmsZKNhc5yPCX+XKu0+dn4VViEQzd0q3Ig
    +Only in /xfstest.scratch/edir: HXYO3UK3FrxqwSZaNnQ5zQ
    +Only in /xfstest.scratch/edir: PecH6opy8KkkB8ir8Oz0pw
    ...
    (Run 'diff -u /root/xfstests-dev/tests/generic/397.out /root/xfstests-
dev/results//generic/397.out.bad'  to see the entire diff)

Let me double check the test again.

Thanks,
Slava.
Viacheslav Dubeyko Feb. 14, 2025, 11:52 p.m. UTC | #8
On Fri, 2025-02-14 at 20:52 +0000, David Howells wrote:
> Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> wrote:
> 
> > Do you mean that you applied this modification?
> 
> See:
> 
> 	https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git/log/?h=netfs-fixes  
> 
> for I have applied.
> 

I took your branch [1] and compiled the kernel:

git status
HEAD detached at origin/netfs-fixes

But I cannot reproduce the issue:

sudo ./check -g encrypt
FSTYP         -- ceph
PLATFORM      -- Linux/x86_64 ceph-testing-0001 6.14.0-rc2+ #1 SMP
PREEMPT_DYNAMIC Fri Feb 14 23:04:17 UTC 2025
MKFS_OPTIONS  -- 127.0.0.1:40137:/scratch
MOUNT_OPTIONS -- -o name=fs,secret=<secret>,ms_mode=crc,nowsync,copyfrom
127.0.0.1:40137:/scratch /mnt/scratch

generic/395 15s ...  10s
generic/396 12s ...  9s
generic/397 13s ...  11s
generic/398 1s ... [not run] kernel doesn't support renameat2 syscall
generic/399 28s ... [not run] Filesystem ceph not supported in
_scratch_mkfs_sized_encrypted
generic/419 1s ... [not run] kernel doesn't support renameat2 syscall
generic/421 17s ...  13s
generic/429 24s ...  22s
generic/435 1115s ...  873s
generic/440 18s ...  13s
generic/548 2s ... [not run] xfs_io fiemap  failed (old kernel/wrong fs?)
generic/549 2s ... [not run] encryption policy '-c 5 -n 6 -f 0' is unusable;
probably missing kernel crypto API support
generic/550 4s ... [not run] encryption policy '-c 9 -n 9 -f 0' is unusable;
probably missing kernel crypto API support
generic/576       [not run] fsverity utility required, skipped this test
generic/580 18s ...  15s
generic/581 21s ...  20s
generic/582 2s ... [not run] xfs_io fiemap  failed (old kernel/wrong fs?)
generic/583 2s ... [not run] encryption policy '-c 5 -n 6 -v 2 -f 0' is
unusable; probably missing kernel crypto API support
generic/584 3s ... [not run] encryption policy '-c 9 -n 9 -v 2 -f 0' is
unusable; probably missing kernel crypto API support
generic/592 3s ... [not run] kernel does not support encryption policy: '-c 1 -n
4 -v 2 -f 8'
generic/593 18s ...  14s
generic/595 20s ...  19s
generic/602 2s ... [not run] kernel does not support encryption policy: '-c 1 -n
4 -v 2 -f 16'
generic/613 5s ... [not run] _get_encryption_nonce() isn't implemented on ceph
generic/621 6s ... [not run] kernel doesn't support renameat2 syscall
generic/693 6s ... [not run] encryption policy '-c 1 -n 10 -v 2 -f 0' is
unusable; probably missing kernel crypto API support
generic/739       [not run] xfs_io set_encpolicy doesn't support -s
Ran: generic/395 generic/396 generic/397 generic/398 generic/399 generic/419
generic/421 generic/429 generic/435 generic/440 generic/548 generic/549
generic/550 generic/576 generic/580 generic/581 generic/582 generic/583
generic/584 generic/592 generic/593 generic/595 generic/602 generic/613
generic/621 generic/693 generic/739
Not run: generic/398 generic/399 generic/419 generic/548 generic/549 generic/550
generic/576 generic/582 generic/583 generic/584 generic/592 generic/602
generic/613 generic/621 generic/693 generic/739
Passed all 27 tests

[1] https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs.git
Christian Brauner Feb. 28, 2025, 10:22 a.m. UTC | #9
On Tue, 04 Feb 2025 16:02:45 -0800, Viacheslav Dubeyko wrote:
> From: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
> 
> The generic/421 fails to finish because of the issue:
> 
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.894678] INFO: task kworker/u48:0:11 blocked for more than 122 seconds.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895403] Not tainted 6.13.0-rc5+ #1
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.895867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896633] task:kworker/u48:0 state:D stack:0 pid:11 tgid:11 ppid:2 flags:0x00004000
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.896641] Workqueue: writeback wb_workfn (flush-ceph-24)
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897614] Call Trace:
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897620] <TASK>
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897629] __schedule+0x443/0x16b0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897637] schedule+0x2b/0x140
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897640] io_schedule+0x4c/0x80
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897643] folio_wait_bit_common+0x11b/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897646] ? _raw_spin_unlock_irq+0xe/0x50
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897652] ? __pfx_wake_page_function+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897655] __folio_lock+0x17/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897658] ceph_writepages_start+0xca9/0x1fb0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897663] ? fsnotify_remove_queued_event+0x2f/0x40
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897668] do_writepages+0xd2/0x240
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897672] __writeback_single_inode+0x44/0x350
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897675] writeback_sb_inodes+0x25c/0x550
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897680] wb_writeback+0x89/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897683] ? finish_task_switch.isra.0+0x97/0x310
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897687] wb_workfn+0xb5/0x410
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897689] process_one_work+0x188/0x3d0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897692] worker_thread+0x2b5/0x3c0
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897694] ? __pfx_worker_thread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897696] kthread+0xe1/0x120
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897699] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897701] ret_from_fork+0x43/0x70
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897705] ? __pfx_kthread+0x10/0x10
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897707] ret_from_fork_asm+0x1a/0x30
> Jan 3 14:25:27 ceph-testing-0001 kernel: [ 369.897711] </TASK>
> 
> [...]

Applied to the vfs-6.15.ceph branch of the vfs/vfs.git tree.
Patches in the vfs-6.15.ceph branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.15.ceph

[1/4] ceph: extend ceph_writeback_ctl for ceph_writepages_start() refactoring
      https://git.kernel.org/vfs/vfs/c/f08068df4aa4
[2/4] ceph: introduce ceph_process_folio_batch() method
      https://git.kernel.org/vfs/vfs/c/ce80b76dd327
[3/4] ceph: introduce ceph_submit_write() method
      https://git.kernel.org/vfs/vfs/c/1551ec61dc55
[4/4] ceph: fix generic/421 test failure
      https://git.kernel.org/vfs/vfs/c/fd7449d937e7