Message ID | 20221107191136.18048-1-pvorel@suse.cz |
---|---|
Headers | show |
Series | Possible bug in zram on ppc64le on vfat | expand |
On 07. 11. 22 22:25, Minchan Kim wrote: > On Mon, Nov 07, 2022 at 08:11:35PM +0100, Petr Vorel wrote: >> Hi all, >> >> following bug is trying to workaround an error on ppc64le, where >> zram01.sh LTP test (there is also kernel selftest >> tools/testing/selftests/zram/zram01.sh, but LTP test got further >> updates) has often mem_used_total 0 although zram is already filled. > > Hi, Petr, > > Is it happening on only ppc64le? > > Is it a new regression? What kernel version did you use? Hi, I've reported the same issue on kernels 4.12.14 and 5.3.18 internally to our kernel developers at SUSE. The bugreport is not public but I'll copy the bug description here: New version of LTP test zram01 found a sysfile issue with zram devices mounted using VFAT filesystem. When when all available space is filled, e.g. by `dd if=/dev/zero of=/mnt/zram0/file`, the corresponding sysfile /sys/block/zram0/mm_stat will report that the compressed data size on the device is 0 and total memory usage is also 0. LTP test zram01 uses these values to calculate compression ratio, which results in division by zero. The issue is specific to PPC64LE architecture and the VFAT filesystem. No other tested filesystem has this issue and I could not reproduce it on other archs (s390 not tested). The issue appears randomly about every 3 test runs on SLE-15SP2 and 15SP3 (kernel 5.3). It appears less frequently on SLE-12SP5 (kernel 4.12). Other SLE version were not tested with the new test version yet. The previous version of the test did not check the VFAT filesystem on zram devices. I've tried to debug the issue and collected some interesting data (all values come from zram device with 25M size limit and zstd compression algorithm): - mm_stat values are correct after mkfs.vfat: 65536 220 65536 26214400 65536 0 0 0 - mm_stat values stay correct after mount: 65536 220 65536 26214400 65536 0 0 0 - the bug is triggered by filling the filesystem to capacity (using dd): 4194304 0 0 26214400 327680 64 0 0 - adding `sleep 1` between `dd` and reading mm_stat does not help - adding sync between `dd` and reading mm_stat appears to fix the issue: 26214400 2404 262144 26214400 327680 399 0 0
On Mon, Nov 07, 2022 at 10:47:33PM +0100, Petr Vorel wrote: > Hi Minchan, > > > On Mon, Nov 07, 2022 at 08:11:35PM +0100, Petr Vorel wrote: > > > Hi all, > > > > following bug is trying to workaround an error on ppc64le, where > > > zram01.sh LTP test (there is also kernel selftest > > > tools/testing/selftests/zram/zram01.sh, but LTP test got further > > > updates) has often mem_used_total 0 although zram is already filled. > > > Hi, Petr, > > > Is it happening on only ppc64le? > I haven't seen it on other archs (x86_64, aarch64). > > > Is it a new regression? What kernel version did you use? > Found on openSUSE kernel, which uses stable kernel releases 6.0.x. > It's probably much older, first I've seen it some years ago (I'm not able to find kernel version), but it was random. Now it's much more common. > > Test runs on VM (I can give qemu command or whatever you need to know about it) > I'll try to verify it on some bare metal ppc64le. Hi Petr and Martin, Thanks for testing and meaning information. Could you tell how I could create VM to run ppc64le and run the test? I'd like to reproduce in my local to debug it. Thanks!
On (22/11/10 15:29), Martin Doucha wrote: > New version of LTP test zram01 found a sysfile issue with zram devices > mounted using VFAT filesystem. When when all available space is filled, e.g. > by `dd if=/dev/zero of=/mnt/zram0/file`, the corresponding sysfile > /sys/block/zram0/mm_stat will report that the compressed data size on the > device is 0 and total memory usage is also 0. LTP test zram01 uses these > values to calculate compression ratio, which results in division by zero. > > The issue is specific to PPC64LE architecture and the VFAT filesystem. No > other tested filesystem has this issue and I could not reproduce it on other > archs (s390 not tested). The issue appears randomly about every 3 test runs > on SLE-15SP2 and 15SP3 (kernel 5.3). It appears less frequently on SLE-12SP5 > (kernel 4.12). Other SLE version were not tested with the new test version > yet. The previous version of the test did not check the VFAT filesystem on > zram devices. Whoooaa... > I've tried to debug the issue and collected some interesting data (all > values come from zram device with 25M size limit and zstd compression > algorithm): > - mm_stat values are correct after mkfs.vfat: > 65536 220 65536 26214400 65536 0 0 0 > > - mm_stat values stay correct after mount: > 65536 220 65536 26214400 65536 0 0 0 > > - the bug is triggered by filling the filesystem to capacity (using dd): > 4194304 0 0 26214400 327680 64 0 0 Can you try using /dev/urandom for dd, not /dev/zero? Do you still see zeroes in sysfs output or some random values?
Hi Martin, thanks a lot for providing more complete description! Kind regards, Petr
> On Mon, Nov 07, 2022 at 10:47:33PM +0100, Petr Vorel wrote: > > Hi Minchan, > > > On Mon, Nov 07, 2022 at 08:11:35PM +0100, Petr Vorel wrote: > > > > Hi all, > > > > following bug is trying to workaround an error on ppc64le, where > > > > zram01.sh LTP test (there is also kernel selftest > > > > tools/testing/selftests/zram/zram01.sh, but LTP test got further > > > > updates) has often mem_used_total 0 although zram is already filled. > > > Hi, Petr, > > > Is it happening on only ppc64le? > > I haven't seen it on other archs (x86_64, aarch64). > > > Is it a new regression? What kernel version did you use? > > Found on openSUSE kernel, which uses stable kernel releases 6.0.x. > > It's probably much older, first I've seen it some years ago (I'm not able to find kernel version), but it was random. Now it's much more common. > > Test runs on VM (I can give qemu command or whatever you need to know about it) > > I'll try to verify it on some bare metal ppc64le. > Hi Petr and Martin, > Thanks for testing and meaning information. > Could you tell how I could create VM to run ppc64le and run the test? > I'd like to reproduce in my local to debug it. I suppose you don't have ppc64le bare metal machine, thus you run on x86_64. One way would be to install on host qemu-system-ppc64, download iso image of any distro which supports ppc64le and install it with virt-manager (which would fill necessary qemu params). Other way, which I often use is to compile system with Buildroot distribution. You can clone my Buildroot distro fork, branch debug/zram [1]. I put there in 3 commits my configuration. I added 0001-zram-Debug-mm_stat_show.patch [2] on the top of 6.0.7 with little debugging. What is now only needed is to 1) install on host qemu-system-ppc64 (speedup build + Buildroot is configured not to compile qemu-system-ppc64), then: $ make # takes time $ ./output/images/start-qemu.sh serial-only When I have ppc64le host with enough space, I often use rapido [3], but that crashed stable kernel (another story which I'll report soon). Hope that helps. Kind regards, Petr [1] https://github.com/pevik/buildroot/commits/debug/zram [2] https://github.com/pevik/buildroot/blob/debug/zram/0001-zram-Debug-mm_stat_show.patch [3] https://github.com/rapido-linux/rapido
Hi Sergey, > On (22/11/10 15:29), Martin Doucha wrote: > > New version of LTP test zram01 found a sysfile issue with zram devices > > mounted using VFAT filesystem. When when all available space is filled, e.g. > > by `dd if=/dev/zero of=/mnt/zram0/file`, the corresponding sysfile > > /sys/block/zram0/mm_stat will report that the compressed data size on the > > device is 0 and total memory usage is also 0. LTP test zram01 uses these > > values to calculate compression ratio, which results in division by zero. > > The issue is specific to PPC64LE architecture and the VFAT filesystem. No > > other tested filesystem has this issue and I could not reproduce it on other > > archs (s390 not tested). The issue appears randomly about every 3 test runs > > on SLE-15SP2 and 15SP3 (kernel 5.3). It appears less frequently on SLE-12SP5 > > (kernel 4.12). Other SLE version were not tested with the new test version > > yet. The previous version of the test did not check the VFAT filesystem on > > zram devices. > Whoooaa... > > I've tried to debug the issue and collected some interesting data (all > > values come from zram device with 25M size limit and zstd compression > > algorithm): > > - mm_stat values are correct after mkfs.vfat: > > 65536 220 65536 26214400 65536 0 0 0 > > - mm_stat values stay correct after mount: > > 65536 220 65536 26214400 65536 0 0 0 > > - the bug is triggered by filling the filesystem to capacity (using dd): > > 4194304 0 0 26214400 327680 64 0 0 > Can you try using /dev/urandom for dd, not /dev/zero? > Do you still see zeroes in sysfs output or some random values? I'm not sure if Martin had time to rerun the test. I was not able to reproduce the problem any more on machine where the test was failing. But I'll have look into this during this week. Kind regards, Petr
On 11. 11. 22 1:48, Sergey Senozhatsky wrote: > On (22/11/10 15:29), Martin Doucha wrote: >> I've tried to debug the issue and collected some interesting data (all >> values come from zram device with 25M size limit and zstd compression >> algorithm): >> - mm_stat values are correct after mkfs.vfat: >> 65536 220 65536 26214400 65536 0 0 0 >> >> - mm_stat values stay correct after mount: >> 65536 220 65536 26214400 65536 0 0 0 >> >> - the bug is triggered by filling the filesystem to capacity (using dd): >> 4194304 0 0 26214400 327680 64 0 0 > > Can you try using /dev/urandom for dd, not /dev/zero? > Do you still see zeroes in sysfs output or some random values? After 50 test runs on a kernel where the issue is confirmed, I could not reproduce the failure while filling the device from /dev/urandom instead of /dev/zero. The test reported compression ratio around 1.8-2.5 which means the memory usage reported by mm_stat was 10-13MB. Note that I had to disable the other filesystems in the test because some of them kept failing with compression ratio <1.
> On 11. 11. 22 1:48, Sergey Senozhatsky wrote: > > On (22/11/10 15:29), Martin Doucha wrote: > > > I've tried to debug the issue and collected some interesting data (all > > > values come from zram device with 25M size limit and zstd compression > > > algorithm): > > > - mm_stat values are correct after mkfs.vfat: > > > 65536 220 65536 26214400 65536 0 0 0 > > > - mm_stat values stay correct after mount: > > > 65536 220 65536 26214400 65536 0 0 0 > > > - the bug is triggered by filling the filesystem to capacity (using dd): > > > 4194304 0 0 26214400 327680 64 0 0 > > Can you try using /dev/urandom for dd, not /dev/zero? > > Do you still see zeroes in sysfs output or some random values? > After 50 test runs on a kernel where the issue is confirmed, I could not > reproduce the failure while filling the device from /dev/urandom instead of > /dev/zero. The test reported compression ratio around 1.8-2.5 which means > the memory usage reported by mm_stat was 10-13MB. Martin, thanks a lot for reruning tests. I wonder problems on /dev/zero is a kernel bug or just problem which should be workarounded. > Note that I had to disable the other filesystems in the test because some of > them kept failing with compression ratio <1. Yes, I noted that as well at least on exfat and btrfs (if I remember correctly). It wouldn't be a problem to just use it for vfat if we agreed test should be modified. Kind regards, Petr
On (22/11/22 16:07), Petr Vorel wrote: > > On 11. 11. 22 1:48, Sergey Senozhatsky wrote: > > > On (22/11/10 15:29), Martin Doucha wrote: [..] > > > Can you try using /dev/urandom for dd, not /dev/zero? > > > Do you still see zeroes in sysfs output or some random values? > > > After 50 test runs on a kernel where the issue is confirmed, I could not > > reproduce the failure while filling the device from /dev/urandom instead of > > /dev/zero. The test reported compression ratio around 1.8-2.5 which means > > the memory usage reported by mm_stat was 10-13MB. > > Martin, thanks a lot for reruning tests. I wonder problems on /dev/zero is a > kernel bug or just problem which should be workarounded. Hmm. Does CONFIG_KASAN show anything interesting?
On 29. 11. 22 5:38, Sergey Senozhatsky wrote: > On (22/11/22 16:07), Petr Vorel wrote: >> Martin, thanks a lot for reruning tests. I wonder problems on /dev/zero is a >> kernel bug or just problem which should be workarounded. > > Hmm. Does CONFIG_KASAN show anything interesting? Sorry for the delay. We've tried to reproduce the bug with CONFIG_KASAN enabled but the only affected arch is PPC64LE and KASAN is not available there at all. Our kernel maintainers confirmed that if we need KASAN to debug this, we're out of luck.
After thinking it through, I think I might have a explanation... On Fri, Aug 04, 2023 at 04:37:11PM +1000, Ian Wienand wrote: > To recap; this test [1] creates a zram device, makes a filesystem on > it, and fills it with sequential 1k writes from /dev/zero via dd. The > problem is that it sees the mem_used_total for the zram device as zero > in the sysfs stats after the writes; this causes a divide by zero > error in the script calculation. > > An annoted extract: > > zram01 3 TINFO: /sys/block/zram1/disksize = '26214400' > zram01 3 TPASS: test succeeded > zram01 4 TINFO: set memory limit to zram device(s) > zram01 4 TINFO: /sys/block/zram1/mem_limit = '25M' > zram01 4 TPASS: test succeeded > zram01 5 TINFO: make vfat filesystem on /dev/zram1 > > >> at this point a cat of /sys/block/zram1/mm_stat shows > >> 65536 527 65536 26214400 65536 0 0 0 > > zram01 5 TPASS: zram_makefs succeeded So I think the thing to note is that mem_used_total is the current number of pages (reported * PAGE_SIZE) used by the zsmalloc allocator to store compressed data. So we have made the file system, which is now quiescent and just has basic vfat data; this is compressed and stored and there's one page allocated for this (arm64, 64k pages). > zram01 6 TINFO: mount /dev/zram1 > zram01 6 TPASS: mount of zram device(s) succeeded > zram01 7 TINFO: filling zram1 (it can take long time) > zram01 7 TPASS: zram1 was filled with '25568' KB > > >> however, /sys/block/zram1/mm_stat shows > >> 9502720 0 0 26214400 196608 145 0 0 > >> the script reads this zero value and tries to calculate the > >> compression ratio > > ./zram01.sh: line 145: 100 * 1024 * 25568 / 0: division by 0 (error token is "0") At this point, because this test fills from /dev/zero, the zsmalloc pool doesn't actually have anything in it. The filesystem metadata is in-use from the writes, and is not written out as compressed data. The zram page de-duplication has kicked in, and instead of handles to zsmalloc areas for data we just have "this is a page of zeros" recorded. So this is correctly reflecting that fact that we don't actually have anything compressed stored at this time. > >> If we do a "sync" then redisply the mm_stat after, we get > >> 26214400 2842 65536 26214400 196608 399 0 0 Now we've finished writing all our zeros and have synced, we would have finished updating vfat allocations, etc. So this gets compressed and written, and we're back to have some small FS metadata compressed in our 1 page of zsmalloc allocations. I think what is probably "special" about this reproducer system is that it is slow enough to allow the zero allocation to persist between the end of the test writes and examining the stats. I'd be happy for any thoughts on the likelyhood of this! If we think this is right; then the point of the end of this test [1] is ensure a high reported compression ratio on the device, presumably to ensure the compression is working. Filling it with urandom would be unreliable in this regard. I think what we want to do is something highly compressable like alternate lengths of 0x00 and 0xFF. This will avoid the same-page detection and ensure we actually have compressed data, and we can continue to assert on the high compression ratio reliably. I'm happy to propose this if we generally agree. Thanks, -i > [1] https://github.com/linux-test-project/ltp/blob/8c201e55f684965df2ae5a13ff439b28278dec0d/testcases/kernel/device-drivers/zram/zram01.sh
On (23/08/07 14:44), Ian Wienand wrote: [..] > > At this point, because this test fills from /dev/zero, the zsmalloc > pool doesn't actually have anything in it. The filesystem metadata is > in-use from the writes, and is not written out as compressed data. > The zram page de-duplication has kicked in, and instead of handles to > zsmalloc areas for data we just have "this is a page of zeros" > recorded. So this is correctly reflecting that fact that we don't > actually have anything compressed stored at this time. > > > >> If we do a "sync" then redisply the mm_stat after, we get > > >> 26214400 2842 65536 26214400 196608 399 0 0 > > Now we've finished writing all our zeros and have synced, we would > have finished updating vfat allocations, etc. So this gets compressed > and written, and we're back to have some small FS metadata compressed > in our 1 page of zsmalloc allocations. > > I think what is probably "special" about this reproducer system is > that it is slow enough to allow the zero allocation to persist between > the end of the test writes and examining the stats. > > I'd be happy for any thoughts on the likelyhood of this! Thanks for looking into this. Yes, the fact that /dev/urandom shows non-zero values in mm_stat means that we don't have any fishy going on in zram but instead very likely have ZRAM_SAME pages, which don't reach zsmalloc pool and don't use any physical pages. And this is what 145 is in mm_stat that was posted earlier. We have 145 pages that are filled with the same bytes pattern: > > >> however, /sys/block/zram1/mm_stat shows > > >> 9502720 0 0 26214400 196608 145 0 0 > If we think this is right; then the point of the end of this test [1] > is ensure a high reported compression ratio on the device, presumably > to ensure the compression is working. Filling it with urandom would > be unreliable in this regard. I think what we want to do is something > highly compressable like alternate lengths of 0x00 and 0xFF. So var-lengths 0x00/0xff should work, just make sure that you don't have a pattern of sizeof(unsigned long) length. I think fio had an option to generate bin data with a certain level of compress-ability. If that option works then maybe you can just use fio with some static buffer_compress_percentage configuration.