Message ID | 20250107-virtual_address_range-tests-v1-0-3834a2fb47fe@linutronix.de |
---|---|
Headers | show |
Series | selftests/mm: virtual_address_range: Two bugfixes and a cleanup | expand |
On 08.01.25 07:09, Dev Jain wrote: > > On 07/01/25 8:44 pm, Thomas Weißschuh wrote: >> During the execution of validate_complete_va_space() a lot of memory is >> on the VM subsystem. When running on a low memory subsystem an OOM may >> be triggered, when writing to the dump file as the filesystem may also >> require memory. >> >> On my test system with 1100MiB physical memory: >> >> Tasks state (memory values in pages): >> [ pid ] uid tgid total_vm rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name >> [ 57] 0 57 34359215953 695 256 0 439 1064390656 0 0 virtual_address >> >> Out of memory: Killed process 57 (virtual_address) total-vm:137436863812kB, anon-rss:1024kB, file-rss:0kB, shmem-rss:1756kB, UID:0 pgtables:1039444kB oom_score_adj:0 >> <snip> >> fault_in_iov_iter_readable+0x4a/0xd0 >> generic_perform_write+0x9c/0x280 >> shmem_file_write_iter+0x86/0x90 >> vfs_write+0x29c/0x480 >> ksys_write+0x6c/0xe0 >> do_syscall_64+0x9e/0x1a0 >> entry_SYSCALL_64_after_hwframe+0x77/0x7f >> >> Write the dumped data into /dev/null instead which does not require >> additional memory during write(), making the code simpler as a >> side-effect. >> >> Signed-off-by: Thomas Weißschuh<thomas.weissschuh@linutronix.de> >> --- >> tools/testing/selftests/mm/virtual_address_range.c | 6 ++---- >> 1 file changed, 2 insertions(+), 4 deletions(-) >> >> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c >> index 484f82c7b7c871f82a7d9ec6d6c649f2ab1eb0cd..4042fd878acd702d23da2c3293292de33bd48143 100644 >> --- a/tools/testing/selftests/mm/virtual_address_range.c >> +++ b/tools/testing/selftests/mm/virtual_address_range.c >> @@ -103,10 +103,9 @@ static int validate_complete_va_space(void) >> FILE *file; >> int fd; >> >> - fd = open("va_dump", O_CREAT | O_WRONLY, 0600); >> - unlink("va_dump"); >> + fd = open("/dev/null", O_WRONLY); >> if (fd < 0) { >> - ksft_test_result_skip("cannot create or open dump file\n"); >> + ksft_test_result_skip("cannot create or open /dev/null\n"); >> ksft_finished(); >> } >> >> @@ -152,7 +151,6 @@ static int validate_complete_va_space(void) >> while (start_addr + hop < end_addr) { >> if (write(fd, (void *)(start_addr + hop), 1) != 1) >> return 1; >> - lseek(fd, 0, SEEK_SET); >> >> hop += MAP_CHUNK_SIZE; >> } >> > > The reason I had not used /dev/null was that write() was succeeding to /dev/null > even from an address not in my VA space. I was puzzled about this behaviour of > /dev/null and I chose to ignore it and just use a real file. > > To test this behaviour, run the following program: > > #include <stdio.h> > #include <stdlib.h> > #include <unistd.h> > #include <fcntl.h> > #include <sys/mman.h> > intmain() > { > intfd; > fd = open("va_dump", O_CREAT| O_WRONLY, 0600); > unlink("va_dump"); > // fd = open("/dev/null", O_WRONLY); > intret = munmap((void*)(1UL<< 30), 100); > if(!ret) > printf("munmap succeeded\n"); > intres = write(fd, (void*)(1UL<< 30), 1); > if(res == 1) > printf("write succeeded\n"); > return0; > } > The write will fail as expected, but if you comment out the va_dump > lines and use /dev/null, the write will succeed. What exactly do we want to achieve with the write? Verify that the output of /proc/self/map is reasonable and we can actually resolve a fault / map a page? Why not access the memory directly+signal handler or using /proc/self/mem, so you can avoid the temp file completely?
On 08.01.25 17:13, Thomas Weißschuh wrote: > On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote: >> On 08.01.25 09:05, Thomas Weißschuh wrote: >>> On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote: >>>> >>>> On 07/01/25 8:44 pm, Thomas Weißschuh wrote: >>>>> If not enough physical memory is available the kernel may fail mmap(); >>>>> see __vm_enough_memory() and vm_commit_limit(). >>>>> In that case the logic in validate_complete_va_space() does not make >>>>> sense and will even incorrectly fail. >>>>> Instead skip the test if no mmap() succeeded. >>>>> >>>>> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()") >>>>> Cc: stable@vger.kernel.org >> >> CC stable on tests is ... odd. > > I thought it was fairly common, but it isn't. > Will drop it. As it's not really a "kernel BUG", it's rather uncommon. >> >> Note that with MAP_NORESRVE, most setups we care about will allow mapping as >> much as you want, but on access OOM will fire. > > Thanks for the hint. > >> So one could require that /proc/sys/vm/overcommit_memory is setup properly >> and use MAP_NORESRVE. > > Isn't the check for lchunks == 0 essentially exactly this? I assume paired with MAP_NORESERVE? Maybe, but it could be better to have something that says "if overcommit_memory is not setup properly I will SKIP this test", but otherwise I expect this to work and will FAIL if it doesn't". Or would you expect to run into lchunks == 0 even if overcommit_memory is setup properly and MAP_NORESERVE is used? (very very low memory that we cannot even create all the VMAs?) > >> Reading from anonymous memory will populate the shared zeropage. To mitigate >> OOM from "too many page tables", one could simply unmap the pieces as they >> are verified (or MAP_FIXED over them, to free page tables). > > The code has to figure out if a verified region was created by mmap(), > otherwise an munmap() could crash the process. > As the entries from /proc/self/maps may have been merged and (I assume) Yes, and partial unmap (in chunk granularity?) would split them again. > the ordering of mappings is not guaranteed, some bespoke logic to establish > the link will be needed. My thinking was that you simply process one /proc/self/maps entry in some chunks. After processing a chunk, you munmap() it. So you would process + munmap in chunks. > > Is it fine to rely on CONFIG_ANON_VMA_NAME? > That would make it much easier to implement. Can you elaborate how you would do it? > > Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even > in very low physical memory conditions. Cool.
> > That is clear. The issue would be to figure which chunks are valid to > unmap. If something critical like the executable file is unmapped, > the process crashes. But see below. Ah, now I see what you mean. Yes, also the stack etc. will be problematic. So IIUC, you want to limit the munmap optimization only to the manually mmap()ed parts. > >>> Is it fine to rely on CONFIG_ANON_VMA_NAME? >>> That would make it much easier to implement. >> >> Can you elaborate how you would do it? > > First set the VMA name after mmap(): > > for (i = 0; i < NR_CHUNKS_LOW; i++) { > ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE, > MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > if (ptr[i] == MAP_FAILED) { > if (validate_lower_address_hint()) > ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n"); > break; > } > > validate_addr(ptr[i], 0); > if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range")) > ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno)); Likely this would prevent merging of VMAs. With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already require 128k VMAs. The default limit is frequently 64k. We could just scan the ptr / hptr array to see if this is a manual mmap area or not. If this takes too long, one could sort the arrays by address and perform a binary search. Not the most efficient way of doing it, but maybe good enough for this test? Alternatively, store the pointer in a xarray-like tree instead of two arrays. Requires a bit more memory ... and we'd have to find a simple implementation we could just reuse in this test. So maybe there is a simpler way to get it done.
On Thu, Jan 09, 2025 at 02:05:43PM +0100, David Hildenbrand wrote: > > > > That is clear. The issue would be to figure which chunks are valid to > > unmap. If something critical like the executable file is unmapped, > > the process crashes. But see below. > > Ah, now I see what you mean. Yes, also the stack etc. will be problematic. > So IIUC, you want to limit the munmap optimization only to the manually > mmap()ed parts. Correct. > > > > Is it fine to rely on CONFIG_ANON_VMA_NAME? > > > > That would make it much easier to implement. > > > > > > Can you elaborate how you would do it? > > > > First set the VMA name after mmap(): > > > > for (i = 0; i < NR_CHUNKS_LOW; i++) { > > ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE, > > MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); > > > > if (ptr[i] == MAP_FAILED) { > > if (validate_lower_address_hint()) > > ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n"); > > break; > > } > > > > validate_addr(ptr[i], 0); > > if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range")) > > ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno)); > > Likely this would prevent merging of VMAs. > > With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already require > 128k VMAs. The default limit is frequently 64k. They are merged for me, as they all share the same name. PR_SET_VMA(2const) even mentions merging: Note that assigning an attribute to a virtual memory area might prevent it from being merged with adjacent virtual memory areas due to the difference in that attribute's value. is_mergeable_vma() has an explicit check using anon_vma_name_eq(). > We could just scan the ptr / hptr array to see if this is a manual mmap area > or not. If this takes too long, one could sort the arrays by address and > perform a binary search. > > Not the most efficient way of doing it, but maybe good enough for this test? A naive loop is what I tried first, but it took forever. > Alternatively, store the pointer in a xarray-like tree instead of two > arrays. Requires a bit more memory ... and we'd have to find a simple > implementation we could just reuse in this test. So maybe there is a simpler > way to get it done. IMO the prctl() is that simpler way. The only real drawback is the dependency on CONFIG_ANON_VMA_NAME. We can add an entry to tools/testing/selftests/mm/config for it. Thomas
The selftest started failing since commit e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping") was merged. While debugging I stumbled upon another bug and potential cleanup. Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> --- Thomas Weißschuh (3): selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB selftests/mm: virtual_address_range: Avoid reading VVAR mappings selftests/mm: virtual_address_range: Dump to /dev/null tools/testing/selftests/mm/virtual_address_range.c | 21 +++++++++++++++------ 1 file changed, 15 insertions(+), 6 deletions(-) --- base-commit: fbfd64d25c7af3b8695201ebc85efe90be28c5a3 change-id: 20250107-virtual_address_range-tests-95843766fa97 Best regards,