mbox series

[0/3] selftests/mm: virtual_address_range: Two bugfixes and a cleanup

Message ID 20250107-virtual_address_range-tests-v1-0-3834a2fb47fe@linutronix.de
Headers show
Series selftests/mm: virtual_address_range: Two bugfixes and a cleanup | expand

Message

Thomas Weißschuh Jan. 7, 2025, 3:14 p.m. UTC
The selftest started failing since commit e93d2521b27f
("x86/vdso: Split virtual clock pages into dedicated mapping")
was merged. While debugging I stumbled upon another bug and potential
cleanup.

Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
---
Thomas Weißschuh (3):
      selftests/mm: virtual_address_range: Fix error when CommitLimit < 1GiB
      selftests/mm: virtual_address_range: Avoid reading VVAR mappings
      selftests/mm: virtual_address_range: Dump to /dev/null

 tools/testing/selftests/mm/virtual_address_range.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)
---
base-commit: fbfd64d25c7af3b8695201ebc85efe90be28c5a3
change-id: 20250107-virtual_address_range-tests-95843766fa97

Best regards,

Comments

David Hildenbrand Jan. 8, 2025, 1:30 p.m. UTC | #1
On 08.01.25 07:09, Dev Jain wrote:
> 
> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>> During the execution of validate_complete_va_space() a lot of memory is
>> on the VM subsystem. When running on a low memory subsystem an OOM may
>> be triggered, when writing to the dump file as the filesystem may also
>> require memory.
>>
>> On my test system with 1100MiB physical memory:
>>
>> 	Tasks state (memory values in pages):
>> 	[  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
>> 	[     57]     0    57 34359215953      695      256        0       439 1064390656        0             0 virtual_address
>>
>> 	Out of memory: Killed process 57 (virtual_address) total-vm:137436863812kB, anon-rss:1024kB, file-rss:0kB, shmem-rss:1756kB, UID:0 pgtables:1039444kB oom_score_adj:0
>> 	<snip>
>> 	fault_in_iov_iter_readable+0x4a/0xd0
>> 	generic_perform_write+0x9c/0x280
>> 	shmem_file_write_iter+0x86/0x90
>> 	vfs_write+0x29c/0x480
>> 	ksys_write+0x6c/0xe0
>> 	do_syscall_64+0x9e/0x1a0
>> 	entry_SYSCALL_64_after_hwframe+0x77/0x7f
>>
>> Write the dumped data into /dev/null instead which does not require
>> additional memory during write(), making the code simpler as a
>> side-effect.
>>
>> Signed-off-by: Thomas Weißschuh<thomas.weissschuh@linutronix.de>
>> ---
>>   tools/testing/selftests/mm/virtual_address_range.c | 6 ++----
>>   1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/tools/testing/selftests/mm/virtual_address_range.c b/tools/testing/selftests/mm/virtual_address_range.c
>> index 484f82c7b7c871f82a7d9ec6d6c649f2ab1eb0cd..4042fd878acd702d23da2c3293292de33bd48143 100644
>> --- a/tools/testing/selftests/mm/virtual_address_range.c
>> +++ b/tools/testing/selftests/mm/virtual_address_range.c
>> @@ -103,10 +103,9 @@ static int validate_complete_va_space(void)
>>   	FILE *file;
>>   	int fd;
>>   
>> -	fd = open("va_dump", O_CREAT | O_WRONLY, 0600);
>> -	unlink("va_dump");
>> +	fd = open("/dev/null", O_WRONLY);
>>   	if (fd < 0) {
>> -		ksft_test_result_skip("cannot create or open dump file\n");
>> +		ksft_test_result_skip("cannot create or open /dev/null\n");
>>   		ksft_finished();
>>   	}
 >>   >> @@ -152,7 +151,6 @@ static int validate_complete_va_space(void)
>>   		while (start_addr + hop < end_addr) {
>>   			if (write(fd, (void *)(start_addr + hop), 1) != 1)
>>   				return 1;
>> -			lseek(fd, 0, SEEK_SET);
>>   
>>   			hop += MAP_CHUNK_SIZE;
>>   		}
>>
> 
> The reason I had not used /dev/null was that write() was succeeding to /dev/null
> even from an address not in my VA space. I was puzzled about this behaviour of
> /dev/null and I chose to ignore it and just use a real file.
> 
> To test this behaviour, run the following program:
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <fcntl.h>
> #include <sys/mman.h>
> intmain()
> {
> intfd;
> fd = open("va_dump", O_CREAT| O_WRONLY, 0600);
> unlink("va_dump");
> // fd = open("/dev/null", O_WRONLY);
> intret = munmap((void*)(1UL<< 30), 100);
> if(!ret)
> printf("munmap succeeded\n");
> intres = write(fd, (void*)(1UL<< 30), 1);
> if(res == 1)
> printf("write succeeded\n");
> return0;
> }
> The write will fail as expected, but if you comment out the va_dump
> lines and use /dev/null, the write will succeed.

What exactly do we want to achieve with the write? Verify that the 
output of /proc/self/map is reasonable and we can actually resolve a 
fault / map a page?

Why not access the memory directly+signal handler or using 
/proc/self/mem, so you can avoid the temp file completely?
David Hildenbrand Jan. 8, 2025, 4:46 p.m. UTC | #2
On 08.01.25 17:13, Thomas Weißschuh wrote:
> On Wed, Jan 08, 2025 at 02:36:57PM +0100, David Hildenbrand wrote:
>> On 08.01.25 09:05, Thomas Weißschuh wrote:
>>> On Wed, Jan 08, 2025 at 11:46:19AM +0530, Dev Jain wrote:
>>>>
>>>> On 07/01/25 8:44 pm, Thomas Weißschuh wrote:
>>>>> If not enough physical memory is available the kernel may fail mmap();
>>>>> see __vm_enough_memory() and vm_commit_limit().
>>>>> In that case the logic in validate_complete_va_space() does not make
>>>>> sense and will even incorrectly fail.
>>>>> Instead skip the test if no mmap() succeeded.
>>>>>
>>>>> Fixes: 010409649885 ("selftests/mm: confirm VA exhaustion without reliance on correctness of mmap()")
>>>>> Cc: stable@vger.kernel.org
>>
>> CC stable on tests is ... odd.
> 
> I thought it was fairly common, but it isn't.
> Will drop it.

As it's not really a "kernel BUG", it's rather uncommon.

>>
>> Note that with MAP_NORESRVE, most setups we care about will allow mapping as
>> much as you want, but on access OOM will fire.
> 
> Thanks for the hint.
> 
>> So one could require that /proc/sys/vm/overcommit_memory is setup properly
>> and use MAP_NORESRVE.
> 
> Isn't the check for lchunks == 0 essentially exactly this?

I assume paired with MAP_NORESERVE?

Maybe, but it could be better to have something that says "if 
overcommit_memory is not setup properly I will SKIP this test", but 
otherwise I expect this to work and will FAIL if it doesn't".

Or would you expect to run into lchunks == 0 even if overcommit_memory 
is setup properly and MAP_NORESERVE is used? (very very low memory that 
we cannot even create all the VMAs?)

> 
>> Reading from anonymous memory will populate the shared zeropage. To mitigate
>> OOM from "too many page tables", one could simply unmap the pieces as they
>> are verified (or MAP_FIXED over them, to free page tables).
> 
> The code has to figure out if a verified region was created by mmap(),
> otherwise an munmap() could crash the process.
> As the entries from /proc/self/maps may have been merged and (I assume)

Yes, and partial unmap (in chunk granularity?) would split them again.

> the ordering of mappings is not guaranteed, some bespoke logic to establish
> the link will be needed.


My thinking was that you simply process one /proc/self/maps entry in 
some chunks. After processing a chunk, you munmap() it.

So you would process + munmap in chunks.

> 
> Is it fine to rely on CONFIG_ANON_VMA_NAME?
> That would make it much easier to implement.

Can you elaborate how you would do it?

> 
> Using MAP_NORESERVE and eager munmap()s, the testcase works nicely even
> in very low physical memory conditions.

Cool.
David Hildenbrand Jan. 9, 2025, 1:05 p.m. UTC | #3
>
> That is clear. The issue would be to figure which chunks are valid to
> unmap. If something critical like the executable file is unmapped,
> the process crashes. But see below.

Ah, now I see what you mean. Yes, also the stack etc. will be 
problematic. So IIUC, you want to limit the munmap optimization only to 
the manually mmap()ed parts.

> 
>>> Is it fine to rely on CONFIG_ANON_VMA_NAME?
>>> That would make it much easier to implement.
>>
>> Can you elaborate how you would do it?
> 
> First set the VMA name after mmap():
> 
> for (i = 0; i < NR_CHUNKS_LOW; i++) {
> 	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
> 		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> 
> 	if (ptr[i] == MAP_FAILED) {
> 		if (validate_lower_address_hint())
> 			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
> 		break;
> 	}
> 
> 	validate_addr(ptr[i], 0);
> 	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
> 		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));

Likely this would prevent merging of VMAs.

With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already 
require 128k VMAs. The default limit is frequently 64k.

We could just scan the ptr / hptr array to see if this is a manual mmap 
area or not. If this takes too long, one could sort the arrays by 
address and perform a binary search.

Not the most efficient way of doing it, but maybe good enough for this test?

Alternatively, store the pointer in a xarray-like tree instead of two 
arrays. Requires a bit more memory ... and we'd have to find a simple 
implementation we could just reuse in this test. So maybe there is a 
simpler way to get it done.
Thomas Weißschuh Jan. 9, 2025, 1:38 p.m. UTC | #4
On Thu, Jan 09, 2025 at 02:05:43PM +0100, David Hildenbrand wrote:
> >
> > That is clear. The issue would be to figure which chunks are valid to
> > unmap. If something critical like the executable file is unmapped,
> > the process crashes. But see below.
> 
> Ah, now I see what you mean. Yes, also the stack etc. will be problematic.
> So IIUC, you want to limit the munmap optimization only to the manually
> mmap()ed parts.

Correct.

> > > > Is it fine to rely on CONFIG_ANON_VMA_NAME?
> > > > That would make it much easier to implement.
> > > 
> > > Can you elaborate how you would do it?
> > 
> > First set the VMA name after mmap():
> > 
> > for (i = 0; i < NR_CHUNKS_LOW; i++) {
> > 	ptr[i] = mmap(NULL, MAP_CHUNK_SIZE, PROT_READ | PROT_WRITE,
> > 		     MAP_NORESERVE | MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> > 
> > 	if (ptr[i] == MAP_FAILED) {
> > 		if (validate_lower_address_hint())
> > 			ksft_exit_fail_msg("mmap unexpectedly succeeded with hint\n");
> > 		break;
> > 	}
> > 
> > 	validate_addr(ptr[i], 0);
> > 	if (prctl(PR_SET_VMA, PR_SET_VMA_ANON_NAME, ptr[i], MAP_CHUNK_SIZE, "virtual_address_range"))
> > 		ksft_exit_fail_msg("prctl(PR_SET_VMA_ANON_NAME) failed: %s\n", strerror(errno));
> 
> Likely this would prevent merging of VMAs.
>
> With a 1 GiB chunk size, and NR_CHUNKS_LOW == 128TiB, you'd already require
> 128k VMAs. The default limit is frequently 64k.

They are merged for me, as they all share the same name.

PR_SET_VMA(2const) even mentions merging:

	Note that assigning an attribute to a virtual memory area might
	prevent it from being merged with adjacent virtual memory areas
	due to the difference in that attribute's value.

is_mergeable_vma() has an explicit check using anon_vma_name_eq().

> We could just scan the ptr / hptr array to see if this is a manual mmap area
> or not. If this takes too long, one could sort the arrays by address and
> perform a binary search.
>
> Not the most efficient way of doing it, but maybe good enough for this test?

A naive loop is what I tried first, but it took forever.

> Alternatively, store the pointer in a xarray-like tree instead of two
> arrays. Requires a bit more memory ... and we'd have to find a simple
> implementation we could just reuse in this test. So maybe there is a simpler
> way to get it done.

IMO the prctl() is that simpler way.
The only real drawback is the dependency on CONFIG_ANON_VMA_NAME.
We can add an entry to tools/testing/selftests/mm/config for it.


Thomas