mbox series

[0/4] implement lightweight guard pages

Message ID cover.1729196871.git.lorenzo.stoakes@oracle.com
Headers show
Series implement lightweight guard pages | expand

Message

Lorenzo Stoakes Oct. 17, 2024, 8:42 p.m. UTC
Userland library functions such as allocators and threading implementations
often require regions of memory to act as 'guard pages' - mappings which,
when accessed, result in a fatal signal being sent to the accessing
process.

The current means by which these are implemented is via a PROT_NONE mmap()
mapping, which provides the required semantics however incur an overhead of
a VMA for each such region.

With a great many processes and threads, this can rapidly add up and incur
a significant memory penalty. It also has the added problem of preventing
merges that might otherwise be permitted.

This series takes a different approach - an idea suggested by Vlasimil
Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
provenance becomes a little tricky to ascertain after this - please forgive
any omissions!)  - rather than locating the guard pages at the VMA layer,
instead placing them in page tables mapping the required ranges.

Early testing of the prototype version of this code suggests a 5 times
speed up in memory mapping invocations (in conjunction with use of
process_madvise()) and a 13% reduction in VMAs on an entirely idle android
system and unoptimised code.

We expect with optimisation and a loaded system with a larger number of
guard pages this could significantly increase, but in any case these
numbers are encouraging.

This way, rather than having separate VMAs specifying which parts of a
range are guard pages, instead we have a VMA spanning the entire range of
memory a user is permitted to access and including ranges which are to be
'guarded'.

After mapping this, a user can specify which parts of the range should
result in a fatal signal when accessed.

By restricting the ability to specify guard pages to memory mapped by
existing VMAs, we can rely on the mappings being torn down when the
mappings are ultimately unmapped and everything works simply as if the
memory were not faulted in, from the point of view of the containing VMAs.

This mechanism in effect poisons memory ranges similar to hardware memory
poisoning, only it is an entirely software-controlled form of poisoning.

Any poisoned region of memory is also able to 'unpoisoned', that is, to
have its poison markers removed.

The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
poisoning.

Poisoning can be performed across multiple VMAs and any existing mappings
will be cleared, that is zapped, before installing the poisoned page table
mappings.

There is no concept of 'nested' poisoning, multiple attempts to poison a
range will, after the first poisoning, have no effect.

Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
memory, so a user can safely unpoison a range of memory and clear only
poison page table mappings leaving the rest intact.

The actual mechanism by which the page table entries are specified makes
use of existing logic - PTE markers, which are used for the userfaultfd
UFFDIO_POISON mechanism.

Unfortunately PTE_MARKER_POISONED is not suited for the guard page
mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
logic to handle it.

We also extend the generic page walk mechanism to allow for installation of
PTEs (carefully restricted to memory management logic only to prevent
unwanted abuse).

We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
specified for a VMA which implies a total removal of memory
characteristics).

It's important to note that the guard page implementation is emphatically
NOT a security feature, so a user can remove the poisoning if they wish. We
simply implement it in such a way as to provide the least surprising
behaviour.

An extensive set of self-tests are provided which ensure behaviour is as
expected and additionally self-documents expected behaviour of poisoned
ranges.

Suggested-by: Vlastimil Babka <vbabka@suze.cz>
Suggested-by: Jann Horn <jannh@google.com>
Suggested-by: David Hildenbrand <david@redhat.com>

v1
* Un-RFC'd as appears no major objections to approach but rather debate on
  implementation.
* Fixed issue with arches which need mmu_context.h and
  tlbfush.h. header imports in pagewalker logic to be able to use
  update_mmu_cache() as reported by the kernel test bot.
* Added comments in page walker logic to clarify who can use
  ops->install_pte and why as well as adding a check_ops_valid() helper
  function, as suggested by Christoph.
* Pass false in full parameter in pte_clear_not_present_full() as suggested
  by Jann.
* Stopped erroneously requiring a write lock for the poison operation as
  suggested by Jann and Suren.
* Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
  consistent with how this is used elsewhere in the kernel as suggested by
  Jann.
* Avoid returning -EAGAIN if we are raced on page faults, just keep looping
  and duck out if a fatal signal is pending or a conditional reschedule is
  needed, as suggested by Jann.
* Avoid needlessly splitting huge PUDs and PMDs by specifying
  ACTION_CONTINUE, as suggested by Jann.

RFC
https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/

Lorenzo Stoakes (4):
  mm: pagewalk: add the ability to install PTEs
  mm: add PTE_MARKER_GUARD PTE marker
  mm: madvise: implement lightweight guard page mechanism
  selftests/mm: add self tests for guard page feature

 arch/alpha/include/uapi/asm/mman.h       |    3 +
 arch/mips/include/uapi/asm/mman.h        |    3 +
 arch/parisc/include/uapi/asm/mman.h      |    3 +
 arch/xtensa/include/uapi/asm/mman.h      |    3 +
 include/linux/mm_inline.h                |    2 +-
 include/linux/pagewalk.h                 |   18 +-
 include/linux/swapops.h                  |   26 +-
 include/uapi/asm-generic/mman-common.h   |    3 +
 mm/hugetlb.c                             |    3 +
 mm/internal.h                            |    6 +
 mm/madvise.c                             |  168 ++++
 mm/memory.c                              |   18 +-
 mm/mprotect.c                            |    3 +-
 mm/mseal.c                               |    1 +
 mm/pagewalk.c                            |  200 ++--
 tools/testing/selftests/mm/.gitignore    |    1 +
 tools/testing/selftests/mm/Makefile      |    1 +
 tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
 18 files changed, 1564 insertions(+), 66 deletions(-)
 create mode 100644 tools/testing/selftests/mm/guard-pages.c

--
2.46.2

Comments

Shuah Khan Oct. 17, 2024, 9:24 p.m. UTC | #1
On 10/17/24 14:42, Lorenzo Stoakes wrote:
> Utilise the kselftest harmness to implement tests for the guard page

Splleing NIT - harmness -> harness

> implementation.
> 
> We start by implement basic tests asserting that guard pages can be

implmenting? By the way checkpatch will catch spelling stuuf.
Please see comments about warnings below.

> established (poisoned), cleared (remedied) and that touching poisoned pages
> result in SIGSEGV. We also assert that, in remedying a range, non-poison
> pages remain intact.
> 
> We then examine different operations on regions containing poison markers
> behave to ensure correct behaviour:
> 
> * Operations over multiple VMAs operate as expected.
> * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in
>    batches works correctly.
> * Ensuring that munmap() correctly tears down poison markers.
> * Using mprotect() to adjust protection bits does not in any way override
>    or cause issues with poison markers.
> * Ensuring that splitting and merging VMAs around poison markers causes no
>    issue - i.e. that a marker which 'belongs' to one VMA can function just
>    as well 'belonging' to another.
> * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
> * Ensuring that mlock()'ing a range containing poison markers does not
>    cause issues.
> * Ensuring that mremap() can move a poisoned range and retain poison
>    markers.
> * Ensuring that mremap() can expand a poisoned range and retain poison
>    markers (perhaps moving the range).
> * Ensuring that mremap() can shrink a poisoned range and retain poison
>    markers.
> * Ensuring that forking a process correctly retains poison markers.
> * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
> * Ensuring that lazyfree simply clears poison markers.
> * Ensuring that userfaultfd can co-exist with guard pages.
> * Ensuring that madvise(..., MADV_POPULATE_READ) and
>    madvise(..., MADV_POPULATE_WRITE) error out when encountering
>    poison markers.
> * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
>    not remove poison markers.

Good summary of test. Does the test require root access?
If so does it check and skip appropriately?

> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   tools/testing/selftests/mm/.gitignore    |    1 +
>   tools/testing/selftests/mm/Makefile      |    1 +
>   tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
>   3 files changed, 1170 insertions(+)
>   create mode 100644 tools/testing/selftests/mm/guard-pages.c
> 
> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> index 689bbd520296..8f01f4da1c0d 100644
> --- a/tools/testing/selftests/mm/.gitignore
> +++ b/tools/testing/selftests/mm/.gitignore
> @@ -54,3 +54,4 @@ droppable
>   hugetlb_dio
>   pkey_sighandler_tests_32
>   pkey_sighandler_tests_64
> +guard-pages
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index 02e1204971b0..15c734d6cfec 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
>   TEST_GEN_FILES += hugetlb_madv_vs_map
>   TEST_GEN_FILES += hugetlb_dio
>   TEST_GEN_FILES += droppable
> +TEST_GEN_FILES += guard-pages
>   
>   ifneq ($(ARCH),arm64)
>   TEST_GEN_FILES += soft-dirty
> diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
> new file mode 100644
> index 000000000000..2ab0ff3ba5a0
> --- /dev/null
> +++ b/tools/testing/selftests/mm/guard-pages.c
> @@ -0,0 +1,1168 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +
> +#define _GNU_SOURCE
> +#include "../kselftest_harness.h"
> +#include <assert.h>
> +#include <fcntl.h>
> +#include <setjmp.h>
> +#include <errno.h>
> +#include <linux/userfaultfd.h>
> +#include <signal.h>
> +#include <stdbool.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <sys/ioctl.h>
> +#include <sys/mman.h>
> +#include <sys/syscall.h>
> +#include <sys/uio.h>
> +#include <unistd.h>
> +
> +/* These may not yet be available in the uAPI so define if not. */
> +
> +#ifndef MADV_GUARD_POISON
> +#define MADV_GUARD_POISON	102
> +#endif
> +
> +#ifndef MADV_GUARD_UNPOISON
> +#define MADV_GUARD_UNPOISON	103
> +#endif
> +
> +volatile bool signal_jump_set;

Can you add a comment about why volatile is needed.
By the way did you happen to run checkpatck on this. There are
several instances where single statement blocks with braces {}

I noticed a few and ran checkpatch on your patch. There are
45 warnings regarding codeing style.

Please run checkpatch and clean them up so we can avoid followup
checkpatch cleanup patches.

> +sigjmp_buf signal_jmp_buf;
> +
> +static int userfaultfd(int flags)
> +{
> +	return syscall(SYS_userfaultfd, flags);
> +}
> +
> +static void handle_fatal(int c)
> +{
> +	if (!signal_jump_set)
> +		return;
> +
> +	siglongjmp(signal_jmp_buf, c);
> +}
> +
> +static int pidfd_open(pid_t pid, unsigned int flags)
> +{
> +	return syscall(SYS_pidfd_open, pid, flags);
> +}
> +
> +/*
> + * Enable our signal catcher and try to read/write the specified buffer. The
> + * return value indicates whether the read/write succeeds without a fatal
> + * signal.
> + */
> +static bool try_access_buf(char *ptr, bool write)
> +{
> +	bool failed;
> +
> +	/* Tell signal handler to jump back here on fatal signal. */
> +	signal_jump_set = true;
> +	/* If a fatal signal arose, we will jump back here and failed is set. */
> +	failed = sigsetjmp(signal_jmp_buf, 0) != 0;
> +
> +	if (!failed) {
> +		if (write) {
> +			*ptr = 'x';
> +		} else {
> +			const volatile char *chr = ptr;
> +
> +			/* Force read. */
> +			(void)*chr;
> +		}
> +	}
> +
> +	signal_jump_set = false;
> +	return !failed;
> +}
> +
> +/* Try and read from a buffer, return true if no fatal signal. */
> +static bool try_read_buf(char *ptr)
> +{
> +	return try_access_buf(ptr, false);
> +}
> +
> +/* Try and write to a buffer, return true if no fatal signal. */
> +static bool try_write_buf(char *ptr)
> +{
> +	return try_access_buf(ptr, true);
> +}
> +
> +/*
> + * Try and BOTH read from AND write to a buffer, return true if BOTH operations
> + * succeed.
> + */
> +static bool try_read_write_buf(char *ptr)
> +{
> +	return try_read_buf(ptr) && try_write_buf(ptr);
> +}
> +
> +FIXTURE(guard_pages)
> +{
> +	unsigned long page_size;
> +};
> +
> +FIXTURE_SETUP(guard_pages)
> +{
> +	struct sigaction act = {
> +		.sa_handler = &handle_fatal,
> +		.sa_flags = SA_NODEFER,
> +	};
> +
> +	sigemptyset(&act.sa_mask);
> +	if (sigaction(SIGSEGV, &act, NULL)) {
> +		perror("sigaction");
> +		ksft_exit_fail();

Replase the above two with ksft_exit_fail_perror()
There is no need for perror("sigaction"); followed by
ksft_exit_fail();

> +	}
> +
> +	self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
> +};
> +
> +FIXTURE_TEARDOWN(guard_pages)
> +{
> +	struct sigaction act = {
> +		.sa_handler = SIG_DFL,
> +		.sa_flags = SA_NODEFER,
> +	};
> +
> +	sigemptyset(&act.sa_mask);
> +	sigaction(SIGSEGV, &act, NULL);
> +}
> +
> +TEST_F(guard_pages, basic)
> +{
> +	const unsigned long NUM_PAGES = 10;
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_PRIVATE | MAP_ANON, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Trivially assert we can touch the first page. */
> +	ASSERT_TRUE(try_read_write_buf(ptr));
> +
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Establish that 1st page SIGSEGV's. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +
> +	/* Ensure we can touch everything else.*/
> +	for (i = 1; i < NUM_PAGES; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Establish a guard page at the end of the mapping. */
> +	ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
> +			  MADV_GUARD_POISON), 0);
> +
> +	/* Check that both guard pages result in SIGSEGV. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
> +
> +	/* Unpoison the first. */
> +	ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_UNPOISON));
> +
> +	/* Make sure we can touch it. */
> +	ASSERT_TRUE(try_read_write_buf(ptr));
> +
> +	/* Unpoison the last. */
> +	ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
> +			     MADV_GUARD_UNPOISON));
> +
> +	/* Make sure we can touch it. */
> +	ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
> +
> +	/*
> +	 *  Test setting a _range_ of pages, namely the first 3. The first of
> +	 *  these be faulted in, so this also tests that we can poison backed
> +	 *  pages.
> +	 */
> +	ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure they are all poisoned. */
> +	for (i = 0; i < 3; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +

This one here and
> +	/* Make sure the rest are not. */
> +	for (i = 3; i < NUM_PAGES; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Unpoison them. */
> +	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
> +
> +	/* Now make sure we can touch everything. */
> +	for (i = 0; i < NUM_PAGES; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Now unpoison everything, make sure we don't remove existing entries */
> +	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
> +
> +	for (i = 0; i < NUM_PAGES * page_size; i += page_size) {
> +		ASSERT_EQ(ptr[i], 'x');
> +	}
> +
> +	ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0);
> +}
> +
> +/* Assert that operations applied across multiple VMAs work as expected. */
> +TEST_F(guard_pages, multi_vma)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
> +	int i;
> +
> +	/* Reserve a 100 page region over which we can install VMAs. */
> +	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
> +			  MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_region, MAP_FAILED);
> +
> +	/* Place a VMA of 10 pages size at the start of the region. */
> +	ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr1, MAP_FAILED);
> +
> +	/* Place a VMA of 5 pages size 50 pages into the region. */
> +	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr2, MAP_FAILED);
> +
> +	/* Place a VMA of 20 pages size at the end of the region. */
> +	ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr3, MAP_FAILED);
> +
> +	/* Unmap gaps. */
> +	ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0);
> +
> +	/*
> +	 * We end up with VMAs like this:
> +	 *
> +	 * 0    10 .. 50   55 .. 80   100
> +	 * [---]      [---]      [---]
> +	 */
> +
> +	/* Now poison the whole range and make sure all VMAs are poisoned. */
> +
> +	/*
> +	 * madvise() is certifiable and lets you perform operations over gaps,
> +	 * everything works, but it indicates an error and errno is set to
> +	 * -ENOMEM. Also if anything runs out of memory it is set to
> +	 * -ENOMEM. You are meant to guess which is which.
> +	 */
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), -1);
> +	ASSERT_EQ(errno, ENOMEM);
> +
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
> +	}
> +
> +	for (i = 0; i < 5; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr2[i * page_size]));
> +	}
> +
> +	for (i = 0; i < 20; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr3[i * page_size]));
> +	}
> +
> +	/* Now unpoison the range and assert the opposite. */
> +
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), -1);
> +	ASSERT_EQ(errno, ENOMEM);
> +
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr1[i * page_size]));
> +	}
> +
> +	for (i = 0; i < 5; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr2[i * page_size]));
> +	}
> +
> +	for (i = 0; i < 20; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr3[i * page_size]));
> +	}
> +
> +	/* Now map incompatible VMAs in the gaps. */
> +	ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
> +		   PROT_READ | PROT_WRITE | PROT_EXEC,
> +		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +	ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
> +		   PROT_READ | PROT_WRITE | PROT_EXEC,
> +		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/*
> +	 * We end up with VMAs like this:
> +	 *
> +	 * 0    10 .. 50   55 .. 80   100
> +	 * [---][xxxx][---][xxxx][---]
> +	 *
> +	 * Where 'x' signifies VMAs that cannot be merged with those adjacent to
> +	 * them.
> +	 */
> +
> +	/* Multiple VMAs adjacent to one another should result in no error. */
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), 0);
> +	for (i = 0; i < 100; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr_region[i * page_size]));
> +	}
> +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), 0);
> +	for (i = 0; i < 100; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr_region[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0);
> +}
> +
> +/*
> + * Assert that batched operations performed using process_madvise() work as
> + * expected.
> + */
> +TEST_F(guard_pages, process_madvise)
> +{
> +	const unsigned long page_size = self->page_size;
> +	pid_t pid = getpid();
> +	int pidfd = pidfd_open(pid, 0);
> +	char *ptr_region, *ptr1, *ptr2, *ptr3;
> +	ssize_t count;
> +	struct iovec vec[6];
> +
> +	ASSERT_NE(pidfd, -1);
> +
> +	/* Reserve region to map over. */
> +	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
> +			  MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_region, MAP_FAILED);
> +
> +	/* 10 pages offset 1 page into reserve region. */
> +	ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr1, MAP_FAILED);
> +	/* We want poison markers at start/end of each VMA. */
> +	vec[0].iov_base = ptr1;
> +	vec[0].iov_len = page_size;
> +	vec[1].iov_base = &ptr1[9 * page_size];
> +	vec[1].iov_len = page_size;
> +
> +	/* 5 pages offset 50 pages into reserve region. */
> +	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr2, MAP_FAILED);
> +	vec[2].iov_base = ptr2;
> +	vec[2].iov_len = page_size;
> +	vec[3].iov_base = &ptr2[4 * page_size];
> +	vec[3].iov_len = page_size;
> +
> +	/* 20 pages offset 79 pages into reserve region. */
> +	ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
> +		    PROT_READ | PROT_WRITE,
> +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr3, MAP_FAILED);
> +	vec[4].iov_base = ptr3;
> +	vec[4].iov_len = page_size;
> +	vec[5].iov_base = &ptr3[19 * page_size];
> +	vec[5].iov_len = page_size;
> +
> +	/* Free surrounding VMAs. */
> +	ASSERT_EQ(munmap(ptr_region, page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0);
> +	ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
> +
> +	/* Now poison in one step. */
> +	count = process_madvise(pidfd, vec, 6, MADV_GUARD_POISON, 0);
> +
> +	/* OK we don't have permission to do this, skip. */
> +	if (count == -1 && errno == EPERM)
> +		ksft_exit_skip("No process_madvise() permissions\n");

Can you make this a bit more informative? What would user do
if they see this message? Are they supposed to run the test
as root?

> +
> +	/* Returns the number of bytes advised. */
> +	ASSERT_EQ(count, 6 * page_size);
> +
> +	/* Now make sure the poisoning was applied. */
> +
> +	ASSERT_FALSE(try_read_write_buf(ptr1));
> +	ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size]));
> +
> +	ASSERT_FALSE(try_read_write_buf(ptr2));
> +	ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size]));
> +
> +	ASSERT_FALSE(try_read_write_buf(ptr3));
> +	ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
> +
> +	/* Now do the same with unpoison... */
> +	count = process_madvise(pidfd, vec, 6, MADV_GUARD_UNPOISON, 0);
> +
> +	/* ...and everything should now succeed. */
> +
> +	ASSERT_TRUE(try_read_write_buf(ptr1));
> +	ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size]));
> +
> +	ASSERT_TRUE(try_read_write_buf(ptr2));
> +	ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size]));
> +
> +	ASSERT_TRUE(try_read_write_buf(ptr3));
> +	ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size]));
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
> +	ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
> +	ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
> +	close(pidfd);
> +}
> +
> +/* Assert that unmapping ranges does not leave poison behind. */
> +TEST_F(guard_pages, munmap)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new1, *ptr_new2;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison first and last pages. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> +	ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Assert that they are poisoned. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size]));
> +
> +	/* Unmap them. */
> +	ASSERT_EQ(munmap(ptr, page_size), 0);
> +	ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
> +
> +	/* Map over them.*/
> +	ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
> +			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new1, MAP_FAILED);
> +	ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
> +			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new2, MAP_FAILED);
> +
> +	/* Assert that they are now not poisoned. */
> +	ASSERT_TRUE(try_read_write_buf(ptr_new1));
> +	ASSERT_TRUE(try_read_write_buf(ptr_new2));
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Assert that mprotect() operations have no bearing on guard poison markers. */
> +TEST_F(guard_pages, mprotect)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison the middle of the range. */
> +	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
> +			  MADV_GUARD_POISON), 0);
> +
> +	/* Assert that it is indeed poisoned. */
> +	ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size]));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size]));
> +
> +	/* Now make these pages read-only. */
> +	ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0);
> +
> +	/* Make sure the range is still poisoned. */
> +	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
> +	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
> +
> +	/* Make sure we can poison again without issue.*/
> +	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
> +			  MADV_GUARD_POISON), 0);
> +
> +	/* Make sure the range is, yet again, still poisoned. */
> +	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
> +	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
> +
> +	/* Now unpoison the whole range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> +
> +	/* Make sure the whole range is readable. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Split and merge VMAs and make sure guard pages still behave. */
> +TEST_F(guard_pages, split_merge)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison the whole range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure the whole range is poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Now unmap some pages in the range so we split. */
> +	ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0);
> +	ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0);
> +	ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0);
> +
> +	/* Make sure the remaining ranges are poisoned post-split. */
> +	for (i = 0; i < 2; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +	for (i = 2; i < 5; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +	for (i = 6; i < 8; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +	for (i = 9; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Now map them again - the unmap will have cleared the poison. */
> +	ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
> +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +	ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
> +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +	ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
> +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +
> +	/* Now make sure poisoning is as expected. */
> +	for (i = 0; i < 10; i++) {
> +		bool result = try_read_write_buf(&ptr[i * page_size]);
> +
> +		if (i == 2 || i == 5 || i == 8) {
> +			ASSERT_TRUE(result);
> +		} else {
> +			ASSERT_FALSE(result);
> +		}
> +	}
> +
> +	/* Now poison everything again. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure the whole range is poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Now split the range into three. */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
> +
> +	/* Make sure the whole range is poisoned for read. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Now reset protection bits so we merge the whole thing. */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
> +			   PROT_READ | PROT_WRITE), 0);
> +
> +	/* Make sure the whole range is still poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Split range into 3 again... */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
> +
> +	/* ...and unpoison the whole range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> +
> +	/* Make sure the whole range is remedied for read. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Merge them again. */
> +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
> +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
> +			   PROT_READ | PROT_WRITE), 0);
> +
> +	/* Now ensure the merged range is remedied for read/write. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Assert that MADV_DONTNEED does not remove guard poison markers. */
> +TEST_F(guard_pages, dontneed)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Back the whole range. */
> +	for (i = 0; i < 10; i++) {
> +		ptr[i * page_size] = 'y';
> +	}
> +
> +	/* Poison every other page. */
> +	for (i = 0; i < 10; i += 2) {
> +		ASSERT_EQ(madvise(&ptr[i * page_size],
> +				  page_size, MADV_GUARD_POISON), 0);
> +	}
> +
> +	/* Indicate that we don't need any of the range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0);
> +
> +	/* Check to ensure poison markers are still in place. */
> +	for (i = 0; i < 10; i++) {
> +		bool result = try_read_buf(&ptr[i * page_size]);
> +
> +		if (i % 2 == 0) {
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +			/* Make sure we really did get reset to zero page. */
> +			ASSERT_EQ(ptr[i * page_size], '\0');
> +		}
> +
> +		/* Now write... */
> +		result = try_write_buf(&ptr[i * page_size]);
> +
> +		/* ...and make sure same result. */
> +		if (i % 2 == 0) {

You don't need  { here
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +		}

Same here.
  
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Assert that mlock()'ed pages work correctly with poison markers. */
> +TEST_F(guard_pages, mlock)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Populate. */
> +	for (i = 0; i < 10; i++) {
> +		ptr[i * page_size] = 'y';
> +	}
> +
> +	/* Lock. */
> +	ASSERT_EQ(mlock(ptr, 10 * page_size), 0);
> +
> +	/* Now try to poison, should fail with EINVAL. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), -1);
> +	ASSERT_EQ(errno, EINVAL);
> +
> +	/* OK unlock. */
> +	ASSERT_EQ(munlock(ptr, 10 * page_size), 0);
> +
> +	/* Poison first half of range, should now succeed. */
> +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure poison works. */
> +	for (i = 0; i < 10; i++) {
> +		bool result = try_read_write_buf(&ptr[i * page_size]);
> +
> +		if (i < 5) {
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +			ASSERT_EQ(ptr[i * page_size], 'x');
> +		}
> +	}
> +
> +	/*
> +	 * Now lock the latter part of the range. We can't lock the poisoned
> +	 * pages, as this would result in the pages being populated and the
> +	 * poisoning would cause this to error out.
> +	 */
> +	ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0);
> +
> +	/*
> +	 * Now unpoison, we do not permit mlock()'d ranges to be remedied as it is
> +	 * a non-destructive operation.
> +	 */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> +
> +	/* Now check that everything is remedied. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/*
> + * Assert that moving, extending and shrinking memory via mremap() retains
> + * poison markers where possible.
> + *
> + * - Moving a mapping alone should retain markers as they are.
> + */
> +TEST_F(guard_pages, mremap_move)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new;
> +
> +	/* Map 5 pages. */
> +	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Place poison markers at both ends of the 5 page span. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure the poison is in effect. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Map a new region we will move this range into. Doing this ensures
> +	 * that we have reserved a range to map into.
> +	 */
> +	ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
> +		       -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +
> +	ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
> +			 MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new);
> +
> +	/* Make sure the poison is retained. */
> +	ASSERT_FALSE(try_read_write_buf(ptr_new));
> +	ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size]));
> +
> +	/*
> +	 * Clean up - we only need reference the new pointer as we overwrote the
> +	 * PROT_NONE range and moved the existing one.
> +	 */
> +	munmap(ptr_new, 5 * page_size);
> +}
> +
> +/*
> + * Assert that moving, extending and shrinking memory via mremap() retains
> + * poison markers where possible.
> + *
> + * - Expanding should retain, only now in different position. The user will have
> + *   to unpoison manually to fix up (they'd have to do the same if it were a
> + *   PROT_NONE mapping)
> + */
> +TEST_F(guard_pages, mremap_expand)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr, *ptr_new;
> +
> +	/* Map 10 pages... */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +	/* ...But unmap the last 5 so we can ensure we can expand into them. */
> +	ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
> +
> +	/* Place poison markers at both ends of the 5 page span. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure the poison is in effect. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Now expand to 10 pages. */
> +	ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Make sure the poison is retained in its original positions. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Reserve a region which we can move to and expand into. */
> +	ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
> +		       MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr_new, MAP_FAILED);
> +
> +	/* Now move and expand into it. */
> +	ptr = mremap(ptr, 10 * page_size, 20 * page_size,
> +		     MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new);
> +	ASSERT_EQ(ptr, ptr_new);
> +
> +	/* Again, make sure the poison is retained in its original
> +	 * positions. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/*
> +	 * A real user would have to unpoison, but would reasonably expect all
> +	 * characteristics of the mapping to be retained, including poison
> +	 * markers.
> +	 */
> +
> +	/* Cleanup. */
> +	munmap(ptr, 20 * page_size);
> +}
> +/*
> + * Assert that moving, extending and shrinking memory via mremap() retains
> + * poison markers where possible.
> + *
> + * - Shrinking will result in markers that are shrunk over being removed. Again,
> + *   if the user were using a PROT_NONE mapping they'd have to manually fix this
> + *   up also so this is OK.
> + */
> +TEST_F(guard_pages, mremap_shrink)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	/* Map 5 pages. */
> +	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Place poison markers at both ends of the 5 page span. */
> +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Make sure the poison is in effect. */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> +
> +	/* Now shrink to 3 pages. */
> +	ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* We expect the poison marker at the start to be retained... */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +
> +	/* ...But remaining pages will not have poison markers. */
> +	for (i = 1; i < 3; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
> +	}
> +
> +	/*
> +	 * As with expansion, a real user would have to unpoison and fixup. But
> +	 * you'd have to do similar manual things with PROT_NONE mappings too.
> +	 */
> +
> +	/*
> +	 * If we expand back to the original size, the end marker will, of
> +	 * course, no longer be present.
> +	 */
> +	ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Again, we expect the poison marker at the start to be retained... */
> +	ASSERT_FALSE(try_read_write_buf(ptr));
> +
> +	/* ...But remaining pages will not have poison markers. */
> +	for (i = 1; i < 5; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	munmap(ptr, 5 * page_size);
> +}
> +
> +/*
> + * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
> + * retain guard pages.
> + */
> +TEST_F(guard_pages, fork)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	pid_t pid;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison the first 5 pages. */
> +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
> +
> +	pid = fork();
> +	ASSERT_NE(pid, -1);
> +	if (!pid) {
> +		/* This is the child process now. */
> +
> +		/* Assert that the poisoning is in effect. */
> +		for (i = 0; i < 10; i++) {
> +			bool result = try_read_write_buf(&ptr[i * page_size]);
> +
> +			if (i < 5) {
> +				ASSERT_FALSE(result);
> +			} else {
> +				ASSERT_TRUE(result);
> +			}
> +		}
> +
> +		/* Now unpoison the range.*/
> +		ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> +
> +		exit(0);
> +	}
> +
> +	/* Parent process. */
> +
> +	/* Parent simply waits on child. */
> +	waitpid(pid, NULL, 0);
> +
> +	/* Child unpoison does not impact parent page table state. */
> +	for (i = 0; i < 10; i++) {
> +		bool result = try_read_write_buf(&ptr[i * page_size]);
> +
> +		if (i < 5) {
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +		}
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/*
> + * Assert that forking a process with VMAs that do have VM_WIPEONFORK set
> + * behave as expected.
> + */
> +TEST_F(guard_pages, fork_wipeonfork)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	pid_t pid;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Mark wipe on fork. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0);
> +
> +	/* Poison the first 5 pages. */
> +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
> +
> +	pid = fork();
> +	ASSERT_NE(pid, -1);
> +	if (!pid) {
> +		/* This is the child process now. */
> +
> +		/* Poison will have been wiped. */
> +		for (i = 0; i < 10; i++) {
> +			ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +		}
> +
> +		exit(0);
> +	}
> +
> +	/* Parent process. */
> +
> +	waitpid(pid, NULL, 0);
> +
> +	/* Poison should be in effect.*/
> +	for (i = 0; i < 10; i++) {
> +		bool result = try_read_write_buf(&ptr[i * page_size]);
> +
> +		if (i < 5) {
> +			ASSERT_FALSE(result);
> +		} else {
> +			ASSERT_TRUE(result);
> +		}
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that MADV_FREE frees poison entries as expected. */
> +TEST_F(guard_pages, lazyfree)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Ensure poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Lazyfree range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);
> +
> +	/* This should simply clear the poison markers. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */
> +TEST_F(guard_pages, populate)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Populate read should error out... */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1);
> +	ASSERT_EQ(errno, EFAULT);
> +
> +	/* ...as should populate write. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1);
> +	ASSERT_EQ(errno, EFAULT);
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove poison markers. */
> +TEST_F(guard_pages, cold_pageout)
> +{
> +	const unsigned long page_size = self->page_size;
> +	char *ptr;
> +	int i;
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Poison range. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* Ensured poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Now mark cold. This should have no impact on poison markers. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0);
> +
> +	/* Should remain poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* OK, now page out. This should equally, have no effect on markers. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
> +
> +	/* Should remain poisoned. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +/* Ensure that guard pages do not break userfaultd. */
> +TEST_F(guard_pages, uffd)
> +{
> +	const unsigned long page_size = self->page_size;
> +	int uffd;
> +	char *ptr;
> +	int i;
> +	struct uffdio_api api = {
> +		.api = UFFD_API,
> +		.features = 0,
> +	};
> +	struct uffdio_register reg;
> +	struct uffdio_range range;
> +
> +	/* Set up uffd. */
> +	uffd = userfaultfd(0);
> +	if (uffd == -1 && errno == EPERM)
> +		ksft_exit_skip("No uffd permissions\n");

Same comment here about a user friendly message that say what
user shoul do

> +	ASSERT_NE(uffd, -1);
> +
> +	ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
> +
> +	/* Map 10 pages. */
> +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> +	ASSERT_NE(ptr, MAP_FAILED);
> +
> +	/* Register the range with uffd. */
> +	range.start = (unsigned long)ptr;
> +	range.len = 10 * page_size;
> +	reg.range = range;
> +	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
> +	ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, &reg), 0);
> +
> +	/* Poison the range. This should not trigger the uffd. */
> +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> +
> +	/* The poisoning should behave as usual with no uffd intervention. */
> +	for (i = 0; i < 10; i++) {
> +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> +	}
> +
> +	/* Cleanup. */
> +	ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0);
> +	close(uffd);
> +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> +}
> +
> +TEST_HARNESS_MAIN

thanks,
-- Shuah
Lorenzo Stoakes Oct. 18, 2024, 7:12 a.m. UTC | #2
On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
> On 10/17/24 14:42, Lorenzo Stoakes wrote:
> > Utilise the kselftest harmness to implement tests for the guard page
>
> Splleing NIT - harmness -> harness
>
> > implementation.
> >
> > We start by implement basic tests asserting that guard pages can be
>
> implmenting? By the way checkpatch will catch spelling stuuf.
> Please see comments about warnings below.

Thanks. The majority of the checkpatch warnings are invalid so I missed
this. Will fix on respin.

>
> > established (poisoned), cleared (remedied) and that touching poisoned pages
> > result in SIGSEGV. We also assert that, in remedying a range, non-poison
> > pages remain intact.
> >
> > We then examine different operations on regions containing poison markers
> > behave to ensure correct behaviour:
> >
> > * Operations over multiple VMAs operate as expected.
> > * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in
> >    batches works correctly.
> > * Ensuring that munmap() correctly tears down poison markers.
> > * Using mprotect() to adjust protection bits does not in any way override
> >    or cause issues with poison markers.
> > * Ensuring that splitting and merging VMAs around poison markers causes no
> >    issue - i.e. that a marker which 'belongs' to one VMA can function just
> >    as well 'belonging' to another.
> > * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
> > * Ensuring that mlock()'ing a range containing poison markers does not
> >    cause issues.
> > * Ensuring that mremap() can move a poisoned range and retain poison
> >    markers.
> > * Ensuring that mremap() can expand a poisoned range and retain poison
> >    markers (perhaps moving the range).
> > * Ensuring that mremap() can shrink a poisoned range and retain poison
> >    markers.
> > * Ensuring that forking a process correctly retains poison markers.
> > * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
> > * Ensuring that lazyfree simply clears poison markers.
> > * Ensuring that userfaultfd can co-exist with guard pages.
> > * Ensuring that madvise(..., MADV_POPULATE_READ) and
> >    madvise(..., MADV_POPULATE_WRITE) error out when encountering
> >    poison markers.
> > * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
> >    not remove poison markers.
>
> Good summary of test. Does the test require root access?
> If so does it check and skip appropriately?

Thanks and some do, in those cases we skip.

>
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >   tools/testing/selftests/mm/.gitignore    |    1 +
> >   tools/testing/selftests/mm/Makefile      |    1 +
> >   tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> >   3 files changed, 1170 insertions(+)
> >   create mode 100644 tools/testing/selftests/mm/guard-pages.c
> >
> > diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> > index 689bbd520296..8f01f4da1c0d 100644
> > --- a/tools/testing/selftests/mm/.gitignore
> > +++ b/tools/testing/selftests/mm/.gitignore
> > @@ -54,3 +54,4 @@ droppable
> >   hugetlb_dio
> >   pkey_sighandler_tests_32
> >   pkey_sighandler_tests_64
> > +guard-pages
> > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> > index 02e1204971b0..15c734d6cfec 100644
> > --- a/tools/testing/selftests/mm/Makefile
> > +++ b/tools/testing/selftests/mm/Makefile
> > @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
> >   TEST_GEN_FILES += hugetlb_madv_vs_map
> >   TEST_GEN_FILES += hugetlb_dio
> >   TEST_GEN_FILES += droppable
> > +TEST_GEN_FILES += guard-pages
> >   ifneq ($(ARCH),arm64)
> >   TEST_GEN_FILES += soft-dirty
> > diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
> > new file mode 100644
> > index 000000000000..2ab0ff3ba5a0
> > --- /dev/null
> > +++ b/tools/testing/selftests/mm/guard-pages.c
> > @@ -0,0 +1,1168 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +
> > +#define _GNU_SOURCE
> > +#include "../kselftest_harness.h"
> > +#include <assert.h>
> > +#include <fcntl.h>
> > +#include <setjmp.h>
> > +#include <errno.h>
> > +#include <linux/userfaultfd.h>
> > +#include <signal.h>
> > +#include <stdbool.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <string.h>
> > +#include <sys/ioctl.h>
> > +#include <sys/mman.h>
> > +#include <sys/syscall.h>
> > +#include <sys/uio.h>
> > +#include <unistd.h>
> > +
> > +/* These may not yet be available in the uAPI so define if not. */
> > +
> > +#ifndef MADV_GUARD_POISON
> > +#define MADV_GUARD_POISON	102
> > +#endif
> > +
> > +#ifndef MADV_GUARD_UNPOISON
> > +#define MADV_GUARD_UNPOISON	103
> > +#endif
> > +
> > +volatile bool signal_jump_set;
>
> Can you add a comment about why volatile is needed.

I'm not sure it's really necessary, it's completely standard to do this
with signal handling and is one of the exceptions to the 'volatile
considered harmful' rule.

> By the way did you happen to run checkpatck on this. There are
> several instances where single statement blocks with braces {}
>
> I noticed a few and ran checkpatch on your patch. There are
> 45 warnings regarding codeing style.
>
> Please run checkpatch and clean them up so we can avoid followup
> checkpatch cleanup patches.

No sorry I won't, checkpatch isn't infallible and series trying to 'clean
up' things that aren't issues will be a waste of everybody's time.

There are cases where removing the braces results in compilation error so
obviously we can't remove them, in others it seriously hurts readability.

>
> > +sigjmp_buf signal_jmp_buf;
> > +
> > +static int userfaultfd(int flags)
> > +{
> > +	return syscall(SYS_userfaultfd, flags);
> > +}
> > +
> > +static void handle_fatal(int c)
> > +{
> > +	if (!signal_jump_set)
> > +		return;
> > +
> > +	siglongjmp(signal_jmp_buf, c);
> > +}
> > +
> > +static int pidfd_open(pid_t pid, unsigned int flags)
> > +{
> > +	return syscall(SYS_pidfd_open, pid, flags);
> > +}
> > +
> > +/*
> > + * Enable our signal catcher and try to read/write the specified buffer. The
> > + * return value indicates whether the read/write succeeds without a fatal
> > + * signal.
> > + */
> > +static bool try_access_buf(char *ptr, bool write)
> > +{
> > +	bool failed;
> > +
> > +	/* Tell signal handler to jump back here on fatal signal. */
> > +	signal_jump_set = true;
> > +	/* If a fatal signal arose, we will jump back here and failed is set. */
> > +	failed = sigsetjmp(signal_jmp_buf, 0) != 0;
> > +
> > +	if (!failed) {
> > +		if (write) {
> > +			*ptr = 'x';
> > +		} else {
> > +			const volatile char *chr = ptr;
> > +
> > +			/* Force read. */
> > +			(void)*chr;
> > +		}
> > +	}
> > +
> > +	signal_jump_set = false;
> > +	return !failed;
> > +}
> > +
> > +/* Try and read from a buffer, return true if no fatal signal. */
> > +static bool try_read_buf(char *ptr)
> > +{
> > +	return try_access_buf(ptr, false);
> > +}
> > +
> > +/* Try and write to a buffer, return true if no fatal signal. */
> > +static bool try_write_buf(char *ptr)
> > +{
> > +	return try_access_buf(ptr, true);
> > +}
> > +
> > +/*
> > + * Try and BOTH read from AND write to a buffer, return true if BOTH operations
> > + * succeed.
> > + */
> > +static bool try_read_write_buf(char *ptr)
> > +{
> > +	return try_read_buf(ptr) && try_write_buf(ptr);
> > +}
> > +
> > +FIXTURE(guard_pages)
> > +{
> > +	unsigned long page_size;
> > +};
> > +
> > +FIXTURE_SETUP(guard_pages)
> > +{
> > +	struct sigaction act = {
> > +		.sa_handler = &handle_fatal,
> > +		.sa_flags = SA_NODEFER,
> > +	};
> > +
> > +	sigemptyset(&act.sa_mask);
> > +	if (sigaction(SIGSEGV, &act, NULL)) {
> > +		perror("sigaction");
> > +		ksft_exit_fail();
>
> Replase the above two with ksft_exit_fail_perror()
> There is no need for perror("sigaction"); followed by
> ksft_exit_fail();

Ack will do on respin. Thanks for the tip! Wasn't aware of that.

>
> > +	}
> > +
> > +	self->page_size = (unsigned long)sysconf(_SC_PAGESIZE);
> > +};
> > +
> > +FIXTURE_TEARDOWN(guard_pages)
> > +{
> > +	struct sigaction act = {
> > +		.sa_handler = SIG_DFL,
> > +		.sa_flags = SA_NODEFER,
> > +	};
> > +
> > +	sigemptyset(&act.sa_mask);
> > +	sigaction(SIGSEGV, &act, NULL);
> > +}
> > +
> > +TEST_F(guard_pages, basic)
> > +{
> > +	const unsigned long NUM_PAGES = 10;
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	ptr = mmap(NULL, NUM_PAGES * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_PRIVATE | MAP_ANON, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Trivially assert we can touch the first page. */
> > +	ASSERT_TRUE(try_read_write_buf(ptr));
> > +
> > +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Establish that 1st page SIGSEGV's. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +
> > +	/* Ensure we can touch everything else.*/
> > +	for (i = 1; i < NUM_PAGES; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Establish a guard page at the end of the mapping. */
> > +	ASSERT_EQ(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
> > +			  MADV_GUARD_POISON), 0);
> > +
> > +	/* Check that both guard pages result in SIGSEGV. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
> > +
> > +	/* Unpoison the first. */
> > +	ASSERT_FALSE(madvise(ptr, page_size, MADV_GUARD_UNPOISON));
> > +
> > +	/* Make sure we can touch it. */
> > +	ASSERT_TRUE(try_read_write_buf(ptr));
> > +
> > +	/* Unpoison the last. */
> > +	ASSERT_FALSE(madvise(&ptr[(NUM_PAGES - 1) * page_size], page_size,
> > +			     MADV_GUARD_UNPOISON));
> > +
> > +	/* Make sure we can touch it. */
> > +	ASSERT_TRUE(try_read_write_buf(&ptr[(NUM_PAGES - 1) * page_size]));
> > +
> > +	/*
> > +	 *  Test setting a _range_ of pages, namely the first 3. The first of
> > +	 *  these be faulted in, so this also tests that we can poison backed
> > +	 *  pages.
> > +	 */
> > +	ASSERT_EQ(madvise(ptr, 3 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure they are all poisoned. */
> > +	for (i = 0; i < 3; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
>
> This one here and

Refer you to other comments around this, it hurts readability to do that.

> > +	/* Make sure the rest are not. */
> > +	for (i = 3; i < NUM_PAGES; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Unpoison them. */
> > +	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
> > +
> > +	/* Now make sure we can touch everything. */
> > +	for (i = 0; i < NUM_PAGES; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Now unpoison everything, make sure we don't remove existing entries */
> > +	ASSERT_EQ(madvise(ptr, NUM_PAGES * page_size, MADV_GUARD_UNPOISON), 0);
> > +
> > +	for (i = 0; i < NUM_PAGES * page_size; i += page_size) {
> > +		ASSERT_EQ(ptr[i], 'x');
> > +	}
> > +
> > +	ASSERT_EQ(munmap(ptr, NUM_PAGES * page_size), 0);
> > +}
> > +
> > +/* Assert that operations applied across multiple VMAs work as expected. */
> > +TEST_F(guard_pages, multi_vma)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr_region, *ptr, *ptr1, *ptr2, *ptr3;
> > +	int i;
> > +
> > +	/* Reserve a 100 page region over which we can install VMAs. */
> > +	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
> > +			  MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_region, MAP_FAILED);
> > +
> > +	/* Place a VMA of 10 pages size at the start of the region. */
> > +	ptr1 = mmap(ptr_region, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr1, MAP_FAILED);
> > +
> > +	/* Place a VMA of 5 pages size 50 pages into the region. */
> > +	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
> > +		    PROT_READ | PROT_WRITE,
> > +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr2, MAP_FAILED);
> > +
> > +	/* Place a VMA of 20 pages size at the end of the region. */
> > +	ptr3 = mmap(&ptr_region[80 * page_size], 20 * page_size,
> > +		    PROT_READ | PROT_WRITE,
> > +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr3, MAP_FAILED);
> > +
> > +	/* Unmap gaps. */
> > +	ASSERT_EQ(munmap(&ptr_region[10 * page_size], 40 * page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 25 * page_size), 0);
> > +
> > +	/*
> > +	 * We end up with VMAs like this:
> > +	 *
> > +	 * 0    10 .. 50   55 .. 80   100
> > +	 * [---]      [---]      [---]
> > +	 */
> > +
> > +	/* Now poison the whole range and make sure all VMAs are poisoned. */
> > +
> > +	/*
> > +	 * madvise() is certifiable and lets you perform operations over gaps,
> > +	 * everything works, but it indicates an error and errno is set to
> > +	 * -ENOMEM. Also if anything runs out of memory it is set to
> > +	 * -ENOMEM. You are meant to guess which is which.
> > +	 */
> > +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), -1);
> > +	ASSERT_EQ(errno, ENOMEM);
> > +
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
> > +	}
> > +
> > +	for (i = 0; i < 5; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr2[i * page_size]));
> > +	}
> > +
> > +	for (i = 0; i < 20; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr3[i * page_size]));
> > +	}
> > +
> > +	/* Now unpoison the range and assert the opposite. */
> > +
> > +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), -1);
> > +	ASSERT_EQ(errno, ENOMEM);
> > +
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr1[i * page_size]));
> > +	}
> > +
> > +	for (i = 0; i < 5; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr2[i * page_size]));
> > +	}
> > +
> > +	for (i = 0; i < 20; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr3[i * page_size]));
> > +	}
> > +
> > +	/* Now map incompatible VMAs in the gaps. */
> > +	ptr = mmap(&ptr_region[10 * page_size], 40 * page_size,
> > +		   PROT_READ | PROT_WRITE | PROT_EXEC,
> > +		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +	ptr = mmap(&ptr_region[55 * page_size], 25 * page_size,
> > +		   PROT_READ | PROT_WRITE | PROT_EXEC,
> > +		   MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/*
> > +	 * We end up with VMAs like this:
> > +	 *
> > +	 * 0    10 .. 50   55 .. 80   100
> > +	 * [---][xxxx][---][xxxx][---]
> > +	 *
> > +	 * Where 'x' signifies VMAs that cannot be merged with those adjacent to
> > +	 * them.
> > +	 */
> > +
> > +	/* Multiple VMAs adjacent to one another should result in no error. */
> > +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_POISON), 0);
> > +	for (i = 0; i < 100; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr_region[i * page_size]));
> > +	}
> > +	ASSERT_EQ(madvise(ptr_region, 100 * page_size, MADV_GUARD_UNPOISON), 0);
> > +	for (i = 0; i < 100; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr_region[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr_region, 100 * page_size), 0);
> > +}
> > +
> > +/*
> > + * Assert that batched operations performed using process_madvise() work as
> > + * expected.
> > + */
> > +TEST_F(guard_pages, process_madvise)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	pid_t pid = getpid();
> > +	int pidfd = pidfd_open(pid, 0);
> > +	char *ptr_region, *ptr1, *ptr2, *ptr3;
> > +	ssize_t count;
> > +	struct iovec vec[6];
> > +
> > +	ASSERT_NE(pidfd, -1);
> > +
> > +	/* Reserve region to map over. */
> > +	ptr_region = mmap(NULL, 100 * page_size, PROT_NONE,
> > +			  MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_region, MAP_FAILED);
> > +
> > +	/* 10 pages offset 1 page into reserve region. */
> > +	ptr1 = mmap(&ptr_region[page_size], 10 * page_size,
> > +		    PROT_READ | PROT_WRITE,
> > +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr1, MAP_FAILED);
> > +	/* We want poison markers at start/end of each VMA. */
> > +	vec[0].iov_base = ptr1;
> > +	vec[0].iov_len = page_size;
> > +	vec[1].iov_base = &ptr1[9 * page_size];
> > +	vec[1].iov_len = page_size;
> > +
> > +	/* 5 pages offset 50 pages into reserve region. */
> > +	ptr2 = mmap(&ptr_region[50 * page_size], 5 * page_size,
> > +		    PROT_READ | PROT_WRITE,
> > +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr2, MAP_FAILED);
> > +	vec[2].iov_base = ptr2;
> > +	vec[2].iov_len = page_size;
> > +	vec[3].iov_base = &ptr2[4 * page_size];
> > +	vec[3].iov_len = page_size;
> > +
> > +	/* 20 pages offset 79 pages into reserve region. */
> > +	ptr3 = mmap(&ptr_region[79 * page_size], 20 * page_size,
> > +		    PROT_READ | PROT_WRITE,
> > +		    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr3, MAP_FAILED);
> > +	vec[4].iov_base = ptr3;
> > +	vec[4].iov_len = page_size;
> > +	vec[5].iov_base = &ptr3[19 * page_size];
> > +	vec[5].iov_len = page_size;
> > +
> > +	/* Free surrounding VMAs. */
> > +	ASSERT_EQ(munmap(ptr_region, page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr_region[11 * page_size], 39 * page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr_region[55 * page_size], 24 * page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr_region[99 * page_size], page_size), 0);
> > +
> > +	/* Now poison in one step. */
> > +	count = process_madvise(pidfd, vec, 6, MADV_GUARD_POISON, 0);
> > +
> > +	/* OK we don't have permission to do this, skip. */
> > +	if (count == -1 && errno == EPERM)
> > +		ksft_exit_skip("No process_madvise() permissions\n");
>
> Can you make this a bit more informative? What would user do
> if they see this message? Are they supposed to run the test
> as root?

OK can update on respin.

>
> > +
> > +	/* Returns the number of bytes advised. */
> > +	ASSERT_EQ(count, 6 * page_size);
> > +
> > +	/* Now make sure the poisoning was applied. */
> > +
> > +	ASSERT_FALSE(try_read_write_buf(ptr1));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr1[9 * page_size]));
> > +
> > +	ASSERT_FALSE(try_read_write_buf(ptr2));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr2[4 * page_size]));
> > +
> > +	ASSERT_FALSE(try_read_write_buf(ptr3));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr3[19 * page_size]));
> > +
> > +	/* Now do the same with unpoison... */
> > +	count = process_madvise(pidfd, vec, 6, MADV_GUARD_UNPOISON, 0);
> > +
> > +	/* ...and everything should now succeed. */
> > +
> > +	ASSERT_TRUE(try_read_write_buf(ptr1));
> > +	ASSERT_TRUE(try_read_write_buf(&ptr1[9 * page_size]));
> > +
> > +	ASSERT_TRUE(try_read_write_buf(ptr2));
> > +	ASSERT_TRUE(try_read_write_buf(&ptr2[4 * page_size]));
> > +
> > +	ASSERT_TRUE(try_read_write_buf(ptr3));
> > +	ASSERT_TRUE(try_read_write_buf(&ptr3[19 * page_size]));
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr1, 10 * page_size), 0);
> > +	ASSERT_EQ(munmap(ptr2, 5 * page_size), 0);
> > +	ASSERT_EQ(munmap(ptr3, 20 * page_size), 0);
> > +	close(pidfd);
> > +}
> > +
> > +/* Assert that unmapping ranges does not leave poison behind. */
> > +TEST_F(guard_pages, munmap)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr, *ptr_new1, *ptr_new2;
> > +
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison first and last pages. */
> > +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> > +	ASSERT_EQ(madvise(&ptr[9 * page_size], page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Assert that they are poisoned. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[9 * page_size]));
> > +
> > +	/* Unmap them. */
> > +	ASSERT_EQ(munmap(ptr, page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr[9 * page_size], page_size), 0);
> > +
> > +	/* Map over them.*/
> > +	ptr_new1 = mmap(ptr, page_size, PROT_READ | PROT_WRITE,
> > +			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_new1, MAP_FAILED);
> > +	ptr_new2 = mmap(&ptr[9 * page_size], page_size, PROT_READ | PROT_WRITE,
> > +			MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_new2, MAP_FAILED);
> > +
> > +	/* Assert that they are now not poisoned. */
> > +	ASSERT_TRUE(try_read_write_buf(ptr_new1));
> > +	ASSERT_TRUE(try_read_write_buf(ptr_new2));
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Assert that mprotect() operations have no bearing on guard poison markers. */
> > +TEST_F(guard_pages, mprotect)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison the middle of the range. */
> > +	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
> > +			  MADV_GUARD_POISON), 0);
> > +
> > +	/* Assert that it is indeed poisoned. */
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[5 * page_size]));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[6 * page_size]));
> > +
> > +	/* Now make these pages read-only. */
> > +	ASSERT_EQ(mprotect(&ptr[5 * page_size], 2 * page_size, PROT_READ), 0);
> > +
> > +	/* Make sure the range is still poisoned. */
> > +	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
> > +	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
> > +
> > +	/* Make sure we can poison again without issue.*/
> > +	ASSERT_EQ(madvise(&ptr[5 * page_size], 2 * page_size,
> > +			  MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure the range is, yet again, still poisoned. */
> > +	ASSERT_FALSE(try_read_buf(&ptr[5 * page_size]));
> > +	ASSERT_FALSE(try_read_buf(&ptr[6 * page_size]));
> > +
> > +	/* Now unpoison the whole range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> > +
> > +	/* Make sure the whole range is readable. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Split and merge VMAs and make sure guard pages still behave. */
> > +TEST_F(guard_pages, split_merge)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr, *ptr_new;
> > +	int i;
> > +
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison the whole range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure the whole range is poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Now unmap some pages in the range so we split. */
> > +	ASSERT_EQ(munmap(&ptr[2 * page_size], page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr[5 * page_size], page_size), 0);
> > +	ASSERT_EQ(munmap(&ptr[8 * page_size], page_size), 0);
> > +
> > +	/* Make sure the remaining ranges are poisoned post-split. */
> > +	for (i = 0; i < 2; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +	for (i = 2; i < 5; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +	for (i = 6; i < 8; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +	for (i = 9; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Now map them again - the unmap will have cleared the poison. */
> > +	ptr_new = mmap(&ptr[2 * page_size], page_size, PROT_READ | PROT_WRITE,
> > +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_new, MAP_FAILED);
> > +	ptr_new = mmap(&ptr[5 * page_size], page_size, PROT_READ | PROT_WRITE,
> > +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_new, MAP_FAILED);
> > +	ptr_new = mmap(&ptr[8 * page_size], page_size, PROT_READ | PROT_WRITE,
> > +		       MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_new, MAP_FAILED);
> > +
> > +	/* Now make sure poisoning is as expected. */
> > +	for (i = 0; i < 10; i++) {
> > +		bool result = try_read_write_buf(&ptr[i * page_size]);
> > +
> > +		if (i == 2 || i == 5 || i == 8) {
> > +			ASSERT_TRUE(result);
> > +		} else {
> > +			ASSERT_FALSE(result);
> > +		}
> > +	}
> > +
> > +	/* Now poison everything again. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure the whole range is poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Now split the range into three. */
> > +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
> > +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
> > +
> > +	/* Make sure the whole range is poisoned for read. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Now reset protection bits so we merge the whole thing. */
> > +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
> > +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
> > +			   PROT_READ | PROT_WRITE), 0);
> > +
> > +	/* Make sure the whole range is still poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Split range into 3 again... */
> > +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ), 0);
> > +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size, PROT_READ), 0);
> > +
> > +	/* ...and unpoison the whole range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> > +
> > +	/* Make sure the whole range is remedied for read. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_TRUE(try_read_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Merge them again. */
> > +	ASSERT_EQ(mprotect(ptr, 3 * page_size, PROT_READ | PROT_WRITE), 0);
> > +	ASSERT_EQ(mprotect(&ptr[7 * page_size], 3 * page_size,
> > +			   PROT_READ | PROT_WRITE), 0);
> > +
> > +	/* Now ensure the merged range is remedied for read/write. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Assert that MADV_DONTNEED does not remove guard poison markers. */
> > +TEST_F(guard_pages, dontneed)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Back the whole range. */
> > +	for (i = 0; i < 10; i++) {
> > +		ptr[i * page_size] = 'y';
> > +	}
> > +
> > +	/* Poison every other page. */
> > +	for (i = 0; i < 10; i += 2) {
> > +		ASSERT_EQ(madvise(&ptr[i * page_size],
> > +				  page_size, MADV_GUARD_POISON), 0);
> > +	}
> > +
> > +	/* Indicate that we don't need any of the range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_DONTNEED), 0);
> > +
> > +	/* Check to ensure poison markers are still in place. */
> > +	for (i = 0; i < 10; i++) {
> > +		bool result = try_read_buf(&ptr[i * page_size]);
> > +
> > +		if (i % 2 == 0) {
> > +			ASSERT_FALSE(result);
> > +		} else {
> > +			ASSERT_TRUE(result);
> > +			/* Make sure we really did get reset to zero page. */
> > +			ASSERT_EQ(ptr[i * page_size], '\0');
> > +		}
> > +
> > +		/* Now write... */
> > +		result = try_write_buf(&ptr[i * page_size]);
> > +
> > +		/* ...and make sure same result. */
> > +		if (i % 2 == 0) {
>
> You don't need  { here
> > +			ASSERT_FALSE(result);
> > +		} else {
> > +			ASSERT_TRUE(result);
> > +		}
>
> Same here.

Removing these results in compilation failure so we can't :) this is
probably/possibly a bug in how the assert macros work but fixing that is
out of scope for this series.

> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Assert that mlock()'ed pages work correctly with poison markers. */
> > +TEST_F(guard_pages, mlock)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Populate. */
> > +	for (i = 0; i < 10; i++) {
> > +		ptr[i * page_size] = 'y';
> > +	}
> > +
> > +	/* Lock. */
> > +	ASSERT_EQ(mlock(ptr, 10 * page_size), 0);
> > +
> > +	/* Now try to poison, should fail with EINVAL. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), -1);
> > +	ASSERT_EQ(errno, EINVAL);
> > +
> > +	/* OK unlock. */
> > +	ASSERT_EQ(munlock(ptr, 10 * page_size), 0);
> > +
> > +	/* Poison first half of range, should now succeed. */
> > +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure poison works. */
> > +	for (i = 0; i < 10; i++) {
> > +		bool result = try_read_write_buf(&ptr[i * page_size]);
> > +
> > +		if (i < 5) {
> > +			ASSERT_FALSE(result);
> > +		} else {
> > +			ASSERT_TRUE(result);
> > +			ASSERT_EQ(ptr[i * page_size], 'x');
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Now lock the latter part of the range. We can't lock the poisoned
> > +	 * pages, as this would result in the pages being populated and the
> > +	 * poisoning would cause this to error out.
> > +	 */
> > +	ASSERT_EQ(mlock(&ptr[5 * page_size], 5 * page_size), 0);
> > +
> > +	/*
> > +	 * Now unpoison, we do not permit mlock()'d ranges to be remedied as it is
> > +	 * a non-destructive operation.
> > +	 */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> > +
> > +	/* Now check that everything is remedied. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/*
> > + * Assert that moving, extending and shrinking memory via mremap() retains
> > + * poison markers where possible.
> > + *
> > + * - Moving a mapping alone should retain markers as they are.
> > + */
> > +TEST_F(guard_pages, mremap_move)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr, *ptr_new;
> > +
> > +	/* Map 5 pages. */
> > +	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Place poison markers at both ends of the 5 page span. */
> > +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> > +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure the poison is in effect. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> > +
> > +	/* Map a new region we will move this range into. Doing this ensures
> > +	 * that we have reserved a range to map into.
> > +	 */
> > +	ptr_new = mmap(NULL, 5 * page_size, PROT_NONE, MAP_ANON | MAP_PRIVATE,
> > +		       -1, 0);
> > +	ASSERT_NE(ptr_new, MAP_FAILED);
> > +
> > +	ASSERT_EQ(mremap(ptr, 5 * page_size, 5 * page_size,
> > +			 MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new), ptr_new);
> > +
> > +	/* Make sure the poison is retained. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr_new));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr_new[4 * page_size]));
> > +
> > +	/*
> > +	 * Clean up - we only need reference the new pointer as we overwrote the
> > +	 * PROT_NONE range and moved the existing one.
> > +	 */
> > +	munmap(ptr_new, 5 * page_size);
> > +}
> > +
> > +/*
> > + * Assert that moving, extending and shrinking memory via mremap() retains
> > + * poison markers where possible.
> > + *
> > + * - Expanding should retain, only now in different position. The user will have
> > + *   to unpoison manually to fix up (they'd have to do the same if it were a
> > + *   PROT_NONE mapping)
> > + */
> > +TEST_F(guard_pages, mremap_expand)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr, *ptr_new;
> > +
> > +	/* Map 10 pages... */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +	/* ...But unmap the last 5 so we can ensure we can expand into them. */
> > +	ASSERT_EQ(munmap(&ptr[5 * page_size], 5 * page_size), 0);
> > +
> > +	/* Place poison markers at both ends of the 5 page span. */
> > +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> > +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure the poison is in effect. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> > +
> > +	/* Now expand to 10 pages. */
> > +	ptr = mremap(ptr, 5 * page_size, 10 * page_size, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Make sure the poison is retained in its original positions. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> > +
> > +	/* Reserve a region which we can move to and expand into. */
> > +	ptr_new = mmap(NULL, 20 * page_size, PROT_NONE,
> > +		       MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr_new, MAP_FAILED);
> > +
> > +	/* Now move and expand into it. */
> > +	ptr = mremap(ptr, 10 * page_size, 20 * page_size,
> > +		     MREMAP_MAYMOVE | MREMAP_FIXED, ptr_new);
> > +	ASSERT_EQ(ptr, ptr_new);
> > +
> > +	/* Again, make sure the poison is retained in its original
> > +	 * positions. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> > +
> > +	/*
> > +	 * A real user would have to unpoison, but would reasonably expect all
> > +	 * characteristics of the mapping to be retained, including poison
> > +	 * markers.
> > +	 */
> > +
> > +	/* Cleanup. */
> > +	munmap(ptr, 20 * page_size);
> > +}
> > +/*
> > + * Assert that moving, extending and shrinking memory via mremap() retains
> > + * poison markers where possible.
> > + *
> > + * - Shrinking will result in markers that are shrunk over being removed. Again,
> > + *   if the user were using a PROT_NONE mapping they'd have to manually fix this
> > + *   up also so this is OK.
> > + */
> > +TEST_F(guard_pages, mremap_shrink)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	/* Map 5 pages. */
> > +	ptr = mmap(NULL, 5 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Place poison markers at both ends of the 5 page span. */
> > +	ASSERT_EQ(madvise(ptr, page_size, MADV_GUARD_POISON), 0);
> > +	ASSERT_EQ(madvise(&ptr[4 * page_size], page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Make sure the poison is in effect. */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +	ASSERT_FALSE(try_read_write_buf(&ptr[4 * page_size]));
> > +
> > +	/* Now shrink to 3 pages. */
> > +	ptr = mremap(ptr, 5 * page_size, 3 * page_size, MREMAP_MAYMOVE);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* We expect the poison marker at the start to be retained... */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +
> > +	/* ...But remaining pages will not have poison markers. */
> > +	for (i = 1; i < 3; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
> > +	}
> > +
> > +	/*
> > +	 * As with expansion, a real user would have to unpoison and fixup. But
> > +	 * you'd have to do similar manual things with PROT_NONE mappings too.
> > +	 */
> > +
> > +	/*
> > +	 * If we expand back to the original size, the end marker will, of
> > +	 * course, no longer be present.
> > +	 */
> > +	ptr = mremap(ptr, 3 * page_size, 5 * page_size, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Again, we expect the poison marker at the start to be retained... */
> > +	ASSERT_FALSE(try_read_write_buf(ptr));
> > +
> > +	/* ...But remaining pages will not have poison markers. */
> > +	for (i = 1; i < 5; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i + page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	munmap(ptr, 5 * page_size);
> > +}
> > +
> > +/*
> > + * Assert that forking a process with VMAs that do not have VM_WIPEONFORK set
> > + * retain guard pages.
> > + */
> > +TEST_F(guard_pages, fork)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	pid_t pid;
> > +	int i;
> > +
> > +	/* Map 10 pages. */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison the first 5 pages. */
> > +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	pid = fork();
> > +	ASSERT_NE(pid, -1);
> > +	if (!pid) {
> > +		/* This is the child process now. */
> > +
> > +		/* Assert that the poisoning is in effect. */
> > +		for (i = 0; i < 10; i++) {
> > +			bool result = try_read_write_buf(&ptr[i * page_size]);
> > +
> > +			if (i < 5) {
> > +				ASSERT_FALSE(result);
> > +			} else {
> > +				ASSERT_TRUE(result);
> > +			}
> > +		}
> > +
> > +		/* Now unpoison the range.*/
> > +		ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_UNPOISON), 0);
> > +
> > +		exit(0);
> > +	}
> > +
> > +	/* Parent process. */
> > +
> > +	/* Parent simply waits on child. */
> > +	waitpid(pid, NULL, 0);
> > +
> > +	/* Child unpoison does not impact parent page table state. */
> > +	for (i = 0; i < 10; i++) {
> > +		bool result = try_read_write_buf(&ptr[i * page_size]);
> > +
> > +		if (i < 5) {
> > +			ASSERT_FALSE(result);
> > +		} else {
> > +			ASSERT_TRUE(result);
> > +		}
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/*
> > + * Assert that forking a process with VMAs that do have VM_WIPEONFORK set
> > + * behave as expected.
> > + */
> > +TEST_F(guard_pages, fork_wipeonfork)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	pid_t pid;
> > +	int i;
> > +
> > +	/* Map 10 pages. */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Mark wipe on fork. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_WIPEONFORK), 0);
> > +
> > +	/* Poison the first 5 pages. */
> > +	ASSERT_EQ(madvise(ptr, 5 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	pid = fork();
> > +	ASSERT_NE(pid, -1);
> > +	if (!pid) {
> > +		/* This is the child process now. */
> > +
> > +		/* Poison will have been wiped. */
> > +		for (i = 0; i < 10; i++) {
> > +			ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +		}
> > +
> > +		exit(0);
> > +	}
> > +
> > +	/* Parent process. */
> > +
> > +	waitpid(pid, NULL, 0);
> > +
> > +	/* Poison should be in effect.*/
> > +	for (i = 0; i < 10; i++) {
> > +		bool result = try_read_write_buf(&ptr[i * page_size]);
> > +
> > +		if (i < 5) {
> > +			ASSERT_FALSE(result);
> > +		} else {
> > +			ASSERT_TRUE(result);
> > +		}
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Ensure that MADV_FREE frees poison entries as expected. */
> > +TEST_F(guard_pages, lazyfree)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	/* Map 10 pages. */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Ensure poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Lazyfree range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_FREE), 0);
> > +
> > +	/* This should simply clear the poison markers. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_TRUE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Ensure that MADV_POPULATE_READ, MADV_POPULATE_WRITE behave as expected. */
> > +TEST_F(guard_pages, populate)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +
> > +	/* Map 10 pages. */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Populate read should error out... */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_READ), -1);
> > +	ASSERT_EQ(errno, EFAULT);
> > +
> > +	/* ...as should populate write. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_POPULATE_WRITE), -1);
> > +	ASSERT_EQ(errno, EFAULT);
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Ensure that MADV_COLD, MADV_PAGEOUT do not remove poison markers. */
> > +TEST_F(guard_pages, cold_pageout)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	char *ptr;
> > +	int i;
> > +
> > +	/* Map 10 pages. */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Poison range. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* Ensured poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Now mark cold. This should have no impact on poison markers. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_COLD), 0);
> > +
> > +	/* Should remain poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* OK, now page out. This should equally, have no effect on markers. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_PAGEOUT), 0);
> > +
> > +	/* Should remain poisoned. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +/* Ensure that guard pages do not break userfaultd. */
> > +TEST_F(guard_pages, uffd)
> > +{
> > +	const unsigned long page_size = self->page_size;
> > +	int uffd;
> > +	char *ptr;
> > +	int i;
> > +	struct uffdio_api api = {
> > +		.api = UFFD_API,
> > +		.features = 0,
> > +	};
> > +	struct uffdio_register reg;
> > +	struct uffdio_range range;
> > +
> > +	/* Set up uffd. */
> > +	uffd = userfaultfd(0);
> > +	if (uffd == -1 && errno == EPERM)
> > +		ksft_exit_skip("No uffd permissions\n");
>
> Same comment here about a user friendly message that say what
> user shoul do

Ack will update on respin.

>
> > +	ASSERT_NE(uffd, -1);
> > +
> > +	ASSERT_EQ(ioctl(uffd, UFFDIO_API, &api), 0);
> > +
> > +	/* Map 10 pages. */
> > +	ptr = mmap(NULL, 10 * page_size, PROT_READ | PROT_WRITE,
> > +		   MAP_ANON | MAP_PRIVATE, -1, 0);
> > +	ASSERT_NE(ptr, MAP_FAILED);
> > +
> > +	/* Register the range with uffd. */
> > +	range.start = (unsigned long)ptr;
> > +	range.len = 10 * page_size;
> > +	reg.range = range;
> > +	reg.mode = UFFDIO_REGISTER_MODE_MISSING;
> > +	ASSERT_EQ(ioctl(uffd, UFFDIO_REGISTER, &reg), 0);
> > +
> > +	/* Poison the range. This should not trigger the uffd. */
> > +	ASSERT_EQ(madvise(ptr, 10 * page_size, MADV_GUARD_POISON), 0);
> > +
> > +	/* The poisoning should behave as usual with no uffd intervention. */
> > +	for (i = 0; i < 10; i++) {
> > +		ASSERT_FALSE(try_read_write_buf(&ptr[i * page_size]));
> > +	}
> > +
> > +	/* Cleanup. */
> > +	ASSERT_EQ(ioctl(uffd, UFFDIO_UNREGISTER, &range), 0);
> > +	close(uffd);
> > +	ASSERT_EQ(munmap(ptr, 10 * page_size), 0);
> > +}
> > +
> > +TEST_HARNESS_MAIN
>
> thanks,
> -- Shuah
>
Shuah Khan Oct. 18, 2024, 3:32 p.m. UTC | #3
On 10/18/24 01:12, Lorenzo Stoakes wrote:
> On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
>> On 10/17/24 14:42, Lorenzo Stoakes wrote:
>>> Utilise the kselftest harmness to implement tests for the guard page
>>
>> Splleing NIT - harmness -> harness
>>
>>> implementation.
>>>
>>> We start by implement basic tests asserting that guard pages can be
>>
>> implmenting? By the way checkpatch will catch spelling stuuf.
>> Please see comments about warnings below.
> 
> Thanks. The majority of the checkpatch warnings are invalid so I missed
> this. Will fix on respin.
> 
>>
>>> established (poisoned), cleared (remedied) and that touching poisoned pages
>>> result in SIGSEGV. We also assert that, in remedying a range, non-poison
>>> pages remain intact.
>>>
>>> We then examine different operations on regions containing poison markers
>>> behave to ensure correct behaviour:
>>>
>>> * Operations over multiple VMAs operate as expected.
>>> * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in
>>>     batches works correctly.
>>> * Ensuring that munmap() correctly tears down poison markers.
>>> * Using mprotect() to adjust protection bits does not in any way override
>>>     or cause issues with poison markers.
>>> * Ensuring that splitting and merging VMAs around poison markers causes no
>>>     issue - i.e. that a marker which 'belongs' to one VMA can function just
>>>     as well 'belonging' to another.
>>> * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
>>> * Ensuring that mlock()'ing a range containing poison markers does not
>>>     cause issues.
>>> * Ensuring that mremap() can move a poisoned range and retain poison
>>>     markers.
>>> * Ensuring that mremap() can expand a poisoned range and retain poison
>>>     markers (perhaps moving the range).
>>> * Ensuring that mremap() can shrink a poisoned range and retain poison
>>>     markers.
>>> * Ensuring that forking a process correctly retains poison markers.
>>> * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
>>> * Ensuring that lazyfree simply clears poison markers.
>>> * Ensuring that userfaultfd can co-exist with guard pages.
>>> * Ensuring that madvise(..., MADV_POPULATE_READ) and
>>>     madvise(..., MADV_POPULATE_WRITE) error out when encountering
>>>     poison markers.
>>> * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
>>>     not remove poison markers.
>>
>> Good summary of test. Does the test require root access?
>> If so does it check and skip appropriately?
> 
> Thanks and some do, in those cases we skip.
> 
>>
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>    tools/testing/selftests/mm/.gitignore    |    1 +
>>>    tools/testing/selftests/mm/Makefile      |    1 +
>>>    tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
>>>    3 files changed, 1170 insertions(+)
>>>    create mode 100644 tools/testing/selftests/mm/guard-pages.c
>>>
>>> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
>>> index 689bbd520296..8f01f4da1c0d 100644
>>> --- a/tools/testing/selftests/mm/.gitignore
>>> +++ b/tools/testing/selftests/mm/.gitignore
>>> @@ -54,3 +54,4 @@ droppable
>>>    hugetlb_dio
>>>    pkey_sighandler_tests_32
>>>    pkey_sighandler_tests_64
>>> +guard-pages
>>> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
>>> index 02e1204971b0..15c734d6cfec 100644
>>> --- a/tools/testing/selftests/mm/Makefile
>>> +++ b/tools/testing/selftests/mm/Makefile
>>> @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
>>>    TEST_GEN_FILES += hugetlb_madv_vs_map
>>>    TEST_GEN_FILES += hugetlb_dio
>>>    TEST_GEN_FILES += droppable
>>> +TEST_GEN_FILES += guard-pages
>>>    ifneq ($(ARCH),arm64)
>>>    TEST_GEN_FILES += soft-dirty
>>> diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
>>> new file mode 100644
>>> index 000000000000..2ab0ff3ba5a0
>>> --- /dev/null
>>> +++ b/tools/testing/selftests/mm/guard-pages.c
>>> @@ -0,0 +1,1168 @@
>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>> +
>>> +#define _GNU_SOURCE
>>> +#include "../kselftest_harness.h"
>>> +#include <assert.h>
>>> +#include <fcntl.h>
>>> +#include <setjmp.h>
>>> +#include <errno.h>
>>> +#include <linux/userfaultfd.h>
>>> +#include <signal.h>
>>> +#include <stdbool.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <string.h>
>>> +#include <sys/ioctl.h>
>>> +#include <sys/mman.h>
>>> +#include <sys/syscall.h>
>>> +#include <sys/uio.h>
>>> +#include <unistd.h>
>>> +
>>> +/* These may not yet be available in the uAPI so define if not. */
>>> +
>>> +#ifndef MADV_GUARD_POISON
>>> +#define MADV_GUARD_POISON	102
>>> +#endif
>>> +
>>> +#ifndef MADV_GUARD_UNPOISON
>>> +#define MADV_GUARD_UNPOISON	103
>>> +#endif
>>> +
>>> +volatile bool signal_jump_set;
>>
>> Can you add a comment about why volatile is needed.
> 
> I'm not sure it's really necessary, it's completely standard to do this
> with signal handling and is one of the exceptions to the 'volatile
> considered harmful' rule.
> 
>> By the way did you happen to run checkpatck on this. There are
>> several instances where single statement blocks with braces {}
>>
>> I noticed a few and ran checkpatch on your patch. There are
>> 45 warnings regarding codeing style.
>>
>> Please run checkpatch and clean them up so we can avoid followup
>> checkpatch cleanup patches.
> 
> No sorry I won't, checkpatch isn't infallible and series trying to 'clean
> up' things that aren't issues will be a waste of everybody's time.
> 

Sorry - this violates the coding styles and makes it hard to read.

See process/coding-style.rst:

Do not unnecessarily use braces where a single statement will do.

.. code-block:: c

         if (condition)
                 action();

and

.. code-block:: c

         if (condition)
                 do_this();
         else
                 do_that();

This does not apply if only one branch of a conditional statement is a single
statement; in the latter case use braces in both branches:

.. code-block:: c

         if (condition) {
                 do_this();
                 do_that();
         } else {
                 otherwise();
         }

Also, use braces when a loop contains more than a single simple statement:

.. code-block:: c

         while (condition) {
                 if (test)
                         do_something();
         }

thanks,
-- Shuah
Lorenzo Stoakes Oct. 18, 2024, 4:07 p.m. UTC | #4
On Fri, Oct 18, 2024 at 09:32:17AM -0600, Shuah Khan wrote:
> On 10/18/24 01:12, Lorenzo Stoakes wrote:
> > On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
> > > On 10/17/24 14:42, Lorenzo Stoakes wrote:
> > > > Utilise the kselftest harmness to implement tests for the guard page
> > >
> > > Splleing NIT - harmness -> harness
> > >
> > > > implementation.
> > > >
> > > > We start by implement basic tests asserting that guard pages can be
> > >
> > > implmenting? By the way checkpatch will catch spelling stuuf.
> > > Please see comments about warnings below.
> >
> > Thanks. The majority of the checkpatch warnings are invalid so I missed
> > this. Will fix on respin.
> >
> > >
> > > > established (poisoned), cleared (remedied) and that touching poisoned pages
> > > > result in SIGSEGV. We also assert that, in remedying a range, non-poison
> > > > pages remain intact.
> > > >
> > > > We then examine different operations on regions containing poison markers
> > > > behave to ensure correct behaviour:
> > > >
> > > > * Operations over multiple VMAs operate as expected.
> > > > * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in
> > > >     batches works correctly.
> > > > * Ensuring that munmap() correctly tears down poison markers.
> > > > * Using mprotect() to adjust protection bits does not in any way override
> > > >     or cause issues with poison markers.
> > > > * Ensuring that splitting and merging VMAs around poison markers causes no
> > > >     issue - i.e. that a marker which 'belongs' to one VMA can function just
> > > >     as well 'belonging' to another.
> > > > * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
> > > > * Ensuring that mlock()'ing a range containing poison markers does not
> > > >     cause issues.
> > > > * Ensuring that mremap() can move a poisoned range and retain poison
> > > >     markers.
> > > > * Ensuring that mremap() can expand a poisoned range and retain poison
> > > >     markers (perhaps moving the range).
> > > > * Ensuring that mremap() can shrink a poisoned range and retain poison
> > > >     markers.
> > > > * Ensuring that forking a process correctly retains poison markers.
> > > > * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
> > > > * Ensuring that lazyfree simply clears poison markers.
> > > > * Ensuring that userfaultfd can co-exist with guard pages.
> > > > * Ensuring that madvise(..., MADV_POPULATE_READ) and
> > > >     madvise(..., MADV_POPULATE_WRITE) error out when encountering
> > > >     poison markers.
> > > > * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
> > > >     not remove poison markers.
> > >
> > > Good summary of test. Does the test require root access?
> > > If so does it check and skip appropriately?
> >
> > Thanks and some do, in those cases we skip.
> >
> > >
> > > >
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > ---
> > > >    tools/testing/selftests/mm/.gitignore    |    1 +
> > > >    tools/testing/selftests/mm/Makefile      |    1 +
> > > >    tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> > > >    3 files changed, 1170 insertions(+)
> > > >    create mode 100644 tools/testing/selftests/mm/guard-pages.c
> > > >
> > > > diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
> > > > index 689bbd520296..8f01f4da1c0d 100644
> > > > --- a/tools/testing/selftests/mm/.gitignore
> > > > +++ b/tools/testing/selftests/mm/.gitignore
> > > > @@ -54,3 +54,4 @@ droppable
> > > >    hugetlb_dio
> > > >    pkey_sighandler_tests_32
> > > >    pkey_sighandler_tests_64
> > > > +guard-pages
> > > > diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> > > > index 02e1204971b0..15c734d6cfec 100644
> > > > --- a/tools/testing/selftests/mm/Makefile
> > > > +++ b/tools/testing/selftests/mm/Makefile
> > > > @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
> > > >    TEST_GEN_FILES += hugetlb_madv_vs_map
> > > >    TEST_GEN_FILES += hugetlb_dio
> > > >    TEST_GEN_FILES += droppable
> > > > +TEST_GEN_FILES += guard-pages
> > > >    ifneq ($(ARCH),arm64)
> > > >    TEST_GEN_FILES += soft-dirty
> > > > diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
> > > > new file mode 100644
> > > > index 000000000000..2ab0ff3ba5a0
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/mm/guard-pages.c
> > > > @@ -0,0 +1,1168 @@
> > > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > > +
> > > > +#define _GNU_SOURCE
> > > > +#include "../kselftest_harness.h"
> > > > +#include <assert.h>
> > > > +#include <fcntl.h>
> > > > +#include <setjmp.h>
> > > > +#include <errno.h>
> > > > +#include <linux/userfaultfd.h>
> > > > +#include <signal.h>
> > > > +#include <stdbool.h>
> > > > +#include <stdio.h>
> > > > +#include <stdlib.h>
> > > > +#include <string.h>
> > > > +#include <sys/ioctl.h>
> > > > +#include <sys/mman.h>
> > > > +#include <sys/syscall.h>
> > > > +#include <sys/uio.h>
> > > > +#include <unistd.h>
> > > > +
> > > > +/* These may not yet be available in the uAPI so define if not. */
> > > > +
> > > > +#ifndef MADV_GUARD_POISON
> > > > +#define MADV_GUARD_POISON	102
> > > > +#endif
> > > > +
> > > > +#ifndef MADV_GUARD_UNPOISON
> > > > +#define MADV_GUARD_UNPOISON	103
> > > > +#endif
> > > > +
> > > > +volatile bool signal_jump_set;
> > >
> > > Can you add a comment about why volatile is needed.
> >
> > I'm not sure it's really necessary, it's completely standard to do this
> > with signal handling and is one of the exceptions to the 'volatile
> > considered harmful' rule.
> >
> > > By the way did you happen to run checkpatck on this. There are
> > > several instances where single statement blocks with braces {}
> > >
> > > I noticed a few and ran checkpatch on your patch. There are
> > > 45 warnings regarding codeing style.
> > >
> > > Please run checkpatch and clean them up so we can avoid followup
> > > checkpatch cleanup patches.
> >
> > No sorry I won't, checkpatch isn't infallible and series trying to 'clean
> > up' things that aren't issues will be a waste of everybody's time.
> >
>
> Sorry - this violates the coding styles and makes it hard to read.
>
> See process/coding-style.rst:
>
> Do not unnecessarily use braces where a single statement will do.
>
> .. code-block:: c
>
>         if (condition)
>                 action();
>
> and
>
> .. code-block:: c
>
>         if (condition)
>                 do_this();
>         else
>                 do_that();
>
> This does not apply if only one branch of a conditional statement is a single
> statement; in the latter case use braces in both branches:
>
> .. code-block:: c
>
>         if (condition) {
>                 do_this();
>                 do_that();
>         } else {
>                 otherwise();
>         }
>
> Also, use braces when a loop contains more than a single simple statement:
>
> .. code-block:: c
>
>         while (condition) {
>                 if (test)
>                         do_something();
>         }
>
> thanks,
> -- Shuah

Shuah, quoting coding standards to an experienced kernel developer
(maintainer now) is maybe not the best way to engage here + it may have
been more productive for you to first engage on why it is I'm deviating
here.

Firstly, as I said, the code _does not compile_ if I do not use braces in
many cases. This is probably an issue with the macros, but it is out of
scope for this series for me to fix that.

'Fixing' these cases results in:

  CC       guard-pages
guard-pages.c: In function ‘guard_pages_split_merge’:
guard-pages.c:566:17: error: ‘else’ without a previous ‘if’
  566 |                 else
      |                 ^~~~
guard-pages.c: In function ‘guard_pages_dontneed’:
guard-pages.c:666:17: error: ‘else’ without a previous ‘if’
  666 |                 else
      |                 ^~~~
guard-pages.c: In function ‘guard_pages_fork’:
guard-pages.c:957:17: error: ‘else’ without a previous ‘if’
  957 |                 else
      |                 ^~~~
guard-pages.c: In function ‘guard_pages_fork_wipeonfork’:
guard-pages.c:1010:17: error: ‘else’ without a previous ‘if’
 1010 |                 else
      |                 ^~~~

In other cases I am simply not a fan of single line loops where there is a
lot of compound stuff going on:

	for (i = 0; i < 10; i++) {
		ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
	}

vs.

	for (i = 0; i < 10; i++)
		ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));

When there are very many loops like this. This is kind of a test-specific
thing, you'd maybe put more effort into splitting this up + have less
repetition in non-test code.

I'm not going to die on the hill of single-line-for-loops though, so if you
insist I'll change those.

However I simply _cannot_ change the if/else blocks that cause compilation
errors.

This is what I mean about checkpatch being fallible. It's also fallible in
other cases, like variable declarations via macro (understandably).

Expecting checkpatch to give zero warnings is simply unattainable,
unfortunately.

As you seem adamant about strict adherence to checkpatch, and I always try
to be accommodating where I can be, I suggest I fix everything _except
where it breaks the compilation_ does that work for you?

Thanks.
Vlastimil Babka Oct. 18, 2024, 4:10 p.m. UTC | #5
+CC linux-api (also should on future revisions)

On 10/17/24 22:42, Lorenzo Stoakes wrote:
> Userland library functions such as allocators and threading implementations
> often require regions of memory to act as 'guard pages' - mappings which,
> when accessed, result in a fatal signal being sent to the accessing
> process.
> 
> The current means by which these are implemented is via a PROT_NONE mmap()
> mapping, which provides the required semantics however incur an overhead of
> a VMA for each such region.
> 
> With a great many processes and threads, this can rapidly add up and incur
> a significant memory penalty. It also has the added problem of preventing
> merges that might otherwise be permitted.
> 
> This series takes a different approach - an idea suggested by Vlasimil
> Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> provenance becomes a little tricky to ascertain after this - please forgive
> any omissions!)  - rather than locating the guard pages at the VMA layer,
> instead placing them in page tables mapping the required ranges.
> 
> Early testing of the prototype version of this code suggests a 5 times
> speed up in memory mapping invocations (in conjunction with use of
> process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> system and unoptimised code.
> 
> We expect with optimisation and a loaded system with a larger number of
> guard pages this could significantly increase, but in any case these
> numbers are encouraging.
> 
> This way, rather than having separate VMAs specifying which parts of a
> range are guard pages, instead we have a VMA spanning the entire range of
> memory a user is permitted to access and including ranges which are to be
> 'guarded'.
> 
> After mapping this, a user can specify which parts of the range should
> result in a fatal signal when accessed.
> 
> By restricting the ability to specify guard pages to memory mapped by
> existing VMAs, we can rely on the mappings being torn down when the
> mappings are ultimately unmapped and everything works simply as if the
> memory were not faulted in, from the point of view of the containing VMAs.
> 
> This mechanism in effect poisons memory ranges similar to hardware memory
> poisoning, only it is an entirely software-controlled form of poisoning.
> 
> Any poisoned region of memory is also able to 'unpoisoned', that is, to
> have its poison markers removed.
> 
> The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> poisoning.
> 
> Poisoning can be performed across multiple VMAs and any existing mappings
> will be cleared, that is zapped, before installing the poisoned page table
> mappings.
> 
> There is no concept of 'nested' poisoning, multiple attempts to poison a
> range will, after the first poisoning, have no effect.
> 
> Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> memory, so a user can safely unpoison a range of memory and clear only
> poison page table mappings leaving the rest intact.
> 
> The actual mechanism by which the page table entries are specified makes
> use of existing logic - PTE markers, which are used for the userfaultfd
> UFFDIO_POISON mechanism.
> 
> Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> logic to handle it.
> 
> We also extend the generic page walk mechanism to allow for installation of
> PTEs (carefully restricted to memory management logic only to prevent
> unwanted abuse).
> 
> We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> specified for a VMA which implies a total removal of memory
> characteristics).
> 
> It's important to note that the guard page implementation is emphatically
> NOT a security feature, so a user can remove the poisoning if they wish. We
> simply implement it in such a way as to provide the least surprising
> behaviour.
> 
> An extensive set of self-tests are provided which ensure behaviour is as
> expected and additionally self-documents expected behaviour of poisoned
> ranges.
> 
> Suggested-by: Vlastimil Babka <vbabka@suze.cz>

Please fix the domain typo (also in patch 3 :)

Thanks for implementing this,
Vlastimil

> Suggested-by: Jann Horn <jannh@google.com>
> Suggested-by: David Hildenbrand <david@redhat.com>
> 
> v1
> * Un-RFC'd as appears no major objections to approach but rather debate on
>   implementation.
> * Fixed issue with arches which need mmu_context.h and
>   tlbfush.h. header imports in pagewalker logic to be able to use
>   update_mmu_cache() as reported by the kernel test bot.
> * Added comments in page walker logic to clarify who can use
>   ops->install_pte and why as well as adding a check_ops_valid() helper
>   function, as suggested by Christoph.
> * Pass false in full parameter in pte_clear_not_present_full() as suggested
>   by Jann.
> * Stopped erroneously requiring a write lock for the poison operation as
>   suggested by Jann and Suren.
> * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
>   consistent with how this is used elsewhere in the kernel as suggested by
>   Jann.
> * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
>   and duck out if a fatal signal is pending or a conditional reschedule is
>   needed, as suggested by Jann.
> * Avoid needlessly splitting huge PUDs and PMDs by specifying
>   ACTION_CONTINUE, as suggested by Jann.
> 
> RFC
> https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> 
> Lorenzo Stoakes (4):
>   mm: pagewalk: add the ability to install PTEs
>   mm: add PTE_MARKER_GUARD PTE marker
>   mm: madvise: implement lightweight guard page mechanism
>   selftests/mm: add self tests for guard page feature
> 
>  arch/alpha/include/uapi/asm/mman.h       |    3 +
>  arch/mips/include/uapi/asm/mman.h        |    3 +
>  arch/parisc/include/uapi/asm/mman.h      |    3 +
>  arch/xtensa/include/uapi/asm/mman.h      |    3 +
>  include/linux/mm_inline.h                |    2 +-
>  include/linux/pagewalk.h                 |   18 +-
>  include/linux/swapops.h                  |   26 +-
>  include/uapi/asm-generic/mman-common.h   |    3 +
>  mm/hugetlb.c                             |    3 +
>  mm/internal.h                            |    6 +
>  mm/madvise.c                             |  168 ++++
>  mm/memory.c                              |   18 +-
>  mm/mprotect.c                            |    3 +-
>  mm/mseal.c                               |    1 +
>  mm/pagewalk.c                            |  200 ++--
>  tools/testing/selftests/mm/.gitignore    |    1 +
>  tools/testing/selftests/mm/Makefile      |    1 +
>  tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
>  18 files changed, 1564 insertions(+), 66 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/guard-pages.c
> 
> --
> 2.46.2
Lorenzo Stoakes Oct. 18, 2024, 4:17 p.m. UTC | #6
On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
> +CC linux-api (also should on future revisions)
>

They're cc'd :) assuming Linux API <linux-api@vger.kernel.org> is correct
right?

> On 10/17/24 22:42, Lorenzo Stoakes wrote:
> > Userland library functions such as allocators and threading implementations
> > often require regions of memory to act as 'guard pages' - mappings which,
> > when accessed, result in a fatal signal being sent to the accessing
> > process.
> >
> > The current means by which these are implemented is via a PROT_NONE mmap()
> > mapping, which provides the required semantics however incur an overhead of
> > a VMA for each such region.
> >
> > With a great many processes and threads, this can rapidly add up and incur
> > a significant memory penalty. It also has the added problem of preventing
> > merges that might otherwise be permitted.
> >
> > This series takes a different approach - an idea suggested by Vlasimil
> > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> > provenance becomes a little tricky to ascertain after this - please forgive
> > any omissions!)  - rather than locating the guard pages at the VMA layer,
> > instead placing them in page tables mapping the required ranges.
> >
> > Early testing of the prototype version of this code suggests a 5 times
> > speed up in memory mapping invocations (in conjunction with use of
> > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > system and unoptimised code.
> >
> > We expect with optimisation and a loaded system with a larger number of
> > guard pages this could significantly increase, but in any case these
> > numbers are encouraging.
> >
> > This way, rather than having separate VMAs specifying which parts of a
> > range are guard pages, instead we have a VMA spanning the entire range of
> > memory a user is permitted to access and including ranges which are to be
> > 'guarded'.
> >
> > After mapping this, a user can specify which parts of the range should
> > result in a fatal signal when accessed.
> >
> > By restricting the ability to specify guard pages to memory mapped by
> > existing VMAs, we can rely on the mappings being torn down when the
> > mappings are ultimately unmapped and everything works simply as if the
> > memory were not faulted in, from the point of view of the containing VMAs.
> >
> > This mechanism in effect poisons memory ranges similar to hardware memory
> > poisoning, only it is an entirely software-controlled form of poisoning.
> >
> > Any poisoned region of memory is also able to 'unpoisoned', that is, to
> > have its poison markers removed.
> >
> > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> > poisoning.
> >
> > Poisoning can be performed across multiple VMAs and any existing mappings
> > will be cleared, that is zapped, before installing the poisoned page table
> > mappings.
> >
> > There is no concept of 'nested' poisoning, multiple attempts to poison a
> > range will, after the first poisoning, have no effect.
> >
> > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> > memory, so a user can safely unpoison a range of memory and clear only
> > poison page table mappings leaving the rest intact.
> >
> > The actual mechanism by which the page table entries are specified makes
> > use of existing logic - PTE markers, which are used for the userfaultfd
> > UFFDIO_POISON mechanism.
> >
> > Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> > logic to handle it.
> >
> > We also extend the generic page walk mechanism to allow for installation of
> > PTEs (carefully restricted to memory management logic only to prevent
> > unwanted abuse).
> >
> > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> > specified for a VMA which implies a total removal of memory
> > characteristics).
> >
> > It's important to note that the guard page implementation is emphatically
> > NOT a security feature, so a user can remove the poisoning if they wish. We
> > simply implement it in such a way as to provide the least surprising
> > behaviour.
> >
> > An extensive set of self-tests are provided which ensure behaviour is as
> > expected and additionally self-documents expected behaviour of poisoned
> > ranges.
> >
> > Suggested-by: Vlastimil Babka <vbabka@suze.cz>
>
> Please fix the domain typo (also in patch 3 :)
>

Damnnn it! I can't believe I left that in. Sorry about that! Will fix on
respin.

Hopefully not to suse.cs ;)

> Thanks for implementing this,
> Vlastimil

Thanks!

>
> > Suggested-by: Jann Horn <jannh@google.com>
> > Suggested-by: David Hildenbrand <david@redhat.com>
> >
> > v1
> > * Un-RFC'd as appears no major objections to approach but rather debate on
> >   implementation.
> > * Fixed issue with arches which need mmu_context.h and
> >   tlbfush.h. header imports in pagewalker logic to be able to use
> >   update_mmu_cache() as reported by the kernel test bot.
> > * Added comments in page walker logic to clarify who can use
> >   ops->install_pte and why as well as adding a check_ops_valid() helper
> >   function, as suggested by Christoph.
> > * Pass false in full parameter in pte_clear_not_present_full() as suggested
> >   by Jann.
> > * Stopped erroneously requiring a write lock for the poison operation as
> >   suggested by Jann and Suren.
> > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
> >   consistent with how this is used elsewhere in the kernel as suggested by
> >   Jann.
> > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
> >   and duck out if a fatal signal is pending or a conditional reschedule is
> >   needed, as suggested by Jann.
> > * Avoid needlessly splitting huge PUDs and PMDs by specifying
> >   ACTION_CONTINUE, as suggested by Jann.
> >
> > RFC
> > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> >
> > Lorenzo Stoakes (4):
> >   mm: pagewalk: add the ability to install PTEs
> >   mm: add PTE_MARKER_GUARD PTE marker
> >   mm: madvise: implement lightweight guard page mechanism
> >   selftests/mm: add self tests for guard page feature
> >
> >  arch/alpha/include/uapi/asm/mman.h       |    3 +
> >  arch/mips/include/uapi/asm/mman.h        |    3 +
> >  arch/parisc/include/uapi/asm/mman.h      |    3 +
> >  arch/xtensa/include/uapi/asm/mman.h      |    3 +
> >  include/linux/mm_inline.h                |    2 +-
> >  include/linux/pagewalk.h                 |   18 +-
> >  include/linux/swapops.h                  |   26 +-
> >  include/uapi/asm-generic/mman-common.h   |    3 +
> >  mm/hugetlb.c                             |    3 +
> >  mm/internal.h                            |    6 +
> >  mm/madvise.c                             |  168 ++++
> >  mm/memory.c                              |   18 +-
> >  mm/mprotect.c                            |    3 +-
> >  mm/mseal.c                               |    1 +
> >  mm/pagewalk.c                            |  200 ++--
> >  tools/testing/selftests/mm/.gitignore    |    1 +
> >  tools/testing/selftests/mm/Makefile      |    1 +
> >  tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> >  18 files changed, 1564 insertions(+), 66 deletions(-)
> >  create mode 100644 tools/testing/selftests/mm/guard-pages.c
> >
> > --
> > 2.46.2
>
Lorenzo Stoakes Oct. 18, 2024, 4:22 p.m. UTC | #7
On Fri, Oct 18, 2024 at 05:07:20PM +0100, Lorenzo Stoakes wrote:
[snip]
> Firstly, as I said, the code _does not compile_ if I do not use braces in
> many cases. This is probably an issue with the macros, but it is out of
> scope for this series for me to fix that.
>
> 'Fixing' these cases results in:
>
>   CC       guard-pages
> guard-pages.c: In function ‘guard_pages_split_merge’:
> guard-pages.c:566:17: error: ‘else’ without a previous ‘if’
>   566 |                 else
>       |                 ^~~~
> guard-pages.c: In function ‘guard_pages_dontneed’:
> guard-pages.c:666:17: error: ‘else’ without a previous ‘if’
>   666 |                 else
>       |                 ^~~~
> guard-pages.c: In function ‘guard_pages_fork’:
> guard-pages.c:957:17: error: ‘else’ without a previous ‘if’
>   957 |                 else
>       |                 ^~~~
> guard-pages.c: In function ‘guard_pages_fork_wipeonfork’:
> guard-pages.c:1010:17: error: ‘else’ without a previous ‘if’
>  1010 |                 else
>       |                 ^~~~

[snip]

An added note - the fact this happens makes the macros suspect everywhere,
and I am concerned single-lining them might break or cause failures not to
propagate perhaps, which led to me _never_ single-lining them I believe and
accounts for a bunch of the warnings.

I can go through and manually test each one to make sure before I
single-line them if necessary.
Shuah Khan Oct. 18, 2024, 4:24 p.m. UTC | #8
On 10/18/24 10:07, Lorenzo Stoakes wrote:
> On Fri, Oct 18, 2024 at 09:32:17AM -0600, Shuah Khan wrote:
>> On 10/18/24 01:12, Lorenzo Stoakes wrote:
>>> On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
>>>> On 10/17/24 14:42, Lorenzo Stoakes wrote:
>>>>> Utilise the kselftest harmness to implement tests for the guard page
>>>>
>>>> Splleing NIT - harmness -> harness
>>>>
>>>>> implementation.
>>>>>
>>>>> We start by implement basic tests asserting that guard pages can be
>>>>
>>>> implmenting? By the way checkpatch will catch spelling stuuf.
>>>> Please see comments about warnings below.
>>>
>>> Thanks. The majority of the checkpatch warnings are invalid so I missed
>>> this. Will fix on respin.
>>>
>>>>
>>>>> established (poisoned), cleared (remedied) and that touching poisoned pages
>>>>> result in SIGSEGV. We also assert that, in remedying a range, non-poison
>>>>> pages remain intact.
>>>>>
>>>>> We then examine different operations on regions containing poison markers
>>>>> behave to ensure correct behaviour:
>>>>>
>>>>> * Operations over multiple VMAs operate as expected.
>>>>> * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in
>>>>>      batches works correctly.
>>>>> * Ensuring that munmap() correctly tears down poison markers.
>>>>> * Using mprotect() to adjust protection bits does not in any way override
>>>>>      or cause issues with poison markers.
>>>>> * Ensuring that splitting and merging VMAs around poison markers causes no
>>>>>      issue - i.e. that a marker which 'belongs' to one VMA can function just
>>>>>      as well 'belonging' to another.
>>>>> * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
>>>>> * Ensuring that mlock()'ing a range containing poison markers does not
>>>>>      cause issues.
>>>>> * Ensuring that mremap() can move a poisoned range and retain poison
>>>>>      markers.
>>>>> * Ensuring that mremap() can expand a poisoned range and retain poison
>>>>>      markers (perhaps moving the range).
>>>>> * Ensuring that mremap() can shrink a poisoned range and retain poison
>>>>>      markers.
>>>>> * Ensuring that forking a process correctly retains poison markers.
>>>>> * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
>>>>> * Ensuring that lazyfree simply clears poison markers.
>>>>> * Ensuring that userfaultfd can co-exist with guard pages.
>>>>> * Ensuring that madvise(..., MADV_POPULATE_READ) and
>>>>>      madvise(..., MADV_POPULATE_WRITE) error out when encountering
>>>>>      poison markers.
>>>>> * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
>>>>>      not remove poison markers.
>>>>
>>>> Good summary of test. Does the test require root access?
>>>> If so does it check and skip appropriately?
>>>
>>> Thanks and some do, in those cases we skip.
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> ---
>>>>>     tools/testing/selftests/mm/.gitignore    |    1 +
>>>>>     tools/testing/selftests/mm/Makefile      |    1 +
>>>>>     tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
>>>>>     3 files changed, 1170 insertions(+)
>>>>>     create mode 100644 tools/testing/selftests/mm/guard-pages.c
>>>>>
>>>>> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
>>>>> index 689bbd520296..8f01f4da1c0d 100644
>>>>> --- a/tools/testing/selftests/mm/.gitignore
>>>>> +++ b/tools/testing/selftests/mm/.gitignore
>>>>> @@ -54,3 +54,4 @@ droppable
>>>>>     hugetlb_dio
>>>>>     pkey_sighandler_tests_32
>>>>>     pkey_sighandler_tests_64
>>>>> +guard-pages
>>>>> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
>>>>> index 02e1204971b0..15c734d6cfec 100644
>>>>> --- a/tools/testing/selftests/mm/Makefile
>>>>> +++ b/tools/testing/selftests/mm/Makefile
>>>>> @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
>>>>>     TEST_GEN_FILES += hugetlb_madv_vs_map
>>>>>     TEST_GEN_FILES += hugetlb_dio
>>>>>     TEST_GEN_FILES += droppable
>>>>> +TEST_GEN_FILES += guard-pages
>>>>>     ifneq ($(ARCH),arm64)
>>>>>     TEST_GEN_FILES += soft-dirty
>>>>> diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
>>>>> new file mode 100644
>>>>> index 000000000000..2ab0ff3ba5a0
>>>>> --- /dev/null
>>>>> +++ b/tools/testing/selftests/mm/guard-pages.c
>>>>> @@ -0,0 +1,1168 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>>>> +
>>>>> +#define _GNU_SOURCE
>>>>> +#include "../kselftest_harness.h"
>>>>> +#include <assert.h>
>>>>> +#include <fcntl.h>
>>>>> +#include <setjmp.h>
>>>>> +#include <errno.h>
>>>>> +#include <linux/userfaultfd.h>
>>>>> +#include <signal.h>
>>>>> +#include <stdbool.h>
>>>>> +#include <stdio.h>
>>>>> +#include <stdlib.h>
>>>>> +#include <string.h>
>>>>> +#include <sys/ioctl.h>
>>>>> +#include <sys/mman.h>
>>>>> +#include <sys/syscall.h>
>>>>> +#include <sys/uio.h>
>>>>> +#include <unistd.h>
>>>>> +
>>>>> +/* These may not yet be available in the uAPI so define if not. */
>>>>> +
>>>>> +#ifndef MADV_GUARD_POISON
>>>>> +#define MADV_GUARD_POISON	102
>>>>> +#endif
>>>>> +
>>>>> +#ifndef MADV_GUARD_UNPOISON
>>>>> +#define MADV_GUARD_UNPOISON	103
>>>>> +#endif
>>>>> +
>>>>> +volatile bool signal_jump_set;
>>>>
>>>> Can you add a comment about why volatile is needed.
>>>
>>> I'm not sure it's really necessary, it's completely standard to do this
>>> with signal handling and is one of the exceptions to the 'volatile
>>> considered harmful' rule.
>>>
>>>> By the way did you happen to run checkpatck on this. There are
>>>> several instances where single statement blocks with braces {}
>>>>
>>>> I noticed a few and ran checkpatch on your patch. There are
>>>> 45 warnings regarding codeing style.
>>>>
>>>> Please run checkpatch and clean them up so we can avoid followup
>>>> checkpatch cleanup patches.
>>>
>>> No sorry I won't, checkpatch isn't infallible and series trying to 'clean
>>> up' things that aren't issues will be a waste of everybody's time.
>>>
>>
>> Sorry - this violates the coding styles and makes it hard to read.
>>
>> See process/coding-style.rst:
>>
>> Do not unnecessarily use braces where a single statement will do.
>>
>> .. code-block:: c
>>
>>          if (condition)
>>                  action();
>>
>> and
>>
>> .. code-block:: c
>>
>>          if (condition)
>>                  do_this();
>>          else
>>                  do_that();
>>
>> This does not apply if only one branch of a conditional statement is a single
>> statement; in the latter case use braces in both branches:
>>
>> .. code-block:: c
>>
>>          if (condition) {
>>                  do_this();
>>                  do_that();
>>          } else {
>>                  otherwise();
>>          }
>>
>> Also, use braces when a loop contains more than a single simple statement:
>>
>> .. code-block:: c
>>
>>          while (condition) {
>>                  if (test)
>>                          do_something();
>>          }
>>
>> thanks,
>> -- Shuah
> 
> Shuah, quoting coding standards to an experienced kernel developer
> (maintainer now) is maybe not the best way to engage here + it may have
> been more productive for you to first engage on why it is I'm deviating
> here.
> 
> Firstly, as I said, the code _does not compile_ if I do not use braces in
> many cases. This is probably an issue with the macros, but it is out of
> scope for this series for me to fix that.

I am not trying to throw the book at you. When I see 45 of
them I have to ask the reasons why.

> 
> 'Fixing' these cases results in:
> 
>    CC       guard-pages
> guard-pages.c: In function ‘guard_pages_split_merge’:
> guard-pages.c:566:17: error: ‘else’ without a previous ‘if’
>    566 |                 else
>        |                 ^~~~
> guard-pages.c: In function ‘guard_pages_dontneed’:
> guard-pages.c:666:17: error: ‘else’ without a previous ‘if’
>    666 |                 else
>        |                 ^~~~
> guard-pages.c: In function ‘guard_pages_fork’:
> guard-pages.c:957:17: error: ‘else’ without a previous ‘if’
>    957 |                 else
>        |                 ^~~~
> guard-pages.c: In function ‘guard_pages_fork_wipeonfork’:
> guard-pages.c:1010:17: error: ‘else’ without a previous ‘if’
>   1010 |                 else
>        |                 ^~~~
> 
> In other cases I am simply not a fan of single line loops where there is a
> lot of compound stuff going on:
> 
> 	for (i = 0; i < 10; i++) {
> 		ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
> 	}
> 
> vs.
> 
> 	for (i = 0; i < 10; i++)
> 		ASSERT_FALSE(try_read_write_buf(&ptr1[i * page_size]));
> 
> When there are very many loops like this. This is kind of a test-specific
> thing, you'd maybe put more effort into splitting this up + have less
> repetition in non-test code.
> 
> I'm not going to die on the hill of single-line-for-loops though, so if you
> insist I'll change those.
> 
> However I simply _cannot_ change the if/else blocks that cause compilation
> errors.
> 
> This is what I mean about checkpatch being fallible. It's also fallible in
> other cases, like variable declarations via macro (understandably).
> 
> Expecting checkpatch to give zero warnings is simply unattainable,
> unfortunately.
> 
> As you seem adamant about strict adherence to checkpatch, and I always try
> to be accommodating where I can be, I suggest I fix everything _except
> where it breaks the compilation_ does that work for you?

Yes Please. If you leave these in here as soon as the patch hits
next, we start seeing a flood of patches. It becomes harder to fix
these later due to merge conflicts.

It becomes a patch overload issue. Yes I would like to see the ones
that don't result in compile errors fixed.

thanks,
-- Shuah

> 
> Thanks.
Shuah Khan Oct. 18, 2024, 4:25 p.m. UTC | #9
On 10/18/24 10:07, Lorenzo Stoakes wrote:
> On Fri, Oct 18, 2024 at 09:32:17AM -0600, Shuah Khan wrote:
>> On 10/18/24 01:12, Lorenzo Stoakes wrote:
>>> On Thu, Oct 17, 2024 at 03:24:49PM -0600, Shuah Khan wrote:
>>>> On 10/17/24 14:42, Lorenzo Stoakes wrote:
>>>>> Utilise the kselftest harmness to implement tests for the guard page
>>>>
>>>> Splleing NIT - harmness -> harness
>>>>
>>>>> implementation.
>>>>>
>>>>> We start by implement basic tests asserting that guard pages can be
>>>>
>>>> implmenting? By the way checkpatch will catch spelling stuuf.
>>>> Please see comments about warnings below.
>>>
>>> Thanks. The majority of the checkpatch warnings are invalid so I missed
>>> this. Will fix on respin.
>>>
>>>>
>>>>> established (poisoned), cleared (remedied) and that touching poisoned pages
>>>>> result in SIGSEGV. We also assert that, in remedying a range, non-poison
>>>>> pages remain intact.
>>>>>
>>>>> We then examine different operations on regions containing poison markers
>>>>> behave to ensure correct behaviour:
>>>>>
>>>>> * Operations over multiple VMAs operate as expected.
>>>>> * Invoking MADV_GUARD_POISION / MADV_GUARD_REMEDY via process_madvise() in
>>>>>      batches works correctly.
>>>>> * Ensuring that munmap() correctly tears down poison markers.
>>>>> * Using mprotect() to adjust protection bits does not in any way override
>>>>>      or cause issues with poison markers.
>>>>> * Ensuring that splitting and merging VMAs around poison markers causes no
>>>>>      issue - i.e. that a marker which 'belongs' to one VMA can function just
>>>>>      as well 'belonging' to another.
>>>>> * Ensuring that madvise(..., MADV_DONTNEED) does not remove poison markers.
>>>>> * Ensuring that mlock()'ing a range containing poison markers does not
>>>>>      cause issues.
>>>>> * Ensuring that mremap() can move a poisoned range and retain poison
>>>>>      markers.
>>>>> * Ensuring that mremap() can expand a poisoned range and retain poison
>>>>>      markers (perhaps moving the range).
>>>>> * Ensuring that mremap() can shrink a poisoned range and retain poison
>>>>>      markers.
>>>>> * Ensuring that forking a process correctly retains poison markers.
>>>>> * Ensuring that forking a VMA with VM_WIPEONFORK set behaves sanely.
>>>>> * Ensuring that lazyfree simply clears poison markers.
>>>>> * Ensuring that userfaultfd can co-exist with guard pages.
>>>>> * Ensuring that madvise(..., MADV_POPULATE_READ) and
>>>>>      madvise(..., MADV_POPULATE_WRITE) error out when encountering
>>>>>      poison markers.
>>>>> * Ensuring that madvise(..., MADV_COLD) and madvise(..., MADV_PAGEOUT) do
>>>>>      not remove poison markers.
>>>>
>>>> Good summary of test. Does the test require root access?
>>>> If so does it check and skip appropriately?
>>>
>>> Thanks and some do, in those cases we skip.
>>>
>>>>
>>>>>
>>>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>>> ---
>>>>>     tools/testing/selftests/mm/.gitignore    |    1 +
>>>>>     tools/testing/selftests/mm/Makefile      |    1 +
>>>>>     tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
>>>>>     3 files changed, 1170 insertions(+)
>>>>>     create mode 100644 tools/testing/selftests/mm/guard-pages.c
>>>>>
>>>>> diff --git a/tools/testing/selftests/mm/.gitignore b/tools/testing/selftests/mm/.gitignore
>>>>> index 689bbd520296..8f01f4da1c0d 100644
>>>>> --- a/tools/testing/selftests/mm/.gitignore
>>>>> +++ b/tools/testing/selftests/mm/.gitignore
>>>>> @@ -54,3 +54,4 @@ droppable
>>>>>     hugetlb_dio
>>>>>     pkey_sighandler_tests_32
>>>>>     pkey_sighandler_tests_64
>>>>> +guard-pages
>>>>> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
>>>>> index 02e1204971b0..15c734d6cfec 100644
>>>>> --- a/tools/testing/selftests/mm/Makefile
>>>>> +++ b/tools/testing/selftests/mm/Makefile
>>>>> @@ -79,6 +79,7 @@ TEST_GEN_FILES += hugetlb_fault_after_madv
>>>>>     TEST_GEN_FILES += hugetlb_madv_vs_map
>>>>>     TEST_GEN_FILES += hugetlb_dio
>>>>>     TEST_GEN_FILES += droppable
>>>>> +TEST_GEN_FILES += guard-pages
>>>>>     ifneq ($(ARCH),arm64)
>>>>>     TEST_GEN_FILES += soft-dirty
>>>>> diff --git a/tools/testing/selftests/mm/guard-pages.c b/tools/testing/selftests/mm/guard-pages.c
>>>>> new file mode 100644
>>>>> index 000000000000..2ab0ff3ba5a0
>>>>> --- /dev/null
>>>>> +++ b/tools/testing/selftests/mm/guard-pages.c
>>>>> @@ -0,0 +1,1168 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-or-later
>>>>> +
>>>>> +#define _GNU_SOURCE
>>>>> +#include "../kselftest_harness.h"
>>>>> +#include <assert.h>
>>>>> +#include <fcntl.h>
>>>>> +#include <setjmp.h>
>>>>> +#include <errno.h>
>>>>> +#include <linux/userfaultfd.h>
>>>>> +#include <signal.h>
>>>>> +#include <stdbool.h>
>>>>> +#include <stdio.h>
>>>>> +#include <stdlib.h>
>>>>> +#include <string.h>
>>>>> +#include <sys/ioctl.h>
>>>>> +#include <sys/mman.h>
>>>>> +#include <sys/syscall.h>
>>>>> +#include <sys/uio.h>
>>>>> +#include <unistd.h>
>>>>> +
>>>>> +/* These may not yet be available in the uAPI so define if not. */
>>>>> +
>>>>> +#ifndef MADV_GUARD_POISON
>>>>> +#define MADV_GUARD_POISON	102
>>>>> +#endif
>>>>> +
>>>>> +#ifndef MADV_GUARD_UNPOISON
>>>>> +#define MADV_GUARD_UNPOISON	103
>>>>> +#endif
>>>>> +
>>>>> +volatile bool signal_jump_set;
>>>>
>>>> Can you add a comment about why volatile is needed.
>>>
>>> I'm not sure it's really necessary, it's completely standard to do this
>>> with signal handling and is one of the exceptions to the 'volatile
>>> considered harmful' rule.
>>>
>>>> By the way did you happen to run checkpatck on this. There are
>>>> several instances where single statement blocks with braces {}
>>>>
>>>> I noticed a few and ran checkpatch on your patch. There are
>>>> 45 warnings regarding codeing style.
>>>>
>>>> Please run checkpatch and clean them up so we can avoid followup
>>>> checkpatch cleanup patches.
>>>
>>> No sorry I won't, checkpatch isn't infallible and series trying to 'clean
>>> up' things that aren't issues will be a waste of everybody's time.
>>>
>>
>> Sorry - this violates the coding styles and makes it hard to read.
>>
>> See process/coding-style.rst:
>>
>> Do not unnecessarily use braces where a single statement will do.
>>
>> .. code-block:: c
>>
>>          if (condition)
>>                  action();
>>
>> and
>>
>> .. code-block:: c
>>
>>          if (condition)
>>                  do_this();
>>          else
>>                  do_that();
>>
>> This does not apply if only one branch of a conditional statement is a single
>> statement; in the latter case use braces in both branches:
>>
>> .. code-block:: c
>>
>>          if (condition) {
>>                  do_this();
>>                  do_that();
>>          } else {
>>                  otherwise();
>>          }
>>
>> Also, use braces when a loop contains more than a single simple statement:
>>
>> .. code-block:: c
>>
>>          while (condition) {
>>                  if (test)
>>                          do_something();
>>          }
>>
>> thanks,
>> -- Shuah
> 
> Shuah, quoting coding standards to an experienced kernel developer
> (maintainer now) is maybe not the best way to engage here + it may have
> been more productive for you to first engage on why it is I'm deviating
> here.
> 

This is not the only comment I gave you in this patch and your
other patches.

thanks,
-- Shuah
Lorenzo Stoakes Oct. 18, 2024, 4:41 p.m. UTC | #10
On Fri, Oct 18, 2024 at 10:25:55AM -0600, Shuah Khan wrote:
[snip]
> > > Sorry - this violates the coding styles and makes it hard to read.
> > >
> > > See process/coding-style.rst:
> > >
> > > Do not unnecessarily use braces where a single statement will do.
> > >
> > > .. code-block:: c
> > >
> > >          if (condition)
> > >                  action();
> > >
> > > and
> > >
> > > .. code-block:: c
> > >
> > >          if (condition)
> > >                  do_this();
> > >          else
> > >                  do_that();
> > >
> > > This does not apply if only one branch of a conditional statement is a single
> > > statement; in the latter case use braces in both branches:
> > >
> > > .. code-block:: c
> > >
> > >          if (condition) {
> > >                  do_this();
> > >                  do_that();
> > >          } else {
> > >                  otherwise();
> > >          }
> > >
> > > Also, use braces when a loop contains more than a single simple statement:
> > >
> > > .. code-block:: c
> > >
> > >          while (condition) {
> > >                  if (test)
> > >                          do_something();
> > >          }
> > >
> > > thanks,
> > > -- Shuah
> >
> > Shuah, quoting coding standards to an experienced kernel developer
> > (maintainer now) is maybe not the best way to engage here + it may have
> > been more productive for you to first engage on why it is I'm deviating
> > here.
> >
>
> This is not the only comment I gave you in this patch and your
> other patches.
>

And to be clear - I absolutely appreciate your feedback and in all cases
except ones which would result in things not compiling have acted (rather
promptly) to address them.

I am simply pointing out that it's not my lack of knowledge of these rules
that's the issue it's 3 things:

1. Some code doesn't compile following these rules
2. I therefore don't trust the macros in single-line statements
3. I am not a fan of for/while loops with heavily compound statements

1 and is unavoidable, 2 maybe is avoidable with auditing and 3 is
subjective, and is something I have now undertaken to change.

I've heard a number of kernel developers' opinions on checkpatch and the
overall consensus has been that, while the core style is sacrosanct, strict
adherence to checkpatch warnings is not.

This may be worth a broader discussion (outside of this series). One must
definitely account for cases where the syntactic analysis fails for
instance so 'always strictly adhere' is out unfortunately (why I say it is
fallible) however perhaps 'strict adherence except where it is obviously
wrong' is a position one could take.

Your have a significantly different view on this and that's fine, as I said
I always try to be accommodating.

Key thing is we've found a way forward which I will act on.

Thanks, Lorenzo
Lorenzo Stoakes Oct. 18, 2024, 9:30 p.m. UTC | #11
On Fri, Oct 18, 2024 at 05:17:56PM +0100, Lorenzo Stoakes wrote:
> On Fri, Oct 18, 2024 at 06:10:37PM +0200, Vlastimil Babka wrote:
> > +CC linux-api (also should on future revisions)
> >
>
> They're cc'd :) assuming Linux API <linux-api@vger.kernel.org> is correct
> right?

As discussed on IRC, no I was being a little slow here and hadn't realised
you'd added them, apologies!

Will add them on future respins, sorry guys :)

>
> > On 10/17/24 22:42, Lorenzo Stoakes wrote:
> > > Userland library functions such as allocators and threading implementations
> > > often require regions of memory to act as 'guard pages' - mappings which,
> > > when accessed, result in a fatal signal being sent to the accessing
> > > process.
> > >
> > > The current means by which these are implemented is via a PROT_NONE mmap()
> > > mapping, which provides the required semantics however incur an overhead of
> > > a VMA for each such region.
> > >
> > > With a great many processes and threads, this can rapidly add up and incur
> > > a significant memory penalty. It also has the added problem of preventing
> > > merges that might otherwise be permitted.
> > >
> > > This series takes a different approach - an idea suggested by Vlasimil
> > > Babka (and before him David Hildenbrand and Jann Horn - perhaps more - the
> > > provenance becomes a little tricky to ascertain after this - please forgive
> > > any omissions!)  - rather than locating the guard pages at the VMA layer,
> > > instead placing them in page tables mapping the required ranges.
> > >
> > > Early testing of the prototype version of this code suggests a 5 times
> > > speed up in memory mapping invocations (in conjunction with use of
> > > process_madvise()) and a 13% reduction in VMAs on an entirely idle android
> > > system and unoptimised code.
> > >
> > > We expect with optimisation and a loaded system with a larger number of
> > > guard pages this could significantly increase, but in any case these
> > > numbers are encouraging.
> > >
> > > This way, rather than having separate VMAs specifying which parts of a
> > > range are guard pages, instead we have a VMA spanning the entire range of
> > > memory a user is permitted to access and including ranges which are to be
> > > 'guarded'.
> > >
> > > After mapping this, a user can specify which parts of the range should
> > > result in a fatal signal when accessed.
> > >
> > > By restricting the ability to specify guard pages to memory mapped by
> > > existing VMAs, we can rely on the mappings being torn down when the
> > > mappings are ultimately unmapped and everything works simply as if the
> > > memory were not faulted in, from the point of view of the containing VMAs.
> > >
> > > This mechanism in effect poisons memory ranges similar to hardware memory
> > > poisoning, only it is an entirely software-controlled form of poisoning.
> > >
> > > Any poisoned region of memory is also able to 'unpoisoned', that is, to
> > > have its poison markers removed.
> > >
> > > The mechanism is implemented via madvise() behaviour - MADV_GUARD_POISON
> > > which simply poisons ranges - and MADV_GUARD_UNPOISON - which clears this
> > > poisoning.
> > >
> > > Poisoning can be performed across multiple VMAs and any existing mappings
> > > will be cleared, that is zapped, before installing the poisoned page table
> > > mappings.
> > >
> > > There is no concept of 'nested' poisoning, multiple attempts to poison a
> > > range will, after the first poisoning, have no effect.
> > >
> > > Importantly, unpoisoning of poisoned ranges has no effect on non-poisoned
> > > memory, so a user can safely unpoison a range of memory and clear only
> > > poison page table mappings leaving the rest intact.
> > >
> > > The actual mechanism by which the page table entries are specified makes
> > > use of existing logic - PTE markers, which are used for the userfaultfd
> > > UFFDIO_POISON mechanism.
> > >
> > > Unfortunately PTE_MARKER_POISONED is not suited for the guard page
> > > mechanism as it results in VM_FAULT_HWPOISON semantics in the fault
> > > handler, so we add our own specific PTE_MARKER_GUARD and adapt existing
> > > logic to handle it.
> > >
> > > We also extend the generic page walk mechanism to allow for installation of
> > > PTEs (carefully restricted to memory management logic only to prevent
> > > unwanted abuse).
> > >
> > > We ensure that zapping performed by, for instance, MADV_DONTNEED, does not
> > > remove guard poison markers, nor does forking (except when VM_WIPEONFORK is
> > > specified for a VMA which implies a total removal of memory
> > > characteristics).
> > >
> > > It's important to note that the guard page implementation is emphatically
> > > NOT a security feature, so a user can remove the poisoning if they wish. We
> > > simply implement it in such a way as to provide the least surprising
> > > behaviour.
> > >
> > > An extensive set of self-tests are provided which ensure behaviour is as
> > > expected and additionally self-documents expected behaviour of poisoned
> > > ranges.
> > >
> > > Suggested-by: Vlastimil Babka <vbabka@suze.cz>
> >
> > Please fix the domain typo (also in patch 3 :)
> >
>
> Damnnn it! I can't believe I left that in. Sorry about that! Will fix on
> respin.
>
> Hopefully not to suse.cs ;)
>
> > Thanks for implementing this,
> > Vlastimil
>
> Thanks!
>
> >
> > > Suggested-by: Jann Horn <jannh@google.com>
> > > Suggested-by: David Hildenbrand <david@redhat.com>
> > >
> > > v1
> > > * Un-RFC'd as appears no major objections to approach but rather debate on
> > >   implementation.
> > > * Fixed issue with arches which need mmu_context.h and
> > >   tlbfush.h. header imports in pagewalker logic to be able to use
> > >   update_mmu_cache() as reported by the kernel test bot.
> > > * Added comments in page walker logic to clarify who can use
> > >   ops->install_pte and why as well as adding a check_ops_valid() helper
> > >   function, as suggested by Christoph.
> > > * Pass false in full parameter in pte_clear_not_present_full() as suggested
> > >   by Jann.
> > > * Stopped erroneously requiring a write lock for the poison operation as
> > >   suggested by Jann and Suren.
> > > * Moved anon_vma_prepare() to the start of madvise_guard_poison() to be
> > >   consistent with how this is used elsewhere in the kernel as suggested by
> > >   Jann.
> > > * Avoid returning -EAGAIN if we are raced on page faults, just keep looping
> > >   and duck out if a fatal signal is pending or a conditional reschedule is
> > >   needed, as suggested by Jann.
> > > * Avoid needlessly splitting huge PUDs and PMDs by specifying
> > >   ACTION_CONTINUE, as suggested by Jann.
> > >
> > > RFC
> > > https://lore.kernel.org/all/cover.1727440966.git.lorenzo.stoakes@oracle.com/
> > >
> > > Lorenzo Stoakes (4):
> > >   mm: pagewalk: add the ability to install PTEs
> > >   mm: add PTE_MARKER_GUARD PTE marker
> > >   mm: madvise: implement lightweight guard page mechanism
> > >   selftests/mm: add self tests for guard page feature
> > >
> > >  arch/alpha/include/uapi/asm/mman.h       |    3 +
> > >  arch/mips/include/uapi/asm/mman.h        |    3 +
> > >  arch/parisc/include/uapi/asm/mman.h      |    3 +
> > >  arch/xtensa/include/uapi/asm/mman.h      |    3 +
> > >  include/linux/mm_inline.h                |    2 +-
> > >  include/linux/pagewalk.h                 |   18 +-
> > >  include/linux/swapops.h                  |   26 +-
> > >  include/uapi/asm-generic/mman-common.h   |    3 +
> > >  mm/hugetlb.c                             |    3 +
> > >  mm/internal.h                            |    6 +
> > >  mm/madvise.c                             |  168 ++++
> > >  mm/memory.c                              |   18 +-
> > >  mm/mprotect.c                            |    3 +-
> > >  mm/mseal.c                               |    1 +
> > >  mm/pagewalk.c                            |  200 ++--
> > >  tools/testing/selftests/mm/.gitignore    |    1 +
> > >  tools/testing/selftests/mm/Makefile      |    1 +
> > >  tools/testing/selftests/mm/guard-pages.c | 1168 ++++++++++++++++++++++
> > >  18 files changed, 1564 insertions(+), 66 deletions(-)
> > >  create mode 100644 tools/testing/selftests/mm/guard-pages.c
> > >
> > > --
> > > 2.46.2
> >