diff mbox series

selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout

Message ID 20240620002648.75204-1-21cnbao@gmail.com
State New
Headers show
Series selftests/mm: Introduce a test program to assess swap entry allocation for thp_swapout | expand

Commit Message

Barry Song June 20, 2024, 12:26 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Both Ryan and Chris have been utilizing the small test program to aid
in debugging and identifying issues with swap entry allocation. While
a real or intricate workload might be more suitable for assessing the
correctness and effectiveness of the swap allocation policy, a small
test program presents a simpler means of understanding the problem and
initially verifying the improvements being made.

Let's endeavor to integrate it into the self-test suite. Although it
presently only accommodates 64KB and 4KB, I'm optimistic that we can
expand its capabilities to support multiple sizes and simulate more
complex systems in the future as required.

Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 tools/testing/selftests/mm/Makefile           |   1 +
 .../selftests/mm/thp_swap_allocator_test.c    | 192 ++++++++++++++++++
 2 files changed, 193 insertions(+)
 create mode 100644 tools/testing/selftests/mm/thp_swap_allocator_test.c

Comments

Chris Li June 20, 2024, 11:34 p.m. UTC | #1
Hi Barry,

Thanks for the wonderful test program.

I have also used other swap test programs as well. A lot of those
tests are harder to setup up and run.

This test is very quick and simple to run. It can test some hard to
hit corner cases for me.

I am able to reproduce the warning and the kernel oops with this test program.
So for me, I am using it as a functional test that my allocator did
not produce a crash.
In that regard, it definitely provides value as a function test.

Having a fall percentage output is fine, as long as we don't fail the
test based on performance number.

I am also fine with moving the test to under tools/mm etc. I see good
value to include the test in the tree one way or the other.


On Wed, Jun 19, 2024 at 5:27 PM Barry Song <21cnbao@gmail.com> wrote:
>
> From: Barry Song <v-songbaohua@oppo.com>
>
> Both Ryan and Chris have been utilizing the small test program to aid
> in debugging and identifying issues with swap entry allocation. While
> a real or intricate workload might be more suitable for assessing the
> correctness and effectiveness of the swap allocation policy, a small
> test program presents a simpler means of understanding the problem and
> initially verifying the improvements being made.
>
> Let's endeavor to integrate it into the self-test suite. Although it
> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> expand its capabilities to support multiple sizes and simulate more
> complex systems in the future as required.
>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  tools/testing/selftests/mm/Makefile           |   1 +
>  .../selftests/mm/thp_swap_allocator_test.c    | 192 ++++++++++++++++++
>  2 files changed, 193 insertions(+)

Assume we want to keep it as selftest.
You did not add your test in run_vmtests.sh.

You might need something like this:

--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -418,6 +418,14 @@ CATEGORY="thp" run_test ./khugepaged -s 2

 CATEGORY="thp" run_test ./transhuge-stress -d 20

+# config and swapon zram here.
+
+CATEGORY="thp" run_test ./thp_swap_allocator_test
+
+CATEGORY="thp" run_test ./thp_swap_allocator_test -s
+
+# swapoff zram here.
+
 # Try to create XFS if not provided
 if [ -z "${SPLIT_HUGE_PAGE_TEST_XFS_PATH}" ]; then
     if test_selected "thp"; then


You can use the following XFS test as an example of how to setup the zram swap.
XFS uses file system mount, you use swapon.

Also you need to update the usage string in run_vmtests.sh.

BTW, here is how I invoke the test runs:

kselftest_override_timeout=500 make -C tools/testing/selftests
TARGETS=mm run_tests

The time out is not for this test, it is for some other test before
the thp_swap which exit run_vmtests.sh before hitting thp_swap. I am
running in a VM so it is slower than native machine.

>  create mode 100644 tools/testing/selftests/mm/thp_swap_allocator_test.c
>
> diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
> index e1aa09ddaa3d..64164ad66835 100644
> --- a/tools/testing/selftests/mm/Makefile
> +++ b/tools/testing/selftests/mm/Makefile
> @@ -65,6 +65,7 @@ TEST_GEN_FILES += mseal_test
>  TEST_GEN_FILES += seal_elf
>  TEST_GEN_FILES += on-fault-limit
>  TEST_GEN_FILES += pagemap_ioctl
> +TEST_GEN_FILES += thp_swap_allocator_test
>  TEST_GEN_FILES += thuge-gen
>  TEST_GEN_FILES += transhuge-stress
>  TEST_GEN_FILES += uffd-stress
> diff --git a/tools/testing/selftests/mm/thp_swap_allocator_test.c b/tools/testing/selftests/mm/thp_swap_allocator_test.c
> new file mode 100644
> index 000000000000..4443a906d0f8
> --- /dev/null
> +++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c
> @@ -0,0 +1,192 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * thp_swap_allocator_test
> + *
> + * The purpose of this test program is helping check if THP swpout
> + * can correctly get swap slots to swap out as a whole instead of
> + * being split. It randomly releases swap entries through madvise
> + * DONTNEED and do swapout on two memory areas: a memory area for
> + * 64KB THP and the other area for small folios. The second memory
> + * can be enabled by "-s".
> + * Before running the program, we need to setup a zRAM or similar
> + * swap device by:
> + *  echo lzo > /sys/block/zram0/comp_algorithm
> + *  echo 64M > /sys/block/zram0/disksize
> + *  echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
> + *  echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
> + *  mkswap /dev/zram0
> + *  swapon /dev/zram0

This setup needs to go into run_vmtest.sh as well.

Also tear it down after the test.

Chris

> + * The expected result should be 0% anon swpout fallback ratio w/ or
> + * w/o "-s".
> + *
> + * Author(s): Barry Song <v-songbaohua@oppo.com>
> + */
> +
> +#define _GNU_SOURCE
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <unistd.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <errno.h>
> +#include <time.h>
> +
> +#define MEMSIZE_MTHP (60 * 1024 * 1024)
> +#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
> +#define ALIGNMENT_MTHP (64 * 1024)
> +#define ALIGNMENT_SMALLFOLIO (4 * 1024)
> +#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
> +#define TOTAL_DONTNEED_SMALLFOLIO (768 * 1024)
> +#define MTHP_FOLIO_SIZE (64 * 1024)
> +
> +#define SWPOUT_PATH \
> +       "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
> +#define SWPOUT_FALLBACK_PATH \
> +       "/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback"
> +
> +static void *aligned_alloc_mem(size_t size, size_t alignment)
> +{
> +       void *mem = NULL;
> +
> +       if (posix_memalign(&mem, alignment, size) != 0) {
> +               perror("posix_memalign");
> +               return NULL;
> +       }
> +       return mem;
> +}
> +
> +static void random_madvise_dontneed(void *mem, size_t mem_size,
> +               size_t align_size, size_t total_dontneed_size)
> +{
> +       size_t num_pages = total_dontneed_size / align_size;
> +       size_t i;
> +       size_t offset;
> +       void *addr;
> +
> +       for (i = 0; i < num_pages; ++i) {
> +               offset = (rand() % (mem_size / align_size)) * align_size;
> +               addr = (char *)mem + offset;
> +               if (madvise(addr, align_size, MADV_DONTNEED) != 0)
> +                       perror("madvise dontneed");
> +
> +               memset(addr, 0x11, align_size);
> +       }
> +}
> +
> +static unsigned long read_stat(const char *path)
> +{
> +       FILE *file;
> +       unsigned long value;
> +
> +       file = fopen(path, "r");
> +       if (!file) {
> +               perror("fopen");
> +               return 0;
> +       }
> +
> +       if (fscanf(file, "%lu", &value) != 1) {
> +               perror("fscanf");
> +               fclose(file);
> +               return 0;
> +       }
> +
> +       fclose(file);
> +       return value;
> +}
> +
> +int main(int argc, char *argv[])
> +{
> +       int use_small_folio = 0;
> +       int i;
> +       void *mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
> +       void *mem2 = NULL;
> +
> +       if (mem1 == NULL) {
> +               fprintf(stderr, "Failed to allocate 60MB memory\n");
> +               return EXIT_FAILURE;
> +       }
> +
> +       if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) {
> +               perror("madvise hugepage for mem1");
> +               free(mem1);
> +               return EXIT_FAILURE;
> +       }
> +
> +       for (i = 1; i < argc; ++i) {
> +               if (strcmp(argv[i], "-s") == 0)
> +                       use_small_folio = 1;
> +       }
> +
> +       if (use_small_folio) {
> +               mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
> +               if (mem2 == NULL) {
> +                       fprintf(stderr, "Failed to allocate 1MB memory\n");
> +                       free(mem1);
> +                       return EXIT_FAILURE;
> +               }
> +
> +               if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) {
> +                       perror("madvise nohugepage for mem2");
> +                       free(mem1);
> +                       free(mem2);
> +                       return EXIT_FAILURE;
> +               }
> +       }
> +
> +       for (i = 0; i < 100; ++i) {
> +               unsigned long initial_swpout;
> +               unsigned long initial_swpout_fallback;
> +               unsigned long final_swpout;
> +               unsigned long final_swpout_fallback;
> +               unsigned long swpout_inc;
> +               unsigned long swpout_fallback_inc;
> +               double fallback_percentage;
> +
> +               initial_swpout = read_stat(SWPOUT_PATH);
> +               initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
> +
> +               random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
> +                               TOTAL_DONTNEED_MTHP);
> +
> +               if (use_small_folio) {
> +                       random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
> +                                       ALIGNMENT_SMALLFOLIO,
> +                                       TOTAL_DONTNEED_SMALLFOLIO);
> +               }
> +
> +               if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) {
> +                       perror("madvise pageout for mem1");
> +                       free(mem1);
> +                       if (mem2 != NULL)
> +                               free(mem2);
> +                       return EXIT_FAILURE;
> +               }
> +
> +               if (use_small_folio) {
> +                       if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) {
> +                               perror("madvise pageout for mem2");
> +                               free(mem1);
> +                               free(mem2);
> +                               return EXIT_FAILURE;
> +                       }
> +               }
> +
> +               final_swpout = read_stat(SWPOUT_PATH);
> +               final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
> +
> +               swpout_inc = final_swpout - initial_swpout;
> +               swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback;
> +
> +               fallback_percentage = (double)swpout_fallback_inc /
> +                       (swpout_fallback_inc + swpout_inc) * 100;
> +
> +               printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n",
> +                               i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage);


Chris

> +       }
> +
> +       free(mem1);
> +       if (mem2 != NULL)
> +               free(mem2);
> +
> +       return EXIT_SUCCESS;
> +}
> --
> 2.34.1
>
>
Huang, Ying June 21, 2024, 2:33 a.m. UTC | #2
David Hildenbrand <david@redhat.com> writes:

> On 20.06.24 11:04, Ryan Roberts wrote:
>> On 20/06/2024 01:26, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Both Ryan and Chris have been utilizing the small test program to aid
>>> in debugging and identifying issues with swap entry allocation. While
>>> a real or intricate workload might be more suitable for assessing the
>>> correctness and effectiveness of the swap allocation policy, a small
>>> test program presents a simpler means of understanding the problem and
>>> initially verifying the improvements being made.
>>>
>>> Let's endeavor to integrate it into the self-test suite. Although it
>>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>>> expand its capabilities to support multiple sizes and simulate more
>>> complex systems in the future as required.
>> I'll try to summarize the thread with Huang Ying by suggesting this
>> test program
>> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>> I've certainly found it useful and think it would be a valuable addition to the
>> tree.
>> That said, I'm not convinced it is a selftest; IMO a selftest should
>> provide a
>> clear pass/fail result against some criteria and must be able to be run
>> automatically by (e.g.) a CI system.
>
> Likely we should then consider moving other such performance-related
> thingies out of the selftests?

I think that it's good to distinguish between functionality and
performance tests.  For example, 0-day test system will use virtual
machines to do some functionality tests to improve efficiency.  But it's
not good to run performance tests in such kind of virtual machines.

--
Best Regards,
Huang, Ying
Ryan Roberts June 21, 2024, 7:25 a.m. UTC | #3
On 20/06/2024 12:34, David Hildenbrand wrote:
> On 20.06.24 11:04, Ryan Roberts wrote:
>> On 20/06/2024 01:26, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Both Ryan and Chris have been utilizing the small test program to aid
>>> in debugging and identifying issues with swap entry allocation. While
>>> a real or intricate workload might be more suitable for assessing the
>>> correctness and effectiveness of the swap allocation policy, a small
>>> test program presents a simpler means of understanding the problem and
>>> initially verifying the improvements being made.
>>>
>>> Let's endeavor to integrate it into the self-test suite. Although it
>>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>>> expand its capabilities to support multiple sizes and simulate more
>>> complex systems in the future as required.
>>
>> I'll try to summarize the thread with Huang Ying by suggesting this test program
>> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>> I've certainly found it useful and think it would be a valuable addition to the
>> tree.
>>
>> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
>> clear pass/fail result against some criteria and must be able to be run
>> automatically by (e.g.) a CI system.
> 
> Likely we should then consider moving other such performance-related thingies
> out of the selftests?

Yes, that would get my vote. But of the 4 tests you mentioned that use
clock_gettime(), it looks like transhuge-stress is the only one that doesn't
have a pass/fail result, so is probably the only candidate for moving.

The others either use the times as a timeout and determines failure if the
action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
supplemental performance information to an otherwise functionality-oriented test.
Ryan Roberts June 21, 2024, 7:34 a.m. UTC | #4
On 21/06/2024 00:34, Chris Li wrote:

>> + * thp_swap_allocator_test
>> + *
>> + * The purpose of this test program is helping check if THP swpout
>> + * can correctly get swap slots to swap out as a whole instead of
>> + * being split. It randomly releases swap entries through madvise
>> + * DONTNEED and do swapout on two memory areas: a memory area for
>> + * 64KB THP and the other area for small folios. The second memory
>> + * can be enabled by "-s".
>> + * Before running the program, we need to setup a zRAM or similar
>> + * swap device by:
>> + *  echo lzo > /sys/block/zram0/comp_algorithm
>> + *  echo 64M > /sys/block/zram0/disksize
>> + *  echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
>> + *  echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
>> + *  mkswap /dev/zram0
>> + *  swapon /dev/zram0
> 
> This setup needs to go into run_vmtest.sh as well.
> 
> Also tear it down after the test.

Additionally, if keeping this as a selftest, you'll want to add

CONFIG_ZRAM=y

to tools/testing/selftests/mm/config so that automated systems ensure zram is
available in the kernel under test.

And you will need to ensure that the zram device has a higher priority than any
other already configured swap devices. Otherwise there will not even be an
attempt to use it for mTHP. The easy way is to do "swapoff -a" as the first step
but that makes cleanup tricky.
Barry Song June 21, 2024, 7:47 a.m. UTC | #5
On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 20/06/2024 12:34, David Hildenbrand wrote:
> > On 20.06.24 11:04, Ryan Roberts wrote:
> >> On 20/06/2024 01:26, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> Both Ryan and Chris have been utilizing the small test program to aid
> >>> in debugging and identifying issues with swap entry allocation. While
> >>> a real or intricate workload might be more suitable for assessing the
> >>> correctness and effectiveness of the swap allocation policy, a small
> >>> test program presents a simpler means of understanding the problem and
> >>> initially verifying the improvements being made.
> >>>
> >>> Let's endeavor to integrate it into the self-test suite. Although it
> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> >>> expand its capabilities to support multiple sizes and simulate more
> >>> complex systems in the future as required.
> >>
> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> >> I've certainly found it useful and think it would be a valuable addition to the
> >> tree.
> >>
> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> >> clear pass/fail result against some criteria and must be able to be run
> >> automatically by (e.g.) a CI system.
> >
> > Likely we should then consider moving other such performance-related thingies
> > out of the selftests?
>
> Yes, that would get my vote. But of the 4 tests you mentioned that use
> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> have a pass/fail result, so is probably the only candidate for moving.
>
> The others either use the times as a timeout and determines failure if the
> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> supplemental performance information to an otherwise functionality-oriented test.

Thank you very much, Ryan. I think you've found a better home for this
tool . I will
send v2, relocating it to tools/mm and adding a function to swap in
either the whole
mTHPs or a portion of mTHPs by "-a"(aligned swapin).

So basically, we will have

1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
high exercise in a short time.

2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
new mTHP is always generated, released or swapped out, similar to the behavior
on a PC or Android phone where many applications are frequently started and
terminated.

3. Swap in with or without the "-a" option to observe how fragments
due to swap-in
and the incoming swap-in of large folios will impact swap-out fallback.

And many thanks to Chris for the suggestion on improving it within
selftest, though I
prefer to place it in tools/mm.

Thanks
Barry
Ryan Roberts June 21, 2024, 7:58 a.m. UTC | #6
On 21/06/2024 08:47, Barry Song wrote:
> On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 20/06/2024 12:34, David Hildenbrand wrote:
>>> On 20.06.24 11:04, Ryan Roberts wrote:
>>>> On 20/06/2024 01:26, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> Both Ryan and Chris have been utilizing the small test program to aid
>>>>> in debugging and identifying issues with swap entry allocation. While
>>>>> a real or intricate workload might be more suitable for assessing the
>>>>> correctness and effectiveness of the swap allocation policy, a small
>>>>> test program presents a simpler means of understanding the problem and
>>>>> initially verifying the improvements being made.
>>>>>
>>>>> Let's endeavor to integrate it into the self-test suite. Although it
>>>>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>>>>> expand its capabilities to support multiple sizes and simulate more
>>>>> complex systems in the future as required.
>>>>
>>>> I'll try to summarize the thread with Huang Ying by suggesting this test program
>>>> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>>>> I've certainly found it useful and think it would be a valuable addition to the
>>>> tree.
>>>>
>>>> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
>>>> clear pass/fail result against some criteria and must be able to be run
>>>> automatically by (e.g.) a CI system.
>>>
>>> Likely we should then consider moving other such performance-related thingies
>>> out of the selftests?
>>
>> Yes, that would get my vote. But of the 4 tests you mentioned that use
>> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
>> have a pass/fail result, so is probably the only candidate for moving.
>>
>> The others either use the times as a timeout and determines failure if the
>> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
>> supplemental performance information to an otherwise functionality-oriented test.
> 
> Thank you very much, Ryan. I think you've found a better home for this
> tool . I will
> send v2, relocating it to tools/mm and adding a function to swap in
> either the whole
> mTHPs or a portion of mTHPs by "-a"(aligned swapin).
> 
> So basically, we will have
> 
> 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> high exercise in a short time.
> 
> 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> new mTHP is always generated, released or swapped out, similar to the behavior
> on a PC or Android phone where many applications are frequently started and
> terminated.
> 
> 3. Swap in with or without the "-a" option to observe how fragments
> due to swap-in
> and the incoming swap-in of large folios will impact swap-out fallback.
> 
> And many thanks to Chris for the suggestion on improving it within
> selftest, though I
> prefer to place it in tools/mm.

All sounds good to me!

If, (for future) you also wanted to test the vmscan swap-out path, the way I've
been doing that is to run the workload in a memory-constrained cgroup. That
means you don't need to exhaust all your phsical ram so speeds things up a lot.


> 
> Thanks
> Barry
Chris Li June 21, 2024, 8:50 a.m. UTC | #7
On Fri, Jun 21, 2024 at 12:47 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >
> > On 20/06/2024 12:34, David Hildenbrand wrote:
> > > On 20.06.24 11:04, Ryan Roberts wrote:
> > >> On 20/06/2024 01:26, Barry Song wrote:
> > >>> From: Barry Song <v-songbaohua@oppo.com>
> > >>>
> > >>> Both Ryan and Chris have been utilizing the small test program to aid
> > >>> in debugging and identifying issues with swap entry allocation. While
> > >>> a real or intricate workload might be more suitable for assessing the
> > >>> correctness and effectiveness of the swap allocation policy, a small
> > >>> test program presents a simpler means of understanding the problem and
> > >>> initially verifying the improvements being made.
> > >>>
> > >>> Let's endeavor to integrate it into the self-test suite. Although it
> > >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> > >>> expand its capabilities to support multiple sizes and simulate more
> > >>> complex systems in the future as required.
> > >>
> > >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> > >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> > >> I've certainly found it useful and think it would be a valuable addition to the
> > >> tree.
> > >>
> > >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> > >> clear pass/fail result against some criteria and must be able to be run
> > >> automatically by (e.g.) a CI system.
> > >
> > > Likely we should then consider moving other such performance-related thingies
> > > out of the selftests?
> >
> > Yes, that would get my vote. But of the 4 tests you mentioned that use
> > clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> > have a pass/fail result, so is probably the only candidate for moving.
> >
> > The others either use the times as a timeout and determines failure if the
> > action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> > supplemental performance information to an otherwise functionality-oriented test.
>
> Thank you very much, Ryan. I think you've found a better home for this
> tool . I will
> send v2, relocating it to tools/mm and adding a function to swap in
> either the whole
> mTHPs or a portion of mTHPs by "-a"(aligned swapin).
>
> So basically, we will have
>
> 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> high exercise in a short time.
>
> 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> new mTHP is always generated, released or swapped out, similar to the behavior
> on a PC or Android phone where many applications are frequently started and
> terminated.

Will this cover the case that the ratio of order 0 and order 4 swap
requests change during LMK, and swapfile is almost full?

If not, please add that :-)

> 3. Swap in with or without the "-a" option to observe how fragments
> due to swap-in
> and the incoming swap-in of large folios will impact swap-out fallback.
>
> And many thanks to Chris for the suggestion on improving it within
> selftest, though I
> prefer to place it in tools/mm.

I am perfectly fine with that. Looking forward to your V2.

Chris
David Hildenbrand June 21, 2024, 8:52 a.m. UTC | #8
On 21.06.24 09:25, Ryan Roberts wrote:
> On 20/06/2024 12:34, David Hildenbrand wrote:
>> On 20.06.24 11:04, Ryan Roberts wrote:
>>> On 20/06/2024 01:26, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> Both Ryan and Chris have been utilizing the small test program to aid
>>>> in debugging and identifying issues with swap entry allocation. While
>>>> a real or intricate workload might be more suitable for assessing the
>>>> correctness and effectiveness of the swap allocation policy, a small
>>>> test program presents a simpler means of understanding the problem and
>>>> initially verifying the improvements being made.
>>>>
>>>> Let's endeavor to integrate it into the self-test suite. Although it
>>>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>>>> expand its capabilities to support multiple sizes and simulate more
>>>> complex systems in the future as required.
>>>
>>> I'll try to summarize the thread with Huang Ying by suggesting this test program
>>> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>>> I've certainly found it useful and think it would be a valuable addition to the
>>> tree.
>>>
>>> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
>>> clear pass/fail result against some criteria and must be able to be run
>>> automatically by (e.g.) a CI system.
>>
>> Likely we should then consider moving other such performance-related thingies
>> out of the selftests?
> 
> Yes, that would get my vote. But of the 4 tests you mentioned that use
> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> have a pass/fail result, so is probably the only candidate for moving.
> 
> The others either use the times as a timeout and determines failure if the
> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> supplemental performance information to an otherwise functionality-oriented test.

Likely for ksm it would make sense to move the really functional parts 
to ksm_function_tests.c.

Fur gup_test it might be similar.
Huang, Ying June 21, 2024, 9:22 a.m. UTC | #9
Barry Song <21cnbao@gmail.com> writes:

> On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 20/06/2024 12:34, David Hildenbrand wrote:
>> > On 20.06.24 11:04, Ryan Roberts wrote:
>> >> On 20/06/2024 01:26, Barry Song wrote:
>> >>> From: Barry Song <v-songbaohua@oppo.com>
>> >>>
>> >>> Both Ryan and Chris have been utilizing the small test program to aid
>> >>> in debugging and identifying issues with swap entry allocation. While
>> >>> a real or intricate workload might be more suitable for assessing the
>> >>> correctness and effectiveness of the swap allocation policy, a small
>> >>> test program presents a simpler means of understanding the problem and
>> >>> initially verifying the improvements being made.
>> >>>
>> >>> Let's endeavor to integrate it into the self-test suite. Although it
>> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>> >>> expand its capabilities to support multiple sizes and simulate more
>> >>> complex systems in the future as required.
>> >>
>> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
>> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>> >> I've certainly found it useful and think it would be a valuable addition to the
>> >> tree.
>> >>
>> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
>> >> clear pass/fail result against some criteria and must be able to be run
>> >> automatically by (e.g.) a CI system.
>> >
>> > Likely we should then consider moving other such performance-related thingies
>> > out of the selftests?
>>
>> Yes, that would get my vote. But of the 4 tests you mentioned that use
>> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
>> have a pass/fail result, so is probably the only candidate for moving.
>>
>> The others either use the times as a timeout and determines failure if the
>> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
>> supplemental performance information to an otherwise functionality-oriented test.
>
> Thank you very much, Ryan. I think you've found a better home for this
> tool . I will
> send v2, relocating it to tools/mm and adding a function to swap in
> either the whole
> mTHPs or a portion of mTHPs by "-a"(aligned swapin).
>
> So basically, we will have
>
> 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> high exercise in a short time.
>
> 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> new mTHP is always generated, released or swapped out, similar to the behavior
> on a PC or Android phone where many applications are frequently started and
> terminated.

MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
large folio swap-in exactly, which hasn't been merged by upstream.  I
don't think that it's a good idea to make such kind of trick.

> 3. Swap in with or without the "-a" option to observe how fragments
> due to swap-in
> and the incoming swap-in of large folios will impact swap-out fallback.

It's good to create fragmentation with swap-in.  Which is more practical
and future-proof.  And, I believe that we can reduce large folio
swap-out fallback rate without the large folio swap-in trick.

> And many thanks to Chris for the suggestion on improving it within
> selftest, though I
> prefer to place it in tools/mm.

--
Best Regards,
Huang, Ying
Barry Song June 21, 2024, 9:43 a.m. UTC | #10
On Fri, Jun 21, 2024 at 9:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 20/06/2024 12:34, David Hildenbrand wrote:
> >> > On 20.06.24 11:04, Ryan Roberts wrote:
> >> >> On 20/06/2024 01:26, Barry Song wrote:
> >> >>> From: Barry Song <v-songbaohua@oppo.com>
> >> >>>
> >> >>> Both Ryan and Chris have been utilizing the small test program to aid
> >> >>> in debugging and identifying issues with swap entry allocation. While
> >> >>> a real or intricate workload might be more suitable for assessing the
> >> >>> correctness and effectiveness of the swap allocation policy, a small
> >> >>> test program presents a simpler means of understanding the problem and
> >> >>> initially verifying the improvements being made.
> >> >>>
> >> >>> Let's endeavor to integrate it into the self-test suite. Although it
> >> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> >> >>> expand its capabilities to support multiple sizes and simulate more
> >> >>> complex systems in the future as required.
> >> >>
> >> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> >> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> >> >> I've certainly found it useful and think it would be a valuable addition to the
> >> >> tree.
> >> >>
> >> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> >> >> clear pass/fail result against some criteria and must be able to be run
> >> >> automatically by (e.g.) a CI system.
> >> >
> >> > Likely we should then consider moving other such performance-related thingies
> >> > out of the selftests?
> >>
> >> Yes, that would get my vote. But of the 4 tests you mentioned that use
> >> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> >> have a pass/fail result, so is probably the only candidate for moving.
> >>
> >> The others either use the times as a timeout and determines failure if the
> >> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> >> supplemental performance information to an otherwise functionality-oriented test.
> >
> > Thank you very much, Ryan. I think you've found a better home for this
> > tool . I will
> > send v2, relocating it to tools/mm and adding a function to swap in
> > either the whole
> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
> >
> > So basically, we will have
> >
> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> > high exercise in a short time.
> >
> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> > new mTHP is always generated, released or swapped out, similar to the behavior
> > on a PC or Android phone where many applications are frequently started and
> > terminated.
>
> MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
> large folio swap-in exactly, which hasn't been merged by upstream.  I
> don't think that it's a good idea to make such kind of trick.

I disagree. This is how userspace heaps can manage memory deallocation.
Additionally, in the event of an application exit, munmap, or OOM killer, the
amount of freed memory can be much larger than 64KB. The primary purpose
of using MADV_DONTNEED is to release anonymous memory and generate
new mTHP so that the iteration can continue. Otherwise, the test program
becomes entirely pointless, as we only have large folios at the beginning.
That is exactly why Chris has failed to find his bugs by using other small
programs.

On the other hand, we definitely want large folios swap-in, otherwise, mTHP
is just a toy to Android or similar system where more than 2/3 memory could
be in swap. We do NOT want single-use mTHP.

>
> > 3. Swap in with or without the "-a" option to observe how fragments
> > due to swap-in
> > and the incoming swap-in of large folios will impact swap-out fallback.
>
> It's good to create fragmentation with swap-in.  Which is more practical
> and future-proof.  And, I believe that we can reduce large folio
> swap-out fallback rate without the large folio swap-in trick.
>
> > And many thanks to Chris for the suggestion on improving it within
> > selftest, though I
> > prefer to place it in tools/mm.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Barry Song June 21, 2024, 11:20 a.m. UTC | #11
On Fri, Jun 21, 2024 at 4:50 PM Chris Li <chrisl@kernel.org> wrote:
>
> On Fri, Jun 21, 2024 at 12:47 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> > >
> > > On 20/06/2024 12:34, David Hildenbrand wrote:
> > > > On 20.06.24 11:04, Ryan Roberts wrote:
> > > >> On 20/06/2024 01:26, Barry Song wrote:
> > > >>> From: Barry Song <v-songbaohua@oppo.com>
> > > >>>
> > > >>> Both Ryan and Chris have been utilizing the small test program to aid
> > > >>> in debugging and identifying issues with swap entry allocation. While
> > > >>> a real or intricate workload might be more suitable for assessing the
> > > >>> correctness and effectiveness of the swap allocation policy, a small
> > > >>> test program presents a simpler means of understanding the problem and
> > > >>> initially verifying the improvements being made.
> > > >>>
> > > >>> Let's endeavor to integrate it into the self-test suite. Although it
> > > >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> > > >>> expand its capabilities to support multiple sizes and simulate more
> > > >>> complex systems in the future as required.
> > > >>
> > > >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> > > >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> > > >> I've certainly found it useful and think it would be a valuable addition to the
> > > >> tree.
> > > >>
> > > >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> > > >> clear pass/fail result against some criteria and must be able to be run
> > > >> automatically by (e.g.) a CI system.
> > > >
> > > > Likely we should then consider moving other such performance-related thingies
> > > > out of the selftests?
> > >
> > > Yes, that would get my vote. But of the 4 tests you mentioned that use
> > > clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> > > have a pass/fail result, so is probably the only candidate for moving.
> > >
> > > The others either use the times as a timeout and determines failure if the
> > > action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> > > supplemental performance information to an otherwise functionality-oriented test.
> >
> > Thank you very much, Ryan. I think you've found a better home for this
> > tool . I will
> > send v2, relocating it to tools/mm and adding a function to swap in
> > either the whole
> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
> >
> > So basically, we will have
> >
> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> > high exercise in a short time.
> >
> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> > new mTHP is always generated, released or swapped out, similar to the behavior
> > on a PC or Android phone where many applications are frequently started and
> > terminated.
>
> Will this cover the case that the ratio of order 0 and order 4 swap
> requests change during LMK, and swapfile is almost full?
>
> If not, please add that :-)

Due to 2, we ensure a certain proportion of mTHP. Similarly, because
of 3, we maintain
a certain proportion of small folios, as we don't support large folios
swap-in, meaning
any swap-in will immediately result in small folios. Therefore, with
both 2 and 3, we
automatically achieve a system containing both mTHP and small folios.
Additionally,
1 provides the ability to continuously swap them out. If we set the
same sizes for 2
and 3, we'll achieve a 1:1 ratio of large folios to small folios. How
about starting with
a 1:1 ratio?

To meet the requirement that the swapfile is almost full, I can
increase the memory to
ensure the total size is quite close to zRAM. This way, we give the
small folios a chance
to perform a slow scan and observe the impact.

>
> > 3. Swap in with or without the "-a" option to observe how fragments
> > due to swap-in
> > and the incoming swap-in of large folios will impact swap-out fallback.
> >
> > And many thanks to Chris for the suggestion on improving it within
> > selftest, though I
> > prefer to place it in tools/mm.
>
> I am perfectly fine with that. Looking forward to your V2.
>
> Chris

Thanks
Barry
Huang, Ying June 24, 2024, 3:42 a.m. UTC | #12
Barry Song <21cnbao@gmail.com> writes:

> On Fri, Jun 21, 2024 at 9:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >>
>> >> On 20/06/2024 12:34, David Hildenbrand wrote:
>> >> > On 20.06.24 11:04, Ryan Roberts wrote:
>> >> >> On 20/06/2024 01:26, Barry Song wrote:
>> >> >>> From: Barry Song <v-songbaohua@oppo.com>
>> >> >>>
>> >> >>> Both Ryan and Chris have been utilizing the small test program to aid
>> >> >>> in debugging and identifying issues with swap entry allocation. While
>> >> >>> a real or intricate workload might be more suitable for assessing the
>> >> >>> correctness and effectiveness of the swap allocation policy, a small
>> >> >>> test program presents a simpler means of understanding the problem and
>> >> >>> initially verifying the improvements being made.
>> >> >>>
>> >> >>> Let's endeavor to integrate it into the self-test suite. Although it
>> >> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>> >> >>> expand its capabilities to support multiple sizes and simulate more
>> >> >>> complex systems in the future as required.
>> >> >>
>> >> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
>> >> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>> >> >> I've certainly found it useful and think it would be a valuable addition to the
>> >> >> tree.
>> >> >>
>> >> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
>> >> >> clear pass/fail result against some criteria and must be able to be run
>> >> >> automatically by (e.g.) a CI system.
>> >> >
>> >> > Likely we should then consider moving other such performance-related thingies
>> >> > out of the selftests?
>> >>
>> >> Yes, that would get my vote. But of the 4 tests you mentioned that use
>> >> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
>> >> have a pass/fail result, so is probably the only candidate for moving.
>> >>
>> >> The others either use the times as a timeout and determines failure if the
>> >> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
>> >> supplemental performance information to an otherwise functionality-oriented test.
>> >
>> > Thank you very much, Ryan. I think you've found a better home for this
>> > tool . I will
>> > send v2, relocating it to tools/mm and adding a function to swap in
>> > either the whole
>> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
>> >
>> > So basically, we will have
>> >
>> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
>> > high exercise in a short time.
>> >
>> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
>> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
>> > new mTHP is always generated, released or swapped out, similar to the behavior
>> > on a PC or Android phone where many applications are frequently started and
>> > terminated.
>>
>> MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
>> large folio swap-in exactly, which hasn't been merged by upstream.  I
>> don't think that it's a good idea to make such kind of trick.
>
> I disagree. This is how userspace heaps can manage memory
> deallocation.

Sorry, I don't understand how.  Can you show some examples?  Such as
strace log with 64KB aligned MADV_DONTNEED?

> Additionally, in the event of an application exit, munmap, or OOM killer, the
> amount of freed memory can be much larger than 64KB. The primary purpose
> of using MADV_DONTNEED is to release anonymous memory and generate
> new mTHP so that the iteration can continue. Otherwise, the test program
> becomes entirely pointless, as we only have large folios at the beginning.
> That is exactly why Chris has failed to find his bugs by using other small
> programs.

Although I still don't understand how 64KB aligned MADV_DONTNEED is used
for libc/java heap or munmap in a practical way.  After more thoughts, I
think 64KB Aligned MADV_DONTNEED can simulate the fragmentation effect
of processes exit at some degree if 64KB folios in these processes are
swapped out without splitting.  If you have no other practical use
cases, I suggest to make it explicit with comments in program.

> On the other hand, we definitely want large folios swap-in, otherwise, mTHP
> is just a toy to Android or similar system where more than 2/3 memory could
> be in swap. We do NOT want single-use mTHP.

I agree that large folios swap-in has its value at least in some
situations.  Whether we should take it as default behavior is another
topic, we can discuss it further in the future.

>>
>> > 3. Swap in with or without the "-a" option to observe how fragments
>> > due to swap-in
>> > and the incoming swap-in of large folios will impact swap-out fallback.
>>
>> It's good to create fragmentation with swap-in.  Which is more practical
>> and future-proof.  And, I believe that we can reduce large folio
>> swap-out fallback rate without the large folio swap-in trick.
>>
>> > And many thanks to Chris for the suggestion on improving it within
>> > selftest, though I
>> > prefer to place it in tools/mm.

--
Best Regards,
Huang, Ying
Barry Song June 24, 2024, 4:05 a.m. UTC | #13
On Mon, Jun 24, 2024 at 3:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Fri, Jun 21, 2024 at 9:24 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> >>
> >> >> On 20/06/2024 12:34, David Hildenbrand wrote:
> >> >> > On 20.06.24 11:04, Ryan Roberts wrote:
> >> >> >> On 20/06/2024 01:26, Barry Song wrote:
> >> >> >>> From: Barry Song <v-songbaohua@oppo.com>
> >> >> >>>
> >> >> >>> Both Ryan and Chris have been utilizing the small test program to aid
> >> >> >>> in debugging and identifying issues with swap entry allocation. While
> >> >> >>> a real or intricate workload might be more suitable for assessing the
> >> >> >>> correctness and effectiveness of the swap allocation policy, a small
> >> >> >>> test program presents a simpler means of understanding the problem and
> >> >> >>> initially verifying the improvements being made.
> >> >> >>>
> >> >> >>> Let's endeavor to integrate it into the self-test suite. Although it
> >> >> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> >> >> >>> expand its capabilities to support multiple sizes and simulate more
> >> >> >>> complex systems in the future as required.
> >> >> >>
> >> >> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> >> >> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> >> >> >> I've certainly found it useful and think it would be a valuable addition to the
> >> >> >> tree.
> >> >> >>
> >> >> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> >> >> >> clear pass/fail result against some criteria and must be able to be run
> >> >> >> automatically by (e.g.) a CI system.
> >> >> >
> >> >> > Likely we should then consider moving other such performance-related thingies
> >> >> > out of the selftests?
> >> >>
> >> >> Yes, that would get my vote. But of the 4 tests you mentioned that use
> >> >> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> >> >> have a pass/fail result, so is probably the only candidate for moving.
> >> >>
> >> >> The others either use the times as a timeout and determines failure if the
> >> >> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> >> >> supplemental performance information to an otherwise functionality-oriented test.
> >> >
> >> > Thank you very much, Ryan. I think you've found a better home for this
> >> > tool . I will
> >> > send v2, relocating it to tools/mm and adding a function to swap in
> >> > either the whole
> >> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
> >> >
> >> > So basically, we will have
> >> >
> >> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> >> > high exercise in a short time.
> >> >
> >> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> >> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> >> > new mTHP is always generated, released or swapped out, similar to the behavior
> >> > on a PC or Android phone where many applications are frequently started and
> >> > terminated.
> >>
> >> MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
> >> large folio swap-in exactly, which hasn't been merged by upstream.  I
> >> don't think that it's a good idea to make such kind of trick.
> >
> > I disagree. This is how userspace heaps can manage memory
> > deallocation.
>
> Sorry, I don't understand how.  Can you show some examples?  Such as
> strace log with 64KB aligned MADV_DONTNEED?

In Java heap and memory allocators such as jemalloc and Scudo, memory is freed
using the MADV_DONTNEED flag when either free() is called or garbage collection
occurs. In Android, the Java heap is freed in chunks aligned to 64KB
or larger. In
Scudo and jemalloc, there is a configuration option to set the
management granularity.
This granularity is set to match the mTHP size(though the default
value is 16KB in the
latest Android if we don't run mTHP). Otherwise, you could end up with
millions of
partial unmap operations, which would severely degrade the performance of mTHP.

Imagine libc/Java functioning like a slab allocator. When kfree() is
called, some pages
may become completely unoccupied and can be returned to the buddy allocator. In
userspace, memory is given back to the kernel in a similar manner,
typically using
MADV_DONTNEED. Therefore, MADV_DONTNEED is the most common memory
reclamation behavior in Android, coming with free(), delete() or GC.

Imagine a system with extensive malloc, free, new, and delete
operations, where objects
are constantly being created and destroyed.

On the other hand, whether libc/Java use MADV_DONTNEED to free memory is not
crucial, although they do. We need a method to simulate the lifecycle
of applications
—exiting and starting anew—on PCs or Android phones. It doesn't matter if you
use MADV_DONTNEED or munmap to achieve this.

It is important to note that mTHP currently operates on a one-shot
basis(after swap-out,
you never get them back as mTHP as we don't support large folios
swapin). For the test
program, we need a method to generate new mTHPs continuously. Without this,
after the initial iterations, we would be left with only folios,
rendering the entire
test program *pointless*.

>
> > Additionally, in the event of an application exit, munmap, or OOM killer, the
> > amount of freed memory can be much larger than 64KB. The primary purpose
> > of using MADV_DONTNEED is to release anonymous memory and generate
> > new mTHP so that the iteration can continue. Otherwise, the test program
> > becomes entirely pointless, as we only have large folios at the beginning.
> > That is exactly why Chris has failed to find his bugs by using other small
> > programs.
>
> Although I still don't understand how 64KB aligned MADV_DONTNEED is used
> for libc/java heap or munmap in a practical way.  After more thoughts, I
> think 64KB Aligned MADV_DONTNEED can simulate the fragmentation effect
> of processes exit at some degree if 64KB folios in these processes are
> swapped out without splitting.  If you have no other practical use
> cases, I suggest to make it explicit with comments in program.
>
> > On the other hand, we definitely want large folios swap-in, otherwise, mTHP
> > is just a toy to Android or similar system where more than 2/3 memory could
> > be in swap. We do NOT want single-use mTHP.
>
> I agree that large folios swap-in has its value at least in some
> situations.  Whether we should take it as default behavior is another
> topic, we can discuss it further in the future.

Cool. Just imagine that mTHP is beneficial for systems that don't frequently
use swap. However, for Android, where most memory resides in swap, mTHP
acts like a one-way ticket: you end up with small folios and can't revert to
large ones. This is so BAD.

>
> >>
> >> > 3. Swap in with or without the "-a" option to observe how fragments
> >> > due to swap-in
> >> > and the incoming swap-in of large folios will impact swap-out fallback.
> >>
> >> It's good to create fragmentation with swap-in.  Which is more practical
> >> and future-proof.  And, I believe that we can reduce large folio
> >> swap-out fallback rate without the large folio swap-in trick.
> >>
> >> > And many thanks to Chris for the suggestion on improving it within
> >> > selftest, though I
> >> > prefer to place it in tools/mm.
>
> --
> Best Regards,
> Huang, Ying

Thanks
Barry
Huang, Ying June 24, 2024, 6:59 a.m. UTC | #14
Barry Song <21cnbao@gmail.com> writes:

> On Mon, Jun 24, 2024 at 3:44 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Fri, Jun 21, 2024 at 9:24 PM Huang, Ying <ying.huang@intel.com> wrote:
>> >>
>> >> Barry Song <21cnbao@gmail.com> writes:
>> >>
>> >> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>> >> >>
>> >> >> On 20/06/2024 12:34, David Hildenbrand wrote:
>> >> >> > On 20.06.24 11:04, Ryan Roberts wrote:
>> >> >> >> On 20/06/2024 01:26, Barry Song wrote:
>> >> >> >>> From: Barry Song <v-songbaohua@oppo.com>
>> >> >> >>>
>> >> >> >>> Both Ryan and Chris have been utilizing the small test program to aid
>> >> >> >>> in debugging and identifying issues with swap entry allocation. While
>> >> >> >>> a real or intricate workload might be more suitable for assessing the
>> >> >> >>> correctness and effectiveness of the swap allocation policy, a small
>> >> >> >>> test program presents a simpler means of understanding the problem and
>> >> >> >>> initially verifying the improvements being made.
>> >> >> >>>
>> >> >> >>> Let's endeavor to integrate it into the self-test suite. Although it
>> >> >> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
>> >> >> >>> expand its capabilities to support multiple sizes and simulate more
>> >> >> >>> complex systems in the future as required.
>> >> >> >>
>> >> >> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
>> >> >> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
>> >> >> >> I've certainly found it useful and think it would be a valuable addition to the
>> >> >> >> tree.
>> >> >> >>
>> >> >> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
>> >> >> >> clear pass/fail result against some criteria and must be able to be run
>> >> >> >> automatically by (e.g.) a CI system.
>> >> >> >
>> >> >> > Likely we should then consider moving other such performance-related thingies
>> >> >> > out of the selftests?
>> >> >>
>> >> >> Yes, that would get my vote. But of the 4 tests you mentioned that use
>> >> >> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
>> >> >> have a pass/fail result, so is probably the only candidate for moving.
>> >> >>
>> >> >> The others either use the times as a timeout and determines failure if the
>> >> >> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
>> >> >> supplemental performance information to an otherwise functionality-oriented test.
>> >> >
>> >> > Thank you very much, Ryan. I think you've found a better home for this
>> >> > tool . I will
>> >> > send v2, relocating it to tools/mm and adding a function to swap in
>> >> > either the whole
>> >> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
>> >> >
>> >> > So basically, we will have
>> >> >
>> >> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
>> >> > high exercise in a short time.
>> >> >
>> >> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
>> >> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
>> >> > new mTHP is always generated, released or swapped out, similar to the behavior
>> >> > on a PC or Android phone where many applications are frequently started and
>> >> > terminated.
>> >>
>> >> MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
>> >> large folio swap-in exactly, which hasn't been merged by upstream.  I
>> >> don't think that it's a good idea to make such kind of trick.
>> >
>> > I disagree. This is how userspace heaps can manage memory
>> > deallocation.
>>
>> Sorry, I don't understand how.  Can you show some examples?  Such as
>> strace log with 64KB aligned MADV_DONTNEED?
>
> In Java heap and memory allocators such as jemalloc and Scudo, memory is freed
> using the MADV_DONTNEED flag when either free() is called or garbage collection
> occurs. In Android, the Java heap is freed in chunks aligned to 64KB
> or larger.

Originally, I heard about that MADV_FREE is used by jemalloc.  Now, I
know that they use MADV_DONTNEED too.  Thanks!

Although I still suspect that libc/java allocator will free pages in
exact 64KB size (IIUC, they should free pages in much larger trunk).  I
agree that MADV_DONTNEED is a way to create fragmentation in swap
devices.

> In
> Scudo and jemalloc, there is a configuration option to set the
> management granularity.
> This granularity is set to match the mTHP size(though the default
> value is 16KB in the
> latest Android if we don't run mTHP). Otherwise, you could end up with
> millions of
> partial unmap operations, which would severely degrade the performance of mTHP.
>
> Imagine libc/Java functioning like a slab allocator. When kfree() is
> called, some pages
> may become completely unoccupied and can be returned to the buddy allocator. In
> userspace, memory is given back to the kernel in a similar manner,
> typically using
> MADV_DONTNEED. Therefore, MADV_DONTNEED is the most common memory
> reclamation behavior in Android, coming with free(), delete() or GC.
>
> Imagine a system with extensive malloc, free, new, and delete
> operations, where objects
> are constantly being created and destroyed.
>
> On the other hand, whether libc/Java use MADV_DONTNEED to free memory is not
> crucial, although they do. We need a method to simulate the lifecycle
> of applications
> —exiting and starting anew—on PCs or Android phones. It doesn't matter if you
> use MADV_DONTNEED or munmap to achieve this.
>
> It is important to note that mTHP currently operates on a one-shot
> basis(after swap-out,
> you never get them back as mTHP as we don't support large folios
> swapin). For the test
> program, we need a method to generate new mTHPs continuously. Without this,
> after the initial iterations, we would be left with only folios,
> rendering the entire
> test program *pointless*.

I understand the requirements for new mTHPs.

>>
>> > Additionally, in the event of an application exit, munmap, or OOM killer, the
>> > amount of freed memory can be much larger than 64KB. The primary purpose
>> > of using MADV_DONTNEED is to release anonymous memory and generate
>> > new mTHP so that the iteration can continue. Otherwise, the test program
>> > becomes entirely pointless, as we only have large folios at the beginning.
>> > That is exactly why Chris has failed to find his bugs by using other small
>> > programs.
>>
>> Although I still don't understand how 64KB aligned MADV_DONTNEED is used
>> for libc/java heap or munmap in a practical way.  After more thoughts, I
>> think 64KB Aligned MADV_DONTNEED can simulate the fragmentation effect
>> of processes exit at some degree if 64KB folios in these processes are
>> swapped out without splitting.  If you have no other practical use
>> cases, I suggest to make it explicit with comments in program.
>>

[snip]

--
Best Regards,
Huang, Ying
Barry Song June 24, 2024, 7:55 a.m. UTC | #15
On Mon, Jun 24, 2024 at 7:01 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Mon, Jun 24, 2024 at 3:44 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Fri, Jun 21, 2024 at 9:24 PM Huang, Ying <ying.huang@intel.com> wrote:
> >> >>
> >> >> Barry Song <21cnbao@gmail.com> writes:
> >> >>
> >> >> > On Fri, Jun 21, 2024 at 7:25 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >> >> >>
> >> >> >> On 20/06/2024 12:34, David Hildenbrand wrote:
> >> >> >> > On 20.06.24 11:04, Ryan Roberts wrote:
> >> >> >> >> On 20/06/2024 01:26, Barry Song wrote:
> >> >> >> >>> From: Barry Song <v-songbaohua@oppo.com>
> >> >> >> >>>
> >> >> >> >>> Both Ryan and Chris have been utilizing the small test program to aid
> >> >> >> >>> in debugging and identifying issues with swap entry allocation. While
> >> >> >> >>> a real or intricate workload might be more suitable for assessing the
> >> >> >> >>> correctness and effectiveness of the swap allocation policy, a small
> >> >> >> >>> test program presents a simpler means of understanding the problem and
> >> >> >> >>> initially verifying the improvements being made.
> >> >> >> >>>
> >> >> >> >>> Let's endeavor to integrate it into the self-test suite. Although it
> >> >> >> >>> presently only accommodates 64KB and 4KB, I'm optimistic that we can
> >> >> >> >>> expand its capabilities to support multiple sizes and simulate more
> >> >> >> >>> complex systems in the future as required.
> >> >> >> >>
> >> >> >> >> I'll try to summarize the thread with Huang Ying by suggesting this test program
> >> >> >> >> is "neccessary but not sufficient" to exhaustively test the mTHP swap-out path.
> >> >> >> >> I've certainly found it useful and think it would be a valuable addition to the
> >> >> >> >> tree.
> >> >> >> >>
> >> >> >> >> That said, I'm not convinced it is a selftest; IMO a selftest should provide a
> >> >> >> >> clear pass/fail result against some criteria and must be able to be run
> >> >> >> >> automatically by (e.g.) a CI system.
> >> >> >> >
> >> >> >> > Likely we should then consider moving other such performance-related thingies
> >> >> >> > out of the selftests?
> >> >> >>
> >> >> >> Yes, that would get my vote. But of the 4 tests you mentioned that use
> >> >> >> clock_gettime(), it looks like transhuge-stress is the only one that doesn't
> >> >> >> have a pass/fail result, so is probably the only candidate for moving.
> >> >> >>
> >> >> >> The others either use the times as a timeout and determines failure if the
> >> >> >> action didn't occur within the timeout (e.g. ksm_tests.c) or use it to add some
> >> >> >> supplemental performance information to an otherwise functionality-oriented test.
> >> >> >
> >> >> > Thank you very much, Ryan. I think you've found a better home for this
> >> >> > tool . I will
> >> >> > send v2, relocating it to tools/mm and adding a function to swap in
> >> >> > either the whole
> >> >> > mTHPs or a portion of mTHPs by "-a"(aligned swapin).
> >> >> >
> >> >> > So basically, we will have
> >> >> >
> >> >> > 1. Use MADV_PAGEPUT for rapid swap-out, putting the swap allocation code under
> >> >> > high exercise in a short time.
> >> >> >
> >> >> > 2. Use MADV_DONTNEED to simulate the behavior of libc and Java heap in freeing
> >> >> > memory, as well as for munmap, app exits, or OOM killer scenarios. This ensures
> >> >> > new mTHP is always generated, released or swapped out, similar to the behavior
> >> >> > on a PC or Android phone where many applications are frequently started and
> >> >> > terminated.
> >> >>
> >> >> MADV_DONTNEED 64KB memory, then memset() it, this just simulates the
> >> >> large folio swap-in exactly, which hasn't been merged by upstream.  I
> >> >> don't think that it's a good idea to make such kind of trick.
> >> >
> >> > I disagree. This is how userspace heaps can manage memory
> >> > deallocation.
> >>
> >> Sorry, I don't understand how.  Can you show some examples?  Such as
> >> strace log with 64KB aligned MADV_DONTNEED?
> >
> > In Java heap and memory allocators such as jemalloc and Scudo, memory is freed
> > using the MADV_DONTNEED flag when either free() is called or garbage collection
> > occurs. In Android, the Java heap is freed in chunks aligned to 64KB
> > or larger.
>
> Originally, I heard about that MADV_FREE is used by jemalloc.  Now, I
> know that they use MADV_DONTNEED too.  Thanks!
>
> Although I still suspect that libc/java allocator will free pages in
> exact 64KB size (IIUC, they should free pages in much larger trunk).  I
> agree that MADV_DONTNEED is a way to create fragmentation in swap
> devices.

Right.

They don't always free memory in exact 64KB sizes or mTHP size, but we
need to define a minimum granularity. Typically, when many objects are
freed, they combine into a larger free block, which is then released to
kernel all at once.

As an example, libc might map lots of 4MB VMAs and classify them into
different size categories—some for small objects and others for larger ones.
While attempts are made to consolidate adjacent free blocks to reduce
system calls, MADV_DONTNEED is often utilized at the minimum granularity
for small objects when merging is temporarily impractical - We don't always
encounter two or more memory blocks where all the objects have been
released :-)


>
> > In
> > Scudo and jemalloc, there is a configuration option to set the
> > management granularity.
> > This granularity is set to match the mTHP size(though the default
> > value is 16KB in the
> > latest Android if we don't run mTHP). Otherwise, you could end up with
> > millions of
> > partial unmap operations, which would severely degrade the performance of mTHP.
> >
> > Imagine libc/Java functioning like a slab allocator. When kfree() is
> > called, some pages
> > may become completely unoccupied and can be returned to the buddy allocator. In
> > userspace, memory is given back to the kernel in a similar manner,
> > typically using
> > MADV_DONTNEED. Therefore, MADV_DONTNEED is the most common memory
> > reclamation behavior in Android, coming with free(), delete() or GC.
> >
> > Imagine a system with extensive malloc, free, new, and delete
> > operations, where objects
> > are constantly being created and destroyed.
> >
> > On the other hand, whether libc/Java use MADV_DONTNEED to free memory is not
> > crucial, although they do. We need a method to simulate the lifecycle
> > of applications
> > —exiting and starting anew—on PCs or Android phones. It doesn't matter if you
> > use MADV_DONTNEED or munmap to achieve this.
> >
> > It is important to note that mTHP currently operates on a one-shot
> > basis(after swap-out,
> > you never get them back as mTHP as we don't support large folios
> > swapin). For the test
> > program, we need a method to generate new mTHPs continuously. Without this,
> > after the initial iterations, we would be left with only folios,
> > rendering the entire
> > test program *pointless*.
>
> I understand the requirements for new mTHPs.
>
> >>
> >> > Additionally, in the event of an application exit, munmap, or OOM killer, the
> >> > amount of freed memory can be much larger than 64KB. The primary purpose
> >> > of using MADV_DONTNEED is to release anonymous memory and generate
> >> > new mTHP so that the iteration can continue. Otherwise, the test program
> >> > becomes entirely pointless, as we only have large folios at the beginning.
> >> > That is exactly why Chris has failed to find his bugs by using other small
> >> > programs.
> >>
> >> Although I still don't understand how 64KB aligned MADV_DONTNEED is used
> >> for libc/java heap or munmap in a practical way.  After more thoughts, I
> >> think 64KB Aligned MADV_DONTNEED can simulate the fragmentation effect
> >> of processes exit at some degree if 64KB folios in these processes are
> >> swapped out without splitting.  If you have no other practical use
> >> cases, I suggest to make it explicit with comments in program.
> >>
>
> [snip]
>
> --
> Best Regards,
> Huang, Ying
diff mbox series

Patch

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index e1aa09ddaa3d..64164ad66835 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -65,6 +65,7 @@  TEST_GEN_FILES += mseal_test
 TEST_GEN_FILES += seal_elf
 TEST_GEN_FILES += on-fault-limit
 TEST_GEN_FILES += pagemap_ioctl
+TEST_GEN_FILES += thp_swap_allocator_test
 TEST_GEN_FILES += thuge-gen
 TEST_GEN_FILES += transhuge-stress
 TEST_GEN_FILES += uffd-stress
diff --git a/tools/testing/selftests/mm/thp_swap_allocator_test.c b/tools/testing/selftests/mm/thp_swap_allocator_test.c
new file mode 100644
index 000000000000..4443a906d0f8
--- /dev/null
+++ b/tools/testing/selftests/mm/thp_swap_allocator_test.c
@@ -0,0 +1,192 @@ 
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * thp_swap_allocator_test
+ *
+ * The purpose of this test program is helping check if THP swpout
+ * can correctly get swap slots to swap out as a whole instead of
+ * being split. It randomly releases swap entries through madvise
+ * DONTNEED and do swapout on two memory areas: a memory area for
+ * 64KB THP and the other area for small folios. The second memory
+ * can be enabled by "-s".
+ * Before running the program, we need to setup a zRAM or similar
+ * swap device by:
+ *  echo lzo > /sys/block/zram0/comp_algorithm
+ *  echo 64M > /sys/block/zram0/disksize
+ *  echo never > /sys/kernel/mm/transparent_hugepage/hugepages-2048kB/enabled
+ *  echo always > /sys/kernel/mm/transparent_hugepage/hugepages-64kB/enabled
+ *  mkswap /dev/zram0
+ *  swapon /dev/zram0
+ * The expected result should be 0% anon swpout fallback ratio w/ or
+ * w/o "-s".
+ *
+ * Author(s): Barry Song <v-songbaohua@oppo.com>
+ */
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <errno.h>
+#include <time.h>
+
+#define MEMSIZE_MTHP (60 * 1024 * 1024)
+#define MEMSIZE_SMALLFOLIO (1 * 1024 * 1024)
+#define ALIGNMENT_MTHP (64 * 1024)
+#define ALIGNMENT_SMALLFOLIO (4 * 1024)
+#define TOTAL_DONTNEED_MTHP (16 * 1024 * 1024)
+#define TOTAL_DONTNEED_SMALLFOLIO (768 * 1024)
+#define MTHP_FOLIO_SIZE (64 * 1024)
+
+#define SWPOUT_PATH \
+	"/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout"
+#define SWPOUT_FALLBACK_PATH \
+	"/sys/kernel/mm/transparent_hugepage/hugepages-64kB/stats/swpout_fallback"
+
+static void *aligned_alloc_mem(size_t size, size_t alignment)
+{
+	void *mem = NULL;
+
+	if (posix_memalign(&mem, alignment, size) != 0) {
+		perror("posix_memalign");
+		return NULL;
+	}
+	return mem;
+}
+
+static void random_madvise_dontneed(void *mem, size_t mem_size,
+		size_t align_size, size_t total_dontneed_size)
+{
+	size_t num_pages = total_dontneed_size / align_size;
+	size_t i;
+	size_t offset;
+	void *addr;
+
+	for (i = 0; i < num_pages; ++i) {
+		offset = (rand() % (mem_size / align_size)) * align_size;
+		addr = (char *)mem + offset;
+		if (madvise(addr, align_size, MADV_DONTNEED) != 0)
+			perror("madvise dontneed");
+
+		memset(addr, 0x11, align_size);
+	}
+}
+
+static unsigned long read_stat(const char *path)
+{
+	FILE *file;
+	unsigned long value;
+
+	file = fopen(path, "r");
+	if (!file) {
+		perror("fopen");
+		return 0;
+	}
+
+	if (fscanf(file, "%lu", &value) != 1) {
+		perror("fscanf");
+		fclose(file);
+		return 0;
+	}
+
+	fclose(file);
+	return value;
+}
+
+int main(int argc, char *argv[])
+{
+	int use_small_folio = 0;
+	int i;
+	void *mem1 = aligned_alloc_mem(MEMSIZE_MTHP, ALIGNMENT_MTHP);
+	void *mem2 = NULL;
+
+	if (mem1 == NULL) {
+		fprintf(stderr, "Failed to allocate 60MB memory\n");
+		return EXIT_FAILURE;
+	}
+
+	if (madvise(mem1, MEMSIZE_MTHP, MADV_HUGEPAGE) != 0) {
+		perror("madvise hugepage for mem1");
+		free(mem1);
+		return EXIT_FAILURE;
+	}
+
+	for (i = 1; i < argc; ++i) {
+		if (strcmp(argv[i], "-s") == 0)
+			use_small_folio = 1;
+	}
+
+	if (use_small_folio) {
+		mem2 = aligned_alloc_mem(MEMSIZE_SMALLFOLIO, ALIGNMENT_MTHP);
+		if (mem2 == NULL) {
+			fprintf(stderr, "Failed to allocate 1MB memory\n");
+			free(mem1);
+			return EXIT_FAILURE;
+		}
+
+		if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_NOHUGEPAGE) != 0) {
+			perror("madvise nohugepage for mem2");
+			free(mem1);
+			free(mem2);
+			return EXIT_FAILURE;
+		}
+	}
+
+	for (i = 0; i < 100; ++i) {
+		unsigned long initial_swpout;
+		unsigned long initial_swpout_fallback;
+		unsigned long final_swpout;
+		unsigned long final_swpout_fallback;
+		unsigned long swpout_inc;
+		unsigned long swpout_fallback_inc;
+		double fallback_percentage;
+
+		initial_swpout = read_stat(SWPOUT_PATH);
+		initial_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
+
+		random_madvise_dontneed(mem1, MEMSIZE_MTHP, ALIGNMENT_MTHP,
+				TOTAL_DONTNEED_MTHP);
+
+		if (use_small_folio) {
+			random_madvise_dontneed(mem2, MEMSIZE_SMALLFOLIO,
+					ALIGNMENT_SMALLFOLIO,
+					TOTAL_DONTNEED_SMALLFOLIO);
+		}
+
+		if (madvise(mem1, MEMSIZE_MTHP, MADV_PAGEOUT) != 0) {
+			perror("madvise pageout for mem1");
+			free(mem1);
+			if (mem2 != NULL)
+				free(mem2);
+			return EXIT_FAILURE;
+		}
+
+		if (use_small_folio) {
+			if (madvise(mem2, MEMSIZE_SMALLFOLIO, MADV_PAGEOUT) != 0) {
+				perror("madvise pageout for mem2");
+				free(mem1);
+				free(mem2);
+				return EXIT_FAILURE;
+			}
+		}
+
+		final_swpout = read_stat(SWPOUT_PATH);
+		final_swpout_fallback = read_stat(SWPOUT_FALLBACK_PATH);
+
+		swpout_inc = final_swpout - initial_swpout;
+		swpout_fallback_inc = final_swpout_fallback - initial_swpout_fallback;
+
+		fallback_percentage = (double)swpout_fallback_inc /
+			(swpout_fallback_inc + swpout_inc) * 100;
+
+		printf("Iteration %d: swpout inc: %lu, swpout fallback inc: %lu, Fallback percentage: %.2f%%\n",
+				i + 1, swpout_inc, swpout_fallback_inc, fallback_percentage);
+	}
+
+	free(mem1);
+	if (mem2 != NULL)
+		free(mem2);
+
+	return EXIT_SUCCESS;
+}